Archive | April, 2012

Reflections on Data Mining

29 Apr

Props go to Prof. Dalkilic for finally getting me off my butt and writing down a blog.

Data Mining is hard.  It seems to go without saying, but most people who hear about data mining and machine learning don’t realize how hard it really is, and how much of it is an art instead of a science.  To do this properly requires a lot of work and knowledge of various fields, from statistics to computer algorithms and too many gaps in between.

As an example, the problem assignment that I just finished involved taking a data set of purchases, some of which had been classified as fraudulent or not fraudulent, and extrapolate from the classified ones to the non-classified ones and see what we could infer.  It’s relatively straightforward, and the data wasn’t as dirty as I feared it would be.  By the time I was finished, I’m more impressed by the amount of work that I left undone than by the work that I had done, and I spent a ton of time on this.

First, I never verified that the data attributes were independent.  This is vitally important to the Naive-Bayes classifier that we were supposed to use, but I just ran out of  time.

My data was not gaussian, it was clear from the graphs.  Most N-B implementations assume that the data is gaussian.  What to do?  I punted, and binned the data instead of treating it as a continuous attribute.

Most egregiously, I didn’t run my classifier on the unclassified data.  I just couldn’t.  It took 3 hours to run on the classified data, which was only 4% of the overall data.  Clearly my program needed to be optimized further.

To challenge myself, I tossed my nice, fast K-means implementation in C and decided to do the full assignment in R with MySQL as a backend, challenging myself with something that I hadn’t done before.  I learned a lot, but damn, it took a long time.  I just don’t like programming in R.  It doesn’t feel intuitive, and it doesn’t feel robust.  At one point, I crashed the R runtime engine.  I have no idea what happened, I suspect I just blew out the memory somehow.  Restarting my program from the fold that it was calculating was fine, and it kept on going without any further crashes.

I started the assignment with grandiose ideas on all the different algorithms that I would try.  I ended up barely getting my required implementations done.  I did try on a few different cleaned data sets, but that was about it.

I loved it.  It’s too bad that it’s the only Data Mining class offered here, I’m going to have to do some more on my own afterwards.  It is clear based on the readings that DM and Machine Learning are central to research, I’ll need to learn this better.