Remembering Aaron Swartz

14 Jan

Aaron Swartz committed suicide over the weekend at 26 years of age.  There have been many articles written about his death, including ones by Lawrence Lessing ( and a few on Ars Technica ).  They write about his life better than I could.

He was facing 50 years in prison for downloading academic journals, most of which get paid for by public funding.  Regardless of the legality of his actions, a Bernie Madoff-level prison sentence is grossly unfair for a copyright violation.  Shame on MIT for pursuing this.  Shame on the prosecutors for pursuing this vendetta against an activist.

Aaron’s family has written a statement ( blaming the prosecutor, US Attorney, Carmen Ortiz, for persecuting Aaron.  She now has to live with the knowledge that she drove a brilliant mind who was pursuing social justice to suicide.

Edit: From a witness who was going to speak on behalf of Aaron:


The Rise of License Plate Readers

21 Aug

Arstechnica has a very interesting article on automated license plate readers that are being installed throughout the country. This is a very good example that highlights the worry about how exposure is different from privacy. From a privacy perspective, there is nothing private about license plate information being collected outside of a home, after all, there is no expectation of privacy while driving down a public road. However, from an exposure perspective, having a machine automatically tracking your movements is extremely exposing, and worries a lot of people.

Of particular interest to me is the small section in which Cyrus Farivar talks about how private companies are collecting this data as well and making it available for both public and private organizational use. This is troublesome because there are no protections against private organizations collecting this data. They can keep it as long as they want, do whatever they want with it, and sell it to whomever they desire. Municipalities ostensibly answer to the will of their voters, while corporations answer only to their stockholders.

Regurgitation on Finals

2 May

When I finally get myself a faculty position somewhere, I’m going to set myself a policy to assume that every exam, homework, project, etc. that I will assign will make it on to the Internet shortly after it is graded, and hopefully not before.

I’ve had two final exams this semester and the difference between them could not have been more striking.  The first final was done in 25 minutes.  It was a regurgitation of material that the professor had presented during the semester.  If we had picked the right material to memorize, perfect!  If we hadn’t, we were screwed.  I lucked out and understood the concept that was behind 2/3 of the points on the test.  I hope my classmates were as lucky.

The second exam was a take-home in which we were honor-bound to not look for materials outside the textbook, or to consult each other.  I was told that this was just not doable in a graduate class, because students will cheat.  I do not understand that philosophy.  If you are that worried about your students cheating, aren’t you teaching incorrectly?  Shouldn’t you be designing better exams?  Exams are hard to create, I understand, but it seems to be a better path than to just throw your hands up in the air and give up in frustration.

I’m sure my peers will find this post at some point in the future and rub it in my face when I complain about students.

I wonder how much of this is due to presence.  A professor who presents themselves as an intimidating, knowledgeable figure who will not tolerate that kind of behavior will not get as much of it as someone who is seen by the students as not caring.

2 out of 3 classes are now finished.  And then my first year at IU will be completed.

Reflections on Data Mining

29 Apr

Props go to Prof. Dalkilic for finally getting me off my butt and writing down a blog.

Data Mining is hard.  It seems to go without saying, but most people who hear about data mining and machine learning don’t realize how hard it really is, and how much of it is an art instead of a science.  To do this properly requires a lot of work and knowledge of various fields, from statistics to computer algorithms and too many gaps in between.

As an example, the problem assignment that I just finished involved taking a data set of purchases, some of which had been classified as fraudulent or not fraudulent, and extrapolate from the classified ones to the non-classified ones and see what we could infer.  It’s relatively straightforward, and the data wasn’t as dirty as I feared it would be.  By the time I was finished, I’m more impressed by the amount of work that I left undone than by the work that I had done, and I spent a ton of time on this.

First, I never verified that the data attributes were independent.  This is vitally important to the Naive-Bayes classifier that we were supposed to use, but I just ran out of  time.

My data was not gaussian, it was clear from the graphs.  Most N-B implementations assume that the data is gaussian.  What to do?  I punted, and binned the data instead of treating it as a continuous attribute.

Most egregiously, I didn’t run my classifier on the unclassified data.  I just couldn’t.  It took 3 hours to run on the classified data, which was only 4% of the overall data.  Clearly my program needed to be optimized further.

To challenge myself, I tossed my nice, fast K-means implementation in C and decided to do the full assignment in R with MySQL as a backend, challenging myself with something that I hadn’t done before.  I learned a lot, but damn, it took a long time.  I just don’t like programming in R.  It doesn’t feel intuitive, and it doesn’t feel robust.  At one point, I crashed the R runtime engine.  I have no idea what happened, I suspect I just blew out the memory somehow.  Restarting my program from the fold that it was calculating was fine, and it kept on going without any further crashes.

I started the assignment with grandiose ideas on all the different algorithms that I would try.  I ended up barely getting my required implementations done.  I did try on a few different cleaned data sets, but that was about it.

I loved it.  It’s too bad that it’s the only Data Mining class offered here, I’m going to have to do some more on my own afterwards.  It is clear based on the readings that DM and Machine Learning are central to research, I’ll need to learn this better.