Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
25. November 2012 09:10

I am still working my way through “Machine Learning in Action”, converting the samples from Python to F#. I am currently in the middle of chapter 6, dedicated to Support Vector Machines, which has given me more trouble than the previous ones. This post will be sharing my current progress: the code I have so far is a working translation of the naïve SVM implementation, presented in the first half of the chapter. We’ll get to kernels, and the full Platt SMO algorithm in a later post – today will be solely discussing the simple, un-optimized version.

Two factors slowed me down with this chapter: the math, and Python.

The math behind the algorithm is significantly more involved than the other algorithms, and I won’t even try to go into why it works. I recommend reading An Idiot’s guide to SVMs, which I found a pretty complete and accessible explanation of the theory behind SVMs. I will focus instead on the implementation, which was in itself a bit challenging.

First, the Python code uses algebra quite a bit, and I found that deciphering what was going on required a bit of patience. Take a line like the following:

fXi = (float)(multiply(alphas, labelMat).T*(dataMatrix*dataMatrix[i,:].T))+b

I am reasonably well versed in linear algebra, but figuring out what this is saying takes some attention. Granted, I have no experience with Python and NumPy, so my whining is probably a bit unfair. Still, I thought the code was not very readable, and it motivated me to see if that could be improved, and as a result I ended up moving away from heavy algebra notation.

Then, the algorithm is implemented as a Deep Arrow. A main loop performs computations and evaluates conditions at multiple points, using continue to exit / short-circuit the evaluation. The code I ended up with doesn’t use mutation, but is still heavily indented, which I am not happy about - I’ll work on that later.

## Simplified algorithm implementation

Note: as the title of the post indicates, this is work in progress. The current implementation works, but has some obvious flaws (see last paragraph), which I intend to fix in upcoming posts. My intent is to share my progression through the problem – please don’t take this as a good reference SVM implementation. Hopefully we’ll get there soon, but this is not it, not yet.

More...

10. November 2012 11:00

The Kaggle/StackOverflow contest officially closed a few days ago, which makes it a perfect time to have a miniature retrospective on that experience. The objective of the contest was to write an algorithm to predict whether a StackOverflow question would be closed by moderators, and the reason why.

The contest was announced just a couple of days before what was supposed to be 4 weeks of computer-free vacation travelling around Europe. Needless to say, a quick change of plans followed; I am a big fan of StackOverflow, and Machine Learning has been on my mind quite a bit lately, so I packed my smallest laptop with Visual Studio installed. At the same time, the wonders of the Interwebs resulted in the formation of Team Charon - the awesome @lu_a_jalla and me, around the loosely defined project of "having fun with this, using 100% F#".

Now that the contest is over, here are a few notes on the experience, focusing on process and tools, and not the modeling aspects – I’ll get back to that in a later post.

This is my first unquestionably positive experience with a dispersed team - every morning I was genuinely looking forward to code check-ins, something I can't say of every experience I have had with remote teams. I recall reading somewhere that there was only one valid reason to work with a dispersed team: when you really want to work with that person, and it is the only way to work together. I tend to agree, and this was tremendously fun. There are not that many opportunities to have meaningful interactions involving both F# and Machine Learning, and I learnt quite a bit in the process, in large part because this was team work.

As a side note, I find it amazing how ridiculously easy it is today to set up a collaborative environment. Set up a GitHub repository, use Skype and Twitter – and you are good to go. The only thing technology hasn’t quite solved yet are these pesky time zones: Minsk and San Francisco are still 11 hours apart. This is were a team of night owls might help…

Whenever there is a deadline, make sure the when and what is clear. Had I followed this simple rule, I would have been on time for the final submission. Instead, I missed it by a couple of hours, because I didn't check what "you have three days left" meant exactly, which is too bad, because otherwise we could have ended up in 27th position, among 160+ competitors:

... which is a result I am pretty proud of, given that this was my first “official” attempt at Machine Learning stuff, and some of the competitors looked pretty qualified. During the initial phase, we went as high as 10th position, and ended up in 40th position, in the top 25%.

More...