Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
6. September 2013 08:15

Recently, Cesar De Souza began moving his .NET machine learning library, Accord.NET, from Google Code to GitHub. The move is still in progress, but that motivated me to take a closer look at the library; given that it is built in C#, with an intended C# usage in mind, I wanted to see how usable it is from F#.

There is a lot in the library; as a starting point, I decided I would try out its Support Vector Machine (SVM), a classic machine learning algorithm, and run it on a classic problem, automatically recognizing hand-written digits. The dataset I will be using here is a subset of the Kaggle Digit Recognizer contest; each example in the dataset is a 28x28 grayscale pixels image, the result of scanning a number written down by a human, and what the actual number is. From that original dataset, I sampled 5,000 examples, which will be used to train the algorithm, and another 500 in a validation set, which we’ll use to evaluate the performance of the model on data it hasn’t “seen before”.

The full example is available as a gist on GitHub.

I’ll be working in a script file within a Library project, as I typically do when exploring data. First, we need to add references to Accord.NET via NuGet:

#r @"..\packages\Accord.2.8.1.0\lib\Accord.dll"
#r @"..\packages\Accord.Math.2.8.1.0\lib\Accord.Math.dll"
#r @"..\packages\Accord.Statistics.2.8.1.0\lib\Accord.Statistics.dll"
#r @"..\packages\Accord.MachineLearning.2.8.1.0\lib\Accord.MachineLearning.dll"

open System
open System.IO

open Accord.MachineLearning
open Accord.MachineLearning.VectorMachines
open Accord.MachineLearning.VectorMachines.Learning
open Accord.Statistics.Kernels


Note the added reference to the Accord.dll and Accord.Math.dll assemblies; while the code presented below doesn’t reference it explicitly, it looks like Accord.MachineLearning is trying to load the assembly, which fails miserably if they are not referenced.

Then, we need some data; once the training set and validation set have been downloaded to your local machine (see the gist for the datasets url), that’s fairly easy to do:

let training = @"C:/users/mathias/desktop/dojosample/trainingsample.csv"
let validation = @"C:/users/mathias/desktop/dojosample/validationsample.csv"

|> fun lines -> lines.[1..]
|> Array.map (fun line -> line.Split(','))
|> Array.map (fun line ->
(line.[0] |> Convert.ToInt32), (line.[1..] |> Array.map Convert.ToDouble))
|> Array.unzip

let labels, observations = readData training


We read every line of the CSV file into an array of strings, drop the headers with array slicing, keeping only items at or after index 1, split each line around commas (so that each line is now an array of strings), retrieve separately the first element of each line (what the number actually is), and all the pixels, which we transform into a float, and finally unzip the result, so that we get an array of integers (the actual numbers), and an array of arrays, the grayscale level of each pixel.

More...

26. December 2012 11:50

This is the continuation of my series converting the samples found in Machine Learning in Action from Python to F#. After starting on a nice and steady pace, I hit a speed bump with Chapter 6, dedicated to the Support Vector Machine algorithm. The math is more involved than the previous algorithms, and the original Python implementation is very procedural , which both slowed down the conversion to a more functional style.

Anyways, I am now at a good point to share progress. The current version uses Sequential Minimization Optimization to train the classifier, and supports Kernels. Judging from my experiments, the algorithm works - what is missing at that point is some performance optimization.

I’ll talk first about the code changes from the “naïve SVM” version previously discussed, and then we’ll illustrate the algorithm in action, recognizing hand-written digits.

## Main changes from previous version

From a functionality standpoint, the 2 main changes from the previous post are the replacement of the hard-coded vector dot product by arbitrary Kernel functions, and the modification of the algorithm from a naïve loop to the SMO double-loop, pivoting on observations based on their prediction error.

You can browse the current version of the SVM algorithm on GitHub here.

Injecting arbitrary Kernels

The code I presented last time relied on vector dot-product to partition linearly separable datasets. The obvious issue is that not all datasets are linearly separable. Fortunately, with minimal change, the SVM algorithm can be used to handle more complex situations, using what’s known as the “Kernel Trick”. In essence, instead of working on the original data, we transform our data in a new space where it is linearly separable:

[ from http://zutopedia.com/udi/?p=40, via Cesar Souza’s blog ]

More...

25. November 2012 09:10

I am still working my way through “Machine Learning in Action”, converting the samples from Python to F#. I am currently in the middle of chapter 6, dedicated to Support Vector Machines, which has given me more trouble than the previous ones. This post will be sharing my current progress: the code I have so far is a working translation of the naïve SVM implementation, presented in the first half of the chapter. We’ll get to kernels, and the full Platt SMO algorithm in a later post – today will be solely discussing the simple, un-optimized version.

Two factors slowed me down with this chapter: the math, and Python.

The math behind the algorithm is significantly more involved than the other algorithms, and I won’t even try to go into why it works. I recommend reading An Idiot’s guide to SVMs, which I found a pretty complete and accessible explanation of the theory behind SVMs. I will focus instead on the implementation, which was in itself a bit challenging.

First, the Python code uses algebra quite a bit, and I found that deciphering what was going on required a bit of patience. Take a line like the following:

fXi = (float)(multiply(alphas, labelMat).T*(dataMatrix*dataMatrix[i,:].T))+b

I am reasonably well versed in linear algebra, but figuring out what this is saying takes some attention. Granted, I have no experience with Python and NumPy, so my whining is probably a bit unfair. Still, I thought the code was not very readable, and it motivated me to see if that could be improved, and as a result I ended up moving away from heavy algebra notation.

Then, the algorithm is implemented as a Deep Arrow. A main loop performs computations and evaluates conditions at multiple points, using continue to exit / short-circuit the evaluation. The code I ended up with doesn’t use mutation, but is still heavily indented, which I am not happy about - I’ll work on that later.

## Simplified algorithm implementation

Note: as the title of the post indicates, this is work in progress. The current implementation works, but has some obvious flaws (see last paragraph), which I intend to fix in upcoming posts. My intent is to share my progression through the problem – please don’t take this as a good reference SVM implementation. Hopefully we’ll get there soon, but this is not it, not yet.

More...