Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
6. September 2013 08:15

Recently, Cesar De Souza began moving his .NET machine learning library, Accord.NET, from Google Code to GitHub. The move is still in progress, but that motivated me to take a closer look at the library; given that it is built in C#, with an intended C# usage in mind, I wanted to see how usable it is from F#.

There is a lot in the library; as a starting point, I decided I would try out its Support Vector Machine (SVM), a classic machine learning algorithm, and run it on a classic problem, automatically recognizing hand-written digits. The dataset I will be using here is a subset of the Kaggle Digit Recognizer contest; each example in the dataset is a 28x28 grayscale pixels image, the result of scanning a number written down by a human, and what the actual number is. From that original dataset, I sampled 5,000 examples, which will be used to train the algorithm, and another 500 in a validation set, which we’ll use to evaluate the performance of the model on data it hasn’t “seen before”.

The full example is available as a gist on GitHub.

I’ll be working in a script file within a Library project, as I typically do when exploring data. First, we need to add references to Accord.NET via NuGet:

#r @"..\packages\Accord.2.8.1.0\lib\Accord.dll"
#r @"..\packages\Accord.Math.2.8.1.0\lib\Accord.Math.dll"
#r @"..\packages\Accord.Statistics.2.8.1.0\lib\Accord.Statistics.dll"
#r @"..\packages\Accord.MachineLearning.2.8.1.0\lib\Accord.MachineLearning.dll"

open System
open System.IO

open Accord.MachineLearning
open Accord.MachineLearning.VectorMachines
open Accord.MachineLearning.VectorMachines.Learning
open Accord.Statistics.Kernels


Note the added reference to the Accord.dll and Accord.Math.dll assemblies; while the code presented below doesn’t reference it explicitly, it looks like Accord.MachineLearning is trying to load the assembly, which fails miserably if they are not referenced.

Then, we need some data; once the training set and validation set have been downloaded to your local machine (see the gist for the datasets url), that’s fairly easy to do:

let training = @"C:/users/mathias/desktop/dojosample/trainingsample.csv"
let validation = @"C:/users/mathias/desktop/dojosample/validationsample.csv"

|> fun lines -> lines.[1..]
|> Array.map (fun line -> line.Split(','))
|> Array.map (fun line ->
(line.[0] |> Convert.ToInt32), (line.[1..] |> Array.map Convert.ToDouble))
|> Array.unzip

let labels, observations = readData training


We read every line of the CSV file into an array of strings, drop the headers with array slicing, keeping only items at or after index 1, split each line around commas (so that each line is now an array of strings), retrieve separately the first element of each line (what the number actually is), and all the pixels, which we transform into a float, and finally unzip the result, so that we get an array of integers (the actual numbers), and an array of arrays, the grayscale level of each pixel.

More...

5. July 2013 15:51

Besides having one of the coolest names around, Random Forest is an interesting machine learning algorithm, for a few reasons. It is applicable to a large range of classification problems, isn’t prone to over-fitting, can produce good quality metrics as a side-effect of the training process itself, and is very suitable for parallelization. For all these reasons, I thought it would be interesting to try it out in F#.

The current implementation I will be discussing below works, but isn’t production ready (yet) – it is work in progress. The API and implementation are very likely to change over the next few weeks. Still, I thought I would share what I did so far, and maybe get some feedback!

## The idea behind the algorithm

As the name suggests, Random Forest (introduced in the early 2000s by Leo Breiman) can be viewed as an extension of Decision Trees, which I discussed before. A decision tree grows a single classifier, in a top-down manner: the algorithm recursively selects the feature which is the most informative, partitions the data according to the outcomes of that feature, and repeats the process until no information can be gained by partitioning further. On a non-technical level, the algorithm is playing a smart “game of 20 questions”: given what has been deduced so far, it picks from the available features the one that is most likely to lead to a more certain answer.

How is a Random Forest different from a Decision Tree? The first difference is that instead of growing a single decision tree, the algorithm will create a “forest” – a collection of Decision Trees; the final decision of the classifier will be the majority decision of all trees in the forest. However, having multiple times the same tree wouldn’t be of much help, because we would get the same classifier repeated over and over again. This is where the algorithm gets interesting: instead of growing a Tree using the entire training set and features, it introduces two sources of randomness:

• each tree is grown on a new sample, created by randomly sampling the original dataset with replacement (“bagging”),
• at each node of the tree, only a random subset of the remaining features is used.

Why would introducing randomness be a good idea? It has a few interesting benefits:

• by selecting different samples, it mitigates the risk of over-fitting. A single tree will produce an excellent fit on the particular dataset that was used to train it, but this doesn’t guarantee that the result will generalize to other sets. Training multiple trees on random samples creates a more robust overall classifier, which will by construction handle a “wider” range of situations than a single dataset,
• by selecting a random subset of features, it mitigates the risks of greedily picking locally optimal features that could be overall sub-optimal. As a bonus, it also allows a computation speed-up for each tree, because fewer features need to be considered at each step,
• the bagging process, by construction, creates for each tree a Training Set (the selected examples) and a Cross-Validation Set (what’s “out-of-the-bag”), which can be directly used to produce quality metrics on how the classifier may perform in general.

## Usage

Before delving into the current implementation, I thought it would be interesting to illustrate on an example the intended usage. I will be using the Titanic dataset, from the Kaggle Titanic contest. The goal of the exercise is simple: given the passengers list of the Titanic, and what happened to them, can you build a model to predict who sinks or swims?

I didn’t think the state of affairs warranted a Nuget package just yet, so this example is implemented as a script, in the Titanic branch of the project itself on GitHub.

First, let’s create a Record type to represent passengers:

type Passenger = {
Id: string;
Class: string;
Name: string;
Sex: string;
Age: string;
SiblingsOrSpouse: string;
ParentsOrChildren: string;
Ticket: string;
Fare: string;
Cabin: string;
Embarked: string }


Note that all the properties are represented as strings; it might be better to represent them for what they are (Age is a float, SiblingsOrSpouse an integer…) – but given that the dataset contains missing data, this would require dealing with that issue, perhaps using an Option type. We’ll dodge the problem for now, and opt for a stringly-typed representation.

Next, we need to construct a training set from the Kaggle data file. We’ll use the CSV parser that comes with FSharp.Data to extract the passengers from that list, as well as their known fate (the file is assumed to have been downloaded on your local machine first):

let path = @"C:\Users\Mathias\Documents\GitHub\Charon\Charon\Charon\train.csv"

let trainingSet =
[| for line in data.Data ->
line.GetColumn "Survived" |> Some, // the label
{   Id = line.GetColumn "PassengerId";
Class = line.GetColumn "Pclass";
Name = line.GetColumn "Name";
Sex = line.GetColumn "Sex";
Age = line.GetColumn "Age";
SiblingsOrSpouse = line.GetColumn "SibSp";
ParentsOrChildren = line.GetColumn "Parch";
Ticket = line.GetColumn "Ticket";
Fare =line.GetColumn "Fare";
Cabin = line.GetColumn "Cabin";
Embarked = line.GetColumn "Embarked" } |]


Now that we have data, we can get to work, and define a model. We’ll start first with a regular Decision Tree, and extract only one feature, Sex:

let features =
[| (fun x -> x.Sex |> StringCategory); |]


What this is doing is defining an Array of features, a feature being a function which takes in a Passenger, and returns an Option string, via the utility StringCategory. StringCategory simply expects a string, and transforms a null or empty case into the “missing data” case, and otherwise treats the string as a Category. So in that case, x is a passenger, and if no Sex information is found, it will transform it into None, and otherwise into Some(“male”) or Some(“female”), the two cases that exist in the dataset.

We are now ready to go – we can run the algorithm and get a Decision Tree classifier, with a minimum leaf of 5 elements (i.e. we stop partitioning if we have less than 5 elements left):

let minLeaf = 5
let classifier = createID3Classifier trainingSet features minLeaf


… and we are done. How good is our classifier? Let’s check:

let correct =
trainingSet
|> Array.averageBy (fun (label, obs) ->
if label = Some(classifier obs) then 1. else 0.)
printfn "Correct: %.4f" correct


More...

29. December 2012 17:23

This post continues my journey converting the Python samples from Machine Learning in Action into F#. On the program today: chapter 7, dedicated to AdaBoost. This is also the last chapter revolving around classification. After almost 6 months spending my week-ends on classifiers, I am rather glad to change gears a bit!

## The idea behind the algorithm

Algorithm outline

AdaBoost is short for “Adaptative Boosting”. Boosting is based on a very common-sense idea: instead of trying to find one perfect classifier that fits the dataset, the algorithm will train a sequence of classifiers, and, at each step, will analyze the latest classifiers’ results, and focus the next training round on reducing classification mistakes, by giving a bigger weight to the misclassified observations. In other words, “get better by working on your weaknesses”.

The second idea in AdaBoost, which I found very interesting and somewhat counter-intuitive, is that multiple poor classification models taken together can constitute a highly reliable source. Rather than discarding previous classifiers, AdaBoost combines them all into a meta-classifier. AdaBoost computes a weight Alpha for each of the “weak classifiers”, based on the proportion of examples properly classified, and classifies observations by taking a majority vote among the weak classifiers, weighted by their Alpha coefficients. In other words, “decide based on all sources of information, but take into account how reliable each source is”.

In pseudo-code, the algorithm looks like this:

Given examples = observations + labels,

Until overall quality is good enough or iteration limit reached,

• From the available weak classifiers,
• Pick the classifier with the lowest weighted prediction error,
• Compute its Alpha weight based on prediction quality,
• Update weights assigned to each example, based on Alpha and whether example was properly classified or not

The weights update mechanism

Let’s dive into the update mechanism for both the training example weights and the weak classifiers Alpha weights. Suppose that we have

• a training set with 4 examples & their label [ (E1, 1); (E2, –1); (E3, 1); (E4, –1) ],
• currently weighted [ 20%; 20%; 30%; 30% ], (note: example weights must sum to 100%)
• f is the best weak classifier selected.

If we apply a weak classifier f to the training set, we can check what examples are mis-classified, and compute the weighted error, i.e. the weighted proportion of mis-classifications:

 Example Label Weight f(E) f is… weighted error E1 1 0.2 1 correct 0.0 E2 -1 0.2 1 incorrect 0.2 E3 1 0.3 1 correct 0.0 E4 -1 0.3 -1 correct 0.0 0.2

This gives us a weighted error rate of 20% for f, given the weights.

The weight given to f in the final classifier is given by

Alpha = 0.5 x ln ((1 - error) / error)

Here is how Alpha looks, plotted as a function of the proportion correctly classified (i.e. 1 – error):

If 50% of the examples are properly classified, the classifier is totally random, and gets a weight of 0 – its output is ignored. Higher quality models get higher weights – and models with high level of misclassification get a strong negative weight. This is interesting; in essence, this treats them as a great negative source of information: if you know that I am always wrong, my answers are still highly informative – you just need to flip the answer…

More...

26. December 2012 11:50

This is the continuation of my series converting the samples found in Machine Learning in Action from Python to F#. After starting on a nice and steady pace, I hit a speed bump with Chapter 6, dedicated to the Support Vector Machine algorithm. The math is more involved than the previous algorithms, and the original Python implementation is very procedural , which both slowed down the conversion to a more functional style.

Anyways, I am now at a good point to share progress. The current version uses Sequential Minimization Optimization to train the classifier, and supports Kernels. Judging from my experiments, the algorithm works - what is missing at that point is some performance optimization.

I’ll talk first about the code changes from the “naïve SVM” version previously discussed, and then we’ll illustrate the algorithm in action, recognizing hand-written digits.

## Main changes from previous version

From a functionality standpoint, the 2 main changes from the previous post are the replacement of the hard-coded vector dot product by arbitrary Kernel functions, and the modification of the algorithm from a naïve loop to the SMO double-loop, pivoting on observations based on their prediction error.

You can browse the current version of the SVM algorithm on GitHub here.

Injecting arbitrary Kernels

The code I presented last time relied on vector dot-product to partition linearly separable datasets. The obvious issue is that not all datasets are linearly separable. Fortunately, with minimal change, the SVM algorithm can be used to handle more complex situations, using what’s known as the “Kernel Trick”. In essence, instead of working on the original data, we transform our data in a new space where it is linearly separable:

[ from http://zutopedia.com/udi/?p=40, via Cesar Souza’s blog ]

More...

25. November 2012 09:10

I am still working my way through “Machine Learning in Action”, converting the samples from Python to F#. I am currently in the middle of chapter 6, dedicated to Support Vector Machines, which has given me more trouble than the previous ones. This post will be sharing my current progress: the code I have so far is a working translation of the naïve SVM implementation, presented in the first half of the chapter. We’ll get to kernels, and the full Platt SMO algorithm in a later post – today will be solely discussing the simple, un-optimized version.

Two factors slowed me down with this chapter: the math, and Python.

The math behind the algorithm is significantly more involved than the other algorithms, and I won’t even try to go into why it works. I recommend reading An Idiot’s guide to SVMs, which I found a pretty complete and accessible explanation of the theory behind SVMs. I will focus instead on the implementation, which was in itself a bit challenging.

First, the Python code uses algebra quite a bit, and I found that deciphering what was going on required a bit of patience. Take a line like the following:

fXi = (float)(multiply(alphas, labelMat).T*(dataMatrix*dataMatrix[i,:].T))+b

I am reasonably well versed in linear algebra, but figuring out what this is saying takes some attention. Granted, I have no experience with Python and NumPy, so my whining is probably a bit unfair. Still, I thought the code was not very readable, and it motivated me to see if that could be improved, and as a result I ended up moving away from heavy algebra notation.

Then, the algorithm is implemented as a Deep Arrow. A main loop performs computations and evaluates conditions at multiple points, using continue to exit / short-circuit the evaluation. The code I ended up with doesn’t use mutation, but is still heavily indented, which I am not happy about - I’ll work on that later.

## Simplified algorithm implementation

Note: as the title of the post indicates, this is work in progress. The current implementation works, but has some obvious flaws (see last paragraph), which I intend to fix in upcoming posts. My intent is to share my progression through the problem – please don’t take this as a good reference SVM implementation. Hopefully we’ll get there soon, but this is not it, not yet.

More...