Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
6. September 2013 08:15

Recently, Cesar De Souza began moving his .NET machine learning library, Accord.NET, from Google Code to GitHub. The move is still in progress, but that motivated me to take a closer look at the library; given that it is built in C#, with an intended C# usage in mind, I wanted to see how usable it is from F#.

There is a lot in the library; as a starting point, I decided I would try out its Support Vector Machine (SVM), a classic machine learning algorithm, and run it on a classic problem, automatically recognizing hand-written digits. The dataset I will be using here is a subset of the Kaggle Digit Recognizer contest; each example in the dataset is a 28x28 grayscale pixels image, the result of scanning a number written down by a human, and what the actual number is. From that original dataset, I sampled 5,000 examples, which will be used to train the algorithm, and another 500 in a validation set, which we’ll use to evaluate the performance of the model on data it hasn’t “seen before”.

The full example is available as a gist on GitHub.

I’ll be working in a script file within a Library project, as I typically do when exploring data. First, we need to add references to Accord.NET via NuGet:

#r @"..\packages\Accord.2.8.1.0\lib\Accord.dll"
#r @"..\packages\Accord.Math.2.8.1.0\lib\Accord.Math.dll"
#r @"..\packages\Accord.Statistics.2.8.1.0\lib\Accord.Statistics.dll"
#r @"..\packages\Accord.MachineLearning.2.8.1.0\lib\Accord.MachineLearning.dll"

open System
open System.IO

open Accord.MachineLearning
open Accord.MachineLearning.VectorMachines
open Accord.MachineLearning.VectorMachines.Learning
open Accord.Statistics.Kernels


Note the added reference to the Accord.dll and Accord.Math.dll assemblies; while the code presented below doesn’t reference it explicitly, it looks like Accord.MachineLearning is trying to load the assembly, which fails miserably if they are not referenced.

Then, we need some data; once the training set and validation set have been downloaded to your local machine (see the gist for the datasets url), that’s fairly easy to do:

let training = @"C:/users/mathias/desktop/dojosample/trainingsample.csv"
let validation = @"C:/users/mathias/desktop/dojosample/validationsample.csv"

|> fun lines -> lines.[1..]
|> Array.map (fun line -> line.Split(','))
|> Array.map (fun line ->
(line.[0] |> Convert.ToInt32), (line.[1..] |> Array.map Convert.ToDouble))
|> Array.unzip

let labels, observations = readData training


We read every line of the CSV file into an array of strings, drop the headers with array slicing, keeping only items at or after index 1, split each line around commas (so that each line is now an array of strings), retrieve separately the first element of each line (what the number actually is), and all the pixels, which we transform into a float, and finally unzip the result, so that we get an array of integers (the actual numbers), and an array of arrays, the grayscale level of each pixel.

More...

5. July 2013 15:51

Besides having one of the coolest names around, Random Forest is an interesting machine learning algorithm, for a few reasons. It is applicable to a large range of classification problems, isn’t prone to over-fitting, can produce good quality metrics as a side-effect of the training process itself, and is very suitable for parallelization. For all these reasons, I thought it would be interesting to try it out in F#.

The current implementation I will be discussing below works, but isn’t production ready (yet) – it is work in progress. The API and implementation are very likely to change over the next few weeks. Still, I thought I would share what I did so far, and maybe get some feedback!

## The idea behind the algorithm

As the name suggests, Random Forest (introduced in the early 2000s by Leo Breiman) can be viewed as an extension of Decision Trees, which I discussed before. A decision tree grows a single classifier, in a top-down manner: the algorithm recursively selects the feature which is the most informative, partitions the data according to the outcomes of that feature, and repeats the process until no information can be gained by partitioning further. On a non-technical level, the algorithm is playing a smart “game of 20 questions”: given what has been deduced so far, it picks from the available features the one that is most likely to lead to a more certain answer.

How is a Random Forest different from a Decision Tree? The first difference is that instead of growing a single decision tree, the algorithm will create a “forest” – a collection of Decision Trees; the final decision of the classifier will be the majority decision of all trees in the forest. However, having multiple times the same tree wouldn’t be of much help, because we would get the same classifier repeated over and over again. This is where the algorithm gets interesting: instead of growing a Tree using the entire training set and features, it introduces two sources of randomness:

• each tree is grown on a new sample, created by randomly sampling the original dataset with replacement (“bagging”),
• at each node of the tree, only a random subset of the remaining features is used.

Why would introducing randomness be a good idea? It has a few interesting benefits:

• by selecting different samples, it mitigates the risk of over-fitting. A single tree will produce an excellent fit on the particular dataset that was used to train it, but this doesn’t guarantee that the result will generalize to other sets. Training multiple trees on random samples creates a more robust overall classifier, which will by construction handle a “wider” range of situations than a single dataset,
• by selecting a random subset of features, it mitigates the risks of greedily picking locally optimal features that could be overall sub-optimal. As a bonus, it also allows a computation speed-up for each tree, because fewer features need to be considered at each step,
• the bagging process, by construction, creates for each tree a Training Set (the selected examples) and a Cross-Validation Set (what’s “out-of-the-bag”), which can be directly used to produce quality metrics on how the classifier may perform in general.

## Usage

Before delving into the current implementation, I thought it would be interesting to illustrate on an example the intended usage. I will be using the Titanic dataset, from the Kaggle Titanic contest. The goal of the exercise is simple: given the passengers list of the Titanic, and what happened to them, can you build a model to predict who sinks or swims?

I didn’t think the state of affairs warranted a Nuget package just yet, so this example is implemented as a script, in the Titanic branch of the project itself on GitHub.

First, let’s create a Record type to represent passengers:

type Passenger = {
Id: string;
Class: string;
Name: string;
Sex: string;
Age: string;
SiblingsOrSpouse: string;
ParentsOrChildren: string;
Ticket: string;
Fare: string;
Cabin: string;
Embarked: string }


Note that all the properties are represented as strings; it might be better to represent them for what they are (Age is a float, SiblingsOrSpouse an integer…) – but given that the dataset contains missing data, this would require dealing with that issue, perhaps using an Option type. We’ll dodge the problem for now, and opt for a stringly-typed representation.

Next, we need to construct a training set from the Kaggle data file. We’ll use the CSV parser that comes with FSharp.Data to extract the passengers from that list, as well as their known fate (the file is assumed to have been downloaded on your local machine first):

let path = @"C:\Users\Mathias\Documents\GitHub\Charon\Charon\Charon\train.csv"

let trainingSet =
[| for line in data.Data ->
line.GetColumn "Survived" |> Some, // the label
{   Id = line.GetColumn "PassengerId";
Class = line.GetColumn "Pclass";
Name = line.GetColumn "Name";
Sex = line.GetColumn "Sex";
Age = line.GetColumn "Age";
SiblingsOrSpouse = line.GetColumn "SibSp";
ParentsOrChildren = line.GetColumn "Parch";
Ticket = line.GetColumn "Ticket";
Fare =line.GetColumn "Fare";
Cabin = line.GetColumn "Cabin";
Embarked = line.GetColumn "Embarked" } |]


Now that we have data, we can get to work, and define a model. We’ll start first with a regular Decision Tree, and extract only one feature, Sex:

let features =
[| (fun x -> x.Sex |> StringCategory); |]


What this is doing is defining an Array of features, a feature being a function which takes in a Passenger, and returns an Option string, via the utility StringCategory. StringCategory simply expects a string, and transforms a null or empty case into the “missing data” case, and otherwise treats the string as a Category. So in that case, x is a passenger, and if no Sex information is found, it will transform it into None, and otherwise into Some(“male”) or Some(“female”), the two cases that exist in the dataset.

We are now ready to go – we can run the algorithm and get a Decision Tree classifier, with a minimum leaf of 5 elements (i.e. we stop partitioning if we have less than 5 elements left):

let minLeaf = 5
let classifier = createID3Classifier trainingSet features minLeaf


… and we are done. How good is our classifier? Let’s check:

let correct =
trainingSet
|> Array.averageBy (fun (label, obs) ->
if label = Some(classifier obs) then 1. else 0.)
printfn "Correct: %.4f" correct


More...

30. June 2013 14:52

I just completed the Coursera Machine Learning class this week, and enjoyed the experience very much. Let’s get the obvious out of the way: getting a high-quality class, for free, wherever you are, at your own pace, is pretty amazing, and I can put up with a sometimes flaky video player for that. Every quarter in college, I would agonize over what limited number of classes I should take, thinking that I might not be able to take that class ever again once I graduated, and Coursera is awesome for that – now I know I can keep learning forever, with that worry gone. Thank you!

One thing I was hoping to get from this class was a broader perspective on machine learning. What I know from the topic, I learnt from various disparate sources, collecting ideas, algorithms and recipes. As a result, my knowledge was a bit of a hodge-podge. The class really delivered on that front. For the most part, the lectures progressed logically, from linear regression, to logistic regression, neural networks and support vector machines, all in a unified manner: define a cost function, use regularization to compensate for over-fitting, and fit parameters using gradient descent. I really enjoyed the coherence of the progression, which helped seeing the commonalities between all the approaches.

Following that thought, my biggest take-away was the emphasis on over- or under-fitting. I understood the concept before the class, but it wasn’t as prominently on my mind as it is now. This is probably a side-effect of my past experience in optimization and statistics, where the data was easier to visualize, and the goal was mostly to find the optimum fit, potentially leading to fragile solutions which wouldn’t generalize – over-fitting wasn’t a problem I gave much thought. In a space like machine learning, where datasets are too large to get a visual sense of what’s going on, keeping that question in mind is important. Relatedly, I found the discussions on how to diagnose a model to focus efforts extremely valuable: while more data is usually better, there are situations where it won’t help, and again, with large and hard to comprehend datasets, understanding what is potentially going wrong in a model and why is very important to avoid wasting efforts in the wrong direction, or simply figure out a direction to take when stuck.

One aspect I found interesting is that while quite a few of the models discussed have a long history in statistics (linear and logistic regression for instance), there was virtually no mention of statistics in the entire class – no goodness-of-fit statistics, no null hypothesis, no discussion on how parameters are distributed, nothing like that. Instead, it felt much closer to my background in operations research: define the function you are trying to minimize, and minimize it. I am not sure whether this has any implications regarding where statistics are headed, but found it intriguing.

Finally, the other aspect that struck me was the emphasis on linear algebra. In essence, one of the messages of the class was “if you want high performance, express your problem in a vectorized form”. I can understand why, from two perspectives. First, computers are really, really good at dealing with linear algebra operations, and second, expressing a problem with matrices and vectors is typically nicely compact. Coming from operations research, this is something I am fairly comfortable with. At the same time, the explosion of indices, sub- and super-scripts wasn’t the most pleasant part of the class, and I spent an inordinate amount of time in programming homework just trying to figure out if a particular element was a row or column vector, tinkering with transposition until the bloody product would just work. I found myself really missing some hints on what the shape of a particular element was, or early warning that the operation wasn’t going to work. Relatedly, while there is a nice consistency in working in a world where everything is a vector or matrix of floats, I felt slightly disturbed at representing true or false as 1 and 0s (for instance), and missing cleaner functional operations like filter or map.

That’s it – overall, this was time very well spent. I am very glad I did it, and would recommend that class to anyone who is not allergic to math, and wants a good introduction to the topic!

26. May 2013 09:06

I got interested in the following question lately: given a data set of examples with some continuous-valued features and discrete classes, what’s a good way to reduce the continuous features into a set of discrete values?

What makes this question interesting? One very specific reason is that some machine learning algorithms, like Decision Trees, require discrete features. As a result, potentially informative data has to be discarded. For example, consider the Titanic dataset: we know the age of passengers of the Titanic, or how much they paid for their ticket. To use these features, we would need to reduce them to a set of states, like “Old/Young” or “Cheap/Medium/Expensive” – but how can we determine what states are appropriate, and what values separate them?

More generally, it’s easier to reason about a handful of cases than a continuous variable – and it’s also more convenient computationally to represent information as a finite set states.

So how could we go about identifying a reasonable way to partition a continuous variable into a handful of informative, representative states?

In the context of a classification problem, what we are interested in is whether the states provide information with respect to the Classes we are trying to recognize. As far as I can tell from my cursory review of what’s out there, the main approaches use either Chi-Square tests or Entropy to achieve that goal. I’ll leave aside Chi-Square based approaches for today, and look into the Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani in 1993.

## The algorithm idea

The algorithm hinges on two key ideas:

• Data should be split into intervals that maximize the information, measured by Entropy,
• Partitioning should not be too fine-grained, to avoid over-fitting.

The first part is classic: given a data set, split in two halves, based on whether the continuous value is above or below the “splitting value”, and compute the gain in entropy. Out of all possibly splitting values, take the one that generates the best gain – and repeat in a recursive fashion.

Let’s illustrate on an artificial example – our output can take 2 values, Yes or No, and we have one continuous-valued feature:

 Continuous Feature Output Class 1.0 Yes 1.0 Yes 2.0 No 3.0 Yes 3.0 No

As is, the dataset has an Entropy of H = - 0.6 x Log (0.6) – 0.4 x Log (0.4) = 0.67 (5 examples, with 3/5 Yes, and 2/5 No).

The Continuous Feature takes 3 values: 1.0, 2.0 and 3.0, which leaves us with 2 possible splits: strictly less than 2, or strictly less than 3. Suppose we split on 2.0 – we would get 2 groups. Group 1 contains Examples where the Feature is less than 2:

 Continuous Feature Output Class 1.0 Yes 1.0 Yes

The Entropy of Group 1 is H(g1) = - 1.0 x Log(1.0) = 0.0

Group 2 contains the rest of the examples:

 Continuous Feature Output Class 2.0 No 3.0 Yes 3.0 No

The Entropy of Group 2 is H(g2) = - 0.33 x Log(0.33) – 0.66 x Log(0.66) = 0.63

Partitioning on 2.0 gives us a gain of H – 2/5 x H(g1) – 3/5 x H(g2) = 0.67 – 0.4 x 0.0 – 0.6 x 0.63 = 0.04. That split gives us additional information on the output, which seems intuitively correct, as one of the groups is now formed purely of “Yes”. In a similar fashion, we can compute the information gain of splitting around the other possible value, 3.0, which would give us a gain of 0.67 – 0.6 x 0.63 – 0.4 x 0.69 =  - 0.00: that split doesn’t improve information, so we would use the first split (or, if we had multiple splits with positive gain, we would take the split leading to the largest gain).

So why not just recursively apply that procedure, and split our dataset until we cannot achieve information gain by splitting further? The issue is that we might end up with an artificially fine-grained partition, over-fitting the data.

More...

21. May 2013 12:55

Last week, we had our first Coding Dojo at SFSharp.org, the San Francisco F# group – and it was great! A few people in the group had mentioned that at that point they were already convinced F# was a great language, and that what they wanted was help getting started writing actual code, so I figured this would be a good format to try out.

What I wanted was something fun, something cool people could realistically achieve under 2 hours. I settled for one of the Kaggle introduction problems, a classic of Machine Learning, where the goal is to automatically recognize hand-written digits. I didn’t think it would be fair to just throw people in the shark tank without any guidance, especially for F# beginners, so I prepared a minimal slide deck to explain the problem and data set, and a “guided script”, with hints and language syntax examples.

And… it worked! The attendees were absolutely awesome. We had people from Kaggle, Rdio, and two people who drove all the way from Sacramento; we had beginners and experienced FSharpers – and everybody managed to get a classifier working, from scratch. Having some beers available definitely helped, too.

My favorite part is this one attendee, a F# beginner, who kept going at it after the meeting was over, and posted an algorithm improvement in the comments section of the Meetup a couple days after. Way to go! And given the positive response, we’ll definitely have more of these.

Also wanted to say a huge thanks to Matt Harrington, first for starting this user group back then, and then for still being an incredible supporter of the F# community in SF, in spite of a crazy work schedule. Thanks, Matt!

Introduction slide deck

“Guided script”