Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
1. March 2014 14:32

During some recent meanderings through the confines of the internet, I ended up discovering the Winnow Algorithm. The simplicity of the approach intrigued me, so I thought it would be interesting to try and implement it in F# and see how well it worked.

The purpose of the algorithm is to train a binary classifier, based on binary features. In other words, the goal is to predict one of two states, using a collection of features which are all binary. The prediction model assigns weights to each feature; to predict the state of an observation, it checks all the features that are “active” (true), and sums up the weights assigned to these features. If the total is above a certain threshold, the result is true, otherwise it’s false. Dead simple – and so is the corresponding F# code:

type Observation = bool []
type Label = bool
type Example = Label * Observation
type Weights = float []

let predict (theta:float) (w:Weights) (obs:Observation) =
(obs,w) ||> Seq.zip
|> Seq.filter fst
|> Seq.sumBy snd
|> ((<) theta)

We create some type aliases for convenience, and write a predict function which takes in theta (the threshold), weights and and observation; we zip together the features and the weights, exclude the pairs where the feature is not active, sum the weights, check whether the threshold is lower that the total, and we are done.

In a nutshell, the learning process feeds examples (observations with known label), and progressively updates the weights when the model makes mistakes. If the current model predicts the output correctly, don’t change anything. If it predicts true but should predict false, it is over-shooting, so weights that were used in the prediction (i.e. the weights attached to active features) are reduced. Conversely, if the prediction is false but the correct result should be true, the active features are not used enough to reach the threshold, so they should be bumped up.

And that’s pretty much it – the algorithm starts with arbitrary initial weights of 1 for every feature, and either doubles or halves them based on the mistakes. Again, the F# implementation is completely straightforward. The weights update can be written as follows:

let update (theta:float) (alpha:float) (w:Weights) (ex:Example) =
let real,obs = ex
match (real,predict theta w obs) with
| (true,false) -> w |> Array.mapi (fun i x -> if obs.[i] then alpha * x else x)
| (false,true) -> w |> Array.mapi (fun i x -> if obs.[i] then x / alpha else x)
| _ -> w

Let’s check that the update mechanism works:

> update 0.5 2. [|1.;1.;|] (false,[|false;true;|]);;
val it : float [] = [|1.0; 0.5|]

The threshold is 0.5, the adjustment multiplier is 2, and each feature is currently weighted at 1. The state of our example is [| false; true; |], so only the second feature is active, which means that the predicted value will be 1. (the weight of that feature). This is above the threshold 0.5, so the predicted value is true. However, because the correct value attached to that example is false, our prediction is incorrect, and the weight of the second feature is reduced, while the first one, which was not active, remains unchanged.

Let’s wrap this up in a convenience function which will learn from a sequence of examples, and give us directly a function that will classify observations:

let learn (theta:float) (alpha:float) (fs:int) (xs:Example seq) =
let updater = update theta alpha
let w0 = [| for f in 1 .. fs -> 1. |]
let w = Seq.fold (fun w x -> updater w x) w0 xs
fun (obs:Observation) -> predict theta w obs

We pass in the number of features, fs, to initialize the weights at the correct size, and use a fold to update the weights for each example in the sequence. Finally, we create and return a function that, given an observation, will predict the label, based on the weights we just learnt.

And that’s it – in 20 lines of code, we are done, the Winnow is implemented.

More...

15. February 2014 12:51

My favorite column in MSDN Magazine is Test Run; it was originally focused on testing, but the author, James McCaffrey, has been focusing lately on topics revolving around numeric optimization and machine learning, presenting a variety of methods and approaches. I quite enjoy his work, with one minor gripe –his examples are all coded in C#, which in my opinion is really too bad, because the algorithms would gain much clarity if written in F# instead.

Back in June 2013, he published a piece on Amoeba Method Optimization using C#. I hadn’t seen that approach before, and found it intriguing. I also found the C# code a bit too hairy for my feeble brain to follow, so I decided to rewrite it in F#.

In a nutshell, the Amoeba approach is a heuristic to find the minimum of a function. Its proper respectable name is the Nelder-Nead method. The reason it is also called the Amoeba method is because of the way the algorithm works: in its simple form, it starts from a triangle, the “Amoeba”; at each step, the Amoeba “probes” the value of 3 points in its neighborhood, and moves based on how much better the new points are. As a result, the triangle is iteratively updated, and behaves a bit like an Amoeba moving on a surface.

Before going into the actual details of the algorithm, here is how my final result looks like. You can find the entire code here on GitHub, with some usage examples in the Sample.fsx script file. Let’s demo the code in action: in a script file, we load the Amoeba code, and use the same function the article does, the Rosenbrock function. We transform the function a bit, so that it takes a Point (an alias for an Array of floats, essentially a vector) as an input, and pass it to the solve function, with the domain where we want to search, in that case, [ –10.0; 10.0 ] for both x and y:

#load "Amoeba.fs"

open Amoeba
open Amoeba.Solver

let g (x:float) y =
100. * pown (y - x * x) 2 + pown (1. - x) 2

let testFunction (x:Point) =
g x.[0] x.[1]

solve Default [| (-10.,10.); (-10.,10.) |] testFunction 1000

Running this in the F# interactive window should produce the following:

val it : Solution = (0.0, [|1.0; 1.0|])
>

The algorithm properly identified that the minimum is 0, for a value of x = 1.0 and y = 1.0. Note that results may vary: this is a heuristic, which starts with a random initial amoeba, so each run could produce slightly different results, and might at times epically fail.

More...

18. January 2014 14:49

A couple of months ago, I started working on an F# decision tree & random forest library, and pushed a first draft out in July 2013. It was a very minimal implementation, but it was a start, and my plan was to keep refining and add features. And then life happened: I got really busy, I began a very poorly disciplined refactoring effort on the code base, I second and third guessed my design - and got nothing to show for a while. Finally in December, I took some time off in Europe, disappeared in the French country side, a perfect setup to roll up my sleeves and finally get some serious coding done.

And here we go - drum roll please, version 0.1 of Charon is out. You can find it on GitHub, or install it as a NuGet package.

As you can guess from the version number, this is alpha-release grade code. There will be breaking changes, there are probably bugs and obvious things to improve, but I thought it was worth releasing, because it is in a shape good enough to illustrate the direction I am taking, and hopefully get some feedback from the community.

But first, what does Charon do? Charon is a decision tree and random forest machine learning classifier. An example will probably illustrate best what it does - let's work through the classic Titanic example. Using the Titanic passenger list, we want to create a model that predicts whether a passenger is likely to survive the disaster – or meet a terrible fate. Here is how you would do that with Charon, in a couple of lines of F#.

First, we use the CSV type provider to extract passenger information from our data file:

open Charon
open FSharp.Data

type DataSet = CsvProvider<"""C:\Users\Mathias\Documents\GitHub\Charon\Charon\Charon.Examples\titanic.csv""",
SafeMode=true, PreferOptionals=true>

type Passenger = DataSet.Row

In order to define a model, Charon needs two pieces of information: what is it you are trying to predict (the label, in that case, whether the passenger survives or not), and what information Charon is allowed to use to produce predictions (the features, in that case whatever passenger information we think is relevant):

let training =
use data = new DataSet()
[| for passenger in data.Data ->
passenger, // label source
passenger |] // features source

let labels = "Survived", (fun (obs:Passenger) -> obs.Survived) |> Categorical

let features =
[
"Sex", (fun (o:Passenger) -> o.Sex) |> Categorical;
"Class", (fun (o:Passenger) -> o.Pclass) |> Categorical;
"Age", (fun (o:Passenger) -> o.Age) |> Numerical;
]

For each feature, we specify whether the feature is Categorical (a finite number of "states" is expected, for instance Sex) or Numerical (the feature is to be interpreted as a numeric value, such as Age).

The Model is now fully specified, and we can train it on our dataset, and retrieve the results:

let results = basicTree training (labels,features) { DefaultSettings with Holdout = 0.1 }

printfn "Quality, training: %.3f" (results.TrainingQuality |> Option.get)
printfn "Quality, holdout: %.3f" (results.HoldoutQuality |> Option.get)

printfn "Tree:"
printfn "%s" (results.Pretty)

… which generates the following output:

Quality, training: 0.796
Quality, holdout: 0.747
Tree:
├ Sex = male
│   ├ Class = 3 → Survived False
│   ├ Class = 1 → Survived False
│   └ Class = 2
│      ├ Age = <= 16.000 → Survived True
│      └ Age = >  16.000 → Survived False
└ Sex = female
├ Class = 3 → Survived False
├ Class = 1 → Survived True
└ Class = 2 → Survived True

Charon automatically figures out what features are most informative, and organizes them into a tree; in our example, it appears that being a lady was a much better idea than being a guy – and being a rich lady traveling first or second class an even better idea. Charon also automatically breaks down continuous variables into bins. For instance, second-class male passengers under 16 had apparently much better odds of surviving than other male passengers. Charon splits the sample into training and validation; in this example, while our model appears quite good on the training set, with nearly 80% correct calls, the performance on the validation set is much weaker, with under 75% correctly predicted, suggesting an over-fitting issue.

I won’t demonstrate the Random Forest here; the API is basically the same, with better results but less human-friendly output. While formal documentation is lacking for the moment, you can find code samples in the Charon.Examples project that illustrate usage on the Titanic and the Nursery datasets.

What I hope I conveyed with this small example is the design priorities for Charon: a lightweight API that permits quick iterations to experiment with features and refine a model, using the F# Interactive capabilities.

I will likely discuss in later posts some of the challenges I ran into while implementing support for continuous variables – I learnt a lot in the process. I will leave it at that for today – in the meanwhile, I would love to get feedback on the current direction, and what you may like or hate about it. If you have comments, feel free to hit me up on Twitter, or to open an Issue on GitHub!

6. September 2013 08:15

Recently, Cesar De Souza began moving his .NET machine learning library, Accord.NET, from Google Code to GitHub. The move is still in progress, but that motivated me to take a closer look at the library; given that it is built in C#, with an intended C# usage in mind, I wanted to see how usable it is from F#.

There is a lot in the library; as a starting point, I decided I would try out its Support Vector Machine (SVM), a classic machine learning algorithm, and run it on a classic problem, automatically recognizing hand-written digits. The dataset I will be using here is a subset of the Kaggle Digit Recognizer contest; each example in the dataset is a 28x28 grayscale pixels image, the result of scanning a number written down by a human, and what the actual number is. From that original dataset, I sampled 5,000 examples, which will be used to train the algorithm, and another 500 in a validation set, which we’ll use to evaluate the performance of the model on data it hasn’t “seen before”.

The full example is available as a gist on GitHub.

I’ll be working in a script file within a Library project, as I typically do when exploring data. First, we need to add references to Accord.NET via NuGet:

#r @"..\packages\Accord.2.8.1.0\lib\Accord.dll"
#r @"..\packages\Accord.Math.2.8.1.0\lib\Accord.Math.dll"
#r @"..\packages\Accord.Statistics.2.8.1.0\lib\Accord.Statistics.dll"
#r @"..\packages\Accord.MachineLearning.2.8.1.0\lib\Accord.MachineLearning.dll"

open System
open System.IO

open Accord.MachineLearning
open Accord.MachineLearning.VectorMachines
open Accord.MachineLearning.VectorMachines.Learning
open Accord.Statistics.Kernels


Note the added reference to the Accord.dll and Accord.Math.dll assemblies; while the code presented below doesn’t reference it explicitly, it looks like Accord.MachineLearning is trying to load the assembly, which fails miserably if they are not referenced.

Then, we need some data; once the training set and validation set have been downloaded to your local machine (see the gist for the datasets url), that’s fairly easy to do:

let training = @"C:/users/mathias/desktop/dojosample/trainingsample.csv"
let validation = @"C:/users/mathias/desktop/dojosample/validationsample.csv"

|> fun lines -> lines.[1..]
|> Array.map (fun line -> line.Split(','))
|> Array.map (fun line ->
(line.[0] |> Convert.ToInt32), (line.[1..] |> Array.map Convert.ToDouble))
|> Array.unzip

let labels, observations = readData training


We read every line of the CSV file into an array of strings, drop the headers with array slicing, keeping only items at or after index 1, split each line around commas (so that each line is now an array of strings), retrieve separately the first element of each line (what the number actually is), and all the pixels, which we transform into a float, and finally unzip the result, so that we get an array of integers (the actual numbers), and an array of arrays, the grayscale level of each pixel.

More...

25. August 2013 08:54

About a month ago, FSharp.Data  released version 1.1.9, which contains some very nice improvements – you can find them listed on Gustavo Guerra’s blog. I was particularly excited by the changes made to the CSV Type Provider, because they make my life digging through datasets even simpler, but couldn’t find the time to write about it, because of my recent cross-country peregrinations.

Now that I am back, let’s talk about why this update made me so happy, with a concrete example. My latest week-end project is an F# implementation of Random Forests; as part of the process, I am trying out the algorithm on various datasets, to get a sense for potential performance problems, and dog-food my own API, the best way I know to quickly spot suckiness.

One of the problems I ran into was the representation of missing values. Most datasets don’t come clean and ready to use – usually you’ll have a few records with missing data. I opted for what seemed the most straightforward representation in F#, and decided to represent every feature value as an Option – anything can either have Some value, or None.

The original CSV Type Provider introduced a bit of friction there, because it inferred types “optimistically”: if the sample used contained only integers, it would create an integer, which is great in most cases, except when you want to be “pessimistic” (which is usually a safe world-view when setting expectations regarding data).

The new-and-improved CSV Type Provider fixes that, and introduces a few niceties. Case in point: the Kaggle Titanic dataset, which contains the Titanic’s passenger list. With the new version, extracting the data is as simple as this:

type DataSet = CsvProvider<"titanic.csv",
Schema="PassengerId=int, Pclass->Class, Parch->ParentsOrChildren, SibSp->SiblingsOrSpouse",
SafeMode=true,
PreferOptionals=true>

type Passenger = DataSet.Row


This is pretty awesome. In a couple of lines, just by passing in the path to my CSV file and some (optional) schema information, I get a Passenger type:

What’s neat here is that first, I immediately get a Passenger with properties – with the correct Optional types, thanks to SafeMode and PreferOptional. Then, notice in the Schema the Pclass->Class, Parch->ParentsOrChildren, SibSp->SiblingsOrSpouse bit? This renames “on the fly” the properties; instead of the pretty obscurely named Parch feature coming from the CSV file header, I get a nice and readable ParentsOrChildren property. The Type Provider even does a few more cool things, automagically; for instance, the feature “Survived”, which is encoded in the original dataset as 0 or 1, gets automatically converted to a boolean. Really nice.

And just like that, I can now use this CSV file, and send it to my (still very much in alpha version) Decision Tree classifier:

// We read the training set into an array,
// defining the Label we want to classify on:
let training =
use data = new DataSet()
[| for passenger in data.Data ->
passenger.Survived |> Categorical, // the label
passenger |]
// We define what features should be used:
let features = [|
"Sex", (fun (x:Passenger) -> x.Sex |> Categorical);
"Class", (fun x -> x.Class |> Categorical); |]
// We run the classifier...
let classifier, report = createID3Classifier training features { DefaultID3Config with DetailLevel = Verbose }
// ... and display the resulting tree:
report.Value.Pretty()


… which produces the following results in the F# Interactive window:

> titanicDemo();;
├ Sex = male
│   ├ Class = 3 → False
│   ├ Class = 1 → False
│   └ Class = 2 → False
└ Sex = female
├ Class = 3 → False
├ Class = 1 → True
└ Class = 2 → True
val it : unit = ()
>

The morale of the story here is triple. First, it was a much better idea to be a rich lady on the Titanic, rather than a (poor) dude. Then, Type Providers are really awesome – in a couple of lines, we extracted from a CSV file a collection of Passengers, all of them statically typed, with all the benefits attached to that; in a way, this is the best of both worlds – access the data as easily as with a dynamic language, but with all the benefits of types. Finally, the F# community is just awesome – big thanks to everyone who contributed to FSharp.Data, and specifically to @ovatsus for the recent improvements to the CSV Type Provider!

You can find the full Titanic example here on GitHub.