Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
15. February 2014 12:51

My favorite column in MSDN Magazine is Test Run; it was originally focused on testing, but the author, James McCaffrey, has been focusing lately on topics revolving around numeric optimization and machine learning, presenting a variety of methods and approaches. I quite enjoy his work, with one minor gripe –his examples are all coded in C#, which in my opinion is really too bad, because the algorithms would gain much clarity if written in F# instead.

Back in June 2013, he published a piece on Amoeba Method Optimization using C#. I hadn’t seen that approach before, and found it intriguing. I also found the C# code a bit too hairy for my feeble brain to follow, so I decided to rewrite it in F#.

In a nutshell, the Amoeba approach is a heuristic to find the minimum of a function. Its proper respectable name is the Nelder-Nead method. The reason it is also called the Amoeba method is because of the way the algorithm works: in its simple form, it starts from a triangle, the “Amoeba”; at each step, the Amoeba “probes” the value of 3 points in its neighborhood, and moves based on how much better the new points are. As a result, the triangle is iteratively updated, and behaves a bit like an Amoeba moving on a surface.

Before going into the actual details of the algorithm, here is how my final result looks like. You can find the entire code here on GitHub, with some usage examples in the Sample.fsx script file. Let’s demo the code in action: in a script file, we load the Amoeba code, and use the same function the article does, the Rosenbrock function. We transform the function a bit, so that it takes a Point (an alias for an Array of floats, essentially a vector) as an input, and pass it to the solve function, with the domain where we want to search, in that case, [ –10.0; 10.0 ] for both x and y:

#load "Amoeba.fs"

open Amoeba
open Amoeba.Solver

let g (x:float) y =
100. * pown (y - x * x) 2 + pown (1. - x) 2

let testFunction (x:Point) =
g x.[0] x.[1]

solve Default [| (-10.,10.); (-10.,10.) |] testFunction 1000

Running this in the F# interactive window should produce the following:

val it : Solution = (0.0, [|1.0; 1.0|])
>

The algorithm properly identified that the minimum is 0, for a value of x = 1.0 and y = 1.0. Note that results may vary: this is a heuristic, which starts with a random initial amoeba, so each run could produce slightly different results, and might at times epically fail.

More...

5. July 2013 15:51

Besides having one of the coolest names around, Random Forest is an interesting machine learning algorithm, for a few reasons. It is applicable to a large range of classification problems, isn’t prone to over-fitting, can produce good quality metrics as a side-effect of the training process itself, and is very suitable for parallelization. For all these reasons, I thought it would be interesting to try it out in F#.

The current implementation I will be discussing below works, but isn’t production ready (yet) – it is work in progress. The API and implementation are very likely to change over the next few weeks. Still, I thought I would share what I did so far, and maybe get some feedback!

## The idea behind the algorithm

As the name suggests, Random Forest (introduced in the early 2000s by Leo Breiman) can be viewed as an extension of Decision Trees, which I discussed before. A decision tree grows a single classifier, in a top-down manner: the algorithm recursively selects the feature which is the most informative, partitions the data according to the outcomes of that feature, and repeats the process until no information can be gained by partitioning further. On a non-technical level, the algorithm is playing a smart “game of 20 questions”: given what has been deduced so far, it picks from the available features the one that is most likely to lead to a more certain answer.

How is a Random Forest different from a Decision Tree? The first difference is that instead of growing a single decision tree, the algorithm will create a “forest” – a collection of Decision Trees; the final decision of the classifier will be the majority decision of all trees in the forest. However, having multiple times the same tree wouldn’t be of much help, because we would get the same classifier repeated over and over again. This is where the algorithm gets interesting: instead of growing a Tree using the entire training set and features, it introduces two sources of randomness:

• each tree is grown on a new sample, created by randomly sampling the original dataset with replacement (“bagging”),
• at each node of the tree, only a random subset of the remaining features is used.

Why would introducing randomness be a good idea? It has a few interesting benefits:

• by selecting different samples, it mitigates the risk of over-fitting. A single tree will produce an excellent fit on the particular dataset that was used to train it, but this doesn’t guarantee that the result will generalize to other sets. Training multiple trees on random samples creates a more robust overall classifier, which will by construction handle a “wider” range of situations than a single dataset,
• by selecting a random subset of features, it mitigates the risks of greedily picking locally optimal features that could be overall sub-optimal. As a bonus, it also allows a computation speed-up for each tree, because fewer features need to be considered at each step,
• the bagging process, by construction, creates for each tree a Training Set (the selected examples) and a Cross-Validation Set (what’s “out-of-the-bag”), which can be directly used to produce quality metrics on how the classifier may perform in general.

## Usage

Before delving into the current implementation, I thought it would be interesting to illustrate on an example the intended usage. I will be using the Titanic dataset, from the Kaggle Titanic contest. The goal of the exercise is simple: given the passengers list of the Titanic, and what happened to them, can you build a model to predict who sinks or swims?

I didn’t think the state of affairs warranted a Nuget package just yet, so this example is implemented as a script, in the Titanic branch of the project itself on GitHub.

First, let’s create a Record type to represent passengers:

type Passenger = {
Id: string;
Class: string;
Name: string;
Sex: string;
Age: string;
SiblingsOrSpouse: string;
ParentsOrChildren: string;
Ticket: string;
Fare: string;
Cabin: string;
Embarked: string }


Note that all the properties are represented as strings; it might be better to represent them for what they are (Age is a float, SiblingsOrSpouse an integer…) – but given that the dataset contains missing data, this would require dealing with that issue, perhaps using an Option type. We’ll dodge the problem for now, and opt for a stringly-typed representation.

Next, we need to construct a training set from the Kaggle data file. We’ll use the CSV parser that comes with FSharp.Data to extract the passengers from that list, as well as their known fate (the file is assumed to have been downloaded on your local machine first):

let path = @"C:\Users\Mathias\Documents\GitHub\Charon\Charon\Charon\train.csv"
let data = CsvFile.Load(path).Cache()

let trainingSet =
[| for line in data.Data ->
line.GetColumn "Survived" |> Some, // the label
{   Id = line.GetColumn "PassengerId";
Class = line.GetColumn "Pclass";
Name = line.GetColumn "Name";
Sex = line.GetColumn "Sex";
Age = line.GetColumn "Age";
SiblingsOrSpouse = line.GetColumn "SibSp";
ParentsOrChildren = line.GetColumn "Parch";
Ticket = line.GetColumn "Ticket";
Fare =line.GetColumn "Fare";
Cabin = line.GetColumn "Cabin";
Embarked = line.GetColumn "Embarked" } |]


Now that we have data, we can get to work, and define a model. We’ll start first with a regular Decision Tree, and extract only one feature, Sex:

let features =
[| (fun x -> x.Sex |> StringCategory); |]


What this is doing is defining an Array of features, a feature being a function which takes in a Passenger, and returns an Option string, via the utility StringCategory. StringCategory simply expects a string, and transforms a null or empty case into the “missing data” case, and otherwise treats the string as a Category. So in that case, x is a passenger, and if no Sex information is found, it will transform it into None, and otherwise into Some(“male”) or Some(“female”), the two cases that exist in the dataset.

We are now ready to go – we can run the algorithm and get a Decision Tree classifier, with a minimum leaf of 5 elements (i.e. we stop partitioning if we have less than 5 elements left):

let minLeaf = 5
let classifier = createID3Classifier trainingSet features minLeaf


… and we are done. How good is our classifier? Let’s check:

let correct =
trainingSet
|> Array.averageBy (fun (label, obs) ->
if label = Some(classifier obs) then 1. else 0.)
printfn "Correct: %.4f" correct


More...

30. September 2012 13:08

After four weeks of vacations, I am back home, ready to continue my series of posts converting the samples from Machine Learning in Action from Python to F#.

Today’s post covers Chapter 5 of the book, dedicated to Logistic Regression. Logistic Regression is another classification method. It uses numeric data to determine how to separate observations into two classes, identified by 0 or 1.

The entire code presented in this post can be found on GitHub, commit 398677f

## The idea behind the algorithm

The main idea behind the algorithm is to find a function which, using the numeric data that describe an individual observation as input, will return a number between 0 and 1. Ideally, that function will return a number close to respectively 0 or 1 for observations belonging to group 0 or 1.

To achieve that result, the algorithm relies on the Sigmoid function, f(x) = 1 / (1 + exp(-x)) .

For any input value, the Sigmoid function returns a value in ] 0 ; 1 [. A positive value will return a value greater than 0.5, and the greater the input value, the closer to 1. One could think of the function as returning a probability: for very high or low values of x, there is a high certainty that it belongs to one of the two groups, and for values close to zero, the probability of each group is 50% / 50%.

The only thing needed then is a transformation taking the numeric values describing the observations from the dataset, and mapping them to a single value, such that applying the Sigmoid function to it produces results close to the group the observation belongs to. The most straightforward way to achieve this is to apply a linear combination: an observation with numeric values [ x1; x2; … xk ] will be converted into w0 + w1 x x1 + w2 x x2 … + wk x xk, by applying weights [ w0; w1; … wk ] to each of the components of the observation. Note how the weights have one extra element w0, which is used for a constant term.

If our observations had two components X and Y, each observation can be represented as a point (X, Y) in the plane, and what we are looking for is a straight line w0 + w1 x X + w2 x Y, such that every observation of group 0 is on one side of the line, and every observation of group 1 on the other side.

We now replaced one problem by another – how can we find a suitable set of weights W?

I won’t even attempt a full explanation of the approach, and will stick to fuzzy, high-level intuition. Basically, the algorithm starts with an arbitrary set of weights, and iteratively adjusts the weights, by comparing the results of the function and what it should be (the actual group), and adjusting them to reduce error.

Note: I’ll skip the Gradient Ascent method, and go straight to the second part of Chapter 5, which covers Stochastic Gradient Ascent, because the code is both easier to understand and more suitable to large datasets. On the other hand, the deterministic gradient ascent approach is probably clearer for the math inclined. If that’s your situation, you might be interested in this MSDN Magazine article, which presents a C# implementation of the Logistic Regression.

Let’s illustrate the update procedure, on an ultra-simplified example, where we have a single weight W. In that case, the predicted value for an observation which has value X will be sigmoid (W x X) , and the algorithm adjustment is given by the following formula:

W <- W + alpha x (Label – sigmoid (W x X))

where Label is the group the observation belongs to (0 or 1), and alpha is a user-defined parameter, between 0 and 1. In other words, W is updated based on the error, Label – sigmoid (W x X) . First, obviously, if there is no error, W will remain unchanged, there is nothing to adjust. Let’s consider the case where Label is 1, and both X and W are positive. In that case, Label – sigmoid (W x X) will be positive (between 0 and 1), and W will be increased. As W increases, the sigmoid becomes closer to 1, and the adjustments become progressively smaller. Similarly, considering all the cases for W and X (positive and negative), one can verify that W will be adjusted in a direction which reduces the classification error. Alpha can be described as “how aggressive” the adjustment should be – the closer to 1, the more W will be updated.

That’s the gist of the algorithm – the full-blown deterministic gradient algorithm proceeds to update the weights by considering the error on the entire dataset at once, which makes it more expensive, whereas the stochastic gradient approach updates the weights sequentially, taking the dataset observations one by one, which makes it convenient for larger datasets.

## Simple implementation

Enough talk – let’s jump into code, with a straightforward implementation first. We create a module “LogisticRegression”, and begin with building the function which predicts the class of an observation, given weights:

module LogisticRegression =

open System

let sigmoid x = 1.0 / (1.0 + exp -x)

// Vector dot product
let dot (vec1: float list)
(vec2: float list) =
List.zip vec1 vec2
|> List.map (fun e -> fst e * snd e)
|> List.sum

let add (vec1: float list)
(vec2: float list) =
List.zip vec1 vec2
|> List.map (fun e -> fst e + snd e)

// Vector scalar product
let scalar alpha (vector: float list) =
List.map (fun e -> alpha * e) vector

// Weights have 1 element more than observations, for constant
let predict (weights: float list)
(obs: float list) =
1.0 :: obs
|> dot weights
|> sigmoid

More...

22. August 2012 13:35

In my recent post on Decision Tree Classifiers, I mentioned that I was too lazy to figure out how to visualize the Decision Tree “supporting” the classifier. Well, at times, the Internet can be an awesome place. Cesar Mendoza has forked the Machine Learning in Action GitHub project, and done a very fine job resolving that problem using the Microsoft Automatic Graph Layout library, and running it on the Lenses Dataset from the University of California, Irvine Machine Learning dataset repository.

Here is the result of the visualization, you can find his code here:

Unfortunately, as far as I can tell, the library is not open source, and requires a MSDN license. The amount of great stuff produced at Microsoft Research is amazing, it’s just too bad that at times licensing seems to get in the way of getting the word out…

18. August 2012 03:39

This is the continuation of my series exploring Machine Learning, converting the code samples of “Machine Learning in Action” from Python to F# as I go through the book. Today’s post covers Chapter 4, which is dedicated to Naïve Bayes classification – and you can find the resulting code on GitHub.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading“Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

## The idea behind the Algorithm

The canonical application of Bayes naïve classification is in text classification, where the goal is to identify to which pre-determined category a piece of text belongs to  – for instance, is this email I just received spam, or ham (“valuable” email)?

The underlying idea is to use individual words present in the text as indications for what category it is most likely to belong to, using Bayes Theorem, named after the cheerful-looking Reverend Bayes.

Imagine that you received an email containing the words “Nigeria”, “Prince”, “Diamonds” and “Money”. It is very likely that if you look into your spam folder, you’ll find quite a few emails containing these words, whereas, unless you are in the business of importing diamonds from Nigeria and have some aristocratic family, your “normal” emails would rarely contain these words. They have a much higher frequency within the category “Spam” than within the Ham, which makes them a potential flag for undesired business ventures.

On the other hand, let’s assume that you are a lucky person, and that typically, what you receive is Ham, with the occasional Spam bit. If you took a random email in your inbox, it is then much more likely that it belongs to the Ham category.

Bayes’ Theorem combines these two pieces of information together, to determine the probability that a particular email belongs to the “Spam” category, if it contains the word “Nigeria”:

P(is “Spam”|contains ”Nigeria”) = P(contains “Nigeria|is ”Spam”) x P(is “Spam”) / P(contains “Nigeria”)

In other words, 2 factors should be taken into account when deciding whether an email containing “Nigeria” is spam: how over-represented is that word in Spam, and how likely is it that any email is spammy in the first place?

The algorithm is named “Naïve”, because it makes a simplifying assumption about the text, which turns out to be very convenient for computations purposes, namely that each word appears with a frequency which doesn’t depend on other words. This is an unlikely assumption (the word “Diamond” is much more likely to be present in an email containing “Nigeria” than in your typical family-members discussion email).

We’ll leave it at that on the concepts –  I’ll refer the reader who want to dig deeper to the book, or to this explanation of text classification with Naïve Bayes.

## A simple F# implementation

For my first pass, I took a slightly different direction from the book, and decided to favor readability over performance. I assume that we are operating on a dataset organized as a sequence of text samples, each of them labeled by category, along these lines (example from the book “Machine Learning in Action”):

Note: the code presented here can be found found on GitHub

let dataset =
[| ("Ham",  "My dog has flea problems help please");
("Spam", "Maybe not take him to dog park stupid");
("Ham",  "My dalmatian is so cute I love him");
("Spam", "Stop posting stupid worthless garbage");
("Ham",  "Mr Licks ate my steak how to stop him");
("Spam", "Quit buying worthless dog food stupid") |]

We will need to do some word counting to compute frequencies, so let’s start with a few utility functions:

    open System
open System.Text.RegularExpressions

// Regular Expression matching full words, case insensitive.
let matchWords = new Regex(@"\w+", RegexOptions.IgnoreCase)

// Extract and count words from a string.
// http://stackoverflow.com/a/2159085/114519
let wordsCount text =
matchWords.Matches(text)
|> Seq.cast<Match>
|> Seq.groupBy (fun m -> m.Value)
|> Seq.map (fun (value, groups) ->
value.ToLower(), (groups |> Seq.length))

// Extracts all words used in a string.
let vocabulary text =
matchWords.Matches(text)
|> Seq.cast<Match>
|> Seq.map (fun m -> m.Value.ToLower())
|> Seq.distinct

// Extracts all words used in a dataset;
// a Dataset is a sequence of "samples",
// each sample has a label (the class), and text.
let extractWords dataset =
dataset
|> Seq.map (fun sample -> vocabulary (snd sample))
|> Seq.concat
|> Seq.distinct

// "Tokenize" the dataset: break each text sample
// into words and how many times they are used.
let prepare dataset =
dataset
|> Seq.map (fun (label, sample) -> (label, wordsCount sample))

We use a Regular Expression, \w+, to match all words, in a case-insensitive way. wordCount extracts individual words and the number of times they occur, while vocabulary simply returns the words encountered. The prepare function takes a complete dataset, and transforms each text sample into a Tuple containing the original classification label, and a Sequence of Tuples containing all lower-cased words found and their count.

More...