Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
25. November 2012 09:10

I am still working my way through “Machine Learning in Action”, converting the samples from Python to F#. I am currently in the middle of chapter 6, dedicated to Support Vector Machines, which has given me more trouble than the previous ones. This post will be sharing my current progress: the code I have so far is a working translation of the naïve SVM implementation, presented in the first half of the chapter. We’ll get to kernels, and the full Platt SMO algorithm in a later post – today will be solely discussing the simple, un-optimized version.

Two factors slowed me down with this chapter: the math, and Python.

The math behind the algorithm is significantly more involved than the other algorithms, and I won’t even try to go into why it works. I recommend reading An Idiot’s guide to SVMs, which I found a pretty complete and accessible explanation of the theory behind SVMs. I will focus instead on the implementation, which was in itself a bit challenging.

First, the Python code uses algebra quite a bit, and I found that deciphering what was going on required a bit of patience. Take a line like the following:

fXi = (float)(multiply(alphas, labelMat).T*(dataMatrix*dataMatrix[i,:].T))+b

I am reasonably well versed in linear algebra, but figuring out what this is saying takes some attention. Granted, I have no experience with Python and NumPy, so my whining is probably a bit unfair. Still, I thought the code was not very readable, and it motivated me to see if that could be improved, and as a result I ended up moving away from heavy algebra notation.

Then, the algorithm is implemented as a Deep Arrow. A main loop performs computations and evaluates conditions at multiple points, using continue to exit / short-circuit the evaluation. The code I ended up with doesn’t use mutation, but is still heavily indented, which I am not happy about - I’ll work on that later.

## Simplified algorithm implementation

Note: as the title of the post indicates, this is work in progress. The current implementation works, but has some obvious flaws (see last paragraph), which I intend to fix in upcoming posts. My intent is to share my progression through the problem – please don’t take this as a good reference SVM implementation. Hopefully we’ll get there soon, but this is not it, not yet.

More...

30. September 2012 13:08

After four weeks of vacations, I am back home, ready to continue my series of posts converting the samples from Machine Learning in Action from Python to F#.

Today’s post covers Chapter 5 of the book, dedicated to Logistic Regression. Logistic Regression is another classification method. It uses numeric data to determine how to separate observations into two classes, identified by 0 or 1.

The entire code presented in this post can be found on GitHub, commit 398677f

## The idea behind the algorithm

The main idea behind the algorithm is to find a function which, using the numeric data that describe an individual observation as input, will return a number between 0 and 1. Ideally, that function will return a number close to respectively 0 or 1 for observations belonging to group 0 or 1.

To achieve that result, the algorithm relies on the Sigmoid function, f(x) = 1 / (1 + exp(-x)) .

For any input value, the Sigmoid function returns a value in ] 0 ; 1 [. A positive value will return a value greater than 0.5, and the greater the input value, the closer to 1. One could think of the function as returning a probability: for very high or low values of x, there is a high certainty that it belongs to one of the two groups, and for values close to zero, the probability of each group is 50% / 50%.

The only thing needed then is a transformation taking the numeric values describing the observations from the dataset, and mapping them to a single value, such that applying the Sigmoid function to it produces results close to the group the observation belongs to. The most straightforward way to achieve this is to apply a linear combination: an observation with numeric values [ x1; x2; … xk ] will be converted into w0 + w1 x x1 + w2 x x2 … + wk x xk, by applying weights [ w0; w1; … wk ] to each of the components of the observation. Note how the weights have one extra element w0, which is used for a constant term.

If our observations had two components X and Y, each observation can be represented as a point (X, Y) in the plane, and what we are looking for is a straight line w0 + w1 x X + w2 x Y, such that every observation of group 0 is on one side of the line, and every observation of group 1 on the other side.

We now replaced one problem by another – how can we find a suitable set of weights W?

I won’t even attempt a full explanation of the approach, and will stick to fuzzy, high-level intuition. Basically, the algorithm starts with an arbitrary set of weights, and iteratively adjusts the weights, by comparing the results of the function and what it should be (the actual group), and adjusting them to reduce error.

Note: I’ll skip the Gradient Ascent method, and go straight to the second part of Chapter 5, which covers Stochastic Gradient Ascent, because the code is both easier to understand and more suitable to large datasets. On the other hand, the deterministic gradient ascent approach is probably clearer for the math inclined. If that’s your situation, you might be interested in this MSDN Magazine article, which presents a C# implementation of the Logistic Regression.

Let’s illustrate the update procedure, on an ultra-simplified example, where we have a single weight W. In that case, the predicted value for an observation which has value X will be sigmoid (W x X) , and the algorithm adjustment is given by the following formula:

W <- W + alpha x (Label – sigmoid (W x X))

where Label is the group the observation belongs to (0 or 1), and alpha is a user-defined parameter, between 0 and 1. In other words, W is updated based on the error, Label – sigmoid (W x X) . First, obviously, if there is no error, W will remain unchanged, there is nothing to adjust. Let’s consider the case where Label is 1, and both X and W are positive. In that case, Label – sigmoid (W x X) will be positive (between 0 and 1), and W will be increased. As W increases, the sigmoid becomes closer to 1, and the adjustments become progressively smaller. Similarly, considering all the cases for W and X (positive and negative), one can verify that W will be adjusted in a direction which reduces the classification error. Alpha can be described as “how aggressive” the adjustment should be – the closer to 1, the more W will be updated.

That’s the gist of the algorithm – the full-blown deterministic gradient algorithm proceeds to update the weights by considering the error on the entire dataset at once, which makes it more expensive, whereas the stochastic gradient approach updates the weights sequentially, taking the dataset observations one by one, which makes it convenient for larger datasets.

## Simple implementation

Enough talk – let’s jump into code, with a straightforward implementation first. We create a module “LogisticRegression”, and begin with building the function which predicts the class of an observation, given weights:

module LogisticRegression =

open System

let sigmoid x = 1.0 / (1.0 + exp -x)

// Vector dot product
let dot (vec1: float list)
(vec2: float list) =
List.zip vec1 vec2
|> List.map (fun e -> fst e * snd e)
|> List.sum

(vec2: float list) =
List.zip vec1 vec2
|> List.map (fun e -> fst e + snd e)

// Vector scalar product
let scalar alpha (vector: float list) =
List.map (fun e -> alpha * e) vector

// Weights have 1 element more than observations, for constant
let predict (weights: float list)
(obs: float list) =
1.0 :: obs
|> dot weights
|> sigmoid

More...

1. August 2012 17:27

This is the continuation of this post, where I began exploring k-nearest-neighbor classification, a Machine Learning algorithm used to classify items in different categories, based on an existing sample of items that have been properly classified.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading “Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

To determine what category an item belongs to, the algorithm measures its distance from each element of the known dataset, and takes a “majority vote” to determine its category, based on its k nearest neighbors.

In our last installment, we wrote a simple classification function, which was doing just that. Today, we’ll continue along Chapter 2, applying our algorithm to real data, and dealing with data normalization.

## The dataset: Elections 2008

Rather than use the data sample from the book, I figured it would be more fun to create my own data set. The problem we’ll look into is whether we can predict whether a state voted Republican or Democrat in the 2008 presidential election. Our dataset will consist of the following: the Latitude and Longitude of the State, and its Population. A State is classified as Democrat if the number of votes (i.e. popular vote) recorded for Obama was greater than McCain.

Notes

• My initial goal was to do this for Cities, but I couldn’t find voting data at that level – so I had to settle for States. The resulting sample is smaller than I would have liked, and also less useful (we can’t realistically use our classifier to produce a prediction for a new State, because the likelihood of a new State joining the Union is fairly small) but it still works well for illustration purposes.
• Computing distance based on raw Latitude and Longitude wouldn’t be such a great idea in general, because they denote a position on a sphere (very close points may have very different Longitudes); however, given the actual layout of the United States, this will be good enough here.

I gathered the data (see sources at the end) and saved it in a comma-delimited text file “Election2008.txt” (the raw text version is available at the bottom of the post).

First, we need to open that file and parse it into the structure we used in our previous post, with a matrix of observations for each state, and a vector of categories (DEM or REP). That’s easy enough with F#:

let elections =
let file = @"C:\Users\Mathias\Desktop\Elections2008.txt"
let fileAsLines =
|> Array.map (fun line -> line.Split(','))
let dataset =
fileAsLines
|> Array.map (fun line ->
[| Convert.ToDouble(line.[1]);
Convert.ToDouble(line.[2]);
Convert.ToDouble(line.[3]) |])
let labels = fileAsLines |> Array.map (fun line -> line.[4])
dataset, labels

We open System and System.IO, and use File.ReadAllLines, which returns an array of strings for each line in the text file, and apply string.Split to each line, using the comma as a split-delimiter, which gives us an array of string for each text line. For each row, we then retrieve the dataset part, elements 1, 2 and 3 (the Latitude, Longitude and Population), creating an Array of doubles by converting the strings to doubles. Finally, we retrieve the fourth element of each row, the vote, into an array of labels, and return dataset and labels as a tuple.

Let’s visualize the result, using the FSharpChart display function we wrote last time. display dataset 1 0 plots Longitude on the X axis, and Latitude on the Y axis, producing a somewhat unusual map of the United States, where we can see the red states / blue states pattern (colored differently), with center states leaning Republican, and Coastal states Democrat:

Plotting Longitude against Population produces the following chart, which also displays some clusters:

Finally, Longitude versus Population also exhibits some patterns:

## Normalizing the data

We could run the algorithm we wrote in the previous post on the data, and measure the distances between observations based on the raw measures, but this would likely produce poor results, because of the discrepancy in scales. Our Latitudes range from about 20 (Hawaii) to 60 (Alaska), while Populations vary from half a million (Washington DC) to over 30 millions (California). Because we compute the distance between observations using the Euclidean distance, differences in Population will have a huge weight in the overall distance, and in effect we would be “ignoring” Latitude and Longitude in our classification.

Consider this: the raw distance between 2 states is given by

Distance (S1, S2) = sqrt ((Pop(S1) – Pop(S2))^2 + (Lat(S1) – Lat(S2))^2 + (Lon(S1) – Lon(S2))^2)

Even if we considered 2 states located as far as possible from each other, at the 2 corners of our map, the distance would look like:

Distance (S1, S2) = sqrt ((Pop(S1) – Pop(S2))^2 + 40^2 + 90^2)

It’s clear that even minimal differences in population in this formula (say, 100,000 inhabitants) will completely dwarf the largest possible effect of geographic distance. If we want to observe the effect of all three dimensions, we need to convert the measurements to a comparable scale, so that differences in Population are comparable, relatively, to differences in Location.

To that effect, we will Normalize our dataset. We’ll follow the example of “Machine Learning in Action”, and normalize each measurement so that its minimal value is 0, and its maximum is 1. Other approaches would be feasible, but this one has the benefit of being fairly straightforward. If we consider Population for instance, what we need to do is:

• Retrieve all the Populations in our Dataset
• Retrieve the minimum and the maximum
• Transform the Population to (Population – minimum) / (maximum – minimum)

Here is what I came up with:

let column (dataset: float [][]) i =
dataset |> Array.map (fun row -> row.[i])

let columns (dataset: float [][]) =
let cols = dataset.[0] |> Array.length
[| for i in 0 .. (cols - 1) -> column dataset i |]

let minMax dataset =
dataset
|> columns
|> Array.map (fun col -> Array.min(col), Array.max(col))

let minMaxNormalizer dataset =
let bounds = minMax dataset
fun (vector: float[]) ->
Array.mapi (fun i v ->
(vector.[i] - fst v) / (snd v - fst v)) bounds

Compared to the Python example, I have to pay a bit of a code tax here, because I chose to use plain F# arrays to model my dataset, instead of using matrices. The column function extracts column i from a dataset, by mapping each row (an observation) to its ith component. columns expands on it, and essentially transposes the matrix, converting it from an array of row vectors to an array of column vectors.

More...

29. July 2012 12:43

With all sorts of people waving around the term “Machine Learning” lately, I figured it was time to look into what the fuss was about, so I purchased “Machine Learning In Action”. I am mostly enjoying the book so far, with one inconvenience: all the code presented is in Python, which is easy enough to follow, but not directly useful to me. The best way to learn is to get your hands dirty and code, so I am planning on converting the Python examples into F# as a progress through – which should also be a good exercise in learning more F#.

Chapter 2 of the book covers classification using k-Nearest Neighbors. The idea behind the algorithm is fairly straightforward: given a dataset of numeric observations, each observation being classified in a group, the algorithm will classify a new observation based on what group most of its close neighbors belong to.

The book uses a linear algebra library in its implementation. It seemed like overkill for the situation, so I’ll go for raw F# here.

Let’s first create a new F# library project, and start working on a script, creating a fictional dataset like this:

let createDataSet =
[| [| 1.0; 0.9 |]
[| 0.8; 1.0 |]
[| 0.8; 0.9 |]
[| 0.0; 0.1 |]
[| 0.3; 0.0 |]
[| 0.1; 0.1 |] |],
[| "A"; "A"; "A"; "B"; "B"; "B" |]

createDataSet returns a Tuple with two elements. First, we create an Array of Arrays, an Array containing 6 observations on 2 fictional variables. The second element is also an Array, containing the Label of the group the observation belongs to. For instance, the first observation was [ 1.0; 0.9 ], and it belonged to group A.

It would be helpful to visualize the dataset, to get a sense for the structure of the data. One option to do this is FSharpChart, a lightweight charting library which works fairly well with the F# interactive window. The easiest way to use it is by adding it to our project via NuGet, which adds a reference to MSDN.FSharpChart. We need to add a reference to FSharpChart to the script, with a reference to the path where NuGet downloaded the libraries (this post by Don Syme provides a great example) – we are now ready to add a scatterplot function to the script:

#r @"C:\Users\Mathias\Documents\Visual Studio 2010\Projects\MachineLearningInAction\packages\MSDN.FSharpChart.dll.0.60\lib\MSDN.FSharpChart.dll"

open MSDN.FSharp.Charting

let createDataSet =
[| [| 1.0; 0.9 |]
[| 0.8; 1.0 |]
[| 0.8; 0.9 |]
[| 0.0; 0.1 |]
[| 0.3; 0.0 |]
[| 0.1; 0.1 |] |],
[| "A"; "A"; "A"; "B"; "B"; "B" |]

let scatterplot (dataset: float[][]) =
dataset
|> Array.map (fun e -> e.[0], e.[1])
|> FSharpChart.FastPoint
|> FSharpChart.Create

The scatterplot simply takes a dataset, maps each observation to a tuple of X and Y coordinates, and passes it to FSharpChart.FastPoint, which produces a… scatterplot. Let’s select all that code, send it to F# interactive, and start playing in fsi:

> let data, labels = createDataSet
scatterplot data;;

val labels : string [] = [|"A"; "A"; "A"; "B"; "B"; "B"|]
val data : float [] [] =
[|[|1.0; 0.9|]; [|0.8; 1.0|]; [|0.8; 0.9|]; [|0.0; 0.1|]; [|0.3; 0.0|];
[|0.1; 0.1|]|]

At that point, you should see a chart popping up, looking like this one:

More...

11. September 2011 12:39

The project I am currently working on involves developing a forecasting model. Starting from an initial estimate, the model will progressively update its forecast as time goes by and real data becomes available.

The process of developing such a model is iterative by nature: you design the mechanics of a forecasting algorithm, look at how it would have performed on historical data, fine-tune the design and parameters based on the results, and rinse & repeat.

The problem I started running into is the “look at how it would have performed on historical data”. There is plenty of data available, which is both a blessing and a curse. A blessing, because more data means better validation. A curse, because as the amount of data increases, it becomes difficult to navigate through it, and focus on individual cases.

So far, my approach has been to create metrics of fit between a model and a set of data, and to run a model against large sets of data, measuring how well the model is doing against the data set. However, I still don’t have a good solution for digging into why a particular case is not working so well. What I would like to achieve is to identify a problematic case, and explore what is going on, ideally by generating charts on the fly to visualize what is happening. Unfortunately, the tools I am using currently do not accommodate that scenario well. Excel is great at producing charts in a flexible manner, but my model is .NET code, and I don’t have a convenient, lightweight way to use C# code in Excel. Conversely, creating exploratory charts from C# is somewhat expensive, and requires a lengthy cycle: write code for the chart, compile (and lose whatever is loaded in memory), observe – and repeat.

I am currently exploring an alternative, via F# and FSharpChart. F# offers a very interesting possibility over C#, F# Interactive (fsi). Fsi is a REPL (Read, Evaluate, Print Loop), which allows you to type in code interactively in the console and execute it as you go. The beauty of it is that you can experiment with code live, without having to go through the code change / recompile cycle. Add to the mix FSharpChart, a library/script which wraps .NET DataVisualization.Charting and makes it conveniently usable from F#, and you get a nice way to write .NET code and generate charts, on the fly.

Let’s illustrate on a simple example. Suppose I have a model that simulates sales following a Poisson process, and want to check whether this “looks right”. First, let’s download FSharpChart, create a folder called “Explore” on the Desktop, and copy the FSharpChart.fsx script file into that folder. Then, let’s create an empty text file called Explore.fsx in the same folder, which we will use to experiment with code and charts, and save whatever snippets come in handy at the time.

Next, let’s double-click on the Explore.fsx file, which will then be opened in Visual Studio, and type in the following:

#load @"C:\Users\Mathias\Desktop\Explore\fsharpchart.fsx"

open System
open System.Drawing
open MSDN.FSharp.Charting

let random = new Random()

// Simulate a Poisson distribution with parameter lambda
let poisson lambda =
let L = Math.Exp(-lambda)
let rec simulate (k,p) =
if p > L then simulate (k + 1, p * random.NextDouble())
else k - 1
simulate (0, 1.0)

let sales lambda periods = [
for i in 1.0 .. periods -> (i, poisson lambda) ]

More...