Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
30. March 2015 16:34

In our previous post, we looked at James McCaffrey’s code, “Gradient Descent Training Using C#” from MSDN magazine, and took a stab at rewriting the first part in F#, to clarify a bit the way the dataset was created. Today, we’ll dive in the second block, which implements the logistic regression using gradient descent. Again, we won’t discuss why the algorithm works – the article does a pretty good job at that – and focus instead purely on the F# / C# conversion part.

Let’s begin by taking a look at the core of the C# code, which lives in the LogisticClassifier class. I took the liberty to do some minor cleanup, and remove some parts which were un-necessary, so as to make it a bit easier to see what is going on:

public class LogisticClassifier
{
private int numFeatures; // number of x variables aka features
private double[] weights; // b0 = constant
private Random rnd;

public LogisticClassifier(int numFeatures)
{
this.numFeatures = numFeatures;
this.weights = new double[numFeatures + 1]; // [0] = b0 constant
this.rnd = new Random(0);
}

public double[] Train(double[][] trainData, int maxEpochs, double alpha)
{
// alpha is the learning rate
int epoch = 0;
int[] sequence = new int[trainData.Length]; // random order
for (int i = 0; i < sequence.Length; ++i)
sequence[i] = i;

while (epoch < maxEpochs)
{
++epoch;

if (epoch % 100 == 0 && epoch != maxEpochs)
{
double mse = Error(trainData, weights);
Console.Write("epoch = " + epoch);
Console.WriteLine("  error = " + mse.ToString("F4"));
}

Shuffle(sequence); // process data in random order

// stochastic/online/incremental approach
for (int ti = 0; ti < trainData.Length; ++ti)
{
int i = sequence[ti];
double computed = ComputeOutput(trainData[i], weights);
int targetIndex = trainData[i].Length - 1;
double target = trainData[i][targetIndex];

weights[0] += alpha * (target - computed) * 1; // the b0 weight has a dummy 1 input
for (int j = 1; j < weights.Length; ++j)
weights[j] += alpha * (target - computed) * trainData[i][j - 1];
}
} // while
return this.weights; // by ref is somewhat risky
} // Train

private void Shuffle(int[] sequence)
{
for (int i = 0; i < sequence.Length; ++i)
{
int r = rnd.Next(i, sequence.Length);
int tmp = sequence[r];
sequence[r] = sequence[i];
sequence[i] = tmp;
}
}

private double Error(double[][] trainData, double[] weights)
{
// mean squared error using supplied weights
int yIndex = trainData[0].Length - 1; // y-value (0/1) is last column
double sumSquaredError = 0.0;
for (int i = 0; i < trainData.Length; ++i) // each data
{
double computed = ComputeOutput(trainData[i], weights);
double desired = trainData[i][yIndex]; // ex: 0.0 or 1.0
sumSquaredError += (computed - desired) * (computed - desired);
}
return sumSquaredError / trainData.Length;
}

private double ComputeOutput(double[] dataItem, double[] weights)
{
double z = 0.0;
z += weights[0]; // the b0 constant
for (int i = 0; i < weights.Length - 1; ++i) // data might include Y
z += (weights[i + 1] * dataItem[i]); // skip first weight
return 1.0 / (1.0 + Math.Exp(-z));
}
} // LogisticClassifier

Just from the length of it, you can tell that most of the action is taking place in the Train method, so let’s start there. What we have here is two nested loops. The outer one runs maxEpoch times, a user defined parameter. Inside that loop, we randomly shuffle the input dataset, and then loop over each training example, computing the predicted output of the logistic function for that example, comparing it to a target, the actual  label of the example, which can be 0 or 1, and adjusting the weights so as to reduce the error. We also have a bit of logging going on, displaying the prediction error every hundred outer iteration. Once the two loops are over, we return the weights.

Two things strike me here. First, a ton of indexes are involved, and this tends to obfuscate what is going on; as a symptom, a few comments are needed, to clarify how the indexes work, and what piece of the data is organized. Then, there is a lot of mutation going on. It’s not necessarily a bad thing, but I tend to avoid it as much as possible, simply because it requires keeping more moving parts in my head when I try to follow the code, and also, as McCaffrey himself points out in a comment, because “by ref is somewhat risky”.

As a warm up, let’s begin with the error computation, which is displayed every 100 iterations.  Rather than having to remember in what column the actual expected value is stored, let’s make our life easier, and use a type alias, Example, so that the features are neatly tucked in an array, and the value is clearly separated. We need to compute the average square difference between the expected value, and the output of the logistic function for each example. As it turns out, we have already implemented the logistic function in the first part in the code, so re-implementing it as in ComputeOutput seems like un-necessary work – we can get rid of that part entirely, and simply map every example to the square error, and compute the average, using pattern matching on the examples to separate clearly the features and the expected value:

type Example = float [] * float

let Error (trainData:Example[], weights:float[]) =
// mean squared error using supplied weights
trainData
|> Array.map (fun (features,value) ->
let computed = logistic weights features
let desired = value
(computed - desired) * (computed - desired))
|> Array.average

Some of you might argue that this could be made tighter – I can think of at least two possibilities. First, using a Tuple might not be the most expressive approach; replacing it with a Record instead could improve readability. Then, we could also skip the map + average part, and directly ask F# to compute the average on the fly:

type Example = { Features:float[]; Label:float }

let Error (trainData:Example[], weights:float[]) =
trainData
|> Array.averageBy (fun example ->
let computed = logistic weights example.Features
let desired = example.Label
(computed - desired) * (computed - desired))

I will keep my original version the way it is, mostly because we created a dataset based on tuples last times.

We are now ready to hit the center piece of the algorithm. Just like we would probably try to extract a method in C#, we will start extracting some of the gnarly code that lies in the middle:

for (int ti = 0; ti < trainData.Length; ++ti)
{
int i = sequence[ti];
double computed = ComputeOutput(trainData[i], weights);
int targetIndex = trainData[i].Length - 1;
double target = trainData[i][targetIndex];

weights[0] += alpha * (target - computed) * 1; // the b0 weight has a dummy 1 input
for (int j = 1; j < weights.Length; ++j)
weights[j] += alpha * (target - computed) * trainData[i][j - 1];
}

Rather than modify the weights values, it seems safer to compute new weights. And because we opted last week to insert a column with ones for the constant feature, we won’t have to deal with the index misalignment, which requires separate handling for b0 and the rest. Instead, we can write an update operation that takes in an example and weights, and returns new weights:

let update (example:Example) (weights:float[]) =
let features,target = example
let computed = logistic weights features
weights
|> Array.mapi (fun i w ->
w + alpha * (target - computed) * features.[i])

Array.mapi allows us to iterate over the weights, while maintaining the index we are currently at, which we use to grab the feature value at the corresponding index. Alternatively, you could go all verbose and zip the arrays together – or all fancy with a double-pipe and map2 to map the two arrays in one go. Your pick:

Array.zip weights features
|> Array.map (fun (weight,feat) ->
weight + alpha * (target - computed) * feat)

(weights,features)
||> Array.map2 (fun weight feat ->
weight + alpha * (target - computed) * feat)

We are now in a very good place; the only thing left to do is to plug that into the two loops. The inner loop is a perfect case for a fold (the Aggregate method in LINQ): given a starting value for weights, we want to go over every example in our training set, and, for each of them, run the update function to compute new weights. For the while loop, we’ll take a different approach, and use recursion: when the epoch reaches maxEpoch, you are done, return the weights, otherwise, keep shuffling the data and updating weights. Let’s put that all together:

let Train (trainData:Example[], numFeatures, maxEpochs, alpha, seed) =

let rng = Random(seed)
let epoch = 0

let update (example:Example) (weights:float[]) =
let features,target = example
let computed = logistic weights features
weights
|> Array.mapi (fun i w ->
w + alpha * (target - computed) * features.[i])

let rec updateWeights (data:Example[]) epoch weights =

if epoch % 100 = 0
then printfn "Epoch: %i, Error: %.2f" epoch (Error (data,weights))

if epoch = maxEpochs then weights
else
let data = shuffle rng data
let weights =
data
|> Array.fold (fun w example -> update example w) weights
updateWeights data (epoch + 1) weights
// initialize the weights and start the recursive update
let initialWeights = [| for _ in 1 .. numFeatures + 1 -> 0. |]
updateWeights trainData 0 initialWeights

And that’s pretty much it. We replaced the whole class by a couple of functions, and all the indexes are gone. This is probably a matter of taste and comfort with functional concepts, but in my opinion, this is much easier to follow.

Before trying it out, to make sure it works, I’ll take a small liberty, and modify the Train function. As it stands right now, it returns the final weights, but really, we don’t care about the weights, what we want is a classifier, which is a function that, given an array, will predict a one or a zero. That’s easy enough, let’s return a function at the end instead of weights:

// initialize the weights and start the recursive update
let initialWeights = [| for _ in 1 .. numFeatures + 1 -> 0. |]
let finalWeights = updateWeights trainData 0 initialWeights
let classifier (features:float[]) =
if logistic finalWeights features > 0.5 then 1. else 0.
classifier

We can now wrap it up, and see our code in action:

printfn "Begin Logistic Regression (binary) Classification demo"
printfn "Goal is to demonstrate training using gradient descent"

let numFeatures = 8 // synthetic data
let numRows = 10000
let seed = 1

printfn "Generating %i artificial data items with %i features" numRows numFeatures
let trueWeights, allData = makeAllData(numFeatures, numRows, seed)

printfn "Data generation weights:"
trueWeights |> Array.iter (printf "%.2f ")
printfn ""

printfn "Creating train (80%%) and test (20%%) matrices"

let trainData, testData = makeTrainTest(allData, 0)
printfn "Done"

let maxEpochs = 1000
let alpha = 0.01

let classifier = Train (trainData,numFeatures,maxEpochs,alpha,0)

let accuracy (examples:Example[]) =
examples
|> Array.averageBy (fun (feat,value) ->
if classifier feat = value then 1. else 0.)

accuracy trainData |> printfn "Prediction accuracy on train data: %.4f"
accuracy testData |> printfn "Prediction accuracy on test data: %.4f"

We used a small trick to compute the accuracy – we mark every correct call as a one, every incorrect one as a zero, which, when we compute the average, gives us directly the proportion of cases that were called correctly. On my machine, I get the following output:

>
Prediction accuracy on train data: 0.9988
Prediction accuracy on test data: 0.9980

Looks good enough to me, the implementation seems to be working. The whole code presented here is available as a gist here. I’ll leave it at that for now (I might revisit it later, and try to make this work with DiffSharp at some point, if anyone is interested) – feel free to ask questions, or drop me a comment on Twitter!

1. March 2014 14:32

During some recent meanderings through the confines of the internet, I ended up discovering the Winnow Algorithm. The simplicity of the approach intrigued me, so I thought it would be interesting to try and implement it in F# and see how well it worked.

The purpose of the algorithm is to train a binary classifier, based on binary features. In other words, the goal is to predict one of two states, using a collection of features which are all binary. The prediction model assigns weights to each feature; to predict the state of an observation, it checks all the features that are “active” (true), and sums up the weights assigned to these features. If the total is above a certain threshold, the result is true, otherwise it’s false. Dead simple – and so is the corresponding F# code:

type Observation = bool []
type Label = bool
type Example = Label * Observation
type Weights = float []

let predict (theta:float) (w:Weights) (obs:Observation) =
(obs,w) ||> Seq.zip
|> Seq.filter fst
|> Seq.sumBy snd
|> ((<) theta)

We create some type aliases for convenience, and write a predict function which takes in theta (the threshold), weights and and observation; we zip together the features and the weights, exclude the pairs where the feature is not active, sum the weights, check whether the threshold is lower that the total, and we are done.

In a nutshell, the learning process feeds examples (observations with known label), and progressively updates the weights when the model makes mistakes. If the current model predicts the output correctly, don’t change anything. If it predicts true but should predict false, it is over-shooting, so weights that were used in the prediction (i.e. the weights attached to active features) are reduced. Conversely, if the prediction is false but the correct result should be true, the active features are not used enough to reach the threshold, so they should be bumped up.

And that’s pretty much it – the algorithm starts with arbitrary initial weights of 1 for every feature, and either doubles or halves them based on the mistakes. Again, the F# implementation is completely straightforward. The weights update can be written as follows:

let update (theta:float) (alpha:float) (w:Weights) (ex:Example) =
let real,obs = ex
match (real,predict theta w obs) with
| (true,false) -> w |> Array.mapi (fun i x -> if obs.[i] then alpha * x else x)
| (false,true) -> w |> Array.mapi (fun i x -> if obs.[i] then x / alpha else x)
| _ -> w

Let’s check that the update mechanism works:

> update 0.5 2. [|1.;1.;|] (false,[|false;true;|]);;
val it : float [] = [|1.0; 0.5|]

The threshold is 0.5, the adjustment multiplier is 2, and each feature is currently weighted at 1. The state of our example is [| false; true; |], so only the second feature is active, which means that the predicted value will be 1. (the weight of that feature). This is above the threshold 0.5, so the predicted value is true. However, because the correct value attached to that example is false, our prediction is incorrect, and the weight of the second feature is reduced, while the first one, which was not active, remains unchanged.

Let’s wrap this up in a convenience function which will learn from a sequence of examples, and give us directly a function that will classify observations:

let learn (theta:float) (alpha:float) (fs:int) (xs:Example seq) =
let updater = update theta alpha
let w0 = [| for f in 1 .. fs -> 1. |]
let w = Seq.fold (fun w x -> updater w x) w0 xs
fun (obs:Observation) -> predict theta w obs

We pass in the number of features, fs, to initialize the weights at the correct size, and use a fold to update the weights for each example in the sequence. Finally, we create and return a function that, given an observation, will predict the label, based on the weights we just learnt.

And that’s it – in 20 lines of code, we are done, the Winnow is implemented.

More...

15. February 2014 12:51

My favorite column in MSDN Magazine is Test Run; it was originally focused on testing, but the author, James McCaffrey, has been focusing lately on topics revolving around numeric optimization and machine learning, presenting a variety of methods and approaches. I quite enjoy his work, with one minor gripe –his examples are all coded in C#, which in my opinion is really too bad, because the algorithms would gain much clarity if written in F# instead.

Back in June 2013, he published a piece on Amoeba Method Optimization using C#. I hadn’t seen that approach before, and found it intriguing. I also found the C# code a bit too hairy for my feeble brain to follow, so I decided to rewrite it in F#.

In a nutshell, the Amoeba approach is a heuristic to find the minimum of a function. Its proper respectable name is the Nelder-Nead method. The reason it is also called the Amoeba method is because of the way the algorithm works: in its simple form, it starts from a triangle, the “Amoeba”; at each step, the Amoeba “probes” the value of 3 points in its neighborhood, and moves based on how much better the new points are. As a result, the triangle is iteratively updated, and behaves a bit like an Amoeba moving on a surface.

Before going into the actual details of the algorithm, here is how my final result looks like. You can find the entire code here on GitHub, with some usage examples in the Sample.fsx script file. Let’s demo the code in action: in a script file, we load the Amoeba code, and use the same function the article does, the Rosenbrock function. We transform the function a bit, so that it takes a Point (an alias for an Array of floats, essentially a vector) as an input, and pass it to the solve function, with the domain where we want to search, in that case, [ –10.0; 10.0 ] for both x and y:

#load "Amoeba.fs"

open Amoeba
open Amoeba.Solver

let g (x:float) y =
100. * pown (y - x * x) 2 + pown (1. - x) 2

let testFunction (x:Point) =
g x.[0] x.[1]

solve Default [| (-10.,10.); (-10.,10.) |] testFunction 1000

Running this in the F# interactive window should produce the following:

val it : Solution = (0.0, [|1.0; 1.0|])
>

The algorithm properly identified that the minimum is 0, for a value of x = 1.0 and y = 1.0. Note that results may vary: this is a heuristic, which starts with a random initial amoeba, so each run could produce slightly different results, and might at times epically fail.

More...

18. January 2014 14:49

A couple of months ago, I started working on an F# decision tree & random forest library, and pushed a first draft out in July 2013. It was a very minimal implementation, but it was a start, and my plan was to keep refining and add features. And then life happened: I got really busy, I began a very poorly disciplined refactoring effort on the code base, I second and third guessed my design - and got nothing to show for a while. Finally in December, I took some time off in Europe, disappeared in the French country side, a perfect setup to roll up my sleeves and finally get some serious coding done.

And here we go - drum roll please, version 0.1 of Charon is out. You can find it on GitHub, or install it as a NuGet package.

As you can guess from the version number, this is alpha-release grade code. There will be breaking changes, there are probably bugs and obvious things to improve, but I thought it was worth releasing, because it is in a shape good enough to illustrate the direction I am taking, and hopefully get some feedback from the community.

But first, what does Charon do? Charon is a decision tree and random forest machine learning classifier. An example will probably illustrate best what it does - let's work through the classic Titanic example. Using the Titanic passenger list, we want to create a model that predicts whether a passenger is likely to survive the disaster – or meet a terrible fate. Here is how you would do that with Charon, in a couple of lines of F#.

First, we use the CSV type provider to extract passenger information from our data file:

open Charon
open FSharp.Data

type DataSet = CsvProvider<"""C:\Users\Mathias\Documents\GitHub\Charon\Charon\Charon.Examples\titanic.csv""",
SafeMode=true, PreferOptionals=true>

type Passenger = DataSet.Row

In order to define a model, Charon needs two pieces of information: what is it you are trying to predict (the label, in that case, whether the passenger survives or not), and what information Charon is allowed to use to produce predictions (the features, in that case whatever passenger information we think is relevant):

let training =
use data = new DataSet()
[| for passenger in data.Data ->
passenger, // label source
passenger |] // features source

let labels = "Survived", (fun (obs:Passenger) -> obs.Survived) |> Categorical

let features =
[
"Sex", (fun (o:Passenger) -> o.Sex) |> Categorical;
"Class", (fun (o:Passenger) -> o.Pclass) |> Categorical;
"Age", (fun (o:Passenger) -> o.Age) |> Numerical;
]

For each feature, we specify whether the feature is Categorical (a finite number of "states" is expected, for instance Sex) or Numerical (the feature is to be interpreted as a numeric value, such as Age).

The Model is now fully specified, and we can train it on our dataset, and retrieve the results:

let results = basicTree training (labels,features) { DefaultSettings with Holdout = 0.1 }

printfn "Quality, training: %.3f" (results.TrainingQuality |> Option.get)
printfn "Quality, holdout: %.3f" (results.HoldoutQuality |> Option.get)

printfn "Tree:"
printfn "%s" (results.Pretty)

… which generates the following output:

Quality, training: 0.796
Quality, holdout: 0.747
Tree:
├ Sex = male
│   ├ Class = 3 → Survived False
│   ├ Class = 1 → Survived False
│   └ Class = 2
│      ├ Age = <= 16.000 → Survived True
│      └ Age = >  16.000 → Survived False
└ Sex = female
├ Class = 3 → Survived False
├ Class = 1 → Survived True
└ Class = 2 → Survived True

Charon automatically figures out what features are most informative, and organizes them into a tree; in our example, it appears that being a lady was a much better idea than being a guy – and being a rich lady traveling first or second class an even better idea. Charon also automatically breaks down continuous variables into bins. For instance, second-class male passengers under 16 had apparently much better odds of surviving than other male passengers. Charon splits the sample into training and validation; in this example, while our model appears quite good on the training set, with nearly 80% correct calls, the performance on the validation set is much weaker, with under 75% correctly predicted, suggesting an over-fitting issue.

I won’t demonstrate the Random Forest here; the API is basically the same, with better results but less human-friendly output. While formal documentation is lacking for the moment, you can find code samples in the Charon.Examples project that illustrate usage on the Titanic and the Nursery datasets.

What I hope I conveyed with this small example is the design priorities for Charon: a lightweight API that permits quick iterations to experiment with features and refine a model, using the F# Interactive capabilities.

I will likely discuss in later posts some of the challenges I ran into while implementing support for continuous variables – I learnt a lot in the process. I will leave it at that for today – in the meanwhile, I would love to get feedback on the current direction, and what you may like or hate about it. If you have comments, feel free to hit me up on Twitter, or to open an Issue on GitHub!

6. September 2013 08:15

Recently, Cesar De Souza began moving his .NET machine learning library, Accord.NET, from Google Code to GitHub. The move is still in progress, but that motivated me to take a closer look at the library; given that it is built in C#, with an intended C# usage in mind, I wanted to see how usable it is from F#.

There is a lot in the library; as a starting point, I decided I would try out its Support Vector Machine (SVM), a classic machine learning algorithm, and run it on a classic problem, automatically recognizing hand-written digits. The dataset I will be using here is a subset of the Kaggle Digit Recognizer contest; each example in the dataset is a 28x28 grayscale pixels image, the result of scanning a number written down by a human, and what the actual number is. From that original dataset, I sampled 5,000 examples, which will be used to train the algorithm, and another 500 in a validation set, which we’ll use to evaluate the performance of the model on data it hasn’t “seen before”.

The full example is available as a gist on GitHub.

I’ll be working in a script file within a Library project, as I typically do when exploring data. First, we need to add references to Accord.NET via NuGet:

#r @"..\packages\Accord.2.8.1.0\lib\Accord.dll"
#r @"..\packages\Accord.Math.2.8.1.0\lib\Accord.Math.dll"
#r @"..\packages\Accord.Statistics.2.8.1.0\lib\Accord.Statistics.dll"
#r @"..\packages\Accord.MachineLearning.2.8.1.0\lib\Accord.MachineLearning.dll"

open System
open System.IO

open Accord.MachineLearning
open Accord.MachineLearning.VectorMachines
open Accord.MachineLearning.VectorMachines.Learning
open Accord.Statistics.Kernels


Note the added reference to the Accord.dll and Accord.Math.dll assemblies; while the code presented below doesn’t reference it explicitly, it looks like Accord.MachineLearning is trying to load the assembly, which fails miserably if they are not referenced.

Then, we need some data; once the training set and validation set have been downloaded to your local machine (see the gist for the datasets url), that’s fairly easy to do:

let training = @"C:/users/mathias/desktop/dojosample/trainingsample.csv"
let validation = @"C:/users/mathias/desktop/dojosample/validationsample.csv"

let readData filePath =
File.ReadAllLines filePath
|> fun lines -> lines.[1..]
|> Array.map (fun line -> line.Split(','))
|> Array.map (fun line ->
(line.[0] |> Convert.ToInt32), (line.[1..] |> Array.map Convert.ToDouble))
|> Array.unzip

let labels, observations = readData training


We read every line of the CSV file into an array of strings, drop the headers with array slicing, keeping only items at or after index 1, split each line around commas (so that each line is now an array of strings), retrieve separately the first element of each line (what the number actually is), and all the pixels, which we transform into a float, and finally unzip the result, so that we get an array of integers (the actual numbers), and an array of arrays, the grayscale level of each pixel.

More...