Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
8. August 2015 12:42

A couple of weeks ago, I came across this blog post by Steve Shogren, which looks at various programming languages, and attempts to define a “language safety score”, by taking into account a list of language criteria (Are nulls allowed? Can variables be mutated? And so on), aggregating them into an overall safety score – and finally looking for whether the resulting score was a reasonable predictor for the observed bug rate across various projects.

I thought this was an interesting idea. However, I also had reservations on the methodology. Picking a somewhat arbitrary list of criteria, giving them indiscriminately the same weight, and summing them up, didn’t seem to me like the most effective approach – especially given that Steve had already collected a nice dataset. If the goal is to identify which language features best predict how buggy the code will be, why not start from there, and build a model which attempts to predict the bug rate based on language features?

So I decided to give it a shot, and build a quick-and-dirty logistic regression model. In a nutshell, logistic regression attempts to model the probability of observing an event, based on a set of criteria / features. A prototypical application would be in medicine, trying to predict, for instance, the chances of developing a disease, given patient characteristics. In our case, the disease is a bug, and the patient a code base. We’ll use the criteria listed by Steve as potential predictors, and, as a nice side-product of logistic regression, we will get a quantification of how important each of the criteria is in predicting the bug rate.

I’ll discuss later some potential issues with the approach; for now, let’s build a model, and see where that leads us. I lifted the data from Steve’s post (hopefully without typos), with one minor modification: instead of scoring criteria as 1, 0 or –1, I just retained 1 or 0 (it’s there or it’s not there), and prepared an F# script, using the Accord framework to run my logistic regression.

Note: the entire script is here as a Gist.

#I @"../packages"
#r @"Accord.3.0.1-alpha\lib\net45\Accord.dll"
#r @"Accord.MachineLearning.3.0.1-alpha\lib\net45\Accord.MachineLearning.dll"
#r @"Accord.Math.3.0.1-alpha\lib\net45\Accord.Math.dll"
#r @"Accord.Statistics.3.0.1-alpha\lib\net45\Accord.Statistics.dll"

let language, bugrate, criteria =
[|  "F#",           0.023486288,    [|1.;1.;1.;0.;1.;1.;1.;0.;0.;0.;1.;1.;1.;0.|]
"Javascript",   0.039445132,    [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|]
"CoffeeScript", 0.047242288,    [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|]
"Clojure",      0.011503478,    [|0.;1.;0.;0.;0.;0.;1.;0.;1.;1.;1.;0.;0.;0.|]
"C#",           0.03261284,     [|0.;0.;1.;0.;0.;1.;1.;0.;0.;0.;1.;0.;1.;0.|]
"Python",       0.02531419,     [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|]
"Java",         0.032567736,    [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|]
"Ruby",         0.020303702,    [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|]
"Scala",        0.01904762,     [|1.;1.;1.;0.;1.;1.;1.;0.;0.;0.;1.;0.;0.;0.|]
"Go",           0.024698375,    [|0.;0.;1.;0.;0.;1.;1.;0.;0.;0.;1.;0.;1.;0.|]
"PHP",          0.031669293,    [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|] |]
|> Array.unzip3

open Accord.Statistics.Models.Regression
open Accord.Statistics.Models.Regression.Fitting

let features = 14
let model = LogisticRegression(features)

let rec learn () =
let delta = learner.Run(criteria, bugrate)
if delta > 0.0001
then learn ()
else ignore ()

learn () |> ignore

And we are done – we have trained a model to predict the bug rate, based on our 14 criteria. How is this working? Let’s find out:

for i in 0 .. (language.Length - 1) do
let lang = language.[i]
let predicted = model.Compute(criteria.[i])
let real = bugrate.[i]
printfn "%16s Real: %.3f Pred: %.3f" lang real predicted

>
F# Real: 0.023 Pred: 0.023
Javascript Real: 0.039 Pred: 0.033
CoffeeScript Real: 0.047 Pred: 0.033
Clojure Real: 0.012 Pred: 0.011
C# Real: 0.033 Pred: 0.029
Python Real: 0.025 Pred: 0.033
Java Real: 0.033 Pred: 0.033
Ruby Real: 0.020 Pred: 0.033
Scala Real: 0.019 Pred: 0.020
Go Real: 0.025 Pred: 0.029
PHP Real: 0.032 Pred: 0.033

Looks pretty good. Let’s confirm that with a chart, using FSharp.Charting:

#load "FSharp.Charting.0.90.12\FSharp.Charting.fsx"
open FSharp.Charting

let last = language.Length - 1

Chart.Combine [
Chart.Line ([ for i in 0 .. last -> bugrate.[i]], "Real", Labels=language)
Chart.Line ([ for i in 0 .. last -> model.Compute(criteria.[i])], "Pred") ]
|> Chart.WithLegend()

What criteria did our model identify as predictors for bugs? Let’s find out.

let criteriaNames = [|
"Null Variable Usage"
"Null List Iteration"
"Prevent Variable Reuse"
"Ensure List Element Exists"
"Safe Type Casting"
"Passing Wrong Type"
"Misspelled Method"
"Missing Enum Value"
"Variable Mutation"
"Memory Deallocation"
"Tail Call Optimization"
"Guaranteed Code Evaluation"
"Functional Purity" |]

for i in 0 .. (features - 1) do
let name = criteriaNames.[i]
let wald = model.GetWaldTest(i)
let odds = model.GetOddsRatio(i)
(printfn "%28s odds: %4.2f significant: %b" name odds wald.Significant)

>
Null Variable Usage odds:  0.22 significant: true
Null List Iteration odds:  0.86 significant: true
Prevent Variable Reuse odds:  0.64 significant: true
Ensure List Element Exists odds:  1.05 significant: true
Safe Type Casting odds:  1.00 significant: false
Passing Wrong Type odds:  0.86 significant: true
Misspelled Method odds:  1.05 significant: true
Missing Enum Value odds:  0.78 significant: true
Variable Mutation odds:  0.86 significant: true
Prevent Deadlocks odds:  0.64 significant: true
Memory Deallocation odds:  0.74 significant: true
Tail Call Optimization odds:  0.22 significant: true
Guaranteed Code Evaluation odds:  1.71 significant: true
Functional Purity odds:  0.69 significant: true

How should you read this? The first output, the odds ratio, describes how much more likely it is to observe success than failure when that criterion is active. In our case, success means “having a bug”, so for instance, if your language prevents using nulls, you’d expect 1.0 / 0.22 = 4.5 times less chances to write bugs. In other words, if the odds are close to 1.0, the criterion doesn’t make much of a difference. The closer to zero it is, the lower the predicted bug count, and vice-versa.

Conclusions and caveats

The 3 most significant predictors of a low bug rate are, in order, no nulls, tail calls optimization, and (to a much lesser degree) lazy evaluation. After that, we have honorable scores for avoiding variable reuse, preventing deadlocks, and functional purity.

So… what’s the bottom line? First off, just based on the bug rates alone, it seems that using functional languages would be a safer bet than Javascript (and CoffeeScript) to avoid bugs.

Then, now would be a good time to reiterate that this is a quick-and-dirty analysis. Specifically, there are some clear issues with the dataset. First, we are fitting 12 languages on 14 criteria – that’s not much to go on. On top of that, there is some data redundancy. None of the languages in our sample has “ensure list element exists” (4th column is filled with zeroes), and all of them guarantee memory de-allocation (11th column filled with ones). I suspect there is some additional redundancy, because of the similarity between the columns.

Note: another interesting discussion would be whether the selected criteria properly cover the differences between languages. I chose to not go into that, and focus strictly on using the data as-is.

I ran the model again, dropping the 2 columns that contain no information; while this doesn’t change the predictions of the model, it does impact a bit the weight of each criterion. The results, while similar, do show some differences:

("Null Variable Usage", 0.0743885639)
("Functional Purity", 0.4565632287)
("Prevent Variable Reuse", 0.5367456237)
("Tail Call Optimization", 0.7028982809)
("Missing Enum Value", 0.7539575884)
("Null List Iteration", 0.7636177784)
("Passing Wrong Type", 0.7636177784)
("Variable Mutation", 0.7646027916)
("Safe Type Casting", 1.072641105)
("Misspelled Method", 1.072641105)
("Guaranteed Code Evaluation", 2.518831684)

Another piece of information I didn’t use is how many commits were taken into consideration. This matters, because the information gathered for PHP is across 10 times more commits than F#, for instance. It wouldn’t be very hard to do – instead of regressing against the bug rate, we could count the clean and buggy commits per language, and proceed along the lines of the last example described here.

In spite of these issues, I think this constitutes a better base to construct a language score index. Rather than picking criteria by hand and giving them arbitrary weights, let the data speak. Measure how well each of them explains defects, and use that as a basis to determine their relative importance.

That’s it for today! Big thanks to Steve Shogren for a stimulating post, and for making the data available. And again, you can find the script here as a Gist. If you have comments or questions, ping me on Twitter!

30. March 2015 16:34

In our previous post, we looked at James McCaffrey’s code, “Gradient Descent Training Using C#” from MSDN magazine, and took a stab at rewriting the first part in F#, to clarify a bit the way the dataset was created. Today, we’ll dive in the second block, which implements the logistic regression using gradient descent. Again, we won’t discuss why the algorithm works – the article does a pretty good job at that – and focus instead purely on the F# / C# conversion part.

Let’s begin by taking a look at the core of the C# code, which lives in the LogisticClassifier class. I took the liberty to do some minor cleanup, and remove some parts which were un-necessary, so as to make it a bit easier to see what is going on:

public class LogisticClassifier
{
private int numFeatures; // number of x variables aka features
private double[] weights; // b0 = constant
private Random rnd;

public LogisticClassifier(int numFeatures)
{
this.numFeatures = numFeatures;
this.weights = new double[numFeatures + 1]; // [0] = b0 constant
this.rnd = new Random(0);
}

public double[] Train(double[][] trainData, int maxEpochs, double alpha)
{
// alpha is the learning rate
int epoch = 0;
int[] sequence = new int[trainData.Length]; // random order
for (int i = 0; i < sequence.Length; ++i)
sequence[i] = i;

while (epoch < maxEpochs)
{
++epoch;

if (epoch % 100 == 0 && epoch != maxEpochs)
{
double mse = Error(trainData, weights);
Console.Write("epoch = " + epoch);
Console.WriteLine("  error = " + mse.ToString("F4"));
}

Shuffle(sequence); // process data in random order

// stochastic/online/incremental approach
for (int ti = 0; ti < trainData.Length; ++ti)
{
int i = sequence[ti];
double computed = ComputeOutput(trainData[i], weights);
int targetIndex = trainData[i].Length - 1;
double target = trainData[i][targetIndex];

weights[0] += alpha * (target - computed) * 1; // the b0 weight has a dummy 1 input
for (int j = 1; j < weights.Length; ++j)
weights[j] += alpha * (target - computed) * trainData[i][j - 1];
}
} // while
return this.weights; // by ref is somewhat risky
} // Train

private void Shuffle(int[] sequence)
{
for (int i = 0; i < sequence.Length; ++i)
{
int r = rnd.Next(i, sequence.Length);
int tmp = sequence[r];
sequence[r] = sequence[i];
sequence[i] = tmp;
}
}

private double Error(double[][] trainData, double[] weights)
{
// mean squared error using supplied weights
int yIndex = trainData[0].Length - 1; // y-value (0/1) is last column
double sumSquaredError = 0.0;
for (int i = 0; i < trainData.Length; ++i) // each data
{
double computed = ComputeOutput(trainData[i], weights);
double desired = trainData[i][yIndex]; // ex: 0.0 or 1.0
sumSquaredError += (computed - desired) * (computed - desired);
}
return sumSquaredError / trainData.Length;
}

private double ComputeOutput(double[] dataItem, double[] weights)
{
double z = 0.0;
z += weights[0]; // the b0 constant
for (int i = 0; i < weights.Length - 1; ++i) // data might include Y
z += (weights[i + 1] * dataItem[i]); // skip first weight
return 1.0 / (1.0 + Math.Exp(-z));
}
} // LogisticClassifier

Just from the length of it, you can tell that most of the action is taking place in the Train method, so let’s start there. What we have here is two nested loops. The outer one runs maxEpoch times, a user defined parameter. Inside that loop, we randomly shuffle the input dataset, and then loop over each training example, computing the predicted output of the logistic function for that example, comparing it to a target, the actual  label of the example, which can be 0 or 1, and adjusting the weights so as to reduce the error. We also have a bit of logging going on, displaying the prediction error every hundred outer iteration. Once the two loops are over, we return the weights.

Two things strike me here. First, a ton of indexes are involved, and this tends to obfuscate what is going on; as a symptom, a few comments are needed, to clarify how the indexes work, and what piece of the data is organized. Then, there is a lot of mutation going on. It’s not necessarily a bad thing, but I tend to avoid it as much as possible, simply because it requires keeping more moving parts in my head when I try to follow the code, and also, as McCaffrey himself points out in a comment, because “by ref is somewhat risky”.

As a warm up, let’s begin with the error computation, which is displayed every 100 iterations.  Rather than having to remember in what column the actual expected value is stored, let’s make our life easier, and use a type alias, Example, so that the features are neatly tucked in an array, and the value is clearly separated. We need to compute the average square difference between the expected value, and the output of the logistic function for each example. As it turns out, we have already implemented the logistic function in the first part in the code, so re-implementing it as in ComputeOutput seems like un-necessary work – we can get rid of that part entirely, and simply map every example to the square error, and compute the average, using pattern matching on the examples to separate clearly the features and the expected value:

type Example = float [] * float

let Error (trainData:Example[], weights:float[]) =
// mean squared error using supplied weights
trainData
|> Array.map (fun (features,value) ->
let computed = logistic weights features
let desired = value
(computed - desired) * (computed - desired))
|> Array.average

Some of you might argue that this could be made tighter – I can think of at least two possibilities. First, using a Tuple might not be the most expressive approach; replacing it with a Record instead could improve readability. Then, we could also skip the map + average part, and directly ask F# to compute the average on the fly:

type Example = { Features:float[]; Label:float }

let Error (trainData:Example[], weights:float[]) =
trainData
|> Array.averageBy (fun example ->
let computed = logistic weights example.Features
let desired = example.Label
(computed - desired) * (computed - desired))

I will keep my original version the way it is, mostly because we created a dataset based on tuples last times.

We are now ready to hit the center piece of the algorithm. Just like we would probably try to extract a method in C#, we will start extracting some of the gnarly code that lies in the middle:

for (int ti = 0; ti < trainData.Length; ++ti)
{
int i = sequence[ti];
double computed = ComputeOutput(trainData[i], weights);
int targetIndex = trainData[i].Length - 1;
double target = trainData[i][targetIndex];

weights[0] += alpha * (target - computed) * 1; // the b0 weight has a dummy 1 input
for (int j = 1; j < weights.Length; ++j)
weights[j] += alpha * (target - computed) * trainData[i][j - 1];
}

Rather than modify the weights values, it seems safer to compute new weights. And because we opted last week to insert a column with ones for the constant feature, we won’t have to deal with the index misalignment, which requires separate handling for b0 and the rest. Instead, we can write an update operation that takes in an example and weights, and returns new weights:

let update (example:Example) (weights:float[]) =
let features,target = example
let computed = logistic weights features
weights
|> Array.mapi (fun i w ->
w + alpha * (target - computed) * features.[i])

Array.mapi allows us to iterate over the weights, while maintaining the index we are currently at, which we use to grab the feature value at the corresponding index. Alternatively, you could go all verbose and zip the arrays together – or all fancy with a double-pipe and map2 to map the two arrays in one go. Your pick:

Array.zip weights features
|> Array.map (fun (weight,feat) ->
weight + alpha * (target - computed) * feat)

(weights,features)
||> Array.map2 (fun weight feat ->
weight + alpha * (target - computed) * feat)

We are now in a very good place; the only thing left to do is to plug that into the two loops. The inner loop is a perfect case for a fold (the Aggregate method in LINQ): given a starting value for weights, we want to go over every example in our training set, and, for each of them, run the update function to compute new weights. For the while loop, we’ll take a different approach, and use recursion: when the epoch reaches maxEpoch, you are done, return the weights, otherwise, keep shuffling the data and updating weights. Let’s put that all together:

let Train (trainData:Example[], numFeatures, maxEpochs, alpha, seed) =

let rng = Random(seed)
let epoch = 0

let update (example:Example) (weights:float[]) =
let features,target = example
let computed = logistic weights features
weights
|> Array.mapi (fun i w ->
w + alpha * (target - computed) * features.[i])

let rec updateWeights (data:Example[]) epoch weights =

if epoch % 100 = 0
then printfn "Epoch: %i, Error: %.2f" epoch (Error (data,weights))

if epoch = maxEpochs then weights
else
let data = shuffle rng data
let weights =
data
|> Array.fold (fun w example -> update example w) weights
updateWeights data (epoch + 1) weights
// initialize the weights and start the recursive update
let initialWeights = [| for _ in 1 .. numFeatures + 1 -> 0. |]
updateWeights trainData 0 initialWeights

And that’s pretty much it. We replaced the whole class by a couple of functions, and all the indexes are gone. This is probably a matter of taste and comfort with functional concepts, but in my opinion, this is much easier to follow.

Before trying it out, to make sure it works, I’ll take a small liberty, and modify the Train function. As it stands right now, it returns the final weights, but really, we don’t care about the weights, what we want is a classifier, which is a function that, given an array, will predict a one or a zero. That’s easy enough, let’s return a function at the end instead of weights:

// initialize the weights and start the recursive update
let initialWeights = [| for _ in 1 .. numFeatures + 1 -> 0. |]
let finalWeights = updateWeights trainData 0 initialWeights
let classifier (features:float[]) =
if logistic finalWeights features > 0.5 then 1. else 0.
classifier

We can now wrap it up, and see our code in action:

printfn "Begin Logistic Regression (binary) Classification demo"
printfn "Goal is to demonstrate training using gradient descent"

let numFeatures = 8 // synthetic data
let numRows = 10000
let seed = 1

printfn "Generating %i artificial data items with %i features" numRows numFeatures
let trueWeights, allData = makeAllData(numFeatures, numRows, seed)

printfn "Data generation weights:"
trueWeights |> Array.iter (printf "%.2f ")
printfn ""

printfn "Creating train (80%%) and test (20%%) matrices"

let trainData, testData = makeTrainTest(allData, 0)
printfn "Done"

let maxEpochs = 1000
let alpha = 0.01

let classifier = Train (trainData,numFeatures,maxEpochs,alpha,0)

let accuracy (examples:Example[]) =
examples
|> Array.averageBy (fun (feat,value) ->
if classifier feat = value then 1. else 0.)

accuracy trainData |> printfn "Prediction accuracy on train data: %.4f"
accuracy testData |> printfn "Prediction accuracy on test data: %.4f"

We used a small trick to compute the accuracy – we mark every correct call as a one, every incorrect one as a zero, which, when we compute the average, gives us directly the proportion of cases that were called correctly. On my machine, I get the following output:

>
Prediction accuracy on train data: 0.9988
Prediction accuracy on test data: 0.9980

Looks good enough to me, the implementation seems to be working. The whole code presented here is available as a gist here. I’ll leave it at that for now (I might revisit it later, and try to make this work with DiffSharp at some point, if anyone is interested) – feel free to ask questions, or drop me a comment on Twitter!

22. March 2015 17:27

I will admit it, I got a bit upset by James McCaffrey’s column in MSDN magazine this month, “Gradient Descent Training Using C#”. While the algorithm explanations are quite good, I was disappointed by the C# sample code, and kept thinking to myself “why oh why isn’t this written in F#”. This is by no means intended as a criticism of C#; it’s a great language, but some problems are just better suited for different languages, and in this case, I couldn’t fathom why F# wasn’t used.

Long story short, I just couldn’t let it go, and thought it would be interesting to take that C# code, and do a commented rewrite in F#. I won’t even go into why the code does what it does – the article explains it quite well – but will instead purely focus on the implementation, and will try to keep it reasonably close to the original, at the expense of some additional nifty things that could be done.

The general outline of the code follows two parts:

• Create a synthetic dataset, creating random input examples, and computing the expected result using a known function,
• Use gradient descent to learn the model parameters, and compare them to the true value to check whether the method is working.

You can download the original C# code here. Today we’ll focus only on the first part, which is mainly contained in two methods, MakeAllData and MakeTrainTest:

static double[][] MakeAllData(int numFeatures, int numRows, int seed)
{
Random rnd = new Random(seed);
double[] weights = new double[numFeatures + 1]; // inc. b0
for (int i = 0; i < weights.Length; ++i)
weights[i] = 20.0 * rnd.NextDouble() - 10.0; // [-10.0 to +10.0]

double[][] result = new double[numRows][]; // allocate matrix
for (int i = 0; i < numRows; ++i)
result[i] = new double[numFeatures + 1]; // Y in last column

for (int i = 0; i < numRows; ++i) // for each row
{
double z = weights[0]; // the b0
for (int j = 0; j < numFeatures; ++j) // each feature / column except last
{
double x = 20.0 * rnd.NextDouble() - 10.0; // random X in [10.0, +10.0]
result[i][j] = x; // store x
double wx = x * weights[j + 1]; // weight * x
z += wx; // accumulate to get Y
}
double y = 1.0 / (1.0 + Math.Exp(-z));
if (y > 0.55)  // slight bias towards 0
result[i][numFeatures] = 1.0; // store y in last column
else
result[i][numFeatures] = 0.0;
}
Console.WriteLine("Data generation weights:");
ShowVector(weights, 4, true);

return result;
}

MakeAllData takes a number of features and rows, and a seed for the random number generator so that we can replicate the same dataset repeatedly. The dataset is represented as an array of array of doubles. The first columns, from 0 to numFeatures – 1, contain random numbers between –10 and 10. The last column contains a 0 or a 1. What we are after here is a classification model: each row can take two states (1 or 0), and we are trying to predict them from observing the features. In our case, that value is computed using a logistic model: we have a set of weights (which we also generate randomly), corresponding to each feature, and the output is

logistic [ x1; x2; … xn ] = 1.0 / (1.0 + exp ( - (w0 * 1.0 + w1 * x1 + w2 * x2 + … + wn * xn))

Note that w0 plays the role of a constant term in the equation, and is multiplied by 1.0 all the time. This is adding some complications to a code where indices are already flying left and right, because now elements in the weights array are mis-aligned by one element with the elements in the features array. Personally, I also don’t like adding another column to contain the predicted value, because that’s another implicit piece of information we have to remember.

In that frame, I will make two minor changes here, just to keep my sanity. First, as is often done, we will insert a column containing just 1.0 in each observation, so that the weights and features are now aligned. Then, we will move the 0s and 1s outside of the features array, to avoid any ambiguity.

Good. Instead of creating a Console application, I’ll simply go for a script. That way, I can just edit my code and check live whether it does what I want, rather than recompile and run every time.

Let’s start with the weights. What we are doing here is simply creating an array of numFeatures + 1 elements, populated by random values between –10.0 and 10.0. We’ll go a bit fancy here: given that we are also generating random numbers the same way a bit further down, let’s extract a function that generates numbers uniformly between a low and high value:

let rnd = Random(seed)
let generate (low,high) = low + (high-low) * rnd.NextDouble()
let weights = Array.init (numFeatures + 1) (fun _ -> generate(-10.0,10.0))

The next section is where things get a bit thornier. The C# code creates an array, then populates it row by row, first filling in the columns with random numbers, and then applying the logistic function to compute the value that goes in the last column. We can make that much clearer, by extracting that function out. The logistic function is really doing 2 things:

• first, the sumproduct of 2 arrays,
• and then, 1.0/(1.0 + exp ( – z ).

That is easy enough to implement:

let sumprod (v1:float[]) (v2:float[]) =
Seq.zip v1 v2 |> Seq.sumBy (fun (x,y) -> x * y)

let sigmoid z = 1.0 / (1.0 + exp (- z))

let logistic (weights:float[]) (features:float[]) =
sumprod weights features |> sigmoid

We can now use all this, and generate a dataset by simply first creating rows of random values (with a 1.0 in the first column for the constant term), applying the logistic function to compute the value for that row, and return them as a tuple:

open System

let sumprod (v1:float[]) (v2:float[]) =
Seq.zip v1 v2 |> Seq.sumBy (fun (x,y) -> x * y)

let sigmoid z = 1.0 / (1.0 + exp (- z))

let logistic (weights:float[]) (features:float[]) =
sumprod weights features |> sigmoid

let makeAllData (numFeatures, numRows, seed) =

let rnd = Random(seed)
let generate (low,high) = low + (high-low) * rnd.NextDouble()
let weights = Array.init (numFeatures + 1) (fun _ -> generate(-10.0,10.0))

let dataset =
[| for row in 1 .. numRows ->
let features =
[|
yield 1.0
for feat in 1 .. numFeatures -> generate(-10.0,10.0)
|]
let value =
if logistic weights features > 0.55
then 1.0
else 0.0
(features, value)
|]

weights, dataset

Done. Let’s move to the second part of the data generation, with the MakeTrainTest method. Basically, what this does is take a dataset, shuffle it, and split it in two parts, 80% which we will use for training, and 20% we leave out for validation.

static void MakeTrainTest(double[][] allData, int seed,
out double[][] trainData, out double[][] testData)
{
Random rnd = new Random(seed);
int totRows = allData.Length;
int numTrainRows = (int)(totRows * 0.80); // 80% hard-coded
int numTestRows = totRows - numTrainRows;
trainData = new double[numTrainRows][];
testData = new double[numTestRows][];

double[][] copy = new double[allData.Length][]; // ref copy of all data
for (int i = 0; i < copy.Length; ++i)
copy[i] = allData[i];

for (int i = 0; i < copy.Length; ++i) // scramble order
{
int r = rnd.Next(i, copy.Length); // use Fisher-Yates
double[] tmp = copy[r];
copy[r] = copy[i];
copy[i] = tmp;
}
for (int i = 0; i < numTrainRows; ++i)
trainData[i] = copy[i];

for (int i = 0; i < numTestRows; ++i)
testData[i] = copy[i + numTrainRows];
}

Again, there is a ton of indexing going on, which in my old age I find very hard to follow. Upon closer inspection, really, the only thing complicated here is the Fischer-Yates shuffle, which takes an array and randomly shuffles the order. The rest is pretty simply – we just want to shuffle, and then split into two arrays. Let’s extract the shuffle code (which happens to also be used and re-implemented later on):

let shuffle (rng:Random) (data:_[]) =
let copy = Array.copy data
for i in 0 .. (copy.Length - 1) do
let r = rng.Next(i, copy.Length)
let tmp = copy.[r]
copy.[r] <- copy.[i]
copy.[i] <- tmp
copy

We went a tiny bit fancy again here, and made the shuffle work on generic arrays; we also pass in the Random instance we want to use, so that we can control / repeat shuffles if we want, by passing a seeded Random. Does this work? Let’s check in FSI:

> [| 1 .. 10 |] |> shuffle (Random ());;
val it : int [] = [|6; 7; 2; 10; 8; 5; 4; 9; 3; 1|]

Looks reasonable. Let’s move on – we can now implement the makeTrainTest function.

let makeTrainTest (allData:_[], seed) =

let rnd = Random(seed)
let totRows = allData.Length
let numTrainRows = int (float totRows * 0.80) // 80% hard-coded

let copy = shuffle rnd allData
copy.[.. numTrainRows-1], copy.[numTrainRows ..]

Done. A couple of remarks here. First, F# is a bit less lenient than C# around types, so we have to be explicit when converting the number of rows to 80%, first to float, then back to int. As an aside, this used to annoy me a bit in the beginning, but I have come to really like having F# as this slightly psycho-rigid friend who nags me when I am taking a dangerous path (for instance, dividing two integers and hoping for a percentage).

Besides that, I think the code is markedly clearer. The complexity of the shuffle has been nicely contained, and we just have to slice the array to get a training and test sets. As an added bonus, we got rid of the out parameters, and that always feels nice and fuzzy.

I’ll leave it at for today; next time we’ll look at the second part, the learning algorithm itself. Before closing shop, let me make a couple of comments. First, the code is a tad shorter, but not by much. I haven’t really tried, and deliberately made only the changes I thought were needed. What I like about it, though, is that all the indexes are gone, except for the shuffle. In my opinion, this is a good thing. I find it difficult to keep it all in my head when more than one index is involved; when I need to also remember what columns contain special values, I get worried – and just find it hard to figure out what is going on. By contrast, I think makeTrainTest, for instance, conveys pretty directly what it does. makeAllData, in spite of some complexity, also maps closely the way I think about my goal: “I want to generate rows of inputs” – this is precisely what the code does. There is probably an element of culture to it, though; looping over arrays has a long history, and is familiar to every developer, and what looks readable to me might look entirely weird to some.

Easier, or more complicated than before? Anything you like or don’t like – or find unclear? Always interested to hear your opinion! Ping me on Twitter if you have comments.

30. September 2012 13:08

After four weeks of vacations, I am back home, ready to continue my series of posts converting the samples from Machine Learning in Action from Python to F#.

Today’s post covers Chapter 5 of the book, dedicated to Logistic Regression. Logistic Regression is another classification method. It uses numeric data to determine how to separate observations into two classes, identified by 0 or 1.

The entire code presented in this post can be found on GitHub, commit 398677f

The idea behind the algorithm

The main idea behind the algorithm is to find a function which, using the numeric data that describe an individual observation as input, will return a number between 0 and 1. Ideally, that function will return a number close to respectively 0 or 1 for observations belonging to group 0 or 1.

To achieve that result, the algorithm relies on the Sigmoid function, f(x) = 1 / (1 + exp(-x)) .

For any input value, the Sigmoid function returns a value in ] 0 ; 1 [. A positive value will return a value greater than 0.5, and the greater the input value, the closer to 1. One could think of the function as returning a probability: for very high or low values of x, there is a high certainty that it belongs to one of the two groups, and for values close to zero, the probability of each group is 50% / 50%.

The only thing needed then is a transformation taking the numeric values describing the observations from the dataset, and mapping them to a single value, such that applying the Sigmoid function to it produces results close to the group the observation belongs to. The most straightforward way to achieve this is to apply a linear combination: an observation with numeric values [ x1; x2; … xk ] will be converted into w0 + w1 x x1 + w2 x x2 … + wk x xk, by applying weights [ w0; w1; … wk ] to each of the components of the observation. Note how the weights have one extra element w0, which is used for a constant term.

If our observations had two components X and Y, each observation can be represented as a point (X, Y) in the plane, and what we are looking for is a straight line w0 + w1 x X + w2 x Y, such that every observation of group 0 is on one side of the line, and every observation of group 1 on the other side.

We now replaced one problem by another – how can we find a suitable set of weights W?

I won’t even attempt a full explanation of the approach, and will stick to fuzzy, high-level intuition. Basically, the algorithm starts with an arbitrary set of weights, and iteratively adjusts the weights, by comparing the results of the function and what it should be (the actual group), and adjusting them to reduce error.

Note: I’ll skip the Gradient Ascent method, and go straight to the second part of Chapter 5, which covers Stochastic Gradient Ascent, because the code is both easier to understand and more suitable to large datasets. On the other hand, the deterministic gradient ascent approach is probably clearer for the math inclined. If that’s your situation, you might be interested in this MSDN Magazine article, which presents a C# implementation of the Logistic Regression.

Let’s illustrate the update procedure, on an ultra-simplified example, where we have a single weight W. In that case, the predicted value for an observation which has value X will be sigmoid (W x X) , and the algorithm adjustment is given by the following formula:

W <- W + alpha x (Label – sigmoid (W x X))

where Label is the group the observation belongs to (0 or 1), and alpha is a user-defined parameter, between 0 and 1. In other words, W is updated based on the error, Label – sigmoid (W x X) . First, obviously, if there is no error, W will remain unchanged, there is nothing to adjust. Let’s consider the case where Label is 1, and both X and W are positive. In that case, Label – sigmoid (W x X) will be positive (between 0 and 1), and W will be increased. As W increases, the sigmoid becomes closer to 1, and the adjustments become progressively smaller. Similarly, considering all the cases for W and X (positive and negative), one can verify that W will be adjusted in a direction which reduces the classification error. Alpha can be described as “how aggressive” the adjustment should be – the closer to 1, the more W will be updated.

That’s the gist of the algorithm – the full-blown deterministic gradient algorithm proceeds to update the weights by considering the error on the entire dataset at once, which makes it more expensive, whereas the stochastic gradient approach updates the weights sequentially, taking the dataset observations one by one, which makes it convenient for larger datasets.

Simple implementation

Enough talk – let’s jump into code, with a straightforward implementation first. We create a module “LogisticRegression”, and begin with building the function which predicts the class of an observation, given weights:

module LogisticRegression =

open System

let sigmoid x = 1.0 / (1.0 + exp -x)

// Vector dot product
let dot (vec1: float list)
(vec2: float list) =
List.zip vec1 vec2
|> List.map (fun e -> fst e * snd e)
|> List.sum

(vec2: float list) =
List.zip vec1 vec2
|> List.map (fun e -> fst e + snd e)

// Vector scalar product
let scalar alpha (vector: float list) =
List.map (fun e -> alpha * e) vector

// Weights have 1 element more than observations, for constant
let predict (weights: float list)
(obs: float list) =
1.0 :: obs
|> dot weights
|> sigmoid

More...