Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
19. October 2013 10:33

A couple of weeks ago, I had the pleasure to attend Progressive F# Tutorials in NYC. The conference was fantastic – two days of hands-on workshops, great organization by the good folks at SkillsMatter, Rickasaurus and Paul Blasucci, and a great opportunity to exchange with like-minded people, catch up with old friends and make new ones.

As an aside, if you missed NYC, fear not – you can still get tickets for Progressive F# Tutorials in London, coming up October 31 and November 1 in London.

After some discussion with Phil Trelford, we decided it would be a lot of fun to organize a workshop around PacMan. Phil has a long history with game development, and a lot of wisdom to share on the topic. I am a total n00b as far as game programming goes, but I thought PacMan would make a fun theme to hack some AI, so I set to refactor some of Phil’s old code, and transform it into a “coding playground” where people could tinker with how PacMan and the Ghosts behave, and make them smarter.

Long story short, the refactoring exercise turned out to be a bit more involved than what I had initially anticipated. First, games are written in a style which is pretty different from your run-of-the-mill business app, and getting familiar with a code base that didn’t follow a familiar style wasn’t trivial.

So here I am, trying to refactor that unfamiliar and somewhat idiosyncratic code base, and I start hitting stuff like this:

let ghost_starts =
[
"red", (16, 16), (1,0)
"cyan", (14, 16), (1,0)
"pink", (16, 14), (0,-1)
"orange", (18, 16), (-1,0)
]
|> List.map (fun (color,(x, y), v) ->
// some stuff happens here
{ … X = x * 8 - 7; Y = y * 8 - 3; V = v; … }
)


This is where I begin to get nervous. I need to get this done quickly, and factor our functions, but I am really worried to touch any of this. What’s X and Y? Why 8, 7 or 3?

More...

6. September 2013 08:15

Recently, Cesar De Souza began moving his .NET machine learning library, Accord.NET, from Google Code to GitHub. The move is still in progress, but that motivated me to take a closer look at the library; given that it is built in C#, with an intended C# usage in mind, I wanted to see how usable it is from F#.

There is a lot in the library; as a starting point, I decided I would try out its Support Vector Machine (SVM), a classic machine learning algorithm, and run it on a classic problem, automatically recognizing hand-written digits. The dataset I will be using here is a subset of the Kaggle Digit Recognizer contest; each example in the dataset is a 28x28 grayscale pixels image, the result of scanning a number written down by a human, and what the actual number is. From that original dataset, I sampled 5,000 examples, which will be used to train the algorithm, and another 500 in a validation set, which we’ll use to evaluate the performance of the model on data it hasn’t “seen before”.

The full example is available as a gist on GitHub.

I’ll be working in a script file within a Library project, as I typically do when exploring data. First, we need to add references to Accord.NET via NuGet:

#r @"..\packages\Accord.2.8.1.0\lib\Accord.dll"
#r @"..\packages\Accord.Math.2.8.1.0\lib\Accord.Math.dll"
#r @"..\packages\Accord.Statistics.2.8.1.0\lib\Accord.Statistics.dll"
#r @"..\packages\Accord.MachineLearning.2.8.1.0\lib\Accord.MachineLearning.dll"

open System
open System.IO

open Accord.MachineLearning
open Accord.MachineLearning.VectorMachines
open Accord.MachineLearning.VectorMachines.Learning
open Accord.Statistics.Kernels


Note the added reference to the Accord.dll and Accord.Math.dll assemblies; while the code presented below doesn’t reference it explicitly, it looks like Accord.MachineLearning is trying to load the assembly, which fails miserably if they are not referenced.

Then, we need some data; once the training set and validation set have been downloaded to your local machine (see the gist for the datasets url), that’s fairly easy to do:

let training = @"C:/users/mathias/desktop/dojosample/trainingsample.csv"
let validation = @"C:/users/mathias/desktop/dojosample/validationsample.csv"

let readData filePath =
File.ReadAllLines filePath
|> fun lines -> lines.[1..]
|> Array.map (fun line -> line.Split(','))
|> Array.map (fun line ->
(line.[0] |> Convert.ToInt32), (line.[1..] |> Array.map Convert.ToDouble))
|> Array.unzip

let labels, observations = readData training


We read every line of the CSV file into an array of strings, drop the headers with array slicing, keeping only items at or after index 1, split each line around commas (so that each line is now an array of strings), retrieve separately the first element of each line (what the number actually is), and all the pixels, which we transform into a float, and finally unzip the result, so that we get an array of integers (the actual numbers), and an array of arrays, the grayscale level of each pixel.

More...

1. September 2013 13:39

I have been back for about a week now, after nearly three weeks on the road, talking about F# all over the US. The first day I woke up in my own bed, my first thought was “where am I again? And where am I speaking tonight?”, now life is slowly getting back to normal, and I thought it would be a good time to share some impressions from the trip.

• I am very proud to have inaugurated two new F# meetup groups during that trip! The Washington DC F# meetup, organized by @devshorts, is off to a great start, we had a full house at B-Line Medical that evening, with a great crowd mixing F# fans, C# developers, as well as OCaml and Python people, it was great. My favorite moment there was with Sam. Sam, a solid C# developer, looked very worried about writing F# code for the first time. Two hours later, he was so proud (and legitimately so) of having a nice classifier working, all in F#, that he couldn’t resist, and presented his code to the entire group. Nice job! Detroit was my final stop on the road, and didn’t disappoint: the Detroit F# meetup was awesome. It was hosted at the Grand Trunk Pub; while the location had minor logistics drawbacks, it was amply compensated by having food and drinks right there, as well as a great crowd. Thanks to  @OldDutchCap and @JohnBFair for making this happen, this was a suitable grand finale for this trip!
• In general, August seems to be the blossoming period for F# meetups – two other groups popped up in the same month, one in Minsk, thanks to the efforts of @lu_a_jalla and @sergey_tihon, and one in Paris, spearheaded by @tjaskula, @robertpi and @thinkb4coding, this is very exciting, and I am looking forward to meeting some F#ers next time I stop back home!
• A lesson I learnt the hard way is that San Francisco is most definitely not a good benchmark for what to wear in August in the US. My first stops were all in the south – Houston, Nashville, Charlotte and Raleigh, and boy was I not ready for the crazy heat and humidity! On the other hand, I can confirm the rumor, the South knows how to make a guest welcome. For that matter, I am extremely grateful to everyone who hosted me during this trip – you know who you are, thank you for all the help.
• One surprise during this trip was the general level of interest in F#. I regularly hear nonsense sentences like “F# is a niche language”, so I expected smaller crowds in general .NET groups. Well, apparently someone forgot to tell the .NET developers, because I got pretty solid audiences in these groups as well, with an amazing 100 people showing up in Raleigh. Trinug rocked!
• In general, I was a bit stressed out by running a hands-on machine learning lab with F# novices; for an experienced F# user, it’s not incredibly complex, but for someone who hasn’t used the language before, it’s a bit of a “here is the deep-end of the swimming pool, now go see if you can swim” moment. I was very impressed by how people did in these groups, everyone either finished or ended up very close. Amusingly, in one of the groups, the first person who completed the exercise, in very short time, was… a DBA, who explained that he immediately went for a set-oriented style. Bingo! The lesson for me is that F# is not complicated, but you have to embrace its flow, and largely forget about C#. One trick which seemed to help was to ask the question “how would you write it if you were using only LINQ”. Otherwise, C# developers seemed to often over-think and build code blocks too large for their own good, whereas F# works best by creating very small and simple functions, and then assembling them in larger workflows.
• Another fun moment was in Boston, where I ran the Machine Learning dojo at Hack/Reduce, language agnostic (thanks @JonnyBoats for making the introductions!). Pretty much every language under the sun was represented (C#, Java, F#, Scala, Python, Matlab, Octave, R, Clojure, Ruby) – but one of the participants still managed to pull “something special”, and tried to implement a classifier entirely in PostgreSQL. It didn’t quite work out, but hats off nevertheless, that was a valiant experiment!
• As a Frenchman, I take food seriously. As a scientist, I want to see the data. Therefore, I was very excited to have the opportunity to investigate whether Northern Carolina style BBQ is indeed an heresy, first hand. I got the chance to try out BBQ in Houston and Raleigh, and I have to give it to Texas, hands down.

• Lesson learnt the hard way: do not ever depend on the internet for a presentation. Some of my material was on a Gist on GitHub, and a couple of hours before a presentation, I realized that they were under a DOS attack. Not happy times.
• I am more and more of a fan of the hands-on, write code in groups format. It has its limitations – you can’t really do it with a very large crowd, and it requires more time than a traditional talk – but it’s a very different experience. One thing I really enjoyed when starting with F# was its interactivity; the “write code and see what happens” experience rekindled the joy of coding for me. The hands-on format captures some of that “happy hacking” spirit, and gets people really engaged. Once someone start writing code, they own it – and working in groups is a great way to accelerate the learning process, and build a community.

• I have been complacent with the story “it works on environments other than Windows/Visual Studio”. It does, but the best moment to figure out how to make it work exactly is not during a group coding exercise. In these situations, fsharp.org is your friend – and since I came back, I started actually trying all that out, because “I heard it should work” is just not good enough.
• I saw probably somewhere between 500 and 1,000 developers during this trip, and while this was completely exhausting, I don’t regret any of it. One of the highpoints of the whole experience was to just get some time to hang out with old or new friends from the F#/functional community – @panesofglass in Houston, @bryan_hunter and the FireFly Logic & @NashFP crew in Nashville, @rickasaurus, @tomaspetricek, @pblasucci, @mitekm and @hmansell in New York City, and @plepilov, @kbattocchi and @talbott in Boston (sorry if I forgot anyone!). If this trip taught me one thing, it’s that there is actually a lot of interest for F# in the .NET community, and beyond – but we, the F# community, are very scattered, and from our smaller local groups, it’s often hard to get a sense for that. Having a chance to talk to all of you guys who have been holding the fort and spreading F# around, discussing what we do, what works and what doesn’t, and simply having a good time, was fantastic. We need more of this – I am incredibly invigorated, and very hopeful that 2014 will be a great year for F#!
13. July 2013 10:43

It looks like this summer will be my strangest vacation in a while – I’ll be taking a F# road trip of sorts in August, talking about F# at user groups all over the United States. How this crazy plan took shape exactly I am not quite sure in retrospect, but I am really looking forward to meeting all the local communities – this will be fun!

As of July 13th July 28th, here is the plan:

July 31, Sacramento: “Coding Dojo: a gentle introduction to Machine Learning with F#”

August 8, Houston: “An Introduction to F# for the C# Developer”

August 9, Houston: “Coding Dojo: a gentle introduction to Machine Learning with F#”

August 12, Nashville: “Coding Dojo: a gentle introduction to Machine Learning with F#”

August 13, Charlotte: “Coding Dojo: a gentle introduction to Machine Learning with F#”

August 14, Raleigh: “An Introduction to F# for the C# Developer”

August 15, Raleigh: “Coding Dojo: a gentle introduction to Machine Learning with F#”

August 16, Washington DC: “Coding Dojo: a gentle introduction to Machine Learning with F#”

August 17, Philadelphia: “Coding Dojo: a gentle introduction to Machine Learning with F#”

August 19, New York City: “Coding Dojo: a gentle introduction to Machine Learning with F#”

August 20, Boston: “An introduction to F# for the C# developer”

August 21, Boston: “Coding Dojo: a gentle introduction to Machine Learning“

August 22, Detroit: TBA

… and a few more should be added to the list soon! I’ll let you extrapolate what possible cities could be following, given the map below. Stay tuned for updates.

View Larger Map

Huge thanks to the people who helped make this happen – I am sure I forgot some of you, sorry about that, and I’ll owe you a beer when I visit your city!

@zychr and the Sacramento .NET group

@bryan_hunter and @NashFP in Nashville

@devshorts and the newly founded Washington DC F# meetup group!

@rickasaurus and the NYC F# group

@plepilov, @talbott, @jonnyboats and @hackreduce + New England F# group in Boston

… and of course @INETA!

5. July 2013 15:51

Besides having one of the coolest names around, Random Forest is an interesting machine learning algorithm, for a few reasons. It is applicable to a large range of classification problems, isn’t prone to over-fitting, can produce good quality metrics as a side-effect of the training process itself, and is very suitable for parallelization. For all these reasons, I thought it would be interesting to try it out in F#.

The current implementation I will be discussing below works, but isn’t production ready (yet) – it is work in progress. The API and implementation are very likely to change over the next few weeks. Still, I thought I would share what I did so far, and maybe get some feedback!

## The idea behind the algorithm

As the name suggests, Random Forest (introduced in the early 2000s by Leo Breiman) can be viewed as an extension of Decision Trees, which I discussed before. A decision tree grows a single classifier, in a top-down manner: the algorithm recursively selects the feature which is the most informative, partitions the data according to the outcomes of that feature, and repeats the process until no information can be gained by partitioning further. On a non-technical level, the algorithm is playing a smart “game of 20 questions”: given what has been deduced so far, it picks from the available features the one that is most likely to lead to a more certain answer.

How is a Random Forest different from a Decision Tree? The first difference is that instead of growing a single decision tree, the algorithm will create a “forest” – a collection of Decision Trees; the final decision of the classifier will be the majority decision of all trees in the forest. However, having multiple times the same tree wouldn’t be of much help, because we would get the same classifier repeated over and over again. This is where the algorithm gets interesting: instead of growing a Tree using the entire training set and features, it introduces two sources of randomness:

• each tree is grown on a new sample, created by randomly sampling the original dataset with replacement (“bagging”),
• at each node of the tree, only a random subset of the remaining features is used.

Why would introducing randomness be a good idea? It has a few interesting benefits:

• by selecting different samples, it mitigates the risk of over-fitting. A single tree will produce an excellent fit on the particular dataset that was used to train it, but this doesn’t guarantee that the result will generalize to other sets. Training multiple trees on random samples creates a more robust overall classifier, which will by construction handle a “wider” range of situations than a single dataset,
• by selecting a random subset of features, it mitigates the risks of greedily picking locally optimal features that could be overall sub-optimal. As a bonus, it also allows a computation speed-up for each tree, because fewer features need to be considered at each step,
• the bagging process, by construction, creates for each tree a Training Set (the selected examples) and a Cross-Validation Set (what’s “out-of-the-bag”), which can be directly used to produce quality metrics on how the classifier may perform in general.

## Usage

Before delving into the current implementation, I thought it would be interesting to illustrate on an example the intended usage. I will be using the Titanic dataset, from the Kaggle Titanic contest. The goal of the exercise is simple: given the passengers list of the Titanic, and what happened to them, can you build a model to predict who sinks or swims?

I didn’t think the state of affairs warranted a Nuget package just yet, so this example is implemented as a script, in the Titanic branch of the project itself on GitHub.

First, let’s create a Record type to represent passengers:

type Passenger = {
Id: string;
Class: string;
Name: string;
Sex: string;
Age: string;
SiblingsOrSpouse: string;
ParentsOrChildren: string;
Ticket: string;
Fare: string;
Cabin: string;
Embarked: string }


Note that all the properties are represented as strings; it might be better to represent them for what they are (Age is a float, SiblingsOrSpouse an integer…) – but given that the dataset contains missing data, this would require dealing with that issue, perhaps using an Option type. We’ll dodge the problem for now, and opt for a stringly-typed representation.

Next, we need to construct a training set from the Kaggle data file. We’ll use the CSV parser that comes with FSharp.Data to extract the passengers from that list, as well as their known fate (the file is assumed to have been downloaded on your local machine first):

let path = @"C:\Users\Mathias\Documents\GitHub\Charon\Charon\Charon\train.csv"
let data = CsvFile.Load(path).Cache()

let trainingSet =
[| for line in data.Data ->
line.GetColumn "Survived" |> Some, // the label
{   Id = line.GetColumn "PassengerId";
Class = line.GetColumn "Pclass";
Name = line.GetColumn "Name";
Sex = line.GetColumn "Sex";
Age = line.GetColumn "Age";
SiblingsOrSpouse = line.GetColumn "SibSp";
ParentsOrChildren = line.GetColumn "Parch";
Ticket = line.GetColumn "Ticket";
Fare =line.GetColumn "Fare";
Cabin = line.GetColumn "Cabin";
Embarked = line.GetColumn "Embarked" } |]


Now that we have data, we can get to work, and define a model. We’ll start first with a regular Decision Tree, and extract only one feature, Sex:

let features =
[| (fun x -> x.Sex |> StringCategory); |]


What this is doing is defining an Array of features, a feature being a function which takes in a Passenger, and returns an Option string, via the utility StringCategory. StringCategory simply expects a string, and transforms a null or empty case into the “missing data” case, and otherwise treats the string as a Category. So in that case, x is a passenger, and if no Sex information is found, it will transform it into None, and otherwise into Some(“male”) or Some(“female”), the two cases that exist in the dataset.

We are now ready to go – we can run the algorithm and get a Decision Tree classifier, with a minimum leaf of 5 elements (i.e. we stop partitioning if we have less than 5 elements left):

let minLeaf = 5
let classifier = createID3Classifier trainingSet features minLeaf


… and we are done. How good is our classifier? Let’s check:

let correct =
trainingSet
|> Array.averageBy (fun (label, obs) ->
if label = Some(classifier obs) then 1. else 0.)
printfn "Correct: %.4f" correct


More...