Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 18. January 2014 14:49

A couple of months ago, I started working on an F# decision tree & random forest library, and pushed a first draft out in July 2013. It was a very minimal implementation, but it was a start, and my plan was to keep refining and add features. And then life happened: I got really busy, I began a very poorly disciplined refactoring effort on the code base, I second and third guessed my design - and got nothing to show for a while. Finally in December, I took some time off in Europe, disappeared in the French country side, a perfect setup to roll up my sleeves and finally get some serious coding done.

And here we go - drum roll please, version 0.1 of Charon is out. You can find it on GitHub, or install it as a NuGet package.

As you can guess from the version number, this is alpha-release grade code. There will be breaking changes, there are probably bugs and obvious things to improve, but I thought it was worth releasing, because it is in a shape good enough to illustrate the direction I am taking, and hopefully get some feedback from the community.

But first, what does Charon do? Charon is a decision tree and random forest machine learning classifier. An example will probably illustrate best what it does - let's work through the classic Titanic example. Using the Titanic passenger list, we want to create a model that predicts whether a passenger is likely to survive the disaster – or meet a terrible fate. Here is how you would do that with Charon, in a couple of lines of F#.

First, we use the CSV type provider to extract passenger information from our data file:

open Charon
open FSharp.Data

type DataSet = CsvProvider<"""C:\Users\Mathias\Documents\GitHub\Charon\Charon\Charon.Examples\titanic.csv""", 
                           SafeMode=true, PreferOptionals=true>

type Passenger = DataSet.Row

In order to define a model, Charon needs two pieces of information: what is it you are trying to predict (the label, in that case, whether the passenger survives or not), and what information Charon is allowed to use to produce predictions (the features, in that case whatever passenger information we think is relevant):

let training = 
    use data = new DataSet()
    [| for passenger in data.Data -> 
        passenger, // label source
        passenger |] // features source

let labels = "Survived", (fun (obs:Passenger) -> obs.Survived) |> Categorical
    
let features = 
    [ 
        "Sex", (fun (o:Passenger) -> o.Sex) |> Categorical;
        "Class", (fun (o:Passenger) -> o.Pclass) |> Categorical;
        "Age", (fun (o:Passenger) -> o.Age) |> Numerical;
    ]

For each feature, we specify whether the feature is Categorical (a finite number of "states" is expected, for instance Sex) or Numerical (the feature is to be interpreted as a numeric value, such as Age).

The Model is now fully specified, and we can train it on our dataset, and retrieve the results:

let results = basicTree training (labels,features) { DefaultSettings with Holdout = 0.1 }

printfn "Quality, training: %.3f" (results.TrainingQuality |> Option.get)
printfn "Quality, holdout: %.3f" (results.HoldoutQuality |> Option.get)
    
printfn "Tree:"
printfn "%s" (results.Pretty)

… which generates the following output:

Quality, training: 0.796
Quality, holdout: 0.747
Tree:
├ Sex = male
│   ├ Class = 3 → Survived False
│   ├ Class = 1 → Survived False
│   └ Class = 2
│      ├ Age = <= 16.000 → Survived True
│      └ Age = >  16.000 → Survived False
└ Sex = female
   ├ Class = 3 → Survived False
   ├ Class = 1 → Survived True
   └ Class = 2 → Survived True

Charon automatically figures out what features are most informative, and organizes them into a tree; in our example, it appears that being a lady was a much better idea than being a guy – and being a rich lady traveling first or second class an even better idea. Charon also automatically breaks down continuous variables into bins. For instance, second-class male passengers under 16 had apparently much better odds of surviving than other male passengers. Charon splits the sample into training and validation; in this example, while our model appears quite good on the training set, with nearly 80% correct calls, the performance on the validation set is much weaker, with under 75% correctly predicted, suggesting an over-fitting issue.

I won’t demonstrate the Random Forest here; the API is basically the same, with better results but less human-friendly output. While formal documentation is lacking for the moment, you can find code samples in the Charon.Examples project that illustrate usage on the Titanic and the Nursery datasets.

What I hope I conveyed with this small example is the design priorities for Charon: a lightweight API that permits quick iterations to experiment with features and refine a model, using the F# Interactive capabilities.

I will likely discuss in later posts some of the challenges I ran into while implementing support for continuous variables – I learnt a lot in the process. I will leave it at that for today – in the meanwhile, I would love to get feedback on the current direction, and what you may like or hate about it. If you have comments, feel free to hit me up on Twitter, or to open an Issue on GitHub!

by Mathias 4. January 2014 12:00

Tis’ still the Season for yearly retrospectives, and making foolish predictions or commitments; here is a very incomplete and disorganized review of my year 2013 with F#, and some of my take-aways for the year ahead.

2013 has been a CRAZY year for me. I used to be proud of myself when I gave one talk per quarter – this is the map of places where I gave F# presentations / Dojos this year (note that I spoke multiple times in some of these places, and some online talks are not listed…):

So yes, it’s been a crazy year. One thing I find interesting here is that most of these talks were direct requests from the community. Not so long ago, I had to knock on doors and “sell” F# talks to user groups – recently, my worry is that I won’t be able to keep up with the requests for F# presentations. In my opinion, the interest in F# is currently totally under-estimated; I was stunned at how many developers showed up for some of these, in unexpected places.

What 2013 has taught me is that the picture “the F# community is all finance in London and NYC” is rather incorrect; while they have the largest concentrations of F# developers, there are also incredibly passionate developers all over the place, and the interest for the language is widespread. The problem here is that the community is pretty scattered; there are many trees, but it’s easy not to notice the sparse forest (yes I am looking at you, Microsoft :)).

I believe one of the main reasons for the current surge in interest is the fantastic work happening around the F# Foundation. It’s been instrumental in shaping a consistent message around the language, providing resources, and getting the community to coalesce around the idea “F# is our language, let’s make it what we want to be”. Huge props and thanks to Don Syme, Tomas Petricek and Philip Trelford for getting the ball rolling there – what has happened in very little time is amazing. And if you want to get involved in a language with an amazing community, where you can make a difference and help shape an ecosystem, GO TO FSHARP.ORG. NOW. We want you!

It’s been a crazy amount of fun, but I have been flirting with burnout quite a bit, too, hence my new year resolution:

 

As much as I would love to keep going everywhere talking about F# to anyone who expresses interest (my 2013 policy in a nutshell…), I’ll have to focus in 2014 and calm down a bit. Instead of talking everywhere (I’ll probably still come if you ask nicely and offer a couch to crash on ;) ), I want to work on scaling.

One format that has worked extremely well is hands-on Dojos: instead of a formal presentation, just get people to code together on an interesting problem, in a fun and friendly atmosphere. It’s great for bootstrapping people, and has the added benefit of being more centered on the community itself, and less on a speaker. So one of my goals this year is to begin building a library of ready-to-use Dojos, which groups can simply grab and run, without the hassle of finding speakers, something which is always a bottleneck for Meetups / user groups. I plan on doing this via Community for F# (@c4fsharp), the brain child of functional cow-boy Ryan Riley. If you are interested in that project, and in general in questions around growing a local community, I’d love to hear from you!

In a similar vein, I want to spend more time in my own backyard, San Francisco and the Bay Area, and help grow a stronger, inclusive, economically viable F# community there. Lots of reasons for optimism: we already have a strong, passionate, and growing community (hello, @FoxyJackFox!), we have stable hosting for sfsharp.org at ThoughtWorks (thanks for your support and enthusiasm, Logan!), and seeing exciting companies like GitHub, Xamarin or Kaggle embrace F# is awesome. The goal for 2014 is simple: crank up the level with Dojos and talks in SF, and start an outpost in the Silicon Valley (I hear there are some developers there, too).

As an aside, I wanted to tip my hat off Bryan Hunter, whose ideas on community building have been very inspirational. Nashville is slowly becoming a hotbed of functional programmers, and seems to be the place to be for F#ers lately; and I am sure this is in no small part due to Bryan’s focus on building a community that emphasizes cross-language, inclusion, empowerment, and dare I say, happiness.

What else? A big part of my year has been focused on my brand-new hashtag (vocation?) #OpenSourceMom ©. If you follow this blog, it shouldn’t come as a surprise to hear that I am very, very interested in Machine Learning and Data Science. I have been busy doing my best helping F# gain the recognition it deserves in that space (in part for selfish reasons: I think it’s a fantastic language for the job, and I want to be able to use it as much as possible), and that has lead me to try and help the community work better together. It never ceases to amaze me how much high-quality code the community has produced already; at the same time, writing code alone is only fun for that long, and because we are so dispersed, good ideas go unfinished or unnoticed, which is a shame. So while I prefer writing my own code (and not write any documentation for it), I was very excited when the F# foundation began launching working groups, and did my best to take a backseat and just try to facilitate communication and cooperation in the area of data science. It has been a fantastic experience, and I am incredibly happy with the results, and the opportunity this has given me to get to know better, and learn a lot, from all you guys (you know who you are). Also, a tip of the hat to Keith and his “Up for Grabs” initiative, which I hope we’ll get to leverage more this year – it’s IMO a great way to channel help, as well as provide easy entry points for beginners who are interested in getting started with a new language. Oh, and this was also the year of my first pull request ever :)

Finally, one highpoint of the year was the month of December, which gave me the chance to get to know the European community better. I had the pleasure to speak at BuildStuff in Vilnius, which was a fantastic conference. Greg, Neringa and Laura put together the kind of event you know you won’t forget – great speakers, of course, but also, and perhaps more importantly, an event with a soul. So thank you guys, and everyone I had the pleasure to talk to there! Oh, and by the way registration is open for 2014, and the price is unbelievable. Go there, buy your ticket now, you’ll thank me later.

While in Europe, I figured I might as well travel around a bit; isn’t that what people do while on vacation? So I went to visit the F# communities in Paris, London and Minsk, which was a blast. Having no organized F# community in France, a country with a strong OCaml history and my place of origin, was a thorn in my side for a long time; that problem has since been solved, the Paris meetup is in very good hands, and I was thrilled to speak there. Similarly, taking the trip to Minsk to speak at that user group was awesome. It was so great to finally meet Natallie and Serguey in person, after years of online contact! And I don’t know what they put in the water in Minsk (Vitamin F, maybe?) but the talent level there is just unbelievable. And I capped that year with London, which I expected to be great, and totally delivered.

So yes, this has been a pretty crazy year of F# for me. At the same time, this has been one of my most fun and rewarding years – all because of you, the F# community. I don’t know how to say it better, but this community just completely, utterly, massively kicks ass. Which makes me even more grateful and humbled that I got nominated F# MVP of the year for 2013. So from the bottom of my heart, thank you – even if it was a grueling year at times, you made it all worth it, and I can’t wait to see what we’ll do together this year. Happy 2014 – the Year of F#!

MVP-of-the-year

Comments

Comment RSS