Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
21. May 2013 12:55

Last week, we had our first Coding Dojo at SFSharp.org, the San Francisco F# group – and it was great! A few people in the group had mentioned that at that point they were already convinced F# was a great language, and that what they wanted was help getting started writing actual code, so I figured this would be a good format to try out.

What I wanted was something fun, something cool people could realistically achieve under 2 hours. I settled for one of the Kaggle introduction problems, a classic of Machine Learning, where the goal is to automatically recognize hand-written digits. I didn’t think it would be fair to just throw people in the shark tank without any guidance, especially for F# beginners, so I prepared a minimal slide deck to explain the problem and data set, and a “guided script”, with hints and language syntax examples.

And… it worked! The attendees were absolutely awesome. We had people from Kaggle, Rdio, and two people who drove all the way from Sacramento; we had beginners and experienced FSharpers – and everybody managed to get a classifier working, from scratch. Having some beers available definitely helped, too.

My favorite part is this one attendee, a F# beginner, who kept going at it after the meeting was over, and posted an algorithm improvement in the comments section of the Meetup a couple days after. Way to go! And given the positive response, we’ll definitely have more of these.

Also wanted to say a huge thanks to Matt Harrington, first for starting this user group back then, and then for still being an incredible supporter of the F# community in SF, in spite of a crazy work schedule. Thanks, Matt!

Introduction slide deck

“Guided script”

28. April 2013 09:32

In our previous post, we began exploring Singular Value Decomposition (SVD) using Math.NET and F#, and showed how this linear algebra technique can be used to “extract” the core information of a dataset and construct a reduced version of the dataset with limited loss of information.

Today, we’ll pursue our excursion in Chapter 14 of Machine Learning in Action, and look at how this can be used to build a collaborative recommendation engine. We’ll follow the approach outlined by the book, starting first with a “naïve” approach, and then using an SVD-based approach.

We’ll start from a slightly modified setup from last post, loosely inspired by the Netflix Prize. The full code for the example can be found here on GitHub.

## The problem and setup

In the early 2000s, Netflix had an interesting problem. Netflix’s business model was simple: you would subscribe, and for a fixed fee you could watch as many movies from their catalog as you wanted. However, what happened was the following: users would watch all the movies they knew they wanted to watch, and after a while, they would run out of ideas – and rather than search for lesser-known movies, they would leave. As a result, Netflix launched a prize: if you could create a model that could provide users with good recommendations for new movies to watch, you could claim a \$1,000,000 prize.

Obviously, we won’t try to replicate the Netflix prize here, if only because the dataset was rather large; 500,000 users and 20,000 movies is a lot of data… We will instead work off a fake, simplified dataset that illustrates some of the key ideas behind collaborative recommendation engines, and how SVD can help in that context. For the sake of clarity, I’ll be erring on the side of extra-verbose.

Our dataset consists of users and movies; a movie can be rated from 1 star (terrible) to 5 stars (awesome). We’ll represent it with a Rating record type, associating a UserId, MovieId, and Rating:

type UserId = int
type MovieId = int
type Rating = { UserId:UserId; MovieId:MovieId; Rating:int }


To make our life simpler, and to be able to validate whether “it works”, we’ll imagine a world where only 3 types of movies exist, say, Action, Romance and Documentary – and where people have simple tastes: people either love Action and hate the rest, love Romance or hate the rest, or love Documentaries and hate the rest. We’ll assume that we have only 12 movies in our catalog: 0 to 3 are Action, 4 to 7 Romance, and 8 to 11 Documentary.

More...

14. April 2013 12:20

Last Thursday, I gave a talk at the Bay.NET user group in Berkeley, introducing F# to C# developers. First off, I have to thank everybody who came – you guys were great, lots of good questions, nice energy, I had a fantastic time!

My goal was to highlight why I think F# is awesome, and of course this had to include a Type Provider demo, one of the most amazing features of F# 3.0. So I went ahead, and demoed Tomas Petricek’s World Bank Type Provider, and Howard Mansell’s R Type Provider – together. The promise of Type Providers is to enable information-rich programming; in this case, we get immediate access to a wealth of data over the internet, in one line of code, entirely discoverable by IntelliSense in Visual Studio - and we can use all the visualization arsenal of R to see what’s going on. Pretty rad.

Rather than just dump the code, I thought it would be fun to turn that demo into a video. The result is a 7 minutes clip, with only minor editing (a few cuts, and I sped up the video x3 because the main point here isn’t how terrible my typing skills are). I think it’s largely self-explanatory, the only points that are worth commenting upon are:

• I am using a NuGet package for the R Type Provider that doesn’t officially exist yet. I figured a NuGet package would make that Type Provider more usable, and spent my week-end creating it, but haven’t published it yet. Stay tuned!
• The most complex part of the demo is probably R’s syntax from hell. For those of you who don’t know R, it’s a free, open-source statistical package which does amazingly cool things. What you need to know to understand this video is that R is very vector-centric. You can create a vector in R using the syntax myData <- c(1,2,3,4), and combine vectors into what’s called a data frame, essentially a collection of features. The R type provider exposes all R packages and functions through a single static type, aptly named R – so for instance, one can create a R vector from F# by typing let myData = R.c( [|1; 2; 3; 4 |]).

That’s it! Let me know what you think, and if you have comments or questions.

25. March 2013 10:33

My trajectory through “Machine Learning in Action” is becoming more unpredictable as we go – this time, rather than completing our last episode on K-means clustering (we’ll get back to it later), I’ll make another jump directly to Chapter 14, which is dedicated to Singular Value Decomposition, and convert the example from Python to F#.

The chapter illustrates how Singular Value Decomposition (or SVD in short) can be used to build a collaborative recommendation engine. We will follow the chapter pretty closely: today we will focus on the mechanics of using SVD in F# – and leave the recommendation part to our next installment.

As usual, the code is on GitHub.

Until this point, I have avoided using a Linear Algebra library, because the algorithms we discussed so far involved lightweight, row-centric operations, which didn’t warrant taking such a dependency. SVD is one of these cases where using an established library is a good idea, if only because implementing it yourself would not be trivial. So let’s create a new script file (Chapter14.fsx), add a reference to Math.NET Numerics for F# to our project via NuGet, and reference it in our script:

#r @"..\..\MachineLearningInAction\packages\MathNet.Numerics.2.4.0\lib\net40\MathNet.Numerics.dll"
#r @"..\..\MachineLearningInAction\packages\MathNet.Numerics.FSharp.2.4.0\lib\net40\MathNet.Numerics.FSharp.dll"

open MathNet.Numerics.LinearAlgebra
open MathNet.Numerics.LinearAlgebra.Double


Now that we have our tools, let’s start working our example. Imagine that we are running a website, where our users can rate dishes, from 1 (horrendous) to 5 (delightful). Our data would look something along these lines:

type Rating = { UserId: int; DishId: int; Rating: int }

// Our existing "ratings database"
let ratings = [
{ UserId = 0; DishId = 0; Rating = 2 };
{ UserId = 0; DishId = 3; Rating = 4 };
... omitted for brevity ...
{ UserId = 10; DishId = 8; Rating = 4 };
{ UserId = 10; DishId = 9; Rating = 5 } ]

Our goal will be to provide recommendations to User for Dishes they haven’t tasted yet, based on their ratings and what other users are saying.

Our first step will be to represent this as a Matrix, where each Row is a User, each Column a Dish, and the corresponding cell is the User Rating for that Dish. Note that not every Dish has been rated by every User – we will represent missing ratings as zeroes in our matrix:

let rows = 11
let cols = 11
let data = DenseMatrix(rows, cols)
ratings
|> List.iter (fun rating ->
data.[rating.UserId, rating.DishId] <- (float)rating.Rating)


We initialize our 11 x 11 matrix, which creates a zero-filled matrix, and then map our user ratings to each “cell”. Because we constructed our example that way, our UserIds go from 0 to 10, and DishIds from 0 to 10, so we can map them respectively to Rows and Columns.

Note: while this sounded like a perfect case to use a Sparse Matrix, I chose to go first with a DenseMatrix, which is more standard. I may look at whether there is a benefit to going sparse later.

Note: our matrix happens to be square, but this isn’t a requirement.

Note: I will happily follow along the book author and replace unknown ratings by zero, because it’s very convenient. I don’t fully get how this is justified, but it seems to work, so I’ll temporarily suspend disbelief and play along.

At that point, we have our data matrix ready. Before going any further, let’s write a quick utility function, to “pretty-render” matrices:

let printNumber v =
if v < 0.
then printf "%.2f " v
else printf " %.2f " v
// Display a Matrix in a "pretty" format
let pretty matrix =
Matrix.iteri (fun row col value ->
if col = 0 then printfn "" else ignore ()
printNumber value) matrix
printfn ""


We iterate over each row and column, start a newline every time we hit column 0, and print every value, nicely formatted with 2 digits after the decimal.

In passing, note the F#-friendly Matrix.iteri syntax – the good people at Math.NET do support F#, and MathNet.Numerics.FSharp.dll contains handy helpers, which allow for a much more functional usage of the library. Thanks, guys!

Let’s see how our data matrix looks like:

printfn "Original data matrix"
pretty data


… which produces the following output in FSI:

Original data matrix

2.00  0.00  0.00  4.00  4.00  0.00  0.00  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  5.00
4.00  0.00  0.00  0.00  0.00  0.00  0.00  1.00  0.00  0.00  0.00
3.00  3.00  4.00  0.00  3.00  0.00  0.00  2.00  2.00  0.00  0.00
5.00  5.00  5.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
0.00  0.00  0.00  0.00  0.00  0.00  5.00  0.00  0.00  5.00  0.00
4.00  0.00  4.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  5.00
0.00  0.00  0.00  0.00  0.00  4.00  0.00  0.00  0.00  0.00  4.00
0.00  0.00  0.00  0.00  0.00  0.00  5.00  0.00  0.00  0.00  0.00
0.00  0.00  0.00  3.00  0.00  0.00  0.00  0.00  4.00  5.00  0.00
1.00  1.00  2.00  1.00  1.00  2.00  1.00  0.00  4.00  5.00  0.00
>

We seem to be in business.

More...

16. March 2013 10:04

Mondrian is one of those modern painters whose work everyone recognizes, even though few people will quote his name. He also happens to be one of my favorite artists – in spite of their simple geometric structure, I find his pieces strangely beautiful:

“Composition II in Red, Blue, and Yellow”, from Wikipedia

I have been hard at work on some pretty dry stuff lately, and needed a bit of a change of pace, and ended up spending a couple of hours coding a simple Mondrianizer in F#: give it a picture, and it will transform it into something “in the style of Mondrian”.

For instance, starting from my Twitter avatar, here is what the Mondrianizer produces:

This was strictly quick-and-dirty hackery, so the code is not my best by any stretch of the imagination, but I was rather pleased by the results – you can find the current version of the Mondrianizer here on GitHub.

More...