Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
28. April 2013 09:32

In our previous post, we began exploring Singular Value Decomposition (SVD) using Math.NET and F#, and showed how this linear algebra technique can be used to “extract” the core information of a dataset and construct a reduced version of the dataset with limited loss of information.

Today, we’ll pursue our excursion in Chapter 14 of Machine Learning in Action, and look at how this can be used to build a collaborative recommendation engine. We’ll follow the approach outlined by the book, starting first with a “naïve” approach, and then using an SVD-based approach.

We’ll start from a slightly modified setup from last post, loosely inspired by the Netflix Prize. The full code for the example can be found here on GitHub.

## The problem and setup

In the early 2000s, Netflix had an interesting problem. Netflix’s business model was simple: you would subscribe, and for a fixed fee you could watch as many movies from their catalog as you wanted. However, what happened was the following: users would watch all the movies they knew they wanted to watch, and after a while, they would run out of ideas – and rather than search for lesser-known movies, they would leave. As a result, Netflix launched a prize: if you could create a model that could provide users with good recommendations for new movies to watch, you could claim a \$1,000,000 prize.

Obviously, we won’t try to replicate the Netflix prize here, if only because the dataset was rather large; 500,000 users and 20,000 movies is a lot of data… We will instead work off a fake, simplified dataset that illustrates some of the key ideas behind collaborative recommendation engines, and how SVD can help in that context. For the sake of clarity, I’ll be erring on the side of extra-verbose.

Our dataset consists of users and movies; a movie can be rated from 1 star (terrible) to 5 stars (awesome). We’ll represent it with a Rating record type, associating a UserId, MovieId, and Rating:

type UserId = int
type MovieId = int
type Rating = { UserId:UserId; MovieId:MovieId; Rating:int }


To make our life simpler, and to be able to validate whether “it works”, we’ll imagine a world where only 3 types of movies exist, say, Action, Romance and Documentary – and where people have simple tastes: people either love Action and hate the rest, love Romance or hate the rest, or love Documentaries and hate the rest. We’ll assume that we have only 12 movies in our catalog: 0 to 3 are Action, 4 to 7 Romance, and 8 to 11 Documentary.

More...

14. April 2013 12:20

Last Thursday, I gave a talk at the Bay.NET user group in Berkeley, introducing F# to C# developers. First off, I have to thank everybody who came – you guys were great, lots of good questions, nice energy, I had a fantastic time!

My goal was to highlight why I think F# is awesome, and of course this had to include a Type Provider demo, one of the most amazing features of F# 3.0. So I went ahead, and demoed Tomas Petricek’s World Bank Type Provider, and Howard Mansell’s R Type Provider – together. The promise of Type Providers is to enable information-rich programming; in this case, we get immediate access to a wealth of data over the internet, in one line of code, entirely discoverable by IntelliSense in Visual Studio - and we can use all the visualization arsenal of R to see what’s going on. Pretty rad.

Rather than just dump the code, I thought it would be fun to turn that demo into a video. The result is a 7 minutes clip, with only minor editing (a few cuts, and I sped up the video x3 because the main point here isn’t how terrible my typing skills are). I think it’s largely self-explanatory, the only points that are worth commenting upon are:

• I am using a NuGet package for the R Type Provider that doesn’t officially exist yet. I figured a NuGet package would make that Type Provider more usable, and spent my week-end creating it, but haven’t published it yet. Stay tuned!
• The most complex part of the demo is probably R’s syntax from hell. For those of you who don’t know R, it’s a free, open-source statistical package which does amazingly cool things. What you need to know to understand this video is that R is very vector-centric. You can create a vector in R using the syntax myData <- c(1,2,3,4), and combine vectors into what’s called a data frame, essentially a collection of features. The R type provider exposes all R packages and functions through a single static type, aptly named R – so for instance, one can create a R vector from F# by typing let myData = R.c( [|1; 2; 3; 4 |]).

That’s it! Let me know what you think, and if you have comments or questions.

10. February 2013 11:25

And the Journey converting “Machine Learning in Action” from Python to F# continues! Rather than following the order of the book, I decided to skip chapters 8 and 9, dedicated to regression methods (regression is something I spent a bit too much time doing in the past to be excited about it just right now), and go straight to Unsupervised Learning, which begins with the K-means clustering algorithm.

In a nutshell, clustering focuses on the following question: given a set of observations, can the computer figure out a way to classify them into “meaningful groups”? The major difference with Classification methods is that in clustering, the Categories / Groups are initially unknown: it’s the algorithm’s job to figure out sensible ways to group items into Clusters, all by itself (hence the word “unsupervised”).

Chapter 10 covers 2 clustering algorithms, k-means , and bisecting k-means. We’ll discuss only the first one today.

The underlying idea behind the k-means algorithm is to identify k “representative archetypes” (k being a user input), the Centroids. The algorithm proceeds iteratively:

• Starting from k random Centroids,
• Observations are assigned to the closest Centroid, and constitute a Cluster,
• Centroids are updated, by taking the average of their Cluster,
• Until the allocation of Observation to Clusters doesn’t change any more.

When things go well, we end up with k stable Centroids (minimal modification of Centroids do not change the Clusters), and Clusters contain Observations that are similar, because they are all close to the same Centroid (The wikipedia page for the algorithm provides a nice graphical representation).

## F# implementation

The Python implementation proposed in the book is both very procedural and deals with Observations that are vectors. I thought it would be interesting to take a different approach, focused on functions instead. The current implementation is likely to change when I get into bisecting k-means, but should remain similar in spirit. Note also that I have given no focus to performance – this is my take on the easiest thing that would work.

The entire code can be found here on GitHub.

Here is how I approached the problem. First, rather than restricting ourselves to vectors, suppose we want to deal with any generic type. Looking at the pseudo-code above, we need a few functions to implement the algorithm:

• to assign Observations of type ‘a to the closest Centroid ‘a, we need a notion of Distance,
• we need to create an initial collection of k Centroids of type ‘a, given a dataset of ‘as,
• to update the Centroids based on a Cluster of ‘as, we need some aggregation function.

Let’s create these 3 functions:

    // the Distance between 2 observations 'a is a float
// It also better be positive - left to the implementer
type Distance<'a> = 'a -> 'a -> float
// CentroidsFactory, given a dataset,
// should generate n Centroids
type CentroidsFactory<'a> = 'a seq -> int -> 'a seq
// Given a Centroid and observations in a Cluster,
// create an updated Centroid
type ToCentroid<'a> = 'a -> 'a seq -> 'a


We can now define a function which, given a set of Centroids, will return the index of the closest Centroid to an Observation, as well as the distance from the Centroid to the Observation:

    // Returns the index of and distance to the
// Centroid closest to observation
let closest (dist: Distance<'a>) centroids (obs: 'a) =
centroids
|> Seq.mapi (fun i c -> (i, dist c obs))
|> Seq.minBy (fun (i, d) -> d)


Finally, we’ll go for the laziest possible way to generate k initial Centroids, by picking up k random observations from our dataset:

    // Picks k random observations as initial centroids
// (this is very lazy, even tolerates duplicates)
let randomCentroids<'a> (rng: System.Random)
(sample: 'a seq)
k =
let size = Seq.length sample
seq { for i in 1 .. k do
let pick = Seq.nth (rng.Next(size)) sample
yield pick }


More...

29. December 2012 17:23

This post continues my journey converting the Python samples from Machine Learning in Action into F#. On the program today: chapter 7, dedicated to AdaBoost. This is also the last chapter revolving around classification. After almost 6 months spending my week-ends on classifiers, I am rather glad to change gears a bit!

## The idea behind the algorithm

Algorithm outline

AdaBoost is short for “Adaptative Boosting”. Boosting is based on a very common-sense idea: instead of trying to find one perfect classifier that fits the dataset, the algorithm will train a sequence of classifiers, and, at each step, will analyze the latest classifiers’ results, and focus the next training round on reducing classification mistakes, by giving a bigger weight to the misclassified observations. In other words, “get better by working on your weaknesses”.

The second idea in AdaBoost, which I found very interesting and somewhat counter-intuitive, is that multiple poor classification models taken together can constitute a highly reliable source. Rather than discarding previous classifiers, AdaBoost combines them all into a meta-classifier. AdaBoost computes a weight Alpha for each of the “weak classifiers”, based on the proportion of examples properly classified, and classifies observations by taking a majority vote among the weak classifiers, weighted by their Alpha coefficients. In other words, “decide based on all sources of information, but take into account how reliable each source is”.

In pseudo-code, the algorithm looks like this:

Given examples = observations + labels,

Until overall quality is good enough or iteration limit reached,

• From the available weak classifiers,
• Pick the classifier with the lowest weighted prediction error,
• Compute its Alpha weight based on prediction quality,
• Update weights assigned to each example, based on Alpha and whether example was properly classified or not

The weights update mechanism

Let’s dive into the update mechanism for both the training example weights and the weak classifiers Alpha weights. Suppose that we have

• a training set with 4 examples & their label [ (E1, 1); (E2, –1); (E3, 1); (E4, –1) ],
• currently weighted [ 20%; 20%; 30%; 30% ], (note: example weights must sum to 100%)
• f is the best weak classifier selected.

If we apply a weak classifier f to the training set, we can check what examples are mis-classified, and compute the weighted error, i.e. the weighted proportion of mis-classifications:

 Example Label Weight f(E) f is… weighted error E1 1 0.2 1 correct 0.0 E2 -1 0.2 1 incorrect 0.2 E3 1 0.3 1 correct 0.0 E4 -1 0.3 -1 correct 0.0 0.2

This gives us a weighted error rate of 20% for f, given the weights.

The weight given to f in the final classifier is given by

Alpha = 0.5 x ln ((1 - error) / error)

Here is how Alpha looks, plotted as a function of the proportion correctly classified (i.e. 1 – error):

If 50% of the examples are properly classified, the classifier is totally random, and gets a weight of 0 – its output is ignored. Higher quality models get higher weights – and models with high level of misclassification get a strong negative weight. This is interesting; in essence, this treats them as a great negative source of information: if you know that I am always wrong, my answers are still highly informative – you just need to flip the answer…

More...

26. December 2012 11:50

This is the continuation of my series converting the samples found in Machine Learning in Action from Python to F#. After starting on a nice and steady pace, I hit a speed bump with Chapter 6, dedicated to the Support Vector Machine algorithm. The math is more involved than the previous algorithms, and the original Python implementation is very procedural , which both slowed down the conversion to a more functional style.

Anyways, I am now at a good point to share progress. The current version uses Sequential Minimization Optimization to train the classifier, and supports Kernels. Judging from my experiments, the algorithm works - what is missing at that point is some performance optimization.

I’ll talk first about the code changes from the “naïve SVM” version previously discussed, and then we’ll illustrate the algorithm in action, recognizing hand-written digits.

## Main changes from previous version

From a functionality standpoint, the 2 main changes from the previous post are the replacement of the hard-coded vector dot product by arbitrary Kernel functions, and the modification of the algorithm from a naïve loop to the SMO double-loop, pivoting on observations based on their prediction error.

You can browse the current version of the SVM algorithm on GitHub here.

Injecting arbitrary Kernels

The code I presented last time relied on vector dot-product to partition linearly separable datasets. The obvious issue is that not all datasets are linearly separable. Fortunately, with minimal change, the SVM algorithm can be used to handle more complex situations, using what’s known as the “Kernel Trick”. In essence, instead of working on the original data, we transform our data in a new space where it is linearly separable:

[ from http://zutopedia.com/udi/?p=40, via Cesar Souza’s blog ]

More...