Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 10. February 2013 11:25

And the Journey converting “Machine Learning in Action” from Python to F# continues! Rather than following the order of the book, I decided to skip chapters 8 and 9, dedicated to regression methods (regression is something I spent a bit too much time doing in the past to be excited about it just right now), and go straight to Unsupervised Learning, which begins with the K-means clustering algorithm.

So what is clustering about?

In a nutshell, clustering focuses on the following question: given a set of observations, can the computer figure out a way to classify them into “meaningful groups”? The major difference with Classification methods is that in clustering, the Categories / Groups are initially unknown: it’s the algorithm’s job to figure out sensible ways to group items into Clusters, all by itself (hence the word “unsupervised”).

Chapter 10 covers 2 clustering algorithms, k-means , and bisecting k-means. We’ll discuss only the first one today.

The underlying idea behind the k-means algorithm is to identify k “representative archetypes” (k being a user input), the Centroids. The algorithm proceeds iteratively:

  • Starting from k random Centroids,
  • Observations are assigned to the closest Centroid, and constitute a Cluster,
  • Centroids are updated, by taking the average of their Cluster,
  • Until the allocation of Observation to Clusters doesn’t change any more.

When things go well, we end up with k stable Centroids (minimal modification of Centroids do not change the Clusters), and Clusters contain Observations that are similar, because they are all close to the same Centroid (The wikipedia page for the algorithm provides a nice graphical representation).

F# implementation

The Python implementation proposed in the book is both very procedural and deals with Observations that are vectors. I thought it would be interesting to take a different approach, focused on functions instead. The current implementation is likely to change when I get into bisecting k-means, but should remain similar in spirit. Note also that I have given no focus to performance – this is my take on the easiest thing that would work.

The entire code can be found here on GitHub.

Here is how I approached the problem. First, rather than restricting ourselves to vectors, suppose we want to deal with any generic type. Looking at the pseudo-code above, we need a few functions to implement the algorithm:

  • to assign Observations of type ‘a to the closest Centroid ‘a, we need a notion of Distance,
  • we need to create an initial collection of k Centroids of type ‘a, given a dataset of ‘as,
  • to update the Centroids based on a Cluster of ‘as, we need some aggregation function.

Let’s create these 3 functions:

    // the Distance between 2 observations 'a is a float
    // It also better be positive - left to the implementer
    type Distance<'a> = 'a -> 'a -> float
    // CentroidsFactory, given a dataset, 
    // should generate n Centroids
    type CentroidsFactory<'a> = 'a seq -> int -> 'a seq
    // Given a Centroid and observations in a Cluster,
    // create an updated Centroid
    type ToCentroid<'a> = 'a -> 'a seq -> 'a

We can now define a function which, given a set of Centroids, will return the index of the closest Centroid to an Observation, as well as the distance from the Centroid to the Observation:

    // Returns the index of and distance to the 
    // Centroid closest to observation
    let closest (dist: Distance<'a>) centroids (obs: 'a) =
        centroids
        |> Seq.mapi (fun i c -> (i, dist c obs)) 
        |> Seq.minBy (fun (i, d) -> d)

Finally, we’ll go for the laziest possible way to generate k initial Centroids, by picking up k random observations from our dataset:

    // Picks k random observations as initial centroids
    // (this is very lazy, even tolerates duplicates)
    let randomCentroids<'a> (rng: System.Random) 
                            (sample: 'a seq) 
                            k =
        let size = Seq.length sample
        seq { for i in 1 .. k do 
              let pick = Seq.nth (rng.Next(size)) sample
              yield pick }

More...

by Mathias 6. February 2013 15:06

In spite of being color blind, I am a visual guy – I like to see things. Nothing beats a chart to identify problems in your data. I also spend lots of time manipulating data in FSI, the F# REPL, and while solutions like FSharpChart makes it possible to produce nice graphs fairly easily, I still find it introduces a bit of friction, and wondered how complicated it would be to use Excel as a charting engine.

Turns out, it’s not very complicated. The typical use case for generating charts in Excel is to first put data in a spreadsheet, and use the corresponding range as a source for a chart. However, it’s also perfectly possible to directly create a Chart object, and manipulate its SeriesCollection, adding and editing Series, which are arrays of XValues and Values.

As a starting point, I decided to focus on 2 problems:

  • plotting functions, in 2 and 3 dimensions,
  • producing scatterplots.

Both are rather painful to do in Excel itself – and scatterplots are the one chart I really care about when analyzing data, because it helps figuring out whether or not some variables are related.

What I wanted was a smooth experience from FSI – start typing code, and ship data to Excel, without having to worry about the joys of the Excel interop and its syntax. The video below shows what I ended up with, in action.

Note: watching me type is about as exciting as watching paint dry, so I sped up the video from its original 5 minutes down to 2 - otherwise there is no trick or editing.

This year’s blockbuster: plotting functions from F# to Excel

More...

Comments

Comment RSS