Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 10. February 2013 11:25

And the Journey converting “Machine Learning in Action” from Python to F# continues! Rather than following the order of the book, I decided to skip chapters 8 and 9, dedicated to regression methods (regression is something I spent a bit too much time doing in the past to be excited about it just right now), and go straight to Unsupervised Learning, which begins with the K-means clustering algorithm.

So what is clustering about?

In a nutshell, clustering focuses on the following question: given a set of observations, can the computer figure out a way to classify them into “meaningful groups”? The major difference with Classification methods is that in clustering, the Categories / Groups are initially unknown: it’s the algorithm’s job to figure out sensible ways to group items into Clusters, all by itself (hence the word “unsupervised”).

Chapter 10 covers 2 clustering algorithms, k-means , and bisecting k-means. We’ll discuss only the first one today.

The underlying idea behind the k-means algorithm is to identify k “representative archetypes” (k being a user input), the Centroids. The algorithm proceeds iteratively:

  • Starting from k random Centroids,
  • Observations are assigned to the closest Centroid, and constitute a Cluster,
  • Centroids are updated, by taking the average of their Cluster,
  • Until the allocation of Observation to Clusters doesn’t change any more.

When things go well, we end up with k stable Centroids (minimal modification of Centroids do not change the Clusters), and Clusters contain Observations that are similar, because they are all close to the same Centroid (The wikipedia page for the algorithm provides a nice graphical representation).

F# implementation

The Python implementation proposed in the book is both very procedural and deals with Observations that are vectors. I thought it would be interesting to take a different approach, focused on functions instead. The current implementation is likely to change when I get into bisecting k-means, but should remain similar in spirit. Note also that I have given no focus to performance – this is my take on the easiest thing that would work.

The entire code can be found here on GitHub.

Here is how I approached the problem. First, rather than restricting ourselves to vectors, suppose we want to deal with any generic type. Looking at the pseudo-code above, we need a few functions to implement the algorithm:

  • to assign Observations of type ‘a to the closest Centroid ‘a, we need a notion of Distance,
  • we need to create an initial collection of k Centroids of type ‘a, given a dataset of ‘as,
  • to update the Centroids based on a Cluster of ‘as, we need some aggregation function.

Let’s create these 3 functions:

    // the Distance between 2 observations 'a is a float
    // It also better be positive - left to the implementer
    type Distance<'a> = 'a -> 'a -> float
    // CentroidsFactory, given a dataset, 
    // should generate n Centroids
    type CentroidsFactory<'a> = 'a seq -> int -> 'a seq
    // Given a Centroid and observations in a Cluster,
    // create an updated Centroid
    type ToCentroid<'a> = 'a -> 'a seq -> 'a

We can now define a function which, given a set of Centroids, will return the index of the closest Centroid to an Observation, as well as the distance from the Centroid to the Observation:

    // Returns the index of and distance to the 
    // Centroid closest to observation
    let closest (dist: Distance<'a>) centroids (obs: 'a) =
        |> Seq.mapi (fun i c -> (i, dist c obs)) 
        |> Seq.minBy (fun (i, d) -> d)

Finally, we’ll go for the laziest possible way to generate k initial Centroids, by picking up k random observations from our dataset:

    // Picks k random observations as initial centroids
    // (this is very lazy, even tolerates duplicates)
    let randomCentroids<'a> (rng: System.Random) 
                            (sample: 'a seq) 
                            k =
        let size = Seq.length sample
        seq { for i in 1 .. k do 
              let pick = Seq.nth (rng.Next(size)) sample
              yield pick }



Comment RSS