Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
22. August 2012 13:35

In my recent post on Decision Tree Classifiers, I mentioned that I was too lazy to figure out how to visualize the Decision Tree “supporting” the classifier. Well, at times, the Internet can be an awesome place. Cesar Mendoza has forked the Machine Learning in Action GitHub project, and done a very fine job resolving that problem using the Microsoft Automatic Graph Layout library, and running it on the Lenses Dataset from the University of California, Irvine Machine Learning dataset repository.

Here is the result of the visualization, you can find his code here:

Unfortunately, as far as I can tell, the library is not open source, and requires a MSDN license. The amount of great stuff produced at Microsoft Research is amazing, it’s just too bad that at times licensing seems to get in the way of getting the word out…

18. August 2012 03:39

This is the continuation of my series exploring Machine Learning, converting the code samples of “Machine Learning in Action” from Python to F# as I go through the book. Today’s post covers Chapter 4, which is dedicated to Naïve Bayes classification – and you can find the resulting code on GitHub.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading“Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

## The idea behind the Algorithm

The canonical application of Bayes naïve classification is in text classification, where the goal is to identify to which pre-determined category a piece of text belongs to  – for instance, is this email I just received spam, or ham (“valuable” email)?

The underlying idea is to use individual words present in the text as indications for what category it is most likely to belong to, using Bayes Theorem, named after the cheerful-looking Reverend Bayes.

Imagine that you received an email containing the words “Nigeria”, “Prince”, “Diamonds” and “Money”. It is very likely that if you look into your spam folder, you’ll find quite a few emails containing these words, whereas, unless you are in the business of importing diamonds from Nigeria and have some aristocratic family, your “normal” emails would rarely contain these words. They have a much higher frequency within the category “Spam” than within the Ham, which makes them a potential flag for undesired business ventures.

On the other hand, let’s assume that you are a lucky person, and that typically, what you receive is Ham, with the occasional Spam bit. If you took a random email in your inbox, it is then much more likely that it belongs to the Ham category.

Bayes’ Theorem combines these two pieces of information together, to determine the probability that a particular email belongs to the “Spam” category, if it contains the word “Nigeria”:

P(is “Spam”|contains ”Nigeria”) = P(contains “Nigeria|is ”Spam”) x P(is “Spam”) / P(contains “Nigeria”)

In other words, 2 factors should be taken into account when deciding whether an email containing “Nigeria” is spam: how over-represented is that word in Spam, and how likely is it that any email is spammy in the first place?

The algorithm is named “Naïve”, because it makes a simplifying assumption about the text, which turns out to be very convenient for computations purposes, namely that each word appears with a frequency which doesn’t depend on other words. This is an unlikely assumption (the word “Diamond” is much more likely to be present in an email containing “Nigeria” than in your typical family-members discussion email).

We’ll leave it at that on the concepts –  I’ll refer the reader who want to dig deeper to the book, or to this explanation of text classification with Naïve Bayes.

## A simple F# implementation

For my first pass, I took a slightly different direction from the book, and decided to favor readability over performance. I assume that we are operating on a dataset organized as a sequence of text samples, each of them labeled by category, along these lines (example from the book “Machine Learning in Action”):

Note: the code presented here can be found found on GitHub

let dataset =
[| ("Ham",  "My dog has flea problems help please");
("Spam", "Maybe not take him to dog park stupid");
("Ham",  "My dalmatian is so cute I love him");
("Spam", "Stop posting stupid worthless garbage");
("Ham",  "Mr Licks ate my steak how to stop him");
("Spam", "Quit buying worthless dog food stupid") |]

We will need to do some word counting to compute frequencies, so let’s start with a few utility functions:

    open System
open System.Text.RegularExpressions

// Regular Expression matching full words, case insensitive.
let matchWords = new Regex(@"\w+", RegexOptions.IgnoreCase)

// Extract and count words from a string.
// http://stackoverflow.com/a/2159085/114519
let wordsCount text =
matchWords.Matches(text)
|> Seq.cast<Match>
|> Seq.groupBy (fun m -> m.Value)
|> Seq.map (fun (value, groups) ->
value.ToLower(), (groups |> Seq.length))

// Extracts all words used in a string.
let vocabulary text =
matchWords.Matches(text)
|> Seq.cast<Match>
|> Seq.map (fun m -> m.Value.ToLower())
|> Seq.distinct

// Extracts all words used in a dataset;
// a Dataset is a sequence of "samples",
// each sample has a label (the class), and text.
let extractWords dataset =
dataset
|> Seq.map (fun sample -> vocabulary (snd sample))
|> Seq.concat
|> Seq.distinct

// "Tokenize" the dataset: break each text sample
// into words and how many times they are used.
let prepare dataset =
dataset
|> Seq.map (fun (label, sample) -> (label, wordsCount sample))

We use a Regular Expression, \w+, to match all words, in a case-insensitive way. wordCount extracts individual words and the number of times they occur, while vocabulary simply returns the words encountered. The prepare function takes a complete dataset, and transforms each text sample into a Tuple containing the original classification label, and a Sequence of Tuples containing all lower-cased words found and their count.

More...

5. August 2012 13:50

Today’s topic will be Chapter 3 of “Machine Learning in Action”, which covers Decision Trees.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading“Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

The idea behind Decision Trees is similar to the Game of 20 Questions: construct a set of discrete Choices to identify the Class of an item. We will use the following dataset for illustration: imagine that we have 5 cards, each with a major masterpiece of contemporary cinema, classified by genre. Now I hide one – and you can ask 2 questions about the genre of the movie to identify the Thespian luminary in the lead role, in as few questions as possible:

 Action Sci-Fi Actor Cliffhanger Yes No Stallone Rocky Yes No Stallone Twins No No Schwarzenegger Terminator Yes Yes Schwarzenegger Total Recall Yes Yes Schwarzenegger

The questions you would likely ask are:

• Is this a Sci-Fi movie? If yes, Arnold is the answer, if no,
• Is this an Action movie? if yes, go for Sylvester, otherwise Arnold it is.

That’s a Decision Tree in a nutshell: we traverse a Tree, asking about features, and depending on the answer, we draw a conclusion or recurse deeper into more questions. The goal today is to let the computer build the right tree from the dataset, and use that tree to classify “subjects”.

## Defining a Tree

Let’s start with the end – the Tree. A common and convenient way to model Trees in F# is to use a discriminated union like this:

type Tree =
| Conclusion of string
| Choice of string * (string * Tree) []

A Tree is composed of either a Conclusion, described by a string, or a Choice, which is described by a string, and an Array of multiple options, each described by a string and its own Tree, “tupled”.

For instance, we can manually create a tree for our example like this:

let manualTree =
Choice
("Sci-Fi",
[|("No",
Choice
("Action",
[|("Yes", Conclusion "Stallone");
("No", Conclusion "Schwarzenegger")|]));
("Yes", Conclusion "Schwarzenegger")|])

Our tree starts with a Choice, labeled “Sci-Fi”, with 2 options in an Array, “No” or “Yes”. “Yes” leads to a Conclusion (a Leaf node), Arnold, while “No” opens another Choice, “Action”, with 2 Conclusions.

So how can we use this to Classify a “Subject”? We need to traverse down the Tree, check what branch corresponds to the Subject for the current Choice, and continue until we reach a Decision node, at what point we can return the contents of the Conclusion. To that effect, we’ll represent a “Subject” (the thing we are trying to classify) as an collection of Tuples, each Tuple being a key/value pair, representing a Feature and its value:

let test = [| ("Action", "Yes"); ("Sci-Fi", "Yes") |]

We are ready to write a classification function now:

let rec classify subject tree =
match tree with
| Conclusion(c) -> c
| Choice(label, options) ->
let subjectState =
subject
|> Seq.find(fun (key, value) -> key = label)
|> snd
options
|> Array.find (fun (option, tree) -> option = subjectState)
|> snd
|> classify subject

classify is a recursive function: given a subject and a tree, if the Tree is a Conclusion, we are done, otherwise, we retrieve the label of the next Choice, find the value of the Subject for that Choice, and use it to pick the next level of the Tree.

At that point, using the Tree to classify our subject is as simple as:

> let actor = classify test manualTree;;

val actor : string = "Schwarzenegger"

Not bad for 14 lines of code. The most painful part is the manual construction of the Tree – let’s see if we can get the computer to build that for us.

More...

1. August 2012 17:27

This is the continuation of this post, where I began exploring k-nearest-neighbor classification, a Machine Learning algorithm used to classify items in different categories, based on an existing sample of items that have been properly classified.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading “Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

To determine what category an item belongs to, the algorithm measures its distance from each element of the known dataset, and takes a “majority vote” to determine its category, based on its k nearest neighbors.

In our last installment, we wrote a simple classification function, which was doing just that. Today, we’ll continue along Chapter 2, applying our algorithm to real data, and dealing with data normalization.

## The dataset: Elections 2008

Rather than use the data sample from the book, I figured it would be more fun to create my own data set. The problem we’ll look into is whether we can predict whether a state voted Republican or Democrat in the 2008 presidential election. Our dataset will consist of the following: the Latitude and Longitude of the State, and its Population. A State is classified as Democrat if the number of votes (i.e. popular vote) recorded for Obama was greater than McCain.

Notes

• My initial goal was to do this for Cities, but I couldn’t find voting data at that level – so I had to settle for States. The resulting sample is smaller than I would have liked, and also less useful (we can’t realistically use our classifier to produce a prediction for a new State, because the likelihood of a new State joining the Union is fairly small) but it still works well for illustration purposes.
• Computing distance based on raw Latitude and Longitude wouldn’t be such a great idea in general, because they denote a position on a sphere (very close points may have very different Longitudes); however, given the actual layout of the United States, this will be good enough here.

I gathered the data (see sources at the end) and saved it in a comma-delimited text file “Election2008.txt” (the raw text version is available at the bottom of the post).

First, we need to open that file and parse it into the structure we used in our previous post, with a matrix of observations for each state, and a vector of categories (DEM or REP). That’s easy enough with F#:

let elections =
let file = @"C:\Users\Mathias\Desktop\Elections2008.txt"
let fileAsLines =
|> Array.map (fun line -> line.Split(','))
let dataset =
fileAsLines
|> Array.map (fun line ->
[| Convert.ToDouble(line.[1]);
Convert.ToDouble(line.[2]);
Convert.ToDouble(line.[3]) |])
let labels = fileAsLines |> Array.map (fun line -> line.[4])
dataset, labels

We open System and System.IO, and use File.ReadAllLines, which returns an array of strings for each line in the text file, and apply string.Split to each line, using the comma as a split-delimiter, which gives us an array of string for each text line. For each row, we then retrieve the dataset part, elements 1, 2 and 3 (the Latitude, Longitude and Population), creating an Array of doubles by converting the strings to doubles. Finally, we retrieve the fourth element of each row, the vote, into an array of labels, and return dataset and labels as a tuple.

Let’s visualize the result, using the FSharpChart display function we wrote last time. display dataset 1 0 plots Longitude on the X axis, and Latitude on the Y axis, producing a somewhat unusual map of the United States, where we can see the red states / blue states pattern (colored differently), with center states leaning Republican, and Coastal states Democrat:

Plotting Longitude against Population produces the following chart, which also displays some clusters:

Finally, Longitude versus Population also exhibits some patterns:

## Normalizing the data

We could run the algorithm we wrote in the previous post on the data, and measure the distances between observations based on the raw measures, but this would likely produce poor results, because of the discrepancy in scales. Our Latitudes range from about 20 (Hawaii) to 60 (Alaska), while Populations vary from half a million (Washington DC) to over 30 millions (California). Because we compute the distance between observations using the Euclidean distance, differences in Population will have a huge weight in the overall distance, and in effect we would be “ignoring” Latitude and Longitude in our classification.

Consider this: the raw distance between 2 states is given by

Distance (S1, S2) = sqrt ((Pop(S1) – Pop(S2))^2 + (Lat(S1) – Lat(S2))^2 + (Lon(S1) – Lon(S2))^2)

Even if we considered 2 states located as far as possible from each other, at the 2 corners of our map, the distance would look like:

Distance (S1, S2) = sqrt ((Pop(S1) – Pop(S2))^2 + 40^2 + 90^2)

It’s clear that even minimal differences in population in this formula (say, 100,000 inhabitants) will completely dwarf the largest possible effect of geographic distance. If we want to observe the effect of all three dimensions, we need to convert the measurements to a comparable scale, so that differences in Population are comparable, relatively, to differences in Location.

To that effect, we will Normalize our dataset. We’ll follow the example of “Machine Learning in Action”, and normalize each measurement so that its minimal value is 0, and its maximum is 1. Other approaches would be feasible, but this one has the benefit of being fairly straightforward. If we consider Population for instance, what we need to do is:

• Retrieve all the Populations in our Dataset
• Retrieve the minimum and the maximum
• Transform the Population to (Population – minimum) / (maximum – minimum)

Here is what I came up with:

let column (dataset: float [][]) i =
dataset |> Array.map (fun row -> row.[i])

let columns (dataset: float [][]) =
let cols = dataset.[0] |> Array.length
[| for i in 0 .. (cols - 1) -> column dataset i |]

let minMax dataset =
dataset
|> columns
|> Array.map (fun col -> Array.min(col), Array.max(col))

let minMaxNormalizer dataset =
let bounds = minMax dataset
fun (vector: float[]) ->
Array.mapi (fun i v ->
(vector.[i] - fst v) / (snd v - fst v)) bounds

Compared to the Python example, I have to pay a bit of a code tax here, because I chose to use plain F# arrays to model my dataset, instead of using matrices. The column function extracts column i from a dataset, by mapping each row (an observation) to its ith component. columns expands on it, and essentially transposes the matrix, converting it from an array of row vectors to an array of column vectors.

More...