Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
29. December 2012 17:23

This post continues my journey converting the Python samples from Machine Learning in Action into F#. On the program today: chapter 7, dedicated to AdaBoost. This is also the last chapter revolving around classification. After almost 6 months spending my week-ends on classifiers, I am rather glad to change gears a bit!

## The idea behind the algorithm

Algorithm outline

AdaBoost is short for “Adaptative Boosting”. Boosting is based on a very common-sense idea: instead of trying to find one perfect classifier that fits the dataset, the algorithm will train a sequence of classifiers, and, at each step, will analyze the latest classifiers’ results, and focus the next training round on reducing classification mistakes, by giving a bigger weight to the misclassified observations. In other words, “get better by working on your weaknesses”.

The second idea in AdaBoost, which I found very interesting and somewhat counter-intuitive, is that multiple poor classification models taken together can constitute a highly reliable source. Rather than discarding previous classifiers, AdaBoost combines them all into a meta-classifier. AdaBoost computes a weight Alpha for each of the “weak classifiers”, based on the proportion of examples properly classified, and classifies observations by taking a majority vote among the weak classifiers, weighted by their Alpha coefficients. In other words, “decide based on all sources of information, but take into account how reliable each source is”.

In pseudo-code, the algorithm looks like this:

Given examples = observations + labels,

Until overall quality is good enough or iteration limit reached,

• From the available weak classifiers,
• Pick the classifier with the lowest weighted prediction error,
• Compute its Alpha weight based on prediction quality,
• Update weights assigned to each example, based on Alpha and whether example was properly classified or not

The weights update mechanism

Let’s dive into the update mechanism for both the training example weights and the weak classifiers Alpha weights. Suppose that we have

• a training set with 4 examples & their label [ (E1, 1); (E2, –1); (E3, 1); (E4, –1) ],
• currently weighted [ 20%; 20%; 30%; 30% ], (note: example weights must sum to 100%)
• f is the best weak classifier selected.

If we apply a weak classifier f to the training set, we can check what examples are mis-classified, and compute the weighted error, i.e. the weighted proportion of mis-classifications:

 Example Label Weight f(E) f is… weighted error E1 1 0.2 1 correct 0.0 E2 -1 0.2 1 incorrect 0.2 E3 1 0.3 1 correct 0.0 E4 -1 0.3 -1 correct 0.0 0.2

This gives us a weighted error rate of 20% for f, given the weights.

The weight given to f in the final classifier is given by

Alpha = 0.5 x ln ((1 - error) / error)

Here is how Alpha looks, plotted as a function of the proportion correctly classified (i.e. 1 – error):

If 50% of the examples are properly classified, the classifier is totally random, and gets a weight of 0 – its output is ignored. Higher quality models get higher weights – and models with high level of misclassification get a strong negative weight. This is interesting; in essence, this treats them as a great negative source of information: if you know that I am always wrong, my answers are still highly informative – you just need to flip the answer…

More...

26. December 2012 11:50

This is the continuation of my series converting the samples found in Machine Learning in Action from Python to F#. After starting on a nice and steady pace, I hit a speed bump with Chapter 6, dedicated to the Support Vector Machine algorithm. The math is more involved than the previous algorithms, and the original Python implementation is very procedural , which both slowed down the conversion to a more functional style.

Anyways, I am now at a good point to share progress. The current version uses Sequential Minimization Optimization to train the classifier, and supports Kernels. Judging from my experiments, the algorithm works - what is missing at that point is some performance optimization.

I’ll talk first about the code changes from the “naïve SVM” version previously discussed, and then we’ll illustrate the algorithm in action, recognizing hand-written digits.

## Main changes from previous version

From a functionality standpoint, the 2 main changes from the previous post are the replacement of the hard-coded vector dot product by arbitrary Kernel functions, and the modification of the algorithm from a naïve loop to the SMO double-loop, pivoting on observations based on their prediction error.

You can browse the current version of the SVM algorithm on GitHub here.

Injecting arbitrary Kernels

The code I presented last time relied on vector dot-product to partition linearly separable datasets. The obvious issue is that not all datasets are linearly separable. Fortunately, with minimal change, the SVM algorithm can be used to handle more complex situations, using what’s known as the “Kernel Trick”. In essence, instead of working on the original data, we transform our data in a new space where it is linearly separable:

[ from http://zutopedia.com/udi/?p=40, via Cesar Souza’s blog ]

More...

1. December 2012 13:55

I have been obsessing about the following idea lately – what if I could run a FSI session from within Excel? The motivation behind this is double. First, one thing Excel is good at is creating and formatting charts. If I could use F# for data manipulation, and Excel for data visualization, I would be a happy camper. Then, I think F# via FSI could provide an interesting alternative for Excel automation. I’d much rather leverage existing .NET libraries to, say, grab data from the internet, than write some VBA to do that – and the ability to write live code in FSI would be less heavy handed that VSTO automation, and closer to what people typically do in Excel, that is, explore data. Having the ability to execute F# scripts would be, at least for me, very useful.

Seeing Tim Robinson’s awesome job with FsNotebook.net kicked me out of procrastination. Even though FsNotebook is still in early development, it provides a very nice user experience – on the web. If something that nice can be done on the web, it should be feasible on a local machine.

As an aside, Tim is looking for feedback and input on FsNotebook – go try it out, it’s really fun:

Anyways – this is the grand plan, now we need to start with baby steps. If I want to embed FSI in Excel (presumably via a VSTO add-in), I need a way to talk to FSI from .NET, so that I can create a Session and send arbitrary strings of code to be evaluated.

As usual, StackOverflow provided two good starting points (this answer, and this answer) – so I set out to look into the Process class, which I didn’t know much about, and attempted to spawn a FSI.EXE process, redirecting input and output. Turns out it’s not overly complicated – here are the 34 lines of code I ended up with so far (see it on GitHub):

namespace ClearLines.FsiRunner

open System.Diagnostics

type public FsiSession(fsiPath: string) =

let info = new ProcessStartInfo()
let fsiProcess = new Process()

do
info.RedirectStandardInput <- true
info.RedirectStandardOutput <- true
info.UseShellExecute <- false
info.CreateNoWindow <- true
info.FileName <- fsiPath

fsiProcess.StartInfo <- info

[<CLIEvent>]

[<CLIEvent>]

member this.Start() =
fsiProcess.Start()