After four weeks of vacations, I am back home, ready to continue my series of posts converting the samples from Machine Learning in Action from Python to F#.
Today’s post covers Chapter 5 of the book, dedicated to Logistic Regression. Logistic Regression is another classification method. It uses numeric data to determine how to separate observations into two classes, identified by 0 or 1.
The entire code presented in this post can be found on GitHub, commit 398677f
The idea behind the algorithm
The main idea behind the algorithm is to find a function which, using the numeric data that describe an individual observation as input, will return a number between 0 and 1. Ideally, that function will return a number close to respectively 0 or 1 for observations belonging to group 0 or 1.
To achieve that result, the algorithm relies on the Sigmoid function, f(x) = 1 / (1 + exp(-x)) .
For any input value, the Sigmoid function returns a value in ] 0 ; 1 [. A positive value will return a value greater than 0.5, and the greater the input value, the closer to 1. One could think of the function as returning a probability: for very high or low values of x, there is a high certainty that it belongs to one of the two groups, and for values close to zero, the probability of each group is 50% / 50%.
The only thing needed then is a transformation taking the numeric values describing the observations from the dataset, and mapping them to a single value, such that applying the Sigmoid function to it produces results close to the group the observation belongs to. The most straightforward way to achieve this is to apply a linear combination: an observation with numeric values [ x1; x2; … xk ] will be converted into w0 + w1 x x1 + w2 x x2 … + wk x xk, by applying weights [ w0; w1; … wk ] to each of the components of the observation. Note how the weights have one extra element w0, which is used for a constant term.
If our observations had two components X and Y, each observation can be represented as a point (X, Y) in the plane, and what we are looking for is a straight line w0 + w1 x X + w2 x Y, such that every observation of group 0 is on one side of the line, and every observation of group 1 on the other side.
We now replaced one problem by another – how can we find a suitable set of weights W?
I won’t even attempt a full explanation of the approach, and will stick to fuzzy, high-level intuition. Basically, the algorithm starts with an arbitrary set of weights, and iteratively adjusts the weights, by comparing the results of the function and what it should be (the actual group), and adjusting them to reduce error.
Note: I’ll skip the Gradient Ascent method, and go straight to the second part of Chapter 5, which covers Stochastic Gradient Ascent, because the code is both easier to understand and more suitable to large datasets. On the other hand, the deterministic gradient ascent approach is probably clearer for the math inclined. If that’s your situation, you might be interested in this MSDN Magazine article, which presents a C# implementation of the Logistic Regression.
Let’s illustrate the update procedure, on an ultra-simplified example, where we have a single weight W. In that case, the predicted value for an observation which has value X will be sigmoid (W x X) , and the algorithm adjustment is given by the following formula:
W <- W + alpha x (Label – sigmoid (W x X))
where Label is the group the observation belongs to (0 or 1), and alpha is a user-defined parameter, between 0 and 1. In other words, W is updated based on the error, Label – sigmoid (W x X) . First, obviously, if there is no error, W will remain unchanged, there is nothing to adjust. Let’s consider the case where Label is 1, and both X and W are positive. In that case, Label – sigmoid (W x X) will be positive (between 0 and 1), and W will be increased. As W increases, the sigmoid becomes closer to 1, and the adjustments become progressively smaller. Similarly, considering all the cases for W and X (positive and negative), one can verify that W will be adjusted in a direction which reduces the classification error. Alpha can be described as “how aggressive” the adjustment should be – the closer to 1, the more W will be updated.
That’s the gist of the algorithm – the full-blown deterministic gradient algorithm proceeds to update the weights by considering the error on the entire dataset at once, which makes it more expensive, whereas the stochastic gradient approach updates the weights sequentially, taking the dataset observations one by one, which makes it convenient for larger datasets.
Enough talk – let’s jump into code, with a straightforward implementation first. We create a module “LogisticRegression”, and begin with building the function which predicts the class of an observation, given weights:
module LogisticRegression =
let sigmoid x = 1.0 / (1.0 + exp -x)
// Vector dot product
let dot (vec1: float list)
(vec2: float list) =
List.zip vec1 vec2
|> List.map (fun e -> fst e * snd e)
// Vector addition
let add (vec1: float list)
(vec2: float list) =
List.zip vec1 vec2
|> List.map (fun e -> fst e + snd e)
// Vector scalar product
let scalar alpha (vector: float list) =
List.map (fun e -> alpha * e) vector
// Weights have 1 element more than observations, for constant
let predict (weights: float list)
(obs: float list) =
1.0 :: obs
|> dot weights