This is the continuation of this post, where I began exploring k-nearest-neighbor classification, a Machine Learning algorithm used to classify items in different categories, based on an existing sample of items that have been properly classified.
Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading “Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.
To determine what category an item belongs to, the algorithm measures its distance from each element of the known dataset, and takes a “majority vote” to determine its category, based on its k nearest neighbors.
In our last installment, we wrote a simple classification function, which was doing just that. Today, we’ll continue along Chapter 2, applying our algorithm to real data, and dealing with data normalization.
The dataset: Elections 2008
Rather than use the data sample from the book, I figured it would be more fun to create my own data set. The problem we’ll look into is whether we can predict whether a state voted Republican or Democrat in the 2008 presidential election. Our dataset will consist of the following: the Latitude and Longitude of the State, and its Population. A State is classified as Democrat if the number of votes (i.e. popular vote) recorded for Obama was greater than McCain.
- My initial goal was to do this for Cities, but I couldn’t find voting data at that level – so I had to settle for States. The resulting sample is smaller than I would have liked, and also less useful (we can’t realistically use our classifier to produce a prediction for a new State, because the likelihood of a new State joining the Union is fairly small) but it still works well for illustration purposes.
- Computing distance based on raw Latitude and Longitude wouldn’t be such a great idea in general, because they denote a position on a sphere (very close points may have very different Longitudes); however, given the actual layout of the United States, this will be good enough here.
I gathered the data (see sources at the end) and saved it in a comma-delimited text file “Election2008.txt” (the raw text version is available at the bottom of the post).
First, we need to open that file and parse it into the structure we used in our previous post, with a matrix of observations for each state, and a vector of categories (DEM or REP). That’s easy enough with F#:
let elections =
let file = @"C:\Users\Mathias\Desktop\Elections2008.txt"
let fileAsLines =
|> Array.map (fun line -> line.Split(','))
let dataset =
|> Array.map (fun line ->
let labels = fileAsLines |> Array.map (fun line -> line.)
We open System and System.IO, and use File.ReadAllLines, which returns an array of strings for each line in the text file, and apply string.Split to each line, using the comma as a split-delimiter, which gives us an array of string for each text line. For each row, we then retrieve the dataset part, elements 1, 2 and 3 (the Latitude, Longitude and Population), creating an Array of doubles by converting the strings to doubles. Finally, we retrieve the fourth element of each row, the vote, into an array of labels, and return dataset and labels as a tuple.
Let’s visualize the result, using the FSharpChart display function we wrote last time. display dataset 1 0 plots Longitude on the X axis, and Latitude on the Y axis, producing a somewhat unusual map of the United States, where we can see the red states / blue states pattern (colored differently), with center states leaning Republican, and Coastal states Democrat:
Plotting Longitude against Population produces the following chart, which also displays some clusters:
Finally, Longitude versus Population also exhibits some patterns:
Normalizing the data
We could run the algorithm we wrote in the previous post on the data, and measure the distances between observations based on the raw measures, but this would likely produce poor results, because of the discrepancy in scales. Our Latitudes range from about 20 (Hawaii) to 60 (Alaska), while Populations vary from half a million (Washington DC) to over 30 millions (California). Because we compute the distance between observations using the Euclidean distance, differences in Population will have a huge weight in the overall distance, and in effect we would be “ignoring” Latitude and Longitude in our classification.
Consider this: the raw distance between 2 states is given by
Distance (S1, S2) = sqrt ((Pop(S1) – Pop(S2))^2 + (Lat(S1) – Lat(S2))^2 + (Lon(S1) – Lon(S2))^2)
Even if we considered 2 states located as far as possible from each other, at the 2 corners of our map, the distance would look like:
Distance (S1, S2) = sqrt ((Pop(S1) – Pop(S2))^2 + 40^2 + 90^2)
It’s clear that even minimal differences in population in this formula (say, 100,000 inhabitants) will completely dwarf the largest possible effect of geographic distance. If we want to observe the effect of all three dimensions, we need to convert the measurements to a comparable scale, so that differences in Population are comparable, relatively, to differences in Location.
To that effect, we will Normalize our dataset. We’ll follow the example of “Machine Learning in Action”, and normalize each measurement so that its minimal value is 0, and its maximum is 1. Other approaches would be feasible, but this one has the benefit of being fairly straightforward. If we consider Population for instance, what we need to do is:
- Retrieve all the Populations in our Dataset
- Retrieve the minimum and the maximum
- Transform the Population to (Population – minimum) / (maximum – minimum)
Here is what I came up with:
let column (dataset: float ) i =
dataset |> Array.map (fun row -> row.[i])
let columns (dataset: float ) =
let cols = dataset. |> Array.length
[| for i in 0 .. (cols - 1) -> column dataset i |]
let minMax dataset =
|> Array.map (fun col -> Array.min(col), Array.max(col))
let minMaxNormalizer dataset =
let bounds = minMax dataset
fun (vector: float) ->
Array.mapi (fun i v ->
(vector.[i] - fst v) / (snd v - fst v)) bounds
Compared to the Python example, I have to pay a bit of a code tax here, because I chose to use plain F# arrays to model my dataset, instead of using matrices. The column function extracts column i from a dataset, by mapping each row (an observation) to its ith component. columns expands on it, and essentially transposes the matrix, converting it from an array of row vectors to an array of column vectors.