Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
26. May 2013 09:06

I got interested in the following question lately: given a data set of examples with some continuous-valued features and discrete classes, what’s a good way to reduce the continuous features into a set of discrete values?

What makes this question interesting? One very specific reason is that some machine learning algorithms, like Decision Trees, require discrete features. As a result, potentially informative data has to be discarded. For example, consider the Titanic dataset: we know the age of passengers of the Titanic, or how much they paid for their ticket. To use these features, we would need to reduce them to a set of states, like “Old/Young” or “Cheap/Medium/Expensive” – but how can we determine what states are appropriate, and what values separate them?

More generally, it’s easier to reason about a handful of cases than a continuous variable – and it’s also more convenient computationally to represent information as a finite set states.

So how could we go about identifying a reasonable way to partition a continuous variable into a handful of informative, representative states?

In the context of a classification problem, what we are interested in is whether the states provide information with respect to the Classes we are trying to recognize. As far as I can tell from my cursory review of what’s out there, the main approaches use either Chi-Square tests or Entropy to achieve that goal. I’ll leave aside Chi-Square based approaches for today, and look into the Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani in 1993.

## The algorithm idea

The algorithm hinges on two key ideas:

• Data should be split into intervals that maximize the information, measured by Entropy,
• Partitioning should not be too fine-grained, to avoid over-fitting.

The first part is classic: given a data set, split in two halves, based on whether the continuous value is above or below the “splitting value”, and compute the gain in entropy. Out of all possibly splitting values, take the one that generates the best gain – and repeat in a recursive fashion.

Let’s illustrate on an artificial example – our output can take 2 values, Yes or No, and we have one continuous-valued feature:

 Continuous Feature Output Class 1.0 Yes 1.0 Yes 2.0 No 3.0 Yes 3.0 No

As is, the dataset has an Entropy of H = - 0.6 x Log (0.6) – 0.4 x Log (0.4) = 0.67 (5 examples, with 3/5 Yes, and 2/5 No).

The Continuous Feature takes 3 values: 1.0, 2.0 and 3.0, which leaves us with 2 possible splits: strictly less than 2, or strictly less than 3. Suppose we split on 2.0 – we would get 2 groups. Group 1 contains Examples where the Feature is less than 2:

 Continuous Feature Output Class 1.0 Yes 1.0 Yes

The Entropy of Group 1 is H(g1) = - 1.0 x Log(1.0) = 0.0

Group 2 contains the rest of the examples:

 Continuous Feature Output Class 2.0 No 3.0 Yes 3.0 No

The Entropy of Group 2 is H(g2) = - 0.33 x Log(0.33) – 0.66 x Log(0.66) = 0.63

Partitioning on 2.0 gives us a gain of H – 2/5 x H(g1) – 3/5 x H(g2) = 0.67 – 0.4 x 0.0 – 0.6 x 0.63 = 0.04. That split gives us additional information on the output, which seems intuitively correct, as one of the groups is now formed purely of “Yes”. In a similar fashion, we can compute the information gain of splitting around the other possible value, 3.0, which would give us a gain of 0.67 – 0.6 x 0.63 – 0.4 x 0.69 =  - 0.00: that split doesn’t improve information, so we would use the first split (or, if we had multiple splits with positive gain, we would take the split leading to the largest gain).

So why not just recursively apply that procedure, and split our dataset until we cannot achieve information gain by splitting further? The issue is that we might end up with an artificially fine-grained partition, over-fitting the data.

More...

5. August 2012 13:50

Today’s topic will be Chapter 3 of “Machine Learning in Action”, which covers Decision Trees.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading“Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

The idea behind Decision Trees is similar to the Game of 20 Questions: construct a set of discrete Choices to identify the Class of an item. We will use the following dataset for illustration: imagine that we have 5 cards, each with a major masterpiece of contemporary cinema, classified by genre. Now I hide one – and you can ask 2 questions about the genre of the movie to identify the Thespian luminary in the lead role, in as few questions as possible:

 Action Sci-Fi Actor Cliffhanger Yes No Stallone Rocky Yes No Stallone Twins No No Schwarzenegger Terminator Yes Yes Schwarzenegger Total Recall Yes Yes Schwarzenegger

The questions you would likely ask are:

• Is this a Sci-Fi movie? If yes, Arnold is the answer, if no,
• Is this an Action movie? if yes, go for Sylvester, otherwise Arnold it is.

That’s a Decision Tree in a nutshell: we traverse a Tree, asking about features, and depending on the answer, we draw a conclusion or recurse deeper into more questions. The goal today is to let the computer build the right tree from the dataset, and use that tree to classify “subjects”.

## Defining a Tree

Let’s start with the end – the Tree. A common and convenient way to model Trees in F# is to use a discriminated union like this:

type Tree =
| Conclusion of string
| Choice of string * (string * Tree) []

A Tree is composed of either a Conclusion, described by a string, or a Choice, which is described by a string, and an Array of multiple options, each described by a string and its own Tree, “tupled”.

For instance, we can manually create a tree for our example like this:

let manualTree =
Choice
("Sci-Fi",
[|("No",
Choice
("Action",
[|("Yes", Conclusion "Stallone");
("No", Conclusion "Schwarzenegger")|]));
("Yes", Conclusion "Schwarzenegger")|])

Our tree starts with a Choice, labeled “Sci-Fi”, with 2 options in an Array, “No” or “Yes”. “Yes” leads to a Conclusion (a Leaf node), Arnold, while “No” opens another Choice, “Action”, with 2 Conclusions.

So how can we use this to Classify a “Subject”? We need to traverse down the Tree, check what branch corresponds to the Subject for the current Choice, and continue until we reach a Decision node, at what point we can return the contents of the Conclusion. To that effect, we’ll represent a “Subject” (the thing we are trying to classify) as an collection of Tuples, each Tuple being a key/value pair, representing a Feature and its value:

let test = [| ("Action", "Yes"); ("Sci-Fi", "Yes") |]

We are ready to write a classification function now:

let rec classify subject tree =
match tree with
| Conclusion(c) -> c
| Choice(label, options) ->
let subjectState =
subject
|> Seq.find(fun (key, value) -> key = label)
|> snd
options
|> Array.find (fun (option, tree) -> option = subjectState)
|> snd
|> classify subject

classify is a recursive function: given a subject and a tree, if the Tree is a Conclusion, we are done, otherwise, we retrieve the label of the next Choice, find the value of the Subject for that Choice, and use it to pick the next level of the Tree.

At that point, using the Tree to classify our subject is as simple as:

> let actor = classify test manualTree;;

val actor : string = "Schwarzenegger"

Not bad for 14 lines of code. The most painful part is the manual construction of the Tree – let’s see if we can get the computer to build that for us.

More...

20. May 2012 15:10

During a recent Internet excursions, I ended up on the Infinite Monkey Theorem wiki page. The infinite monkey is a somewhat famous figure in probability; his fame comes from the following question: suppose you gave a monkey a typewriter, what’s the likelihood that, given enough time randomly typing, he would produce some noteworthy literary output, say, the complete works of Shakespeare?

Somewhat unrelatedly, this made me wonder about the following question: imagine that I had a noteworthy literary output and such a monkey – could I get my computer to distinguish these?

For the sake of experimentation, let’s say that our “tolerable page” is the following paragraph by Jorge Luis Borges:

Everything would be in its blind volumes. Everything: the detailed history of the future, Aeschylus' The Egyptians, the exact number of times that the waters of the Ganges have reflected the flight of a falcon, the secret and true nature of Rome, the encyclopedia Novalis would have constructed, my dreams and half-dreams at dawn on August 14, 1934, the proof of Pierre Fermat's theorem, the unwritten chapters of Edwin Drood, those same chapters translated into the language spoken by the Garamantes, the paradoxes Berkeley invented concerning Time but didn't publish, Urizen's books of iron, the premature epiphanies of Stephen Dedalus, which would be meaningless before a cycle of a thousand years, the Gnostic Gospel of Basilides, the song the sirens sang, the complete catalog of the Library, the proof of the inaccuracy of that catalog. Everything: but for every sensible line or accurate fact there would be millions of meaningless cacophonies, verbal farragoes, and babblings. Everything: but all the generations of mankind could pass before the dizzying shelves—shelves that obliterate the day and on which chaos lies—ever reward them with a tolerable page.

Assuming my imaginary typewriter-pounding monkey is typing each letter with equal likelihood, my first thought was that by comparison, a text written in English would have more structure and predictability – and we could use Entropy to measure that difference in structure.

Entropy is the expected information of a message; the general idea behind it is that a signal where every outcome is equally likely is unpredictable, and has a high entropy, whereas a message where certain outcomes are more frequent than others has more structure and lower entropy.

The formula for Entropy, lifted from Wikipedia, is given below; it corresponds to the average quantity of information of a message X, where X can take different values x:

For instance, a series of coin tosses with the proverbial fair coin would produce about as many heads and tails, and the entropy would come out as –0.5 x log2(0.5) – 0.5 x log2(0.5) = 1.0, whereas a totally unfair coin producing only heads would have an entropy of –1.0 x log2(1.0) – 0.0 = 0.0, a perfectly predictable signal.

How could I apply this to my problem?

First, we need a mechanical monkey. Given a sample text (our benchmark), we’ll extract its alphabet (all characters used), and create a virtual typewriter where each key corresponds to one of these characters. The monkey will then produce monkey literature, by producing a string as long as the original text, “typing” on the virtual keyboard randomly:

let monkey (text: string) =
let rng = new System.Random()
let alphabet = Seq.distinct text |> Seq.toArray
let alphabetLength = Array.length alphabet
let textLength = text.Length
[| for i in 1 .. textLength ->
alphabet.[rng.Next(0, alphabetLength)] |]

We store the Borges paragraph as:

let borges = "Everything would be in its blind volumes. (etc...)

… and we can now run the Monkey on the Borges paragraph,

> new string(monkey borges);;

which produces a wonderful output (results may vary – you could, after all, get a paragraph of great literary value):

ovfDG4,xUfETo4Sv1dbxkknthzB19Dgkphz3Tsa1L——w—w iEx-Nra mDs--k3Deoi—hFifskGGBBStU11-iiA3iU'S R9DnhzLForbkhbF:PbAUwP—ir-U4sF u w-tPf4LLuRGsDEP-ssTvvLk3NyaE f:krRUR-Gbx'zShb3wNteEfGwnuFbtsuS9Fw9lgthE1vL,tE4:Uk3UnfS FfxDbcLdpihBT,e'LvuaN4royz ,Aepbu'f1AlRgeRUSBDD.PwzhnA'y.s9:d,F'T3xvEbvNmy.vDtmwyPNotan3Esli' BTFbmywP.fgw9ytAouLAbAP':txFvGvBti Fg,4uEu.grk-rN,tEnvBs3uUo,:39altpBud3'-Aetp,T.chccE1yuDeUT,Pp,R994tnmidffcFonPbkSuw :pvde .grUUTfF1Flb4s cw'apkt GDdwadn-Phn4h.TGoPsyc'pcBEBb,kasl—aepdv,ctA TxrhRUgPAv-ro3s:aD z-FahLcvS3k':emSoz9NTNRDuus3PSpe-Upc9nSnhBovRfoEBDtANiGwvLikp4w—nPDAfknr—p'-:GnPEsULDrm'ub,3EyTmRoDkG9cERvcyxzPmPbD Fuit:lwtsmeUEieiPdnoFUlP'uSscy—Nob'st1:dA.RoLGyakGpfnT.zLb'hsBTo.mRRxNerBD9.wvFzg,,UAc,NSx.9ilLGFmkp—:FnpcpdP—-ScGSkmN9BUL1—uuUpBhpDnwS9NddLSiBLzawcbywiG—-E1DBlx—aN.D9u-ggrs3S4y4eFemo3Ba g'zeF:EsS-gTy-LFiUn3DvSzL3eww4NPLxT13isGb:—vBnLhy'yk1Rsip—res9t vmxftwvEcc::ezvPPziNGPylw:tPrluTl3E,T,vDcydn SyNSooaxaT llwNtwzwoDtoUcwlBdi',UrldaDFeFLk 3goos4unyzmFD9.vSTuuv4.wzbN.ynakoetb—ecTksm—-f,N:PtoNTne3EdErBrzfATPRreBv1:Rb.cfkELlengNkr'L1cA—lfAgU-vs9  Lic-m,kheU9kldUzTAriAg:bBUb'n—x'FL Adsn,kmar'p BE9akNr194gP,hdLrlgvbymp dplh9sPlNf'.'9

Does the entropy of these 2 strings differ? Let’s check.

let I p =
match p with
| 0.0 -> 0.0
| _ -> - System.Math.Log(p, 2.0)

let freq text letter =
let count =
Seq.fold (fun (total, count) l ->
if l = letter
then (total + 1.0, count + 1.0)
else (total + 1.0, count)) (0.0, 0.0) text
(letter, snd count / fst count)

let H text =
let alphabet = Seq.distinct text
Seq.map (fun l -> snd (freq text l)) alphabet
|> Seq.sumBy (fun p -> p * I(p))

I computes the self-information of a message of probability p, freq computes the frequency of a particular character within a string, and H, the entropy, proceeds by first extracting all the distinct characters present in the text into an “alphabet”, and then maps each character of the alphabet to its frequency and computes the expected self-information.

We have now all we need – let’s see the results:

> H borges;;
val it : float = 4.42083025
> H monkeyLit;;
val it : float = 5.565782825

Monkey lit has a higher entropy / disorder than Jorge Luis Borges’ output. This is reassuring.

How good of a test is this, though? In the end, what we measured with Entropy is that some letters were more likely to come up than others, which we would expect from a text written in English, where the letter “e” has a 12% probability to show up. However, if we gave our Monkey a Dvorak keyboard, he may fool our test; we could also create an uber Mechanical Monkey which generates a string based on the original text frequency:

let uberMonkey (text: string) =
let rng = new System.Random()
let alphabet = Seq.distinct text |> Seq.toArray
let textLength = text.Length
let freq = Array.map (fun l -> freq text l) alphabet
let rec index i p cumul =
let cumul = cumul + snd freq.[i]
if cumul >= p then i else index (i+1) p cumul
[| for i in 1 .. textLength ->
let p = rng.NextDouble()
alphabet.[index 0 p 0.0] |]

This somewhat ugly snippet computes the frequency of every letter in the original text, and returns random chars based on the frequency. The ugly part is the index function; given a probability p, it returns the index of the first char in the frequency array such that the cumulative probability of all chars up to that index is greater than p, which will return each index based on its frequency.

Running the uberMonkey produces another milestone of worldwide literature:

lk  aeew omauG dga rlhhfatywrrg   earErhsgeotnrtd utntcnd  o,  ntnnrcl gltnhtlu eAal yeh uro  it-lnfELiect eaurefe Busfehe h f1efeh hs eo.dhoc , rbnenotsno, e et tdiobronnaeonioumnroe  escr l hlvrds anoslpstr'thstrao lttamxeda iotoaeAB ai sfoF,hfiemnbe ieoobGrta dogdirisee nt.eba   t oisfgcrn  eehhfrE' oufepas Eroshhteep snodlahe sau  eoalymeonrt.ss.ehstwtee,ttudtmr ehtlni,rnre  ae h  e chp c crng Rdd  eucaetee gire dieeyGhr a4ELd  sr era tndtfe rsecltfu  t1tefetiweoroetasfl bnecdt'eetoruvmtl ii fi4fprBla Fpemaatnlerhig  oteriwnEaerebepnrsorotcigeotioR g  bolrnoexsbtuorsr si,nibbtcrlte uh ts ot  trotnee   se rgfTf  ibdr ne,UlA sirrr a,ersus simf bset  guecr s tir9tb e ntcenkwseerysorlddaaRcwer ton redts— nel ye oi leh v t go,amsPn 'e  areilynmfe ae  evr  lino t, s   a,a,ytinms   elt i :wpa s s hAEgocetduasmrlfaar  de cl,aeird fefsef E  s se hplcihf f  cerrn rnfvmrdpo ettvtu oeutnrk —toc anrhhne  apxbmaio hh  edhst, mfeveulekd. vrtltoietndnuphhgp rt1ocfplrthom b gmeerfmh tdnennletlie hshcy,,bff,na nfitgdtbyowsaooomg , hmtdfidha l aira chh olnnehehe acBeee  n  nrfhGh dn toclraimeovbca atesfgc:rt  eevuwdeoienpttdifgotebeegc ehms ontdec e,ttmae llwcdoh

… and, as expected, if we run our Entropy function on uberMonkeyLit, we get

> H uberMonkeyLit;;
val it : float = 4.385303632

This is pretty much what we got with the Borges original. The uberMonkey produced a paragraph just as organized as Borges, or so it seems.

Obviously, the raw Entropy calculation is not cutting it here. So what are we missing? The problem is that we are simply looking at the frequency of characters, which measures a certain form of order / predictability; however, there is “more” order than that in English. If I were to tell you that the 2 first characters of a text are “Th”, chances are, you would bet on “e” for the third position – because some sequences are much more likely than others, and “The” much more probable than “Thw”. The “raw” Entropy would consider the two following sequences “ABAABABB” and “ABABABAB” equally ordered (each contains 4 As and 4 Bs), whereas a human eye would consider that the second one, with its neat succession of A and Bs, may follow a pattern, where knowing the previous observations of the sequence conveys some information about what’s likely to show up next.

We’ll leave it at that for today, with an encouraging thought for those of you who may now worry that world literature could be threaten by offshore monkey typists. According to Wikipedia again,

In 2003, lecturers and students from the University of Plymouth MediaLab Arts course used a £2,000 grant from the Arts Council to study the literary output of real monkeys. They left a computer keyboard in the enclosure of six Celebes Crested Macaques in Paignton Zoo in Devon in England for a month, with a radio link to broadcast the results on a website.

Not only did the monkeys produce nothing but five pages consisting largely of the letter S, the lead male began by bashing the keyboard with a stone, and the monkeys continued by urinating and defecating on it. Phillips said that the artist-funded project was primarily performance art, and they had learned "an awful lot" from it.

£2,000 may seem a bit steep to watch monkeys defecating on typewriters; on the other hand, it seems that human writers can sleep quietly, without worrying about their jobs.