Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 20. May 2012 15:10

During a recent Internet excursions, I ended up on the Infinite Monkey Theorem wiki page. The infinite monkey is a somewhat famous figure in probability; his fame comes from the following question: suppose you gave a monkey a typewriter, what’s the likelihood that, given enough time randomly typing, he would produce some noteworthy literary output, say, the complete works of Shakespeare?

Somewhat unrelatedly, this made me wonder about the following question: imagine that I had a noteworthy literary output and such a monkey – could I get my computer to distinguish these?

For the sake of experimentation, let’s say that our “tolerable page” is the following paragraph by Jorge Luis Borges:

Everything would be in its blind volumes. Everything: the detailed history of the future, Aeschylus' The Egyptians, the exact number of times that the waters of the Ganges have reflected the flight of a falcon, the secret and true nature of Rome, the encyclopedia Novalis would have constructed, my dreams and half-dreams at dawn on August 14, 1934, the proof of Pierre Fermat's theorem, the unwritten chapters of Edwin Drood, those same chapters translated into the language spoken by the Garamantes, the paradoxes Berkeley invented concerning Time but didn't publish, Urizen's books of iron, the premature epiphanies of Stephen Dedalus, which would be meaningless before a cycle of a thousand years, the Gnostic Gospel of Basilides, the song the sirens sang, the complete catalog of the Library, the proof of the inaccuracy of that catalog. Everything: but for every sensible line or accurate fact there would be millions of meaningless cacophonies, verbal farragoes, and babblings. Everything: but all the generations of mankind could pass before the dizzying shelves—shelves that obliterate the day and on which chaos lies—ever reward them with a tolerable page.

Assuming my imaginary typewriter-pounding monkey is typing each letter with equal likelihood, my first thought was that by comparison, a text written in English would have more structure and predictability – and we could use Entropy to measure that difference in structure.

Entropy is the expected information of a message; the general idea behind it is that a signal where every outcome is equally likely is unpredictable, and has a high entropy, whereas a message where certain outcomes are more frequent than others has more structure and lower entropy.

The formula for Entropy, lifted from Wikipedia, is given below; it corresponds to the average quantity of information of a message X, where X can take different values x:

 H(X) = \mathbb{E}_{X} [I(x)] = -\sum_{x \in \mathbb{X}} p(x) \log p(x).

For instance, a series of coin tosses with the proverbial fair coin would produce about as many heads and tails, and the entropy would come out as –0.5 x log2(0.5) – 0.5 x log2(0.5) = 1.0, whereas a totally unfair coin producing only heads would have an entropy of –1.0 x log2(1.0) – 0.0 = 0.0, a perfectly predictable signal.

How could I apply this to my problem?

First, we need a mechanical monkey. Given a sample text (our benchmark), we’ll extract its alphabet (all characters used), and create a virtual typewriter where each key corresponds to one of these characters. The monkey will then produce monkey literature, by producing a string as long as the original text, “typing” on the virtual keyboard randomly:

let monkey (text: string) =
   let rng = new System.Random()
   let alphabet = Seq.distinct text |> Seq.toArray
   let alphabetLength = Array.length alphabet
   let textLength = text.Length
   [| for i in 1 .. textLength -> 
      alphabet.[rng.Next(0, alphabetLength)] |]

We store the Borges paragraph as:

let borges = "Everything would be in its blind volumes. (etc...)

… and we can now run the Monkey on the Borges paragraph,

> new string(monkey borges);;

which produces a wonderful output (results may vary – you could, after all, get a paragraph of great literary value):

ovfDG4,xUfETo4Sv1dbxkknthzB19Dgkphz3Tsa1L——w—w iEx-Nra mDs--k3Deoi—hFifskGGBBStU11-iiA3iU'S R9DnhzLForbkhbF:PbAUwP—ir-U4sF u w-tPf4LLuRGsDEP-ssTvvLk3NyaE f:krRUR-Gbx'zShb3wNteEfGwnuFbtsuS9Fw9lgthE1vL,tE4:Uk3UnfS FfxDbcLdpihBT,e'LvuaN4royz ,Aepbu'f1AlRgeRUSBDD.PwzhnA'y.s9:d,F'T3xvEbvNmy.vDtmwyPNotan3Esli' BTFbmywP.fgw9ytAouLAbAP':txFvGvBti Fg,4uEu.grk-rN,tEnvBs3uUo,:39altpBud3'-Aetp,T.chccE1yuDeUT,Pp,R994tnmidffcFonPbkSuw :pvde .grUUTfF1Flb4s cw'apkt GDdwadn-Phn4h.TGoPsyc'pcBEBb,kasl—aepdv,ctA TxrhRUgPAv-ro3s:aD z-FahLcvS3k':emSoz9NTNRDuus3PSpe-Upc9nSnhBovRfoEBDtANiGwvLikp4w—nPDAfknr—p'-:GnPEsULDrm'ub,3EyTmRoDkG9cERvcyxzPmPbD Fuit:lwtsmeUEieiPdnoFUlP'uSscy—Nob'st1:dA.RoLGyakGpfnT.zLb'hsBTo.mRRxNerBD9.wvFzg,,UAc,NSx.9ilLGFmkp—:FnpcpdP—-ScGSkmN9BUL1—uuUpBhpDnwS9NddLSiBLzawcbywiG—-E1DBlx—aN.D9u-ggrs3S4y4eFemo3Ba g'zeF:EsS-gTy-LFiUn3DvSzL3eww4NPLxT13isGb:—vBnLhy'yk1Rsip—res9t vmxftwvEcc::ezvPPziNGPylw:tPrluTl3E,T,vDcydn SyNSooaxaT llwNtwzwoDtoUcwlBdi',UrldaDFeFLk 3goos4unyzmFD9.vSTuuv4.wzbN.ynakoetb—ecTksm—-f,N:PtoNTne3EdErBrzfATPRreBv1:Rb.cfkELlengNkr'L1cA—lfAgU-vs9  Lic-m,kheU9kldUzTAriAg:bBUb'n—x'FL Adsn,kmar'p BE9akNr194gP,hdLrlgvbymp dplh9sPlNf'.'9

Does the entropy of these 2 strings differ? Let’s check.

let I p =
   match p with
   | 0.0 -> 0.0
   | _ -> - System.Math.Log(p, 2.0)

let freq text letter =
   let count =
      Seq.fold (fun (total, count) l -> 
         if l = letter
         then (total + 1.0, count + 1.0)
         else (total + 1.0, count)) (0.0, 0.0) text
   (letter, snd count / fst count)

let H text =
   let alphabet = Seq.distinct text
   Seq.map (fun l -> snd (freq text l)) alphabet
   |> Seq.sumBy (fun p -> p * I(p))

I computes the self-information of a message of probability p, freq computes the frequency of a particular character within a string, and H, the entropy, proceeds by first extracting all the distinct characters present in the text into an “alphabet”, and then maps each character of the alphabet to its frequency and computes the expected self-information.

We have now all we need – let’s see the results:

> H borges;;
val it : float = 4.42083025
> H monkeyLit;;
val it : float = 5.565782825

Monkey lit has a higher entropy / disorder than Jorge Luis Borges’ output. This is reassuring.

How good of a test is this, though? In the end, what we measured with Entropy is that some letters were more likely to come up than others, which we would expect from a text written in English, where the letter “e” has a 12% probability to show up. However, if we gave our Monkey a Dvorak keyboard, he may fool our test; we could also create an uber Mechanical Monkey which generates a string based on the original text frequency:

let uberMonkey (text: string) =
   let rng = new System.Random()
   let alphabet = Seq.distinct text |> Seq.toArray
   let textLength = text.Length
   let freq = Array.map (fun l -> freq text l) alphabet
   let rec index i p cumul =
      let cumul = cumul + snd freq.[i]
      if cumul >= p then i else index (i+1) p cumul
   [| for i in 1 .. textLength -> 
      let p = rng.NextDouble()
      alphabet.[index 0 p 0.0] |]

This somewhat ugly snippet computes the frequency of every letter in the original text, and returns random chars based on the frequency. The ugly part is the index function; given a probability p, it returns the index of the first char in the frequency array such that the cumulative probability of all chars up to that index is greater than p, which will return each index based on its frequency.

Running the uberMonkey produces another milestone of worldwide literature:

lk  aeew omauG dga rlhhfatywrrg   earErhsgeotnrtd utntcnd  o,  ntnnrcl gltnhtlu eAal yeh uro  it-lnfELiect eaurefe Busfehe h f1efeh hs eo.dhoc , rbnenotsno, e et tdiobronnaeonioumnroe  escr l hlvrds anoslpstr'thstrao lttamxeda iotoaeAB ai sfoF,hfiemnbe ieoobGrta dogdirisee nt.eba   t oisfgcrn  eehhfrE' oufepas Eroshhteep snodlahe sau  eoalymeonrt.ss.ehstwtee,ttudtmr ehtlni,rnre  ae h  e chp c crng Rdd  eucaetee gire dieeyGhr a4ELd  sr era tndtfe rsecltfu  t1tefetiweoroetasfl bnecdt'eetoruvmtl ii fi4fprBla Fpemaatnlerhig  oteriwnEaerebepnrsorotcigeotioR g  bolrnoexsbtuorsr si,nibbtcrlte uh ts ot  trotnee   se rgfTf  ibdr ne,UlA sirrr a,ersus simf bset  guecr s tir9tb e ntcenkwseerysorlddaaRcwer ton redts— nel ye oi leh v t go,amsPn 'e  areilynmfe ae  evr  lino t, s   a,a,ytinms   elt i :wpa s s hAEgocetduasmrlfaar  de cl,aeird fefsef E  s se hplcihf f  cerrn rnfvmrdpo ettvtu oeutnrk —toc anrhhne  apxbmaio hh  edhst, mfeveulekd. vrtltoietndnuphhgp rt1ocfplrthom b gmeerfmh tdnennletlie hshcy,,bff,na nfitgdtbyowsaooomg , hmtdfidha l aira chh olnnehehe acBeee  n  nrfhGh dn toclraimeovbca atesfgc:rt  eevuwdeoienpttdifgotebeegc ehms ontdec e,ttmae llwcdoh

… and, as expected, if we run our Entropy function on uberMonkeyLit, we get

> H uberMonkeyLit;;
val it : float = 4.385303632

This is pretty much what we got with the Borges original. The uberMonkey produced a paragraph just as organized as Borges, or so it seems.

Obviously, the raw Entropy calculation is not cutting it here. So what are we missing? The problem is that we are simply looking at the frequency of characters, which measures a certain form of order / predictability; however, there is “more” order than that in English. If I were to tell you that the 2 first characters of a text are “Th”, chances are, you would bet on “e” for the third position – because some sequences are much more likely than others, and “The” much more probable than “Thw”. The “raw” Entropy would consider the two following sequences “ABAABABB” and “ABABABAB” equally ordered (each contains 4 As and 4 Bs), whereas a human eye would consider that the second one, with its neat succession of A and Bs, may follow a pattern, where knowing the previous observations of the sequence conveys some information about what’s likely to show up next.

We’ll leave it at that for today, with an encouraging thought for those of you who may now worry that world literature could be threaten by offshore monkey typists. According to Wikipedia again,

In 2003, lecturers and students from the University of Plymouth MediaLab Arts course used a £2,000 grant from the Arts Council to study the literary output of real monkeys. They left a computer keyboard in the enclosure of six Celebes Crested Macaques in Paignton Zoo in Devon in England for a month, with a radio link to broadcast the results on a website.

Not only did the monkeys produce nothing but five pages consisting largely of the letter S, the lead male began by bashing the keyboard with a stone, and the monkeys continued by urinating and defecating on it. Phillips said that the artist-funded project was primarily performance art, and they had learned "an awful lot" from it.

£2,000 may seem a bit steep to watch monkeys defecating on typewriters; on the other hand, it seems that human writers can sleep quietly, without worrying about their jobs.

Comments

5/24/2012 7:38:36 AM #

Jack Fox

Nearly all the literature on entropy in information theory falls immediately into using log base 2 in the standard equation, with little or no explanation. "Foundations of Statistical Natural Language Processing" 2.21 says "...using any other base yields only a linear scaling of results". Yet Shannon (1948) mentions at least twice that base 2 is to be used when measuring bits. "If the base 10 is used the units may be called decimal digits." (p. 1) He doesn't explicitly say "use the log to the base of the number of characters in the alphabet", but I think that is what he meant. Also the article http://en.wikipedia.org/wiki/Dit_(information) suggests this (also without explicitly saying it).  Almost all the interest in information theory revolves around bit strings and streams, so my quick survey found little information on the topic of alternate log bases (perhaps there is more research in the area of code breaking).

So I started tweaking your original experiment and found some interesting results. Applying a base 48 log to the strings "borges" and "monkeyLit" both resulted in entropy of 0.9965658363. This was very frustrating! Perhaps the explanation is the "linear scaling" scaled the difference to where it does not even appear in 10 significant digits.

Next I redefined the originating alphabet, after all Borges did not carefully craft the excerpt you chose from an alphabet of 48 characters (assuming the fiction he wrote in English), but lets say he had the full 95 characters on my keyboard. So I generated "newMonkeyLit" (same length as "borges", of course) from the new alphabet, and generated the entropy for both strings using log base 95:

borges: 0.8491429365
newMonkeyLit:  0.9946488358

So now, just like in your analysis of the fair coin, we appear to have an entropy scale in which perfect randomness is 1.0. It seems the entropy reading using log to the size of the alphabet does well at measuring the entropy resulting from using a subset of the alphabet, but does not differentiate so well when both strings use the full alphabet. So if your goal is compare entropy with no reference to "perfect randomness" (whatever that is), use base 2. It would take a lot more experimenting (or a real command of theory proof, which I don't have) to vet my hypotheses.

And speaking of experiments, I think to do real statistics R&D one should hook-up to something like http://qrng.anu.edu.au/index.php rather than rely on the .NET pseudo-random generator. I'll get around to doing this someday.

Jack Fox United States | Reply

5/26/2012 4:47:25 AM #

Mathias

Hi Jack,
Thanks for the comments! I was actually wondering about this question of whether there were implications on picking different bases for the logarithm, and your comments forced me to look into it (and take a math refresher!)...
Let's note that
1) Log2 and LogB the log in base 2 and B. LogB(x) = Log2(x)/Log2(B)
Then, noting HB the entropy in base B:
H2(X) = - ( p(x1) x Log2(p(x1)) + ... p(xn) x Log2(p(xn)) )
HB(X) = - ( p(x1) x LogB(p(x1)) + ... p(xn) x LogB(p(xn)) )
HB(X) = - ( p(x1) x Log2(p(x1) / Log2(B)) + ... p(xn) x Log2(p(xn) / Log2(B)) )
HB(X) = - (1 / Log2(B)) x ( p(x1) x Log2(p(x1)) + ... p(xn) x Log2(p(xn)) )
HB(X) = H2(X) x 1 / Log2(B)

i.e., unless I messed up the math, the choice of the base will simply apply a multiplier to the entropy - but the base won't impact the order. If messages M1 and M2 are such that H(M1) > H(M2), that inequality will hold regardless of the base.

I think the choice of base helps interpret the number. Base 2 is in bits, and the result has some relationship to the number of bits you could compress the message to (if I understand properly). I think using for base the number of characters is interesting, in that the maximum entropy possible is then - ( 1/b x Logb(1/b) + ... + 1/b x Logb(1/b) ) = 1.0 . On the other hand, using different bases for different messages prevents comparison.

Hope this makes sense! Thanks for the interesting comment,

Mathias

Mathias United States | Reply

5/26/2012 5:07:35 AM #

Jack Fox

Yes, completely agree. Having a scale where the maximum entropy is a fixed value has some value, but for larger alphabet sizes you loose "resolution" in that you cannot distinguish between cases with different, but close entropies.

Jack Fox United States | Reply

5/26/2012 5:09:38 AM #

Jack Fox

That is to say "cannot distinguish" because you run out of significant digits.

Jack Fox United States | Reply

5/27/2012 12:41:21 PM #

Mathias

Yes. Also, there is a clear benefit in using the same base across, in that it allows for comparison, regardless of the "alphabet".

Mathias United States | Reply

8/5/2012 1:51:01 PM #

trackback

Decision Tree classification

Decision Tree classification

Clear Lines Blog | Reply

12/10/2012 1:45:15 AM #

Cloud

Great source code, thanks

Cloud United States | Reply

12/18/2016 8:11:35 PM #

pingback

Pingback from steroidsforsale.biz

where to buy d-bol

steroidsforsale.biz | Reply

Add comment




  Country flag

biuquote
  • Comment
  • Preview
Loading



Comments

Comment RSS