Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 3. May 2015 16:34

I got curious the other day about how to measure the F# community growth, and thought it could be interesting to take a look at this through StackOverflow. As it turns out, it’s not too hard to get some data, because StackExchange exposes a nice API, which allows you to make all sorts of queries and get a JSON response back.

As a starting point, I figured I would just try to get the number of questions asked per month. The API allows you to retrieve questions on any site, by tag, between arbitrary dates. Responses are paged: you can get up to 100 items per page, and keep asking for next pages until there is nothing left to receive. That sounds like a perfect job for the FSharp.Data JSON Type Provider.

First things first, we create a type, Questions, by pointing the JSON Type Provider to a url that returns questions; based on the structure of the JSON document it receives, the Type Provider creates a type, which we will then be able to use to make queries:

#r @"FSharp.Data.2.2.0\lib\net40\FSharp.Data.dll"
open FSharp.Data
open System

let sampleUrl = ""
type Questions = JsonProvider<sampleUrl>

Next, we’ll need to grab all the questions tagged F# between 2 given dates. As an example, the following would return the second page (questions 101 to 200) from all F# questions asked between January 1, 2014 and January 31, 2015:

There are a couple of quirks here. First, the dates are in UNIX standard, that is, the number of seconds elapsed from January 1, 1970. Then, we need to keep pulling pages, until the response indicates that there are no more questions to receive, which is indicated by the HasMore property. That’s not too hard: let’s create a couple of functions, first to convert a .NET date to a UNIX date, and then to build up a proper query, appending the page and dates we are interested in to our base query – and finally, let’s build a request that recursively calls the API and appends results, until there is nothing left:

let fsharpQuery = ""

let unixEpoch = DateTime(1970,1,1)
let unixTime (date:DateTime) = 
    (date - unixEpoch).TotalSeconds |> int64

let page (page:int) (query:string) = 
    sprintf "%s&page=%i" query page
let between (from:DateTime) (``to``:DateTime) (query:string) = 
    sprintf "%s&&fromdate=%i&todate=%i" query (unixTime from) (unixTime ``to``)

let questionsBetween (from:DateTime) (``to``:DateTime) =
    let baseQuery = fsharpQuery |> between from ``to``
    let rec pull results p = 
        let nextPage = Questions.Load (baseQuery |> page p)
        let results = results |> Array.append nextPage.Items
        if (nextPage.HasMore)
        then pull results (p+1)
        else results
    pull Array.empty 1

And we are pretty much done. At that point, we can for instance ask for all the questions asked in January 2015, and check what percentage were answered:

let january2015 = questionsBetween (DateTime(2015,1,1)) (DateTime(2015,1,31))

|> Seq.averageBy (fun x -> if x.IsAnswered then 1. else 0.)
|> printfn "Average answer rate: %.3f"

… which produces a fairly solid 78%.

If you play a bit more with this, and perhaps try to pull down more data, you might experience (as I did) the Big StackExchange BanHammer. As it turns out, the API has usage limits (which is totally fair). In particular, if you ask for too much data, too fast, you will get banned from making requests, for a dozen hours or so.

This is not pleasant. However, in their great kindness, the API designers have provided a way to avoid it. When you are making too many requests, the response you receive will include a field named “backoff”, which indicates for how many seconds you should back off until you make your next call.

This got me stumped for a bit, because that field doesn’t show up by default on the response – only when you are hitting the limit. As a result, I wasn’t sure how to pass that information to the JSON Type Provider, until Max Malook helped me out (thanks so much, Max!). The trick here is to supply not one sample response to the type provider, but a list of samples, in that case, one without the backoff field, and one with it.

I carved out an artisanal, hand-crafted sample for the occasion, along these lines:

let sample = """
   {"tags":["f#","units-of-measurement"],//SNIPPED FOR BREVITY}],
   {"tags":["f#","units-of-measurement"],//SNIPPED FOR BREVITY}],

type Questions = JsonProvider<sample,SampleIsList=true>

… and everything is back in order – we can now modify the recursive request, causing it to sleep for a bit when it encounters a backoff. Not the cleanest solution ever, but hey, I just want to get data here:

let questionsBetween (from:DateTime) (``to``:DateTime) =
    let baseQuery = fsharpQuery |> between from ``to``
    let rec pull results p = 
        let nextPage = Questions.Load (baseQuery |> page p)
        let results = results |> Array.append nextPage.Items
        if (nextPage.HasMore)
            match nextPage.Backoff with
            | Some(seconds) -> System.Threading.Thread.Sleep (1000*seconds + 1000)
            | None -> ignore ()
            pull results (p+1)
        else results
    pull Array.empty 1

So what were the results? I decided, quite arbitrarily, to count questions month by month since January 2010. Here is how the results looks like:


Clearly, the trend is up – it doesn’t take an advanced degree in statistics to see that. It’s interesting also to see the slump around 2012-2013; I can see a similar pattern in the Meetup registration numbers in San Francisco. My sense is that after a spike in interest in 2010, when F# launched with Visual Studio, there hasn’t been much marketing push for the language, and interest eroded a bit, until serious community-driven efforts took place. However, I don’t really have data to back that up – this is speculation.

How this correlates to overall F# adoption is another question: while I think this curves indicates growth, the number of questions on StackOverflow is clearly a very indirect measurement of how many people actually use it, and StackOverflow itself is a distorted sample of the overall population. Would be interesting to take a similar look at GitHub, perhaps…

by Mathias 10. November 2012 11:00

The Kaggle/StackOverflow contest officially closed a few days ago, which makes it a perfect time to have a miniature retrospective on that experience. The objective of the contest was to write an algorithm to predict whether a StackOverflow question would be closed by moderators, and the reason why.

The contest was announced just a couple of days before what was supposed to be 4 weeks of computer-free vacation travelling around Europe. Needless to say, a quick change of plans followed; I am a big fan of StackOverflow, and Machine Learning has been on my mind quite a bit lately, so I packed my smallest laptop with Visual Studio installed. At the same time, the wonders of the Interwebs resulted in the formation of Team Charon - the awesome @lu_a_jalla and me, around the loosely defined project of "having fun with this, using 100% F#".

Now that the contest is over, here are a few notes on the experience, focusing on process and tools, and not the modeling aspects – I’ll get back to that in a later post.

This is my first unquestionably positive experience with a dispersed team - every morning I was genuinely looking forward to code check-ins, something I can't say of every experience I have had with remote teams. I recall reading somewhere that there was only one valid reason to work with a dispersed team: when you really want to work with that person, and it is the only way to work together. I tend to agree, and this was tremendously fun. There are not that many opportunities to have meaningful interactions involving both F# and Machine Learning, and I learnt quite a bit in the process, in large part because this was team work.

As a side note, I find it amazing how ridiculously easy it is today to set up a collaborative environment. Set up a GitHub repository, use Skype and Twitter – and you are good to go. The only thing technology hasn’t quite solved yet are these pesky time zones: Minsk and San Francisco are still 11 hours apart. This is were a team of night owls might help…

Whenever there is a deadline, make sure the when and what is clear. Had I followed this simple rule, I would have been on time for the final submission. Instead, I missed it by a couple of hours, because I didn't check what "you have three days left" meant exactly, which is too bad, because otherwise we could have ended up in 27th position, among 160+ competitors:


... which is a result I am pretty proud of, given that this was my first “official” attempt at Machine Learning stuff, and some of the competitors looked pretty qualified. During the initial phase, we went as high as 10th position, and ended up in 40th position, in the top 25%.


by Mathias 18. August 2012 03:39

This is the continuation of my series exploring Machine Learning, converting the code samples of “Machine Learning in Action” from Python to F# as I go through the book. Today’s post covers Chapter 4, which is dedicated to Naïve Bayes classification – and you can find the resulting code on GitHub.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading“Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

File:Thomas Bayes.gif

The idea behind the Algorithm

The canonical application of Bayes naïve classification is in text classification, where the goal is to identify to which pre-determined category a piece of text belongs to  – for instance, is this email I just received spam, or ham (“valuable” email)?

The underlying idea is to use individual words present in the text as indications for what category it is most likely to belong to, using Bayes Theorem, named after the cheerful-looking Reverend Bayes.

Imagine that you received an email containing the words “Nigeria”, “Prince”, “Diamonds” and “Money”. It is very likely that if you look into your spam folder, you’ll find quite a few emails containing these words, whereas, unless you are in the business of importing diamonds from Nigeria and have some aristocratic family, your “normal” emails would rarely contain these words. They have a much higher frequency within the category “Spam” than within the Ham, which makes them a potential flag for undesired business ventures.

On the other hand, let’s assume that you are a lucky person, and that typically, what you receive is Ham, with the occasional Spam bit. If you took a random email in your inbox, it is then much more likely that it belongs to the Ham category.

Bayes’ Theorem combines these two pieces of information together, to determine the probability that a particular email belongs to the “Spam” category, if it contains the word “Nigeria”:

P(is “Spam”|contains ”Nigeria”) = P(contains “Nigeria|is ”Spam”) x P(is “Spam”) / P(contains “Nigeria”)

In other words, 2 factors should be taken into account when deciding whether an email containing “Nigeria” is spam: how over-represented is that word in Spam, and how likely is it that any email is spammy in the first place?

The algorithm is named “Naïve”, because it makes a simplifying assumption about the text, which turns out to be very convenient for computations purposes, namely that each word appears with a frequency which doesn’t depend on other words. This is an unlikely assumption (the word “Diamond” is much more likely to be present in an email containing “Nigeria” than in your typical family-members discussion email).

We’ll leave it at that on the concepts –  I’ll refer the reader who want to dig deeper to the book, or to this explanation of text classification with Naïve Bayes.

A simple F# implementation

For my first pass, I took a slightly different direction from the book, and decided to favor readability over performance. I assume that we are operating on a dataset organized as a sequence of text samples, each of them labeled by category, along these lines (example from the book “Machine Learning in Action”):

Note: the code presented here can be found found on GitHub

let dataset =
    [| ("Ham",  "My dog has flea problems help please");
       ("Spam", "Maybe not take him to dog park stupid");
       ("Ham",  "My dalmatian is so cute I love him");
       ("Spam", "Stop posting stupid worthless garbage");
       ("Ham",  "Mr Licks ate my steak how to stop him");
       ("Spam", "Quit buying worthless dog food stupid") |]

We will need to do some word counting to compute frequencies, so let’s start with a few utility functions:

    open System
    open System.Text.RegularExpressions

    // Regular Expression matching full words, case insensitive.
    let matchWords = new Regex(@"\w+", RegexOptions.IgnoreCase)

    // Extract and count words from a string.
    let wordsCount text =
        |> Seq.cast<Match>
        |> Seq.groupBy (fun m -> m.Value)
        |> (fun (value, groups) -> 
            value.ToLower(), (groups |> Seq.length))

    // Extracts all words used in a string.
    let vocabulary text =
        |> Seq.cast<Match>
        |> (fun m -> m.Value.ToLower())
        |> Seq.distinct

    // Extracts all words used in a dataset;
    // a Dataset is a sequence of "samples", 
    // each sample has a label (the class), and text.
    let extractWords dataset =
        |> (fun sample -> vocabulary (snd sample))
        |> Seq.concat
        |> Seq.distinct

    // "Tokenize" the dataset: break each text sample
    // into words and how many times they are used.
    let prepare dataset =
        |> (fun (label, sample) -> (label, wordsCount sample))

We use a Regular Expression, \w+, to match all words, in a case-insensitive way. wordCount extracts individual words and the number of times they occur, while vocabulary simply returns the words encountered. The prepare function takes a complete dataset, and transforms each text sample into a Tuple containing the original classification label, and a Sequence of Tuples containing all lower-cased words found and their count.



Comment RSS