Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 22. December 2015 14:43

This is my modest contribution to the F# Advent Calendar 2015. Thanks to @sergey_tihon for organizing it! Check out the epic stuff others have produced so far on his website or under the #fsAdvent hashtag on Twitter. Also, don’t miss the Japan Edition of #fsAdvent for more epicness…

Sometime last year, in a moment of beer-fueled inspiration, I ended up putting together @fsibot, the ultimate mobile F# IDE for the nomad developer with a taste for functional-first programming. This was fun, some people created awesome things with it, other people, not so much, and I learnt a ton.

People also had feature requests (of course they did), some obviously crucial (Quines! We need quines!), some less so. Among others came the suggestion to support querying the World Bank for data, and returning results as a chart.

So... Let's do it! After a bit of thought, I decided I would not extend @fsibot to support this, but rather build a separate bot, with its own external DSL. My thinking here was that adding this as a feature to @fsibot would clutter the code; also, this is a specialized task, and it might make sense to create a dedicated language for it, to make it accessible to the broader public who might not be familiar with F# and its syntax.

You can find the code for this thing here.

The World Bank Type Provider

Let's start with the easy part - accessing the World Bank data and turning it into a chart. So what I want to do is something along the lines of 'give me the total population for France between 2000 and 2005', and make a nice columns chart out of this. The first step is trivial using the World Bank type provider, which can be found in the FSharp.Data library:

open FSharp.Data

let wb = WorldBankData.GetDataContext ()
let france = wb.Countries.France
let population = france.Indicators.``Population, total``
let series = [ for year in 2000 .. 2005 -> year, population.[year]]

Creating a chart isn't much harder, using FSharp.Charting:

open FSharp.Charting

let title = sprintf "%s, %s" (france.Name) (population.Name)
let filename = __SOURCE_DIRECTORY__ + "/chart.png"

Chart.Line(series, Title=title)
|> Chart.Save(filename)

Wrapping up calls to the Type Provider

Next, we need to take in whatever string the user will send us over Twitter, and convert it into something we can execute. Specifically, what we want is to take user input along the lines of "France, Total population, 2000-2005", and feed that information into the WorldBank type provider.

Suppose for a moment that we had broken down our message into its 4 pieces, a country name, an indicator name, and two years. We could then call the WorldBank type provider, along these lines:

type WB = WorldBankData.ServiceTypes
type Country = WB.Country
type Indicator = Runtime.WorldBank.Indicator

let findCountry (name:string) =
    |> Seq.tryFind (fun c -> c.Name = name)

let findIndicator (name:string) (c:Country) =
    |> Seq.tryFind (fun i -> i.Name = name)

let getValues (year1,year2) (indicator:Indicator) =
    [ for year in year1 .. year2 -> year, indicator.[year]]

We can then easily wrap this into a single function, like this:

let getSeries (country,indicator,year1,year2) =
    findCountry country
    |> Option.bind (findIndicator indicator)
    |> (getValues (year1,year2))

Defining our language

This is a bit limiting, however. Imagine that we wanted to also support queries like "France, Germany, Italy, Total population, total GDP, 2000". We could of course pass in everything as lists, say,

 ["France";"Germany"], ["Total population"], [2000],

… but we'd have to then examine how many elements the list contains to make a decision. Also, more annoyingly, this allows for cases that should not be possible: ideally, we wouldn't want to even allow requests such as

[], [], [2000; 2010; 2020].

One simple solution is to carve out our own language, using F# Discriminated Unions. Instead of lists, we could, for instance, create a handful of types to represent valid arguments:

type PLACE =
    | COUNTRY of string
    | COUNTRIES of string list

type MEASURE =
    | INDICATOR of string

    | OVER of int * int
    | IN of int

This is much nicer: we can now clean up our API using pattern matching, eliminating a whole class of problems:

let cleanAPI (place:PLACE) (values:MEASURE) (timeframe:TIMEFRAME) =
    match (place, values, timeframe) with
    | COUNTRY(country), INDICATOR(indicator), OVER(year1,year2) ->             
        // do stuff
    | COUNTRIES(countries), INDICATOR(indicator), OVER(year1,year2) -> 
        // do different stuff
    | // etc...

Parsing user input

The only problem we are left with now is to break a raw string - the user request - into a tuple of arguments. If we have that, then we can compose all the pieces together, piping them into a function that will take a string and go all the way down to the type provider.

We are faced with a decision now: we can go the hard way, powering our way through this using Regex and string manipulation, or the easy way, using a parser like FParsec. Let's be lazy and smart!

Note to self: when using FParsec from a script file, make sure you #r FParsecCS before FParsec. I spent a couple of hours stuck trying to understand what I was doing wrong because of that one.

Simply put, FParsec is awesome. It allows you to define small functions to parse input strings, test them on small pieces of input, and compose them together into bigger and badder parsers. Let's illustrate: suppose that in our DSL, we expect user requests to contain a piece that looks like "IN 2010", or "OVER 2000 - 2010" to define the timeframe.

In the first case, we want to recognize the string “IN”, followed by spaces, followed by an integer; if we find that pattern, we want to retrieve the integer and create an instance of IN:

let pYear = spaces >>. pint32 .>> spaces
let pIn =
    pstring "IN" >>. pYear
    |>> IN

If we run the parser on a well-formed string, we get what we expect:

run pIn "IN  2000 "
val it : ParserResult<TIMEFRAME,unit> = Success: IN 2000

If we pass in an incorrectly formed string, we get a nice error diagnosis:

run pIn "IN some year "
val it : ParserResult<TIMEFRAME,unit> =
Error in Ln: 1 Col: 4
IN some year 
Expecting: integer number (32-bit, signed)

Beautiful! The second case is rather straightforward, too:

let pYears = 
    tuple2 pYear (pstring "-" >>. pYear)
let pOver = 
    pstring "OVER" >>. pYears
    |>> OVER

Passing in a well-formed string gives us back OVER(2000,2010):

run pOver "OVER 2000- 2010"
val it : ParserResult<TIMEFRAME,unit> = Success: OVER (2000,2010)

Finally we can compose these together, so that when we encounter either IN 2000, or OVER 2000 - 2005, we parse this into a TIMEFRAME:

let pTimeframe = pOver <|> pIn

I won't go into the construction of the full parser - you can just take a look here. The trickiest part was my own doing. I wanted to allow messages without quotes, that is,


and not

COUNTRY "France"

The second case is much easier to parse (look for any chars between ""), especially because there are indicators like, for instance, "Population, total". The parser is pretty hacky, but hey, it mostly works, so... ship it!

Ship it!

That's pretty much it. At that point, all the pieces are there. I ended up copy pasting taking inspiration from the existing @fsibot code, using LinqToTwitter to deal with reading and writing to Twitter, and TopShelf to host the bot as a Windows service, hosted on an Azure VM, and voila! You can now tweet to @wbfacts, and get back a nice artisanal chart, hand-crafted just for you, with the freshest data from the World Bank:

A couple of quick final comments:

  • One of the most obvious issues with the bot is that Twitter offers very minimal support for IntelliSense (and by minimal, I mean 'none'). This is a problem, because we lose discoverability, a key benefit of type providers. To compensate for that, I added a super-crude string matching strategy, which will give a bit of flexibility around misspelled country or indicator names. This is actually a fun problem - I was a bit pressed by time, but I'll probably revisit it later.
  • In the same vein, it would be nice to add a feature like "find me an indicator with a name like GDP total". That should be reasonably easy to do, by extending the language to support instructions like HELP and / or INFO.
  • The bot seems like a perfect case for some Railway-Oriented Programming. Currently the wiring is pretty messy; for instance, our parsing step returns an option, and drops parsing error messages from FParsec. That message would be much more helpful to the user than our current message that only states that “parsing failed". With ROP, we should be able to compose a clean pipeline of functions, along the lines of parseArguments >> runArguments >> composeResponse.
  • The performance of looking up indicators by name is pretty terrible, at least on the first call on a country. You have been warned :)
  • That's right, there is no documentation. Not a single test, either. Tests show a disturbing lack of confidence in your coding skills. Also, I had to ship by December 22nd :)

That being said, in spite of its many, many warts, I am kind of proud of @wbfacts! It is ugly as hell, the code is full of duct-tape, the parser is wanky, and you should definitely not take this as ‘best practices’. I am also not quite clear on how the Twitter rate limits work, so I would not be entirely surprised if things went wrong in the near future… In spite of all this, hey, it kind of runs! Hopefully you find the code or what it does fun, and perhaps it will even give you some ideas for your own projects. In the meanwhile, I wish you all happy holidays!

You can find the code for this thing here.

This is my modest contribution to the F# Advent Calendar 2015. Thanks to @sergey_tihon for organizing it! Check out the epic stuff others have produced so far on his website or under the #fsAdvent hashtag on Twitter. Also, don’t miss the Japan Edition of #fsAdvent for more epicness…

I also wanted to say thanks to Tomas Petricek, for opening my eyes to discriminated unions as a modeling tool, and Phil Trelford for introducing me to FParsec, which is truly a thing of beauty. They can be blamed to an extent for inspiring this ill-conceived project, but whatever code monstrosity is in the repository is entirely my doing :)

And… ping me on Twitter as @brandewinder if you have questions or comments!

by Mathias 8. August 2015 12:42

A couple of weeks ago, I came across this blog post by Steve Shogren, which looks at various programming languages, and attempts to define a “language safety score”, by taking into account a list of language criteria (Are nulls allowed? Can variables be mutated? And so on), aggregating them into an overall safety score – and finally looking for whether the resulting score was a reasonable predictor for the observed bug rate across various projects.

I thought this was an interesting idea. However, I also had reservations on the methodology. Picking a somewhat arbitrary list of criteria, giving them indiscriminately the same weight, and summing them up, didn’t seem to me like the most effective approach – especially given that Steve had already collected a nice dataset. If the goal is to identify which language features best predict how buggy the code will be, why not start from there, and build a model which attempts to predict the bug rate based on language features?

So I decided to give it a shot, and build a quick-and-dirty logistic regression model. In a nutshell, logistic regression attempts to model the probability of observing an event, based on a set of criteria / features. A prototypical application would be in medicine, trying to predict, for instance, the chances of developing a disease, given patient characteristics. In our case, the disease is a bug, and the patient a code base. We’ll use the criteria listed by Steve as potential predictors, and, as a nice side-product of logistic regression, we will get a quantification of how important each of the criteria is in predicting the bug rate.

I’ll discuss later some potential issues with the approach; for now, let’s build a model, and see where that leads us. I lifted the data from Steve’s post (hopefully without typos), with one minor modification: instead of scoring criteria as 1, 0 or –1, I just retained 1 or 0 (it’s there or it’s not there), and prepared an F# script, using the Accord framework to run my logistic regression.

Note: the entire script is here as a Gist.

#I @"../packages"
#r @"Accord.3.0.1-alpha\lib\net45\Accord.dll"
#r @"Accord.MachineLearning.3.0.1-alpha\lib\net45\Accord.MachineLearning.dll"
#r @"Accord.Math.3.0.1-alpha\lib\net45\Accord.Math.dll"
#r @"Accord.Statistics.3.0.1-alpha\lib\net45\Accord.Statistics.dll"

let language, bugrate, criteria = 
    [|  "F#",           0.023486288,    [|1.;1.;1.;0.;1.;1.;1.;0.;0.;0.;1.;1.;1.;0.|]
        "Haskell",      0.015551204,    [|1.;1.;1.;0.;1.;1.;1.;1.;1.;0.;1.;1.;0.;1.|]
        "Javascript",   0.039445132,    [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|] 
        "CoffeeScript", 0.047242288,    [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|] 
        "Clojure",      0.011503478,    [|0.;1.;0.;0.;0.;0.;1.;0.;1.;1.;1.;0.;0.;0.|] 
        "C#",           0.03261284,     [|0.;0.;1.;0.;0.;1.;1.;0.;0.;0.;1.;0.;1.;0.|] 
        "Python",       0.02531419,     [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|] 
        "Java",         0.032567736,    [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|] 
        "Ruby",         0.020303702,    [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|] 
        "Scala",        0.01904762,     [|1.;1.;1.;0.;1.;1.;1.;0.;0.;0.;1.;0.;0.;0.|] 
        "Go",           0.024698375,    [|0.;0.;1.;0.;0.;1.;1.;0.;0.;0.;1.;0.;1.;0.|] 
        "PHP",          0.031669293,    [|0.;0.;0.;0.;0.;0.;0.;0.;0.;0.;1.;0.;1.;0.|] |]
    |> Array.unzip3

open Accord.Statistics.Models.Regression
open Accord.Statistics.Models.Regression.Fitting 

let features = 14
let model = LogisticRegression(features)
let learner = LogisticGradientDescent(model)

let rec learn () = 
    let delta = learner.Run(criteria, bugrate)
    if delta > 0.0001
    then learn ()
    else ignore ()

learn () |> ignore

And we are done – we have trained a model to predict the bug rate, based on our 14 criteria. How is this working? Let’s find out:

for i in 0 .. (language.Length - 1) do    
    let lang = language.[i]
    let predicted = model.Compute(criteria.[i])
    let real = bugrate.[i]
    printfn "%16s Real: %.3f Pred: %.3f" lang real predicted

              F# Real: 0.023 Pred: 0.023
         Haskell Real: 0.016 Pred: 0.016
      Javascript Real: 0.039 Pred: 0.033
    CoffeeScript Real: 0.047 Pred: 0.033
         Clojure Real: 0.012 Pred: 0.011
              C# Real: 0.033 Pred: 0.029
          Python Real: 0.025 Pred: 0.033
            Java Real: 0.033 Pred: 0.033
            Ruby Real: 0.020 Pred: 0.033
           Scala Real: 0.019 Pred: 0.020
              Go Real: 0.025 Pred: 0.029
             PHP Real: 0.032 Pred: 0.033

Looks pretty good. Let’s confirm that with a chart, using FSharp.Charting:

#load "FSharp.Charting.0.90.12\FSharp.Charting.fsx"
open FSharp.Charting

let last = language.Length - 1

Chart.Combine [
    Chart.Line ([ for i in 0 .. last -> bugrate.[i]], "Real", Labels=language)
    Chart.Line ([ for i in 0 .. last -> model.Compute(criteria.[i])], "Pred") ] 
|> Chart.WithLegend()


What criteria did our model identify as predictors for bugs? Let’s find out.

let criteriaNames = [|
    "Null Variable Usage"
    "Null List Iteration"
    "Prevent Variable Reuse"
    "Ensure List Element Exists"
    "Safe Type Casting"
    "Passing Wrong Type"
    "Misspelled Method"
    "Missing Enum Value"
    "Variable Mutation"
    "Prevent Deadlocks"
    "Memory Deallocation"
    "Tail Call Optimization"
    "Guaranteed Code Evaluation"
    "Functional Purity" |]    
for i in 0 .. (features - 1) do 
    let name = criteriaNames.[i]
    let wald = model.GetWaldTest(i)
    let odds = model.GetOddsRatio(i)
    (printfn "%28s odds: %4.2f significant: %b" name odds wald.Significant)

         Null Variable Usage odds:  0.22 significant: true
         Null List Iteration odds:  0.86 significant: true
      Prevent Variable Reuse odds:  0.64 significant: true
  Ensure List Element Exists odds:  1.05 significant: true
           Safe Type Casting odds:  1.00 significant: false
          Passing Wrong Type odds:  0.86 significant: true
           Misspelled Method odds:  1.05 significant: true
          Missing Enum Value odds:  0.78 significant: true
           Variable Mutation odds:  0.86 significant: true
           Prevent Deadlocks odds:  0.64 significant: true
         Memory Deallocation odds:  0.74 significant: true
      Tail Call Optimization odds:  0.22 significant: true
  Guaranteed Code Evaluation odds:  1.71 significant: true
           Functional Purity odds:  0.69 significant: true

How should you read this? The first output, the odds ratio, describes how much more likely it is to observe success than failure when that criterion is active. In our case, success means “having a bug”, so for instance, if your language prevents using nulls, you’d expect 1.0 / 0.22 = 4.5 times less chances to write bugs. In other words, if the odds are close to 1.0, the criterion doesn’t make much of a difference. The closer to zero it is, the lower the predicted bug count, and vice-versa.

Conclusions and caveats

The 3 most significant predictors of a low bug rate are, in order, no nulls, tail calls optimization, and (to a much lesser degree) lazy evaluation. After that, we have honorable scores for avoiding variable reuse, preventing deadlocks, and functional purity.

So… what’s the bottom line? First off, just based on the bug rates alone, it seems that using functional languages would be a safer bet than Javascript (and CoffeeScript) to avoid bugs.

Then, now would be a good time to reiterate that this is a quick-and-dirty analysis. Specifically, there are some clear issues with the dataset. First, we are fitting 12 languages on 14 criteria – that’s not much to go on. On top of that, there is some data redundancy. None of the languages in our sample has “ensure list element exists” (4th column is filled with zeroes), and all of them guarantee memory de-allocation (11th column filled with ones). I suspect there is some additional redundancy, because of the similarity between the columns.

Note: another interesting discussion would be whether the selected criteria properly cover the differences between languages. I chose to not go into that, and focus strictly on using the data as-is.

I ran the model again, dropping the 2 columns that contain no information; while this doesn’t change the predictions of the model, it does impact a bit the weight of each criterion. The results, while similar, do show some differences:

("Null Variable Usage", 0.0743885639)
("Functional Purity", 0.4565632287)
("Prevent Variable Reuse", 0.5367456237)
("Prevent Deadlocks", 0.5374379877)
("Tail Call Optimization", 0.7028982809)
("Missing Enum Value", 0.7539575884)
("Null List Iteration", 0.7636177784)
("Passing Wrong Type", 0.7636177784)
("Variable Mutation", 0.7646027916)
("Safe Type Casting", 1.072641105)
("Misspelled Method", 1.072641105)
("Guaranteed Code Evaluation", 2.518831684)

Another piece of information I didn’t use is how many commits were taken into consideration. This matters, because the information gathered for PHP is across 10 times more commits than F#, for instance. It wouldn’t be very hard to do – instead of regressing against the bug rate, we could count the clean and buggy commits per language, and proceed along the lines of the last example described here.

In spite of these issues, I think this constitutes a better base to construct a language score index. Rather than picking criteria by hand and giving them arbitrary weights, let the data speak. Measure how well each of them explains defects, and use that as a basis to determine their relative importance.

That’s it for today! Big thanks to Steve Shogren for a stimulating post, and for making the data available. And again, you can find the script here as a Gist. If you have comments or questions, ping me on Twitter!

by Mathias 12. May 2015 13:50

As much as we people who write code like to talk about code, the biggest challenge in a software project is not code. A project rarely fails because of technology – it usually fails because of miscommunications: the code that is delivered solves a problem (sometimes), but not the right one. One of the reasons we often deliver the wrong solution is that coding involves translating the world of the original problem into a different language. Translating one way is hard enough as it is, but then, rarely are users comfortable with reading and interpreting code – and as a result, confirming whether the code “does the right thing” is hard, and errors go un-noticed.

This is why the idea of Ubiquitous Language, coined by Eric Evans in his Domain Driven Design book, always appealed to me. The further apart the languages of the domain expert and the code are, the more likely it is that something will be lost in translation.

However, achieving this perfect situation, with “a language structured around the domain model and used by all team members to connect all the activities of the team with the software” [source], is hard. I have tried this in the past, mainly through tests. My idea at the time was that tests, especially BDD style, could perhaps provide domain experts with scenarios similar enough to their worldview that they could serve as a basis for an active dialogue. The experience wasn’t particularly successful: it helped some, but in the end, I never got to the point where tests would become a shared, common ground (which doesn’t mean it’s not possible – I just didn’t manage to do it).

Fast forward a bit to today – I just completed a project, and it’s the closest I have ever been to seeing Ubiquitous Language in action. It was one of the most satisfying experiences I had, and F# had a lot to do with why it worked.

The project involved some pretty complex modeling, and only two people – me and the client. The client is definitely a domain expert, and on the very high end of the “computer power user” spectrum: he is very comfortable with SQL, doesn’t write software applications, but has a license to Visual Studio and is not afraid of code.

The fact that F# worked well for me isn’t a surprise – I am the developer in that equation, and I love it, for all the usual technical reasons. It just makes my life writing code easier. The part that was interesting here is that F# worked well for the client, too, and became our basis for communication.

What ended up happening was the following: I created a GitHub private repository, and started coding in a script file, fleshing out a domain model with small, runnable pieces of code illustrating what it was doing. We would have regular Skype meetings, with a screen share so that I could walk him through the code in Visual Studio, and explain the changes I made - and we would discuss. Soon after, he started to run the code himself, and even making small changes here and there, not necessarily the most complicated bits, but more domain-specific parts, such as adjusting parameters and seeing how the results would differ. And soon, I began receiving emails containing specific scenarios he had experimented with, using actual production data, and pointing at possible flaws in my approach, or questions that required clarifications.

So how did F# make a difference? I think it’s a combination of at least 2 things: succinctness, and static typing + scripts. Succinctness, because you can define a domain with very little code, without loosing expressiveness. As a result, the core entities of the domain end up taking a couple of lines at the top of a single file, and it’s easy to get a full picture, without having to navigate around between files and folders, and keep information in your head. As an illustration, here is a snippet of code from the project:

type Window = { Early:DateTime; Target:DateTime; Late:DateTime }

type Trip = { 
    Dwell:TimeSpan }

type Action = 
    | Pickup of Trip 
    | Dropoff of Trip
    | CompleteRoute of Location

This is concise, and pretty straightforward – no functional programming guru credentials needed. This is readable code, which we can talk about without getting bogged down in extraneous details.

The second ingredient is static typing + scripts. What this creates is a safe environment for experimentation.  You can just change a couple of lines here and there, run the code, and see what happens. And when you break something, the compiler immediately barks at you – just undo or fix it. Give someone a running script, and they can start playing with it, and exploring ideas.

In over 10 years writing code professionally, I never had such a collaborative, fruitful, and productive interaction cycle with a client. Never. This was the best of both worlds – I could focus on the code and the algorithms, and he could immediately use it, try it out, and send me invaluable feedback, based on his domain knowledge. No noise, no UML diagrams, no slides, no ceremony – just write code, and directly communicate around it, making sure nothing was amiss. Which triggered this happy tweet a few weeks back:

We were looking at the code together, and my client spotted a domain modeling mistake, right there. This is priceless.

As a side-note, another thing that is priceless is “F# for Fun and Profit”. Scott Wlaschin has been doing an incredible work with this website. It’s literally a gold mine, and I picked up a lot of ideas there. If you haven’t visited it yet, you probably should.

by Mathias 3. May 2015 16:34

I got curious the other day about how to measure the F# community growth, and thought it could be interesting to take a look at this through StackOverflow. As it turns out, it’s not too hard to get some data, because StackExchange exposes a nice API, which allows you to make all sorts of queries and get a JSON response back.

As a starting point, I figured I would just try to get the number of questions asked per month. The API allows you to retrieve questions on any site, by tag, between arbitrary dates. Responses are paged: you can get up to 100 items per page, and keep asking for next pages until there is nothing left to receive. That sounds like a perfect job for the FSharp.Data JSON Type Provider.

First things first, we create a type, Questions, by pointing the JSON Type Provider to a url that returns questions; based on the structure of the JSON document it receives, the Type Provider creates a type, which we will then be able to use to make queries:

#r @"FSharp.Data.2.2.0\lib\net40\FSharp.Data.dll"
open FSharp.Data
open System

let sampleUrl = ""
type Questions = JsonProvider<sampleUrl>

Next, we’ll need to grab all the questions tagged F# between 2 given dates. As an example, the following would return the second page (questions 101 to 200) from all F# questions asked between January 1, 2014 and January 31, 2015:

There are a couple of quirks here. First, the dates are in UNIX standard, that is, the number of seconds elapsed from January 1, 1970. Then, we need to keep pulling pages, until the response indicates that there are no more questions to receive, which is indicated by the HasMore property. That’s not too hard: let’s create a couple of functions, first to convert a .NET date to a UNIX date, and then to build up a proper query, appending the page and dates we are interested in to our base query – and finally, let’s build a request that recursively calls the API and appends results, until there is nothing left:

let fsharpQuery = ""

let unixEpoch = DateTime(1970,1,1)
let unixTime (date:DateTime) = 
    (date - unixEpoch).TotalSeconds |> int64

let page (page:int) (query:string) = 
    sprintf "%s&page=%i" query page
let between (from:DateTime) (``to``:DateTime) (query:string) = 
    sprintf "%s&&fromdate=%i&todate=%i" query (unixTime from) (unixTime ``to``)

let questionsBetween (from:DateTime) (``to``:DateTime) =
    let baseQuery = fsharpQuery |> between from ``to``
    let rec pull results p = 
        let nextPage = Questions.Load (baseQuery |> page p)
        let results = results |> Array.append nextPage.Items
        if (nextPage.HasMore)
        then pull results (p+1)
        else results
    pull Array.empty 1

And we are pretty much done. At that point, we can for instance ask for all the questions asked in January 2015, and check what percentage were answered:

let january2015 = questionsBetween (DateTime(2015,1,1)) (DateTime(2015,1,31))

|> Seq.averageBy (fun x -> if x.IsAnswered then 1. else 0.)
|> printfn "Average answer rate: %.3f"

… which produces a fairly solid 78%.

If you play a bit more with this, and perhaps try to pull down more data, you might experience (as I did) the Big StackExchange BanHammer. As it turns out, the API has usage limits (which is totally fair). In particular, if you ask for too much data, too fast, you will get banned from making requests, for a dozen hours or so.

This is not pleasant. However, in their great kindness, the API designers have provided a way to avoid it. When you are making too many requests, the response you receive will include a field named “backoff”, which indicates for how many seconds you should back off until you make your next call.

This got me stumped for a bit, because that field doesn’t show up by default on the response – only when you are hitting the limit. As a result, I wasn’t sure how to pass that information to the JSON Type Provider, until Max Malook helped me out (thanks so much, Max!). The trick here is to supply not one sample response to the type provider, but a list of samples, in that case, one without the backoff field, and one with it.

I carved out an artisanal, hand-crafted sample for the occasion, along these lines:

let sample = """
   {"tags":["f#","units-of-measurement"],//SNIPPED FOR BREVITY}],
   {"tags":["f#","units-of-measurement"],//SNIPPED FOR BREVITY}],

type Questions = JsonProvider<sample,SampleIsList=true>

… and everything is back in order – we can now modify the recursive request, causing it to sleep for a bit when it encounters a backoff. Not the cleanest solution ever, but hey, I just want to get data here:

let questionsBetween (from:DateTime) (``to``:DateTime) =
    let baseQuery = fsharpQuery |> between from ``to``
    let rec pull results p = 
        let nextPage = Questions.Load (baseQuery |> page p)
        let results = results |> Array.append nextPage.Items
        if (nextPage.HasMore)
            match nextPage.Backoff with
            | Some(seconds) -> System.Threading.Thread.Sleep (1000*seconds + 1000)
            | None -> ignore ()
            pull results (p+1)
        else results
    pull Array.empty 1

So what were the results? I decided, quite arbitrarily, to count questions month by month since January 2010. Here is how the results looks like:


Clearly, the trend is up – it doesn’t take an advanced degree in statistics to see that. It’s interesting also to see the slump around 2012-2013; I can see a similar pattern in the Meetup registration numbers in San Francisco. My sense is that after a spike in interest in 2010, when F# launched with Visual Studio, there hasn’t been much marketing push for the language, and interest eroded a bit, until serious community-driven efforts took place. However, I don’t really have data to back that up – this is speculation.

How this correlates to overall F# adoption is another question: while I think this curves indicates growth, the number of questions on StackOverflow is clearly a very indirect measurement of how many people actually use it, and StackOverflow itself is a distorted sample of the overall population. Would be interesting to take a similar look at GitHub, perhaps…

by Mathias 22. March 2015 17:27

I will admit it, I got a bit upset by James McCaffrey’s column in MSDN magazine this month, “Gradient Descent Training Using C#”. While the algorithm explanations are quite good, I was disappointed by the C# sample code, and kept thinking to myself “why oh why isn’t this written in F#”. This is by no means intended as a criticism of C#; it’s a great language, but some problems are just better suited for different languages, and in this case, I couldn’t fathom why F# wasn’t used.

Long story short, I just couldn’t let it go, and thought it would be interesting to take that C# code, and do a commented rewrite in F#. I won’t even go into why the code does what it does – the article explains it quite well – but will instead purely focus on the implementation, and will try to keep it reasonably close to the original, at the expense of some additional nifty things that could be done.

The general outline of the code follows two parts:

  • Create a synthetic dataset, creating random input examples, and computing the expected result using a known function,
  • Use gradient descent to learn the model parameters, and compare them to the true value to check whether the method is working.

You can download the original C# code here. Today we’ll focus only on the first part, which is mainly contained in two methods, MakeAllData and MakeTrainTest:

static double[][] MakeAllData(int numFeatures, int numRows, int seed)
  Random rnd = new Random(seed);
  double[] weights = new double[numFeatures + 1]; // inc. b0
  for (int i = 0; i < weights.Length; ++i)
    weights[i] = 20.0 * rnd.NextDouble() - 10.0; // [-10.0 to +10.0]

  double[][] result = new double[numRows][]; // allocate matrix
  for (int i = 0; i < numRows; ++i)
    result[i] = new double[numFeatures + 1]; // Y in last column

  for (int i = 0; i < numRows; ++i) // for each row
    double z = weights[0]; // the b0 
    for (int j = 0; j < numFeatures; ++j) // each feature / column except last
      double x = 20.0 * rnd.NextDouble() - 10.0; // random X in [10.0, +10.0]
      result[i][j] = x; // store x
      double wx = x * weights[j + 1]; // weight * x 
      z += wx; // accumulate to get Y
    double y = 1.0 / (1.0 + Math.Exp(-z));
    if (y > 0.55)  // slight bias towards 0
      result[i][numFeatures] = 1.0; // store y in last column
      result[i][numFeatures] = 0.0;
  Console.WriteLine("Data generation weights:");
  ShowVector(weights, 4, true);

  return result;

MakeAllData takes a number of features and rows, and a seed for the random number generator so that we can replicate the same dataset repeatedly. The dataset is represented as an array of array of doubles. The first columns, from 0 to numFeatures – 1, contain random numbers between –10 and 10. The last column contains a 0 or a 1. What we are after here is a classification model: each row can take two states (1 or 0), and we are trying to predict them from observing the features. In our case, that value is computed using a logistic model: we have a set of weights (which we also generate randomly), corresponding to each feature, and the output is

logistic [ x1; x2; … xn ] = 1.0 / (1.0 + exp ( - (w0 * 1.0 + w1 * x1 + w2 * x2 + … + wn * xn))

Note that w0 plays the role of a constant term in the equation, and is multiplied by 1.0 all the time. This is adding some complications to a code where indices are already flying left and right, because now elements in the weights array are mis-aligned by one element with the elements in the features array. Personally, I also don’t like adding another column to contain the predicted value, because that’s another implicit piece of information we have to remember.

In that frame, I will make two minor changes here, just to keep my sanity. First, as is often done, we will insert a column containing just 1.0 in each observation, so that the weights and features are now aligned. Then, we will move the 0s and 1s outside of the features array, to avoid any ambiguity.

Good. Instead of creating a Console application, I’ll simply go for a script. That way, I can just edit my code and check live whether it does what I want, rather than recompile and run every time.

Let’s start with the weights. What we are doing here is simply creating an array of numFeatures + 1 elements, populated by random values between –10.0 and 10.0. We’ll go a bit fancy here: given that we are also generating random numbers the same way a bit further down, let’s extract a function that generates numbers uniformly between a low and high value:

let rnd = Random(seed)
let generate (low,high) = low + (high-low) * rnd.NextDouble()
let weights = Array.init (numFeatures + 1) (fun _ -> generate(-10.0,10.0))

The next section is where things get a bit thornier. The C# code creates an array, then populates it row by row, first filling in the columns with random numbers, and then applying the logistic function to compute the value that goes in the last column. We can make that much clearer, by extracting that function out. The logistic function is really doing 2 things:

  • first, the sumproduct of 2 arrays,
  • and then, 1.0/(1.0 + exp ( – z ).

That is easy enough to implement:

let sumprod (v1:float[]) (v2:float[]) = v1 v2 |> Seq.sumBy (fun (x,y) -> x * y)

let sigmoid z = 1.0 / (1.0 + exp (- z))

let logistic (weights:float[]) (features:float[]) =
    sumprod weights features |> sigmoid

We can now use all this, and generate a dataset by simply first creating rows of random values (with a 1.0 in the first column for the constant term), applying the logistic function to compute the value for that row, and return them as a tuple:

open System

let sumprod (v1:float[]) (v2:float[]) = v1 v2 |> Seq.sumBy (fun (x,y) -> x * y)

let sigmoid z = 1.0 / (1.0 + exp (- z))

let logistic (weights:float[]) (features:float[]) =
    sumprod weights features |> sigmoid

let makeAllData (numFeatures, numRows, seed) =

    let rnd = Random(seed)
    let generate (low,high) = low + (high-low) * rnd.NextDouble()
    let weights = Array.init (numFeatures + 1) (fun _ -> generate(-10.0,10.0))
    let dataset = 
        [| for row in 1 .. numRows ->
            let features = 
                    yield 1.0 
                    for feat in 1 .. numFeatures -> generate(-10.0,10.0) 
            let value = 
                if logistic weights features > 0.55 
                then 1.0 
                else 0.0
            (features, value)

    weights, dataset

Done. Let’s move to the second part of the data generation, with the MakeTrainTest method. Basically, what this does is take a dataset, shuffle it, and split it in two parts, 80% which we will use for training, and 20% we leave out for validation.

static void MakeTrainTest(double[][] allData, int seed,
  out double[][] trainData, out double[][] testData)
  Random rnd = new Random(seed);
  int totRows = allData.Length;
  int numTrainRows = (int)(totRows * 0.80); // 80% hard-coded
  int numTestRows = totRows - numTrainRows;
  trainData = new double[numTrainRows][];
  testData = new double[numTestRows][];

  double[][] copy = new double[allData.Length][]; // ref copy of all data
  for (int i = 0; i < copy.Length; ++i)
    copy[i] = allData[i];

  for (int i = 0; i < copy.Length; ++i) // scramble order
    int r = rnd.Next(i, copy.Length); // use Fisher-Yates
    double[] tmp = copy[r];
    copy[r] = copy[i];
    copy[i] = tmp;
  for (int i = 0; i < numTrainRows; ++i)
    trainData[i] = copy[i];

  for (int i = 0; i < numTestRows; ++i)
    testData[i] = copy[i + numTrainRows];

Again, there is a ton of indexing going on, which in my old age I find very hard to follow. Upon closer inspection, really, the only thing complicated here is the Fischer-Yates shuffle, which takes an array and randomly shuffles the order. The rest is pretty simply – we just want to shuffle, and then split into two arrays. Let’s extract the shuffle code (which happens to also be used and re-implemented later on):

let shuffle (rng:Random) (data:_[]) =
    let copy = Array.copy data
    for i in 0 .. (copy.Length - 1) do
        let r = rng.Next(i, copy.Length)
        let tmp = copy.[r]
        copy.[r] <- copy.[i]
        copy.[i] <- tmp

We went a tiny bit fancy again here, and made the shuffle work on generic arrays; we also pass in the Random instance we want to use, so that we can control / repeat shuffles if we want, by passing a seeded Random. Does this work? Let’s check in FSI:

> [| 1 .. 10 |] |> shuffle (Random ());;
val it : int [] = [|6; 7; 2; 10; 8; 5; 4; 9; 3; 1|]

Looks reasonable. Let’s move on – we can now implement the makeTrainTest function.

let makeTrainTest (allData:_[], seed) =

      let rnd = Random(seed)
      let totRows = allData.Length
      let numTrainRows = int (float totRows * 0.80) // 80% hard-coded

      let copy = shuffle rnd allData
      copy.[.. numTrainRows-1], copy.[numTrainRows ..]

Done. A couple of remarks here. First, F# is a bit less lenient than C# around types, so we have to be explicit when converting the number of rows to 80%, first to float, then back to int. As an aside, this used to annoy me a bit in the beginning, but I have come to really like having F# as this slightly psycho-rigid friend who nags me when I am taking a dangerous path (for instance, dividing two integers and hoping for a percentage).

Besides that, I think the code is markedly clearer. The complexity of the shuffle has been nicely contained, and we just have to slice the array to get a training and test sets. As an added bonus, we got rid of the out parameters, and that always feels nice and fuzzy.

I’ll leave it at for today; next time we’ll look at the second part, the learning algorithm itself. Before closing shop, let me make a couple of comments. First, the code is a tad shorter, but not by much. I haven’t really tried, and deliberately made only the changes I thought were needed. What I like about it, though, is that all the indexes are gone, except for the shuffle. In my opinion, this is a good thing. I find it difficult to keep it all in my head when more than one index is involved; when I need to also remember what columns contain special values, I get worried – and just find it hard to figure out what is going on. By contrast, I think makeTrainTest, for instance, conveys pretty directly what it does. makeAllData, in spite of some complexity, also maps closely the way I think about my goal: “I want to generate rows of inputs” – this is precisely what the code does. There is probably an element of culture to it, though; looping over arrays has a long history, and is familiar to every developer, and what looks readable to me might look entirely weird to some.

Easier, or more complicated than before? Anything you like or don’t like – or find unclear? Always interested to hear your opinion! Ping me on Twitter if you have comments.


Comment RSS