Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 22. December 2015 14:43

This is my modest contribution to the F# Advent Calendar 2015. Thanks to @sergey_tihon for organizing it! Check out the epic stuff others have produced so far on his website or under the #fsAdvent hashtag on Twitter. Also, don’t miss the Japan Edition of #fsAdvent for more epicness…

Sometime last year, in a moment of beer-fueled inspiration, I ended up putting together @fsibot, the ultimate mobile F# IDE for the nomad developer with a taste for functional-first programming. This was fun, some people created awesome things with it, other people, not so much, and I learnt a ton.

People also had feature requests (of course they did), some obviously crucial (Quines! We need quines!), some less so. Among others came the suggestion to support querying the World Bank for data, and returning results as a chart.

So... Let's do it! After a bit of thought, I decided I would not extend @fsibot to support this, but rather build a separate bot, with its own external DSL. My thinking here was that adding this as a feature to @fsibot would clutter the code; also, this is a specialized task, and it might make sense to create a dedicated language for it, to make it accessible to the broader public who might not be familiar with F# and its syntax.

You can find the code for this thing here.

The World Bank Type Provider

Let's start with the easy part - accessing the World Bank data and turning it into a chart. So what I want to do is something along the lines of 'give me the total population for France between 2000 and 2005', and make a nice columns chart out of this. The first step is trivial using the World Bank type provider, which can be found in the FSharp.Data library:

open FSharp.Data

let wb = WorldBankData.GetDataContext ()
let france = wb.Countries.France
let population = france.Indicators.``Population, total``
let series = [ for year in 2000 .. 2005 -> year, population.[year]]

Creating a chart isn't much harder, using FSharp.Charting:

open FSharp.Charting

let title = sprintf "%s, %s" (france.Name) (population.Name)
let filename = __SOURCE_DIRECTORY__ + "/chart.png"

Chart.Line(series, Title=title)
|> Chart.Save(filename)

Wrapping up calls to the Type Provider

Next, we need to take in whatever string the user will send us over Twitter, and convert it into something we can execute. Specifically, what we want is to take user input along the lines of "France, Total population, 2000-2005", and feed that information into the WorldBank type provider.

Suppose for a moment that we had broken down our message into its 4 pieces, a country name, an indicator name, and two years. We could then call the WorldBank type provider, along these lines:

type WB = WorldBankData.ServiceTypes
type Country = WB.Country
type Indicator = Runtime.WorldBank.Indicator

let findCountry (name:string) =
    wb.Countries 
    |> Seq.tryFind (fun c -> c.Name = name)

let findIndicator (name:string) (c:Country) =
    c.Indicators 
    |> Seq.tryFind (fun i -> i.Name = name)

let getValues (year1,year2) (indicator:Indicator) =
    [ for year in year1 .. year2 -> year, indicator.[year]]

We can then easily wrap this into a single function, like this:

let getSeries (country,indicator,year1,year2) =
    findCountry country
    |> Option.bind (findIndicator indicator)
    |> Option.map (getValues (year1,year2))

Defining our language

This is a bit limiting, however. Imagine that we wanted to also support queries like "France, Germany, Italy, Total population, total GDP, 2000". We could of course pass in everything as lists, say,

 ["France";"Germany"], ["Total population"], [2000],

… but we'd have to then examine how many elements the list contains to make a decision. Also, more annoyingly, this allows for cases that should not be possible: ideally, we wouldn't want to even allow requests such as

[], [], [2000; 2010; 2020].

One simple solution is to carve out our own language, using F# Discriminated Unions. Instead of lists, we could, for instance, create a handful of types to represent valid arguments:

type PLACE =
    | COUNTRY of string
    | COUNTRIES of string list

type MEASURE =
    | INDICATOR of string

type TIMEFRAME = 
    | OVER of int * int
    | IN of int

This is much nicer: we can now clean up our API using pattern matching, eliminating a whole class of problems:

let cleanAPI (place:PLACE) (values:MEASURE) (timeframe:TIMEFRAME) =
    match (place, values, timeframe) with
    | COUNTRY(country), INDICATOR(indicator), OVER(year1,year2) ->             
        // do stuff
    | COUNTRIES(countries), INDICATOR(indicator), OVER(year1,year2) -> 
        // do different stuff
    | // etc...

Parsing user input

The only problem we are left with now is to break a raw string - the user request - into a tuple of arguments. If we have that, then we can compose all the pieces together, piping them into a function that will take a string and go all the way down to the type provider.

We are faced with a decision now: we can go the hard way, powering our way through this using Regex and string manipulation, or the easy way, using a parser like FParsec. Let's be lazy and smart!

Note to self: when using FParsec from a script file, make sure you #r FParsecCS before FParsec. I spent a couple of hours stuck trying to understand what I was doing wrong because of that one.

Simply put, FParsec is awesome. It allows you to define small functions to parse input strings, test them on small pieces of input, and compose them together into bigger and badder parsers. Let's illustrate: suppose that in our DSL, we expect user requests to contain a piece that looks like "IN 2010", or "OVER 2000 - 2010" to define the timeframe.

In the first case, we want to recognize the string “IN”, followed by spaces, followed by an integer; if we find that pattern, we want to retrieve the integer and create an instance of IN:

let pYear = spaces >>. pint32 .>> spaces
let pIn =
    pstring "IN" >>. pYear
    |>> IN

If we run the parser on a well-formed string, we get what we expect:

run pIn "IN  2000 "
> 
val it : ParserResult<TIMEFRAME,unit> = Success: IN 2000

If we pass in an incorrectly formed string, we get a nice error diagnosis:

run pIn "IN some year "
> 
val it : ParserResult<TIMEFRAME,unit> =
  Failure:
Error in Ln: 1 Col: 4
IN some year 
   ^
Expecting: integer number (32-bit, signed)

Beautiful! The second case is rather straightforward, too:

let pYears = 
    tuple2 pYear (pstring "-" >>. pYear)
     
let pOver = 
    pstring "OVER" >>. pYears
    |>> OVER

Passing in a well-formed string gives us back OVER(2000,2010):

run pOver "OVER 2000- 2010"
> 
val it : ParserResult<TIMEFRAME,unit> = Success: OVER (2000,2010)

Finally we can compose these together, so that when we encounter either IN 2000, or OVER 2000 - 2005, we parse this into a TIMEFRAME:

let pTimeframe = pOver <|> pIn

I won't go into the construction of the full parser - you can just take a look here. The trickiest part was my own doing. I wanted to allow messages without quotes, that is,

COUNTRY France

and not

COUNTRY "France"

The second case is much easier to parse (look for any chars between ""), especially because there are indicators like, for instance, "Population, total". The parser is pretty hacky, but hey, it mostly works, so... ship it!

Ship it!

That's pretty much it. At that point, all the pieces are there. I ended up copy pasting taking inspiration from the existing @fsibot code, using LinqToTwitter to deal with reading and writing to Twitter, and TopShelf to host the bot as a Windows service, hosted on an Azure VM, and voila! You can now tweet to @wbfacts, and get back a nice artisanal chart, hand-crafted just for you, with the freshest data from the World Bank:

A couple of quick final comments:

  • One of the most obvious issues with the bot is that Twitter offers very minimal support for IntelliSense (and by minimal, I mean 'none'). This is a problem, because we lose discoverability, a key benefit of type providers. To compensate for that, I added a super-crude string matching strategy, which will give a bit of flexibility around misspelled country or indicator names. This is actually a fun problem - I was a bit pressed by time, but I'll probably revisit it later.
  • In the same vein, it would be nice to add a feature like "find me an indicator with a name like GDP total". That should be reasonably easy to do, by extending the language to support instructions like HELP and / or INFO.
  • The bot seems like a perfect case for some Railway-Oriented Programming. Currently the wiring is pretty messy; for instance, our parsing step returns an option, and drops parsing error messages from FParsec. That message would be much more helpful to the user than our current message that only states that “parsing failed". With ROP, we should be able to compose a clean pipeline of functions, along the lines of parseArguments >> runArguments >> composeResponse.
  • The performance of looking up indicators by name is pretty terrible, at least on the first call on a country. You have been warned :)
  • That's right, there is no documentation. Not a single test, either. Tests show a disturbing lack of confidence in your coding skills. Also, I had to ship by December 22nd :)

That being said, in spite of its many, many warts, I am kind of proud of @wbfacts! It is ugly as hell, the code is full of duct-tape, the parser is wanky, and you should definitely not take this as ‘best practices’. I am also not quite clear on how the Twitter rate limits work, so I would not be entirely surprised if things went wrong in the near future… In spite of all this, hey, it kind of runs! Hopefully you find the code or what it does fun, and perhaps it will even give you some ideas for your own projects. In the meanwhile, I wish you all happy holidays!

You can find the code for this thing here.

This is my modest contribution to the F# Advent Calendar 2015. Thanks to @sergey_tihon for organizing it! Check out the epic stuff others have produced so far on his website or under the #fsAdvent hashtag on Twitter. Also, don’t miss the Japan Edition of #fsAdvent for more epicness…

I also wanted to say thanks to Tomas Petricek, for opening my eyes to discriminated unions as a modeling tool, and Phil Trelford for introducing me to FParsec, which is truly a thing of beauty. They can be blamed to an extent for inspiring this ill-conceived project, but whatever code monstrosity is in the repository is entirely my doing :)

And… ping me on Twitter as @brandewinder if you have questions or comments!

by Mathias 3. May 2015 16:34

I got curious the other day about how to measure the F# community growth, and thought it could be interesting to take a look at this through StackOverflow. As it turns out, it’s not too hard to get some data, because StackExchange exposes a nice API, which allows you to make all sorts of queries and get a JSON response back.

As a starting point, I figured I would just try to get the number of questions asked per month. The API allows you to retrieve questions on any site, by tag, between arbitrary dates. Responses are paged: you can get up to 100 items per page, and keep asking for next pages until there is nothing left to receive. That sounds like a perfect job for the FSharp.Data JSON Type Provider.

First things first, we create a type, Questions, by pointing the JSON Type Provider to a url that returns questions; based on the structure of the JSON document it receives, the Type Provider creates a type, which we will then be able to use to make queries:

#I"../packages"
#r @"FSharp.Data.2.2.0\lib\net40\FSharp.Data.dll"
open FSharp.Data
open System

[<Literal>]
let sampleUrl = "https://api.stackexchange.com/2.2/questions?site=stackoverflow"
type Questions = JsonProvider<sampleUrl>

Next, we’ll need to grab all the questions tagged F# between 2 given dates. As an example, the following would return the second page (questions 101 to 200) from all F# questions asked between January 1, 2014 and January 31, 2015:

https://api.stackexchange.com/2.2/questions?page=2&pagesize=100&fromdate=1420070400&todate=1422662400&tagged=F%23&site=stackoverflow

There are a couple of quirks here. First, the dates are in UNIX standard, that is, the number of seconds elapsed from January 1, 1970. Then, we need to keep pulling pages, until the response indicates that there are no more questions to receive, which is indicated by the HasMore property. That’s not too hard: let’s create a couple of functions, first to convert a .NET date to a UNIX date, and then to build up a proper query, appending the page and dates we are interested in to our base query – and finally, let’s build a request that recursively calls the API and appends results, until there is nothing left:

let fsharpQuery = "https://api.stackexchange.com/2.2/questions?site=stackoverflow&tagged=F%23&pagesize=100"

let unixEpoch = DateTime(1970,1,1)
let unixTime (date:DateTime) = 
    (date - unixEpoch).TotalSeconds |> int64

let page (page:int) (query:string) = 
    sprintf "%s&page=%i" query page
let between (from:DateTime) (``to``:DateTime) (query:string) = 
    sprintf "%s&&fromdate=%i&todate=%i" query (unixTime from) (unixTime ``to``)

let questionsBetween (from:DateTime) (``to``:DateTime) =
    let baseQuery = fsharpQuery |> between from ``to``
    let rec pull results p = 
        let nextPage = Questions.Load (baseQuery |> page p)
        let results = results |> Array.append nextPage.Items
        if (nextPage.HasMore)
        then pull results (p+1)
        else results
    pull Array.empty 1

And we are pretty much done. At that point, we can for instance ask for all the questions asked in January 2015, and check what percentage were answered:

let january2015 = questionsBetween (DateTime(2015,1,1)) (DateTime(2015,1,31))

january2015 
|> Seq.averageBy (fun x -> if x.IsAnswered then 1. else 0.)
|> printfn "Average answer rate: %.3f"

… which produces a fairly solid 78%.

If you play a bit more with this, and perhaps try to pull down more data, you might experience (as I did) the Big StackExchange BanHammer. As it turns out, the API has usage limits (which is totally fair). In particular, if you ask for too much data, too fast, you will get banned from making requests, for a dozen hours or so.

This is not pleasant. However, in their great kindness, the API designers have provided a way to avoid it. When you are making too many requests, the response you receive will include a field named “backoff”, which indicates for how many seconds you should back off until you make your next call.

This got me stumped for a bit, because that field doesn’t show up by default on the response – only when you are hitting the limit. As a result, I wasn’t sure how to pass that information to the JSON Type Provider, until Max Malook helped me out (thanks so much, Max!). The trick here is to supply not one sample response to the type provider, but a list of samples, in that case, one without the backoff field, and one with it.

I carved out an artisanal, hand-crafted sample for the occasion, along these lines:

[<Literal>]
let sample = """
[{"items":[
   {"tags":["f#","units-of-measurement"],//SNIPPED FOR BREVITY}],
   "has_more":false,
   "quota_max":300,
   "quota_remaining":294},
 {"items":[
   {"tags":["f#","units-of-measurement"],//SNIPPED FOR BREVITY}],
   "has_more":false,
   "quota_max":300,
   "quota_remaining":294,
   "backoff":10}]"""

type Questions = JsonProvider<sample,SampleIsList=true>

… and everything is back in order – we can now modify the recursive request, causing it to sleep for a bit when it encounters a backoff. Not the cleanest solution ever, but hey, I just want to get data here:

let questionsBetween (from:DateTime) (``to``:DateTime) =
    let baseQuery = fsharpQuery |> between from ``to``
    let rec pull results p = 
        let nextPage = Questions.Load (baseQuery |> page p)
        let results = results |> Array.append nextPage.Items
        if (nextPage.HasMore)
        then
            match nextPage.Backoff with
            | Some(seconds) -> System.Threading.Thread.Sleep (1000*seconds + 1000)
            | None -> ignore ()
            pull results (p+1)
        else results
    pull Array.empty 1

So what were the results? I decided, quite arbitrarily, to count questions month by month since January 2010. Here is how the results looks like:

StackOverflow

Clearly, the trend is up – it doesn’t take an advanced degree in statistics to see that. It’s interesting also to see the slump around 2012-2013; I can see a similar pattern in the Meetup registration numbers in San Francisco. My sense is that after a spike in interest in 2010, when F# launched with Visual Studio, there hasn’t been much marketing push for the language, and interest eroded a bit, until serious community-driven efforts took place. However, I don’t really have data to back that up – this is speculation.

How this correlates to overall F# adoption is another question: while I think this curves indicates growth, the number of questions on StackOverflow is clearly a very indirect measurement of how many people actually use it, and StackOverflow itself is a distorted sample of the overall population. Would be interesting to take a similar look at GitHub, perhaps…

by Mathias 25. August 2013 08:54

About a month ago, FSharp.Data  released version 1.1.9, which contains some very nice improvements – you can find them listed on Gustavo Guerra’s blog. I was particularly excited by the changes made to the CSV Type Provider, because they make my life digging through datasets even simpler, but couldn’t find the time to write about it, because of my recent cross-country peregrinations.

Now that I am back, let’s talk about why this update made me so happy, with a concrete example. My latest week-end project is an F# implementation of Random Forests; as part of the process, I am trying out the algorithm on various datasets, to get a sense for potential performance problems, and dog-food my own API, the best way I know to quickly spot suckiness.

One of the problems I ran into was the representation of missing values. Most datasets don’t come clean and ready to use – usually you’ll have a few records with missing data. I opted for what seemed the most straightforward representation in F#, and decided to represent every feature value as an Option – anything can either have Some value, or None.

The original CSV Type Provider introduced a bit of friction there, because it inferred types “optimistically”: if the sample used contained only integers, it would create an integer, which is great in most cases, except when you want to be “pessimistic” (which is usually a safe world-view when setting expectations regarding data).

The new-and-improved CSV Type Provider fixes that, and introduces a few niceties. Case in point: the Kaggle Titanic dataset, which contains the Titanic’s passenger list. With the new version, extracting the data is as simple as this:

type DataSet = CsvProvider<"titanic.csv", 
                           Schema="PassengerId=int, Pclass->Class, Parch->ParentsOrChildren, SibSp->SiblingsOrSpouse", 
                           SafeMode=true, 
                           PreferOptionals=true>

type Passenger = DataSet.Row

This is pretty awesome. In a couple of lines, just by passing in the path to my CSV file and some (optional) schema information, I get a Passenger type:

Titanic

What’s neat here is that first, I immediately get a Passenger with properties – with the correct Optional types, thanks to SafeMode and PreferOptional. Then, notice in the Schema the Pclass->Class, Parch->ParentsOrChildren, SibSp->SiblingsOrSpouse bit? This renames “on the fly” the properties; instead of the pretty obscurely named Parch feature coming from the CSV file header, I get a nice and readable ParentsOrChildren property. The Type Provider even does a few more cool things, automagically; for instance, the feature “Survived”, which is encoded in the original dataset as 0 or 1, gets automatically converted to a boolean. Really nice.

And just like that, I can now use this CSV file, and send it to my (still very much in alpha version) Decision Tree classifier:

// We read the training set into an array,
// defining the Label we want to classify on:
let training =
    use data = new DataSet()
    [| for passenger in data.Data -> 
        passenger.Survived |> Categorical, // the label
        passenger |]
// We define what features should be used:
let features = [|
    "Sex", (fun (x:Passenger) -> x.Sex |> Categorical);
    "Class", (fun x -> x.Class |> Categorical); |]
// We run the classifier...
let classifier, report = createID3Classifier training features { DefaultID3Config with DetailLevel = Verbose }
// ... and display the resulting tree:
report.Value.Pretty()

… which produces the following results in the F# Interactive window:

> titanicDemo();;
├ Sex = male
│   ├ Class = 3 → False
│   ├ Class = 1 → False
│   └ Class = 2 → False
└ Sex = female
   ├ Class = 3 → False
   ├ Class = 1 → True
   └ Class = 2 → True
val it : unit = ()
>

The morale of the story here is triple. First, it was a much better idea to be a rich lady on the Titanic, rather than a (poor) dude. Then, Type Providers are really awesome – in a couple of lines, we extracted from a CSV file a collection of Passengers, all of them statically typed, with all the benefits attached to that; in a way, this is the best of both worlds – access the data as easily as with a dynamic language, but with all the benefits of types. Finally, the F# community is just awesome – big thanks to everyone who contributed to FSharp.Data, and specifically to @ovatsus for the recent improvements to the CSV Type Provider!

You can find the full Titanic example here on GitHub.

by Mathias 14. April 2013 12:20

Last Thursday, I gave a talk at the Bay.NET user group in Berkeley, introducing F# to C# developers. First off, I have to thank everybody who came – you guys were great, lots of good questions, nice energy, I had a fantastic time!

My goal was to highlight why I think F# is awesome, and of course this had to include a Type Provider demo, one of the most amazing features of F# 3.0. So I went ahead, and demoed Tomas Petricek’s World Bank Type Provider, and Howard Mansell’s R Type Provider – together. The promise of Type Providers is to enable information-rich programming; in this case, we get immediate access to a wealth of data over the internet, in one line of code, entirely discoverable by IntelliSense in Visual Studio - and we can use all the visualization arsenal of R to see what’s going on. Pretty rad.

Rather than just dump the code, I thought it would be fun to turn that demo into a video. The result is a 7 minutes clip, with only minor editing (a few cuts, and I sped up the video x3 because the main point here isn’t how terrible my typing skills are). I think it’s largely self-explanatory, the only points that are worth commenting upon are:

  • I am using a NuGet package for the R Type Provider that doesn’t officially exist yet. I figured a NuGet package would make that Type Provider more usable, and spent my week-end creating it, but haven’t published it yet. Stay tuned!
  • The most complex part of the demo is probably R’s syntax from hell. For those of you who don’t know R, it’s a free, open-source statistical package which does amazingly cool things. What you need to know to understand this video is that R is very vector-centric. You can create a vector in R using the syntax myData <- c(1,2,3,4), and combine vectors into what’s called a data frame, essentially a collection of features. The R type provider exposes all R packages and functions through a single static type, aptly named R – so for instance, one can create a R vector from F# by typing let myData = R.c( [|1; 2; 3; 4 |]).

That’s it! Let me know what you think, and if you have comments or questions.

Comments

Comment RSS