Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 22. March 2015 17:27

I will admit it, I got a bit upset by James McCaffrey’s column in MSDN magazine this month, “Gradient Descent Training Using C#”. While the algorithm explanations are quite good, I was disappointed by the C# sample code, and kept thinking to myself “why oh why isn’t this written in F#”. This is by no means intended as a criticism of C#; it’s a great language, but some problems are just better suited for different languages, and in this case, I couldn’t fathom why F# wasn’t used.

Long story short, I just couldn’t let it go, and thought it would be interesting to take that C# code, and do a commented rewrite in F#. I won’t even go into why the code does what it does – the article explains it quite well – but will instead purely focus on the implementation, and will try to keep it reasonably close to the original, at the expense of some additional nifty things that could be done.

The general outline of the code follows two parts:

  • Create a synthetic dataset, creating random input examples, and computing the expected result using a known function,
  • Use gradient descent to learn the model parameters, and compare them to the true value to check whether the method is working.

You can download the original C# code here. Today we’ll focus only on the first part, which is mainly contained in two methods, MakeAllData and MakeTrainTest:

static double[][] MakeAllData(int numFeatures, int numRows, int seed)
{
  Random rnd = new Random(seed);
  double[] weights = new double[numFeatures + 1]; // inc. b0
  for (int i = 0; i < weights.Length; ++i)
    weights[i] = 20.0 * rnd.NextDouble() - 10.0; // [-10.0 to +10.0]

  double[][] result = new double[numRows][]; // allocate matrix
  for (int i = 0; i < numRows; ++i)
    result[i] = new double[numFeatures + 1]; // Y in last column

  for (int i = 0; i < numRows; ++i) // for each row
  {
    double z = weights[0]; // the b0 
    for (int j = 0; j < numFeatures; ++j) // each feature / column except last
    {
      double x = 20.0 * rnd.NextDouble() - 10.0; // random X in [10.0, +10.0]
      result[i][j] = x; // store x
      double wx = x * weights[j + 1]; // weight * x 
      z += wx; // accumulate to get Y
    }
    double y = 1.0 / (1.0 + Math.Exp(-z));
    if (y > 0.55)  // slight bias towards 0
      result[i][numFeatures] = 1.0; // store y in last column
    else
      result[i][numFeatures] = 0.0;
  }
  Console.WriteLine("Data generation weights:");
  ShowVector(weights, 4, true);

  return result;
}

MakeAllData takes a number of features and rows, and a seed for the random number generator so that we can replicate the same dataset repeatedly. The dataset is represented as an array of array of doubles. The first columns, from 0 to numFeatures – 1, contain random numbers between –10 and 10. The last column contains a 0 or a 1. What we are after here is a classification model: each row can take two states (1 or 0), and we are trying to predict them from observing the features. In our case, that value is computed using a logistic model: we have a set of weights (which we also generate randomly), corresponding to each feature, and the output is

logistic [ x1; x2; … xn ] = 1.0 / (1.0 + exp ( - (w0 * 1.0 + w1 * x1 + w2 * x2 + … + wn * xn))

Note that w0 plays the role of a constant term in the equation, and is multiplied by 1.0 all the time. This is adding some complications to a code where indices are already flying left and right, because now elements in the weights array are mis-aligned by one element with the elements in the features array. Personally, I also don’t like adding another column to contain the predicted value, because that’s another implicit piece of information we have to remember.

In that frame, I will make two minor changes here, just to keep my sanity. First, as is often done, we will insert a column containing just 1.0 in each observation, so that the weights and features are now aligned. Then, we will move the 0s and 1s outside of the features array, to avoid any ambiguity.

Good. Instead of creating a Console application, I’ll simply go for a script. That way, I can just edit my code and check live whether it does what I want, rather than recompile and run every time.

Let’s start with the weights. What we are doing here is simply creating an array of numFeatures + 1 elements, populated by random values between –10.0 and 10.0. We’ll go a bit fancy here: given that we are also generating random numbers the same way a bit further down, let’s extract a function that generates numbers uniformly between a low and high value:

let rnd = Random(seed)
let generate (low,high) = low + (high-low) * rnd.NextDouble()
let weights = Array.init (numFeatures + 1) (fun _ -> generate(-10.0,10.0))

The next section is where things get a bit thornier. The C# code creates an array, then populates it row by row, first filling in the columns with random numbers, and then applying the logistic function to compute the value that goes in the last column. We can make that much clearer, by extracting that function out. The logistic function is really doing 2 things:

  • first, the sumproduct of 2 arrays,
  • and then, 1.0/(1.0 + exp ( – z ).

That is easy enough to implement:

let sumprod (v1:float[]) (v2:float[]) =
    Seq.zip v1 v2 |> Seq.sumBy (fun (x,y) -> x * y)

let sigmoid z = 1.0 / (1.0 + exp (- z))

let logistic (weights:float[]) (features:float[]) =
    sumprod weights features |> sigmoid

We can now use all this, and generate a dataset by simply first creating rows of random values (with a 1.0 in the first column for the constant term), applying the logistic function to compute the value for that row, and return them as a tuple:

open System

let sumprod (v1:float[]) (v2:float[]) =
    Seq.zip v1 v2 |> Seq.sumBy (fun (x,y) -> x * y)

let sigmoid z = 1.0 / (1.0 + exp (- z))

let logistic (weights:float[]) (features:float[]) =
    sumprod weights features |> sigmoid

let makeAllData (numFeatures, numRows, seed) =

    let rnd = Random(seed)
    let generate (low,high) = low + (high-low) * rnd.NextDouble()
    let weights = Array.init (numFeatures + 1) (fun _ -> generate(-10.0,10.0))
    
    let dataset = 
        [| for row in 1 .. numRows ->
            let features = 
                [|  
                    yield 1.0 
                    for feat in 1 .. numFeatures -> generate(-10.0,10.0) 
                |]
            let value = 
                if logistic weights features > 0.55 
                then 1.0 
                else 0.0
            (features, value)
        |]

    weights, dataset

Done. Let’s move to the second part of the data generation, with the MakeTrainTest method. Basically, what this does is take a dataset, shuffle it, and split it in two parts, 80% which we will use for training, and 20% we leave out for validation.

static void MakeTrainTest(double[][] allData, int seed,
  out double[][] trainData, out double[][] testData)
{
  Random rnd = new Random(seed);
  int totRows = allData.Length;
  int numTrainRows = (int)(totRows * 0.80); // 80% hard-coded
  int numTestRows = totRows - numTrainRows;
  trainData = new double[numTrainRows][];
  testData = new double[numTestRows][];

  double[][] copy = new double[allData.Length][]; // ref copy of all data
  for (int i = 0; i < copy.Length; ++i)
    copy[i] = allData[i];

  for (int i = 0; i < copy.Length; ++i) // scramble order
  {
    int r = rnd.Next(i, copy.Length); // use Fisher-Yates
    double[] tmp = copy[r];
    copy[r] = copy[i];
    copy[i] = tmp;
  }
  for (int i = 0; i < numTrainRows; ++i)
    trainData[i] = copy[i];

  for (int i = 0; i < numTestRows; ++i)
    testData[i] = copy[i + numTrainRows];
}

Again, there is a ton of indexing going on, which in my old age I find very hard to follow. Upon closer inspection, really, the only thing complicated here is the Fischer-Yates shuffle, which takes an array and randomly shuffles the order. The rest is pretty simply – we just want to shuffle, and then split into two arrays. Let’s extract the shuffle code (which happens to also be used and re-implemented later on):

let shuffle (rng:Random) (data:_[]) =
    let copy = Array.copy data
    for i in 0 .. (copy.Length - 1) do
        let r = rng.Next(i, copy.Length)
        let tmp = copy.[r]
        copy.[r] <- copy.[i]
        copy.[i] <- tmp
    copy

We went a tiny bit fancy again here, and made the shuffle work on generic arrays; we also pass in the Random instance we want to use, so that we can control / repeat shuffles if we want, by passing a seeded Random. Does this work? Let’s check in FSI:

> [| 1 .. 10 |] |> shuffle (Random ());;
val it : int [] = [|6; 7; 2; 10; 8; 5; 4; 9; 3; 1|]

Looks reasonable. Let’s move on – we can now implement the makeTrainTest function.

let makeTrainTest (allData:_[], seed) =

      let rnd = Random(seed)
      let totRows = allData.Length
      let numTrainRows = int (float totRows * 0.80) // 80% hard-coded

      let copy = shuffle rnd allData
      copy.[.. numTrainRows-1], copy.[numTrainRows ..]

Done. A couple of remarks here. First, F# is a bit less lenient than C# around types, so we have to be explicit when converting the number of rows to 80%, first to float, then back to int. As an aside, this used to annoy me a bit in the beginning, but I have come to really like having F# as this slightly psycho-rigid friend who nags me when I am taking a dangerous path (for instance, dividing two integers and hoping for a percentage).

Besides that, I think the code is markedly clearer. The complexity of the shuffle has been nicely contained, and we just have to slice the array to get a training and test sets. As an added bonus, we got rid of the out parameters, and that always feels nice and fuzzy.

I’ll leave it at for today; next time we’ll look at the second part, the learning algorithm itself. Before closing shop, let me make a couple of comments. First, the code is a tad shorter, but not by much. I haven’t really tried, and deliberately made only the changes I thought were needed. What I like about it, though, is that all the indexes are gone, except for the shuffle. In my opinion, this is a good thing. I find it difficult to keep it all in my head when more than one index is involved; when I need to also remember what columns contain special values, I get worried – and just find it hard to figure out what is going on. By contrast, I think makeTrainTest, for instance, conveys pretty directly what it does. makeAllData, in spite of some complexity, also maps closely the way I think about my goal: “I want to generate rows of inputs” – this is precisely what the code does. There is probably an element of culture to it, though; looping over arrays has a long history, and is familiar to every developer, and what looks readable to me might look entirely weird to some.

Easier, or more complicated than before? Anything you like or don’t like – or find unclear? Always interested to hear your opinion! Ping me on Twitter if you have comments.

by Mathias 4. December 2011 15:47

I am in the middle of “Working Effectively with Legacy Code”, and found it every bit as great as it was said to be. In the book, Feathers introduces the concept of Seams and Enabling Points:

a Seam is a place where you can alter behavior in your program without editing it in that place

every seam has an enabling point, a place where you can make the decision to use one behavior or another.

The idea - as I understand it - is that an enabling point is a hook for testability, a place where you can replace the behavior of a piece of code with your own controlled behavior, and validate that the results are as expected.

The reason I am bringing this up is that I have been writing lots of F# lately, and it made me realize that a functional style provides lots of enabling points, and can be much easier to test than object-oriented code.

Here is a simplified, but representative, example of the problem I was looking at: I needed to pick a random item in a list. In C#, a method along these lines would do the job:

public T PickFrom(IList<T> list)
{
   var random = new Random();
   return list[random.Next(list.Count())];
}

However, this code is utterly untestable; it’s also probably a terrible idea to instantiate a new Random every time this is called, so we modify it this way:

public T PickFrom(IList<T> list, Random random)
{
   return list[random.Next(list.Count())];
}

This is much better: now we have a decent Enabling Point, because the list of arguments of the method contains everything that is used inside the method. However, this is still untestable, but for a different reason: by definition, Random.Next() will return different values every time PickFrom is called, and expecting a repeatable result from PickFrom is a bit of a desperate enterprise.

More...

by Mathias 27. October 2010 15:14

When it comes to design discussions, I have been described as “opinionated” (at times, harsher words like pig-headed or argumentative – or way worse - have been substituted). It isn’t that I think I am usually right, and  I am normally a pretty gentle person, but I firmly believe in Socratic method for identifying well-formulated, clear and transparent solutions.

Two ingredients are required to make it work: an unambiguous statement, and a contradictor, who will poke at the statement until it either crumbles, or gets refined until it is self-evident, with clearly identified qualities and limitations. So I like to take somewhat extreme positions in design discussions, either in my proposals, or in my criticism. It has nothing to do with how much I believe them to be right, and everything to do with understanding what is at stake in the discussion and where the tensions are.

Unfortunately, it has also gotten me into trouble at times, because, well, discussions can get heated. I will honestly try my best to find issues with any design I am presented with, and push for clarification – just like I really appreciate when people do the same with mine. It’s a tricky exercise, because while these are ideas that are under fire, it’s often difficult not to take the criticism personally. I hope to get better one day at sensing when that emotional line is being crossed, so that an honest and open discussion can be maintained.

Anyways, the reason for this rant is that I am finally reading The Structure of Scientific Revolutions, by Thomas S. Kuhn - and I am enjoying it tremendously. The book is mostly concerned with science, but some of the ideas resonate deeply with me, and in my opinion extend beyond science. The two following gems are lifted from the book:

Truth emerges more readily from error than from confusion.

Sir Francis Bacon, quoted in Kuhn

… Novelty ordinarily emerges only for the man who, knowing with precision what he should expect, is able to recognize that something has gone wrong. Anomaly appears only against the background provided by the paradigm. The more precise and far-reaching that paradigm is, the more sensitive an indicator it provides of anomaly […] By ensuring that the paradigm will not be too easily surrendered, resistance guarantees that scientists will not be lightly distracted…

Thomas S. Kuhn

I could not agree more.

by Mathias 26. September 2010 08:12

The most awesome free developer event on the west coast is back again, and it looks like it will be even more awesome this year. The 5th edition of Silicon Valley Code Camp is taking place Saturday and Sunday Oct 9th and 10th, 2010, at Foothill College, in Los Altos Hill. There are 194 sessions, already over 1,600 registrations, with topics for all tastes - .NET, Java, Cloud Computing, Python, Javascript, mobile development, and more.

The 2010 schedule is up here. I will be giving 2 talks, on Test-Driven Development, and on Mocking. Both are on Sunday – hope to see you there!

by Mathias 20. May 2010 06:43

dotlesscss I gave a quick Firestarter talk at the San Francisco .Net user group yesterday about .less. .less (‘dot-less’) is an open source .Net library, which extends the functionality of CSS (it works with your existing CSS files), adding features like variables, using a syntax close to CSS. If you find that working with CSS causes some teeth-grinding , you should probably have a look!

Download the slide deck here

Comments

Comment RSS