Code for the Dark Worlds competition

Here’s the code for my entry in the Dark Worlds competition.

- Run ‘test_on_train.m’ to test the algorithm on the training data.
The settings in this file are such that you will get a good result in about 15 min on a quad-core PC.

- Run ‘predict_leaderboard.m’ to reproduce my winning submission for the competition. This requires a lot more running time for a relatively modest gain in accuracy.

Get the code.

Observing Dark Worlds

Kaggle recently ran another great competition, which I was very fortunate to win. The goal of this competition: detect clouds of dark matter floating around the universe through their effect on the light emitted by background galaxies.

From the competition website:

There is more to the Universe than meets the eye. Out in the cosmos exists a form of matter that outnumbers the stuff we can see by almost 7 to 1, and we don’t know what it is. What we do know is that it does not emit or absorb light, so we call it Dark Matter. Such a vast amount of aggregated matter does not go unnoticed. In fact we observe that this stuff aggregates and forms massive structures called Dark Matter Halos. Although dark, it warps and bends spacetime such that any light from a background galaxy which passes close to the Dark Matter will have its path altered and changed. This bending causes the galaxy to appear as an ellipse in the sky.

The task is then to use this “bending of light” to estimate where in the sky this dark matter is located.

Dark Matter bending the light from a background galaxy.

Dark Matter bending the light from a background galaxy.

Although the description makes this sound like a physics problem, it is really one of statistics: given the noisy data (the elliptical galaxies) recover the model and parameters (position and mass of the dark matter) that generated them. After recovering these dark matter halos, their positions could then be uploaded to the Kaggle website where a complicated loss function was used to calculate the accuracy of our estimates.

Bayesian analysis provided the winning recipe for solving this problem:

  1. Construct a prior distribution for the halo positions p(x), i.e. formulate our expectations about the halo positions before looking at the data.
  2. Construct a probabilistic model for the data (observed ellipticities of the galaxies) given the positions of the dark matter halos: p(e|x).
  3. Use Bayes’ rule to get the posterior distribution of the halo positions: p(x|e) = p(e|x)p(x)/p(e), i.e. use to the data to guess where the dark matter halos might be.
  4. Minimize the expected loss with respect to the posterior distribution over the predictions for the halo positions: \hat{x} = \arg\min_{\text{prediction}} \mathbb{E}_{p(x|e)} L(\text{prediction},x), i.e. tune our predictions to be as good as possible for the given error metric.

For step 1. I simply assumed that the dark matter halos were distributed uniformly at random across the sky. Step 2 is more complicated. Fortunately the competition organizers provided us with a set of training skies for which the positions of the dark matter halos was known, as well as a summary of the physics behind it all. After reading through the tutorials and forum posts it became clear that the following model should be reasonable:

p(e_{i}|x) = N(\sum_{j = \text{all halos}} d_{i,j}m_{j}f(r_{i,j}), \sigma^{2}),

where N() denotes the normal distribution, d_{i,j} is the tangential direction, i.e. the direction in which halo j bends the light of galaxy i, m_{j} is the mass of halo j, and f(r_{i,j}) is a decreasing function in the euclidean distance r_{i,j} between galaxy i and halo j.

After looking at the data I fixed the variance of the Gaussian distribution \sigma^{2} at 0.05. Like most competitors I also noticed that all skies seemed to have a single large halo, and that the other halos were much smaller. For the large halo I assigned the halo mass m a log-uniform distribution between 40 and 180, and I set f(r_{i,j}) = 1/\text{max}(r_{i,j},240). For the small halos I fixed the mass at 20, and I used f(r_{i,j}) = 1/\text{max}(r_{i,j},70). The resulting model is likely to be overly simplistic but it seems to capture most of the signal that is present in the data. In addition, keeping the model simple protected me against overfitting the data. Note that I assumed that the galaxy positions were independent of the halo positions, although it turns out this may not have been completely accurate.

After completing step 1 and 2, step 3 and 4 are simply a matter of implementation: I choose to use a simple random-walk Metropolis Hastings sampler to approximate the posterior distribution in step 3. The optimization in step 4 was done using standard gradient-based optimization, with random restarts to avoid local minima.

Like I remarked in the competition forums, the outcome of this competition was more noisy than is usual: final prediction accuracy was judged on a set of only 90 cases, with an evaluation metric that is very sensitive to small (angular) perturbations of the predictions. The public leaderboard standings were even more random, being based on only 30 cases. In fact, the 1.05 public score of my winning submission was only about average on the public leaderboard. All of this means I was very lucky indeed to win this competition. Nevertheless, runner-up Iain Murray seems to have taken a very similar approach, suggesting there is at least something to be said for looking at this kind of problem from a Bayesian perspective.

Finally, I would like to thank the organizers Dave and Tom, and sponsor Winton Capital for organizing a great competition. Looking at a problem different from the standard regression/classification problems was very refreshing.

I will post the Matlab code for my solution sometime later this week

A Shiny new way of communicating Bayesian statistics

Bayesian data analysis follows a very simple and general recipe:

  1. Specify a model and likelihood, i.e. what process do you think is generating your data?
  2. Specify a prior distribution, i.e. quantify what you know about a problem before having seen the data.
  3. Apply Bayes’ rule and out comes the posterior distribution, i.e. what you know about the problem after having seen the data. This is typically what you would report as the result of you analysis.

Because of step 2, Bayesian methods are sometimes criticized for being “subjective”: you and I may have different ideas about a particular problem and therefore we may have different prior distributions, leading to different posterior results in step 3. Such criticism is largely unfounded since step 1 almost always has much more influence on the results and since there exist good methods for specifying “objective priors”. Nevertheless it does point to a problem in communicating the results of a Bayesian data analysis: Should I report only the results obtained using my personal (subjective) prior? Should I try to make the analysis as objective as possible by specifying an “uninformative” prior? Or should I report the results obtained under various different prior specifications? This problem often comes up in practice, with one of the most common questions in academic referee reports being: “Are your results robust to changing the prior distribution on variable X?”.

Ideally, the reader of a data analysis report should be able to easily obtain the posterior results under different prior specifications without having to repeat the complete analysis. In his book, Bayesian econometrician John Geweke points out that this is often easy to do: If the reader has access to draws from the posterior distribution resulting from the data analysis, the results under a different prior can be obtained by simple re-weighting these draws using the principle of importance sampling. Although this is a simple and powerful idea, I have never seen it being used in practice. Presumably that is because until recently such functionality was difficult to offer online, and required running specialized software such as R or Matlab on the reader’s PC. All of this changed however with the recent announcement of a new R package called ‘Shiny’ by the folks at RStudio. In their own words “Shiny makes it super simple for R users like you to turn analyses into interactive web applications that anyone can use.”

Today I have been playing around with Shiny, and I can say: they’re not lying! I produced the interactive graph below using just a few lines of R code. The graph is from my recent paper analyzing economic growth data for a cross-section of countries. One of the problems in studying economic growth is that we have relatively little data and many potential explanatory variables: In my case 88 observations of economic growth rates, and 67 explanatory variables. Because it is hard to be sure which and how many of these explanatory variables to include in our model it is customary to perform a Bayesian model averaging, which averages over all possible selections of variables. The graph below shows the posterior distribution of the model size in this model averaging, i.e. the distribution of the number of explanatory variables that ended up being included in the regression models. In my paper I show this graph for a flat prior on the model size (represented by the black line in the graph), but really this choice is just as arbitrary as any other. By playing with the sliders at the bottom of the graph you can now easily see how this posterior distribution changes with the prior specification:

Refresh your browser if the graph is not showing.

The graph above is being generated by an R script that is currently being hosted free of charge under beta-testing on the Shiny server platform, and will stay up at least until they decide to charge me ;-) Overall I’m very impressed with the capabilities and ease of use offered by this package and I’m looking forward to seeing how other data scientists will use it to present data analyses in an interactive way.

Gibbs sampling with Julia

Some time ago I wrote a post about Gibbs sampling using Matlab. There I showed that the JIT compiler of Matlab can get you very close to the speed of compiled C code if you know what you are doing, but that it is easy to screw up and get a very slow program as a result. More recently, I came across a new scientific programming language called Julia, which seems to be designed specifically with this kind of JIT compilation in mind. So I put Julia to the test, using the slow version of the Gibbs sampler from my last post:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function gibbs2(n, thin)
   x_samp = zeros(n,1)
   y_samp = zeros(n,1)
   x=0.0
   y=0.0
   for i=1:n
      for j=1:thin
         x=(y^2+4)*randg(3)
         y=1/(1+x)+randn()/sqrt(2*x+2)
      end
      x_samp[i] = x
      y_samp[i] = y
   end
   return x_samp, y_samp
end
function gibbs2(n, thin)
   x_samp = zeros(n,1)
   y_samp = zeros(n,1)
   x=0.0
   y=0.0
   for i=1:n
      for j=1:thin
         x=(y^2+4)*randg(3)
         y=1/(1+x)+randn()/sqrt(2*x+2)
      end
      x_samp[i] = x
      y_samp[i] = y
   end
   return x_samp, y_samp
end
1
2
julia> @elapsed gibbs2(50000,1000)
7.6084020137786865
julia> @elapsed gibbs2(50000,1000)
7.6084020137786865

The first thing to notice is that the Julia code looks very similar to the Matlab code from my last post, which is great for people familiar with that language. The second thing is that it is very fast and does not suffer from Matlab’s dumb JIT problems. In fact, this Julia code is even faster than the ‘fast version’ of the Matlab code from my last post. Way to go Julia!

Top Lecturer Award & Kaggle-in-Class

I was very fortunate and pleasantly surprised to receive my school’s ‘Top Lecturer 2011′ award last week. The award is an encouragement for innovation in education. Among other reasons, I received the award because of how much the students of last year’s ‘Econometrie 2′ course liked the prediction contest I organized as a class assignment. The competition required the students to model a given ‘training’ data set in order to predict an unseen ‘testing’ data set, using the methods that were covered during the course, with the winners of the competition receiving a bonus to their grade. The contest was set up using Kaggle-in-Class: a free service by prediction competition company Kaggle, which makes this process extremely easy. If you are involved in teaching any courses on statistics or data analysis, I would highly recommend having a look at Kaggle-in-Class!

In addition to Kaggle, I would also like to thank Dennis Fok and Jan Brinkhuis for supporting my development as a teacher, and of course the students who worked hard and made the contest a great experience for everybody!

The power of JIT compilation

Darren Wilkinson has an interesting blog post, comparing execution times of a Gibbs sampler implemented in various different programming languages. His sampler generates draws from the following joint distribution

f(x,y) \propto x^{2}\exp(-xy^{2}-y^{2}+2y-4x)

by drawing iteratively from the full conditional distributions

x|y \sim \text{Gamma}(3,y^{2}+4)

and

y|x \sim N\left(\frac{1}{1+x},\frac{1}{2(1+x)}\right).

Darren presents different implementations of this sampler in R, Python, C, Java, Scala, and Groovy, and then compares their running times. You will have to take a look at Darren’s blog post for the complete results, but one thing they show is that popular interpreted languages such as R and Python are many times slower than compiled C code.

With this in mind I decided to give his sampler a try myself, using one of my favorite languages for quick prototyping: Matlab. Here’s the code and corresponding running time:

function [x_samp,y_samp] = gibbs(n,thin)

x_samp = zeros(n,1);
y_samp = zeros(n,1);

y=0;
for i=1:n
    gammarands=randg(3,thin,1);
    normrands=randn(thin,1);

    for j=1:thin
        x=(y^2+4)*gammarands(j);
        y=1/(1+x)+normrands(j)/sqrt(2*x+2);
    end

    x_samp(i) = x;
    y_samp(i) = y;

end
end
tic
gibbs(5e4,1e3);
toc
Elapsed time is 10.027671 seconds.

Although Matlab is also an interpreted language, this running time of 10 seconds is remarkably close to the ~8 seconds Darren reports for compiled C code (using a PC very similar to mine). The reason that the code above is so fast is that Matlab internally compiles the inner loop of the ‘gibbs’ function; a trick called ‘Just In Time’ compilation. For obtaining the random numbers used by the function, Matlab uses fast built-in code (the ‘randn’ and ‘randg’ functions) which are only called once every 1000 iterations of the inner loop (1000 being the value of the ‘thin’ parameter). Despite its similar speed, coding and running the Matlab example above is much easier than coding, compiling, and running the corresponding C code.

The Matlab JIT compiler is powerful, but not very smart: it can only compile very simple operations like those found in the inner loop of the code above. This means that we have to be very careful with the operations we perform in the inner loop of a function. The example below shows what happens when we move the random number generation (‘randn’ and ‘randg’) to the inner loop:

function [x_samp,y_samp] = gibbs2(n,thin)

x_samp = zeros(n,1);
y_samp = zeros(n,1);

y=0;
for i=1:n

    for j=1:thin
        x=(y^2+4)*randg(3,1,1);
        y=1/(1+x)+randn/sqrt(2*x+2);
    end

    x_samp(i) = x;
    y_samp(i) = y;
end
end
tic
gibbs2(5e4,1e3);
toc
Elapsed time is 218.589886 seconds.

The new code takes 20 times as long! The problem with the new code is that Matlab does not know what to do with the ‘randn’ and ‘randg’ functions until it executes them, and is therefore no longer able to compile the inner loop of the algorithm.

To summarize: compiled code can be much faster than interpreted code for number crunching, but if we are smart and use JIT compilation well, we can get remarkably close!

Note that many other languages besides Matlab also support some form of JIT compilation: Darren Wilkinson’s post has a nice example of speeding things up by JIT compiling Python code.

Prediction contests in the media

The Deloitte/FIDE Chess Rating Challenge has been getting quite some attention in the media. Competition organizer Jeff Sonas has a nice article on Chessbase.com, which does a great job of explaining the competition. The Australian TV channel ABC broadcast a great show on Kaggle, the company that hosts these prediction competitions:

ABC Catalyst show on Kaggle

If you watch very very carefully you might even catch a glimpse of me! ;-)

Winning the “Don’t Overfit!” competition

Between February 28, 2011 and May 15, 2011, Kaggle hosted a prediction competition entitled “Don’t Overfit!”. The goal of this competition was to develop a model that would predict well in a setting where you have little data and many explanatory variables. I was lucky enough to end up winning one of the two parts of the competition, as well as being the overall winner. Below you can find a cross-post of the description of my winning entry:

The data set for this competition was an artificial data set created by competition organizer Phil Brierley, consisting of 250 observations of a binary “target” variable and 200 different “explanatory variables”. The goal was to model the relationship between the explanatory variables and the targets in order to predict another 19750 holdout target variables. To do this well, one has to avoid the trap of “overfitting”, i.e. creating a model with a good in-sample fit but with poor predictive performance. The question “how do we prevent overfitting?” has been asked again and again, but in my opinion the answer has been known since Laplace: use Bayesian analysis with a sensible prior.

A Bayesian analysis of any problem consists of two steps:

1. Formulate your prior guess about how the data was generated in terms of a probability distribution. In this case let’s call that distribution p(T), with T the full 20,000×1 vector of targets.

2.Condition on the observed data to make predictions, i.e. construct the posterior distribution p(T_predict | T_observed) where T_observed is now the 250×1 vector of observed targets. The predictions can then be obtained by minimizing the expected loss under this posterior distribution. I did not have time to properly look into the AUC measure used to judge accuracy in this competition, but I guessed it would be (near) optimal to just use the conditional expectations E(T_predict | T_observed) as my predictions. Taking the expectation over the posterior distribution implies averaging over all models and variable selections that are plausible given the data T_observed. Because of this averaging, Bayes is inherently less prone to overfitting than estimation methods that are optimization-based.

Different people may have different ideas on what the appropriate prior distribution p(T) should be, but the nice thing about Bayes is that, conditional on our choice for p(T), it automatically gives us the predictions with the lowest expected loss! (the statistician Dennis Lindley famously called this “turning the Bayesian crank”) For this competition this really meant the following: the only thing that the competitors would have needed to discuss is how Phil generated the data. Given our guess about Phil’s data generating process, Bayes then gives us the ideal predictions. (in expectation… this contest was quite random due to the small sample size)

I started this contest with very little time left, but fortunately the other participants had already left me lots of clues in the forum. In particular, a quick read revealed the following:

-          The “equation” used to generate the data seemed to be linear

-          The coefficient of the explanatory variables all seemed to be of the same sign

-          According to Phil the “equation” did not have any noise in it

Based on these clues and some experimentation, I guessed that the data was generated as follows:

1. Sample the 200 explanatory variables ‘X’ uniformly on [0,1]

2. With probability 0.5 select each different X variable for use in the “equation”

3. For each included variable uniformly sample a coefficient A

4. Define Y = A_1*X_1 + A_2*X_2 etc

5. Define Z = Y – mean(Y)

6. Set T_i = 1 if Z_i < 0 and set T_i = 0 otherwise

8. Round all X variables to 3 decimal places

The above defines the prior distribution p(T) to be used in the Bayesian analysis. The posterior distribution can then be approximated quite straightforwardly using Gibbs sampling. This Gibbs sampler will then average over all probable coefficients A and all probable X (since we only observed the rounded X’s). My implementation of this turned out to be good enough for me to become the overall winner of the competition. (for the complete results, see here)

I had fun with this competition and I would like to thank Phil Brierley for organizing it. If this post has coincidentally managed to convert any of you to the Bayesian religion ;-) , I strongly recommend reading Jaynes’s “Probability Theory: The Logic of Science”, of which the first few chapters can be read online here. (that’s how I first learned Bayesian analysis)