# My talks at NIPS 2014

Last December, I gave three talks at the NIPS conference in Montreal. It took some time, but the videos are now finally online!

Challenges in Machine Learning workshop: 10 lessons learned from 10 Machine Learning Challenges, Video

High Energy Physics workshop: My 2nd place solution for the Higgs Boson Machine Learning Challenge, Video, Code and Slides

Advances in Variational Inference workshop: Bridging the gap between variational methods and MCMC, Slides, Abstract

# A Shiny new way of communicating Bayesian statistics

Bayesian data analysis follows a very simple and general recipe:

1. Specify a model and likelihood, i.e. what process do you think is generating your data?
2. Specify a prior distribution, i.e. quantify what you know about a problem before having seen the data.
3. Apply Bayes’ rule and out comes the posterior distribution, i.e. what you know about the problem after having seen the data. This is typically what you would report as the result of you analysis.

Because of step 2, Bayesian methods are sometimes criticized for being “subjective”: you and I may have different ideas about a particular problem and therefore we may have different prior distributions, leading to different posterior results in step 3. Such criticism is largely unfounded since step 1 almost always has much more influence on the results and since there exist good methods for specifying “objective priors”. Nevertheless it does point to a problem in communicating the results of a Bayesian data analysis: Should I report only the results obtained using my personal (subjective) prior? Should I try to make the analysis as objective as possible by specifying an “uninformative” prior? Or should I report the results obtained under various different prior specifications? This problem often comes up in practice, with one of the most common questions in academic referee reports being: “Are your results robust to changing the prior distribution on variable X?”.

Ideally, the reader of a data analysis report should be able to easily obtain the posterior results under different prior specifications without having to repeat the complete analysis. In his book, Bayesian econometrician John Geweke points out that this is often easy to do: If the reader has access to draws from the posterior distribution resulting from the data analysis, the results under a different prior can be obtained by simple re-weighting these draws using the principle of importance sampling. Although this is a simple and powerful idea, I have never seen it being used in practice. Presumably that is because until recently such functionality was difficult to offer online, and required running specialized software such as R or Matlab on the reader’s PC. All of this changed however with the recent announcement of a new R package called ‘Shiny’ by the folks at RStudio. In their own words “Shiny makes it super simple for R users like you to turn analyses into interactive web applications that anyone can use.”

Today I have been playing around with Shiny, and I can say: they’re not lying! I produced the interactive graph below using just a few lines of R code. The graph is from my recent paper analyzing economic growth data for a cross-section of countries. One of the problems in studying economic growth is that we have relatively little data and many potential explanatory variables: In my case 88 observations of economic growth rates, and 67 explanatory variables. Because it is hard to be sure which and how many of these explanatory variables to include in our model it is customary to perform a Bayesian model averaging, which averages over all possible selections of variables. The graph below shows the posterior distribution of the model size in this model averaging, i.e. the distribution of the number of explanatory variables that ended up being included in the regression models. In my paper I show this graph for a flat prior on the model size (represented by the black line in the graph), but really this choice is just as arbitrary as any other. By playing with the sliders at the bottom of the graph you can now easily see how this posterior distribution changes with the prior specification:

Refresh your browser if the graph is not showing.

The graph above is being generated by an R script that is currently being hosted free of charge under beta-testing on the Shiny server platform, and will stay up at least until they decide to charge me Overall I’m very impressed with the capabilities and ease of use offered by this package and I’m looking forward to seeing how other data scientists will use it to present data analyses in an interactive way.

# Kaggle’s Top 10 Data Scientists

Recently I was honored to be named one of Kaggle’s top 10 data scientists. For coverage on the topic see:

# Gibbs sampling with Julia

Some time ago I wrote a post about Gibbs sampling using Matlab. There I showed that the JIT compiler of Matlab can get you very close to the speed of compiled C code if you know what you are doing, but that it is easy to screw up and get a very slow program as a result. More recently, I came across a new scientific programming language called Julia, which seems to be designed specifically with this kind of JIT compilation in mind. So I put Julia to the test, using the slow version of the Gibbs sampler from my last post:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  function gibbs2(n, thin)    x_samp = zeros(n,1)    y_samp = zeros(n,1)    x=0.0    y=0.0    for i=1:n       for j=1:thin          x=(y^2+4)*randg(3)          y=1/(1+x)+randn()/sqrt(2*x+2)       end       x_samp[i] = x       y_samp[i] = y    end    return x_samp, y_samp end
function gibbs2(n, thin)
x_samp = zeros(n,1)
y_samp = zeros(n,1)
x=0.0
y=0.0
for i=1:n
for j=1:thin
x=(y^2+4)*randg(3)
y=1/(1+x)+randn()/sqrt(2*x+2)
end
x_samp[i] = x
y_samp[i] = y
end
return x_samp, y_samp
end
 1 2  julia> @elapsed gibbs2(50000,1000) 7.6084020137786865
julia> @elapsed gibbs2(50000,1000)
7.6084020137786865

The first thing to notice is that the Julia code looks very similar to the Matlab code from my last post, which is great for people familiar with that language. The second thing is that it is very fast and does not suffer from Matlab’s dumb JIT problems. In fact, this Julia code is even faster than the ‘fast version’ of the Matlab code from my last post. Way to go Julia!

# Top Lecturer Award & Kaggle-in-Class

I was very fortunate and pleasantly surprised to receive my school’s ‘Top Lecturer 2011′ award last week. The award is an encouragement for innovation in education. Among other reasons, I received the award because of how much the students of last year’s ‘Econometrie 2′ course liked the prediction contest I organized as a class assignment. The competition required the students to model a given ‘training’ data set in order to predict an unseen ‘testing’ data set, using the methods that were covered during the course, with the winners of the competition receiving a bonus to their grade. The contest was set up using Kaggle-in-Class: a free service by prediction competition company Kaggle, which makes this process extremely easy. If you are involved in teaching any courses on statistics or data analysis, I would highly recommend having a look at Kaggle-in-Class!

In addition to Kaggle, I would also like to thank Dennis Fok and Jan Brinkhuis for supporting my development as a teacher, and of course the students who worked hard and made the contest a great experience for everybody!

# The power of JIT compilation

Darren Wilkinson has an interesting blog post, comparing execution times of a Gibbs sampler implemented in various different programming languages. His sampler generates draws from the following joint distribution

$f(x,y) \propto x^{2}\exp(-xy^{2}-y^{2}+2y-4x)$

by drawing iteratively from the full conditional distributions

$x|y \sim \text{Gamma}(3,y^{2}+4)$

and

$y|x \sim N\left(\frac{1}{1+x},\frac{1}{2(1+x)}\right)$.

Darren presents different implementations of this sampler in R, Python, C, Java, Scala, and Groovy, and then compares their running times. You will have to take a look at Darren’s blog post for the complete results, but one thing they show is that popular interpreted languages such as R and Python are many times slower than compiled C code.

With this in mind I decided to give his sampler a try myself, using one of my favorite languages for quick prototyping: Matlab. Here’s the code and corresponding running time:

function [x_samp,y_samp] = gibbs(n,thin)

x_samp = zeros(n,1);
y_samp = zeros(n,1);

y=0;
for i=1:n
gammarands=randg(3,thin,1);
normrands=randn(thin,1);

for j=1:thin
x=(y^2+4)*gammarands(j);
y=1/(1+x)+normrands(j)/sqrt(2*x+2);
end

x_samp(i) = x;
y_samp(i) = y;

end
end
tic
gibbs(5e4,1e3);
toc
Elapsed time is 10.027671 seconds.

Although Matlab is also an interpreted language, this running time of 10 seconds is remarkably close to the ~8 seconds Darren reports for compiled C code (using a PC very similar to mine). The reason that the code above is so fast is that Matlab internally compiles the inner loop of the ‘gibbs’ function; a trick called ‘Just In Time’ compilation. For obtaining the random numbers used by the function, Matlab uses fast built-in code (the ‘randn’ and ‘randg’ functions) which are only called once every 1000 iterations of the inner loop (1000 being the value of the ‘thin’ parameter). Despite its similar speed, coding and running the Matlab example above is much easier than coding, compiling, and running the corresponding C code.

The Matlab JIT compiler is powerful, but not very smart: it can only compile very simple operations like those found in the inner loop of the code above. This means that we have to be very careful with the operations we perform in the inner loop of a function. The example below shows what happens when we move the random number generation (‘randn’ and ‘randg’) to the inner loop:

function [x_samp,y_samp] = gibbs2(n,thin)

x_samp = zeros(n,1);
y_samp = zeros(n,1);

y=0;
for i=1:n

for j=1:thin
x=(y^2+4)*randg(3,1,1);
y=1/(1+x)+randn/sqrt(2*x+2);
end

x_samp(i) = x;
y_samp(i) = y;
end
end
tic
gibbs2(5e4,1e3);
toc
Elapsed time is 218.589886 seconds.

The new code takes 20 times as long! The problem with the new code is that Matlab does not know what to do with the ‘randn’ and ‘randg’ functions until it executes them, and is therefore no longer able to compile the inner loop of the algorithm.

To summarize: compiled code can be much faster than interpreted code for number crunching, but if we are smart and use JIT compilation well, we can get remarkably close!

Note that many other languages besides Matlab also support some form of JIT compilation: Darren Wilkinson’s post has a nice example of speeding things up by JIT compiling Python code.

# Prediction contests in the media

The Deloitte/FIDE Chess Rating Challenge has been getting quite some attention in the media. Competition organizer Jeff Sonas has a nice article on Chessbase.com, which does a great job of explaining the competition. The Australian TV channel ABC broadcast a great show on Kaggle, the company that hosts these prediction competitions:

If you watch very very carefully you might even catch a glimpse of me!

# Winning the “Don’t Overfit!” competition

Between February 28, 2011 and May 15, 2011, Kaggle hosted a prediction competition entitled “Don’t Overfit!”. The goal of this competition was to develop a model that would predict well in a setting where you have little data and many explanatory variables. I was lucky enough to end up winning one of the two parts of the competition, as well as being the overall winner. Below you can find a cross-post of the description of my winning entry:

The data set for this competition was an artificial data set created by competition organizer Phil Brierley, consisting of 250 observations of a binary “target” variable and 200 different “explanatory variables”. The goal was to model the relationship between the explanatory variables and the targets in order to predict another 19750 holdout target variables. To do this well, one has to avoid the trap of “overfitting”, i.e. creating a model with a good in-sample fit but with poor predictive performance. The question “how do we prevent overfitting?” has been asked again and again, but in my opinion the answer has been known since Laplace: use Bayesian analysis with a sensible prior.

A Bayesian analysis of any problem consists of two steps:

1. Formulate your prior guess about how the data was generated in terms of a probability distribution. In this case let’s call that distribution p(T), with T the full 20,000×1 vector of targets.

2.Condition on the observed data to make predictions, i.e. construct the posterior distribution p(T_predict | T_observed) where T_observed is now the 250×1 vector of observed targets. The predictions can then be obtained by minimizing the expected loss under this posterior distribution. I did not have time to properly look into the AUC measure used to judge accuracy in this competition, but I guessed it would be (near) optimal to just use the conditional expectations E(T_predict | T_observed) as my predictions. Taking the expectation over the posterior distribution implies averaging over all models and variable selections that are plausible given the data T_observed. Because of this averaging, Bayes is inherently less prone to overfitting than estimation methods that are optimization-based.

Different people may have different ideas on what the appropriate prior distribution p(T) should be, but the nice thing about Bayes is that, conditional on our choice for p(T), it automatically gives us the predictions with the lowest expected loss! (the statistician Dennis Lindley famously called this “turning the Bayesian crank”) For this competition this really meant the following: the only thing that the competitors would have needed to discuss is how Phil generated the data. Given our guess about Phil’s data generating process, Bayes then gives us the ideal predictions. (in expectation… this contest was quite random due to the small sample size)

I started this contest with very little time left, but fortunately the other participants had already left me lots of clues in the forum. In particular, a quick read revealed the following:

-          The “equation” used to generate the data seemed to be linear

-          The coefficient of the explanatory variables all seemed to be of the same sign

-          According to Phil the “equation” did not have any noise in it

Based on these clues and some experimentation, I guessed that the data was generated as follows:

1. Sample the 200 explanatory variables ‘X’ uniformly on [0,1]

2. With probability 0.5 select each different X variable for use in the “equation”

3. For each included variable uniformly sample a coefficient A

4. Define Y = A_1*X_1 + A_2*X_2 etc

5. Define Z = Y – mean(Y)

6. Set T_i = 1 if Z_i < 0 and set T_i = 0 otherwise

8. Round all X variables to 3 decimal places

The above defines the prior distribution p(T) to be used in the Bayesian analysis. The posterior distribution can then be approximated quite straightforwardly using Gibbs sampling. This Gibbs sampler will then average over all probable coefficients A and all probable X (since we only observed the rounded X’s). My implementation of this turned out to be good enough for me to become the overall winner of the competition. (for the complete results, see here)

I had fun with this competition and I would like to thank Phil Brierley for organizing it. If this post has coincidentally managed to convert any of you to the Bayesian religion , I strongly recommend reading Jaynes’s “Probability Theory: The Logic of Science”, of which the first few chapters can be read online here. (that’s how I first learned Bayesian analysis)

# Deloitte/FIDE Chess Rating Challenge Details

This year, from February 7 to May 4, a prediction contest was held at Kaggle.com/c/ChessRatings2 where I ended up taking first place. The goal of the contest was to build a model to forecast the results of future chess matches based on the results of past matches. This post contains a description of my approach. The corresponding MATLAB code can be found here. For a non-technical account of my experiences in the competition, see my earlier post.

## The base model

The basic model underlying my approach was inspired strongly by the TrueSkill model, as well as by the winner and runner-up of an earlier chess rating contest on Kaggle. The final model was programmed in Matlab, but some of the early experimentation was done using the Infer.NET package, which is definitely worth having a look at. Warning: The discussion in this section is somewhat technical.

The basic statistical model assumed for the result of a match between a white player A and a black player B is the familiar ordered probit model:

$d = s_{A} - s_{B} + \gamma + \epsilon$, $\epsilon \sim N(0,\sigma^{2})$

if $d$ > 1 : A wins

if -1 < $d$ < 1 : A and B draw

if $d$ < -1 : B wins

Here $d$ can be seen as the _performance difference_ between A and B in this match, $s_{A}$ and $s_{B}$ as the skills of player A and B, $\gamma$ as the advantage of playing white, and $\epsilon$ as a random error term.

Given a set of match results, we will infer the skills of all players by means of factorized approximate Bayesian inference. $\gamma$ and $\sigma^{2}$ are estimated using approximate maximum likelihood.

We specify an independent normal prior for each player’s skill $s_{i}$, having mean $\mu_{i}$ and a variance of 1 (determined by cross validation). The means $\mu_{i}$ can be initialized to 0 and will be set to a weighted average of the posterior means of the skills of each players’ opponents at each iteration. The effect of this is to shrink the skills of the players to those of their opponents, as was first done by Yannis Sismanis.

Given a set of match results $r$, the skills have the following posterior density

$p(s|r) \propto \prod_{i=1}^{players} \phi(s_{i}-\mu_{i}) \prod_{j=1}^{matches} \psi_{j}(s_{W_{j}},s_{B_{j}})$

where $\phi()$ is the standard normal distribution function, $\psi_{j}()$ is the likelihood term due to the match result $r_{j}$, and $W_{j}$ and $B_{j}$ identify the white and black player in match $j$. This posterior distribution is of a very high dimension and is not of any standard form, which makes it intractable for exact inference. Approximate Bayesian inference can solve this problem by approximating the above density by a product of univariate normal densities.

$\tilde{p}(s|r) \propto \prod_{i=1}^{players} \phi(s_{i}-\mu_{i}) \prod_{j=1}^{matches} \phi([s_{W_{j}}-m_{j,1}]/\sqrt{v_{j,1}}) \phi([s_{B_{j}}-m_{j,2}]/\sqrt{v_{j,2}})$

There exist various ways of obtaining the mean and variance terms ($m_{j,1}$, $v_{j,1}$, $m_{j,2}$, $v_{j,2}$) in this pseudo-posterior. The two methods I tried were expectation propagation, as is used in the TrueSkill model and Laplace approximation, as used here in a similar context. For the current model and data set the results for both methods were practically the same. The advantage of the Laplace approximation is that it is easier to apply when we change the ordered probit specification to something else like a (ordered or multinomial) logit model. However, since the ordered probit specification provided the best fit, my final submission was made using this specification in combination with expectation propagation. Both methods can only be applied directly when we know which of the two players was playing white. This wasn’t the case for part of the data. My solution to this problem was to calculate the likelihood terms for both the case that the first player is white as well as the case that the second player is white, after which I weight the likelihood terms of both cases by their respective posterior probabilities. This is the natural thing to do when using the Laplace approximation, but it also works well with expectation propagation.

In estimating the skills of the players we would like to assign more importance to matches that have occurred recently than to matches that were played long ago. The main innovation in my approach is to do this by replacing the pseudo posterior above with a weighted version:

$\tilde{p}_{w}(s|r) \propto \prod_{i=1}^{players} \phi(s_{i}-\mu_{i}) \prod_{j=1}^{matches} \phi([s_{W_{j}}-m_{j,1}]/\sqrt{v_{j,1}})^{w_{j,1}} \phi([s_{B_{j}}-m_{j,2}]/\sqrt{v_{j,2}})^{w_{j,2}}$

for weights $w_{j,1}$ and $w_{j,2}$ between zero and one. Since the normal distribution is a member of the exponential family this does not change the functional form of the posterior. Because of this, the weights can be incorporated quite naturally into the expectation propagation algorithm. An advantage of using this weighting scheme in combination with factorized approximate inference is that each match may now have a different weight for each of the two players. This is not possible using more conventional weighting methods like the one used to win the first Kaggle chess competition.

The use of a weighted likelihood in a Bayesian framework is an ad hoc solution, but can be viewed as a way of performing approximate inference in a model where the skills vary over time according to some stochastic process. An alternative solution would be to assume that this stochastic process is a (possibly mean-reverting) random walk, in which case we could use a forward-backward algorithm similar to the Kalman filter. However, for this particular problem the weighting approach performed slightly better.

After trying multiple options, the weight function chosen was $w_{j,1}=\exp(-0.012 \times l_{j,1}-0.02 \times t_{j}-0.1 \times q_{j})$, with $l_{j,1}$ the number of matches played by this player between the current month and the end of the sample, $t_{j}$ the number of months in the same period, and $q_{j}$ an indicator variable equal to one if match $j$ is from the tertiary data set, which was of lower quality. The coefficients in this function were determined by cross-validation. There were other weighting schemes that showed some promise, such as overweighting those matches with players close in skill level to player B, when estimating the skill of player A for predicting his/her result against B. Alternatively, we could overweight those matches containing players that regularly played against B, as we can be more certain about their strength in relation to B than for players that have never played against this player. Due to time constraints I was unable to explore these possibilities further. The combination of approximate inference with likelihood weighting may be an interesting topic for future research.

## Post-processing

The predictions of the base model scored very well on the leaderboard of the competition, but they were not yet good enough to put me in first place. It was at this time that I realized that the match schedule itself contained useful information for predicting the results, something that had already been noticed by some of the other competitors. In chess, most tournaments are played according to the Swiss system, in which in each round players are paired with other players that have achieved a comparable performance in earlier rounds. This means that if in a given tournament player A has encountered better opponents than player B, this most likely means that player A has won a larger percentage of his/her matches in that tournament.

In order to incorporate the information present in the match schedule, I generated out-of-sample predictions for the last 1.5 years of data using a rolling 3-month prediction window. (i.e. predicting months 127-129 using months 1-126, predicting months 130-132 using months 1-129 etc.) I then performed two post-processing steps using these predictions and the realized match outcomes, the first using standard logistic regression and the second using a locally weighted variant of logistic regression. These post-processing steps used a large number of different variables as can be seen in the code below, but the most important variables were:

• the predictions of the base model
• the posterior means of the skills of A and B
• the number of matches played by these players
• the posterior means of the skills of the opponents encountered by A and B
• the variation in the quality of the opponents
• the average predicted win percentage over all matches in the same month for these players
• the predictions of a random forest using these variables

By comparing the quality of the opponents of A and B in a given tournament we may predict the result of the match between A and B. However, the data only indicated the month in which each match was played and not the tournament, and some players appear to have played in multiple tournaments in the same month. What finally pulled me ahead of the competition may have been the addition of a variable that weighs the skills of the opponents of B by their rooted pagerank to A on the match graph. The idea behind this was that if a player C, who played against B, is close to A on the match graph, the match between B and C most likely occurred in the same tournament as the match between A and B.

In order to make the locally weighted logistic regression computationally feasible, the post-processing procedure first allocates the matches in the test set into a small number of cluster points for which the logistic regression is performed. For the final prediction we can then use local interpolation between the parameter estimates at the cluster centers using Gaussian processes. Using this approach the post-processing was sufficiently fast to allow me to quickly try many different settings and variables.

## Conclusions

I would like to thank both the organizers and the competitors for a great competition. Much of the contest came down to how to use the information in the match schedule. Although interesting in its own right, this was less than ideal for the original goal of finding a good rating system. Despite of this I hope that the competition, and my contribution to it, was useful and that it will help to advance the science of rating systems.