Winning the “Don’t Overfit!” competition

Between February 28, 2011 and May 15, 2011, Kaggle hosted a prediction competition entitled “Don’t Overfit!”. The goal of this competition was to develop a model that would predict well in a setting where you have little data and many explanatory variables. I was lucky enough to end up winning one of the two parts of the competition, as well as being the overall winner. Below you can find a cross-post of the description of my winning entry:

The data set for this competition was an artificial data set created by competition organizer Phil Brierley, consisting of 250 observations of a binary “target” variable and 200 different “explanatory variables”. The goal was to model the relationship between the explanatory variables and the targets in order to predict another 19750 holdout target variables. To do this well, one has to avoid the trap of “overfitting”, i.e. creating a model with a good in-sample fit but with poor predictive performance. The question “how do we prevent overfitting?” has been asked again and again, but in my opinion the answer has been known since Laplace: use Bayesian analysis with a sensible prior.

A Bayesian analysis of any problem consists of two steps:

1. Formulate your prior guess about how the data was generated in terms of a probability distribution. In this case let’s call that distribution p(T), with T the full 20,000×1 vector of targets.

2.Condition on the observed data to make predictions, i.e. construct the posterior distribution p(T_predict | T_observed) where T_observed is now the 250×1 vector of observed targets. The predictions can then be obtained by minimizing the expected loss under this posterior distribution. I did not have time to properly look into the AUC measure used to judge accuracy in this competition, but I guessed it would be (near) optimal to just use the conditional expectations E(T_predict | T_observed) as my predictions. Taking the expectation over the posterior distribution implies averaging over all models and variable selections that are plausible given the data T_observed. Because of this averaging, Bayes is inherently less prone to overfitting than estimation methods that are optimization-based.

Different people may have different ideas on what the appropriate prior distribution p(T) should be, but the nice thing about Bayes is that, conditional on our choice for p(T), it automatically gives us the predictions with the lowest expected loss! (the statistician Dennis Lindley famously called this “turning the Bayesian crank”) For this competition this really meant the following: the only thing that the competitors would have needed to discuss is how Phil generated the data. Given our guess about Phil’s data generating process, Bayes then gives us the ideal predictions. (in expectation… this contest was quite random due to the small sample size)

I started this contest with very little time left, but fortunately the other participants had already left me lots of clues in the forum. In particular, a quick read revealed the following:

-          The “equation” used to generate the data seemed to be linear

-          The coefficient of the explanatory variables all seemed to be of the same sign

-          According to Phil the “equation” did not have any noise in it

Based on these clues and some experimentation, I guessed that the data was generated as follows:

1. Sample the 200 explanatory variables ‘X’ uniformly on [0,1]

2. With probability 0.5 select each different X variable for use in the “equation”

3. For each included variable uniformly sample a coefficient A

4. Define Y = A_1*X_1 + A_2*X_2 etc

5. Define Z = Y – mean(Y)

6. Set T_i = 1 if Z_i < 0 and set T_i = 0 otherwise

8. Round all X variables to 3 decimal places

The above defines the prior distribution p(T) to be used in the Bayesian analysis. The posterior distribution can then be approximated quite straightforwardly using Gibbs sampling. This Gibbs sampler will then average over all probable coefficients A and all probable X (since we only observed the rounded X’s). My implementation of this turned out to be good enough for me to become the overall winner of the competition. (for the complete results, see here)

I had fun with this competition and I would like to thank Phil Brierley for organizing it. If this post has coincidentally managed to convert any of you to the Bayesian religion ;-) , I strongly recommend reading Jaynes’s “Probability Theory: The Logic of Science”, of which the first few chapters can be read online here. (that’s how I first learned Bayesian analysis)

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code lang=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" extra="">