## Monday, January 20, 2014

### Thoughts on the Kaggle Contest

As mentioned in a previous post, Kaggle is sponsoring a March Madness contest.  After some false starts, I managed to figure out the rules, scoring, and submission format.  The first phase of the contest is scoring predictions for the past five tournaments.  I entered a submission based on the Prediction Machine's point spreads and placed in the Top Ten of the leaderboard.

Some random thoughts about the contest in no particular order.

(1) The "typical" March Madness contest awards full points for predicting a game correctly, and no points for predicting a game incorrectly.  Scoring is totally dependent upon the outcome of the game, so scoring for games between closely matched opponents is essentially random.  Consequently, winning these contests usually comes down to getting a few late round upset picks correct, a topic I've previously explored.

The Kaggle contest is using an interesting alternative scoring method, the log loss, also called the predictive binomial deviance.  Submissions give a likelihood from 0 to 1 for a particular outcome (e.g., Arizona will beat UNC Greensboro).  The more certain (closer to 1) the prediction, the higher the reward (penalty) for getting the game right (wrong).  For close games, you can predict an outcome around 0.50 and get a small reward if you are right but only a small penalty if you are wrong.  This scoring metric does a better job of rewarding contestants who accurately judge the relative strengths of the teams in each game rather than the outcome of that particular game (if that makes sense).

You can also think of this scoring method as a betting strategy.  When you place a high likelihood on a particular outcome, it's like betting a lot on the game.  When you place an even likelihood on a particular outcome, it's like betting only a small amount on the game.  The winner is the contestant who ends up with the most money at the end of the tournament.

(2)  The problem with predicting "likelihood" is that there's no way to measure the actual likelihood.  If we made the teams play a 100 games, we'd get a good approximation of the likelihood, but that's obviously not reasonable.  So there's really two parts to each submission:  (a) assessing the relative strength of the competing teams, and (b) translating that into a likelihood of victory for one of the teams.

To see that these are two separate problems, imagine that every competitor had to base their entry on the RPI scores of the teams.  Every competitor would have the same relative strength assessment.  But they could translate that into a likelihood any way they wanted.  One competitor might use an exponential model with an exponent of 15, another an exponential model with an exponent of 22, another a logistic distribution, etc.  The winner in this case would be whomever happened to pick the best likelihood model for that year's tournament.

To my mind, it would be a better test of the predictors to have them predict the point spread of each game.  Point spread is directly measurable and is the best proxy we have for likelihood, so we'd eliminate that element of how well the competitors translated relative strength to likelihood.  But this is probably a minor point -- predicting likelihood with a log loss evaluation is overall a pretty good approach.

(3)  So what's the right strategy for this contest?  The default strategy is obviously to make your best possible predictions for the games and enter that.  But does it ever make sense to intentionally use something other than your best possible prediction?

In a traditionally-scored tournament pool, I believe it does make sense to pick against your best predictions.  The reason is that most good predictors are going to have similar outcomes for almost all the games.  In that situation, the best possible result for your best predictions might be to end up in a multi-way tie for first place.  But in any decent size pool, the most likely result is that you're going to lose to someone who got lucky and picked one or more of the inevitable upsets.  So if you want to win the pool, you need to pick upsets yourself, and hope to get lucky.

It isn't clear to me that the same reasoning applies with the log loss scoring method.  Since it rewards accurate assessment more than game outcome, it may be that the best strategy is to simply use your best possible predictions.

(4)  Phase One of this contest is essentially meaningless.  The outcomes of the last five tournaments are known, so it is trivial to craft a "perfect" submission.  No one has done that yet, but the top of the leaderboard is already filled with (what appear to be) unrealistic submissions.  These submissions are probably "cheating" or are heavily tuned to do well on the Phase One test data.

(5) So what's the best "realistic" score for this contest?  By this, I mean the score over a large number of tournament games.

On the point spread side, the best known predictor for college basketball game outcomes are the Vegas closing lines.  This isn't an absolute bound on performance, but it's a good starting point. As I pointed out above, converting point spreads to likelihoods isn't straightforward, but with one reasonable approach, the lines have a log loss score of around 0.52 for the past few seasons of regular season games.  So I'd be dubious of any approach that does significantly better than that.

(6) That said, it's important to remember that a single NCAA tournament is a very small set of data.  It's perfectly reasonable to expect an approach that would be terrible on average over a large number of tournaments to do very well on any particular tournament (or vice versa).  For example, my entry to Phase One had a score of about .54.  When I look at how that entry scored on each individual season, I see that in some seasons it scored around .51.   So the winner of Phase Two could easily be someone who just happened to get lucky with a good score this year.

It wouldn't be an entirely unreasonable approach to build a model to assess team strengths, an algorithm for translating that to likelihoods and then tune that to do particularly well on some past tournament (say, 2010).  That's probably not the best general approach, but it might get lucky and do very well this year.