## Wednesday, February 25, 2015

### JQAS Paper Reviews

Some of you who participated in last year's Kaggle contest may remember that the Journal of Quantitative Analysis in Sports (JQAS) solicited papers on the methods contestants used to predict basketball game outcomes in the NCAA tournament as part of the Kaggle contest.  The next issue of JQAS will contain five papers that resulted from this solicitation and the publisher has made the papers freely downloadable for a month after the issue is published as well as while they are posted in the "Ahead of Print" section on the JQAS site.  (I have also added them to my Papers archive.)  Below are short reviews of the five papers.

Michael J. Lopez  and Gregory J. Matthews, "Building an NCAA men's basketball predictive model and quantifying its success."

Lopez and Matthews won the 2014 Kaggle Contest.  The paper describes their approach as well as an analysis of how "lucky" their win was.

Lopez and Matthews employed a two-pronged prediction approach based upon (1) point spreads and (2) efficiency ratings (from Pomeroy).  They built separate regression models for points spreads and the efficiency ratings and combined them in a weighted average for their two submissions:  One that weighted point spreads at 75% and efficiency ratings at 25%, and one vice versa.  Since point spreads were only available for the first 32 games, Lopez & Matthews estimated the point spreads for the remaining games using an undescribed "proprietary" model.

Lopez & Matthews also analyzed how "lucky" they were to win the contest.  Their analysis suggests that luck is by far the biggest factor in the competition.  For example, they found that about 80% of the entries could have won the contest under different reasonable outcomes, and the true probability of their entry being the best was less than 2%.
Commentary:  While I appreciate that Lopez & Matthews took the time to write up their experience, I find myself disappointed that this approach ended up winning; it brings nothing novel or interesting to the problem.  Their analysis in the second part of the paper is interesting -- it confirmed my belief that the Kaggle contest was essentially a random lottery amongst the top few hundred entries.
Ajay Andrew Gupta, "A new approach to bracket prediction in the NCAA Men’s Basketball Tournament based on a dual proportion likelihood"

In this paper, Gupta describes his approach to predicting the Tournament games and also does some analysis of bracket strategy under traditional (1, 2, 4, 8, 16, 32) scoring.

Gupta's prediction approach is complex.  It involves team ratings based upon maximum likelihood and what Gupta terms a "dual proportion" model.  I won't attempt to summarize the math here -- it requires several appendices in the paper itself to describe -- the interested reader should consult the paper.

In the second half of the paper, Gupta addresses how to compose a tournament selection to do well in a traditional bracket competition.  His conclusion is to pick a high-probability upset for one to three late round games.
Commentary:  This paper is poorly written and confusing from start to finish.  I'm frankly very surprised that it was chosen for publication.

One of the major problems is that uninteresting or unoriginal ideas are inflated with confusing descriptions.  For example, the paper presents the "dual proportion model" as a novel new approach.  So what is the "dual proportion model"?  "Each of the two teams in a game has a probability of winning the game, and these must add up to 100%."  That's hardly worthy of mention, much less to be held up as a new insight.
Another major problem is the long list of unsupported assumptions throughout the model:  a scaling parameter beta "that applies to big wins, meaning at least 10 points" (Why scale big wins?  Why is 10 points a big win?), "However, [log-likelihood's] shape is better for bracket prediction." (Why is it better?)  "Some wins are more indicative of future wins than others are."  (Really?  What wins?  Why?)  "Point differences can also be deceptive..."  (What is your proof of this?)  "The strength-of-schedule adjustment works by reducing the strengths of the non-tournament teams in a weak conference."  (Why?)  There are many more examples.  None of these various assumptions are given any more than a vague explanation, and worse, none are tested in any way.  The result is a pastiche of unexplained, untested ideas that likely have little or no value.
One final nitpick is that this paper doesn't seem to have anything to do with the Kaggle competition, and all of its analysis is based upon the more standard pool scoring methods.
Andrew Hoegh, Marcos Carzolio, Ian Crandell, Xinran Hu, Lucas Roberts, Yuhyun Song and
Scotland C. Leman, "Nearest-neighbor matchup effects: accounting for team matchups for predicting March Madness"

In this paper, Hoegh (et al) augment a standard strength rating-based predictive system with relative adjustments based upon how each team in the matchup has performed in similar past matchups.  So, for example, if a team is playing a very tall, good rebounding team, the model will look at the team's past performances against very tall, good rebounding teams and see if they played particularly well (or particularly poorly) against these sorts of teams in the past, and then apply that adjustment to predicting the current game.
Commentary:  This paper is well-written and presents an interesting and novel idea.  The notion of adjusting a general prediction to account for a particular matchup is at least intuitively appealing, and their approach is straightforward and easily defensible.  There are a couple of interesting issues to think about in their scheme.

First of all, how should you find past games for calculating the matchup adjustment?  Since you're trying to improve a generic strength measurement, I'd argue that ideally you'd like to find past games using some factors that aren't already reflected in the strength metric.  (Otherwise you're likely to just reinforce the information already in the strength metric.)  In this case, the authors find similar past games using a nearest-neighbor distance metric based upon twenty-one of Pomeroy's team statistics.  Some of these statistics do seem orthogonal to the strength metric (e.g., Effective Height, Adjusted Tempo) but others seem as if they would be highly correlated with the strength metric (e.g., FG percentage).  I would be interested to see some feature selection work on these statistics to see what statistics perform best on finding past games.

Second of all, testing this scheme is problematic.  The authors note that the scheme can really only be applied to the Tournament (or at least late in the season) when teams have played enough games that there's a reasonable chance to find similar past matchups.  In this case the authors have tested the scheme using Tournament games but only (if my reading is correct) looking in detail at the 2014 results.  That shows some positive benefits of the scheme, but 65 games is just too small a sample size to draw any conclusions.

Overall, I'm a little dubious that individual matchup effects exist, and that you can detect them and exploit them.  For one thing, if this were true I'd expect to see some obvious evidence of that in conference play, where teams play home-and-home.  For example, you might expect that if Team A has a matchup advantage over Team B that it would outperform expectations in both the home and away against Team B.  I haven't seen any evidence for that sort of pattern.  I've also looked at individual team adjustments a number of times.  For example, you might think that teams have individual home court advantages -- i.e., that Team A has a really good home crowd and does relatively better at home than other teams.  But I've never been able to find individual team adjustments with predictive value.  Sometimes teams do appear to have an unusually good home court advantage -- I recall a season when Denver was greatly outperforming expectations at home for the first part of the season.  But it (almost?) always turns out to be random noise in the data -- Denver's great home performance in the first part of the season evaporated in the second half of the season.
So this paper would have benefited from some more rigorous attempts to verify the existence and value of matchup effects, but it nonetheless presents and interesting idea and approach.
Lo-Hua Yuan, Anthony Liu, Alec Yeh, Aaron Kaufman, Andrew Reece, Peter Bull, Alex Franks, Sherrie Wang, Dmitri Illushin and Luke Bornn, "A mixture-of-modelers approach to forecasting NCAA tournament outcomes."

This paper discusses a number of predictive models created at Harvard for the 2014 Kaggle competition.  The final models included three logistic regressions, a stochastic gradient descent model, and a neural network.  Inputs to the models were team-level statistics from Pomeroy, Sonny Moore, Massey, ESPN and RPI.  The models were also used to build ensemble predictors.
Commentary:  This paper presents a very ordinary, not very interesting approach.  (I suspect that the Kaggle competition was used as an exercise in an undergraduate statistics course and this paper is a write-up of that experience.)  The approach uses standard models (logistic regression, SGD, neural networks) on standard inputs.  The model performances are also unusually bad.  None of the models performed as well as the baseline "predict every game at 50%" model.  Even a very naive model should easily outperform the baseline 0.5 predictor.  That none of these models did suggests very strongly that there is a fundamental underlying problem in this work.

The paper also spends an inordinate amount of time on "data decontamination" -- by which the authors mean you can't use data which includes the Tournament to predict the Tournament.  I realize that many Kaggle participants trying to use canned, off-the-shelf statistics like Pomeroy fell into this trap, but it's frankly a very basic mistake that doesn't warrant a journal publication.  The paper also makes the mistake of trying to train and test using only Tournament data.  The authors acknowledge that there isn't enough data in Tournament games for this approach to work, but persist in using it anyway.
Francisco J. R. Ruiz and Fernando Perez-Cruz, "A generative model for predicting outcomes in college basketball."

This paper extends a model previously used for soccer to NCAA basketball.  Teams have attack and defense coefficients, and the expected score for a team in a game is the attack coefficient of the team multiplied by the defense coefficient of the opponent team.  This basic model is extended first by representing each team as a vector of attack and defense coefficients, and secondly representing conferences as vectors of attack and defense coefficients as well.  The resulting model finished 39th in the 2014 Kaggle competition.  The authors also assess the statistical significance of the results of the 2014 Kaggle competition and conclude that 198 out of the 248 participants are statistically indistinguishable.  This agrees with the similar analysis in the Lopez paper.

Commentary: The approach used in this paper is similar to the one used by Danny Tarlow in the 2009 March Madness contest, although with a more mathematically sophisticated basis.  (Whether that results in better performance is unclear.)  The authors give an intuitively appealing rationale for using vectors of coefficients (instead of a single coefficient) to represent teams: "Each coefficient may represent a particular tactic or strategy, so that teams can be good at defending some tactics but worse at defending others (the same applies for attacking)."  It would have been helpful to have a direct comparison between a model with one coefficient and multiple coefficients to see if this in fact has any value.  Similarly, the idea of explicitly representing conferences has some appeal (although it's hard to imagine what reality that captures) but without some test of the value of that idea it remains simply an interesting notion.  Although the basic ideas of this paper are interesting, the lack of any validation is a major weakness.