Friday, April 8, 2016

2016 Machine Madness Winner

I've been a little slow in getting around to this, but I want to congratulate "SDSU Fan" on winning the 2016 Machine Madness contest!  In real life, SDSU Fan is Peter Calhoun, a graduate student in Statistics at (no surprise) San Diego State University.  We had a very large pool of entrants this year (40!) so Peter deserves some congratulations for beating the masses.  Peter was trailing by a significant amount after the Round of 32, but strong performances in the later rounds (and especially the FF) resulted in big lead by the end.

Peter's model modified the Logistic Regression/Markov Chain (LRMC) approach proposed by Kvam and Sokol to use random forests.  Peter also finished in fiftieth on Kaggle -- a very strong performance all around.

Despite the large number of entries, nobody had Villanova winning it all.  I think that makes the Villanova win a "true upset".  I know in my model, Villanova played considerably better than predicted.

Speaking of my model, it follows a strategy in pool-based contests of picking some "likely" upsets to try to maximize the chance of winning.  (This is probably more important in a larger pool.)  This year, it picked Purdue to make it to the Championship Game.  Not only didn't that happen, Purdue was upset in the first round by #12 Little Rock.  I'm adding a special "Purdue Rule" to the Net Prophet model so that mistake is never again repeated.  :-)

Congratulations again to Peter on great performance!

Paper Reviews

These papers have been added to the paper archive available through the Papers link on the sidebar.  Links are also provided for direct download of the papers.

Dubbs, Alexander, "Statistics-Free Sports Prediction", arXiv.org
The author builds logistic regression models for MLB, NBA, NFL, and NHL games that use only the teams and scores.  This works best for basketball, and the author concludes that "in basketball, most statistics are subsumed by the scores of the games, whereas in baseball, football, and hockey, further study of game and player statistics is necessary to predict games as well as can be done."

COMMENT: I'm not sure the results of this paper say anything deeper than "Compared to the other major sports, NBA has a long season and the teams don't change much from year to year." 
Clay, Daniel, "Geospatial Determinants of Game Outcomes in NCAA Men’s Basketball," International journal of sport and society 02/2015; 4(4):71-81.
The authors build a logistic regression model for 1,648 NCAA Tournament games that include features for distance travel, time zones crossed, direction of travel, altitude and temperature.  They conclude "We found that traveling east reduces the odds of winning more than does traveling west, and this finding holds when controlling for strength of team, home region advantage and other covariates. Traveling longer distances (>150 miles) also has a dramatic negative effect on game outcomes..."
COMMENT: This paper shows that travel distance and direction has a statistically significant impact upon game results in the NCAA Tournament, but I want to add a few caveats to this conclusion.  First, it isn't clear that the authors understand and control for the fact that there are many more basketball programs (and arguably stronger basketball programs) on the East Coast than elsewhere in the nation.  For this reason, it's likely that teams moving west to play in the Tournament are stronger than teams moving east.  Since the authors don't control for the strength of teams, it's impossible to say whether the claimed impact of direction of travel means anything.  Second, the magnitude of these effects may not be huge.  I don't understand how the authors calculate their "Odds Ratio" but factors like strength of team are several orders of magnitude more significant in determining outcome.  Third, the authors are measuring strength of team by seed, which has several problems.  It's a very coarse measure, it doesn't distinguish between teams with the same seed, and it's often poorly correlated with the actual team strength (i.e., teams are commonly mis-seeded).  In my experience, many factors with low significance vanish when team strength is more accurately estimated.  I think distance and direction of travel probably do have an impact on Tournament games, but I suspect the true effect is smaller than this paper would indicate.
Clay, Daniel, "Player Rotation, On-court Performance and Game Outcomes in NCAA Men's Basketball", International Journal of Performance Analysis in Sport · August 2014

The authors look at the relationship between the size of rotation (how many players play at least 10 minutes in a game) and statistics such as rebounding, shooting percentage, etc.  The authors conclude that teams with deep rotation tend to rebound better, particularly on the offensive end. They also have more steals. By contrast, smaller rotation teams tend to shoot the ball better, both field goals and free throws, and they are more effective at taking care of the ball, resulting in fewer turnovers.  In general, a larger rotation improves the chance of winning.
COMMENT: There's quite a bit of interesting material in this paper, and I recommend reading it and drawing your own conclusions.  I have reservations about some of the conclusions in this paper because the authors have not controlled for number of possessions in the game for many of the statistics.  Since I'd expect (for example) that both the number of offensive rebounds and the depth of rotation to increase with more possessions, I'm not sure I immediately accept that teams with deeper rotations rebound better.  The authors do control for possessions in two of the statistics (offensive and defensive rating) and those conclusions are more convincing.  However, as far as I can tell the authors did nothing to control for overtime games, and that may also be affecting the results. 
From the specific viewpoint of predicting game outcomes, the authors don't make use of any kind of strength rating, so it isn't clear whether depth of rotation has any predictive value that wouldn't already be covered by a good strength metric.

Monday, March 28, 2016

Sorry About That!

I have to apologize to anyone who Stole My Entry over on Kaggle, because the Net Prophet predictor has made a hash of it this Tournament, and is mired low in the Leaderboard and well below the median entry.  A number of the upsets have been very improbable according to the Net Prophet predictor and it has suffered accordingly.

It's worth noting that some others have been suffering too:  Monte McNair has done better than Net Prophet but not by a whole lot.   Ken Massey entered for the first time and is very low on the Leaderboard (apparently because he gambled rather heavily on 2-15 matchups).   The most interesting story is ShiningMGF, who started poorly (perhaps because their first-round predictions are influenced by the Vegas lines?) but have been climbing steadily and are now in tenth place.  Top Ten finishes three years running is almost certainly a good indication that they know something the rest of us don't!

Over at the Machine Madness contest, Net Prophet isn't doing any better, being one of the many entries that predicted Kansas as the eventual champion.  It looks like "SDSU" has the win locked up already.  "Predict the Madness" is likely to finish second unless North Carolina loses the next game.  Beyond that it gets a little murky, but all the entries with UNC winning it all have an obvious advantage.

But regardless of who wins, it's been a great turnout for the contest (40 entries!) and I want to give my sincere thanks to everyone who entered.  It's really great to see so much interest and participation!


Tuesday, March 22, 2016

What Would a Perfect (Knowledge) Predictor Score in the Kaggle Competition?

It isn't possible to have a perfect predictor for NCAA Tournament games, because the outcome is probabilistic.  We can't know for sure who is going to win a game.  But we could conceivably have a predictor with perfect knowledge.  This predictor would know the true probability for every game.  That is, if Duke is 75% likely to beat Yale, the perfect knowledge predictor would provide that number.  (Because predicting the true probability results in the best score in the long run.) What would such a predictor score in the Kaggle Contest?

The Kaggle contest uses a log-loss scoring system.  In this system, a correct prediction is worth the log of the confidence of the prediction, and an incorrect prediction is worth one minus the log of the confidence of the prediction.  (And for the Kaggle contest the sign is then swapped so that smaller numbers are better.

Let's return to our example of Duke versus Yale.  Our perfect knowledge predictor predicts Duke over Yale with 0.75 confidence.  What would this predictor score in the long run?  (I.e., if Duke and Yale played thousands of times.)  Since the prediction is also the true probability that Duke will win, that number is given by the equation:

`0.75 * ln(0.75) + (1-0.75) * ln(1-0.75)`

that is, 75%  of the time Duke will win and in those cases the predictor will score ln(0.75), and 25% of the time Yale will win and the predictor will score ln(0.25).   This happens to come out to about -0.56 (or 0.56 in Kaggle terms).

So we see how to calculate the expected score of our perfect knowledge predictor given the true advantage.  If the favorite in all the Tournament games was 75% likely to win, then our perfect predictor would be expected to score 0.56.  But we don't know the true advantage in Tournament games, and they're all different advantages.  Is there some way we can estimate this?

One approach is to use the historical results.  We know how many games were upsets in past Tournaments, so we can use this to estimate the true advantage.  For example, we can look at all the historical 7 vs. 12 matchups and use the results to estimate the true advantage in those games.  (One problem with this approach is that in every Tournament, some teams are "mis-seeded".  If we judge upsets by seed numbers, this adds some error.)

Between this Wikipedia page and this ESPN page we can determine the win percentages for every possible first-round matchup.  There have been a reasonable number of these matchups (128 for each type of first-round matchup) so we can have at least a modicum of confidence that the historical win percentage is indicative of the true advantage:

SeedWin Pct
1 vs. 16100%
2 vs. 1594%
3 vs. 1484%
4 vs. 1380%
5 vs. 1264%
6 vs. 1164%
7 vs. 1061%
8 vs. 951%

Using the win percentage as the true advantage, we can then calculate what our perfect knowledge predictor would score in each type of match-up:

SeedWin PctScore
1 vs. 16100%0.00
2 vs. 1594%-0.22
3 vs. 1484%-0.45
4 vs. 1380%-0.50
5 vs. 1264%-0.65
6 vs. 1164%-0.65
7 vs. 1061%-0.67
8 vs. 951%-0.69

Since there are equal numbers of each of these games, the average performance of the predictor is just the average of these scores:  -0.48.

This analysis can be extended in a straightforward way to the later rounds of the tournament, but since there are fewer examples in each category it's hard to have much faith in some of those numbers.  But I would expect the later round games to make the perfect knowledge predictor's score worse, because more of those games are going to be close match-ups like the 8 vs. 9 case.

So 0.48 probably represents an optimistic lower bound for performance in the Kaggle competition.

UPDATE #1:

Here's an rough attempt to estimate the performance of the perfect predictor in the other rounds of the Tournament.

According to the Wikipedia page, there have been 52 upsets in the remaining rounds of the Tournament (a rate of about 2%).  If we treat all these games as having an average seed difference of 4 (which is a conservative estimate), then our log-loss score on these games would be about -0.66.  (Intuitively, this is as we would expect -- with most of the low seeds eliminated, games in the later rounds are going to be between teams that are more nearly equal in strength, so our log-loss score will be correspondingly worse.)  Since there are as many first round games as all the other rounds, the overall performance is just the average of -0.48 and -0.66:  0.57.

UPDATE #2:

Over in the Kaggle thread on this topic, Good Spellr pointed out that if you treat the first round games as independent events with a normal distribution, you can estimate the variance as well:

`variance = (1/n^2) sum_(i=1)^n p_i*(1 - p_i)*(Log[p_i/(1 - p_i)])^2`

which works out to a standard deviation of about 0.07. That means that after the first round of the tournament, the perfect prediction would fall in the range [0.34, 0.62] about 95% of the time.
.

Sunday, March 20, 2016

A Quick Update

I'm still in Brooklyn watching games (well, we're done watching now -- had a couple of fun games) and have been too busy to do more than minimum checking of email, but I found time to check on the Machine Madness contest.  I see that we have an amazing 40 contestants -- presumably most found us through the Kaggle Contest, but it's great to see the participation.  What's not so great is that the Net Prophet entry is doing poorly both here and at the Kaggle Contest, but that's a post for another day :-)

Tuesday, March 15, 2016

Year End Rankings

I'm not really into ranking teams that much (because match-ups depend on many more factors), but I came up with a new (and I think better) rating system today and here's how it ranks the Top Twenty:

RankTeamRating
1North Carolina131.6
2Kansas129.6
3Michigan State126.9
4West Virginia125.3
5Virginia117.9
6Villanova114.9
7Oregon112.1
8Xavier110.4
9Purdue109.3
10Louisville108.9
11Arizona106.1
12Duke105.4
13Kentucky105.2
14SMU104.0
15Indiana103.9
16Oklahoma103.7
17Miami Florida99.9
18Maryland97.9
19Baylor97.7
20Wichita State97.4

I'm not entirely sure what I think of this.  The top of the rankings isn't too surprising, although I think most folks wouldn't have UNC ahead of Kansas and MSU.  Oklahoma is much lower than the #2 seed they received.  Wichita State is also a surprise at 20 -- although they seem to be handling Vanderbilt tonight so maybe there's something to that.

And I guess you could conclude that it's a bad year for Louisville and SMU to be on probation -- they were both very solid this year.

Monday, March 14, 2016

Does Coaching Experience Matter?

One of the things I investigated in the run-up to the Tournament this year was whether coaching experience matters.  My approach was pretty simplistic -- I offered my prediction model information on how a team/coach had performed the previous year in the Tournament to see if that information had any predictive value.  It didn't -- at least for my model.

Over at Harvard Sports Analysis Collective (worth reading, by the way), Kurt Bullard takes a better look at the same question.  He looks at how coaches perform relative to their seeding over their coaching lifetime.  If experience matters, you'd expect coaches with more experience to do better.  But that's not the case -- there's no correlation between how well a coach does and how much experience he has.  (Alternatively, it could be that his experience is factored into the seed his team gets, although I'd argue that's probably not the case.)

At any rate, you might want to be leery of analysts who say that "Michigan State is going to do well in the Tournament because Coach Izzo has more experience than anyone in the Tournament."  Michigan State probably is going to do well -- but that's because the Committee mis-seeded them, not because of Coach Izzo's experience.

Wednesday, March 9, 2016

That's Not Really A Number

Suppose that you're competing in the Kaggle competition and you're using team win ratios and average scoring for the season to predict who is going to win a game.  Your input to your model might look something like this:

TeamWin RatioAve. ScoreTeamWin RatioAve. Score
Michigan0.7586.5UCLA0.7381

Your results are mediocre, so you decide to improve your model by adding more information about each team.  The seeding of the team -- the NCAA Tournament committee's assessment of team strength -- seems like it would be useful for prediction, so you add each team's seeds to your inputs:

TeamWin RatioAve. ScoreSeedTeamWin RatioAve. ScoreSeed
Michigan0.7586.55UCLA0.738114

You've just made a mistake.  Do you see what it is?

The way you've added the seeding information, many machine learning tools / models are going to treat the seed as a number1, not any different from the Win Ratio or the Average Score.  And that's a problem, because the seed is not really a number.  It's actually what statisticians would call a categorical variable, because it can take one value out of a fixed set of arbitrary values.  (Machine learning types might be more likely to call it a categorical feature.)  If you're not convinced about that, imagine replacing each seed with a letter -- the #1 seed becomes the A seed, the #2 seed becomes the B seed and so on.  This is perfectly reasonable -- we could still talk about the A seeds being the favorites, we'd know in the first round the A seeds play the P seeds and so on.  It wouldn't make any sense to try to do that with a numeric feature like (for instance) the Average Score.

Another difference between numeric and categorical features that merely look like numbers is that real numeric features have a fixed scale, while categorical features have an arbitrary scale.  The difference between an Ave. Score of 86.5 and an Ave. Score of 81 is 5.5 points, and that's the same amount as the difference between an Ave. Score of 78.8 and 73.3.  But the difference between a 16 seed and a 15 seed might be quite different than the difference between a 1 seed and a 2 seed.  (Not to mention that the difference between a 16 seed and a 15 seed might be quite different than between the 16 seed and a different 15 seed!)

So if I've convinced you that team seeds are not really numbers but just look like numbers, then how should your represent them in your model?

The basic approach is something called "one hot encoding" (the name derives from a type of digital circuit design).  The idea behind one hot encoding is to represent each possible value of the categorical feature as a different binary feature (with value 0 or 1).  For any particular value of the categorical feature, one of these binary features will have the value 1 and the rest will be zero.  (Hence "one hot".)  To represent the seeds, we need 16 binary features:

TeamWin RatioAve. ScoreSeed_1Seed_2Seed_3Seed_4Seed_5...Seed_16TeamWin RatioAve. ScoreSeed_0...
Michigan0.7586.500001...0UCLA0.73810...

This representation gives your model much more flexibility to learn the differences between the seeds.  For example, it could learn that being a 5 seed is every bit as valuable as being a 4 seed.  If you represented the seed as a number, it would be impossible to learn that.

One drawback about this representation is that it has a tendency to rapidly increase the number of  input features.  Seeds added 32 features to the model -- adding the Massey ordinals to your model could add many thousands of features!  That can be bad for several reasons.  First, the increased dimensionality will slow down the training of your model.  This may or may not be a problem, depending upon the overall size of your training data and the computational power you have for training.  A more significant problem is that it will encourage your model to overfit.  Even if you use the Tournament data all the way back to 1985 (which I do not encourage) there may be only a few games for unlikely pairings such as (say) a #8 seed plays a #16 seed.  That may cause your model to learn a rule for those games that is too specific.

Now it may be that you disagree with the very premise of this posting.  You really do think that seeds can be treated like numbers.  The good news for you is that it is very easy to do that -- just include the seed as both a numeric feature and a categorical feature.  You can look at the size of your model coefficients or other components to see whether treating the seed as a number has any value or not.


1 This is admittedly a simplification.  Some machine learning models (such as tree-based models) may be flexible enough to treat numeric variables in ways that mimic categorical variables.  But many models (and notably linear and logistic regressions) will not.

Tuesday, March 8, 2016

Machine Madness 2016

This is just a quick post to announce the return for the 7th year of the Machine Madness contest, for machine predictors to compete in a traditional bracket-style March Madness competition.  Details can be found on this page.  Last year, Dr. Amanda Schierz (aka "Bluefool") won the competition and ESPN immediately (*) sent a film crew all the way to England to interview her.  Don't miss out on your chance to be a media star!

(*) If by immediately you mean a year later.

Please come compete, and let me know if you have any questions!

Friday, March 4, 2016

Scoring the Kaggle Contest

In this previous post, I talked briefly about whether competitors in the Kaggle March Madness contest should "gamble" with their entries.  The short answer is "yes" -- if your goal is to win money, then your best strategy is to gamble with at least some of your game predictions.  (How many games you should gamble with is an interesting question for another time.) In my opinion, that's a sign that the contest is broken.  Rather than testing who can make the best predictions about the NCAA Tournament, the contest is testing who can formulate the best meta-strategy for winning the contest.  

So, is it possible to fix the contest so that the results more accurately identify the best predictions?

The log-loss scoring function asks competitors to provide a confidence in their predictions, and then scores them based upon how confident their correct (or incorrect) prediction was.  If you analyze this scoring approach, you find that the best strategy in the long run is for the competitors to set their confidence in each prediction to exactly their real confidence.  And that's exactly what you want for a fair and accurate scoring system.

The problem is that this contest is not a "long run".  In fact, it's anything but a long run -- there are only 63 games being scored.  That's a lot compared to predicting (say) just the Superbowl, but for a contest like this it's not nearly long enough to ensure that true predictions are the best strategy.  

So, how can we fix the scoring to better reward true predictions?

The obvious fix of having the teams play a few thousand games is probably a non-starter.  But it does point towards the necessary condition:  We want the competitors to be making many choices instead of just 63.  My suggestion is to have the competitors predict the Margin of Victory (MOV) for each game, and score them on how close they get to the actual MOVs.  Now instead of making 63 binary predictions, the competitors are making 63 predictions with many more choices, and -- crucially -- they don't have control over how much they will win/lose on each prediction.

It should be obvious that this makes it more difficult to "gamble" for an improved score.  Consider last year, where Kentucky was viewed as an overwhelming favorite coming into the Tournament.  Under the current scoring system there was an easy and obvious "gambling" strategy -- predict that Kentucky would win every game and set your confidence in each of those games very high.  (And in fact, if Kentucky had won the championship game, a gambling strategy would probably have won the contest.)  However, under the Margin of Victory scoring system, how would you "gamble" to improve your chances of winning the contest?  It's hard to imagine any approach that would work better than submitting your actual best predictions.

The Kaggle contest is a fun diversion and I think the results have provided some interesting insight into predicting college basketball games.  But I think the contest would be improved by using a scoring system that more accurately identified the best predictor, and I'll continue my low-key lobbying efforts (*) for that change.

(*) Which consist entirely of posting something like this every year :-)

Thursday, March 3, 2016

To Gamble or Not To Gamble, That is the Question

Or at least that's "a" question -- one that comes up yearly in the Kaggle competition.  Here's a version of it that popped up this year.

The Kaggle competition (for those who aren't aware) uses log-loss scoring.  Competitors predict which team will win as a confidence level (e.g., 95% certain of a win by Kentucky) and then are rewarded/punished accordingly.  And since the scoring is logarithmic, you are punished a lot if you make a very confident wrong decision.

The question that plagues competitors is whether forcing their predictions to be more conservative or less conservative will improve their chances of winning the contest.  (Or at least finishing in the top five and receiving a cash prize.)  Note that this is only concerned with winning the contest, not with improving the predictions.  Presumably your predictions are already as accurate as you can make them, and artificially changing them would make them worse -- in the long run.  But the Kaggle contest isn't concerned with the long run -- it's only concerned with how you perform during this particular March Madness.


As a thought experiment, let's assume that you could change your entry right before the final game.  You can see the current standings, but not any of the other entries.  Would you change your entry?  And if so, how?

Well, if you see that you're in first place with a big lead, you might not change it at all.  Or maybe you'd make your pick more conservative so that you could be sure you wouldn't lose much if your pick was wrong.  But if you didn't have a big lead (and in general the farther away from first place you were) you'd probably want to gamble on getting that last game correct.  At that point "average" performance cannot be expected to move you ahead of the team's ahead of you, and even "good" performance might be passed by someone behind you who was willing to gamble more than you.

Since it's much more likely that you will be losing the contest going into the final game than in first place with a big lead, I think this argues that (if your goal is to maximize your expected profit) you should "gamble" on at least the last game.  It's left to the reader to apply this reasoning recursively to games before the final game :-).

As a concrete example of this, last year Juho Kokkala submitted entries based upon "Steal This Entry" but with Kentucky's probabilities turned up to 1.0.   The non-gambling "Steal This Entry" finished in 42nd place, but if Kentucky had won out, Juho would have probably placed in the top two and collected some prize money. 

Friday, February 19, 2016

More Kaggle News, ESPN Irritates Me

As a follow-up to this previous post, the Kaggle competition is officially back.  A good deal of data is available, and the forums have been moderately active.  The new Kaggle Notebooks feature is getting some exercise, too:  there are 116 scripts for this competition at the moment, although I'm unclear on what they all are.  There are at least a couple of scripts to calculate ELO ratings and similar things.  Might be worth a look if you're just getting started in this area.

Prizes this year are considerable -- $20K split 10/6/4/3/2.  I suggested awarding prizes for the best performance on each round of the Tournament, but that might have been too hard to implement quickly.  At any rate, spreading the prizes down to 5th place is a good improvement.  The contest is basically random amongst about the top 100 or so contestants, so weighting all the money at the top makes it even more of "random number lottery."

On a completely unrelated note, the NetProphet predictor broke on me last night.  It turned out that ESPN has changed the format of its box scores.  You can see the new format here.  The change seems to have also broken all the past seasons.  If you go to (say) November 2014 the scoreboard and schedule pages will claim that no games were played.

ESPN has been modifying their page formats for a while now, and I was expecting a change at some point.  The scoreboard page had earlier been modified to run from JSON data embedded in the page, and I was expecting to see something similar happen with the box scores and other game pages.  But interestingly enough, although the page formats have changed, they haven't gone to using embedded JSON data on these pages.  That's too bad, because pulling the JSON data out of the page, parsing it and then using it is more straightforward -- and probably a lot more robust -- than pulling data out of the HTML.

Saturday, February 6, 2016

Kaggle Competition is Back for 2016

I've been remiss about posting to the blog, but I thought I'd share that a little birdie hinted to me that the Kaggle Competition will be back again this year, with perhaps some new twists.  So keep your predictors warmed up.

I'm undecided whether I'm going to provide "Steal My Entry" again this year, but I might be interested in a private collaborative effort. In particular my thought is to merge an entry from my predictor -- which mostly focuses on regular-season games -- with a predictor that has specifically been trained on tournament games.  I'll provide my model's game predictions for all the tournament games back to 2009, and then you train a tournament-specific model using my predictions along with any other information you think is valuable (e.g., team seedings, locations, etc.).  Contact me if that sounds interesting -- and this isn't an exclusive offer, I'm happy to collaborate with multiple folks either individually or as part of a larger group.