Monday, March 28, 2016

Sorry About That!

I have to apologize to anyone who Stole My Entry over on Kaggle, because the Net Prophet predictor has made a hash of it this Tournament, and is mired low in the Leaderboard and well below the median entry.  A number of the upsets have been very improbable according to the Net Prophet predictor and it has suffered accordingly.

It's worth noting that some others have been suffering too:  Monte McNair has done better than Net Prophet but not by a whole lot.   Ken Massey entered for the first time and is very low on the Leaderboard (apparently because he gambled rather heavily on 2-15 matchups).   The most interesting story is ShiningMGF, who started poorly (perhaps because their first-round predictions are influenced by the Vegas lines?) but have been climbing steadily and are now in tenth place.  Top Ten finishes three years running is almost certainly a good indication that they know something the rest of us don't!

Over at the Machine Madness contest, Net Prophet isn't doing any better, being one of the many entries that predicted Kansas as the eventual champion.  It looks like "SDSU" has the win locked up already.  "Predict the Madness" is likely to finish second unless North Carolina loses the next game.  Beyond that it gets a little murky, but all the entries with UNC winning it all have an obvious advantage.

But regardless of who wins, it's been a great turnout for the contest (40 entries!) and I want to give my sincere thanks to everyone who entered.  It's really great to see so much interest and participation!

Tuesday, March 22, 2016

What Would a Perfect (Knowledge) Predictor Score in the Kaggle Competition?

It isn't possible to have a perfect predictor for NCAA Tournament games, because the outcome is probabilistic.  We can't know for sure who is going to win a game.  But we could conceivably have a predictor with perfect knowledge.  This predictor would know the true probability for every game.  That is, if Duke is 75% likely to beat Yale, the perfect knowledge predictor would provide that number.  (Because predicting the true probability results in the best score in the long run.) What would such a predictor score in the Kaggle Contest?

The Kaggle contest uses a log-loss scoring system.  In this system, a correct prediction is worth the log of the confidence of the prediction, and an incorrect prediction is worth one minus the log of the confidence of the prediction.  (And for the Kaggle contest the sign is then swapped so that smaller numbers are better.

Let's return to our example of Duke versus Yale.  Our perfect knowledge predictor predicts Duke over Yale with 0.75 confidence.  What would this predictor score in the long run?  (I.e., if Duke and Yale played thousands of times.)  Since the prediction is also the true probability that Duke will win, that number is given by the equation:

`0.75 * ln(0.75) + (1-0.75) * ln(1-0.75)`

that is, 75%  of the time Duke will win and in those cases the predictor will score ln(0.75), and 25% of the time Yale will win and the predictor will score ln(0.25).   This happens to come out to about -0.56 (or 0.56 in Kaggle terms).

So we see how to calculate the expected score of our perfect knowledge predictor given the true advantage.  If the favorite in all the Tournament games was 75% likely to win, then our perfect predictor would be expected to score 0.56.  But we don't know the true advantage in Tournament games, and they're all different advantages.  Is there some way we can estimate this?

One approach is to use the historical results.  We know how many games were upsets in past Tournaments, so we can use this to estimate the true advantage.  For example, we can look at all the historical 7 vs. 12 matchups and use the results to estimate the true advantage in those games.  (One problem with this approach is that in every Tournament, some teams are "mis-seeded".  If we judge upsets by seed numbers, this adds some error.)

Between this Wikipedia page and this ESPN page we can determine the win percentages for every possible first-round matchup.  There have been a reasonable number of these matchups (128 for each type of first-round matchup) so we can have at least a modicum of confidence that the historical win percentage is indicative of the true advantage:

SeedWin Pct
1 vs. 16100%
2 vs. 1594%
3 vs. 1484%
4 vs. 1380%
5 vs. 1264%
6 vs. 1164%
7 vs. 1061%
8 vs. 951%

Using the win percentage as the true advantage, we can then calculate what our perfect knowledge predictor would score in each type of match-up:

SeedWin PctScore
1 vs. 16100%0.00
2 vs. 1594%-0.22
3 vs. 1484%-0.45
4 vs. 1380%-0.50
5 vs. 1264%-0.65
6 vs. 1164%-0.65
7 vs. 1061%-0.67
8 vs. 951%-0.69

Since there are equal numbers of each of these games, the average performance of the predictor is just the average of these scores:  -0.48.

This analysis can be extended in a straightforward way to the later rounds of the tournament, but since there are fewer examples in each category it's hard to have much faith in some of those numbers.  But I would expect the later round games to make the perfect knowledge predictor's score worse, because more of those games are going to be close match-ups like the 8 vs. 9 case.

So 0.48 probably represents an optimistic lower bound for performance in the Kaggle competition.


Here's an rough attempt to estimate the performance of the perfect predictor in the other rounds of the Tournament.

According to the Wikipedia page, there have been 52 upsets in the remaining rounds of the Tournament (a rate of about 2%).  If we treat all these games as having an average seed difference of 4 (which is a conservative estimate), then our log-loss score on these games would be about -0.66.  (Intuitively, this is as we would expect -- with most of the low seeds eliminated, games in the later rounds are going to be between teams that are more nearly equal in strength, so our log-loss score will be correspondingly worse.)  Since there are as many first round games as all the other rounds, the overall performance is just the average of -0.48 and -0.66:  0.57.


Over in the Kaggle thread on this topic, Good Spellr pointed out that if you treat the first round games as independent events with a normal distribution, you can estimate the variance as well:

`variance = (1/n^2) sum_(i=1)^n p_i*(1 - p_i)*(Log[p_i/(1 - p_i)])^2`

which works out to a standard deviation of about 0.07. That means that after the first round of the tournament, the perfect prediction would fall in the range [0.34, 0.62] about 95% of the time.

Sunday, March 20, 2016

A Quick Update

I'm still in Brooklyn watching games (well, we're done watching now -- had a couple of fun games) and have been too busy to do more than minimum checking of email, but I found time to check on the Machine Madness contest.  I see that we have an amazing 40 contestants -- presumably most found us through the Kaggle Contest, but it's great to see the participation.  What's not so great is that the Net Prophet entry is doing poorly both here and at the Kaggle Contest, but that's a post for another day :-)

Tuesday, March 15, 2016

Year End Rankings

I'm not really into ranking teams that much (because match-ups depend on many more factors), but I came up with a new (and I think better) rating system today and here's how it ranks the Top Twenty:

1North Carolina131.6
3Michigan State126.9
4West Virginia125.3
17Miami Florida99.9
20Wichita State97.4

I'm not entirely sure what I think of this.  The top of the rankings isn't too surprising, although I think most folks wouldn't have UNC ahead of Kansas and MSU.  Oklahoma is much lower than the #2 seed they received.  Wichita State is also a surprise at 20 -- although they seem to be handling Vanderbilt tonight so maybe there's something to that.

And I guess you could conclude that it's a bad year for Louisville and SMU to be on probation -- they were both very solid this year.

Monday, March 14, 2016

Does Coaching Experience Matter?

One of the things I investigated in the run-up to the Tournament this year was whether coaching experience matters.  My approach was pretty simplistic -- I offered my prediction model information on how a team/coach had performed the previous year in the Tournament to see if that information had any predictive value.  It didn't -- at least for my model.

Over at Harvard Sports Analysis Collective (worth reading, by the way), Kurt Bullard takes a better look at the same question.  He looks at how coaches perform relative to their seeding over their coaching lifetime.  If experience matters, you'd expect coaches with more experience to do better.  But that's not the case -- there's no correlation between how well a coach does and how much experience he has.  (Alternatively, it could be that his experience is factored into the seed his team gets, although I'd argue that's probably not the case.)

At any rate, you might want to be leery of analysts who say that "Michigan State is going to do well in the Tournament because Coach Izzo has more experience than anyone in the Tournament."  Michigan State probably is going to do well -- but that's because the Committee mis-seeded them, not because of Coach Izzo's experience.

Wednesday, March 9, 2016

That's Not Really A Number

Suppose that you're competing in the Kaggle competition and you're using team win ratios and average scoring for the season to predict who is going to win a game.  Your input to your model might look something like this:

TeamWin RatioAve. ScoreTeamWin RatioAve. Score

Your results are mediocre, so you decide to improve your model by adding more information about each team.  The seeding of the team -- the NCAA Tournament committee's assessment of team strength -- seems like it would be useful for prediction, so you add each team's seeds to your inputs:

TeamWin RatioAve. ScoreSeedTeamWin RatioAve. ScoreSeed

You've just made a mistake.  Do you see what it is?

The way you've added the seeding information, many machine learning tools / models are going to treat the seed as a number1, not any different from the Win Ratio or the Average Score.  And that's a problem, because the seed is not really a number.  It's actually what statisticians would call a categorical variable, because it can take one value out of a fixed set of arbitrary values.  (Machine learning types might be more likely to call it a categorical feature.)  If you're not convinced about that, imagine replacing each seed with a letter -- the #1 seed becomes the A seed, the #2 seed becomes the B seed and so on.  This is perfectly reasonable -- we could still talk about the A seeds being the favorites, we'd know in the first round the A seeds play the P seeds and so on.  It wouldn't make any sense to try to do that with a numeric feature like (for instance) the Average Score.

Another difference between numeric and categorical features that merely look like numbers is that real numeric features have a fixed scale, while categorical features have an arbitrary scale.  The difference between an Ave. Score of 86.5 and an Ave. Score of 81 is 5.5 points, and that's the same amount as the difference between an Ave. Score of 78.8 and 73.3.  But the difference between a 16 seed and a 15 seed might be quite different than the difference between a 1 seed and a 2 seed.  (Not to mention that the difference between a 16 seed and a 15 seed might be quite different than between the 16 seed and a different 15 seed!)

So if I've convinced you that team seeds are not really numbers but just look like numbers, then how should your represent them in your model?

The basic approach is something called "one hot encoding" (the name derives from a type of digital circuit design).  The idea behind one hot encoding is to represent each possible value of the categorical feature as a different binary feature (with value 0 or 1).  For any particular value of the categorical feature, one of these binary features will have the value 1 and the rest will be zero.  (Hence "one hot".)  To represent the seeds, we need 16 binary features:

TeamWin RatioAve. ScoreSeed_1Seed_2Seed_3Seed_4Seed_5...Seed_16TeamWin RatioAve. ScoreSeed_0...

This representation gives your model much more flexibility to learn the differences between the seeds.  For example, it could learn that being a 5 seed is every bit as valuable as being a 4 seed.  If you represented the seed as a number, it would be impossible to learn that.

One drawback about this representation is that it has a tendency to rapidly increase the number of  input features.  Seeds added 32 features to the model -- adding the Massey ordinals to your model could add many thousands of features!  That can be bad for several reasons.  First, the increased dimensionality will slow down the training of your model.  This may or may not be a problem, depending upon the overall size of your training data and the computational power you have for training.  A more significant problem is that it will encourage your model to overfit.  Even if you use the Tournament data all the way back to 1985 (which I do not encourage) there may be only a few games for unlikely pairings such as (say) a #8 seed plays a #16 seed.  That may cause your model to learn a rule for those games that is too specific.

Now it may be that you disagree with the very premise of this posting.  You really do think that seeds can be treated like numbers.  The good news for you is that it is very easy to do that -- just include the seed as both a numeric feature and a categorical feature.  You can look at the size of your model coefficients or other components to see whether treating the seed as a number has any value or not.

1 This is admittedly a simplification.  Some machine learning models (such as tree-based models) may be flexible enough to treat numeric variables in ways that mimic categorical variables.  But many models (and notably linear and logistic regressions) will not.

Tuesday, March 8, 2016

Machine Madness 2016

This is just a quick post to announce the return for the 7th year of the Machine Madness contest, for machine predictors to compete in a traditional bracket-style March Madness competition.  Details can be found on this page.  Last year, Dr. Amanda Schierz (aka "Bluefool") won the competition and ESPN immediately (*) sent a film crew all the way to England to interview her.  Don't miss out on your chance to be a media star!

(*) If by immediately you mean a year later.

Please come compete, and let me know if you have any questions!

Friday, March 4, 2016

Scoring the Kaggle Contest

In this previous post, I talked briefly about whether competitors in the Kaggle March Madness contest should "gamble" with their entries.  The short answer is "yes" -- if your goal is to win money, then your best strategy is to gamble with at least some of your game predictions.  (How many games you should gamble with is an interesting question for another time.) In my opinion, that's a sign that the contest is broken.  Rather than testing who can make the best predictions about the NCAA Tournament, the contest is testing who can formulate the best meta-strategy for winning the contest.  

So, is it possible to fix the contest so that the results more accurately identify the best predictions?

The log-loss scoring function asks competitors to provide a confidence in their predictions, and then scores them based upon how confident their correct (or incorrect) prediction was.  If you analyze this scoring approach, you find that the best strategy in the long run is for the competitors to set their confidence in each prediction to exactly their real confidence.  And that's exactly what you want for a fair and accurate scoring system.

The problem is that this contest is not a "long run".  In fact, it's anything but a long run -- there are only 63 games being scored.  That's a lot compared to predicting (say) just the Superbowl, but for a contest like this it's not nearly long enough to ensure that true predictions are the best strategy.  

So, how can we fix the scoring to better reward true predictions?

The obvious fix of having the teams play a few thousand games is probably a non-starter.  But it does point towards the necessary condition:  We want the competitors to be making many choices instead of just 63.  My suggestion is to have the competitors predict the Margin of Victory (MOV) for each game, and score them on how close they get to the actual MOVs.  Now instead of making 63 binary predictions, the competitors are making 63 predictions with many more choices, and -- crucially -- they don't have control over how much they will win/lose on each prediction.

It should be obvious that this makes it more difficult to "gamble" for an improved score.  Consider last year, where Kentucky was viewed as an overwhelming favorite coming into the Tournament.  Under the current scoring system there was an easy and obvious "gambling" strategy -- predict that Kentucky would win every game and set your confidence in each of those games very high.  (And in fact, if Kentucky had won the championship game, a gambling strategy would probably have won the contest.)  However, under the Margin of Victory scoring system, how would you "gamble" to improve your chances of winning the contest?  It's hard to imagine any approach that would work better than submitting your actual best predictions.

The Kaggle contest is a fun diversion and I think the results have provided some interesting insight into predicting college basketball games.  But I think the contest would be improved by using a scoring system that more accurately identified the best predictor, and I'll continue my low-key lobbying efforts (*) for that change.

(*) Which consist entirely of posting something like this every year :-)

Thursday, March 3, 2016

To Gamble or Not To Gamble, That is the Question

Or at least that's "a" question -- one that comes up yearly in the Kaggle competition.  Here's a version of it that popped up this year.

The Kaggle competition (for those who aren't aware) uses log-loss scoring.  Competitors predict which team will win as a confidence level (e.g., 95% certain of a win by Kentucky) and then are rewarded/punished accordingly.  And since the scoring is logarithmic, you are punished a lot if you make a very confident wrong decision.

The question that plagues competitors is whether forcing their predictions to be more conservative or less conservative will improve their chances of winning the contest.  (Or at least finishing in the top five and receiving a cash prize.)  Note that this is only concerned with winning the contest, not with improving the predictions.  Presumably your predictions are already as accurate as you can make them, and artificially changing them would make them worse -- in the long run.  But the Kaggle contest isn't concerned with the long run -- it's only concerned with how you perform during this particular March Madness.

As a thought experiment, let's assume that you could change your entry right before the final game.  You can see the current standings, but not any of the other entries.  Would you change your entry?  And if so, how?

Well, if you see that you're in first place with a big lead, you might not change it at all.  Or maybe you'd make your pick more conservative so that you could be sure you wouldn't lose much if your pick was wrong.  But if you didn't have a big lead (and in general the farther away from first place you were) you'd probably want to gamble on getting that last game correct.  At that point "average" performance cannot be expected to move you ahead of the team's ahead of you, and even "good" performance might be passed by someone behind you who was willing to gamble more than you.

Since it's much more likely that you will be losing the contest going into the final game than in first place with a big lead, I think this argues that (if your goal is to maximize your expected profit) you should "gamble" on at least the last game.  It's left to the reader to apply this reasoning recursively to games before the final game :-).

As a concrete example of this, last year Juho Kokkala submitted entries based upon "Steal This Entry" but with Kentucky's probabilities turned up to 1.0.   The non-gambling "Steal This Entry" finished in 42nd place, but if Kentucky had won out, Juho would have probably placed in the top two and collected some prize money.