Friday, May 27, 2011

The Offense-Defense Model & Probabilistic Matrix Model (PMM)

The first two MOV-based models will look at are similar:  they both calculate an "Offense" and a "Defense" for each team.  The "Offense" number represents a teams offensive capability and the "Defense" the defensive capability.  When Duke plays UNC, Duke's predicted score is Duke's "Offense" times UNC's "Defense."  Generally speaking, these numbers are calculated by initializing all teams to some baseline numbers (e.g., so that the expected score OxD = 65) and then iteratively adjusting the values so that they more closely match actual game outcomes.  If Arizona State University consistently scores few points, it's "Offense" value will drop (and at the same time it's opponents' "Defense" value will also drop).  After some number of iterations adjusting the numbers, the total error (across all games) will be minimized.

The Offense-Defense model is a version of this I first implemented for the 2010 March Madness Predictive Analytics Challenge.  It's a fairly simple model.  For each team, for each game, it predicts a score based upon the current Offense and Defense ratings.  It then determines the error between the prediction and the actual game result, and adjusts the appropriate Offense and Defense ratings to remove 75% of the error.  It then iterates this across all the games for a fixed number of iterations.  (This algorithm isn't guaranteed to converge, although in practice it usually does.)

Testing this algorithm with our usual methodology gives these results (for comparison, I show the best non-MOV predictor as well):

Predictor    % Correct    MOV Error
TrueSkill + iRPI72.9%11.01
Offense-Defense69.6%11.84

This performance (with some adjustments) was enough to win the 2010 Challenge, but seems disappointing in comparison to the TrueSkill + iRPI performance.  In particular, we might expect ratings based upon MOV to have lower MOV error rates, but that is not the case here.  I also implemented a version of the Dick Vitale methodology, where I calculated separate home and away ratings for all teams.  In this case, our predicted score is the home team's home Offense times the away team's away Defense (and vice versa).  Here's how that performs:

Predictor    % Correct    MOV Error
TrueSkill + iRPI72.9%11.01
Offense-Defense (home & away)68.2%12.26

Surprisingly (at least to me) this is significantly worse than the undifferentiated ratings.  Perhaps this is additional evidence that teams don't play differently at home than away; the home court advantage would then be due primarily to the referees -- a conclusion shared by Sports Illustrated.

The second model I tested is the "Probabilistic Matrix Model" (PMM).  This model is based upon the code Danny Tarlow released for his tournament predictor, which he discusses here.  This is similar in spirit to the Offense-Defense model, if much more sophisticated mathematically.  (You can tell this because the code has variables like s_hat_i in it.)  Testing PMM gives these results:

Predictor    % Correct    MOV Error
TrueSkill + iRPI72.9%11.01
Offense-Defense69.6%11.84
PMM71.7%11.23

The PMM does better than my naive Offense-Defense model (apparently there's something to all that math stuff) but still does not approach the performance of TrueSkill + iRPI.  I did not implement separate home & away ratings, but there's no reason to think they would provide improved performance.

Combinations & Other Models

Before we move on to MOV-based rating systems, it may be instructive to look at combining the various "RPI-like" rating systems to see if using them together can improve our prediction performance.  For these experiments, I'll be looking at our three best RPI-like ratings: TrueSkill, Improved RPI (iRPI) and the Iterative Strength Rating (ISR).

The first experiment we can try is to use more than one rating as an input to our linear regression.  The following table shows the performance using various combinations of the three ratings:

Predictor    % Correct    MOV Error
TrueSkill72.8%11.07
TrueSkill + Improved RPI72.9%11.01
TrueSkill + Iterative Strength Rating72.7%11.05
TrueSkill + Improved RPI + Iterative Strength Rating73.0%11.01
Improved RPI + Iterative Strength Rating71.9%11.31

The combination of TrueSkill and the Improved RPI improves performance modestly.  Adding in the Iterative Strength Rating does little (and in fact, the home team's ISR gets optimized out of the linear regression).

Another experiment we can try is to do a separate linear regression for each rating and then average their predictions (for regression tasks, this is done with the Vote operator in RapidMiner).  Here are some averaging results:

Predictor    % Correct    MOV Error
TrueSkill + Improved RPI (combined regression)72.9%11.01
TrueSkill + Improved RPI (averaged)72.9%11.04
TrueSkill + Iterative Strength Rating (averaged)72.8%11.17
TrueSkill + Improved RPI + Iterative Strength Rating (averaged)73.0%11.16

Averaging provides worse performance than using a single linear regression.

We can also try using a more sophisticated prediction model than a linear regression.  For cases where we're only using a single rating value for each team, we wouldn't expect this to provide significantly better performance than the linear regression.  Here are the performances of some alternative models for the TrueSkill ratings:

Predictor    % Correct    MOV Error
TrueSkill (neural network)72.8%11.07
TrueSkill (support vector machine)72.7%11.08
TrueSkill (k-NN, k=96)72.7%11.13

As expected, there is no improvement over a simple linear regression.  The alternate models also provide no benefit when we are using multiple ratings:

Predictor    % Correct    MOV Error
TrueSkill + Improved RPI (neural network)68.8%12.24
TrueSkill + Improved RPI (support vector machine)72.9%11.02
TrueSkill + Improved RPI (k-NN, k=96)72.8%11.13

SVNs do the best here, but still not an improvement over the linear regression.

So combining TrueSkill & Improved RPI into a single regression is an improvement, but generally more sophisticated models don't provide any value.  It's interesting to note that this performance is already as good as the best models reported in the literature.

Unless I get further distracted, I'll be moving on next to assessing ratings/models which make use of the margin of victory (MOV) next.  I have a small collection of ratings/models to assess, but I'm always looking for inputs, so if you have a favorite ranking that you'd like to see included, please let me know!

Tuesday, May 24, 2011

RPI-Like Summary

I've re-implemented the incorrect algorithms and results are below.   But before summarizing the results so far, I'll briefly mention a few others that I did not implement.

Glicko

The Glicko Rating system was developed by Mark Glickman (chairman of the US Chess Federation (USCF) ratings committee) as an improvement upon ELO.  The Glicko Rating system is very similar to Microsoft's TrueSkill rating system.  Both are based on Bayesian reasoning and provide both a rating and an uncertainty.  For two player games that don't produce ties (i.e., basketball) the only significant difference is that Glicko uses a logistic distribution of performance ratings rather than the Gaussian distribution used by TrueSkill.  It seems unlikely that this small difference would result in a significant difference in predictive performance.  (The Glicko-2 system adds a "volatility" factor, and this might be worth investigating at some point.)

Jon Dokter

The Prediction Tracker tracks the accuracy of various NCAA basketball rating systems both for won-loss and against the Las Vegas line.  The highest rated system as of the end of the 2010-2011 season belonged to Jon Dokter.  Dokter's ratings are intended to be predictive (he sells wagering advice for \$10.99/week) but are not well-explained.  This page provides a general overview of his methods, but there is not enough detail to replicate his rating system for testing.

Bethel Rank

Roy Bethel proposed a ranking system based upon "Maximum Likelihood Estimation" specifically to address the problem of sports with unequal strength of schedules, i.e., where teams do not play a complete round-robin.  (His paper is available in the papers archive.) While Bethel's rating system appears interesting, it cannot handle winless or lossless teams.  Since both happen with some regularity in college basketball (and are certain to occur for a significant portion of the season) using this rating is problematic.

NCAA Tournament-Specific Ratings

A number of people have created rating systems specific to predicting the NCAA tournament, e.g., Bradley West and Nate Silver.  Most of these rely on seeding, human polls or other information that exists only for the tournament, and makes them unsuitable for predicting regular season games.

Other Systems

I'm interested in any pointers to other rating systems that I should investigate, particularly if they have a fundamentally different approach than the ones I've covered already.  Send me an email (srt19170@gmail.com) or leave a comment.

Summary of Results

In total, I tested about 110 algorithm variants.  (Some multiple times as I uncovered errors in my code!)  The following table summarizes the best performances for each algorithm:

Predictor    % Correct    MOV Error
Naïve50.0%14.50
1-Bit62.6%14.17
Random Walkers71.0%11.72
RPI71.2%11.62
ELO71.8%11.59
KRACH71.5%11.50
Colley71.8%11.33
Wilson71.9%11.32
ISR71.9%11.32
Improved RPI72.1%11.30
TrueSkill72.8%11.09

There are a couple of interesting points to be gathered from this.

First, the best performer (TrueSkill) represents about a 15% improvement over always picking the home team, but only about a 2% improvement over the standard RPI.  On MOV Error, it fares somewhat better, being a 22% improvement over picking the home team, and about 5% over RPI.  But given the complexity of TrueSkill compared to the 1-Bit algorithm (or even standard RPI), that isn't as much improvement as we might have hoped to see.

Second, we note that the performance of our implementation of ELO is very close to the tracked performance of Sagarin ELO at Prediction Tracker.  That gives us some confidence in these results.  (On the MOV Error side, there seems to be about a 1.5 point bias in MOV Error between my measurements and those on Prediction Tracker.  I don't know why that would be.)

Third, if we compare TrueSkill to the rating systems tracked at Prediction Tracker, we see it would have beaten all systems except Jon Doktor -- and his system makes use of MOV.  So even without making use of MOV, we have a system that is competitive with the best systems available.

Friday, May 20, 2011

Whoops!

I just discovered a significant error in the code underlying the ISR, Wilson and Colley ratings.  It's a variant of a problem I've had before -- information from later games "leaking" back into the ratings for earlier games.  In this case, it was just operator error -- I was using a function that returned all the games for the season where I should have used one that returned all the games so far.  I'm re-running the assessments of the affected rating systems and will post results when I have them, but the preliminary results indicate that the performance of these rating systems will take a significant hit, which may return TrueSkill to the top of the leaderboard.

KRACH

College hockey seems to be a hotbed for ratings systems -- possibly because (like college football) it has many teams and few games.  One of the most popular ratings sytems for college hockey is Ken's Ratings for American College Hockey (KRACH).  KRACH is based upon Bradley-Terry rankings, which is a probablistic model for pairwise comparison, i.e., gives a method for ranking a set of entities by making only pairwise comparisons.  Since games are "pairwise comparisons" between teams, there's an obvious application to ranking sports teams.

Unlike some of the rating systems we've looked at, KRACH gives the ratings a specific meaning: an odds scale.  This means that if Team A has a rating of 200 and Team B has a rating of 100, then Team A is a 2:1 favorite to beat Team B.  If Team A and Team B play an infinitely long series of games, then Team A is expected to win 2/3 of those games and Team B to win 1/3.  If Team A and Team B play just a single game, then Team A is expected to win 0.66 of that game and Team B is expected to win 0.33 of that game.

(I'm reminded of the old joke that a statistician is a person who can put his head in an oven, his feet in a freezer, and on average be quite comfortable.)

So if we're given a team and its schedule (along with the KRACH ratings) we can calculated the expected number of wins by summing up the expected wins for each game.  For example, if a Team with a Krach rating of 100 plays Teams A, B, C, and D with ratings of (respectively) 150, 100, 75 and 50, then the expected number of wins is:
X vs A:  0.40
X vs B:  0.50
X vs C:  0.57
X vs D:  0.67
------
Total:  2.14
What the KRACH system does is try to pick ratings for all the teams (simultaneously) so that the expected number of wins for each team matches the actual number of wins.  As you might imagine, this requires an iterative approach.  The update equation for each iteration is given by this formula:
Ki = Vi / Sum over j of [Nij/(Ki+Kj)]
Where Ki is the KRACH rating for team i, Vi is the number of wins for team i, and Nij is the number of times that team i and j have played each other.  To avoid problems with undefeated and winless teams, we also need to add in a "tie" game with a fictitious team with an arbitrary rating.

There's one other adjustment necessary for using the KRACH ratings for prediction.  Since the KRACH ratings represent odds, when we compare two ratings to predict an outcome, we need to use a ratio of the ratings:
Ki/(Ki+Kj)
rather than use Ki and Kj directly in our linear regression.

Implementing the KRACH rating and testing with our usual methodology yields this performance:

Predictor    % Correct    MOV Error
Wilson77.7%10.33
KRACH71.5%11.50

Once again, this rating does not provide an improvement on our best rating so far.  Because the KRACH rating is an odds rating, we might expect the MOV Error to be higher, but even on predicting outcome, KRACH is not an improvement on Wilson.

Thursday, May 19, 2011

Colley's Bias Free Matrix Rankings

(Note #1:  My original implementation of the Colley rating had an error.  See "Whoops!")
(Note #2: An interesting discussion about Colley can be found here.)

The next RPI alternative we'll look at comes from Wesley Colley.  Colley developed his rating system for rating NCAA football, and specifically to avoid bias towards conferences, traditions or regions.  Like RPI, the Colley Rating for a team has two components, one based on a team's performance and the other based upon the strength of its opponents.  Like the Iterative Strength Rating, and the Wilson Rating, the Colley Rating is calculated iteratively.

One unique feature of the Colley Rating is that, rather than use a team's winning percentage as a measure of its performance, Colley uses the "Laplace percentage."  This adjusts the winning percentage by adding a 1 in the numerator and a 2 in the denominator.  So a team with a 0-0 record will have a Laplace percentage of 1/2, and a team with a 5-2 record would have a percentage of 6/9.  This has several good effects. First, it eliminates problems that occur with undefeated teams and winless teams, which tend to introduce nasty divide-by-zero errors.  It also reduces the impact of wins and losses in the early season.  For example, with straight winning percentages, a team that is 1-0 is "infinitely" better than a team that is 0-1.  With the Laplace percentage, the 1-0 team is now 2/3, and the 0-1 team is 1/3, or half as good, which intuitively seems reasonable.

For each team, the Colley Rating calculates a "number of effective wins", determined as half the team's wins minus losses, plus the sum of all its opponents' ratings.  Then the ratings can be recalculated using Laplace percentage with the number of effective wins.  This is then repeated iteratively until the solution converges.

(Colley also provides a matrix solution to calculating the ratings, hence the "Colley Matrix" name for the rating.)

Testing this with our usual methodology provides this result:

Predictor    % Correct    MOV Error
Wilson77.7%10.33
Colley72.0%11.29

There is not much to "tweak" in the Colley Rating, and since this initial result is considerably worse than our current best, it seems unlikely that it can be tweaked to provide a new best result.

Friday, May 13, 2011

ELO

Several of the rating systems we have looked at so far borrow ideas from the ELO Chess Rating system.  The inventor of this rating system, Arpad Elo, was a master-level chess player who invented the system to enable fairer rankings of chess players.  ELO has since been used for many other two-player games, and Jeff Sagarin uses "ELO-CHESS" as one of his rating systems for basketball and football.  It is also one component of the Bowl Championship Series (BCS).

The version of ELO that I have implemented for testing (based upon the explanation given here) is defined by this formula:
N(t) = R(t)+K*(S(t)-E(t))

N(t) = the new rating for team t
R(t) = the old rating for team t
K = a maximum value for increase or decrease of rating
S(t) = the outcome of the game for team t
E(t) = the expected outcome of the game for team t
Note that (because it does not base a team's rating recursively on the ratings of any other team) ELO does not use an iterative solution like ISR or Wilson.  And unlike RPI, changing the ELO rating for a team does not affect the rating of any other team.  So it is very simple and fast to calculate.

Since there are no draws in basketball (and since we're ignoring MOV), the outcome of a game for team t [S(t)] is 1 if team t won the game, and 0 otherwise.  The heart of ELO system is determining the "expected outcome" for a game (the term E in the update equation above).  This is a number between 0 and 1 indicating how likely team t was to win this game based upon the team's current ELO rating and the opponent's current ELO rating.  ELO assumes that performance is a normally distributed random variable, and that each player has the same standard deviation.  As a result, E(t) is defined as this:
E(t) = 1/[1+10^([R(o)-R(t)]/400)]
where team o is the opponent in the current game.  (The "400" in this equation is an historical artifact.)

The only variable in the ELO formula is "K," the maximum update to the rating allowed from one game.  In chess this is typically 16 or 32 (depending upon the skill of the player).   For our purposes, we can test a range of values and look for one that maximizes performance for college basketball.

Here are the results of testing ELO with our usual methodology:

Predictor    % Correct    MOV Error
Wilson77.7%10.33
ELO (K=16)71.6%11.77
ELO (K=32)71.6%11.67
ELO (K=64)71.8%11.59
ELO (K=100)71.4%11.60
ELO (K=200)70.7%11.76

Performance seems to peak around K=64, but even at its best is significantly short of the best performing rating so far.  It is also significantly less accurate than the Trueskill rating (which is also based on Bayesian reasoning).

Wednesday, May 11, 2011

Random Walkers

The next RPI alternative we'll look at is inspired by the LRMC rating system developed by Paul Kvam and Joel Sokol at Georgia Tech.  This rating system has gotten some publicity in recent years for predicting the NCAA tournament outcomes better than other rating systems.  The LMRC rating system is inspired by ranking system for NCAA football developed by Callaghan, Porter, and Mucha.  The core of their system is what they call a "simple random walker":
Consider independent random walkers who each cast a single vote for the team they believe is the best. Each walker occasionally considers changing its vote by examining the outcome of a single game selected randomly from those played by their favorite team, recasting its vote for the winner of that game with probability p (and for the loser with probability 1-p).
So the random walker rating system simulates a large number of these random walker voters, and iterates until the votes across all the teams settles down to a steady state.

Callaghan, Porter, and Mucha derive a solution for this problem based upon linear algebra and differential equations that is beyond my mathematical abilities to comprehend (or program).  However, it is relatively straightfoward to calculate this with the sort of iterative solution we have used elsewhere.

Assume that a team T currently has W (weight) votes as the best team. (W need not be an integer.)  We update W by looking at the games that T has played, and distributing some of W to the winner and the loser of each game.  So some of the weight comes back to team T and some goes to its opponent.  We do the same thing for every other team until the weights of the teams settle down to a steady state.

If team T has played N games, we begin our distribution by splitting W into N buckets.  We then walk through the list of games that team T has played, and for each team we split the bucket (W/N) into two parts: (W/N)*p for the winning team, and (W/N)*(1-p) for the losing team.

The selection of "p" determines how the weights will be split between winning and losing teams.  If p=1.0, then the winning team in each game receives all the "votes" and the system is similar to RPI.  If p=0.50, then there is no bonus for winning games and every team tends to the same rating. In the casue of the Callaghan, Porter, and Mucha paper, they use p=0.80 to generate their ratings.  (LRMC, as we shall see, uses a more complex formula based upon MOV to determine the split.)  We will try a range of values for p=0.50 to p=1.00:

Predictor    % Correct    MOV Error
Wilson77.7%10.33
RW (p=0.60)70.3%11.87
RW (p=0.70)71.0%11.72
RW (p=0.80)69.3%12.09
RW (p=1.00)64.7%13.60

Performance is maximized somewhere in the neighborhood of p=0.70, but even so is considerably worse than our best predictor so far.

We'll revisit the "random walker" model later when we look at LRMC.  It's an intriguing framework for distributing weight between teams, and perhaps can be modified to provide better overall performance.

Tuesday, May 10, 2011

Wilson Rating

(Note:  My original implementation of the Wilson rating had an error.  See "Whoops!"

The next RPI alternative we'll look at comes from David Wilson.  Wilson's rating system was developed for use with college football, but applies equally well to college basketball.  Wilson describes this system this way:
A team's rating is the average of its opponents' ratings plus 100  for a win or minus 100 for a loss.  Wins that lower the rating and losses that raise the rating count one twentieth as much as the other games.  Post-season games count double.
A comparison of this system to the Iterative Strength Rating shows one major difference.  In the Wilson Rating, wins that lower a team's rating and losses that raise a team's rating are heavily discounted.  Other than this, the only differences between the systems are the initial values and the size of the win/loss bonus.  Both ratings are calculated with an iterative algorithm to hone in on the final values.

Implementing the standard Wilson algorithm and testing it with our usual methodology gives this result:

Predictor    % Correct    MOV Error
ISR77.7%10.45
Wilson77.7%10.33

This shows a slight improvement in MOV Error over the ISR rating.

There are only a few tweaks we can easily apply to the Wilson Rating.  One is to use our MOV cutoff to exclude close games from the ratings.  This didn't improve ISR, so it likely won't improve Wilson either:

Predictor    % Correct    MOV Error
ISR77.7%10.45
Wilson77.7%10.33
Wilson (mov=1)77.6%10.34
Wilson (mov=4)76.9%10.35

And indeed it doesn't.  Another possibility is to tweak the amount that some games are discounted.  Wilson himself initially set the discount to .01 and later raised it to .05.  If the discount is set to 1.0, Wilson is functionally equivalent to ISR; if it is set to 0.0 it maximally discounts those games, so we can try those two limit cases to see how performance is affected:

Predictor    % Correct    MOV Error
ISR77.7%10.45
Wilson77.7%10.33
Wilson (weight=1.0)77.6%10.35
Wilson (weight=0.0)77.6%10.34

Interestingly, the game discounting doesn't seem to have much of an impact.  There is a performance improvement that maximizes around 0.15 but the improvement is not substantial.

Except for the game discounting, the primary difference between the Wilson rating and the ISR is the size of the bonus given for a win or a loss.  My implementation of ISR uses a bonus of 4, and the standard Wilson algorithm uses 100, so we can try a range of values to see where performance is maximized:

Predictor    % Correct    MOV Error
ISR (bonus=4)77.7%10.45
Wilson (bonus=100)77.7%10.33
Wilson (bonus=25)77.5%10.42
Wilson (bonus=250)77.7%10.33

Performance appears to maximize for values >100.  This is somewhat counter-intuitive -- one would imagine that with an iterative algorithm like ISR/Wilson, the size of the win bonus would only affect the time to reach a steady-state solution, but apparently it can affect (at least marginally) the accuracy of the solution as well.

So the Wilson Rating with the bonus set at 100 and game discounting at 0.15 is our new champion RPI-like rating system.

Monday, May 2, 2011

Iterative Strength Rating (ISR)

(Note:  My original implementation of the ISR had an error.  See "Whoops!"

Having found some success with Trueskill, the second RPI alternative we'll look at is called "Iterative Strength Rating."  This particular rating was developed by Boyd Nation for rating college baseball teams.  Nation describes the ISR algorithm thus:
Begin with all teams set to an even rating -- 100 in this case. Then, for each game played, give each team the value of their opponent's rating plus or minus a factor for winning or losing the game -- 25 in this case. Total all of a team's results, divide by the number of games played, and that's the end of a cycle. Then use those numbers as the start of the next cycle until you get the same results for each team for two consecutive cycles.
So the ISR is an iterative algorithm similar to what we used for the "infinitely deep" RPI.  Implementing this and testing with our standard cross-validation methodology produces these results:

Predictor    % Correct    MOV Error
1-Bit62.6%14.17
Trueskill (draw=8 points)72.8%11.09
ISR77.7%10.45

This algorithm substantially outperforms our previous best algorithm (Trueskill).  It gets 5% more games correct and pushes the MOV error down below 11 points.  That's very impressive performance from a very simple approach.

There are limited opportunities for tweaking ISR.  As with our other models, we're using a linear regression, so the HCA is efficiently modeled within the regression.  The only tweak that obviously applies is to filter out some of the training data with an MOV cutoff, as we did with success for both RPI and Trueskill:

Predictor    % Correct    MOV Error
1-Bit62.6%14.17
Trueskill (draw=8 points)72.8%11.09
ISR77.7%10.45
ISR (mov=1)77.1%10.48
ISR (mov=4)76.2%10.49

Unfortunately, for ISR eliminating close games from the training set does not improve the predictive performance.  However, we've set a new highwater mark for performance from W-L only ratings.  If ISR remains the champion after look at other algorithms, we'll put more effort into seeing if the performance can be further improved.