Wednesday, August 31, 2011

Prediction by Similarity

Well, I've been all over the map with my blog postings lately and this one is no exception.  I'm going to post today about what I call "prediction by similarity."  I don't know what it's really called by the machine learning community -- back in the day when I did artificial intelligence we would have called it case-based reasoning and for all I know that may still be perfectly fine nomenclature.

The basic idea (as it applies to college basketball) is that if I want to know how Illinois @ Michigan State is going to turn out, I can look at historical examples of similar games and see how they turned out.  Hopefully they'll be a good guide for predicting the Illinois @ Michigan State matchup.

The first challenge in this method is to find similar games.  The easy solution would be if Illinois had already played (say) five games at Michigan State this season.  Presumably those results would be an excellent guide to how the current game would turn out.  Of course, it's never the case that we have five previous matchups to look at.  Even when the two teams have played before, it's usually the other end of a home-and-home series, and because of the impact of home court advantage, it's hard to use even that.  On the other hand, there are 15K games in my corpus -- and about twice that if I include the 2006-2008 seasons as additional historical examples.  Out of 30,000 games, I ought to be able to find a few pretty similar to Illinois @ Michigan State. 

About our only option for measuring the similarity of two games is to compare the season statistics (including things like ratings) of the four teams.  If the two home teams have very similar statistics, and the two visiting teams have very similar statistics, we might have some hope that the past game can be used to predict the current game.  So what statistics are important?  I have no idea (and as far as I know, no one has looked at the question) but we can make some educated guesses.  The relative strengths of the teams is probably important -- that is, we want to look at past games where the home team was about as strong as MSU and the visiting team as strong as Illinois.  The absolute strengths might be important, too.  If we use (say) TrueSkill as our strength rating, a matchup between an 850 team and an 800 team might be quite different than a matchup between a 450 team and a 400 team.  (Or maybe not.)  Pace of play might be important -- a fast 850 team playing a fast 800 team might have a different result than a fast 850 team playing a slow 800 team. In fact, I can probably make an argument for just about any statistic as being important.

As usual, I intend to compensate for my lack of knowledge with persistence and processing power.  I'll try different combinations until I find one that works well or convince myself that further search won't be fruitful.  Unfortunately, this search will take a lot more processing power.  By the end of the 2011 season, I have to rate and sort 30K games for each predicted game.  That goes a lot slower than (say) just updating RPI.  It takes several hours to make one run.

Let's take a look at what sort of "similar" games we find with a similarity function that uses TrueSkill, average points per possession (PPP), average points per possession allowed (PPPA), and average number of possessions (POSS):

0.001/24/2011Pittsburgh104.71.210.9266.4Notre Dame70.611.130.9866.7-5
0.203/21/2009Connecticut104.31.230.9268.7Texas A&M68.391.070.9866.926

Our current game (the one we're trying to predict) is in the first row: the 1/24 conference game between Pittsburgh and Notre Dame.  The next three lines are the closest matches found.  In the closest match, Oregon's statistics are amazingly close to Notre Dame's.  Florida's aren't quite as close a match to Pittsburgh -- notably, Florida's strength (as measured by TrueSkill) is significantly less.  In all of the similar games, the home team wins handily -- we would predict Pittsburgh by 16 points based upon these games.  And that probably wouldn't be a bad prediction: Notre Dame's victory was considered a significant upset.

So how well does this work as a predictor?  So far, not very well.  About the best performance I've found is a MOV Error of 12.5 and a % Correct of about 68%.  There are a couple of positives: I'm still working on the code, so I may find some bugs or improvements.  And secondly, the predictions from this model are only correlated about 60% with (say) TrueSkill, which may make it useful as part of an ensemble model.

Saturday, August 27, 2011

Correlation Between Predictors

Danny Tarlow was kind enough to give me some comments on a paper I'm writing about the work reported in this blog, and one of his suggestions was to look at whether the predictors I've tested are picking up on the same signals.  This is a significant question because if the predictors are picking up on different signals, then they can be combined into an ensemble predictor that will perform better than the individual predictors.  (Dietterich 2000) showed that
"...a necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse."
A classifier is accurate if is better than random guessing.  Two predictors are diverse if they make different errors.  Intuitively, an ensemble will perform better than the base predictors if the errors in the base predictors are uncorrelated and tend to cancel each other out.  Our predictors are all obviously accurate, but are they diverse?

To test this we can measure the correlation between the errors made by the different predictors.  If they are uncorrelated, then it is likely that we can construct an ensemble with improved performance.  I don't have the time and energy to test all combinations of the predictors I've implemented, but here are the correlations between the top two won-loss based predictors (Wilson, iRPI) and the top two MOV-based predictors (TrueSkill+MOV, Govan):



Not unsurprisingly, the highest correlations are between the two won-loss predictors and the two MOV-based predictors.  But all of the predictors are highly correlated.  The least correlated (by a hair) are Wilson and TrueSkill+MOV.  Putting those two predictors into a combined linear regression or an averaging ensemble results in performance worse that TrueSkill+MOV alone.

On the other hand, perhaps using the best predictors is the wrong course.  Perhaps its more likely that the worst predictors are uncorrelated with the best predictors, and a combination of one of the worst with one of the best would be fruitful.

Winning Percentage0.970.980.920.93

As this shows, even the 1-Bit predictor ("the home team wins by 4.5") is highly correlated with the better predictors, and using just the winning percentage shoots the correlation to 0.92+.  Adding these predictors to an ensemble with the better predictors also results in worse performance.

Of course, it's always possible that some combination of predictors will improve performance.  There's been some interesting work in this area -- see (Caruana 2004) in Papers.  But for right now I don't have the infrastructure to search all the possible combinations.

Monday, August 22, 2011

The Development Cycle

My development cycle seems to go something like this:

  Work on content --> Clean up code --> Remove impediments --+
        ^                                                    |
        |                                                    |

(Excuse the 1970s typewriter graphics...)  I'm in the latter half of the cycle right now, re-factoring my Lisp code to make it better organized & maintainable, and also addressing some little impediments in my tools (primarily on the Emacs / Lisp end). 

Lee Ming was also nice enough to scrape the rest of the 2010-2011 basketball season, so I also spent some time incorporating that into my test data.  Lee's archive is missing about 10% of the games over the course of the season -- and that's comparing it to my independent scrape, which I know is missing games.  Lee is taking a look at why he's missing games, but it's frustratingly hard to get a good, complete data set scraping Yahoo!.  If anyone knows somebody at Yahoo! that we can bug to have a dataset released (or to set up an API for the scores), please let me know!

Friday, August 19, 2011

The Relative Importance of Possessions

One of the reasons we want to calculate the number of possessions in a game is so that we can calculate "tempo-free" stats such as Points Per Possession (PPP).  By factoring out possessions, we can get better comparisons between teams playing at different paces.  One of the reasons we want to be able to predict the number of possessions in a game is to predict the Margin Of Victory (MOV) and other statistics that depend upon the pace of the game.

For example, suppose that Duke is playing Maryland, and Duke's predicted Points Per Possession (PPP) for this game is PPPDuke, and Maryland's predicted Points Per Possession (PPP) for this game is PPPMaryland. If we know the number of possessions that will occur in the game, we can then predict the MOV as:
MOVpredicted = Posspredicted * (PPPDuke - PPPMaryland)
One of the appealing features of this predictor is that it is fairly orthogonal to predicting MOV from won-loss records or even previous MOV; so this predictor (if any good) would likely be a good candidate for an ensemble predictor with methods like TrueSkill or Govan.

We've already seen that we don't (yet) have a very good method for predicting the number of possessions in a game.  But we don't know how important Posspredicted is in that equation above; it could be that we'd do fine with a fairly poor predictor and shouldn't waste too much time trying to improve.  So how can we estimate the importance of Posspredicted?

One approach is to bound the importance by assuming that we can't predict possessions at all, and see how well we can do predicting MOV based upon only the PPP for the two teams.  If we take the actual PPPs for games (as if we had a perfect predictor for PPP) and stick them into a linear regression to predict MOV, we get this performance:

  Predictor    Error    % Correct  
Perfect PPP information only 1.3891%

Which is an amazingly good result.  Without any notion of the pace of the game, we can still predict the MOV within ~1.5 points if we know the relative offensive efficiencies of the two teams.

(If you're wondering why we only get 91% of the games correct, it's because the regression optimizes for MOV, and it turns out the MOV prediction is better when the home team is slightly overweighted.)

If we throw our best possessions predictor into the model, the performance improves by about 20%:

  Predictor    Error    % Correct  
Perfect PPP information only 1.3891%
Perfect PPP information + "Best" possessions predictor  1.0791%

This suggests that it's far more important to accurately predict PPP, and that even our current fairly poor possessions predictor may be good enough.

Wednesday, August 17, 2011

Possessions, Part 3

In a comment on the last posting, ProbablePicks suggested trying to predict the number of possessions in a game by a regression on the average number of possessions for both teams in their previous games.  That was an excuse to add the ability to create averaged stats to my processing framework, so I put that in, debugged it for a while and then created the suggested regression.  Here was its performance:

  Predictor    Error  
Possessions (67)6.30
Possessions (Split Model)5.20
Possessions (Regression on Averages) 5.10

It does a little better than the Split Model.  The regression equation looks like this:
Poss = 0.665*HPossave + 0.620*APossave - 20.540
This weights the home team slightly more (about 7%) than the away team -- I speculated on this possibility with the Split Model but didn't see a performance improvement in that case.

One speculation I've had is that possessions might be harder to predict in close games.  There will usually be more fouling and aggressive defense that might create turnovers and additional possessions.  We can (sort of) look at that by filtering out games where the MOV was above some cutoff:

  Predictor    Error  
Possessions (Regression on Averages) 5.10
Possessions (MOV > 8) 4.97
Possessions (MOV > 12) 4.77
Possessions (MOV < -8) 5.17
Possessions (MOV < -12) 5.19

There's an interesting result here:  We can do a better job predicting possessions when the home team is winning a blowout, but we do worse predicting possessions when the away team is winning a blowout.  I'd be inclined to think this was because games with MOV > 12 (say) are going to skew to the top end of the possessions range anyway, and the compression of the range of possible results will reduce the error.  But that's contradicted by the results for away team blowouts, so there's presumably some other explanation.  Of course, for predictive purposes this doesn't matter because we won't know the MOV anyway, but it's an intriguing result nonetheless.

Tuesday, August 16, 2011

Possessions, Continued

A few more experiments and thoughts about predicting the number of possessions in a game.

Following up on my previous post, I took a look at some variants of the model used there to predict possessions.  To start with, I looked at a variant model where one team had more control over the pace of the game:
Possessionspredicted = Alpha * Preferred PossHome + (1 - Alpha) * Preferred PossAway
The idea being that perhaps the home team has more control over the pace of the game -- analogous to the home court scoring advantage.   Or perhaps the away team has the advantage. However, the results didn't indicate an advantage for either team.  Prediction performance went down for any value of Alpha significantly different from 0.50.

The next experiment was to use a different model altogether for predicted possessions:
Possessionspredicted = FHome * FAway
There's no intuitive explanation for this model -- it just presumes that there's a multiplicative relationship between two underlying factors.  But it's quite different from our intuitive model (that the two teams essentially split the difference between their desired paces), so at a minimum, if this model worked poorly, it would be evidence that our intuitive model has some validity.

But interestingly enough, this model did just about as well as the split model:

  Predictor    Error  
Possessions (Split Model)5.20
Possessions (Multiplicative Model) 5.22

 which suggests to me that the intuitive approach may not be particularly valid.

So I took a step back and tried to characterize the range of performance for predicting possessions.  To start with, I created a predictor that simply predicts the average for the test data (~67):

  Predictor    Error  
Possessions (67)6.30
Possessions (Split Model)5.20
Possessions (Multiplicative Model) 5.22

This showed that the split model is only about 1 possession per game more accurate than just guessing the average.  As a second experiment, I ran a linear regression using all the teams as the attributes -- this generates a huge regression with 592 terms (essentially a Home term and an Away term for every team in Division I) with the following performance:

  Predictor    Error  
Possessions (67)6.30
Possessions (Split Model)5.20
Possessions (Multiplicative Model) 5.22
Possessions (Regression on All Teams) 4.72

I wouldn't expect much out of this model, but it does about a 1/2 a possession better than the Split Model.  (It should be noted that this is not an apples-to-apples comparison to the other models; this simple regression uses the entire season for data, not just the season up to the predicted game.)  So I think there's clearly some work to be done in improving our model for predicting possessions.

Monday, August 15, 2011

More on Possessions

In a previous post, I mentioned that I was working on calculating the number of possessions in a game.  There's no direct stat for this, so it has to be approximated by formula from the existing stats.  The number of possessions in a college game averages about 67.  (In the previous post I reported the average as 68, but that number was skewed by overtime games.  I've since fixed that problem by normalizing to 40 minutes rather than per game.)

The next step is to try to predict the number of possessions when two teams play each other.  Why would we want to do this?  Consider the situation where Maryland beats Duke by 5 points.  How strong is that evidence that Maryland is better than Duke?  Well, if the final score was 45-40 we might consider that stronger evidence than if the score was 125-120.  Number of possessions is also used to calculate "tempo-free statistics" which allow us to better make apples-to-apples comparisons between games that are played at different paces.

So how do we predict the number of possessions?  The model I'm using right now supposes that each team has a preferred pace -- i.e., an ideal number of possessions.  Some teams would like to play games with lots of running up and down the court and many possessions; others would like to play a very slow, controlled game.  When two teams meet up, they each try to play the game at their preferred pace, and as a result the game is played somewhere in-between:
Posspred = (Preferred PossHome + Preferred PossAway)/2

Of course, we don't know the "preferred pace" for a team, so we have to try to discover that from the game data.  One way to do that is gradient descent, as was used for Danny Tarlow's PMM.  If we do that, and then test the predictor in the same way we've tested the others, we get this performance:

  Predictor    Error  

Is that good performance?  If we look at the distribution of possessions/game:

I'm inclined to say the predictor is "meh" -- not particularly good, but probably good enough to be useful.

Wednesday, August 10, 2011

More PMM

In my previous post, I replaced the prediction model from Danny Tarlow's PMM:
Predi = Offensei * Defensej
with this model:
Predi = Offensei + Defensej
with predictably terrible results.   There are other models we could try that would likely be more reasonable, but I want to detour a bit into a two-stage model.

The basic idea is that we predict the game outcome using the original Offense*Defense model, and minimize the error in the prediction across all the games using gradient descent.  However, we then add a second stage, where we attempt to predict the residual error between our best Offense*Defense model and the actual scores.  The value we're going to try to predict is:
Residuali = Score- (Offensei * Defensej)
Now there wouldn't be much sense in trying to predict this Residual by the same sort of Offense*Defense model -- if that would work, it would presumably be captured in our original model.  So we need to pick some different sort of model for the Residual, and in this case we'll use the additive model we used before, except applied this time to predict the Residual:
Pred Residuali = Ri + Sj
and we'll determine R and S by the same sort of gradient descent we use for Offense and Defense.  Our final predicted score will be:

Predi = Offensei * Defensej + Ri + Sj

Here's how that performs:

  Predictor    % Correct    MOV Error  
PMM (w Residual Prediction) 72.0%11.19

The improvement isn't huge, but it does show some promise.  Intuitively, if we think of the performance of a team as a sum of a number of different factors plus some noise, then different models may be capable of accurately modeling different factors.  Some factors may be well-modeled by "Offense*Defense" while others are better modeled by "Offense+Defense".

There are several avenues to explore from here.  One is to look at other alternate models, for both the primary model and the residual model.  Another is to look at combining the two models, so we optimize both at once -- this would have some advantages if the two models are interdependent.  Another interesting notion is to use an entirely different approach -- say, TrueSkill -- for either the primary or the residual model.

Tuesday, August 9, 2011

An Alternate PMM Model

As mentioned in the last post, I'm looking at the code for implementing Danny Tarlow's PMM.  In Danny's (1 dimensional) model, each team has an Offense rating and a Defense rating and when Teami plays Teamj, the predicted score is:
Predi = Offensei * Defensej
In other words, we have a model where teams have a certain scoring potential (Offense) and the effect of defense is to reduce that potential proportionately.  So if a team has a Defense of 0.85, and plays a team with an Offense of 100, we expect the opponent to score 85 points -- 15 points below its potential.  On the other hand, if we play a team with an Offense of 60, we expect them to score 51 points -- 9 points below its potential.

That seems reasonable, but it's not the only possible model for the relationship between Offense and Defense.  For example, we might model defense as reducing the opposing team's offense by a fixed amount, e.g., Maryland holds opposing teams to 4 points less than they would score against an "average" defender.  With that model, our predicted score would look like this:
Predi = Offensei + Defensej
Which model is more reasonable?  My intuition says the first model, and it's certainly widely used, but I don't know of any work that has tried to determine the best model of the interaction between offense and defense in basketball.  (Although looking around did lead me to this interesting blog.)  Please correct my ignorance if you're aware of something relevant.  Easy enough, though, to test this model:

  Predictor    % Correct    MOV Error  
PMM (alternate model) 64.5%13.65

So, not very good.  No real surprise, but it does raise the question of whether some different model for the interaction between Offense and Defense could out-perform the Offense*Defense model.

Monday, August 8, 2011

Regularization in PMM

I'm back from vacation and slowly finding some time to work on prediction.  One of the things I'm doing is revisiting the code for "Probabilistic Matrix Model" (PMM).  This model is based upon the code Danny Tarlow released for his tournament predictor, which he discusses here. At the heart of this code is an algorithm to minimize the error between the predicted scores and the actual scores using batch gradient descent.  This is something I'll want to do frequently in the next stage of development (e.g., to predict the number of possessions in a game), so I'm looking at adapting Danny's code.  (Or rather, adapting my adaptation of Danny's code :-)

Danny's code differs in a couple of ways from a straightforward batch gradient descent.  One difference is that Danny has added in a regularization step.  Danny made this comment about regularization:
In addition, I regularize the latent vectors by adding independent zero-mean Gaussian priors (or equivalently, a linear penalty on the squared L2 norm of the latent vectors). This is known to improve these matrix-factorization-like models by encouraging them to be simpler, and less willing to pick up on spurious characteristics of the data.
I theorize that with a large, diverse training set such as I'm using, regularization is unnecessary.  To test that, I re-ran the PMM without any regularization:

  Predictor    % Correct    MOV Error  
PMM (w/o regularization) 71.8%11.20

Performance is almost identical, so indeed there doesn't seem to be any value in regularization.