## Thursday, October 18, 2012

### Early Season Predictions, Part 2

As mentioned previously, I'm using this time before the college basketball season gets going thinking about how to predict early season games.  In the early season, we're missing two elements needed for good predictions:  (1) a meaningful statistical description of each team, and (2) a model that uses those statistics to predict game outcomes.  By the end of the season we have both things - a good statistical characterization of each team as well as a model that has been trained on the season's outcomes.  So how do we replace those two elements in the early season?

Replacing the model turns out to be fairly easy, because the factors that determine whether teams win or lose don't change drastically from season to season.  When you try to predict the tournament games at the end of the season, a model trained on the previous season's games does nearly as well as a model trained on the current season's games.  Of course, if the current year happens to be the year when the NCAA introduces the 3 point shot, all bets are off.  Still, in my testing the best performing models are the ones trained on several previous years of data.  So in the early season we can expect the model from the previous season to perform well.

(You might argue that early season predictions could be more accurate with a model specifically trained for early season games.  There's some merit to this argument and I may look at this in the future.)

Replacing the team data is not so easy.  The problem here is that teams have played so few games (none at all for the first game of the season) that we don't have an accurate characterization of their strengths and weaknesses.  Even worse, many of the comparative statistics (like RPI) rely on teams having the same opponents to determine the relative strength of teams.  In the early season, the teams don't "connect up" and in some cases, play few or no strong opponents.  So how bad is it?  I tested it on games from the 2011-2012 season:

Predictor    % Correct    MOV Error
Late Season Prediction72.3%11.10
Early Season Prediction71.3%15.06

So, pretty bad.  It adds 4 points of error to our predictions.  Since we've been groveling to pick up a tenth of a point here and there, that's a lot!

The obvious proxy for the team data is to use the team data from the previous season.   Clearly this has problems -- in college basketball team performance is highly variable season to season -- but it's at least worth examining to see whether it does improve performance.  In this experiment, I used the entire previous season's data to "prime the pump" for the next season.  In effect, I treated the early season games as if they were being played by the previous year's team at the end of the previous season.  Here are the results:

Predictor    % Correct    MOV Error
Early Season 71.3%15.06
Early Season (w previous season) 75.5%12.18

A fairly significant improvement.  Is there anything we can do to improve the previous season's data as a proxy for this season?  We'll investigate some possibilities next time.