Thursday, April 14, 2011

Testing Methodology

We'll take a slight detour for this posting to discuss my testing methodology for these RPI experiments.

To begin with, I produce RPI values for both the home team and the away team for every game in the 2009, 2010 and 2010 regular seasons. (I eliminate the last 150 games of each data set to eliminate the Tournament games, NIT games, and the conference tournaments.)  I also eliminate the first 1000 games in each season so that the RPI values are based on at least 5 games for each team.  I also eliminate any non-Div I games.

For each game, the RPI value for both teams is calculated based upon all the previous games for that season.

The resulting data is fed into a RapidMiner process which calculates a Margin of Victory (MOV) for each game.  A test set of 500 games is then split off.  (The test set is chosen randomly, but is the same for every predictor tested.)  The remaining games (approx. 10,000) are then used to train a linear regression using the MOV as the label.

The resulting linear regression is then applied to the test set of 500 games, and scored for RMSE and correctness of prediction.

More sophisticated regression models are available (e.g., neural networks, polynomial regression, etc.) but experimenting with the various possibilities showed that none of the more sophisticated models produced better results than a linear regression.  (This is not too surprising, given that the input data is just the two RPI values.)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.