Net Prophet: Combinations & Other Models

Before we move on to MOV-based rating systems, it may be instructive to look at combining the various "RPI-like" rating systems to see if using them together can improve our prediction performance. For these experiments, I'll be looking at our three best RPI-like ratings: TrueSkill, Improved RPI (iRPI) and the Iterative Strength Rating (ISR).

The first experiment we can try is to use more than one rating as an input to our linear regression. The following table shows the performance using various combinations of the three ratings:

Predictor	% Correct	MOV Error
TrueSkill	72.8%	11.07
TrueSkill + Improved RPI	72.9%	11.01
TrueSkill + Iterative Strength Rating	72.7%	11.05
TrueSkill + Improved RPI + Iterative Strength Rating	73.0%	11.01
Improved RPI + Iterative Strength Rating	71.9%	11.31

The combination of TrueSkill and the Improved RPI improves performance modestly. Adding in the Iterative Strength Rating does little (and in fact, the home team's ISR gets optimized out of the linear regression).

Another experiment we can try is to do a separate linear regression for each rating and then average their predictions (for regression tasks, this is done with the Vote operator in RapidMiner). Here are some averaging results:

Predictor	% Correct	MOV Error
TrueSkill + Improved RPI (combined regression)	72.9%	11.01
TrueSkill + Improved RPI (averaged)	72.9%	11.04
TrueSkill + Iterative Strength Rating (averaged)	72.8%	11.17
TrueSkill + Improved RPI + Iterative Strength Rating (averaged)	73.0%	11.16

Averaging provides worse performance than using a single linear regression.

We can also try using a more sophisticated prediction model than a linear regression. For cases where we're only using a single rating value for each team, we wouldn't expect this to provide significantly better performance than the linear regression. Here are the performances of some alternative models for the TrueSkill ratings:

Predictor	% Correct	MOV Error
TrueSkill (neural network)	72.8%	11.07
TrueSkill (support vector machine)	72.7%	11.08
TrueSkill (k-NN, k=96)	72.7%	11.13

As expected, there is no improvement over a simple linear regression. The alternate models also provide no benefit when we are using multiple ratings:

Predictor	% Correct	MOV Error
TrueSkill + Improved RPI (neural network)	68.8%	12.24
TrueSkill + Improved RPI (support vector machine)	72.9%	11.02
TrueSkill + Improved RPI (k-NN, k=96)	72.8%	11.13

SVNs do the best here, but still not an improvement over the linear regression.

So combining TrueSkill & Improved RPI into a single regression is an improvement, but generally more sophisticated models don't provide any value. It's interesting to note that this performance is already as good as the best models reported in the literature.

Unless I get further distracted, I'll be moving on next to assessing ratings/models which make use of the margin of victory (MOV) next. I have a small collection of ratings/models to assess, but I'm always looking for inputs, so if you have a favorite ranking that you'd like to see included, please let me know!

2 comments:

UnknownApril 21, 2012 at 4:54 PM
I love this blog. But a lot of these results would seem to have been predicted by the "flat maximum" phenomenon, no?
Scott TurnerApril 21, 2012 at 9:32 PM
I wasn't aware of the "flat maximum" phenomenon, so thanks for bringing it to my attention. I can find a good tutorial article on it, but from what I can find, the hypothesis appears to be that performance is not sensitive to the weighting of the inputs for some class of social and human predictions. I'm not sure that's entirely the case here, but there's certainly at least some element of that at play. I know in the statistical modeling that I can get nearly equivalent performance from a variety of different inputs, which certainly seems in line with the "flat maximum" hypothesis.

Note: Only a member of this blog may post a comment.

Friday, May 27, 2011

Combinations & Other Models

2 comments: