Friday, May 27, 2011

Combinations & Other Models

Before we move on to MOV-based rating systems, it may be instructive to look at combining the various "RPI-like" rating systems to see if using them together can improve our prediction performance.  For these experiments, I'll be looking at our three best RPI-like ratings: TrueSkill, Improved RPI (iRPI) and the Iterative Strength Rating (ISR).

The first experiment we can try is to use more than one rating as an input to our linear regression.  The following table shows the performance using various combinations of the three ratings:

  Predictor    % Correct    MOV Error  
TrueSkill + Improved RPI72.9%11.01
TrueSkill + Iterative Strength Rating72.7%11.05
TrueSkill + Improved RPI + Iterative Strength Rating73.0%11.01
Improved RPI + Iterative Strength Rating71.9%11.31

The combination of TrueSkill and the Improved RPI improves performance modestly.  Adding in the Iterative Strength Rating does little (and in fact, the home team's ISR gets optimized out of the linear regression).

Another experiment we can try is to do a separate linear regression for each rating and then average their predictions (for regression tasks, this is done with the Vote operator in RapidMiner).  Here are some averaging results:

  Predictor    % Correct    MOV Error  
TrueSkill + Improved RPI (combined regression)72.9%11.01
TrueSkill + Improved RPI (averaged)72.9%11.04
TrueSkill + Iterative Strength Rating (averaged)72.8%11.17
TrueSkill + Improved RPI + Iterative Strength Rating (averaged)73.0%11.16

Averaging provides worse performance than using a single linear regression.

We can also try using a more sophisticated prediction model than a linear regression.  For cases where we're only using a single rating value for each team, we wouldn't expect this to provide significantly better performance than the linear regression.  Here are the performances of some alternative models for the TrueSkill ratings:

  Predictor    % Correct    MOV Error  
TrueSkill (neural network)72.8%11.07
TrueSkill (support vector machine)72.7%11.08
TrueSkill (k-NN, k=96)72.7%11.13

As expected, there is no improvement over a simple linear regression.  The alternate models also provide no benefit when we are using multiple ratings:

  Predictor    % Correct    MOV Error  
TrueSkill + Improved RPI (neural network)68.8%12.24
TrueSkill + Improved RPI (support vector machine)72.9%11.02
TrueSkill + Improved RPI (k-NN, k=96)72.8%11.13

SVNs do the best here, but still not an improvement over the linear regression.

So combining TrueSkill & Improved RPI into a single regression is an improvement, but generally more sophisticated models don't provide any value.  It's interesting to note that this performance is already as good as the best models reported in the literature.

Unless I get further distracted, I'll be moving on next to assessing ratings/models which make use of the margin of victory (MOV) next.  I have a small collection of ratings/models to assess, but I'm always looking for inputs, so if you have a favorite ranking that you'd like to see included, please let me know!


  1. I love this blog. But a lot of these results would seem to have been predicted by the "flat maximum" phenomenon, no?

  2. I wasn't aware of the "flat maximum" phenomenon, so thanks for bringing it to my attention. I can find a good tutorial article on it, but from what I can find, the hypothesis appears to be that performance is not sensitive to the weighting of the inputs for some class of social and human predictions. I'm not sure that's entirely the case here, but there's certainly at least some element of that at play. I know in the statistical modeling that I can get nearly equivalent performance from a variety of different inputs, which certainly seems in line with the "flat maximum" hypothesis.