## Friday, February 3, 2012

### The Continued (Slow) Pursuit of Statistical Prediction (Part II)

Continuing on from last time, I had set up the infrastructure to allow me to easily test the value of derived variables in statistical prediction.  Before testing any of these derived variables, we need a baseline.  In this case, the baseline is the performance of a linear regression using all the base variables.  I don't know that I've ever documented the base variables, but they are basically all that can be created from the full game statistics available at Yahoo! Sports.  These are averaged by game, so for example one of the base statistics is "Average free throw attempts per game."  I also have the capability to average statistics by possession (e.g., "Average free throw attempts per possession" but unlike some other researchers, I've never found per possession averages to be any more useful than per game averages, so I generally don't produce them.

For most statistics, I also produce the average for the team's opponents.  So to continue the example above, I produce "Average free throws per game for this team's opponents."  I also produce a small number of simple derived statistics, such as "Average Margin of Victory (MOV)", and winning percentages at home and on the road.

When we get to predicting game outcomes, of course we have all of these statistics for both the home and the away team.  (And that home/road distinction is important, obviously.)  If we use all these base statistics to create a linear regression, we get the following performance:

Predictor    % Correct    MOV Error
Base Statistical Predictor72.3%11.10

This is the same performance I have reported earlier, and tracks fairly well with the best performance from the predictors based upon strength ratings.

Now we want to augment that predictor with derived statistics to see if they offer any performance improvement.  As mentioned last time, we have 1200 derived statistics, so we have to do some feature selection to thin that crop for testing.

One possibility (as discussed here) is to build a decision tree, and use the features identified in the tree.  If we do that (and force the tree to be small), we identify these derived features as important:

1. The home team's average margin of victory per possession over the overall winning percentage
2. The away team's average number of field goals made by opponents over average score
3. The home team's average assists by opponents over the field goals made
4. The home teams average MOV per game over the home winning percentage
That is, you'd have to admit, quite a goulash of statistics.  I can probably come up with some rationale about some of those, but I won't bother.  All I really care about is whether they will improve my predictive accuracy.

To test that, I add those statistics to my base statistics and re-run the linear regression.  In this case, what I find is that while some of the derived statistics are identified as having high value by the linear regression, the overall performance does not improve.

There are other methods for feature selection, of course.  RapidMiner has an extension focused solely on feature extension.  This offers a variety of approaches, including selecting based on Maximum Relevance, Correlation-Based Feature Selection, and Recursive Conditional Correlation Weighting.  All of these methods identified "important" derived statistics, but none produced a set of features that out-performed the base set.

A final approach is a brute force approach called forward search.  In this approach, we start with the base set of statistics, add each of the derived statistics in turn, and test each combination.  If any of those combinations improve on the base set, we pick the best combination and repeat the process.  We continue this way until we can find no further improvement.

There are a couple of advantages to this approach.  First, there's no guessing about what features will be useful -- instead we're actually running a full test every time and determining whether a feature is useful or not.  Second, we're testing all combinations in our search space, so we know we'll find the best combination.  The caveat here is that we assume that improvement is monotonic with regards to adding features.  If the best feature set is "A, B, C" then we're assuming we can find that by adding A first (because it offers the most improvement at the first step), then B to that, and so on.  That isn't always true, but in this case it seems a reasonable assumption.

The big drawback of this approach is that it is very expensive.  We have to try lots of combinations of features, and we have to run a full test for each combination.  In this case, the forward search took about 54 hours to complete -- and since I had to run it several times because of errors or tweaks to the process in ended up taking about a solid week of computer time.

In the end, the forward search identified ten derived features, with this performance:

Predictor    % Correct    MOV Error
Base Statistical Predictor72.3%11.10
w/ Forward Search Features74.0%10.73

This is a fairly significant improvement.  The most important derived features in the resulting model were:
1. The away team's opponent scoring average over the away team's winning percentage.
2. The away team's offensive rebounding average over the away team's # of field goals attempted
3. The away team's scoring average over the away team's winning percentage
4. The away team's opponent treys attempted over the away team's rebounds
The ten statistics were actually evenly divided between home team statistics and away team statistics, but it turned out that the most significant five were all the away team statistics.

I'll leave it to the reader to contemplate the meaning of these statistics, but there are some interesting suggestions here.  The first and third statistics seem to be saying something about whether the away team is winning games through defense or offense.  The second and fourth statistics seem to be saying something about rebounding efficiency, and perhaps about whether the team is good at getting "long" rebounds.  (The statistics for the home team are completely different, by the way.)

Next time I'll begin looking at a different set of derived statistics.