Danny Tarlow and Lee-Ming Zen over at This Number Crunching Life have announced their annual March Madness prediction contest. To compete, you use data from this season and past seasons (which Danny & Lee will provide), build a computer system that fills out a bracket, then pit yourself against the field of silicon competition. The posts from last season's tournament can be found here.

I personally know the winner from last year and the previous year, and I can only say that I have the utmost respect for their dedication, intelligence, and ruggedly handsome good looks.

## Friday, February 24, 2012

## Thursday, February 23, 2012

### 3PT Attempt Percentage

Ken Pomeroy recently made a couple of blog postings concerning defense, and specifically a statistic he calls the "3 Point Attempt Percentage" (3PA%). He defines this statistic as the "percentage of field-goal attempts that are from three-point range." Ken Pomeroy thinks this is a better measure of defense than 3PT%. His reasoning is that most teams only take 3 point shots when they are relatively unguarded; the effect of defense is not to make these shots harder, but to cut down on the number of opportunities. Hence the claim that it's really how many 3 pointers your opponent takes that reveals the quality of your 3PT defense. Near the end of the second posting he says:

3PA% is similar to Effective Field Goal Percentage, one of Dean Oliver's Four Factors. I have previously considered the Four Factors and concluded that they didn't add any predictive value to my models, but 3PA% captures a slightly different slice of information.

When I recently looked at derived statistics, one of the derived statistics was pretty close to 3PA%:

That aside, I modified my models to generate four new statistics: the 3PA% for the home team in previous games, the 3PA% for the away team in previous games, the 3PA% for the home team's opponents in previous games, and the 3PA% for the away team's opponents in previous games. I then tested the model both with and without these statistics:

There's a very small improvement in RMSE with the added 3PA% statistics. So at least for my model, the 3PA% statistics don't seem to add any significant new information.

This is a strong statement and worthy of a little research to see whether it is true (at least so far as predicting outcomes is concerned).People that are unaware of 3PA% (which is to say nearly everyone) are missing a very telling statistic that explains a lot of how defense works.

3PA% is similar to Effective Field Goal Percentage, one of Dean Oliver's Four Factors. I have previously considered the Four Factors and concluded that they didn't add any predictive value to my models, but 3PA% captures a slightly different slice of information.

When I recently looked at derived statistics, one of the derived statistics was pretty close to 3PA%:

(Ave. number of 3PT attempts by the opposing team)This isn't quite the same statistic, because it is using game averages rather than cumulatives, but it is close. This statistic turned out to have no predictive value, but a couple of statistics based upon 3PT attempts did have value:

------------------------------------------------------

(Ave. number of FG attempts by the opposing team)

(Ave. number of 3PT attempts by the opposing team)

-----------------------------------------------------

(Ave. number of turnovers)

(Ave. number of 3PT attempts by the opposing team)Note that these statistics are relating the number of 3PT attempts by the

-----------------------------------------------------

(Ave. number of rebounds)

*opponent*to a statistic for the*defending*team. I'm not entirely sure what these statistics are capturing, but I don't think it is 3PT defense. (The latter might be indirectly saying something about how a team defends against the three pointer, from how it is positioned to rebound effectively or not after a taken three pointer.)That aside, I modified my models to generate four new statistics: the 3PA% for the home team in previous games, the 3PA% for the away team in previous games, the 3PA% for the home team's opponents in previous games, and the 3PA% for the away team's opponents in previous games. I then tested the model both with and without these statistics:

Model | Error | %Correct |
---|---|---|

Base Statistical model | 11.06 | 72.7% |

Base Statistical model + 3PA% statistics | 11.05 | 72.7% |

There's a very small improvement in RMSE with the added 3PA% statistics. So at least for my model, the 3PA% statistics don't seem to add any significant new information.

Labels:
model performance,
predictions,
stats

## Friday, February 17, 2012

### Performance Versus "The Line"

As I've mentioned earlier, I use my models to bet (in some theoretical sense) against the "the line". Typically I bet the games where my model differs significantly from the line (e.g., >4 points or so). As I've documented here, I have a number of different models, all of which have around the same performance (~11 points RMSE).

In the past I've usually averaged the predictions of these models for betting purposes, but for some time I've wondered whether they all perform equally well against the line. Although they all have similar errors, it's possible that some of the models error more consistently to the winning side of the line. To test this, I gathered three seasons worth of Vegas closing line data (about 7700 games) and tested each model for how often its predictions were correct versus the line. (The predictor is "correct" if it would make a winning bet given the line.) I also looked at each predictor's error versus the line (i.e., how accurately it predicted the line).

The "All" model here is a linear predictor using all the inputs to TrueSkill, Govan, BGD and Statistics w/ Derived. (I also tested some voting models, but they all under-perform the Statistical/All models.)

There are a couple of interesting results.

Most noticeably, the "All" predictor is at break-even versus the line. (Due to "house cut" on sports bets, you need to win about 52% of your bets to break even.) If we restrict ourselves to bets where the predictor differs from the line by at least two points, performance moves into (barely) positive territory. This is very good performance; the best predictors tracked at The Prediction Tracker do not even break 50%. (Furthermore, I am using the "closing" line, which is a tougher measure [by about one point] than the opening line used at the Prediction Tracker.)

It's also intriguing that TrueSkill/Govan/BGD all underperform the line but track it noticeably better than the statistical predictor. This suggests to me that the line is set not by wily veteran gamblers in the smoky back rooms, but by a computer program using some kind of team strength measure.

A (possibly interesting) side-note: All models that under-perform the line are going to fall into the seemingly miniscule range of 48-52%. (If a model performs worse than 48% against the line, we would simply bet against the model.) Pick any crazy model you like -- "Always bet the home team," "Always bet on the team whose trainer's name is first alphabetically," etc. -- and the performance is almost certainly going to fall in that 48-52% range against the line. (If it doesn't, you've found the key to beating Vegas!)

In the past I've usually averaged the predictions of these models for betting purposes, but for some time I've wondered whether they all perform equally well against the line. Although they all have similar errors, it's possible that some of the models error more consistently to the winning side of the line. To test this, I gathered three seasons worth of Vegas closing line data (about 7700 games) and tested each model for how often its predictions were correct versus the line. (The predictor is "correct" if it would make a winning bet given the line.) I also looked at each predictor's error versus the line (i.e., how accurately it predicted the line).

Model | Performance vs. Line | Error vs. Line |
---|---|---|

TrueSkill | 49.89% | 3.75 |

Govan | 49.28% | 3.49 |

BGD | 49.58% | 3.51 |

Base Statistical | 50.12% | 4.34 |

Statistical w/ Derived | 50.15% | 4.34 |

All | 52.00% | 3.49 |

All (Difference > 2) | 53.15% |

The "All" model here is a linear predictor using all the inputs to TrueSkill, Govan, BGD and Statistics w/ Derived. (I also tested some voting models, but they all under-perform the Statistical/All models.)

There are a couple of interesting results.

Most noticeably, the "All" predictor is at break-even versus the line. (Due to "house cut" on sports bets, you need to win about 52% of your bets to break even.) If we restrict ourselves to bets where the predictor differs from the line by at least two points, performance moves into (barely) positive territory. This is very good performance; the best predictors tracked at The Prediction Tracker do not even break 50%. (Furthermore, I am using the "closing" line, which is a tougher measure [by about one point] than the opening line used at the Prediction Tracker.)

It's also intriguing that TrueSkill/Govan/BGD all underperform the line but track it noticeably better than the statistical predictor. This suggests to me that the line is set not by wily veteran gamblers in the smoky back rooms, but by a computer program using some kind of team strength measure.

A (possibly interesting) side-note: All models that under-perform the line are going to fall into the seemingly miniscule range of 48-52%. (If a model performs worse than 48% against the line, we would simply bet against the model.) Pick any crazy model you like -- "Always bet the home team," "Always bet on the team whose trainer's name is first alphabetically," etc. -- and the performance is almost certainly going to fall in that 48-52% range against the line. (If it doesn't, you've found the key to beating Vegas!)

## Friday, February 10, 2012

### The Continued (Slow) Pursuit of Statistical Prediction (Part III)

As promised last time, we'll now look at a different type of derived statistic. We're going to look at statistics which are the ratio between the two teams of the same base statistic, e.g.,

The idea here is that it may be more predictive to look at the relative strengths of the teams rather than the absolute strengths.

The first statistics I want to try this upon are the strength measures like TrueSkill and RPI. Suppose that Syracuse, with an RPI of 0.6823, plays Missouri, with an RPI of 0.6234, and the same night UCF, with an RPI of 0.5723 plays Oregon State with an RPI of 0.516. Would we expect the same outcome in those games? In both cases, the better team is about 0.06 better in RPI. But Syracuse is about 10% better than Missouri, while UCF is about 12% better than OSU. If it's the relative strength that matters, we would expect UCF to win (on average) by more than Syracuse.

To test this out, I generated the relative strengths for measures like TrueSkill and ran them through my testing setup. In every case, the relative strengths had no predictive value above and beyond the value of the absolute strengths. And when the relative strengths alone were used for prediction, they underperformed the absolutes used alone.

I then did the same thing for the statistical attributes like offensive rebounding and got the same result. The relative strengths of the two teams provided no additional predictive accuracy.

I find this result fairly intriguing. My strong intuition was that at least a portion of the game outcome would be better explained by the relative strengths of the two teams. It's hard to believe that Syracuse should win its game against Missouri by more points simply because they're both stronger teams than UCF and OSU. But (as has often proven to be the case!) my intuition was just wrong, and relative strength is much less important than I would guess.

(Ave # of offensive rebounds for the home team / Ave # of offensive rebounds for the away team)

The idea here is that it may be more predictive to look at the relative strengths of the teams rather than the absolute strengths.

The first statistics I want to try this upon are the strength measures like TrueSkill and RPI. Suppose that Syracuse, with an RPI of 0.6823, plays Missouri, with an RPI of 0.6234, and the same night UCF, with an RPI of 0.5723 plays Oregon State with an RPI of 0.516. Would we expect the same outcome in those games? In both cases, the better team is about 0.06 better in RPI. But Syracuse is about 10% better than Missouri, while UCF is about 12% better than OSU. If it's the relative strength that matters, we would expect UCF to win (on average) by more than Syracuse.

To test this out, I generated the relative strengths for measures like TrueSkill and ran them through my testing setup. In every case, the relative strengths had no predictive value above and beyond the value of the absolute strengths. And when the relative strengths alone were used for prediction, they underperformed the absolutes used alone.

I then did the same thing for the statistical attributes like offensive rebounding and got the same result. The relative strengths of the two teams provided no additional predictive accuracy.

I find this result fairly intriguing. My strong intuition was that at least a portion of the game outcome would be better explained by the relative strengths of the two teams. It's hard to believe that Syracuse should win its game against Missouri by more points simply because they're both stronger teams than UCF and OSU. But (as has often proven to be the case!) my intuition was just wrong, and relative strength is much less important than I would guess.

## Friday, February 3, 2012

### The Continued (Slow) Pursuit of Statistical Prediction (Part II)

Continuing on from last time, I had set up the infrastructure to allow me to easily test the value of derived variables in statistical prediction. Before testing any of these derived variables, we need a baseline. In this case, the baseline is the performance of a linear regression using all the base variables. I don't know that I've ever documented the base variables, but they are basically all that can be created from the full game statistics available at Yahoo! Sports. These are averaged by game, so for example one of the base statistics is "Average free throw attempts per game." I also have the capability to average statistics by possession (e.g., "Average free throw attempts per possession" but unlike some other researchers, I've never found per possession averages to be any more useful than per game averages, so I generally don't produce them.

For most statistics, I also produce the average for the team's opponents. So to continue the example above, I produce "Average free throws per game for this team's opponents." I also produce a small number of simple derived statistics, such as "Average Margin of Victory (MOV)", and winning percentages at home and on the road.

When we get to predicting game outcomes, of course we have all of these statistics for both the home and the away team. (And that home/road distinction is important, obviously.) If we use all these base statistics to create a linear regression, we get the following performance:

This is the same performance I have reported earlier, and tracks fairly well with the best performance from the predictors based upon strength ratings.

Now we want to augment that predictor with derived statistics to see if they offer any performance improvement. As mentioned last time, we have 1200 derived statistics, so we have to do some feature selection to thin that crop for testing.

One possibility (as discussed here) is to build a decision tree, and use the features identified in the tree. If we do that (and force the tree to be small), we identify these derived features as important:

To test that, I add those statistics to my base statistics and re-run the linear regression. In this case, what I find is that while some of the derived statistics are identified as having high value by the linear regression, the overall performance does not improve.

There are other methods for feature selection, of course. RapidMiner has an extension focused solely on feature extension. This offers a variety of approaches, including selecting based on Maximum Relevance, Correlation-Based Feature Selection, and Recursive Conditional Correlation Weighting. All of these methods identified "important" derived statistics, but none produced a set of features that out-performed the base set.

A final approach is a brute force approach called forward search. In this approach, we start with the base set of statistics, add each of the derived statistics in turn, and test each combination. If any of those combinations improve on the base set, we pick the best combination and repeat the process. We continue this way until we can find no further improvement.

There are a couple of advantages to this approach. First, there's no guessing about what features will be useful -- instead we're actually running a full test every time and determining whether a feature is useful or not. Second, we're testing all combinations in our search space, so we know we'll find the best combination. The caveat here is that we assume that improvement is monotonic with regards to adding features. If the best feature set is "A, B, C" then we're assuming we can find that by adding A first (because it offers the most improvement at the first step), then B to that, and so on. That isn't always true, but in this case it seems a reasonable assumption.

The big drawback of this approach is that it is very expensive. We have to try lots of combinations of features, and we have to run a full test for each combination. In this case, the forward search took about 54 hours to complete -- and since I had to run it several times because of errors or tweaks to the process in ended up taking about a solid week of computer time.

In the end, the forward search identified ten derived features, with this performance:

This is a fairly significant improvement. The most important derived features in the resulting model were:

I'll leave it to the reader to contemplate the meaning of these statistics, but there are some interesting suggestions here. The first and third statistics seem to be saying something about whether the away team is winning games through defense or offense. The second and fourth statistics seem to be saying something about rebounding efficiency, and perhaps about whether the team is good at getting "long" rebounds. (The statistics for the home team are completely different, by the way.)

Next time I'll begin looking at a different set of derived statistics.

For most statistics, I also produce the average for the team's opponents. So to continue the example above, I produce "Average free throws per game for this team's opponents." I also produce a small number of simple derived statistics, such as "Average Margin of Victory (MOV)", and winning percentages at home and on the road.

When we get to predicting game outcomes, of course we have all of these statistics for both the home and the away team. (And that home/road distinction is important, obviously.) If we use all these base statistics to create a linear regression, we get the following performance:

Predictor | % Correct | MOV Error |
---|---|---|

Base Statistical Predictor | 72.3% | 11.10 |

This is the same performance I have reported earlier, and tracks fairly well with the best performance from the predictors based upon strength ratings.

Now we want to augment that predictor with derived statistics to see if they offer any performance improvement. As mentioned last time, we have 1200 derived statistics, so we have to do some feature selection to thin that crop for testing.

One possibility (as discussed here) is to build a decision tree, and use the features identified in the tree. If we do that (and force the tree to be small), we identify these derived features as important:

*The home team's average margin of victory per possession over the overall winning percentage**The away team's average number of field goals made by opponents over average score**The home team's average assists by opponents over the field goals made**The home teams average MOV per game over the home winning percentage*

To test that, I add those statistics to my base statistics and re-run the linear regression. In this case, what I find is that while some of the derived statistics are identified as having high value by the linear regression, the overall performance does not improve.

There are other methods for feature selection, of course. RapidMiner has an extension focused solely on feature extension. This offers a variety of approaches, including selecting based on Maximum Relevance, Correlation-Based Feature Selection, and Recursive Conditional Correlation Weighting. All of these methods identified "important" derived statistics, but none produced a set of features that out-performed the base set.

A final approach is a brute force approach called forward search. In this approach, we start with the base set of statistics, add each of the derived statistics in turn, and test each combination. If any of those combinations improve on the base set, we pick the best combination and repeat the process. We continue this way until we can find no further improvement.

There are a couple of advantages to this approach. First, there's no guessing about what features will be useful -- instead we're actually running a full test every time and determining whether a feature is useful or not. Second, we're testing all combinations in our search space, so we know we'll find the best combination. The caveat here is that we assume that improvement is monotonic with regards to adding features. If the best feature set is "A, B, C" then we're assuming we can find that by adding A first (because it offers the most improvement at the first step), then B to that, and so on. That isn't always true, but in this case it seems a reasonable assumption.

The big drawback of this approach is that it is very expensive. We have to try lots of combinations of features, and we have to run a full test for each combination. In this case, the forward search took about 54 hours to complete -- and since I had to run it several times because of errors or tweaks to the process in ended up taking about a solid week of computer time.

In the end, the forward search identified ten derived features, with this performance:

Predictor | % Correct | MOV Error |
---|---|---|

Base Statistical Predictor | 72.3% | 11.10 |

w/ Forward Search Features | 74.0% | 10.73 |

This is a fairly significant improvement. The most important derived features in the resulting model were:

*The away team's opponent scoring average over the away team's winning percentage.**The away team's offensive rebounding average over the away team's # of field goals attempted**The away team's scoring average over the away team's winning percentage**The away team's opponent treys attempted over the away team's rebounds*

I'll leave it to the reader to contemplate the meaning of these statistics, but there are some interesting suggestions here. The first and third statistics seem to be saying something about whether the away team is winning games through defense or offense. The second and fourth statistics seem to be saying something about rebounding efficiency, and perhaps about whether the team is good at getting "long" rebounds. (The statistics for the home team are completely different, by the way.)

Next time I'll begin looking at a different set of derived statistics.

Subscribe to:
Posts (Atom)