Wednesday, October 12, 2011

More on Statistical Prediction

I am continuing to explore statistical prediction.  In particular, after implementing the Four Factors as described here, I became interested in examining other statistics generated from the base set of statistics.  A subset of these generated statistics are ratios of the base statistics, like the "Offensive Balance" statistic I defined in my earlier post:
Offensive Balance = (# 3 Pt Attempts) / (# FG Attempts)
You can probably come up with a few sensible statistics like these off the top of your head.  But since I've seen time and again the value of exploring all options -- even the ones that make no "sense" -- I decided to calculate and test all of these sorts of ratios to see which of them (if any) have predictive value.

That's a more difficult job than you might imagine.  In my data sets there are 13 base statistics per team per game (FG Made, FG Attempted, 3PT Made, 3PT Attempted, FT Made, FT Attempted, Offensive Rebounds, Total Rebounds, Assists, Turnovers, Steals, Fouls, Score, and MOV).  For predictive purposes, we want to use the average of these over a team's previous games [1] and we can average by either game or possession - so that's 26 base statistics per team.  There are 26*25 = 650 possible ratios of those statistics.  But we also want to consider ratios not only of a team with itself but also of the team with its opponent, e.g., the ratio of the team's average number of 3 PT attempts in past games to it's opponents average number of 3 PT attempts in past games.  That adds another 676 possible ratios.  Finally, we also want to consider the statistics for a team's past opponents, e.g., the average number of 3 PT attempts in past games of a team's opponents in those games.  Adding those in creates a lot more ratios.  Multiply all that by the 12K games in my training data, and it's a lot of data.

My approach is to generate a subset of the possible ratios and test them for predictive value.  For various reasons I settled on generating all the ratios with a particular numerator, e.g.,
(FG Made) / (# Fouls)
(FG Made) / (Opponent's # Fouls)
(FG Made) / (# Fouls by Opponents in Past Games)
etc.
This ends up adding about 96 new statistics to every game in the database.  I can then take this expanded data and pump it through the usual linear regressions, etc., to find the statistics that have predictive value.  But this is a slow process -- for each numerator, it takes hours to generate all the statistics and run them through iterations of the predictive model.  (This has the disadvantage that I may miss some combination of generated statistics with different numerators that are only valuable in combination.)

So far, I haven't identified any ratios that result in significantly better predictions.  But I have been surprised that (at least so far) the models have selected a number of unexpected ratios as being of value.  For example:
(Away team's Average FG Made) / (Away team's Average 3PTs Attempted)
(Away team's Average FG Made) / (Away team's Average 3PTs Made)
These ratios seem to be capturing something about the Away team's offensive balance between inside and outside play.  Interestingly, both the ratio with 3 PTs Attempted and 3 PTs Made are significant -- it may be that the first captures the "offensive strategy" (whether a team plays outside first or inside first) and the second captures something about how effective they are at executing that strategy.  It's also interesting that these ratios are only significant for the Away team -- apparently the home team's performance doesn't depend strongly on what sort of offensive strategy it uses.

Another interesting statistic:
(Home team's Average FG Made) / (Home team's Past Opponents' Average Offensive Rebounds)
It takes a moment's thought to grasp this statistic.  It compares the average number of FGs made by a team to the offensive rebounding of the opponents the team faced.  If we take Offensive Rebounds as an indicator of how strongly teams are contesting inside play, then this ratio would seem to say something about how effective the home team's inside play has been relative to its opponents.

Hopefully working through all the ratio statistics will turn up a set of statistics that provide significantly better predictive value.

[1] Averaging isn't the only option here, and there are other possibilities for generated statistics that might be useful, but I feel that ratios are a reasonably fertile area for exploration.

2 comments:

  1. Scott,

    Checkout log-ratio analysis for techniques on ration metric features.

    Also curious: how are you performing your feature selection? Are you using a recursive feature elimination (RFE) algorithm?

    ReplyDelete
  2. Thanks for the suggestion - do you have a pointer to a suitable intro?

    I've used a couple of methods so far. Since I'm using a linear predictor, I can run that on all the features and see which ones it selects (M5, I think). Alternatively, I've tried greedy forward and backward approaches, as well as some evolutionary algorithms. The linear predictor seems to work as well as anything.

    ReplyDelete