## Monday, September 19, 2011

### Statistical Prediction: Normalizing Inputs

One thing we want to consider in doing statistical prediction (or any sort of prediction where we have a variety of dissimilar inputs) is to normalize our inputs.  The purpose of this is to be able to compare inputs that have different scales.  For example, in my data set, home team scoring average varies from 43 to 102, while "steals by the away team" varies from 0 to 13, so it's hard to compare those two numbers directly.  And we don't want our prediction model to favor one data over another just because it has a bigger absolute value.  To address this we can "normalize" our data to similar scales.

I mentioned here that Brady West normalizes all the input data to his model by subtracting the mean and dividing by the standard deviation -- this is called "standard score."  Instead of knowing that the home team scored 108 points, you'd know that they score 2.38 standard deviations above the mean.  That sounds like a fine approach to me, but as it turns out, RapidMiner (the tool I'm using to do the predictive models) doesn't offer that as an option.  It does, however, offer a z-transformation, which transforms the data so that it has a mean of zero and a standard deviation of 1.  If we apply that to all of our inputs, we'll have more of an apple-to-apples comparison.  For example, the home scoring average ends up ranging from -9.96 to 3.99, while the away team's FT percentage varies from -14.34 to 4.87 -- giving you some sense that there is more variance in FT shooting percentage.

If we apply the z-transformation to our inputs, there is no change in performance for the model that takes only scoring averages.  That's reasonable, since the scoring averages are all basically on the same scale anyway.  But when we throw in a second data point with a different scale, the difference becomes apparent:

Predictor    % Correct    MOV Error
Govan + Averaging73.5%10.80
Scoring averages72.1%11.18
Scoring + 3 pt % -- Without normalization 72.1%11.18
Scoring + 3 pt % -- With normalization 72.1%11.09

So as a matter of course I'll perform a normalization step as part of the prediction workflow.  (In this case, it doesn't improve our best performance by much.)

It's also interesting to compare the coefficients in our linear regression.  This is what we see if we look at the coefficients for the various scoring averages:

Datum  Coefficient
Home Team Scoring Average5.886
Away Team's Opponent Scoring Average-4.447
Away Team Scoring Average-5.686
Home Team's Opponent Scoring Average4.793

Naively, you might want to predict a team's score as exactly halfway between what the team usually scores (offense) and what the other team usually gives up (defense); but what this shows is that the best estimate actually weights offense slightly more -- 57% for the home team, 54% for the away team.