Wednesday, March 9, 2016

That's Not Really A Number

Suppose that you're competing in the Kaggle competition and you're using team win ratios and average scoring for the season to predict who is going to win a game.  Your input to your model might look something like this:

TeamWin RatioAve. ScoreTeamWin RatioAve. Score

Your results are mediocre, so you decide to improve your model by adding more information about each team.  The seeding of the team -- the NCAA Tournament committee's assessment of team strength -- seems like it would be useful for prediction, so you add each team's seeds to your inputs:

TeamWin RatioAve. ScoreSeedTeamWin RatioAve. ScoreSeed

You've just made a mistake.  Do you see what it is?

The way you've added the seeding information, many machine learning tools / models are going to treat the seed as a number1, not any different from the Win Ratio or the Average Score.  And that's a problem, because the seed is not really a number.  It's actually what statisticians would call a categorical variable, because it can take one value out of a fixed set of arbitrary values.  (Machine learning types might be more likely to call it a categorical feature.)  If you're not convinced about that, imagine replacing each seed with a letter -- the #1 seed becomes the A seed, the #2 seed becomes the B seed and so on.  This is perfectly reasonable -- we could still talk about the A seeds being the favorites, we'd know in the first round the A seeds play the P seeds and so on.  It wouldn't make any sense to try to do that with a numeric feature like (for instance) the Average Score.

Another difference between numeric and categorical features that merely look like numbers is that real numeric features have a fixed scale, while categorical features have an arbitrary scale.  The difference between an Ave. Score of 86.5 and an Ave. Score of 81 is 5.5 points, and that's the same amount as the difference between an Ave. Score of 78.8 and 73.3.  But the difference between a 16 seed and a 15 seed might be quite different than the difference between a 1 seed and a 2 seed.  (Not to mention that the difference between a 16 seed and a 15 seed might be quite different than between the 16 seed and a different 15 seed!)

So if I've convinced you that team seeds are not really numbers but just look like numbers, then how should your represent them in your model?

The basic approach is something called "one hot encoding" (the name derives from a type of digital circuit design).  The idea behind one hot encoding is to represent each possible value of the categorical feature as a different binary feature (with value 0 or 1).  For any particular value of the categorical feature, one of these binary features will have the value 1 and the rest will be zero.  (Hence "one hot".)  To represent the seeds, we need 16 binary features:

TeamWin RatioAve. ScoreSeed_1Seed_2Seed_3Seed_4Seed_5...Seed_16TeamWin RatioAve. ScoreSeed_0...

This representation gives your model much more flexibility to learn the differences between the seeds.  For example, it could learn that being a 5 seed is every bit as valuable as being a 4 seed.  If you represented the seed as a number, it would be impossible to learn that.

One drawback about this representation is that it has a tendency to rapidly increase the number of  input features.  Seeds added 32 features to the model -- adding the Massey ordinals to your model could add many thousands of features!  That can be bad for several reasons.  First, the increased dimensionality will slow down the training of your model.  This may or may not be a problem, depending upon the overall size of your training data and the computational power you have for training.  A more significant problem is that it will encourage your model to overfit.  Even if you use the Tournament data all the way back to 1985 (which I do not encourage) there may be only a few games for unlikely pairings such as (say) a #8 seed plays a #16 seed.  That may cause your model to learn a rule for those games that is too specific.

Now it may be that you disagree with the very premise of this posting.  You really do think that seeds can be treated like numbers.  The good news for you is that it is very easy to do that -- just include the seed as both a numeric feature and a categorical feature.  You can look at the size of your model coefficients or other components to see whether treating the seed as a number has any value or not.

1 This is admittedly a simplification.  Some machine learning models (such as tree-based models) may be flexible enough to treat numeric variables in ways that mimic categorical variables.  But many models (and notably linear and logistic regressions) will not.

No comments:

Post a Comment