Friday, January 23, 2015

Reversing the Data?

Over on Reddit's machine learning subreddit, a user posed this question (paraphrasing):
For my basketball predictor, I could double the number of training examples by creating new examples with the Home and Away teams swapped.  Would it make sense to do that?
That's an interesting thought.

The first concern is that you can't just swap Home and Away.  No one is entirely sure why, but home teams play differently (and better) than away teams.  In NCAA basketball, the Home Court Advantage (HCA) is between 4 and 5 points.  What's more, all of a team's statistics are typically different for their home games.  There's sometimes even a home court effect for neutral site games.  In particular, the "home" team in NCAA Tournament games does better than you would expect -- even when you adjust for them being the better team.  (In the Tournament, the better-seeded team in the matchup is the home team.)  If you just blindly swap Home and Away, the newly created games are going to be bad training examples because they won't reflect what would happen in a real game with a swapped matchup.

But let's assume that there is no difference between Home and Away.  In that case would it be valuable to swap Home and Away to create new training examples?

Let's look at a simple case.  Suppose that our training examples consist of a strength rating for each team (S1, S2) and the margin of victory (MOV).  Here's our very small training set:
S1, S2, MOV 
20, 18, 2
14, 28, -14
Based upon this training data, we'll build a (perfect) model that predicts MOV = S1 - S2.

Now let's augment the training set by adding new games where S1 and S2 are swapped and MOV is negated.
S1, S2, MOV 
20, 18, 2
14, 28, -14
18, 20, -2
28, 14, 14
Based upon the new training data, we'll build the exact same model!

That happens because we didn't add any new information to our training set.  We added new examples, but they didn't contain any new information. They just repeated existing information transformed in a way that was irrelevant to our model.

I suspect this will always happen if (1) the data transformation is perfect, and (2) the machine learning model can "see through" the transformation.

An example of modifying the training set where the data transformation is not perfect is boosting.  In boosting, we duplicate SOME of the training examples, and that leads to a different model.  If we duplicate all of the training examples, the transformation is perfect and there's no change to the learned model.  It's only duplicating some of the examples that is key to boosting.

A situation where the machine learning model cannot "see through" the transformation is using non-linear transformations with a linear model.  For example, if we have a data set where the data is arranged in a circle, a linear regression will not model the data well because it always produces a straight line.  However, if we apply a polar coordinate transformation on the data the circle will become a line, and the linear regression will be more effective.

My guess is that swapping Home & Away to create new training examples isn't an effective technique for NCAA Basketball.  However, I encouraged the original poster to go ahead and give it a try.  It's a fairly easy experiment, and who knows, my intuitions may be wrong!

(And if they are, I'm totally stealing the technique.)

3 comments:

  1. Hey! I'm famous! :)

    I'm almost certain you're correct, but just finished getting the initial model built this afternoon. I was able to get it up to 77.01% accuracy based on a little over 23k games and an 80/20% split.. I'm going to randomize the training/test sets a bit more and see if that number holds.. of course I don't expect that number to hold up to much outside of my own little world of data, but it's been a great learning exercise. I'll see if I can slap some python together for reversing the data and see if the model changes at all. Thanks again for all of the pointers!

    ReplyDelete
  2. You're welcome! :-)

    Glad to hear you've got the model working. 77% is pretty good -- actually a little too good to be true! You can see some representative results at http://www.thepredictiontracker.com/bbresults.php. The good predictors do around 75%. Hopefully you've made a spectacular breakthrough, but it's more likely you have some sort of error. The most likely possibility is a "leak" of information into the test examples.

    Keep working, and try doing a cross-validation if you have the capability. If you want any specific help, email me and I'll try to give you some pointers.

    ReplyDelete
    Replies
    1. I'll definitely do that! I've seen the options for cross-validation in the model parameters, and will certainly look into it (still don't fully understand it). It took a lot of finessing to get the 77% (initial model was in the high 75's), both with the parameters and the variable combinations. I'm curious to see if the model still holds up at 77% with those tweaks when I use another random seed to generate the 80/20 split.

      I'll have to read more on that prediction tracker site, that looks amazing! Sad that there aren't too many people doing it. I'd love to do that but I'm sure it's way out of my league.

      I'll shoot you an email when I've got the results of the reversed / duplicated data worked out. That should be pretty easy to do, I was overthinking it.. I've already got all of the data in a database, so just a matter of swapping the order of my select statement and tacking it on to the existing file, the running another 80/20 spit on it.

      I am making sure each portion of the 80/20 maintains the same distribution with respect to the predictor so hopefully I'm not introducing any errors there.

      Delete