For my basketball predictor, I could double the number of training examples by creating new examples with the Home and Away teams swapped. Would it make sense to do that?That's an interesting thought.
The first concern is that you can't just swap Home and Away. No one is entirely sure why, but home teams play differently (and better) than away teams. In NCAA basketball, the Home Court Advantage (HCA) is between 4 and 5 points. What's more, all of a team's statistics are typically different for their home games. There's sometimes even a home court effect for neutral site games. In particular, the "home" team in NCAA Tournament games does better than you would expect -- even when you adjust for them being the better team. (In the Tournament, the better-seeded team in the matchup is the home team.) If you just blindly swap Home and Away, the newly created games are going to be bad training examples because they won't reflect what would happen in a real game with a swapped matchup.
But let's assume that there is no difference between Home and Away. In that case would it be valuable to swap Home and Away to create new training examples?
Let's look at a simple case. Suppose that our training examples consist of a strength rating for each team (S1, S2) and the margin of victory (MOV). Here's our very small training set:
Based upon this training data, we'll build a (perfect) model that predicts MOV = S1 - S2.
S1, S2, MOV
20, 18, 2 14, 28, -14
Now let's augment the training set by adding new games where S1 and S2 are swapped and MOV is negated.
Based upon the new training data, we'll build the exact same model!
S1, S2, MOV
20, 18, 2 14, 28, -14 18, 20, -2 28, 14, 14
That happens because we didn't add any new information to our training set. We added new examples, but they didn't contain any new information. They just repeated existing information transformed in a way that was irrelevant to our model.
I suspect this will always happen if (1) the data transformation is perfect, and (2) the machine learning model can "see through" the transformation.
An example of modifying the training set where the data transformation is not perfect is boosting. In boosting, we duplicate SOME of the training examples, and that leads to a different model. If we duplicate all of the training examples, the transformation is perfect and there's no change to the learned model. It's only duplicating some of the examples that is key to boosting.
A situation where the machine learning model cannot "see through" the transformation is using non-linear transformations with a linear model. For example, if we have a data set where the data is arranged in a circle, a linear regression will not model the data well because it always produces a straight line. However, if we apply a polar coordinate transformation on the data the circle will become a line, and the linear regression will be more effective.
My guess is that swapping Home & Away to create new training examples isn't an effective technique for NCAA Basketball. However, I encouraged the original poster to go ahead and give it a try. It's a fairly easy experiment, and who knows, my intuitions may be wrong!
(And if they are, I'm totally stealing the technique.)