Net Prophet: Testing Methodology Redux

For the RPI experiments, I used a testing methodology that tried each RPI variant against the same set of approximately 10K training games and 500 test games. This had the advantage of being fast and repeatable. However, it has the disadvantage that performance on the 500 test games might not accurately estimate the general performance. That is, we might have a tweak that (for whatever reason) happens to perform very well (or very poorly) on that particular set of test games. As we move forward into other ratings and more complex models, we'd like to avoid that problem.

To do that, we can test our algorithms on several different test sets. The general approach is called cross-validation. The basic idea is to split the input set into a large training set and a smaller test set, train the algorithm on the training set, test it on the test set, and then repeat for a new training & test set. The more test sets we use, the closer we can come to accurately estimating the true performance of the algorithm. The drawback is that testing becomes slower because of the repeated training-testing loop.

RapidMiner provides cross-validation as a standard operator. This picture shows the top-level process flow I am using in testing:

The flow begins at the top left, where the entire set of game data is read in from a CSV (comma-separated values) file. Each game has the date, home team, away team, scores, etc., as well as the computed ratings to be tested -- for example, we might have the basic RPI rating for each team. (The computed ratings are produced by a Lisp program that must be run before the cross-validation.)

The game data is then subject to some preprocessing to get it ready for use in the predictive model. The first step is to generate a unique ID for each game. (This is useful if we split the data from a game into two parts and want to later re-join the parts.) Next is a "Generate Attributes" operator, which takes the home team score and the away team score and creates a derived attribute called MOV (the Margin of Victory). The "Set Role" operator then marks this new attribute as the "label" in our data -- the label is what we are trying to predict. Finally, we use a "Select Attribute" operator to select only those attributes in the game data that we want to use as inputs to our predictive model. For example, if we are testing RPI, we'd select only the home team's RPI and the away team's RPI (along with the label) as inputs to the model.

The yellow Validation operator encapsulates the cross-validation process. It takes as inputs the preprocessed training data and outputs a model, the same training data, and some computed performance estimates. In RapidMiner, we can drill down inside this operator to look at the cross-validation process:

This process is divided into two halves: Training and Testing. RapidMiner takes care of splitting the data into a training set and a testing set.

The training set is fed into the Training side of the cross-validation as the input. The Training side takes this input and produces a model. In this case, we're using a Linear Regression operator that takes the training data and produces a linear regression that best fits the input attributes (Home RPI, Away RPI) to the label data (the Margin of Victory). The model is then output from the Training side and passed over to the Testing side.

The Testing side of the cross-validation takes the model and the test data and outputs performance values. In this case, we use the "Apply Model" operator to apply the model from the Training side to the test data. This produces "labelled data" -- the test data with an added attribute called "Predicted(MOV)".

In this case, I want to keep track of two measures of performance: the number of correctly predicted games, and the root mean squared error in the MOV prediction. To do this, a copy is made of the labelled data using the Multiply operator. One copy is sent down to the yellow Performance operator at the bottom of the figure. This is a built-in RapidMiner operator that calculates root mean squared error from the labelled data. This result is then pushed out as a performance measure for this model.

The other copy of the labelled data is sent into the pink boxes at the top of the figure. These boxes rename the "Prediction(MOV)" attribute, generate a new attribute which is 1 if the prediction was correct and 0 otherwise, and then aggregates (sums) the new attribute. This number is then converted to a performance measure and pushed out as another performance measure for this model.

When this process is run, RapidMiner splits the data into training and test sets according to the parameters of the Cross-Validation operator (in this case, I'm doing 100 cross-validations), runs the Training/Testing process on each data set, and averages the performance measures across all the sets: