Thursday, October 25, 2012

A Detour into RapidMiner

A chunk of visitors to this blog find it looking for RapidMiner, so I thought I'd take a detour to explain the RapidMiner process I'm using to explore early season performance.  This RapidMiner process uses training data to build a model, applies the model to separate test data, and then measures performance.  This is something of a sequel to the post I did for Danny Tarlow over at his blog.  Hopefully it will be useful to some folks as an example of how to put together a more complex RapidMiner process, as well as how to apply a model to test data, which wasn't covered in the previous post.

(Reminder: RapidMiner is a free data-mining tool that you can download here.)

The (unreadable) graphic above illustrates the entire process.  There are four parts to this process.  In Process 1, the training data is read in and processed.  In Process 2, the test data is read in and processed.  In Process 3, the training data is used to build a linear regression and then the model from that regression is applied to the test data.  In Process 4, the results are processed and performance measures calculated.  I'll now go into each process in detail.

The graphic above shows Process 1 in more detail. It's a straightforward linear flow starting at the upper left and ending at the lower right.  The steps are:
  1. Read CSV -- This operator reads in the training data, which is simply a large text file in comma-separated value (CSV) format, with one line for every game (record) in our training data set.
  2. Generate ID -- This operator adds a unique ID attribute to every record in our training data set.  (We'll see later why it is useful to have a unique ID on every record in the data set.)
  3. Rename by Replacing -- This operator is used to rename attributes in the data set.  In this case, I use it to replace every occurrence of a dash (-) with an underscore ( _ ).  Dashes in attribute names are problematic when you do arithmetic on the attributes, because they get mistaken for minus signs.
  4. Generate Attributes -- This operator generates new attributes based on the existing attributes.  In this case, I calculate a new attribute called "mov" (Margin of Victory) by subtracting the visiting team's score from the home team's score.
  5. Set Role -- Most attributes are "regular" but some have special roles.  For example, the ID attribute generated in step 2 has the "id" role.  Here I use the Set Role operator to set the role of the "mov" attribute to "label."  This role identifies the attribute that we are trying to predict.
  6. Read Constructions -- You can use the Generate Attributes operator to generate new attributes, but that's not convenient if you want to generate a lot of new attributes, or if you want to generate new attributes based on some external inputs.  In my case, I have generated and tested many derived statistics, and entering them manually into "Generate Attributes" was not feasible.  The Read Constructions operator reads formulas to generate new attributes from a file and creates them in the data set.  Using this, I was able to have a Lisp program create a (long) list of derived statistics to test, write them to a file, and then have the RapidMiner process construct them automatically.
  7. Replace Missing Values -- This is the first of several data cleanup operators.  There shouldn't be any missing values in my data sets, but if there is, this operator replaces the missing value with the average over the rest of the data.
  8. Replace Infinite Values -- Some of the constructions in Step 6 can result in "infinite" values if (for example) they cause a divide by zero.  These two operators replace positive infinite values with 250 and negative infinite values with -250.
  9. Select Attributes -- The last operator in this process removes some attributes from our data.  In particular, we don't want to leave the scores in the data -- because the predictive model will (rightfully) use those to predict the MOV.  (The MOV itself is not a problem, because it has the "label" role.)  We also remove a couple of other attributes (like the team names) that would cause other problems.
So at the end of this process, we have read in the training data, processed it to contain all the attributes we want and none that we don't want, and cleaned up any inconsistencies in the data.

Process 2 is exactly the same as Process 1, except it is applied to the test data.  It's important to ensure that both the training data and the test data are treated identically.  If they aren't, you'll get misleading results or cause an error later in the process.  (I should point out that RapidMiner can bundle up a process into a sub-process and reuse it in multiple places, and that's probably what I should do here.)

The graphic above shows Process 3.  I've left in the "Select Attributes" from the end of Process 1 and Process 2 for context.  Here are the steps in Process 3:
  1. Linear Regression -- RapidMiner offers a wide variety of classification models that can be used for prediction.  In this case we're using a linear regression.  The input to this operator is the training data, and the output is a model.  This operator trains itself to predict the "label" attribute (MOV in our case) from the regular attributes.  The model it produces is a linear equation based upon the regular attributes.  In my process here, I'm training the model every time I run the process.  It's also possible to train the model once, save it, and re-use it every time you want to test or predict.  In my case, I tweak the data and/or process almost continuously, so it's easiest just to re-train every time.  There are about 15K records in the training data set, and the Linear Regression takes a couple of minutes on my laptop.  Other classification operators are much slower, and re-training each time is not feasible.
  2. Apply Model -- This operator applies the model from step 1 to the testing data from Process 2 and "labels" it -- that is, it adds a "prediction(mov)" attribute that has a predicted Margin of Victory for the game.
  3. Join -- This operator "joins" two data sets.  To do this, it finds records in the two data sets that have the same ID and then merges the attributes into a single record.  (Now we see why we need a unique ID!)  The two data sets being merged here are (1) the labeled data from the model, and (2) the original data from the Select Attributes operator.  Recall that the Select Attributes operator is used to remove unwanted attributes from the data, including the team names and scores.  So the labeled data coming out of the model does not have that information.  However, to evaluate our predictive performance we need the scores (so we can compare the actual outcome to the predicted outcome) and it would be nice to have team names and dates on the data as well.  So this Join operator puts those attributes back into our data.  In general, this is a useful technique for temporarily removing attributes from a data set.
At this point, we have a data set which consists of our test data with an added attribute "prediction(mov)"containing the predicted margin of victory for each game.  Next we want to see how well our model performed.

The graphic above shows Process 4.  I've left in the "Join" from Process 3 to make it clear where it connects.  Here are the steps to Process 4:
  1. Rename -- The first step is to rename the "prediction(mov)" attribute to "pred".  The parentheses in this name can confuse some later processing, so it's best just to remove them.
  2. Generate Attributes -- Next we generate a new attribute called "correct".  This attribute is 1 if we've correctly predicted the winner of the game, and 0 if not.  RapidMiner provides a powerful syntax for defining new attributes.  In this case, "correct" is defined as "if(sgn(mov)==sgn(pred),1,0)" -- if the sign of our predicted MOV is the same as the sign of the actual MOV, then we correctly predicted the winner.
  3. Write Excel -- At this point, I save the results to an Excel spreadsheet for later reference and processing (e.g., to produce the graphs seen here).
  4. Multiply -- I like to look at two different measures of performance, so I create two copies of the test data.  This isn't strictly necessary in this case (I could chain the two Performance operators) but this is another example of a useful general technique.
  5. Performance -- RapidMiner provides a powerful operator for measuring performance that can assess many different measures of error and correlation.  In the top use of Performance in this process, I use the built-in "root mean squared error" measure and apply it to the predicted MOV to calculate the RMSE error.
  6. Aggregate / Performance -- The second measure of performance I like to look at is how often I predicted the correct winner.  (I might prefer a model that predicts the correct winner more often even if it increases the RMSE of the predicted MOV.)  I want to know this number over the entire data set, so the first step is to Aggregate the "correct" attribute.  This produces a new attribute "sum(correct)" which is the number of correct predictions over the whole data set (and has the same value for every record in the data set).  This is then reported by the Performance operator as a performance measure.  The Performance operator isn't strictly necessary in this situation -- I could just report out the "sum(correct)" value -- but in general marking this as a measure of performance allows me to (for example) use the value to drive an optimization process (e.g., selecting a subset of attributes that maximizes the number of correct predictions).
And that's "all" there is to it.  One of the advantages of RapidMiner is that the graphical interface for building processes let's you quickly lay out a process, as well as easily modify it (such as switching the Linear Regression to an SVM, for example). 

No comments:

Post a Comment