## Friday, December 5, 2014

### Ensemble Classifiers

(Review:  In machine learning, a classifier is a model for classifying/predicting.  The Pain Machine uses a linear regression as its classifier.  In this series of posts, I'm discussing other potential approaches.  Previously I discussed using AutoWEKA to automatically find a good classifier.)

An ensemble classifier is one that takes a number of other classifiers (called the "base classifiers") and aggregates their results to get a (hopefully) more accurate prediction.  Ensemble classifiers are popular because -- in certain cases -- the approach provides an easy way to improve accuracy.

To see why this might be so, imagine that we're trying to predict the winner of an NCAA basketball game.  We have three base classifiers, and they're each 70% accurate.  So if we use any one of these classifiers, we'll predict 30% of the games incorrectly.  Instead, let's make a prediction by taking a majority vote amongst the classifiers.  Now we'll only make the wrong prediction if two out of the three classifiers are wrong.  If I've done the math correctly, that means the combined (ensemble) classifier is 73% accurate.

The figure below (taken from this article at Scholarpedia) illustrates this idea.

Each of the three classifiers across the top has mis-classified some examples -- the ones with the dark borders.  But since each classifier has made different errors, they all "wash out" in the final ensemble prediction.

If you're mathematically inclined, you might have already realized the "catch".  For this to work, the errors made by the base classifiers need to be independent.  If all the classifiers make the same mistakes, nothing will be gained by collecting them into an ensemble.  Many approaches have been suggested for creating independent classifiers.  Common ones include subsetting the training data and giving a different subset to each classifier, or subsetting the attributes and giving different attributes to each classifier.  You can also try creating a large number of classifiers, and then search them to find independent ones.  (I will have more to say on that idea later.)  Unfortunately, many of these methods trade off accuracy of the base classifier for independence, and the end result is often that the ensemble does no better than the best base classifier.

At any rate, the experiment with Auto-WEKA provided me with a pool of good base classifiers, so I thought it would be worth a little effort to try combining them into an ensemble classifier.

There are a couple of options on how to combine the base classifiers.  The most straightforward is the voting scheme outline above (for numeric predictions, this is averaging).  This often works well, but in many cases weighting the base classifiers can produce better predictions.  For example, instead of averaging the predictions of the base classifiers, we might want to take 80% of Classifier A, 15% of Classifier B and 5% of Classifier C.  How do we figure that out?  Well, we just create a new classifier -- one that takes the outputs of the base classifiers as inputs -- and produces a prediction.  We can then pick an appropriate model (say, a linear regression) and train that for the best results.

This brings up an easily-overlooked complexity, namely, what data shall we use to train and test this new ensemble classifier?  For best results, we need training & test data for the ensemble classifier that is independent of the data we used to train and test the base classifiers.  If we re-use any of that data, we will probably overfit the ensemble classifier to that data.  This will often give us poor results on new data.  So to pursue an ensemble classifier, we need to partition our data into four sets:  training data for the base classifiers, test data for the base classifiers, training data for the ensemble classifier, and test data for the ensemble classifier.

The best classifiers found during the Auto-WEKA experiment were a linear regression, a Support Vector Machine (SVM), an M5P tree-based classifier and a PLS-based classifier.  So I combined these three into an ensemble and trained a linear regression ensemble classifier, with this result:
mov =
0.1855 * weka.classifiers.functions.SMOreg-1 +
0.2447 * weka.classifiers.functions.LinearRegression-2 +
0.4786 * weka.classifiers.tree.M5P +
0.1147 * weka.classifiers.function.PLSclassifier +
-0.1154
So in this case, the final classifier uses 18% of the SVM prediction, 25% of the linear regression, and so on.

The ensemble classifier did perform better than the base classifiers -- but only by a tiny amount which was not statistically significant.  So at least in this case, there was no appreciable benefit.

WEKA also provides some built-in methods for creating ensembles.  It has meta-classifiers called "Random Subspace" and "Rotation Forest" that create ensembles of base classifiers by splitting up the attributes into independent subspaces (explained in more detail here and here).  These are somewhat more restrictive approaches because they build ensembles of only a single type of base classifier, but they're very easy to use.  In this case, using these meta-classifiers can improve the performance of the M5P and PLSclassifiers to be as good as the mixed ensemble.

Although these ensemble approaches didn't provide a significant benefit for the Pain Machine model, the ensemble learning concept is a worthwhile concept to keep in your toolkit.  If nothing else, you should keep in mind that prediction diversity can be converted into accuracy, and be on the lookout for potential sources of diversity.