Wednesday, December 31, 2014

Some Miscellany

A few miscellaneous things I've experimented with recently.

At some point over the past few years I snagged a dataset which had the locations (latitude and longitude) of many of the Division 1 arenas.  I filled out the dataset and added in the date the arenas were constructed.  I also started grabbing the attendance numbers for games.  With the locations of all the team arenas, I can calculate the travel distance for Away teams.  (For Neutral site games, I just use a generic 500 mile travel distance for both teams.)  I was then able to feed all this data into the predictor.

There's a small correlation between travel distance and performance; teams that travel further generally perform more poorly.  But the effect is small, and the impact upon prediction accuracy is not significant.

Attendance numbers correlate positively with performance -- teams that get lots of fans tend to do well.  Cause and effect are probably reversed here -- fans turn out to see good teams.  Again, there's little benefit to prediction accuracy from the attendance numbers, probably because that information is already captured in the other statistics that describe how good the team is.  (If attendance was a good predictor it would be problematic, because we don't usually know the attendance numbers until the game is over!)

There's also a weak positive correlation between the date an arena was built and how well the team performs.  This is probably because good teams get new arenas.  For example, after winning the National Championship, the Maryland basketball team moved out of the rickety on-campus gym and into a shiny new arena. 

On a different note, I also started calculating the variance for team statistics and using them for prediction.  Variance tells us how much a variable is changing.  Imagine two teams that both average 10 offensive rebounds per game.  Team A has gotten exactly ten offensive rebounds in every game.  Team B has gotten 0 offensive rebounds in half their games, and twenty in the other half.  Their averages are the same, but Team B has a much higher variance.

One of the obvious ways to use variance is to quantify "confidence".  We might be much more confident in our prediction about a team that has very little variance (i.e., performs very consistently) than our prediction about a team that has a lot of variance.  I've had some success with this (particularly for predicting upsets in the NCAA tournament) but it's not an area where I've yet done a lot of experimentation.

But interestingly enough, it turns out that variance has some value as a direct predictor of performance.  In some cases, variance is a good thing -- a team with high variance in some statistic does better than a team with low variance.  In other cases, it's the opposite.  I've notice this in the past with some statistics (like Trueskill) that produce a variance as part of their calculation, but I decided to calculate variance consistently for all the statistics and test it for predictive value.

The results were mixed.  Some of the variance calculations were statistically significant and were selected by the model.  On the other hand, they didn't significantly improve accuracy.  I ended up keeping a number of the variance statistics because they represent a different aspect of the data, and I hope that means they'll make the model more consistent overall.

Wednesday, December 24, 2014

Other Approaches to Boosting

In my previous post I talked about boosting.  Boosting tries to improve the accuracy of a model by increasing the importance ("boosting") the training examples where the model has the most error.  Most machine learning toolkits include some form of boosting (often AdaBoost).  If you're not using a toolkit, it's pretty easy to manually implement a simple form of boosting.

To do this, train your classifier and then run it on all the examples of your training set.  You now have a training set with predicted outcomes and actual outcomes.  You can use this information to find the training examples with the most error.  If you're predicted the margin of victory (MOV) of a college basketball game, those are the training examples where the prediction is most different from the actual outcome.

(As an aside, this is most straightforwardly measured as the absolute value (or square) of the difference between the prediction and the actual outcome.  But you might also consider measuring this as a percentage error, if you think that being off by 1 point when you predict a 1 point victory is worse than being off 1 point when you predict a 20 point victory.)

Now you can take the worst predictions and duplicate them in the training set (so that they effectively count double) or even use them as the training set for a new predictor.  (Some machine learning models allow you to explicitly assign a weight to each training example.  In that case you can make them worth (say) 150%.) 

For my predictor, I have about 28,000 games in the training set.  I did an experiment where I took the worst 5600 (~20%) predictions and duplicated them in my training data.  Then I trained a new predictor on this "boosted" data and checked its accuracy.  I found a very small improvement in accuracy.

Why doesn't this approach boost accuracy for this problem?  I speculate that most of the games with large errors are games where "luck" skewed strongly towards one team.  Maybe one team shot 70% on their three-pointers, and the other team missed all their free throws.  Or one team just had a slightly bad night all-around and the other team had a slightly good night all-around.  By definition these sorts of events are unpredictable, so it doesn't benefit our model to try harder to predict them.  There's nothing useful in that data for the model to learn.

This insight suggests another approach.  If the worst predictions are in fact unpredictable, it might benefit us not to skew our training by trying to predict them.  We can do this by removing them from our training set (or de-weighting them if our model permits that).  I've never seen this approach suggested in the literature, so I don't have a name for it, but I'll call this "Deboosting."

We can create a deboosted training set in much the same way as a boosted set, except deleting rather than duplicating the worst predictions.  I did this and checked its accuracy.  Again, this shows only a slight improvement in accuracy.

So why doesn't this approach improve accuracy?  I suspect the answer is that the errors in the worst predictions are essentially random -- some in one direction, some in another.  So the games are unpredictable, but there's little net effect on the model when we train with these examples, because they tend to cancel each other out.  So removing them doesn't have as much impact as we might naively expect.

You can also consider doing similar things with the best predictions.  Removing the best predictions is a form of boosting (because it effectively increases the weight of the worst predictions) and duplicating the best predictions is likewise a form of deboosting.  You can also experiment with combinations, such as removing the worst examples and duplicating the best examples.

Finally, you can experiment with boosting approaches that make use of your knowledge of the problem domain and/or of attribute values to select the examples to boost.  As a simple example, maybe accuracy is improved by boosting the value of all non-conference road games.  Or by boosting examples where one team is much stronger than the other team (or conversely, where teams are very evenly matched).  There's a lot of room for creative speculation!

Tuesday, December 9, 2014

Other Meta-Classifier Approaches

In previous posts, I've talked about hyper-parameter tuning and ensemble learning.  The appeal of these techniques is to get something for nothing.  Without any real effort, they can (in theory) turn your inaccurate models into more accurate models.  Hyper-parameter tuning and ensemble learning aren't the only approaches to improve the accuracy of a base classifier.  Another commonly used approach is called "boosting."

Imagine that you have a model that predicts the final margin of victory (MOV) of a college basketball game.  If you ran this model on a bunch of training examples, you could calculate the error for each example (the actual MOV minus the predicted MOV).  You could then try to train a second model to predict that error.  And if you could do that, your combined prediction (the predicted MOV from the first model + the predicted error from the second model) would be better than your first model.  This idea is known as gradient boosting (or additive regression).

However, to be useful, there has to be something that the gradient boosting can find in the errors from the first model.  To understand why this is important, consider using gradient boosting with linear regression.  The linear regression algorithm produces the best linear solution to the data.  There's no other solution which will yield smaller errors.  So if you try to fit a second linear regression to the errors that are left behind, you get no improvement over the original model.  (Because no improvement is possible.)

There are two ways to overcome this limitation.  The first is to use a weaker base model.  That may seem non-intuitive, but it has the nice property of automatically turning a weak model into a stronger one.  (And in fact gradient boosting is usually applied to tree models for just this reason.)  The second approach is to use a different type of model for the second model.  For example, we could build a linear regression in the first step, and then a tree model in the second step.  The tree model could pick up some non-linear information in the errors that the linear regression would miss.

Gradient boosting is just one form of boosting.  In general, "boosting" methods try to improve accuracy by boosting the importance of training examples that have errors.  One approach is to weight the training examples, create a base classifier, find the training examples where the base classifier is wrong, increase the weight of those examples and create a new classifier.  The idea is that by forcing the base classifier to pay more attention to the incorrect examples, it will become more accurate on those examples.  A popular version of this approach is called adaptive boosting or AdaBoost and has proven to be very effective for a wide range of problems.

Machine learning toolkits like WEKA and RapidMiner have various forms of boosting built in, including both adaptive regression and AdaBoost, but the idea is straightforward enough that it's not difficult to implement yourself if you need more control over the approach.  In the case of the Pain Machine, I haven't found a boosting approach that improves the accuracy of the base linear regression, but I have found that some tree approaches can be boosted to an accuracy close to that of the linear regression.  I suspect that boosting is less useful for problem domains with large amounts of noisy data, or for models that are already close to the best possible performance, but that's just speculation on my part.

Friday, December 5, 2014

Ensemble Classifiers

(Review:  In machine learning, a classifier is a model for classifying/predicting.  The Pain Machine uses a linear regression as its classifier.  In this series of posts, I'm discussing other potential approaches.  Previously I discussed using AutoWEKA to automatically find a good classifier.)

An ensemble classifier is one that takes a number of other classifiers (called the "base classifiers") and aggregates their results to get a (hopefully) more accurate prediction.  Ensemble classifiers are popular because -- in certain cases -- the approach provides an easy way to improve accuracy.

To see why this might be so, imagine that we're trying to predict the winner of an NCAA basketball game.  We have three base classifiers, and they're each 70% accurate.  So if we use any one of these classifiers, we'll predict 30% of the games incorrectly.  Instead, let's make a prediction by taking a majority vote amongst the classifiers.  Now we'll only make the wrong prediction if two out of the three classifiers are wrong.  If I've done the math correctly, that means the combined (ensemble) classifier is 73% accurate.

The figure below (taken from this article at Scholarpedia) illustrates this idea.


Each of the three classifiers across the top has mis-classified some examples -- the ones with the dark borders.  But since each classifier has made different errors, they all "wash out" in the final ensemble prediction.

If you're mathematically inclined, you might have already realized the "catch".  For this to work, the errors made by the base classifiers need to be independent.  If all the classifiers make the same mistakes, nothing will be gained by collecting them into an ensemble.  Many approaches have been suggested for creating independent classifiers.  Common ones include subsetting the training data and giving a different subset to each classifier, or subsetting the attributes and giving different attributes to each classifier.  You can also try creating a large number of classifiers, and then search them to find independent ones.  (I will have more to say on that idea later.)  Unfortunately, many of these methods trade off accuracy of the base classifier for independence, and the end result is often that the ensemble does no better than the best base classifier.

At any rate, the experiment with Auto-WEKA provided me with a pool of good base classifiers, so I thought it would be worth a little effort to try combining them into an ensemble classifier.

There are a couple of options on how to combine the base classifiers.  The most straightforward is the voting scheme outline above (for numeric predictions, this is averaging).  This often works well, but in many cases weighting the base classifiers can produce better predictions.  For example, instead of averaging the predictions of the base classifiers, we might want to take 80% of Classifier A, 15% of Classifier B and 5% of Classifier C.  How do we figure that out?  Well, we just create a new classifier -- one that takes the outputs of the base classifiers as inputs -- and produces a prediction.  We can then pick an appropriate model (say, a linear regression) and train that for the best results.

This brings up an easily-overlooked complexity, namely, what data shall we use to train and test this new ensemble classifier?  For best results, we need training & test data for the ensemble classifier that is independent of the data we used to train and test the base classifiers.  If we re-use any of that data, we will probably overfit the ensemble classifier to that data.  This will often give us poor results on new data.  So to pursue an ensemble classifier, we need to partition our data into four sets:  training data for the base classifiers, test data for the base classifiers, training data for the ensemble classifier, and test data for the ensemble classifier.

The best classifiers found during the Auto-WEKA experiment were a linear regression, a Support Vector Machine (SVM), an M5P tree-based classifier and a PLS-based classifier.  So I combined these three into an ensemble and trained a linear regression ensemble classifier, with this result:
mov =
      0.1855 * weka.classifiers.functions.SMOreg-1 +
      0.2447 * weka.classifiers.functions.LinearRegression-2 +
      0.4786 * weka.classifiers.tree.M5P +
      0.1147 * weka.classifiers.function.PLSclassifier +
     -0.1154
So in this case, the final classifier uses 18% of the SVM prediction, 25% of the linear regression, and so on.

The ensemble classifier did perform better than the base classifiers -- but only by a tiny amount which was not statistically significant.  So at least in this case, there was no appreciable benefit.

WEKA also provides some built-in methods for creating ensembles.  It has meta-classifiers called "Random Subspace" and "Rotation Forest" that create ensembles of base classifiers by splitting up the attributes into independent subspaces (explained in more detail here and here).  These are somewhat more restrictive approaches because they build ensembles of only a single type of base classifier, but they're very easy to use.  In this case, using these meta-classifiers can improve the performance of the M5P and PLSclassifiers to be as good as the mixed ensemble.

Although these ensemble approaches didn't provide a significant benefit for the Pain Machine model, the ensemble learning concept is a worthwhile concept to keep in your toolkit.  If nothing else, you should keep in mind that prediction diversity can be converted into accuracy, and be on the lookout for potential sources of diversity.







Thursday, December 4, 2014

Least Frightening Div I Team Names & Other Diversions

In alphabetical order:
Centenary Gentlemen
UMKC Kangaroos
Pennsylvania Quakers
Presbyterian Blue Hose
South Dakota State Jackrabbits
St. Peter's Peacocks
UC Irvine Anteaters
Vanderbilt Commodores (feat. Lionel Richie)
Youngstown St. Penguins
It's always a blood bath when the Quakers take on the Gentlemen.

BONUS CONTENT:  First round matchups we'd like to see
The Campbell Fighting Camels vs. the Delaware Fighting Blue Hens
The Kent St. Golden Flashes vs. the St. Francis(PA) Red Flash
The Furman Paladins vs. the Northwestern St. Demons
The Loyola-Maryland Greyhounds vs. the Marist Red Foxes
The McNeese St. Cowboys vs. the Marshall Thundering Herd
The North Dakota Fighting Sioux vs. the North Dakota State Bison
The Northern Arizona Lumberjacks vs. the Indiana St. Sycamores
That is all.