Saturday, November 29, 2014

Auto-WEKA

As I've mentioned before, the Pain Machine uses a linear regression for prediction.  Is that the best model for this problem?  One of the challenges of machine learning is that there are a number of different machine learning algorithms, and each of these typically has several parameters that control its performance.  For example, if you're trying to predict Margin of Victory (MOV), you could use a linear regression, a polynomial regression, a linear local regression, a Support Vector Machine (SVM), and so on.  There's no straightforward way to know which of these will be best, and likewise no easy approach to tuning each algorithm to find its best performance.

Auto-WEKA is a software package from some smart folks at the University of British Columbia to try to address this problem.  It applies something called "Sequential Model-based Algorithm Configuration" to a machine learning problem to try to find the best Weka algorithm for solving that problem.   (Weka is a popular open source suite of machine learning software written in Java at the University of Waikato, New Zealand.)  Auto-WEKA tries to be smart about picking parameters for the Weka algorithms it tries so that it can hone in on the "best" parameters relatively quickly.

Regardless of how efficient Auto-WEKA is at finding the best parameters, it's very handy simply because it automates the work of trying a bunch of different algorithms on a problem.  At the simplest, you just point it at a data file and let it work, but I'll illustrate a more complex example.

First of all, you need to get Auto-WEKA from here and unpack it on your system.  It's self-contained and written in Java, so as long as you have Java installed you should be able to run it on just about any sort of machine.

Second, you need to get your data in a format that Auto-WEKA (and Weka) likes.  This is the Attribute-Relation File Format (ARFF). For me, the easiest way to do this was using RapidMiner.  RapidMiner has operators to read and write data in a various formats, so I put together a simple process to read my data (in CSV format) and write it out in ARFF.  One caveat is that Weka expects the label (that is, the data we're trying to predict) to be the last column in the ARFF data.  This is something of a standard, so RapidMiner takes care of this when writing out the ARFF format.  Here's what my RapidMiner process looks like:


The Pre-Process operator is responsible for reading in and labeling my data.  The Sample operator is in there because my data set is very large, so I down-sample it so that Auto-WEKA doesn't take forever to make progress.  (I used about 5000 games.)

With the data in hand, we start Auto-WEKA up and are presented with the startup screen:



The "Wizard" option does about what you'd expect -- you point it at your data file and it does the rest.  For more control, you can use the Experiment Builder.  This opens a window with three sequential tabs.  In the first tab, you open the data file.  In the second tab, you pick the classifiers (and meta-classifiers) you'd like to try in your experiment:


Here I've picked some tree-based classifiers as well as a few of the meta-classifiers.  By default, Auto-WEKA will use all the classifiers suitable for your data set.  The third tab allows you to control the experiment settings, such as how long to run:


The defaults here are reasonable, but depending upon your dataset you might want to allocate more memory and time.

Having done all that your experiment is constructed.  You now run the experiment.


(Note that you supply a random seed for the run.  You can do multiple runs in parallel, using different random seeds, to speed the overall search.)

Now you sit back and wait ...  for days.  On a large dataset like mine, training classifiers can take many hours.  And Auto-WEKA will try different classifiers and different combinations of parameters, so it will make many runs.  In truth, you can spend about as much CPU time as you can find.

The good news is that with about a week of CPU time, Auto-WEKA was able to identify several algorithms that performed as well as linear regression.  The combinations it found involved base classifiers combined with meta-classifiers, using parameter settings quite different than the defaults.  It's unlikely that I could have found any of these experimenting on my own.  The bad news is that none of these significantly outperformed linear regression, so my overall performance did not improve.

Overall, Auto-WEKA is a tool that could be very useful in the early stages of building a predictor by quickly identifying promising classifier configurations.  And certainly you have nothing to lose by simply letting it loose on the data to see what it can find.

Thursday, November 20, 2014

A New Season

Well, the new college basketball season is upon us.  My early predictions indicate that the Kentucky Blue Platoon is going to play the Kentucky White Platoon in the Championship game, but that prediction might evolve as we see more games.

In the meantime, I have been working on the Net Prophet model (for new readers of the blog - ha, as if! - sometimes known as the Pain Machine or PM).  Traditionally the PM has used a linear regression as its predictive model, but lately I have been looking at other possibilities.  In coming blog posts I'll detail some of these things.

I'm also hoping to get my act together enough to get the PM listed on the Prediction Tracker page for basketball predictions.  I meant to do this last year but never quite got around to doing it.  But this year for sure :-).