Wednesday, December 31, 2014

Some Miscellany

A few miscellaneous things I've experimented with recently.

At some point over the past few years I snagged a dataset which had the locations (latitude and longitude) of many of the Division 1 arenas.  I filled out the dataset and added in the date the arenas were constructed.  I also started grabbing the attendance numbers for games.  With the locations of all the team arenas, I can calculate the travel distance for Away teams.  (For Neutral site games, I just use a generic 500 mile travel distance for both teams.)  I was then able to feed all this data into the predictor.

There's a small correlation between travel distance and performance; teams that travel further generally perform more poorly.  But the effect is small, and the impact upon prediction accuracy is not significant.

Attendance numbers correlate positively with performance -- teams that get lots of fans tend to do well.  Cause and effect are probably reversed here -- fans turn out to see good teams.  Again, there's little benefit to prediction accuracy from the attendance numbers, probably because that information is already captured in the other statistics that describe how good the team is.  (If attendance was a good predictor it would be problematic, because we don't usually know the attendance numbers until the game is over!)

There's also a weak positive correlation between the date an arena was built and how well the team performs.  This is probably because good teams get new arenas.  For example, after winning the National Championship, the Maryland basketball team moved out of the rickety on-campus gym and into a shiny new arena. 

On a different note, I also started calculating the variance for team statistics and using them for prediction.  Variance tells us how much a variable is changing.  Imagine two teams that both average 10 offensive rebounds per game.  Team A has gotten exactly ten offensive rebounds in every game.  Team B has gotten 0 offensive rebounds in half their games, and twenty in the other half.  Their averages are the same, but Team B has a much higher variance.

One of the obvious ways to use variance is to quantify "confidence".  We might be much more confident in our prediction about a team that has very little variance (i.e., performs very consistently) than our prediction about a team that has a lot of variance.  I've had some success with this (particularly for predicting upsets in the NCAA tournament) but it's not an area where I've yet done a lot of experimentation.

But interestingly enough, it turns out that variance has some value as a direct predictor of performance.  In some cases, variance is a good thing -- a team with high variance in some statistic does better than a team with low variance.  In other cases, it's the opposite.  I've notice this in the past with some statistics (like Trueskill) that produce a variance as part of their calculation, but I decided to calculate variance consistently for all the statistics and test it for predictive value.

The results were mixed.  Some of the variance calculations were statistically significant and were selected by the model.  On the other hand, they didn't significantly improve accuracy.  I ended up keeping a number of the variance statistics because they represent a different aspect of the data, and I hope that means they'll make the model more consistent overall.

Wednesday, December 24, 2014

Other Approaches to Boosting

In my previous post I talked about boosting.  Boosting tries to improve the accuracy of a model by increasing the importance ("boosting") the training examples where the model has the most error.  Most machine learning toolkits include some form of boosting (often AdaBoost).  If you're not using a toolkit, it's pretty easy to manually implement a simple form of boosting.

To do this, train your classifier and then run it on all the examples of your training set.  You now have a training set with predicted outcomes and actual outcomes.  You can use this information to find the training examples with the most error.  If you're predicted the margin of victory (MOV) of a college basketball game, those are the training examples where the prediction is most different from the actual outcome.

(As an aside, this is most straightforwardly measured as the absolute value (or square) of the difference between the prediction and the actual outcome.  But you might also consider measuring this as a percentage error, if you think that being off by 1 point when you predict a 1 point victory is worse than being off 1 point when you predict a 20 point victory.)

Now you can take the worst predictions and duplicate them in the training set (so that they effectively count double) or even use them as the training set for a new predictor.  (Some machine learning models allow you to explicitly assign a weight to each training example.  In that case you can make them worth (say) 150%.) 

For my predictor, I have about 28,000 games in the training set.  I did an experiment where I took the worst 5600 (~20%) predictions and duplicated them in my training data.  Then I trained a new predictor on this "boosted" data and checked its accuracy.  I found a very small improvement in accuracy.

Why doesn't this approach boost accuracy for this problem?  I speculate that most of the games with large errors are games where "luck" skewed strongly towards one team.  Maybe one team shot 70% on their three-pointers, and the other team missed all their free throws.  Or one team just had a slightly bad night all-around and the other team had a slightly good night all-around.  By definition these sorts of events are unpredictable, so it doesn't benefit our model to try harder to predict them.  There's nothing useful in that data for the model to learn.

This insight suggests another approach.  If the worst predictions are in fact unpredictable, it might benefit us not to skew our training by trying to predict them.  We can do this by removing them from our training set (or de-weighting them if our model permits that).  I've never seen this approach suggested in the literature, so I don't have a name for it, but I'll call this "Deboosting."

We can create a deboosted training set in much the same way as a boosted set, except deleting rather than duplicating the worst predictions.  I did this and checked its accuracy.  Again, this shows only a slight improvement in accuracy.

So why doesn't this approach improve accuracy?  I suspect the answer is that the errors in the worst predictions are essentially random -- some in one direction, some in another.  So the games are unpredictable, but there's little net effect on the model when we train with these examples, because they tend to cancel each other out.  So removing them doesn't have as much impact as we might naively expect.

You can also consider doing similar things with the best predictions.  Removing the best predictions is a form of boosting (because it effectively increases the weight of the worst predictions) and duplicating the best predictions is likewise a form of deboosting.  You can also experiment with combinations, such as removing the worst examples and duplicating the best examples.

Finally, you can experiment with boosting approaches that make use of your knowledge of the problem domain and/or of attribute values to select the examples to boost.  As a simple example, maybe accuracy is improved by boosting the value of all non-conference road games.  Or by boosting examples where one team is much stronger than the other team (or conversely, where teams are very evenly matched).  There's a lot of room for creative speculation!

Tuesday, December 9, 2014

Other Meta-Classifier Approaches

In previous posts, I've talked about hyper-parameter tuning and ensemble learning.  The appeal of these techniques is to get something for nothing.  Without any real effort, they can (in theory) turn your inaccurate models into more accurate models.  Hyper-parameter tuning and ensemble learning aren't the only approaches to improve the accuracy of a base classifier.  Another commonly used approach is called "boosting."

Imagine that you have a model that predicts the final margin of victory (MOV) of a college basketball game.  If you ran this model on a bunch of training examples, you could calculate the error for each example (the actual MOV minus the predicted MOV).  You could then try to train a second model to predict that error.  And if you could do that, your combined prediction (the predicted MOV from the first model + the predicted error from the second model) would be better than your first model.  This idea is known as gradient boosting (or additive regression).

However, to be useful, there has to be something that the gradient boosting can find in the errors from the first model.  To understand why this is important, consider using gradient boosting with linear regression.  The linear regression algorithm produces the best linear solution to the data.  There's no other solution which will yield smaller errors.  So if you try to fit a second linear regression to the errors that are left behind, you get no improvement over the original model.  (Because no improvement is possible.)

There are two ways to overcome this limitation.  The first is to use a weaker base model.  That may seem non-intuitive, but it has the nice property of automatically turning a weak model into a stronger one.  (And in fact gradient boosting is usually applied to tree models for just this reason.)  The second approach is to use a different type of model for the second model.  For example, we could build a linear regression in the first step, and then a tree model in the second step.  The tree model could pick up some non-linear information in the errors that the linear regression would miss.

Gradient boosting is just one form of boosting.  In general, "boosting" methods try to improve accuracy by boosting the importance of training examples that have errors.  One approach is to weight the training examples, create a base classifier, find the training examples where the base classifier is wrong, increase the weight of those examples and create a new classifier.  The idea is that by forcing the base classifier to pay more attention to the incorrect examples, it will become more accurate on those examples.  A popular version of this approach is called adaptive boosting or AdaBoost and has proven to be very effective for a wide range of problems.

Machine learning toolkits like WEKA and RapidMiner have various forms of boosting built in, including both adaptive regression and AdaBoost, but the idea is straightforward enough that it's not difficult to implement yourself if you need more control over the approach.  In the case of the Pain Machine, I haven't found a boosting approach that improves the accuracy of the base linear regression, but I have found that some tree approaches can be boosted to an accuracy close to that of the linear regression.  I suspect that boosting is less useful for problem domains with large amounts of noisy data, or for models that are already close to the best possible performance, but that's just speculation on my part.

Friday, December 5, 2014

Ensemble Classifiers

(Review:  In machine learning, a classifier is a model for classifying/predicting.  The Pain Machine uses a linear regression as its classifier.  In this series of posts, I'm discussing other potential approaches.  Previously I discussed using AutoWEKA to automatically find a good classifier.)

An ensemble classifier is one that takes a number of other classifiers (called the "base classifiers") and aggregates their results to get a (hopefully) more accurate prediction.  Ensemble classifiers are popular because -- in certain cases -- the approach provides an easy way to improve accuracy.

To see why this might be so, imagine that we're trying to predict the winner of an NCAA basketball game.  We have three base classifiers, and they're each 70% accurate.  So if we use any one of these classifiers, we'll predict 30% of the games incorrectly.  Instead, let's make a prediction by taking a majority vote amongst the classifiers.  Now we'll only make the wrong prediction if two out of the three classifiers are wrong.  If I've done the math correctly, that means the combined (ensemble) classifier is 73% accurate.

The figure below (taken from this article at Scholarpedia) illustrates this idea.


Each of the three classifiers across the top has mis-classified some examples -- the ones with the dark borders.  But since each classifier has made different errors, they all "wash out" in the final ensemble prediction.

If you're mathematically inclined, you might have already realized the "catch".  For this to work, the errors made by the base classifiers need to be independent.  If all the classifiers make the same mistakes, nothing will be gained by collecting them into an ensemble.  Many approaches have been suggested for creating independent classifiers.  Common ones include subsetting the training data and giving a different subset to each classifier, or subsetting the attributes and giving different attributes to each classifier.  You can also try creating a large number of classifiers, and then search them to find independent ones.  (I will have more to say on that idea later.)  Unfortunately, many of these methods trade off accuracy of the base classifier for independence, and the end result is often that the ensemble does no better than the best base classifier.

At any rate, the experiment with Auto-WEKA provided me with a pool of good base classifiers, so I thought it would be worth a little effort to try combining them into an ensemble classifier.

There are a couple of options on how to combine the base classifiers.  The most straightforward is the voting scheme outline above (for numeric predictions, this is averaging).  This often works well, but in many cases weighting the base classifiers can produce better predictions.  For example, instead of averaging the predictions of the base classifiers, we might want to take 80% of Classifier A, 15% of Classifier B and 5% of Classifier C.  How do we figure that out?  Well, we just create a new classifier -- one that takes the outputs of the base classifiers as inputs -- and produces a prediction.  We can then pick an appropriate model (say, a linear regression) and train that for the best results.

This brings up an easily-overlooked complexity, namely, what data shall we use to train and test this new ensemble classifier?  For best results, we need training & test data for the ensemble classifier that is independent of the data we used to train and test the base classifiers.  If we re-use any of that data, we will probably overfit the ensemble classifier to that data.  This will often give us poor results on new data.  So to pursue an ensemble classifier, we need to partition our data into four sets:  training data for the base classifiers, test data for the base classifiers, training data for the ensemble classifier, and test data for the ensemble classifier.

The best classifiers found during the Auto-WEKA experiment were a linear regression, a Support Vector Machine (SVM), an M5P tree-based classifier and a PLS-based classifier.  So I combined these three into an ensemble and trained a linear regression ensemble classifier, with this result:
mov =
      0.1855 * weka.classifiers.functions.SMOreg-1 +
      0.2447 * weka.classifiers.functions.LinearRegression-2 +
      0.4786 * weka.classifiers.tree.M5P +
      0.1147 * weka.classifiers.function.PLSclassifier +
     -0.1154
So in this case, the final classifier uses 18% of the SVM prediction, 25% of the linear regression, and so on.

The ensemble classifier did perform better than the base classifiers -- but only by a tiny amount which was not statistically significant.  So at least in this case, there was no appreciable benefit.

WEKA also provides some built-in methods for creating ensembles.  It has meta-classifiers called "Random Subspace" and "Rotation Forest" that create ensembles of base classifiers by splitting up the attributes into independent subspaces (explained in more detail here and here).  These are somewhat more restrictive approaches because they build ensembles of only a single type of base classifier, but they're very easy to use.  In this case, using these meta-classifiers can improve the performance of the M5P and PLSclassifiers to be as good as the mixed ensemble.

Although these ensemble approaches didn't provide a significant benefit for the Pain Machine model, the ensemble learning concept is a worthwhile concept to keep in your toolkit.  If nothing else, you should keep in mind that prediction diversity can be converted into accuracy, and be on the lookout for potential sources of diversity.







Thursday, December 4, 2014

Least Frightening Div I Team Names & Other Diversions

In alphabetical order:
Centenary Gentlemen
UMKC Kangaroos
Pennsylvania Quakers
Presbyterian Blue Hose
South Dakota State Jackrabbits
St. Peter's Peacocks
UC Irvine Anteaters
Vanderbilt Commodores (feat. Lionel Richie)
Youngstown St. Penguins
It's always a blood bath when the Quakers take on the Gentlemen.

BONUS CONTENT:  First round matchups we'd like to see
The Campbell Fighting Camels vs. the Delaware Fighting Blue Hens
The Kent St. Golden Flashes vs. the St. Francis(PA) Red Flash
The Furman Paladins vs. the Northwestern St. Demons
The Loyola-Maryland Greyhounds vs. the Marist Red Foxes
The McNeese St. Cowboys vs. the Marshall Thundering Herd
The North Dakota Fighting Sioux vs. the North Dakota State Bison
The Northern Arizona Lumberjacks vs. the Indiana St. Sycamores
That is all.

Saturday, November 29, 2014

Auto-WEKA

As I've mentioned before, the Pain Machine uses a linear regression for prediction.  Is that the best model for this problem?  One of the challenges of machine learning is that there are a number of different machine learning algorithms, and each of these typically has several parameters that control its performance.  For example, if you're trying to predict Margin of Victory (MOV), you could use a linear regression, a polynomial regression, a linear local regression, a Support Vector Machine (SVM), and so on.  There's no straightforward way to know which of these will be best, and likewise no easy approach to tuning each algorithm to find its best performance.

Auto-WEKA is a software package from some smart folks at the University of British Columbia to try to address this problem.  It applies something called "Sequential Model-based Algorithm Configuration" to a machine learning problem to try to find the best Weka algorithm for solving that problem.   (Weka is a popular open source suite of machine learning software written in Java at the University of Waikato, New Zealand.)  Auto-WEKA tries to be smart about picking parameters for the Weka algorithms it tries so that it can hone in on the "best" parameters relatively quickly.

Regardless of how efficient Auto-WEKA is at finding the best parameters, it's very handy simply because it automates the work of trying a bunch of different algorithms on a problem.  At the simplest, you just point it at a data file and let it work, but I'll illustrate a more complex example.

First of all, you need to get Auto-WEKA from here and unpack it on your system.  It's self-contained and written in Java, so as long as you have Java installed you should be able to run it on just about any sort of machine.

Second, you need to get your data in a format that Auto-WEKA (and Weka) likes.  This is the Attribute-Relation File Format (ARFF). For me, the easiest way to do this was using RapidMiner.  RapidMiner has operators to read and write data in a various formats, so I put together a simple process to read my data (in CSV format) and write it out in ARFF.  One caveat is that Weka expects the label (that is, the data we're trying to predict) to be the last column in the ARFF data.  This is something of a standard, so RapidMiner takes care of this when writing out the ARFF format.  Here's what my RapidMiner process looks like:


The Pre-Process operator is responsible for reading in and labeling my data.  The Sample operator is in there because my data set is very large, so I down-sample it so that Auto-WEKA doesn't take forever to make progress.  (I used about 5000 games.)

With the data in hand, we start Auto-WEKA up and are presented with the startup screen:



The "Wizard" option does about what you'd expect -- you point it at your data file and it does the rest.  For more control, you can use the Experiment Builder.  This opens a window with three sequential tabs.  In the first tab, you open the data file.  In the second tab, you pick the classifiers (and meta-classifiers) you'd like to try in your experiment:


Here I've picked some tree-based classifiers as well as a few of the meta-classifiers.  By default, Auto-WEKA will use all the classifiers suitable for your data set.  The third tab allows you to control the experiment settings, such as how long to run:


The defaults here are reasonable, but depending upon your dataset you might want to allocate more memory and time.

Having done all that your experiment is constructed.  You now run the experiment.


(Note that you supply a random seed for the run.  You can do multiple runs in parallel, using different random seeds, to speed the overall search.)

Now you sit back and wait ...  for days.  On a large dataset like mine, training classifiers can take many hours.  And Auto-WEKA will try different classifiers and different combinations of parameters, so it will make many runs.  In truth, you can spend about as much CPU time as you can find.

The good news is that with about a week of CPU time, Auto-WEKA was able to identify several algorithms that performed as well as linear regression.  The combinations it found involved base classifiers combined with meta-classifiers, using parameter settings quite different than the defaults.  It's unlikely that I could have found any of these experimenting on my own.  The bad news is that none of these significantly outperformed linear regression, so my overall performance did not improve.

Overall, Auto-WEKA is a tool that could be very useful in the early stages of building a predictor by quickly identifying promising classifier configurations.  And certainly you have nothing to lose by simply letting it loose on the data to see what it can find.

Thursday, November 20, 2014

A New Season

Well, the new college basketball season is upon us.  My early predictions indicate that the Kentucky Blue Platoon is going to play the Kentucky White Platoon in the Championship game, but that prediction might evolve as we see more games.

In the meantime, I have been working on the Net Prophet model (for new readers of the blog - ha, as if! - sometimes known as the Pain Machine or PM).  Traditionally the PM has used a linear regression as its predictive model, but lately I have been looking at other possibilities.  In coming blog posts I'll detail some of these things.

I'm also hoping to get my act together enough to get the PM listed on the Prediction Tracker page for basketball predictions.  I meant to do this last year but never quite got around to doing it.  But this year for sure :-).

Friday, October 10, 2014

Day of the Week Effect && Polynominal Variables in Linear Regression

Motivated partly by recent discussion of Thursday Night Football, I began to wonder if the day of the week has any impact upon college basketball games.  This is a little bit of a tricky topic, because conferences play games on different nights (e.g., the Ivy League plays on Friday nights) so there's some conference bias mixed into any discussion of the impact of day of the week.  But I decided to ignore that for the moment and just look at the straightforward question.

This is a little trickier than you might expect, because my prediction model uses linear regression.  Linear regression works fine when we're looking for the relationship between two numerical variables (e.g., how does rebounds/game affect score) but it doesn't work so well with polynominal (not polynomial!) variables.  A polynominal variable is one that takes on a number of discrete, non-numeric values.  In this case, day of the week can be Monday, Tuesday, Wednesday and so on.

To use a polynominal variable in linear regression, we turn it into a number of binominal variables.  In this case, we create a new variable called "DOW = Monday" and give it a 1 or 0 value depending upon whether or not the day of the game is Monday.  We do this for each possible value of the polynominal variable, so in this case we end up with seven new variables.  We can then use these as input to our linear regression.

When I do so, I find that only one of the new variables has any importance in the regression:

      0.6636 * DOW = 4=false

Translating, this says the home team is at a small disadvantage in Friday games.  I leave it up to the reader to explain why that might be true.  (Ivy League effect?)


We can also look at whether predictions are more or less accurate on some days.  When I do that for my model, I find that the predictions are most accurate for Saturday games, and the least accurate for Sunday games.  The difference in RMSE is about 6/10 of a point, so it's not an entirely trivial difference.  In fact, Saturday games are more accurate than any other day of the week.

Monday, September 8, 2014

A Few More Papers

As usual, all these papers are available in the Papers archive.

[Trono 2007] Trono, John A., "An Effective Nonlinear Rewards-Based Ranking System," Journal of Quantitative Analysis in Sports, Volume 3, Issue 2, 2007.

Trono is very concerned about the NCAA football polls and with formulating a rating system that will closely match those polls.  I'm not exactly sure what utility that provides -- surely if I want to know what the polls say I can just look at them?  That issue aside, his description of his ranking system is vague and confusing -- I came away with no good understanding of how it worked or how to implement it. 

[Minton 1992] Minton, R. "A mathematical rating system." UMAP Journal 13.4 (1992): 313-334.
This is a teaching module for undergraduate mathematics that illustrates basic linear algebra through application to sports rating.  The ratings systems developed are simple systems of linear equations based upon wins, MOV, etc.  The systems are very simple, but this is a clear and detailed introduction to some basic concepts.

[Redmond 2003] Redmond, Charles. "A natural generalization of the win-loss rating system." Mathematics magazine (2003): 119-126.
Redmond presents a rating system based upon MOV that includes a first-generation strength of schedule factor. It isn't extremely sophisticated, but makes a nice follow-on to [Minton 1992].

[Gleich 2014] Gleich, David. "PageRank Beyond the Web," http://arxiv.org/abs/1407.5107.

This is a thorough and well-written survey of the use of the PageRank algorithm.  Gleich provides clear, non-formal descriptions of the subject but also delves into the mathematical details at a level that will require some knowledge to understand.  There is a section on PageRank applied to sports rankings, and Gleich also shows that the Colley rating is equivalent to a PageRank.  Required reading for anyone interested in applying PageRank-type algorithms.

[Massey 1997] Massey, Kenneth. "Statistical models applied to the rating of sports teams." Bluefield College (1997).
Kenneth Massey's undergraduate thesis is required reading for anyone interesting is sports rating systems.  He covers the least-squares and maximum-likelihood ratings that form the basis of the Massey rating system.

Thursday, September 4, 2014

Welcome Back & The Oracle Rating System

Welcome back!  I hope you had a great summer.  With Fall rapidly approaching my attention has returned (somewhat) back to NCAA basketball and sports prediction.  One trigger was happening across a paper from the June issue of JQAS:

[Balreira 2014] Eduardo Cabral Balreira, Brian K. Miceli and Thomas Tegtmeyer, "An Oracle method to predict NFL games", Journal of Quantitative Analysis in Sports. Volume 10, Issue 2, Pages 183–196, ISSN (Online) 1559-0410, ISSN (Print) 2194-6388,DOI: 10.1515/jqas-2013-0063, March 2014

The paper describes a variant of a random walker algorithm and uses it to predict NFL games. The work here was motivated by a quirky feature of random walkers.  Beating a very good team can raise a team's rating significantly, even if the rest of the team's performance is poor.  In some ways this makes sense, but it can lead to a situation where a mediocre team is ranked inordinately high based upon a lucky win over a very good team.  To address this, the Oracle algorithm introduces an artificial additional team (called the Oracle) and by varying how many times each real team has "won" or "lost" against this Oracle team, biases the resulting rankings.  The authors test the predictive performance of the Oracle rating on NFL games from 1966-2013, and out-perform rating systems like Massey and Colley, although only by small margins (1-2% in most cases).  The paper is well-written and comprehensive, with clear explanation of the approach, illustrative examples, and thorough testing.  

Since I have previously implemented various random walker algorithms, it wasn't difficult to implement this approach and test its performance on NCAA basketball games.  There were a couple of interesting results from this experiment.

First of all, I found the best performance was based upon the won-loss records of teams, and not margin of victory (MOV).  This is pretty unusual -- I don't think I've found any other rating system that performed better using won-loss than MOV.  The performance was also competitive with very good MOV-based rating systems.

Second, I found that for NCAA basketball games, the algorithm performed much better without a converting the results matrix to a column-stochastic form before creating the ratings.  A brief digression is in order to explain that remark.

Random walker algorithms model a system with a large number of random walkers:
Consider independent random walkers who each cast a single vote for the team they believe is the best. Each walker occasionally considers changing its vote by examining the outcome of a single game selected randomly from those played by their favorite team, recasting its vote for the winner of that game with probability p (and for the loser with probability 1-p).
If you let this process go long enough, it reaches a steady state, and the percentage of total walkers on each team becomes that team's rating.  That means that the sum of all the ratings is 1, and each rating represents the probability that a walker will end on that team.  When you formulate this as a matrix mathematics problem, you must normalize each column in the raw results matrix to sum to one (making the matrix "column stochastic") to ensure that the final ratings will represent the probabilities.

It isn't clear what the ratings "mean" if you don't convert to column stochastic form, but I found that the ratings had much better performance for NCAA basketball games without the conversion.  When I reported this result back to Eduardo Balreira, he tested it for his corpus of NFL games and found that it performed worse.  It's altogether a rather curious result and I'm not certain what to make of it.

In my experimentation so far, I haven't found any customization of the Oracle system that produces results better than my current best predictors.  However, it is close and has a few interesting properties that bear some more thought, so I may continue to play with it to see if I can discover a way to further improve its performance for NCAA basketball games.

Monday, April 7, 2014

Championship Game Prediction

The Prediction Machine hasn't fared very well this Tournament (languishing in the middle of both the Kaggle and March Machine Madness contests) but for what it's worth here is the prediction for the Championship Game:
Connecticut vs. Kentucky:  Kentucky by 2
I'd like to see Connecticut win myself, but I think they have a hard row to hoe.  Napier & Boatright have been destroying opposing guards with their pressure defense.  If they can do that to the Harrison twins and keep them from repeatedly driving the lane, that will certainly help Connecticut's chances.  But so far the referees have been very stingy with charge calls, which is going to be make it very difficult for Connecticut's undersized defense to deal with Kentucky's dribble-drive offense.  Wisconsin figured out in the second half that they could mug the Harrisons once they were in the lane with little repercussion, but who knows if the reffing crew tonight will allow that.  And you have to figure that Kentucky is going to continue to enjoy an enormous advantage in rebounding.  Still, anything can happen, and it will hopefully be a tight and entertaining game.

Machine March Madness Winner: Congratulations to Monte McNair!

Apparently none of the competitors in the Machine March Madness have either Kentucky or Connecticut winning the final game, so the contest has been decided, and the winner is Monte McNair with 108 points and 40 correct picks.

(Note that we did have one Machine March Madness competitor who did better than Monte -- "TD" -- but since he never contacted me to explain his entry, he has been disqualified.)
Congratulations to Monte who continues to be one of the strongest competitors year after year.  (Although unfortunately something went wrong for him in the semi-final games in the Kaggle contest, where he dropped from the top ten to 44!)

Wednesday, April 2, 2014

Recent Papers Reviewed

I have added several new papers to the Papers archive.  Short descriptions follow.

[Barrow 2013] D. Barrow, I. Drayer, P. Elliott, G. Gaut, and B. Osting, "Ranking rankings: an empirical comparison of the predictive power of sports ranking methods," 2013.

This paper compares a number of ranking systems on predictive power.  The main conclusions are that (1) ranking systems which use margin of victory are more predictive than those that use only win-loss data, and (2) least squares and random walkers are better than other methods for predicting NCAA football outcomes.
[Hvattum 2010] Lars Magnus Hvattum, , Halvard Arntzen, "Using ELO ratings for match result prediction in association football," International Journal of Forecasting 26 (2010) 460–470.
This paper looks at using ELO ratings to predict association football (soccer) matches.  ELO was better than all of the other rating systems, but failed to out-perform the market lines.
[Kain 2011] Kyle J. Kain and Trevon D. Logan, "Are Sports Betting Markets Prediction Markets?  Evidence from a New Test," January 2011.
This paper tests whether the point spread is a good predictor of margin of victory (it is) and whether the over/under is a good predictor of total points scored (it is not).
[Melo 2012] Pedro O. S. Vaz De Melo, Virgilio A. F. Almeida, Antonio A. F. Loureiro, and Christos Faloutsos, "Forecasting in the NBA and Other Team Sports: Network Effects in Action," ACM Transactions on Knowledge Discovery from Data, Vol. 6, No. 3, Article 13, October 2012.
This is a rather interesting paper that models NBA teams as networks exchanging players and coaches.  This allows the authors to look at hypotheses such as "trading players improves a team's performance," or "a player who has played for a number of teams is more valuable than one who hasn't."  They develop metrics such as "team volatility" and use these to predict future performance.
[Page 2007] Garritt L. Page, Gilbert W. Fellingham, C. Shane Reese, "Using Box-Scores to Determine a Position’s Contribution to Winning Basketball Games," Journal of Quantitative Analysis in Sports, Volume 3, Issue 4 2007 Article 1.
This paper looks at box scores for games from the 1996-97 NBA season to determine the importance of different basketball skills (e.g., defensive rebounding) were to each basketball position (e.g., point guard).  The surprising result was the importance of defensive rebounding by the guard positions and offensive rebounding by the point guard.
[Park 2005] Juyong Park and M. E. J. Newman, "A network-based ranking system for US college football," Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI, 2005.
The authors develop a ranking system based upon the intuitive logic that "If A beat B and B beat C, then A indirectly beat C" and apply it to college football.
[Strumbelj 2012] Erik Štrumbelj, Petar Vračar, "Simulating a basketball match with a homogeneous Markov model and forecasting the outcome," International Journal of Forecasting 28 (2012) 532–542.
The authors build a possession-by-possession transition matrix for an NBA game based upon box score data and team statistics.  They then use this matrix to predict game outcomes.  The results were not statistically better than methods such as ELO, and worse than point spreads.

Monday, March 31, 2014

Final Four Predictions

The Prediction Machine did pretty well on the Sweet Sixteen games.  I think it would have missed many of the Elite Eight games, but I didn't actually run it so we'll never know.  For the first two games of the Final Four:

#1 Florida vs. #7 Connecticut:  Florida by 5.5

I think most people would agree that Connecticut is the weakest of the Final Four teams.  Florida meanwhile has been rolling along quietly taking care of business.  Short of an abnormal shooting night from one or both of the teams, I don't think UConn has much chance in this game.

#2 Wisconsin vs. #8 Kentucky:  Toss-up
Before watching the Kentucky-Michigan game, I thought Wisconsin was playing the best basketball of any of the contenders.  Now I'm not so sure.  Kentucky has been nearly unstoppable on offense throughout the Tournament, and the fabled freshmen have been impervious to the pressure.  Still, the Wildcats may be vulnerable if they get stymied enough on offense (as they did a couple of times this year against Florida), and Bo Ryan's team is certainly capable of applying the defensive pressure.  But even so, Wisconsin is going to have to be very efficient on the offensive end to stay even with the Wildcats.

Wednesday, March 26, 2014

Adventures in Data Cleansing

According to the ESPN play-by-play data, the American University vs. Penn State game on 12/21/2009 was a blowout -- Penn State won 914-629.

Imagine if it had gone to OT!

Machine Madness Competitors: Monte McNair

Next up in our tour of Machine Madness competitors is Monte McNair.  Monte was in this contest last year as well, under the nom de plume "Predict the Madness". 

Monte attended Princeton but is also a lifelong Stanford fan, so he is enjoying their current Tournament run.  As a UCLA fan I'll try not to hold that against him.  At least he isn't a Cal fan.  Monte blogs (infrequently) about sports at Outside the Hashes.  He also runs a site called Ultimate Bracket Challenge that's worth checking out and bookmarking for next year.

Last year he did a posting over on the Number Crunching Life where he talked about his approach.  He uses a logistic regression based upon the location of the game, metrics for the team's offense and defense, and metrics of the team's opponents' averages for both offense and defense.  Unlike some approaches (like mine) that produce a predicted point spread, Monte's approach produces a confidence number.  Monte finished in the middle of the pack last year but is doing much better this year.  He's currently in second in this contest, and is doing quite well over on Kaggle, where he's currently in eleventh.

Monte has Villanova-Florida-Arizona-Louisville as his Final Four, with Arizona winning it all.  The current leader has Florida for champion, so if Arizona wins it all Monte will likely jump into first and win this contest.

Sweet Sixteen Analysis

Jeff Fogle over on Stats Intelligence has a nice post up analyzing the Sweet Sixteen matchups.  Unlike most analysis you'll see, this is actually grounded in the team statistics instead of some pundits vague intuitions.

Unfortunately for me, Jeff comes to the same conclusion I did about UCLA's chances against Florida:  not very good.  UCLA did beat Arizona (a team very similar to Florida) in the final of the Pac-12 Tournament, but Arizona was a little tired for that game, and UCLA enjoyed a tremendous advantage on the free throw line.  You never know what the officiating will be like in the Tournament, but I'll be very surprised if UCLA ends up with a significant advantage in that category. 

Machine March Madness Competitors: Eric Akers

Our next profile is Eric Akers.

Eric is a software engineer at a small biotech company, so he fits my mental profile of a Machine Madness competitor.  He went to Kansas and his education has focused on robotics, computer vision, and some machine learning.  Needless to say he's a big Jayhawks fan.

Sorry about that Stanford game, Eric.

He's also passionate about baseball (KC Royals fan, naturally) and is contemplating trying to predict baseball games.  It's a slippery slope, this prediction business.  For outside hobbies he's building an ASV (autonomous seasurface vehicle) and is looking at build a quadcopter.

Like many of the competitors, he got hooked into this via Number Crunching Life. One of his work colleagues did some number crunching to try to get an edge in the office pool, Eric got intrigued, Googled around and found Number Crunching Life and got sucked in.  Looking at the competitors from last year I don't see Eric's name, so I think this is his first year competing.

His algorithm uses Danny Tarlow's probabilistic matrix factorization method using the 2D model with a vector for both offense and defense. Data was taken from the Kaggle site. Stochastic gradient decent was used to train the model, but with aging added. After an initial training, the teams were ranked based on the offense and defense vectors, then training continued using a higher learning rate based on the rank of the opponent being faced.
Eric also entered the Kaggle competition and was as high as #7 at one point on the first day, but has since dropped to #224.  If he'd taken the opposite of his predictions he'd be in 25th place :-).
Right now Eric is in 6th place in the Machine Madness competition.  Eric has a Virginia-Louisville final predicted, with Louisville winning it all.  If that happens he'll certainly jump upwards in the standings!


Tuesday, March 25, 2014

Machine Madness Competitors: Brandon Kling

I thought it would be interesting to get a little insight into some of the Machine Madness competitors, and I'm starting off with Brandon Kling.

I frankly expected all the competitors to be data mining/AI geeks, but Brandon is a Commercial Real Estate broker/investor from Bloomfield, Michigan.  He went to Walsh College of Business, a private school in the suburbs about 1 hour north of Detroit, and -- given that Walsh College doesn't seem to field a basketball team -- is a die-hard University of Michigan fan.  He also plays a little basketball and volleyball himself.

This is his first year entering any sort of automated/algorithm based prediction models. He stumbled upon the Machine March Madness pages while looking for the historical best ways to predict the NCAA tournament brackets and got sucked into doing his own completely automated bracket. 

His method was pretty straightforward:

  1. Teams with the higher BPI (Basketball Power Index) advance (unless the difference between seeds playing each other is 3 or less, then see #2)
  2. If difference in seeds is 3 or less (i.e. 10v7 or 2vs3 or 1vs1), then disregard BPI and instead advance the team with the lower OPPONENT AVG PPG
  3. Championship Game points -  (winning score = Championship Game winner's AVG PPG)  (losing score= Championship Game winner's OPPONENT AVG PPG)
In one sense his algorithm was more advanced than mine -- I just randomly filled in the tie-breaker game points!
Sadly, Brandon's in last place in the pool.   However, he's the only competitor with Virginia winning the Championship, so if that happens he'll move up quite a bit.  And of course his beloved Wolverines are still in the Tournament (at least until they play Tennessee :-) so he has that to be happy about!

Sweet Sixteen Predictions

Thursday Games

#10 Stanford vs. #11 Dayton:   Stanford by 3
The Sweet Sixteen surprise match-up.  My prediction for this game prior to the tournament had Stanford by 4, so Dayton has closed the gap but Stanford is still favored.  My subjective judgement agrees.  (Dayton's biggest fan for this game?  Wojo.)

#2 Wisconsin vs. #6 Baylor:  Wisconsin by 4
The needle hasn't  moved at all on this game.  Baylor has surprised some with their performance, but Creighton was probably over-rated at a three seed.  And let's not discount Wisconsin's beat-down of American and handling of Oregon.  I know many think that Baylor could surprise Wisconsin, but my own expectation is that Wisconsin will win handily.  I think Baylor will quickly get disheartened by the Wisconsin defense.

#1 Florida vs. #4 UCLA:  Florida by 4.5
The machine predictors generally rate Florida as the best team in the country, and after their dismantling of Pittsburgh it's easy to see why.  Prior to 2005 UCLA had never played Florida.  Since then, they've had the misfortune to meet them four times in the Tournament -- every time during a year when Florida was at its best.   As a UCLA fan, I don't like the matchups with the Florida players, and you can't expect them to shoot 55% from the field as they did against Stephen F. Austin.

#1 Arizona vs. #4 SDSU:  Arizona by 8
It feels like both of these teams have been playing well, but they've actually just won their tournament games about as expected.   SDSU was lucky to face North Dakota State instead of Oklahoma, or they might not be in this game.  This should be a straightforward victory for Arizona.

Friday Games

#2 Michigan vs. #11 Tennessee:  Tennessee by 3
The Prediction Machine had Tennessee as one of the most mis-seeded teams this year, and that has certainly borne out.  They've actually played significantly above the predictions, so they're now a 3 point favorite in this game (they were a 1 point favorite before the tourney began).  Michigan has played about as expected.  Tennessee may revert to form, but either way Michigan is facing a tougher challenge here than you'd expect from the seeding.

#3 Iowa State vs. #7 Connecticut:  Iowa State by 3
The Prediction Machine doesn't consider injuries, so this line should probably be a little tighter.  These are both mediocre, inconsistent teams, so I won't be surprised if either team wins big or if it's a 3OT thriller.
#4 Louisville vs. #8 Kentucky:  Louisville by 6
This is the only sub-regional that has played out (so far, anyway) exactly as the PM predicted.   The Kentucky-Wichita State game was a tremendously fun game to watch, but let's not overstate the value of a two point win over a very over-seeded #1.  UK beat Louisville solidly early in the year, but that was on UK's home court.  On the other hand, six points isn't a lock.  It's certainly going to be a hard-fought game, and could be an instant classic.

#1 Virginia vs. #4 Michigan State:  Michigan State by 2
Michigan State has the slight edge here, but Virginia's defense should keep them in the game.  I'll be surprised if this is a blow-out either way.

Monday, March 24, 2014

A Look Back At Some Predictions

Previously on Net Prophet:

Courtesy of the Prediction Machine, here are the five most unpredictable teams in the Tournament:
  1. Oklahoma State
  2. North Dakota State
  3. Harvard
ND State and Harvard certainly did unpredictably well.  OK State may have been unpredictably bad at the wrong time.

Over-seeded teams:
UMass!!  should be a 15, was a 6 (-9)
St. Louis should be a 10, was a 5 (-5)
Colorado should be a 13, was an 8 (-5)
St. Joseph's should be a 15, was a 10 (-5)
New Mexico State should be an "18", was a 13 (-5)
Syracuse should be a 7, was a 3 (-4)
 All of these teams are out.   Only Syracuse and St. Louis even made it to a second game.

Under-seeded teams:
Iowa should be a 4, was an 11 (+7)
Oklahoma St. should be a 2, was a 9 (+7)
Tennessee should be a 5, was an 11 (+6)
Ohio State should be a 2, was a 6 (+4)
Harvard should be a 9, was an 11 (+2) 
Tennessee and Harvard both out-performed their seeds.  Iowa had the misfortune to face Tennessee.  Oklahoma State and Ohio State both lost games they should have won, so the committee might have been right about them.
Three possibly first-round upsets:  Stanford over New Mexico, Providence over UNC, and Xavier over St. Louis. 
Stanford beat New Mexico, and UNC and St. Louis won by a total of 5 points.

Machine March Madness Update

The Machine March Madness competition ended up with 9 competitors.  After 32 games, the current leader is "T.D." with a fairly commanding 9 point lead over perennial competitor Monte McNair.  The winner of the bracket may be determined by the final game, where TD has Florida and Monte has Arizona.  Mark & I (in third place) will be overshadowed by TD and Monte, but if Louisville wins the championship then Tim (currently in fifth place) may pick up enough points to win it all.

Here's a quick summary of the Final Four and Champion predictions from the machines:

Team  Final Four Champion
#1 Florida 6 3
#1 Arizona 7 2
#4 Louisville 7 2
#1 Virginia 2 1
#2 Villanova 4 0
#4 Michigan State 2 0
#2 Kansas 2 0
#2 Wisconsin 2 0
#1 Wichita State 2 0
#3 Syracuse 1 0
#3 UNC 1 0

Florida-Arizona is the clear favorite for the championship game.  The predictors also seem to agree that the committee under-estimated Louisville and Michigan State, while over-estimating Wichita State and Virginia.

More to come from the competitors in the next few days.

Thursday, March 20, 2014

Prediction Recap & Kaggle Contest

Albany vs. Mt. St. Mary's:  Albany by 1.5
NC State vs. Xavier:  Xavier by 4
Cal Poly vs. Texas Southern:  Cal Poly by 2.5
Iowa vs. Tennessee:  Iowa by 2.5
The PM goes 2-2 in the "First Four" which is probably no better than chance.  Interestingly, it got both the 16 seed play-in games correct.   NC State outplayed Xavier, but the Iowa-Tennessee game was more competitive and could have gone either way.

I'm in Vegas with a group of friends to watch the first-round games, so posting will be light, but here's an interesting graphic showing the spread of predictions amongst the Kaggle competitors:


This is a little non-intuitive, but if the caption says "Albany beats Florida" then having the histogram to the left indicates that the predictors don't believe in that hypothesis (and vice versa).

Tuesday, March 18, 2014

First Four Predictions

The Prediction Machine on the First Four:
Albany vs. Mt. St. Mary's:  Albany by 1.5
NC State vs. Xavier:  Xavier by 4
Cal Poly vs. Texas Southern:  Cal Poly by 2.5
Iowa vs. Tennessee:  Iowa by 2.5

Monday, March 17, 2014

More Grist For Your Tournament Picks

Pando did an article on machines predicting the Tournament.  Quotes from many of my favorite writers on the topic, including my most favorite writer -- me.

The Harvard Sports Analysis Collective put together a nice collection of random Tournament facts, such as:
24. Coming in at the 94th most efficient offense and the 36th most efficient defense in the country, UMASS is statistically the worst 6 seed by a wide margin.
I don't put much stock in those sorts of factoids, but it's an entertaining read nonetheless.

Courtesy of the Prediction Machine, here are the five most unpredictable teams in the Tournament:

  1. Oklahoma State
  2. North Dakota State
  3. Harvard
  4. Memphis
  5. Massachusetts
 I suspect that much of Oklahoma State's unpredictability stems from the temporary loss of Marcus Smart, so you might want to discount that.  But if you're counting on (say) Harvard to play a great game and beat Cincinnati in the first round upset, you might take this as a positive sign ("They're an inconsistent team, so they have a chance to play over their heads!") or as a negative sign ("They're too inconsistent to count on to pull off the upset!").  In the past, the Prediction Machine has used these numbers to fairly good effect.

Three possibly first-round upsets:  Stanford over New Mexico, Providence over UNC, and Xavier over St. Louis.

Sunday, March 16, 2014

Mis-Seedings in the NCAA Tournament

Here's some of the overseeded/underseeded teams, at least according to one analysis by the Prediction Machine.  There's a lot of mis-seeding in the lower seeds because most of the automatic invites down there wouldn't even be 16 seeds in the PM's book, but I'm ignoring those.
UMass!!  should be a 15, was a 6 (-9)
St. Louis should be a 10, was a 5 (-5)
Colorado should be a 13, was an 8 (-5)
St. Joseph's should be a 15, was a 10 (-5)
New Mexico State should be an "18", was a 13 (-5)
Syracuse should be a 7, was a 3 (-4)
On the other end of the stick:
Iowa should be a 4, was an 11 (+7)
Oklahoma St. should be a 2, was a 9 (+7)
Tennessee should be a 5, was an 11 (+6)
Ohio State should be a 2, was a 6 (+4)
Harvard should be a 9, was an 11 (+2) 

The Iowa/Tennessee mis-rating is particularly interesting.  The two most underseeded teams play each other, and then the winner gets UMass, the most overseeded team in the tournament.  Similarly we get underseeded tOSU against overseeded Syracuse in the second round.  Of course, the most egregious mistake might be Louisville as a 4 seed.  And overseeded St. Louis gets them in the second round.

I'm tempted to say that the Committee was intentionally stacking the seeding to create "upsets".

Here are the teams that the Committee got right:
Florida
Arizona
Villanova
Creighton
UCLA
Cincinnati
Oregon
Stanford
Xavier
For whatever reason, the Committee seems to have had the clearest vision of the Pac-12 teams.

UCLA (my team) got a pretty favorable draw for the first two rounds.  Then they run into their old nemesis Florida, a match-up I'm sure the Committee considered when laying out the regions.

Thursday, March 13, 2014

Flattening & the Kaggle Contest

Jeff Fogle at Stat Intelligence has another good post up, this time arguing that in the post-season (the NCAA Tournament in the case of college basketball), differences between teams condense.  By his argument, once the best teams are playing each other, they're more able to reduce any differences in strength between the teams.  And one implication of this is that handicapping (prediction) that works for the regular season won't work as well for the post-season.

Putting aside for the moment whether I buy this, how would we test this idea?

One of the big problems with any sort of assertion about the post-season is that it's very difficult to test, simply because of the small sample size.  College basketball is actually the best candidate amongst the major sports, because you have 67 games a year in the post-season -- and arguably more if you're willing to include the NIT and the conference tournaments.  In contrast, the NFL has only 11 games a year in the post-season.

But even for college basketball, 67 games per year is just not that big a sample size.  With five years of data, you still have less than 350 games for testing purposes.  (And give how the rules change in college basketball, going back more than 5 years or so runs the risk of comparing apples to oranges.)  And if we're looking specifically at Jeff Fogle's hypothesis about the best teams playing each other, it isn't entirely clear that some of these games would count -- is a #1 playing a #16 more like a regular season game, or more like a playoff game?

I'm not going to work out the math, but with 350 games in our test set and the known high variance in college basketball, any difference you found between Tournament games and regular season games would have to be huge to significant.

Several other factors make it difficult to assess the difference between the Tournament and the regular season.

One is that the Tournament games are played on neutral courts with mixed officiating crews.  That might well be the cause for any difference we saw between regular season and Tournament games.

Another is that (by necessity) we have to try to predict Tournament games based upon regular season performance.  That will make it more difficult to discern any qualitative difference between regular season and Tournament games.

All that said, in my own experience I haven't identified a qualitative difference between regular season games and Tournament games.  (Or, for that matter, between conference and non-conference games.)  Specifically, if I build a predictor based upon regular season games and a predictor based upon only Tournament games, I find that the regular season version is still the better predictor of Tournament games.  But given the small sample size for building the predictor based on Tournament games, I don't place a lot of confidence in that result.

(I will caveat that preceding paragraph slightly:  Tournament games have different home court advantage numbers in my predictor, but I ascribe that difference to the fact that they're played on a neutral court.)

Interestingly, the Kaggle competition will provide something of an empirical test of this thesis.  Judging by the Phase 1 leaderboard, there are a number of competitors who are specializing their predictors for good performance on the past five Tournaments.  If these predictors generally out-perform the predictors that are optimized for all games (or for regular-season games) it could be taken as some level of evidence that there really are fundamental differences that a predictor can exploit.  (Or not; again, small sample size.)  But at any rate I'm quite interested in seeing the results.

Wednesday, March 12, 2014

Basketball Power Index

Jeff Fogle over at Stat Intelligence has a new posting deriding ESPN's congratulatory self-coverage of the Basketball Power Index.  I won't comment much on what he says other than to note that as usual he's right on the mark with his criticism.

What I don't like about BPI that he doesn't mention is its "secret sauce" formulation.  Nobody except the stats gurus at ESPN know exactly what the formula is for BPI.  If you chase around the ESPN links trying to find a definition for BPI, you get to this page, which provides this sort of "explanation":
There are a number of small details that we have in our methodology to make it reflective of a résumé for a tournament team -- these are pretty technical and many people won't be interested, so we won't go into detail, but we think they improve how the tool works.
There's no way to check how BPI is calculated, whether all the small details are being applied consistently, whether ESPN is tweaking it weekly to inflate its performance, or to compare it to other methodologies.  (I have the same complaint about Ken Pomeroy, who is similarly vague about his actual calculations.)

Obviously, these folks have every right to keep their ratings formulas secret.  And by all means compare your rating performance to other ratings.  But to my mind, you're not a leader in sports rating systems if you're not willing to expose the details of your rating system and let others test and criticize it.

Monday, March 10, 2014

Top Twenty (3/10) and Predictions


1 Louisville
31.10
0.65%
2 Arizona
30.18
-0.46%
3 Iowa
29.88
-0.57%
4 Duke
29.72
-0.47%
5 Oklahoma St.
29.69
-0.13%
6 Michigan
29.58
0.61%
7 Creighton
29.44
-0.37%
8 Villanova
29.38
0.20%
9 Ohio St.
29.24
-0.03%
10 Florida
29.14
0.83%
11 Michigan St.
29.09
0.28%
12 Kansas
29.01
0.21%
13 Kentucky
28.90
-0.52%
14 Iowa St.
28.80
-0.24%
15 UCLA
28.71
-0.73%
16 Cincinnati
28.62
-0.07%
17 Wisconsin
28.60
-0.35%
18 Gonzaga
28.53
NEW
19 Arkansas
28.49
-0.21%
20 Arizona St.
28.47
-0.32%

The regular season Top Twenty ends with Louisville atop the leaderboard after strong showings against SMU and Connecticut this week.  But the biggest winner was Florida, who jumped four spots after crushing South Carolina and Kentucky.  UCLA took the biggest hit following a bad loss at Washington State.

It's interesting to compare the computer rankings to the AP ranking.  Wichita State is probably going to get a #1 seed for the tournament, so this could be the year that a #16 beats a #1.  And Oklahoma State might not get an invite despite being the in the top twenty of most computer polls.

PREDICTIONS

It's conference tournament week, so the marquee matchups won't happen till later in the week and aren't scheduled yet.  Only a couple of games currently stand out:

Notre Dame vs. Wake Forest (Notre Dame by 1)

An early-round ACC matchup between two teams that are statistically nearly identical.
Utah State vs. Colorado State (CSU by 2)
These two teams are the Notre Dame - Wake Forest equivalents in the Mountain West.
Indiana vs. Illinois (Indiana by 3.5)
These two teams are the Notre Dame - Wake Forest equivalents in the B1G.  They split their previous two meetings (although the Illini needed OT to win in Urbana) so this is the rubber match.
Maryland vs. FSU (Maryland by 3)
After surprising UVa, Maryland gets one more chance at a win on its way out of the ACC.  FSU needs to get on a roll in the ACC Tournament to have a shot at getting into the NCAA Tournament.

Predictions (3/4) Recap

#16 Iowa State @ Baylor:  Baylor by 1 (Baylor by 13)
#11 Louisville @ #18 SMU:  Louisville by 3 (Louisville by 13)
#10 SDSU @ UNLV:  SDSU by 2.5 (SDSU by 9)
#6 Villanova @ Xavier:  Villanova by 6  (Villanova by 7) 
#20 Memphis @ #15 Cincinnati:  Cincinnati by 12.5 (Cincy by 13)
#24 Iowa @ #22 MSU:  MSU by 4 (MSU by 10)
#25 Kentucky @ #1 Florida: Florida by 9 (Florida by 19)
#14 UNC @ #4 Duke: Duke by 12.5 (Duke by 12)
#21 New Mexico @ #10 SDSU:  SDSU by 5 (SDSU by 3)
#19 Connecticut @ #11 Louisville:  Louisville by 14 (Louisville by 33)
Oklahoma State @ #16 Iowa State:  Iowa State by 5 (Iowa State by 4 in OT)
#18 SMU @ #20 Memphis: Memphis by 3.5 (Memphis by 9)

The Prediction Machine goes 12-0 in the final week of the regular season, if not always close on the point spread.  Peaking just in time for the Tournament :-)

Friday, March 7, 2014

Machine March Madness 2014

Obviously the big news this year (in the area of machine prediction of the NCAA Tournament, anyway) is the Kaggle competition for $15,000.  Oh, and there's the Quicken Loans competition for a $1 Billion.  But if that's not enough to keep you busy this March, I'm pleased to announce the continuation of the Machine March Madness competitions.  There's no money at stake, but this is the longest-running machine prediction competition, and let's face it -- if you're going to enter the Kaggle competition you might as well enter Machine March Madness, too.  It's not much more work :-).

 (My thanks to Danny and Lee for letting me keep the competition running.)

The rules are very informal.  Your predictions must be based on a computer algorithm, but you can implement some parts manually as long as they're objective.  For example, your method might include "Take the team with the higher Sagarin rating" which you just handled manually, but please limit these steps and avoid just using your subjective judgement.  You can use any data you can find, including human-generated rankings like the AP poll.

The competition will be run as a Yahoo! Pool called "Machine March Madness" which you can find here.  Scoring will be Fibonacci -- 2-3-5-8-13-21 -- which will make the competition a little bit less dependent on the final round(s) than the traditional scoring.  To get the password to join the pool, email me (srt19170@gmail.com) with the name of your entry and a short description of your approach.  Also, please join the Google Group for announcements and discussion.

Useful data can found in a couple of places.  First, at the Kaggle competition data page.  Secondly, you can look in this Google Group thread from last year for some pointer's to last year's data.  Finally, I have fairly extensive data and will make it available as needed -- email me (or post in the Google Group) what you'd like to see.

Danny Tarlow's starter code from past years can be found here.  A short tutorial I wrote on using RapidMiner to predict games can be found here.  Finally, there have been several useful postings on ratings systems and predictions in the Kaggle forum

Tuesday, March 4, 2014

Tool Posting: Slime and Swank Versions Differ in Emacs

(Another tool posting.  Apologies to the pure basketball types, but this problem and its fix are impossible to find on the Internets so I wanted to document it for others.)

If you use the Superior Lisp Interaction Mode for Emacs (SLIME), you may get an error when starting up Slime that says "Slime and Swank versions differ, continue?"  Answering yes to the prompt usually continues without any problems, but it's still annoying.

The bug is caused by having multiple copies of Slime on your machine.  To fix it, you need to find and remove all older versions of Slime.  (Searching your file system for "slime.el" is a simple way to find all the installed copies of Slime.)  You might also need to clean up your .emacs file to not load (or include on the load-path) those old versions.

However, the problem will persist until the "slime.el" file is recompiled.  To force this, find the "slime.elc" file in the remaining installation of Slime and remove it.  The next time you start Slime in a new Emacs the file will be recompiled and the error should go away.

Top Twenty (3/4) and Predictions

1 NC Louisville
30.90
-0.48%
2 (+1) Arizona
30.32
0.43%
3 (-1) Iowa
30.05
-0.86%
4 NC Duke
29.86
-0.07%
5 NC Oklahoma St.
29.73
0.27%
6 NC Creighton
29.55
-0.27%
7 NC Michigan
29.40
-0.20%
8 (+1) Villanova
29.32
0.24%
9 (-1) Ohio St.
29.25
-0.48%
10 NC Kentucky
29.05
-0.51%
11 NC Michigan St.
29.01
-0.38%
12 (+1) Kansas
28.95
0.00%
13 (-1) UCLA
28.92
-0.45%
14 NC Florida
28.90
0.07%
15 NC Iowa St.
28.87
0.03%
16 NC Wisconsin
28.70
0.07%
17 NC Cincinnati
28.64
-0.07%
18 NC Connecticut
28.58
-0.10%
19 NEW Arizona St.
28.56
NA
20 NEW Arkansas
28.55
NA

The big loser this week is Iowa, who drops nearly an entire percentage point thanks to three straight losses.  The big winner is Arizona, who continues to roll through the Pac-12.  Arizona State and Arkansas climb back into the Top Twenty, thanks in part to Syracuse and Pittsburgh looking mortal.

PREDICTIONS

The season is ending with a bang, as this week (and particularly next Saturday) has a large number of marquee matchups.

#16 Iowa State @ Baylor:  Baylor by 1

Baylor probably needs a few more wins to get into the Tournament, this is a good opportunity.
#11 Louisville @ #18 SMU:  Louisville by 3

The PM continues to love Louisville despite its unappealing record against the top twenty.
#10 SDSU @ UNLV:  SDSU by 2.5

A possible stumbling block for SDSU.
#6 Villanova @ Xavier:  Villanova by 6

The Cintas Center has been something of a graveyard for visiting ranked teams (as Creighton can testify), but Xavier also does things like lose to Seton Hall, so overall I still expect Villanova to win this game.
#20 Memphis @ #15 Cincinnati:  Cincinnati by 12.5

A surprisingly large number in this game, but Cincinnati is probably better than the AP thinks.
#24 Iowa @ #22 MSU:  MSU by 4

More B10 on B10 thrashing.
#25 Kentucky @ #1 Florida: Florida by 9

Kentucky has struggled for the last half of February and into March, and this is going to be another frustrating game, I think.
#14 UNC @ #4 Duke: Duke by 12.5

The PM is unimpressed by UNC's recent wins.

#21 New Mexico @ #10 SDSU:  SDSU by 5

Comparing this game to the UNLV game shows the power of the HCA.
#19 Connecticut @ #11 Louisville:  Louisville by 14

Should be an easy win for Pitino's boys.
Oklahoma State @ #16 Iowa State:  Iowa State by 5

There's some talk that OK State won't get into the Tournament without a few more wins.  Crazy talk in my opinion, but what do I know.  A win at Iowa State would probably clinch a spot, but that's going to be a tough road.

#18 SMU @ #20 Memphis: Memphis by 3.5
Both teams are probably a lock for the Tournament at this point, and might not have much to play for in this game.