Monday, September 7, 2015

Welcome to Another Year

"In the Fall, a young man's fancy lightly turns to thoughts of college hoops..."
                                                                                    -- Alfred Lord Tennyson

I may have gotten that quote slightly wrong but as the Summer has been winding to a close I've begun thinking again about college basketball. One of my goals for this year is to move my system into Python. This is a fairly major undertaking, so I thought I'd mention briefly why I've decided to do this.

The architecture of the NetProphet system currently looks something like this:

The top half shows the flow when building the predictive model. WebHarvest is used to fetch the historical data (this only happens once; from then on we can work from the saved copies). A Common Lisp program is then used to process this historical data in various ways. This includes calculating strength measures such as RPI as well as preparing the data for input into the machine learning algorithm. The processed data is then fed into Rapidminer, which uses this processed historical data to create a predictive model, which is saved for later use.

The bottom half shows the flow when making a prediction.  WebHarvest is used to fetch the schedule of upcoming games.  This information is fed into a (slightly different) Common Lisp program which prepares the scheduled games for input into the predictive model.  Finally, Rapidminer is used to apply the previously-saved model to the processed schedule to produce predictions.

This may seem complicated, but there's something to be said for breaking up the flow this way and applying the most appropriate tool at each step.  However, executing these processes can be tedious, and unfortunately, the past few years have revealed some shortcomings in these tools.

At this point, WebHarvest is basically abandonware.  It is open-source, so this isn't a showstopper.  In fact, I've hacked the version I'm using to address some problems and add some additional features.  However, it would be nice to use a supported tool with some more modern features.

I've used Common Lisp for many years.  I'm very productive coding Common Lisp and I have a very efficient coding environment.  But for this problem it has some shortcomings.  Support for things like matrix mathematics and linear algebra is weak.  I've managed to get an interface to BLAS working, but there are still areas where I'm limited by what is available.

Rapidminer has traditionally had some nice advantages for machine learning.  The graphical interface makes it very easy to construct models and to try out many different processes.  Saving models and applying them to create predictions is very easy.  Unfortunately, a couple of years ago RapidMiner transitioned from being free & open-source to a tiered model.  While there is still a free & open-source version, all development has stopped on that branch.  The free "Community" version is hobbled in ways that make it unusable for me.  And the "Professional" version is very expensive -- $2000/year (!).  While I was once a strong advocate for Rapidminer as a beginning machine learning tool, due to these changes I can no longer recommend it.  

Perhaps more importantly than any of the tool shortcomings, machine learning is a rapidly changing field, and few of the recent developments were available in Common Lisp or Rapidminer.  As I've watched the field for the past few years, it seemed that Python is the most popular environment for machine learning research.  There's certainly work in other languages, but more often than not interesting work finds its way into the Python environment.  Python's also powerful enough to host all of the data processing components that are currently in Common Lisp.  Python even has a very good web scraping functionality (Scrapy) that can replace WebHarvest.

The final factor was that I don't currently know Python, so this provides a chance for me to learn a new programming language.

I'm a few days into the rewrite and seem to be progressing fairly well.  Updates to follow.


  1. Glad the blog is back! I've been enjoying it.

    I do most all of my work in Python, including some lighter NCAABB things similar (but not as fancy) as yours. I also used Scrapy to data gathering, and it's a really great library for that. I'm sure you've probably already run into SciKit-Learn, and that library has great methods for doing cross-validations of all kinds. If you're looking for something more visual, the Orange project is a visual programming language built in python for data analysis.

  2. Thanks James! Converting my web scraper to Scrapy is on my list, but I'm holding off because it appears that ESPN is switching to a new format. I hadn't heard of Orange (or if I had I've forgotten) but I'll take a look.