Wednesday, December 9, 2015

Sports Information API

Tonight I stumbled across Sportradar.us, which seems to be the former SportsData.  Interestingly, they have APIs to deliver all sorts of sports information, including comprehensive NCAA men's basketball coverage -- including play-by-play data and even location data, i.e., where on the court a shot was taken (!). 

The bad news is that the lowest pricing tier is $500/month.  So not something I'll be buying for Christmas.  But interesting.

Working Overtime

Overtime is one of the interesting quirks of basketball.  In some sports -- particularly low-scoring sports like soccer and hockey -- a game may end in a tie.  But in college basketball teams play additional periods -- as many as needed -- until a winner is determined.

Overtime games skew team statistics.  ESPN and other sites typically have pages of statistics such as "Points Per Game".  But if one game is 40 minutes long and another is 226 minutes long, it's not really an apples-to-apples comparison.  This is one reason analysts are fond of "per possession" statistics -- not only does it correct for pace of play, but it also corrects for overtime games.

Clearly the statistics you feed into a predictor need to be corrected somehow for overtime games.  But there's another interesting overtime issue to consider:  What's the final score of an overtime game?

One choice is to use the score at the end of the overtime(s).  The other is to treat the game as a tie.  There are intuitive arguments in favor of both choices.  The fact that Syracuse beat Connecticut suggests that Syracuse is a better team, regardless of how many minutes that took, so we should treat the game as a win for Syracuse.  On the other hand, the teams were deadlocked for six overtimes, which suggests that they're about as equal as it is possible to be, regardless of whether one team or the other managed to win the game in the wee hours of the morning.

Or maybe the game should be treated as a tie for some statistics and not for others.

As longtime readers of this blog are aware, I'm a believer in doing whatever works best.  So in this case, I made two runs of my predictor, once treating overtime games as ties and once using the actual  final scores.   In my case, the predictor performed better treating overtime games as ties.

Another possibility is to treat the final score of an overtime game as 1 or -1 (or 0.1 and -0.1 if your predictor can handle that), depending upon which team wins the overtime period(s).  This retains the won/loss information, but otherwise treats the game as (nearly) a tie.

For those of you who also have predictors, I encourage you to try the same experiment and report back which choice (if either) works better for you.

Sunday, December 6, 2015

A Few Funny Things

When I logged in to work on this post, I noticed that my blog had 100,000 page views.  Since I have an audience of like six people, you guys must be checking my pages a lot.  Good job!  Anyway, I've been spending some time lately getting my data scraping working, and that always involves a few trips through the bowels of data validation.

First stop is this game.  I was running the predictor when it warned me about an unusual event:  a conference game in early November.  Unusual, but it happens (often a Big5 game).  What was more surprising was that it was a team playing itself.  According to the predictor, UNC Greensboro had come up with the clever notion of scheduling a home game against itself.  Or maybe it was on the road. 

One of the challenges of predicting NCAA basketball is that every data source uses different names for teams.  To try to match them up I have lists of alternate team names:

St. Francis (NY)
1383
St. Francis BRK
St. Francis (N.Y.)
St. Francis-NY
 St Francis NY
St Francis(NY)
St Francis (NY)
St. Francis Brooklyn
St. Francis NY
St. Francis-NY Terriers
St Francis (BKN)
st.-francis-(NY)-terriers
St. Francis (BKN)
(That weird-looking "1383" is the name for St. Francis (NY) in the Kaggle contest.  Because it's run by data scientists, so why use a human-readable name when you can use an arbitrary and completely useless number?)

In this case the predictor too aggressively (although reasonably) determined that Div III Greensboro College was a nickname for UNC Greensboro.  (By the way, my list of nicknames and the Python code that goes with it is available for the asking.  But you're on your own dealing with Greensboro vs. Greensboro.)

Next up is this game.  Looks like a perfectly reasonable WAC Conference game.  Problem is, one of those teams was not in the WAC.  Actually, one of those teams didn't even exist.

You see, last year the University of Texas decided to merge two campuses -- the University of Texas Brownville and the University of Texas Pan American -- to form a brand new campus the University of Texas Rio Grande Valley.  Brownsville didn't have sports, but UT-PA was a Division I team in the WAC, so the new campus stayed in the WAC and became the "Vaqueros."

(Trivia Question:  Name the other four NCAA Division I basketball nicknames that are Spanish words.)

Well, ESPN decided the easiest way to deal with this whole business was to just go into their database and replace every instance of "University of Texas Pan American Broncs" with "UTRGV Vaqueros."  Hence the mysterious 2013 game involving a university that wouldn't exist for several more years.