Tuesday, September 29, 2015

Missed It By *That* Much!

After five years of working on predicting college basketball, I've honed a decent intuition for the problem.  That applies doubly to my own predictor.  When I add a feature I have a pretty good idea of how it will affect performance.  Anything too significant and I immediately suspect that I've had data leakage or some other error.  In fact, this happened to me enough in the first year or so that I eventually re-architected my predictor specifically to create separation between the processing of the season's games and the prediction of scheduled games, and I carried that architecture over into the Python rewrite.

So I was a little dubious when I added a new feature to the predictor and saw a significant increase in performance.  The increase wasn't enormous (although with tuning it did become even better) but it was enough to trigger my "Spidey sense".  I commenced to carefully inspect my code, but I couldn't find any leaks.  The code was parallel to a similar feature that I was confident was sound, and I hadn't done anything obvious to violate my separation architecture, so it was a bit puzzling.

My ability to test the problem directly was limited (because portions of the architecture aren't yet built out in Python) but seemed to confirm there was a problem.  Without many other options, I broke out the debugger and started stepping through the program while monitoring the data structures.  Which is a laborious task when you're processing 32,000 games and your model has 500+ features.

Eventually I found the problem.  At one point in the code where I thought I was copying a value out of an array to save it elsewhere, I was actually getting a 1x1 array -- i.e., an array containing just one value.  Python is so flexible with data structures that this actually worked fine throughout the rest of the code.  It turns out that a 1x1 array of a value is just about interchangeable with the value itself.

The rub is that Python doesn't copy arrays unless forced.  So when I grabbed that 1x1 array, I was really just getting a view into the array it came from.  When the value in the original array changed, so did the value in my little 1x1 array.  The end result was that new values in the original array would leak back into what I thought were saved values.

This sort of thing leads a lot of people to hate untyped scripting languages like Python and Javascript.  The flexibility is great, but it can lead to difficult to diagnose errors (and also poor performance).  I'm a little more pragmatic.  All the choices in programming languages have plusses and minuses, and in my experience they tend to largely balance out.

So in the end I didn't discover the secret of college basketball, but at least I found my bug.  So that's something.

Friday, September 25, 2015

Mysterious Errors in Regression

I mentioned the other day that I'd been wrestling with a bug in the predictor.  I found a situation where I had generated some new features for my model and they caused a large drop in accuracy.  Here's a picture that captures the problem:
This is a plot of the accuracy of the model (as measured by a cross-validation) as more features are added.  You can see the puzzling behavior around feature #25.  With the addition of one feature, the accuracy of the model plummets from around 100 to about 11000 (!).  At first I thought there must have been missing values, corruption or some other problem with the feature, but examination shows that there's nothing ill-behaved or broken in the feature.  What's more, if I eliminate the offending feature from the data the same thing happens with some other feature.

The model I'm using here is the standard Ridge Regression model from Scikit-Learn.  I'm dubious that any attribute could cause this much inaccuracy, but even so the attribute should have been optimized out of the model.  And the fact that this happens with multiple features suggests to me that its a pervasive problem in my data or in the model.

At any rate, I was unable to figure it out and moved on to other things.  But I'm still curious about what could be going on here.  Maybe there's something obvious I'm missing.

Wednesday, September 23, 2015

Calculating the Massey Rating, Part 2

I've gotten consumed in the last few days with a bug in the predictor, and I forgot I was intending to post the second part of this notebook. The good news is that in the meantime I figured out how to make these notebooks play better with Blogger.

In this posting I'll expand the basic Massey rating with a few additions. First, let's recover the game list and the definitions from the previous posting.

In [70]:
games = [[0, 3, 3], [1, 4, 21], [1, 5, 5], [2, 0, 13], [2, 3, 13],
         [3, 0, 25], [4, 5, 8], [4, 3, 15], [5, 1, 21], [5, 4, 8]]

import numpy as np

def buildGamesMatrix(games, num_teams):
    M = np.zeros([len(games), num_teams])
    row = 0
    for g in games:
        M[row, g[0]] = 1
        M[row, g[1]] = -1
        row += 1
    return M

M = buildGamesMatrix(games,6)

def buildOutcomes(games):
    E = np.zeros([len(games)])
    row = 0
    for g in games:
        E[row] = g[2]
        row += 1
    return E

E = buildOutcomes(games)

bh = np.linalg.lstsq(M,E)[0]
print bh
[-18.55833333  11.74166667  -1.15833333  -9.75833333   5.24166667
  12.49166667]

One point that Massey makes in his thesis is that it is sometimes impossible to solve this set of equations. This occurs when the teams are divided into two or more subsets that have never played each other. (In math terms, we can only solve the equations when the graph of teams is connected.) In practice this isn't a problem for me, because I don't start calculating ratings until about 800 games into the season, and by that time the graph is connected. But it can be a problem in a sport like football, so how do we address this?

Massey recommends forcing the graph to be connecting by arbitrarily changing one row of $M$ to be all ones, and the corresponding element of $E$ to zero. For example:

In [71]:
print "\nBefore forcing connectivity:\n"
print M
print 
print E
M[9:] = 1.0
E[9] = 0.0
print "\nAfter forcing connectivity: \n"
print M
print
print E
Before forcing connectivity:

[[ 1.  0.  0. -1.  0.  0.]
 [ 0.  1.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0. -1.]
 [-1.  0.  1.  0.  0.  0.]
 [ 0.  0.  1. -1.  0.  0.]
 [-1.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  1. -1.]
 [ 0.  0.  0. -1.  1.  0.]
 [ 0. -1.  0.  0.  0.  1.]
 [ 0.  0.  0.  0. -1.  1.]]

[  3.  21.   5.  13.  13.  25.   8.  15.  21.   8.]

After forcing connectivity: 

[[ 1.  0.  0. -1.  0.  0.]
 [ 0.  1.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0. -1.]
 [-1.  0.  1.  0.  0.  0.]
 [ 0.  0.  1. -1.  0.  0.]
 [-1.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  1. -1.]
 [ 0.  0.  0. -1.  1.  0.]
 [ 0. -1.  0.  0.  0.  1.]
 [ 1.  1.  1.  1.  1.  1.]]

[  3.  21.   5.  13.  13.  25.   8.  15.  21.   0.]

Now we can re-run our ratings with this new $M$.

In [72]:
bh = np.linalg.lstsq(M,E)[0]
print bh
[-18.43333333  11.56666667  -1.03333333  -9.63333333   5.36666667
  12.16666667]

As you can see, this has (not unsurprisingly) changed the ratings. In addition to making the graph connected, this change also tries to make the mean of the ratings equal to 0.

I prefer a variant approach which adds the connectivity fix as a new "game" instead of overwriting the last game. This avoids impacting the ratings.

In [73]:
M = buildGamesMatrix(games,6)
E = buildOutcomes(games)
M = np.vstack((M, [1, 1, 1, 1, 1, 1]))
print M
E = np.append(E, 0)
print E
[[ 1.  0.  0. -1.  0.  0.]
 [ 0.  1.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0. -1.]
 [-1.  0.  1.  0.  0.  0.]
 [ 0.  0.  1. -1.  0.  0.]
 [-1.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  1. -1.]
 [ 0.  0.  0. -1.  1.  0.]
 [ 0. -1.  0.  0.  0.  1.]
 [ 0.  0.  0.  0. -1.  1.]
 [ 1.  1.  1.  1.  1.  1.]]
[  3.  21.   5.  13.  13.  25.   8.  15.  21.   8.   0.]

And now we'll calculate the ratings again.

In [74]:
bh = np.linalg.lstsq(M,E)[0]
print bh
[-18.55833333  11.74166667  -1.15833333  -9.75833333   5.24166667
  12.49166667]

Now the connectivity fix doesn't affect the ratings. (In this case, anyway. If the matrix wasn't well-connected, or the ratings didn't already sum to zero, it would still have an effect.)

Moving on, you may have noticed in the previous notebook about the Massey rating that there was no mention of whether teams were playing at home or away. Games were expressed as a winner, a loser and a margin of victory. So let's add home & away. To start with, we'll modify our game representation so that instead of

$[Winner, Loser, Margin]$

we have

$[Home, Loser, MOV]$

where $MOV$ is the more familiar definition of the score of the home team minus the score of the away team. I've modified the games array from the previous posting to make a few games wins by the Away team. Since I choose to put Home first in the new representation, and changed Margin to MOV, nothing actually changes in the code. So let's calculate the ratings for this new set of games.

In [80]:
games = [[0, 3, 3], [1, 4, 21], [1, 5, -5], [2, 0, 13], [2, 3, 13],
         [3, 0, 25], [4, 5, -8], [4, 3, 15], [5, 1, 21], [5, 4, -8]]
M = buildGamesMatrix(games,6)
E = buildOutcomes(games)
M = np.vstack((M, [1, 1, 1, 1, 1, 1]))
E = np.append(E, 0)

bh = np.linalg.lstsq(M,E)[0]
print bh
[-18.35   9.45  -0.95  -9.55   5.45  13.95]

So we now have Home/Away represented and we have forced connectivity. Another issue we might want to address is that the Massey rating system doesn't have the notion of a "home court advantage". You can see that it predicts the same game outcome for $Team_i$ versus $Team_j$ regardless of whether $Team_i$ is Home or $Team_j$ is Home.

One way we can try to address that shortcoming is to add a HCA constant to the expected outcome equation, like so:

$e_{ha} = R_{home} - R_{away} + C_{HCA}$

Here $C_{HCA}$ is a constant for all games. We can represent this as a new column in $M$ of all 1s (because it affects every game), except in the connectivity row (Do you see why? $C_{HCA}$ captures a bias in scoring data, so we don't want to force it to sum to zero with the ratings.).

In [81]:
M = buildGamesMatrix(games,6)
E = buildOutcomes(games)
# Adjust for connectivity
M = np.vstack((M, [1, 1, 1, 1, 1, 1]))
E = np.append(E, 0)
# Calculate rating & error before HCA
bh = np.linalg.lstsq(M,E)[0]
print "Error rating without HCA:\n"
print sum(map(abs, M.dot(bh)-E))/10
# Adjust for HCA
M = np.hstack((M, [[1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [0]]))
print "\nM with HCA column:\n"
print M
bh = np.linalg.lstsq(M,E)[0]
print "\nRatings with new HCA:\n"
print bh
Error rating without HCA:

8.78

M with HCA column:

[[ 1.  0.  0. -1.  0.  0.  1.]
 [ 0.  1.  0.  0. -1.  0.  1.]
 [ 0.  1.  0.  0.  0. -1.  1.]
 [-1.  0.  1.  0.  0.  0.  1.]
 [ 0.  0.  1. -1.  0.  0.  1.]
 [-1.  0.  0.  1.  0.  0.  1.]
 [ 0.  0.  0.  0.  1. -1.  1.]
 [ 0.  0.  0. -1.  1.  0.  1.]
 [ 0. -1.  0.  0.  0.  1.  1.]
 [ 0.  0.  0.  0. -1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  0.]]

Ratings with new HCA:

[-12.86923077   4.54615385  -2.39230769  -4.06923077   4.00769231
  10.77692308   6.92307692]

$C_{HCA}$ is the last term in $bh$ and you can see that in this example it is quite substantial. Did adding this make our rating system any more accurate? Let's calculate the error with $C_{HCA}$:

In [83]:
sum(map(abs, M.dot(bh)-E))/10
Out[83]:
7.987692307692309

So yes, in this case adding $C_{HCA}$ to our model resulted in an improvement of about 0.8 in our MAE.

As a last note, if you read Massey's thesis you'll see he's actually solving a set of equations that look like this:

$M^T * E = M^T * M*R$

which is to say that he's multiplied both sides by $M^T$ (the transpose of $M$). This has the effect of reducing the number of equations from the number of games to the number of teams, as you can see here:

In [86]:
print "M is 11 rows long:\n"
print M
print "\nMt*M is 6 rows long:\n"
print M.T.dot(M)
M is 11 rows long:

[[ 1.  0.  0. -1.  0.  0.  1.]
 [ 0.  1.  0.  0. -1.  0.  1.]
 [ 0.  1.  0.  0.  0. -1.  1.]
 [-1.  0.  1.  0.  0.  0.  1.]
 [ 0.  0.  1. -1.  0.  0.  1.]
 [-1.  0.  0.  1.  0.  0.  1.]
 [ 0.  0.  0.  0.  1. -1.  1.]
 [ 0.  0.  0. -1.  1.  0.  1.]
 [ 0. -1.  0.  0.  0.  1.  1.]
 [ 0.  0.  0.  0. -1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  0.]]

Mt*M is 6 rows long:

[[  4.   1.   0.  -1.   1.   1.  -1.]
 [  1.   4.   1.   1.   0.  -1.   1.]
 [  0.   1.   3.   0.   1.   1.   2.]
 [ -1.   1.   0.   5.   0.   1.  -2.]
 [  1.   0.   1.   0.   5.  -1.   0.]
 [  1.  -1.   1.   1.  -1.   5.   0.]
 [ -1.   1.   2.  -2.   0.   0.  10.]]

Applying the same operation to both sides of the equation doesn't change the answer, so you get the same result if you solve the new formulation.

In [87]:
bh = np.linalg.lstsq(M.T.dot(M),M.T.dot(E))[0]
print "\nRatings with new formulation:\n"
print bh
Ratings with new formulation:

[-12.86923077   4.54615385  -2.39230769  -4.06923077   4.00769231
  10.77692308   6.92307692]

The advantage (at least in 1997 when Massey wrote his thesis) was that it was easier/faster to solve the smaller set of equations. I'm not sure if that's still true and it might be that Python's linear solver is smart enough to do this itself. But you'll get the same result either way.

In [ ]: