Saturday, January 31, 2015

Strength of Schedule & Adjusted Statistics: An Unexpected Update

I've removed my previous posts on Strength of Schedule and adjusted statistics from the blog because I now believe that the method I was proposing does not always converge -- in some cases it appears that the adjusted statistics will end up with one team at 1.0 and all the others at 0.0 -- not a particularly useful approach!

That will teach me to post before I've implemented -- although it also goes to show that the process of experimentation is bound to be fraught with problems.  I usually avoid describing all the pitfalls, but the process of discovery is rarely as clean as the final product.


UPDATE

After some more work, I think I understand the conditions under which the approach converges, and I will go back, fix the original postings and put them up again.

Friday, January 30, 2015

Strength of Schedule & Adjusted Statistics (Part 3)

In this previous post, I described how to calculate a "Strength of Schedule" measure for an arbitrary statistic and one way you might apply it to create an "adjusted" statistic:

`S_"adj"(T) = (S(T))/(SoS(T,S))`

This adjusted statistic is a way of normalizing statistics between teams when they have played different opponents.

One shortcoming of this approach is that the Strength of Schedule is itself based upon the unadjusted statistic:

`SoS(T,S) = (1/(n*m)) sum_(i="opponents"(T))^n sum_(j="opponents"(i))^m (j ne T) S(j)`

So to some extent this measure is biased by the quality of a team's opponents' opponents' opponents (if you can follow that!).  This bias may not be significant -- certainly by the middle of the season, the third-level opponents for a given team is a pretty big set, so there's going to be overlap between most teams.  But each team's opponents will be dominated by their conference teams, and there are sizable differences in strength between conferences, so it may be significant.

The obvious way to eliminate this bias is to use the adjusted statistic in the SoS calculation.  The whole purpose of the adjusted statistic is to normalize away these biases.

`SoS(T,S) = (1/(n*m)) sum_(i="opponents"(T))^n sum_(j="opponents"(i))^m (j ne T) S_"adj"(j)`

However, we've now created a circular definition!  The adjusted statistic for a team depends upon it's opponents' opponents adjusted statistic -- and their adjusted statistic depends upon other teams, and so on.

But just because the definition is circular doesn't mean we can't compute its value.  One approach to do this is to guess some initial value for the adjusted statistic (say, the unadjusted value) and then recalculate all the adjusted statistics.  The values will change and then we can repeat until we converge on an answer.

NOTE: A system of non-linear equations is not guaranteed to have a solution.  For this and other reasons, the iterative approach described here is not guaranteed to converge for every system.  In this case, I believe that the solution will converge if the system has "full support", which for NCAA basketball should be true after about 500 games.  Testing seems to bear this out, because it works for me for data from the past 5 seasons, but I haven't proven that it converges.  It also turns out that the normalization step shown below isn't necessary for convergence, but I'm leaving it in below because the example doesn't work otherwise.
 
Let's take a look at how this works.  For this experiment, we have a small league of three teams.  They've all played each other once, and here are their unadjusted statistics for 3 point shooting:

Team Unadj
Blue34%
Gold24%
Silver30%

Now we will calculate the Strength of Schedule for each team.  Since each team has played each other once, their opponents' opponents are each of the other two teams.  (I'll wait while you confirm that.)  We'll use the Unadjusted statistic as our first guess for the Adjusted statistic, so the SoS is just the average of the other two teams.

Team UnadjSoS(1)
Blue34%0.27
Gold24%0.32
Silver30%0.29

The Adjusted Statistic is then the Unadjusted Statistic divided by the SoS.

Team UnadjSoS(1)Adj(1)
Blue34%0.271.26
Gold24%0.320.75
Silver30%0.291.03

Finally, we take the Adjusted Statistic and normalize it so that it sums to one to create the Normalized Adjusted Statistic:

TeamUnadjSoS(1)Adj(1)Nadj(1)
Blue0.340.271.260.41
Gold0.240.320.750.25
Silver0.30.291.030.34

Now let's repeat that calculation again for a second iteration.

TeamUnadjSoS(1)Adj(1)Nadj(1)SoS(2)Adj(2)Nadj(2)
Blue0.340.271.260.410.291.160.43
Gold0.240.320.750.250.380.640.24
Silver0.30.291.030.340.330.910.34

The Normalized Adjusted Statistics still changed a little bit, so let's do a third iteration.

TeamUnadjSoS(1)Adj(1)Nadj(1)SoS(2)Adj(2)Nadj(2)SoS(3)Adj(3)Nadj(3)
Blue0.340.271.260.410.291.160.430.291.190.44
Gold0.240.320.750.250.380.640.240.380.630.23
Silver0.30.291.030.340.330.910.340.330.900.33

Still a little bit of change, so we do another iteration.

TeamUnadjSoS(1)Adj(1)Nadj(1)SoS(2)Adj(2)Nadj(2)SoS(3)Adj(3)Nadj(3)SoS(4)Adj(4)Nadj(4)
Blue0.340.271.260.410.291.160.430.291.190.440.281.210.44
Gold0.240.320.750.250.380.640.240.380.630.230.380.620.23
Silver0.30.291.030.340.330.910.340.330.900.330.330.900.33

And voila!  The Normalized Adjusted Statistics (and the SoS) have converged (at least to two decimal places).  And a good thing, too, because that table was getting very wide.  If you compare the adjusted statistics to the unadjusted statistics, you see that Blue is a much better 3 point shooting team than the unadjusted statistic shows, Gold is slightly worse, and Silver slightly better.

So there you have it -- a method for adjusting a non-symmetric statistic to account for strength of schedule, and a way to compute it.  In the next post, I'll share some thoughts on how to create an efficient implementation of this method.

Wednesday, January 28, 2015

Strength of Schedule & Adjusted Statistics (Part 2)

In a previous post, I talked about why Strength of Schedule (SoS) is important to interpreting a team's statistical performance, and I briefly described the standard SoS approach taken by Ken Pomeroy, SevenOvertimes, and others.

To briefly review, the standard approach for calculating SoS for a statistic like winning percentage (WP) for a team (T) is to average the winning percentage of all of a team's opponents:

`SoS(T) = (1/n) sum_(i="opponents"(T))^n WP(i)`

To use this SoS measure to interpret the original statistic, we could create an adjusted statistic:

`WP_"adj"(T) = WP(T) * SoS(T)`

This works fine for symmetric statistics like winning percentage, where a win for you means a loss for your opponent.  Unfortunately, winning percentage (and other won-loss stats) are about the only statistics with this property.  Most statistics are like three point percentage, where a team's performance is mostly unrelated to how well it's opponent does the same thing.  Instead, there's an offense-defense aspect to the statistic, and to interpret the statistic you want to know how well the opponent does at defending the statistic.  However, there's not usually a corresponding defense statistic (e.g., "3 PT defense"), so we have to derive the opponent's defensive strength by looking at how well the opponent has done in stopping other teams.  So in the case of three point percentage, we want to know how well a team's opponents have done at stopping the three pointer.
 
We calculate the Strength of Schedule by averaging the team's opponents' opponents performance:

`SoS(T) = (1/(n*m)) sum_(i="opponents"(T))^n sum_(j="opponents"(i))^m 3PT%(j)`

There's actually one more little wrinkle; we want to exclude the original team from the opponents' opponents.

`SoS(T) = (1/(n*m)) sum_(i="opponents"(T))^n sum_(j="opponents"(i))^m (j ne team) 3PT%(j)`

For example, suppose that Louisville is shooting 32% from the arc.  If the teams Louisville has played have held all their opponents to an average three point percentage of only 24%, then Louisville's 32% is more impressive.  Conversely, if Louisville's opponents have allowed the teams they played to average 48%, then Louisville's 32% looks less impressive.

Note that this SoS measure is backwards from the typical one used for symmetric statistics.  In this case, a smaller SoS indicates tougher competition.   (This all assumes that "bigger is better" for our statistic.  If we have a statistic where you want to have a low number, such as turnovers, everything flips.)

We can capture this analytically as an adjusted statistic (using S for a generic statistic, and assuming that for S bigger is better):

`S_"adj"(T) = (S(T))/(SoS(T,S))`

To return to the Louisville example, if the strength of schedule is 24%, then Louisville's adjusted 3PT% is 1.33.  But if the strength of schedule is 48%, then Louisville's adjusted 3PT% is only 0.66.

As should be obvious from that example, these adjusted statistics don't have any meaning.  They're just a number, where bigger is better.  But they can be used to compare teams, and may be more useful than the original statistic for prediction because they provide a common measure even when teams haven't faced the same opponents.

One problem that we haven't yet addressed is that this SoS measure only goes one level deep.  Maybe Louisville's opponents held teams to 24% three point shooting because they played a bunch of teams that were terrible three point shooters.  I'll address that in Part 3.

Tuesday, January 27, 2015

Strength of Schedule & Adjusted Statistics (Part 1)

The usual method of predicting games is to look at a team's past performance -- usually expressed as a statistical value such as "winning percentage" -- and use that to estimate future performance.  But this approach is problematic, because all statistics are not made the same.  In the case of winning percentage, two teams with a winning percentage of 84% are not necessarily equivalent.  Louisville's 16-3 record, with losses to #1 Kentucky, #4 Duke and #18 UNC is not the same as Dayton's 16-3 record with losses to #17 Connecticut, Arkansas and Davidson.

This problem arises because college basketball is a case of incomplete pairwise comparison.  If every team played every other team twenty times, by the end of the (admittedly long) season, winning percentage would be a pretty good measure of team strength.  But that's never going to happen, so we need other ways to compensate for this weakness.  One of the simplest is to calculate a "Strength of Schedule" and use that to interpret the statistic.

In its simplest form, Strength of Schedule (SoS) is calculated as the average of all a team's opponents in the same statistic.  So if we were looking at "winning percentage" SoS would be calculated by averaging the winning percentage of all of a team's opponents.  So Louisville might have a SoS of .57 (meaning its opponents has overall won 54% of their games) while Dayton had a SoS of .51 (meaning its opponents had only won 51% of their games).  In light of this, we could then say that Louisville's 84% winning percentage is better than Dayton's 84% winning percentage.

There are several shortcomings with this definition of SoS.

First, it doesn't always make sense to measure Strength of Schedule using the same statistic.  For example, suppose we're looking at "3 Pt Shooting Percentage".  In this case, SoS would tell us how well our opponents shot the three-pointer.  That doesn't make a lot of sense.  How well our opponent shot the three doesn't affect how well we shot the three.  In this case, we really want to know how well our opponent's opponents shot the three (if you can follow that thought).  The simplistic form of SoS only makes sense for symmetric statistics -- where a plus for one team is automatically a minus for the other team -- such as winning percentage, where a win for you necessarily means a loss for the other team.

Even for symmetric statistics, there are problems with this view of SoS.  One is that we've only pushed off the problem one level by looking at a team's opponents.  To return to the previous example, Louisville's opponents seem better because they have better records than Dayton's opponents.  But maybe that's just because Louisville's opponents themselves played weak teams. This is the problem that RPI tries to address by looking at opponents and opponents' opponents.  Two layers is pretty good, and RPI is a much better metric than straight Winning Percentage.  Of course, you can take it deeper.

In general, many of the more sophisticated rating systems (e.g., Massey) can be viewed as different  approaches to extending Strength of Schedule as deep as possible.  I'm not sure there's a "right" answer to measuring Strength of Schedule, but it seems clear that the general idea -- to adjust or interpret statistics based upon a team's opponents -- is valuable.

Top Twenty (1/27)

I've got a couple of topics I'm trying to write up, but in the meantime a quick Top Twenty:

RankTeamRating
1Kentucky31.67
2Ohio State30.29
3Virginia30.27
4Wisconsin30.05
5Duke30.04
6Utah29.80
7Gonzaga29.77
8Notre Dame29.72
9North Carolina29.58
10Louisville29.35
11Arizona29.35
12Villanova29.11
13Texas29.01
14West Virginia28.88
15Butler28.60
16Oklahoma28.52
17Michigan State28.44
18Baylor28.39
19Wichita State28.33
20Indiana28.29

The amazing thing here is that the gap between #1 and #2 is almost as big as the gap between #2 and #20.  Kentucky's really way better than the rest of the field.  And if not for Kentucky, we'd have a dead heat for the four #1 seeds for the tournament.

Overall, the Top Twenty has been pretty stable for the last three weeks.  Some shuffling around, but only Wichita State is new.  (Pushing off Maryland.)

Friday, January 23, 2015

Reversing the Data?

Over on Reddit's machine learning subreddit, a user posed this question (paraphrasing):
For my basketball predictor, I could double the number of training examples by creating new examples with the Home and Away teams swapped.  Would it make sense to do that?
That's an interesting thought.

The first concern is that you can't just swap Home and Away.  No one is entirely sure why, but home teams play differently (and better) than away teams.  In NCAA basketball, the Home Court Advantage (HCA) is between 4 and 5 points.  What's more, all of a team's statistics are typically different for their home games.  There's sometimes even a home court effect for neutral site games.  In particular, the "home" team in NCAA Tournament games does better than you would expect -- even when you adjust for them being the better team.  (In the Tournament, the better-seeded team in the matchup is the home team.)  If you just blindly swap Home and Away, the newly created games are going to be bad training examples because they won't reflect what would happen in a real game with a swapped matchup.

But let's assume that there is no difference between Home and Away.  In that case would it be valuable to swap Home and Away to create new training examples?

Let's look at a simple case.  Suppose that our training examples consist of a strength rating for each team (S1, S2) and the margin of victory (MOV).  Here's our very small training set:
S1, S2, MOV 
20, 18, 2
14, 28, -14
Based upon this training data, we'll build a (perfect) model that predicts MOV = S1 - S2.

Now let's augment the training set by adding new games where S1 and S2 are swapped and MOV is negated.
S1, S2, MOV 
20, 18, 2
14, 28, -14
18, 20, -2
28, 14, 14
Based upon the new training data, we'll build the exact same model!

That happens because we didn't add any new information to our training set.  We added new examples, but they didn't contain any new information. They just repeated existing information transformed in a way that was irrelevant to our model.

I suspect this will always happen if (1) the data transformation is perfect, and (2) the machine learning model can "see through" the transformation.

An example of modifying the training set where the data transformation is not perfect is boosting.  In boosting, we duplicate SOME of the training examples, and that leads to a different model.  If we duplicate all of the training examples, the transformation is perfect and there's no change to the learned model.  It's only duplicating some of the examples that is key to boosting.

A situation where the machine learning model cannot "see through" the transformation is using non-linear transformations with a linear model.  For example, if we have a data set where the data is arranged in a circle, a linear regression will not model the data well because it always produces a straight line.  However, if we apply a polar coordinate transformation on the data the circle will become a line, and the linear regression will be more effective.

My guess is that swapping Home & Away to create new training examples isn't an effective technique for NCAA Basketball.  However, I encouraged the original poster to go ahead and give it a try.  It's a fairly easy experiment, and who knows, my intuitions may be wrong!

(And if they are, I'm totally stealing the technique.)

Monday, January 5, 2015

Top Twenty (1/5)

I had email today asking about my Top Twenty.  I've been lax about doing that this year, but here's the ratings as of today:

PositionAPTeamRating
11Kentucky31.70
222Ohio State30.72
32Duke30.62
43Virginia30.18
54Wisconsin30.00
618North Carolina29.80
713Notre Dame29.78
85Louisville29.61
96Gonzaga29.51
109Utah29.24
1110Texas29.23
128Villanova29.06
1314West Virginia29.05
14NRButler28.62
157Arizona28.61
1616Oklahoma28.60
17NRIllinois28.45
18NRMichigan State28.42
19NRBaylor28.40
2011Maryland28.30

I've included the AP rankings for comparison.  Kentucky is very strong -- as much stronger than the #2 than the #2 is of #7.  The PM likes Ohio State, North Carolina and Notre Dame much more than the poll.  Conversely, it doesn't think as much of Arizona, Villanova or Maryland.  Butler is probably the biggest darkhorse in this list, although they did receive votes in the AP poll.

The Effect of Additional Data on Performance

I've been wondering whether having more training data (i.e., additional seasons of games) would further improve my predictor.  This is problematic, because I already have data back to when the 3 point shot was introduced in the 2009-2010 season, so I can't actually get any more usable data.  But the question persisted, so I did a quick and dirty experiment to try to characterize how much improvement I'll see with additional data.

I trained a model on differing amounts of training data and tested it on the entire training set.  Ideally, I'd do this as some sort of cross-fold validation, picking different slices of the data for training, but I didn't want to spend the time that would require, so I just did each trial once.  So there's necessarily a lot of fuzziness in these results, but I still think the result is instructive.  The plot of error versus amount of training data looks like this:


That's error along the Y axis and number of training examples along the X.  You can see that error falls fairly steeply for the first 10K or so training examples and then begins to level off.  (Although it continues to slowly decrease.)  Eyeballing this chart suggests that additional data isn't likely to provide any big improvement.

If you're building your own predictor, this suggests that you should try to get at least 15K games for training data.  Depending upon how many games you throw out from the early season, that's around 3 full seasons of games.

This also shows the folly of trying to build a Tournament predictor based upon past Tournament games.  At 63 games a year, you'd need about 238 years of Tournament results to get a decent error rate :-).

The Prediction Tracker

This is just a quick note to say that this season I've been submitting predictions to The Prediction Tracker.  This site tracks the performance of a number of rating / prediction sites.  Current standings as of today look like this:

                                                                  Mean        Straight    Against
                           Straight  Against    Mean     Mean     Square        Up        Spread
                              Up     Spread     Error    Bias     Error       W     L     W     L

  Opening Line               0.77321  0.52652  8.82583 -0.12012  150.521    1016  298    268   241
  Dokter Entropy             0.76173  0.52541  8.79621 -0.15383  126.594    1055  330    703   635
  Line                       0.76170   .       8.63633 -0.07785  121.524    1042  326      .     .
  Ashby AccuRatings          0.74982  0.50901  8.86946  0.51996  127.529    1034  345    650   627
  StatFox                    0.74982  0.50000  9.15083  0.18088  135.416    1034  345    629   629
  System Average             0.74946  0.49440  8.92108  0.07232  128.329    1038  347    662   677
  Sonny Moore                0.74764  0.51311  9.30695  1.43590  139.909    1031  348    685   650
  NetProphet                 0.74661  0.53379  8.92028 -0.17446  126.397     551  187    387   338
  Sagarin Rating             0.74619  0.49625  9.16057 -0.02549  134.504    1029  350    661   671
  DRatings.com               0.74566  0.49234  9.43660 -1.43057  147.705     988  337    611   630
  Sagarin Predictor          0.74547  0.49737  9.11497  0.02793  133.947    1028  351    663   670
  ComPughter Ratings         0.74032  0.51709  9.53151  0.20241  148.497     841  295    575   537
  Sagarin Golden Mean        0.72806  0.49512  9.81426  0.07640  154.552    1004  375    659   672
  Sagarin Elo Score          0.71501  0.50113  10.2929  0.17347  169.295     986  393    668   665

I optimize my predictor on RMSE, so I'm pleased to see that I have the best performance in that metric of the tracked predictors.  I'm also doing the best of the predictors against the spread, although that performance is a little higher than I'd expect from my own testing so I won't be surprised if that trends down.  It's interesting to note that my predictor is about middle-of-the-pack for predicting the winner straight up and also not very good on Mean Error.

I only submit picks once a week on Monday, so that hurts my performance a little bit.  The predictions for the Saturday games are five or six days stale, which probably costs me 0.1 or so in RMSE.  (But for all I know the other predictors have the same problem.)