Friday, September 25, 2015

Mysterious Errors in Regression

I mentioned the other day that I'd been wrestling with a bug in the predictor.  I found a situation where I had generated some new features for my model and they caused a large drop in accuracy.  Here's a picture that captures the problem:
This is a plot of the accuracy of the model (as measured by a cross-validation) as more features are added.  You can see the puzzling behavior around feature #25.  With the addition of one feature, the accuracy of the model plummets from around 100 to about 11000 (!).  At first I thought there must have been missing values, corruption or some other problem with the feature, but examination shows that there's nothing ill-behaved or broken in the feature.  What's more, if I eliminate the offending feature from the data the same thing happens with some other feature.

The model I'm using here is the standard Ridge Regression model from Scikit-Learn.  I'm dubious that any attribute could cause this much inaccuracy, but even so the attribute should have been optimized out of the model.  And the fact that this happens with multiple features suggests to me that its a pervasive problem in my data or in the model.

At any rate, I was unable to figure it out and moved on to other things.  But I'm still curious about what could be going on here.  Maybe there's something obvious I'm missing.

No comments:

Post a Comment