Why using R-squared is a bad idea
21 Nov 2015The coefficient of determination is otherwise known as \(R^2\) and is often used to determine whether a model is good. The Wikipedia article says that \(R^2\) “[…] is a number that indicates how well data fit a statistical model – sometimes simply a line or a curve. An \(R^2\) of 1 indicates that the regression line perfectly fits the data, while an \(R^2\) of 0 indicates that the line does not fit the data at all”. From this we could conclude that we can use this measure to indicate how good our model is. There are, however, 2 major problems with that conclusion:
- Sometimes just having a model that is a little better than random guessing is already great.
- Adding useless covariates to the model improves \(R^2\).
The first argument is supported best by the example stock market. If I have a model that is just a little bit better than random guessing, I will be rich. The second argument I want to show on an example.
So we have an outcome \(y\) that depends on a covariate \(x\), but the noise is very high, so our \(R^2\) is pretty low. Let’s add further useless covarates to the model.
And voilà the more random covariates we add the better the model according to \(R^2\). Does not make sense right?