# AnalyticBridge

A Data Science Central Community

R squared, also known as coefficient of determination, is a popular measure of quality of fit in regression. However, it does not offer any significant insights into how well our regression model can predict future values. Instead, the PRESS statistic (the predicted residual sum of squares) can be used as a measure of predictive power. The PRESS statistic can be computed in the leave-one-out cross validation process, by adding the square of the residuals for the case that is left out. As a reminder, in the leave-one-out cross validation, one case of the data set is used as the testing set and the remaining are used as the testing set. We iterate this process, until all cases have served as the testing set.

Here is an example implemented in R, on the gala dataset in the faraway package:

> gala[1:3,]

Species Endemics  Area Elevation Nearest Scruz Adjacent

Baltra         58       23 25.09       346     0.6   0.6     1.84

Bartolome      31       21  1.24       109     0.6  26.3   572.33

Caldwell        3        3  0.21       114     2.8  58.7     0.78

Model1:

>model1<-lm(Species~Endemics+Area+Elevation)

>summary(model1)

....

Residual standard error: 27.29 on 26 degrees of freedom

Multiple R-squared: 0.9492,    Adjusted R-squared: 0.9433

F-statistic: 161.8 on 3 and 26 DF,  p-value: < 2.2e-16

Model2:

> model2<-lm(Species~I(Endemics^2))

> summary(model2)

...

Residual standard error: 27.1 on 28 degrees of freedom

Multiple R-squared: 0.946,     Adjusted R-squared: 0.9441

F-statistic:   491 on 1 and 28 DF,  p-value: < 2.2e-16

Model3:

> model3<-lm(Species~Endemics+I(Endemics^2))

> summary(model3)

.....

Residual standard error: 22.94 on 27 degrees of freedom

Multiple R-squared: 0.9627,    Adjusted R-squared: 0.9599

F-statistic: 348.5 on 2 and 27 DF,  p-value: < 2.2e-16

Here are now the AIC (Akaike test criterion), BIC (Bayesian information criterion), and PRESS statistic of the three models:

Model 1:

>AIC(model1)

289.243

> BIC(model1)

296.249

PRESS(model1)=259520.5

Model 2:

> AIC(model2)

287.0325

> BIC(model2)

291.2361

PRESS(model2)=26382.22

Model 3:

> AIC(model3)

277.9558

> BIC(model3)

283.5606

PRESS(model3)=22567.03

As we can see, the PRESS statistic is significantly smaller (better) for models 2 and 3, while R squared has a trivial improvement for model 3.  So, according to PRESS, model 3 has the highest predictive power. It is interesting to note that the AIC and BIC also get their best values for model 3.

If you are interested in how I computed the PRESS statistic doing cross-validation in R, please check my next blog post.

Views: 36061

Comment

Join AnalyticBridge Comment by Sean Flanigan on May 13, 2013 at 6:27pm

This is an amazing post. Thanks so much. R-Squared discussions tend to launch many bar fights. Comment by Mirko Krivanek on May 13, 2013 at 11:27am

The ability to predict the future performance, rather than goodness of fit on existing data, is a great advantage. This can be achieved using cross-validation, which your method does in some way, through the leaving-one-out procedure. It would be nice to see a metric that simultaneously addresses

• robustness (R Square and PRESS fail)
• no sensitivity to number of observations (R square fails, not sure about PRESS)
• has predictive power (R square fails, PRESS wins) Comment by Vincent Granville on May 13, 2013 at 11:03am Comment by Vincent Granville on May 12, 2013 at 2:41pm