# AnalyticBridge

A Data Science Central Community

# Cross-validation in R: a do-it-yourself and a black box approach

In my previous post, we saw that R-squared  can lead to a misleading interpretation of the quality of our regression fit, in terms of prediction power. One thing that R-squared offers no protection against is overfitting. On the other hand, cross validation, by allowing us to have cases in our testing set that are different from the cases in our training set,  inherently offers protection against overfittting.

1.Do-it-yourself leave-one-out cross validation in R.

In this type of validation, one case in our data set is used as the test set, while the remaining cases are used as the training set. We iterate through the data set, until all cases have served as the test set. In order to implement the iteration in R, we introduce an extra column, that is used as an index to identify the leave-out-case. Here is the code, for the gala data set in the faraway package.

> library(faraway)

> gala[1:3,]

Species Endemics  Area Elevation Nearest Scruz Adjacent

Baltra         58       23 25.09       346     0.6   0.6     1.84

Bartolome      31       21  1.24       109     0.6  26.3   572.33

Caldwell        3        3  0.21       114     2.8  58.7     0.78

>c1<-c(1:30)

> gala2<-cbind(gala,c1)

> gala2[1:3,]

Species Endemics  Area Elevation Nearest Scruz Adjacent c1

Baltra         58       23      25.09       346     0.6        0.6        1.84       1

Bartolome   31       21     1.24       109       0.6      26.3     572.33      2

> diff1<-numeric(30)

> for(i in 1:30){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=gala2)

+specpr<-predict(model1,list(Endemics=gala2[i,2],Area=gala2[i,3],Elevation=gala2[i,4]),data=gala2))

+diff1[i]<-gala2[i,1]-specpr }

>summ1<-numeric(1)

>summ1=0

> for(i in 1:30){summ1<-summ1+diff1[i]^2}

> summ1

[1] 259520.5

The do-it-yourself approach allows us to incorporate any computation we like inside the leave-one-out loop. In this case, we computed the difference between the expected and observed value of the dependent variable. Summing these errors squared, gives us variable summ1 which holds the value of the PRESS statistic, which can be used to evaluate the prediction power of our regression.

2.A black box approach to cross-validation.

R offers various packages to do cross-validation. One of them is the DAAG package, which offers a method CVlm(), that allows us to do k-fold cross validation. In this type of validation, the data set is divided into K subsamples. Below, we see 10-fold validation on the gala data set and for the best model in my previous post (model 3).

>library(DAAG)

>model3.daag<-CVlm(df=gala,m=10,form.lm=formula(Species~I(Endemics^2)+Endemics))

Analysis of Variance Table

Response: Species

Df Sum Sq Mean Sq F value Pr(>F)

I(Endemics^2)  1 360522  360522   684.9 <2e-16 ***

Endemics       1   6347    6347    12.1 0.0018 **

Residuals     27  14212     526

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

fold 1

Observations in test set: 3

Caldwell Daphne.Minor Marchena

Predicted      6.407        0.254     58.3

cvpred         3.737       -3.044     58.9

Species        3.000       24.000     51.0

CV residual   -0.737       27.044     -7.9

Sum of squares = 794    Mean square = 265    n = 3

…………………………………………………………

fold 10

Observations in test set: 3

Champion Las.Plazas SanCristobal

Predicted      19.99      19.99        229.2

cvpred         20.52      20.52        221.6

Species        25.00      12.00        280.0

CV residual     4.48      -8.52         58.4

Sum of squares = 3505    Mean square = 1168    n = 3

Overall (Sum over all 3 folds)

ms

770

At the end, CVlm reports the mean square error (based on each test case’s CV residual), which can also be used as a measure of prediction power.

It is interesting to compare the do-it-yourself and black box cross validations, and see how they do in terms of evaluating predictive power. Let’s do this for model3 and model1, of my previous post. If you recall model3 had the best predictive power, as measured by PRESS, and model1, the worst. Here is the comparison:

`Model3: model3<-lm(Species~I(Endemics^2)+Endemics`
`PRESS (model3)=22567.03`
`MS(model3)=770`

`Model1: model1<-lm(Species~Endemics+Area+Elevation)`
`PRESS (model1)=259520.5`
`MS(model1)=8644`

We see that both methods produced the smallest error for model3, and in addition, in both cases, the error of model3 is approximately an order of magnitude smaller than that of model 1.

Views: 33048

Comment

Join AnalyticBridge

Comment by Alex Zolot on May 26, 2013 at 11:45pm

>summ1<-numeric(1)

>summ1=0

> for(i in 1:30){summ1<-summ1+diff1[i]^2}

> summ1

more R-stile and more effective is

summ1= sum(diff1^2)

specpr<-predict(model1,list(Endemics=gala2[i,2],Area=gala2[i,3],Elevation=gala2[i,4]),data=gala2))

I would write

specpr<-predict(model1, gala2[i,])

Comment by Arthur Aguirre on May 23, 2013 at 8:44am

I intend to implement this in several working regression models so very useful.

Thanks!