A Data Science Central Community
In my previous post, we saw that R-squared can lead to a misleading interpretation of the quality of our regression fit, in terms of prediction power. One thing that R-squared offers no protection against is overfitting. On the other hand, cross validation, by allowing us to have cases in our testing set that are different from the cases in our training set, inherently offers protection against overfittting.
1.Do-it-yourself leave-one-out cross validation in R.
In this type of validation, one case in our data set is used as the test set, while the remaining cases are used as the training set. We iterate through the data set, until all cases have served as the test set. In order to implement the iteration in R, we introduce an extra column, that is used as an index to identify the leave-out-case. Here is the code, for the gala data set in the faraway package.
> library(faraway)
> gala[1:3,]
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
>c1<-c(1:30)
> gala2<-cbind(gala,c1)
> gala2[1:3,]
Species Endemics Area Elevation Nearest Scruz Adjacent c1
Baltra 58 23 25.09 346 0.6 0.6 1.84 1
Bartolome 31 21 1.24 109 0.6 26.3 572.33 2
> diff1<-numeric(30)
> for(i in 1:30){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=gala2)
+specpr<-predict(model1,list(Endemics=gala2[i,2],Area=gala2[i,3],Elevation=gala2[i,4]),data=gala2))
+diff1[i]<-gala2[i,1]-specpr }
>summ1<-numeric(1)
>summ1=0
> for(i in 1:30){summ1<-summ1+diff1[i]^2}
> summ1
[1] 259520.5
The do-it-yourself approach allows us to incorporate any computation we like inside the leave-one-out loop. In this case, we computed the difference between the expected and observed value of the dependent variable. Summing these errors squared, gives us variable summ1 which holds the value of the PRESS statistic, which can be used to evaluate the prediction power of our regression.
2.A black box approach to cross-validation.
R offers various packages to do cross-validation. One of them is the DAAG package, which offers a method CVlm(), that allows us to do k-fold cross validation. In this type of validation, the data set is divided into K subsamples. Below, we see 10-fold validation on the gala data set and for the best model in my previous post (model 3).
>library(DAAG)
>model3.daag<-CVlm(df=gala,m=10,form.lm=formula(Species~I(Endemics^2)+Endemics))
Analysis of Variance Table
Response: Species
Df Sum Sq Mean Sq F value Pr(>F)
I(Endemics^2) 1 360522 360522 684.9 <2e-16 ***
Endemics 1 6347 6347 12.1 0.0018 **
Residuals 27 14212 526
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
fold 1
Observations in test set: 3
Caldwell Daphne.Minor Marchena
Predicted 6.407 0.254 58.3
cvpred 3.737 -3.044 58.9
Species 3.000 24.000 51.0
CV residual -0.737 27.044 -7.9
Sum of squares = 794 Mean square = 265 n = 3
…………………………………………………………
fold 10
Observations in test set: 3
Champion Las.Plazas SanCristobal
Predicted 19.99 19.99 229.2
cvpred 20.52 20.52 221.6
Species 25.00 12.00 280.0
CV residual 4.48 -8.52 58.4
Sum of squares = 3505 Mean square = 1168 n = 3
Overall (Sum over all 3 folds)
ms
770
At the end, CVlm reports the mean square error (based on each test case’s CV residual), which can also be used as a measure of prediction power.
It is interesting to compare the do-it-yourself and black box cross validations, and see how they do in terms of evaluating predictive power. Let’s do this for model3 and model1, of my previous post. If you recall model3 had the best predictive power, as measured by PRESS, and model1, the worst. Here is the comparison:
Model3: model3<-lm(Species~I(Endemics^2)+Endemics
PRESS (model3)=22567.03
MS(model3)=770
Model1: model1<-lm(Species~Endemics+Area+Elevation)
PRESS (model1)=259520.5
MS(model1)=8644
We see that both methods produced the smallest error for model3, and in addition, in both cases, the error of model3 is approximately an order of magnitude smaller than that of model 1.
Comment
Instead of
>summ1<-numeric(1)
>summ1=0
> for(i in 1:30){summ1<-summ1+diff1[i]^2}
> summ1
more R-stile and more effective is
summ1= sum(diff1^2)
Instead of
specpr<-predict(model1,list(Endemics=gala2[i,2],Area=gala2[i,3],Elevation=gala2[i,4]),data=gala2))
I would write
specpr<-predict(model1, gala2[i,])
I intend to implement this in several working regression models so very useful.
Thanks!
© 2019 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge