A Data Science Central Community
While carrying out logistic regression, the model with most significant variable removed from it(based on p-values)gives the highest accuracy but the improvement in residuance deviance is not better than the null model. Thus which criteria is to be followed the model with best accuracy or model which gives lowest residual deviance, i.e shoudl we remove variables if the accuracy improves or remove the variables if the residual devicnes becomes low? Any reference will be usefull.
Hi Mine the data,
can you please explain what you mean by 'residual deviance'?
Deviance from my understanding in logistic regression is the -2LogL chi sq test right or AIC/BIC/SCORE tests.
This is used to test the hypothesis that when you removed the variable that there is a significant difference in the logit value.
Residual analysis however is used for a different purpose.
1) testing multivariate normal assumption
2) testing/observing heteroskedasticity of residuals (i.e. is there different variance for different ranges of the predicted odds)
3) get a feeling for the noise (ie mean, median, std, range etc...)
To test the quality of your classification you need other tests:
RoC area under the curve
F1 Test - comparison of specificity vs sensitivity
Hosmer & Lemeshow GoF test
KS - D statistic and GoF test
Furthermore you test the influence/leverage of outliers on your solution using
Leverage vs predicted
Influence vs predicted
Individual beta influence against each feature
It could be however that I just don't understand what ' residual deviance' is and need enlightenment.