Subscribe to DSC Newsletter

I'm confused about the Hosmer and Lemeshow Goodness-of-Fit test in SAS
for logit models. It is giving me some problems. It is testing
significant (p<0.001), which as I understand implies my logit model is not
explaining the data well.

Some literature indicates that this test is too highly dependent on the
actual groupings and cutoff value used when conducting the test,
i.e. it may not be that reliable?

I'm just not sure how much emphasis I should place on the HL test

On the other hand, all of the independent variables are significant
(p<.001). My pseudo R-square is quite low .17, but the area under the
ROC curve is .71, which I don't think is that bad. When predicting
outside the data set, the % of correct predictions is at 62%.

Can someone give me their assessment of these results? I'd like these
numbers to be higher, but does it matter given the results from the HL
test above?

I have not yet tested for multicolinnearity, serial
correlation,spatial autocorrelation - perhaps some corrections can be
made to improve my model pending these tests and possible corrections.

Views: 16566

Reply to This

Replies to This Discussion

Hi Tom,

I used stepwise regression procedure to eliminate independent variables but most of my independent variables are significant (p<0.001) so I eliminated only a few insignificant variables.
Binning: I have binned data based on preliminary Univariate Analysis. I have binned only the demographics data and kept continuous variable as it is. Should i try changing my binning?

Also i have Age values missing for almost 30% of my data so i created a separate group called Unspecified , is this correct way to do?
Before using stepwise regression to eliminate the independent variable you should eliminate the multicolinearity effect. You know multicolinearity produce over-estimate and p-value going to be least. To get better performance, please remove multicolinearity first. In that case you can proceed with standardization the data.
If possible you can remove serial correlation as well.

Please do some emphasize on HL statistic also.
What would be a good method to detect multi-collinearity and correaltion among variables. I have lots of categorical variables and a few continuos variables.
I can use Proc Reg VIF option but need to recode the categorical to dummy variables.
Can anybody suggest a better way to detect correlation and multi-collimearity?

P.S. I have access to only Base SAS
I recommend that you shlould read the following web page on Logistice regression assumptions which is though for SPSS but will be certainly useful to you:-
Also check your residuals to see if they are random. Your model may be missing something.

-Ralph Winters
Hi, when evaluating predictions, look at the initial breakdown in the data, because while you can get a good overall hit rate (i use 80% as a simple rule of thumb), looking at the data, what was your sensitivity and specificity. In other words, does your model classify both sets of conditions (outcome a and outcome b) you are modelling well? Having a high percentage in one group, and getting them classified correctly can really make your overall hit rate misleading.

I would chek your residuals (the difference between your expected as a probability) and the observed, and see which cases you are misclassifying, and which ones you are misclassifying really badly,and perhaps then try and profile them.

Also, remember that statistical significance can be boosted by sample size (power), and if you have a lot of cases, your predictors can be significanct. I would check the odds ratios etc, to see how much the presence of each predictor changes the overall odds of falling into one group or another

The only advantage i see to using model fitting stats like this is to compare different models (:

PS - don't use backward or forward elimination, or stepwise procedures, try to compile a rationale for including and or excluding certain variables, i would only use this as a last resort, also see if you can find some freeware to run tree algorithms on your data to see the impact of different models (logistic is not the be all and end all of modelling), rattlle in R is a nice GUI you could look at, and it removes the programming learning curve, and offers logisitc, boosting, random forests, svm etc which enables comparisons and classification rates

I have the same issues with this stat. I think it is sample size issue. 


I used VIF option under proc reg to make sure those variables entered into logistic model do not highly correlated. So correlation is not an issue.


My logistic model also has very high KS value.  


I suspect it is binning issue when dealing with very small group of responders. Sometime, low respond % also tends to generates logstics model with high KS. H-L test is like Chi-square test performed on 10 group of bins.


1 responder falling into a different bin can change a lot outcome on this chi-square test if responders are very thin. 

I'm also confused about this test for binary logistic and have similar problem with Hariharan. Thank you for advices. Significant H-L stat on my model has been solved follow suggestion on this treads.


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service