A Data Science Central Community
I need an understanding of the usage of Weight statement.
Background - I had to build a logistic regression response model on rare event data with event rate as low as 0.008%. I increased the event rate to 3% by creating two separate datasets through oversampling (increasing the number of events) and undersampling(decreasing the number of non-events). It is believed that the model equation obtained after such sampling has a change only in the intercept term, however, the coefficients remain the same. To make an adjustment, weights are used. Suppose, I would have sampled (1/10)th of the non event group then I would have taken the weight for the event group as
1 and that for the non event group as 10. To have a correction in the intercept term, I use the weight statement in proc logistic. But the problem starts from there. The concordance falls from 70% in the non-adjusted model to 20% in the weight adjusted model (when I use the weight statement). I didn't use the weight statement in any proc during my analysis before building the model but I used only at the time of getting the final model output to get the correct intercept. Is it because I never used the weight statement in bivariate profiling (proc freq, proc summary, proc univariate, etc) ever and created variable transformations, made indicator variables, etc depending upon my bivariate analysis, I am getting a model with abysmally low concordance with the weight statement?
If yes, then my question is - if I had used the weight statement in my initial steps of analysis then each step would've tried replicating the event rate of the original dataset (which was 0.008%) and if that is what I wanted to do then there was no need of oversampling or undersampling. Or, am I missing something here?
Please help me solve this conundrum..I need to know how exactly the weight statement works? and why is my concordance falling to such a low value?
I think your problem is that you've already oversampled your data and you want to compensate by using the weight statement. I would use the original data and then used the weight statement to perform the oversampling.
Thanks for the reply Ralph..
PFA two documents both of which contain statistics from models built on hypothetically created datasets. It is to be noticed that when weights are used HL test fails (high chi sq and low probability), however, concordance remains the same.
Usage of weights -
I understand that if I wouldnt have oversampled the dataset by eliminating observations then I could've used the weights with each observation; assigning low weights to non events in proportion to high weights to the events such that my event rate boosts up as desirable. I would have used this dataset and the weights in every step of the modeling procedure and finally would have built a model using logistic with weight statement again. However, till now, I would have done every activity including running the logistic on an oversampled dataset (obs having weights - low non-events v/s high events). So, my intercept term would still be incorrect. I can get it corrected either by building the model WITHOUT a weight statement now. This is because, now I would like to have the model built on the original dataset to get a correct intercept term.
I did the opposite of above by oversampling by eliminating the non-events so I had to get the correct intercept WITH weight statement in logistic.
The moot point here is corrected intercept term can be obtained by building the model on the original dataset in both the methods. However, Concordance (one of the model's statistics) falls by a large amount, from 70% to 20%. Hence, I am trying to determine the following things -
1. Over fitting of the model - If I conclude that since the concordance has fallen down it means that the model equation doesnt hold true on the original dataset, hence it becomes a case of over fitting. However, the lift charts (original, oversampled 80%, oversampled 20%) overlap completely.
Here, I have a digressing question - If this were true then lift is a not-so-accurate measure for validating the model on another dataset. Even if it were, why not use the model's variables and build a model on validation dataset (and not just score it using the model's equation) and then compare whether there is a change in the Concordance or any other important statistic.
2. Error due to very low event rate - If this is not the case of over-fitting, then is it somethign erratic or chimerical due to very low event rate of the original dataset.
P.S. I am using bootstrap validation to determine whether this could be a case of overfitting..