Subscribe to DSC Newsletter

Hi guys,

I have some doubts regarding oversampling. I am trying to predict the probability of a person to unsubscribe from emails for an online retailer. I plan to run a Logistic regression model for this purpose on 5% of my total user base which is around 3 million.
In my total population the % of unsubscribers is around 10%. I have Oversampled my 5% Random sample such that it has 25% unsubscribers. Now to run any cross-tabs or test some hypothesis should I assign weights as my data is Oversampled and if so how to assign Weights??
Can I assign a weights of 10% / 25% for Unsubscribers and 90% / 75% for subscribers??

Please advice.

Thanks,
Hari

Views: 4786

Reply to This

Replies to This Discussion

Hi Tom,

Thanks for the reply. But can I assign a weight of 0.1/0.25=0.4 for Unsubscribers and 0.9/0.75=1.2 for Subscribers. Would this make sense.

Thanks,
Hari
Hi Guys,

One more doubt regarding Cross-tabs. When I run a 2x2 cross-tab, I get a P-Value<0.001 but the Phi-coefficient is very very low at 0.0063. When i just look at p-value it might show that both variables are correlated but my Phi-coefficient being very low can i conclude the variables are independent?

Thanks,
Hari
Hari,

Are you building a predictive model? If so are you using the model to rank only or do you need *accurate* probabilities as well? If its the former, I would say there is no reason to be concerned with weighting the data back (validate the model on a hold out of regular density though). If you really need accurate standard errors then weights alone are not enough I dont believe and you should use software specific to the task (e.g. Proc SurveyLogistic in SAS).

Here is a post you may be interested in for oversampling adjustment in general:
http://blog.data-miners.com/2009/09/adjusting-for-oversampling.html

While oversampling is vital for many techniques, I have always found logistic regression to be quite insensitive to rare events- especially with an event denity of 10% which is not really rare compared to database marketing and fraud detection! You could try model building both ways and if oversampling does not provide significant improvement, dont do it.

RSS

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service