A Data Science Central Community
Most people use logistic regression for modeling response, attrition, risk, etc. And in the world of business, these are usually rare occurences.
One practise widely accepted is oversampling or undersampling to model these rare events. Sometime back, I was working on a campaign response model using logistic regression. After getting frustrated with the model performance/accuracy, I use weights to oversample the responders. I remember clearly that I got the same or a very similar model.
According to Gordon Linoff and Michael Berry's blog
"Standard statistical techniques are insensitive to the original density of the data. So, a logistic regression run on oversampled data should produce essentially the same model as on the original data. It turns out that the confidence intervals on the coefficients do vary, but the model remains basically the same."
But everyone seems to extol or recommend oversampling/undersampling for modeling rare events using logistic regression. What are your experiences and opinions on this?
Hello, indeed in many applications, if not in most of them, the number of "positive" event (e.g., the number of buyers in a marketing campaign) is very small, much smaller than the number of the "non events" (e.g., non buyers, in the above example). Hence, building a model on a sample of observations, however large, drawn randomly from the population (the "universe") may not inculde enough positive events to build a significant model with. The solution in these cases is to use a choice-based sample for training the logistic regression model, namely a sample that contains a larger proportion of positive events (e.g., buyers, in a marketing campaign) than in the universe (sometimes, all of them), and only a sample of the non-buyers. As shown in Ben-Akiva, M., and S.R. Lerman, 1987, Discrete Choice Analysis, the MIT Press, Cambridge, MA, one needs to correct only the constant of the resulting logistic regression model to render a model that reflects the true proprtion of the buyers and non buyers in the universe. In the case the choice-based sample contains all the buyers in the universe, the constant is modified using the expression: modified constant = resulting constant of the logistic regression model + LN(TN/UN), where TN is the number of the non buyers in the training dataset used to build the model, UN is the number of the non buyers in the universe and LN is the natural logarithm.