# AnalyticBridge

A Data Science Central Community

Most people use logistic regression for modeling response, attrition, risk, etc. And in the world of business, these are usually rare occurences.

One practise widely accepted is oversampling or undersampling to model these rare events. Sometime back, I was working on a campaign response model using logistic regression. After getting frustrated with the model performance/accuracy, I use weights to oversample the responders. I remember clearly that I got the same or a very similar model.

According to Gordon Linoff and Michael Berry's blog

"Standard statistical techniques are insensitive to the original density of the data. So, a logistic regression run on oversampled data should produce essentially the same model as on the original data. It turns out that the confidence intervals on the coefficients do vary, but the model remains basically the same."

But everyone seems to extol or recommend oversampling/undersampling for modeling rare events using logistic regression. What are your experiences and opinions on this?

Regards,

Datalligence

Views: 12495

### Replies to This Discussion

What would stop you from using a Poisson Regression technique here? like logistic, its log linear in the case of Poisson, Natural log, and I think you could do this in R very quickly and measure the Overdispersion coefficient to see if you have the correct level of precision for your point estimate. you can also adjust this via a Pearson coefficient and get some decent classification accuracy if the data is amiable to this type of regression.

You might also want to look into Negative binomial regression which is very applicable to rare events though again it may not fit your criteria as the details aren't provided. it allows for more variability then Poission but its really just a matter of firing your data sets into R or Rapid miner to see what weapons in your warchest give you the best model.

My student tryied to predict fog - had only 100 positive among more than 200k observations.

Just say there will be no fog and you are 99,99% right. There's no value in such trivial prediction.

Fog requires some conditions - humidity, tempreature, etc. If you focus only on these observations what satisfy basic conditions, data mining task is completly different.

Instead of 0,0005% you have 5% or maybe 20% of positive observations and your favorite algorithm works well.
I agree with the statement that the density of rare events does not affect logistic regression, if we are talking about prediction. Where oversampling (or weights or costs or priors) does pay big dividends in my experience is decision trees. I think thats the canonical example of where consideration of the rare case as a proportion is needed.
It's not widely used, but a close friend of mine has had good success with Rare Events Logit (RELogit). Available for Stata, GAUSS, and R. It's fairly new (2001) and it seems that it is just now starting to gain traction in the academic literature.
If you are modeling binomial data; ie a numerator consisting of the number of 1/0 successes you have for
a given pattern of covariates, and a denominator that gives the value of the total number of observations having that covariate pattern (a specific profile of predictor values; eg age=23, married=1, working=0) , a logistic regresson is generally appropriate. But when the mean values of the numerators are less than 10% of the mean values of the denominator, it is likely that a Poisson model is preferred. The otherwise logistic numerator is the count response variable (dependent variable) and the natural log of the denominator is the offset. Generally the Poisson model will fit the data better. Logistic models are not indended for rare occurrences. But that's exactly how the Poisson distribution is derived from the binomial. If the Poisson model is overdispersed, ie if the variance of the response is greater than the mean, then you'll likely need a negative binomial model. Nearly all count models are overdispersed; some aren't, and some are underdispersed. Remember, given the data you have - and I don't know it - it may be that none of these models is really appropriate. But rule-of-thumb, modeling rare events is typically the task of Poisson or negative binomial count models. There are a host of different types of negative binomial models, all of which address a certain type of overdispersion. It appears on the surface that these are better alternatives than a logistic model.

Hello, indeed in many applications, if not in most of them, the number of "positive" event (e.g., the number of buyers in a marketing campaign) is very small, much smaller than the number of the "non events" (e.g., non buyers, in the above example). Hence, building a model on a sample of observations, however large, drawn randomly from the population (the "universe") may not inculde enough positive events to build a significant model with. The solution in these cases is to use a choice-based sample for training the logistic regression model, namely a sample that contains a larger proportion of positive events (e.g., buyers, in a marketing campaign) than in the universe (sometimes, all of them), and only a sample of the non-buyers. As shown in Ben-Akiva, M., and S.R. Lerman, 1987, Discrete Choice Analysis, the MIT Press, Cambridge, MA, one needs to correct only the constant of the resulting logistic regression model to render a model that reflects the true proprtion of the buyers and non buyers in the universe.  In the case the choice-based sample contains all the buyers in the universe, the constant is modified using the expression: modified constant  = resulting constant of the logistic regression model + LN(TN/UN), where TN is the number of the non buyers in the training dataset used to build the model, UN is the number of the non buyers in the universe and LN is the natural logarithm.