# AnalyticBridge

A Data Science Central Community

# Techniques to address very low event rate for Logistic Regression Model

Hi Folks,

I am looking at data form a telecom company and  developing model  to predict an event ( read churn).

I am planning to develop GLM using logit link function.

The real problem I am facing in the data is - very low volume (1.6 %) of churners.

So seeking advise  on the following ;

- What are the possible (bad) outcomes  if I take randomised training sample,  consisting just 1.6 %  churners ?

- Should I weight the training  sample to have a event  rate >25% ?

-  Any other technique to address  problem of such small event rate.

Regards,

HV

Views: 5262

### Replies to This Discussion

HV,

Please see a paper by Gary King & Langche Zeng entitled "Logistic Regression in Rare Events Data".

Basile

Hi, I wish I could help in such way. I myself using the Link Model to observe and study repeated events . All events are repeated. My sampling study was "Random or Causality" for drawing winning lottery numbers. The term Regression is some how a slow process of continuity of events, regarding THE MODEL THAT is used. I only observed activities of all Celestial Bodies that caused things to happen the way they happened. I may use mathematical model for regression. can it be manipulated? I want to know. Please I am open for suggestion  or critics. I am very sorry for any disappointment from my reply.

HS,

Take the second option - oversample and then add an offset to the final result. Calculate the probabilities and see if you can distinguish between churners and non churners at a specified cut off.
If it works - you are in good shape!

Thanks,
Ratheen

Thanks Ratheen !!

I am  trying the other option.

Well just to share , this is the second time I am making such a model.

The disadvantage of such a model  ( with  rare event and oversample events )  is , once you use  the model to score a population > sample size,   the lifts drop dramatically. At some time , worse than random.

Have used several alternative models including  Genetic Algorithm as well, but with limited help to overcome  such issue.

Regards,

HS

Hi,

Your problem description makes me think about the use of Penalized Likelihood Method (e.g. Firth's method). Please refer to the following link for details.

http://www.statisticalhorizons.com/logistic-regression-for-rare-events

thanks,

Sulabh Dube

Just one more point to add: No matter whatever method you use (Traditional Statistical Algorithms Or Machine Learning), equivalent approach would exist (e.g. In Machine Learning, the objective function can be severely penalized for missclassifying the EVENTS=1, when compared to non-EVENTS)

Applied predictive modelling by Max Kuhn. Kjell Johnson has a chapter talking about remedies for severe class imbalanced case. It provides a lot of ways to try regarding to how sampling.