A Data Science Central Community
I am looking at data form a telecom company and developing model to predict an event ( read churn).
I am planning to develop GLM using logit link function.
The real problem I am facing in the data is - very low volume (1.6 %) of churners.
So seeking advise on the following ;
- What are the possible (bad) outcomes if I take randomised training sample, consisting just 1.6 % churners ?
- Should I weight the training sample to have a event rate >25% ?
- Any other technique to address problem of such small event rate.
Please see a paper by Gary King & Langche Zeng entitled "Logistic Regression in Rare Events Data".
Hi, I wish I could help in such way. I myself using the Link Model to observe and study repeated events . All events are repeated. My sampling study was "Random or Causality" for drawing winning lottery numbers. The term Regression is some how a slow process of continuity of events, regarding THE MODEL THAT is used. I only observed activities of all Celestial Bodies that caused things to happen the way they happened. I may use mathematical model for regression. can it be manipulated? I want to know. Please I am open for suggestion or critics. I am very sorry for any disappointment from my reply.
Thanks Ratheen !!
I am trying the other option.
Well just to share , this is the second time I am making such a model.
The disadvantage of such a model ( with rare event and oversample events ) is , once you use the model to score a population > sample size, the lifts drop dramatically. At some time , worse than random.
Have used several alternative models including Genetic Algorithm as well, but with limited help to overcome such issue.
Your problem description makes me think about the use of Penalized Likelihood Method (e.g. Firth's method). Please refer to the following link for details.
Hope this should be helpful.
Just one more point to add: No matter whatever method you use (Traditional Statistical Algorithms Or Machine Learning), equivalent approach would exist (e.g. In Machine Learning, the objective function can be severely penalized for missclassifying the EVENTS=1, when compared to non-EVENTS)
Applied predictive modelling by Max Kuhn. Kjell Johnson has a chapter talking about remedies for severe class imbalanced case. It provides a lot of ways to try regarding to how sampling.