Subscribe to DSC Newsletter

Hi group,

Has anyone worked on modeling rare events using some unconventional techniques (say anything other than logistic regression / and versions) ? When I say rare -- it is something like a case of 1:500 or even lower.

Looking for your inputs

Views: 227

Replies to This Discussion

Hi Manish,

You say 1:500, can you make the ratio less extreme? Can you broaden the date range, N definition so one is larger? Can you more carefully define the numerator to exclude range cases that could never convert from a to a "1"? Perhaps there's a pattern or a link that is missing. Just a thought. I suggest it because with that ratio you have to use special case statistics, like they do in medical studies for say, side effects. If you have to stick with those odds I would look into statistical processes used in medical research methods.

Good luck man,

Manish, I am not sure about the domain or the business problem but in fraud/risk modeling (which is always rare), balancing the data (oversampling or undersampling) has given me very good results most of the times.
Thanks Amanda,

Unfortunately the room to "redefine" Y isn't there. But I have got another idea out of it :).

Do you have any papers (let it be medical research) where the we have use of any such technique ?
Redefining Y? Not sure if it's redefinition:-) but I worked on this project where the fraud cases were extremely low according to the client's existing rules. During our discussions with the client, we helped them define the "suspected" frauds too which were different from the "real" fraud cases. These suspected fraud records/transactions were then included/recoded in the dataset as real fraud cases and it definitely improved the model performance, not to mention the business benefits!
"My next question in that case is - how do you re-estimate your model coefficients to match back to original ration ? I will be thankful if you can share some codes."

You always oversample/undersample in the training dataset. The testing dataset should be left untouched, so the original proportions are there. The model you built in the oversampled training dataset should be able to classify/detect a fair number/percentage (as defined by you & the busines requirement) of the rare events/target responses in the testing dataset. There is no need to re-estimate your model co-effs.

Let me know if you have any questions. Regarding techniques, logistic regression has some assumptions/requirements about your data. Have you tried decision trees - C5, CHAID or CART?


On Data Science Central

© 2020 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service