Subscribe to DSC Newsletter

Hi group,

Has anyone worked on modeling rare events using some unconventional techniques (say anything other than logistic regression / and versions) ? When I say rare -- it is something like a case of 1:500 or even lower.

Looking for your inputs

Views: 5656

Replies to This Discussion

Hello Manish,

One method that I have used in the passed is a form of bootstrap sampling to boost your rare event cases higher for development purposes - obviously it will be important to validate and can be difficult to get a great validation (but depending on what you are modeling and what your goal is that might be okay). I would also create stress testing of your parameter estimates using an iterative random sample process.

I think there are probably other palatable techniques depending on what your outcome is - do you mind sharing what it is?
Hi Matthew,

Thanks for the reply.

Yes, I agree that bootstrapping can also help -- though my experience with this tells me that it may not make a significant dent.

As far as the problem, I am looking to model / predict responses to marketing solicitations (external/ internal acquisitions) in case of card business. Feel free to share your thoughts around the same.

"...a form of bootstrap sampling to boost your rare event cases higher for development purposes "

Have you considered adaptive methods (e.g Gradient Boosting aka Salford System's "Tree Net")?
Actually, you can model pretty much anything with a low incidence rate even less than 1 in 500. Believe it or not straight statistical methods like logistic regression or any other algorithm of your choice can handle this type of problem without too much trouble. My favorite, however, is using spline model techniques. The secret in my experience is in balancing your samples (i.e. development and validation). With any low incidence data, there is a greater likelihood that a couple of outliers can throw your samples off leading to big differences in predictive performance between each. My suggestion is to create several such sample splits and build models on each set. Look for those splits that give you relatively good and equal performance on both samples.

Hope this helps,

I'd like to know if you can point me to
publications or papers where I can learn more, about the issue.

Many thanks

You likely want to do Poisson, or better, negative binomial regression. See my book:
Hilbe, Joseph M. (2007), Negative Binomial Regression, Cambridge University Press

I give a rather lengthy discussion of how dealing with low incidence data differ from logistic regression in my new:
Hilbe, Joseph M (2009), Logistic Regression Models, Chapman & Hall/CRC

I am not sure if this is the right place to ask this question but I am currently working on Supply chain Disruption Management. Specifically, I would like to model the impact of very low frequency but high impact events analytically - Not in a data mining way but more in terms of Operations Research/Management Science. Are there analytical methodologies other than Rare Event Simulation to model extreme or very low probability events?

thank you in advance!
Thanks for the replies,

Bill can you share a little more around the spline based modeling ?

Manish - something like your problem happened in Basel II Credit Risk analysis. About 2005, the financial regulators in Europe and US realised their rules would require banks to produce a probability of default model for some loan portfolios that in fact had no, or few, historical defaults - this is pre-crunch 2005, mind! How to do it? (PD=0 is the wrong answer, by the way).

Some good ideas came out of the FSA (UK regulator), the Bundesbank and other financial institutions and some of these were gathered in the following link.

I'm interested in this because I proposed one of the methods, based on well-known ideas of marginal likelihood (so nothing proprietorial about this!). I still like it, as it gets to the heart of the underlying portfolio default model - a random effects model with time-series autocorrelations. See

I think if you're data-mining then you're probably more interested in a purely fixed effects model and will have a rather high dimensional state space over which to build your likelihood functions, so likelihood surfaces may become hard to visualise. But it's important to remember that likelihood is the fundamental quantity from which any exact model will derive its best fit and standard errors, so in low event situations you'll probably find yourself working with likelihoods directly whatever way you do it.

Hope this helps, or at least triggers a few ideas.

Any thoughts on claims fraud modeling ?
A bit late to the table...One-class classification aka novelty detection might be worth a look at. There is a package called DD_Tools for Matlab, it contains a number of one-class classifiers including SVDD. 2 researchers active in this area are David J. Tax ( and Nathalie Japkowicz.


I'd like to recommend you take a look at esProc, a script for complicated data processing. It is the an unconventional technique without modeling,just with a new computing modes of step-by-step. further more, the data-analytic operation doesn't need high skilled technology background, and it brings great convenience for the analyst to get the instant answers and realize their real-time ideas without calling for data scientists' help.

I'm wondering if the technique could help you, if so, you could download it at

Looking forward to hearing from you.



On Data Science Central

© 2019 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service