Subscribe to DSC Newsletter

Dear all,


I have received a data where only 0.23% claims are fraudulent rest 99.73% are legitimate claims. Can I build a logistic regression model using this data set to identify future suspicious claims/ fraudulent claims?


My worry is such a low % of fraudulent claims in the present data set may not give me a proper result if I use it as it is.


Can you suggest me any particular technique?


Best regards,


Shounak Ghosal


Views: 9041

Replies to This Discussion

how many input variables do you have? And how many observations? I would try to use neural networks or some tree based method but it depends on your variables.

Hi Tomas,

Thank you for your reply.

I have around 60 variables and 195,953 observations. But 55 of this variables are categorical in nature.

Let me know if you require further detail.


I haven't worked with Fraud/Risk/Credit data before, and I understand the amount of precision with which modeling works in those domains.
I've also heard the very small modeling population like in your case ~0.2%.

There are ways to weigh your sample using weighted response modeling techniques. Basically, you skew your sample to contain 0.5% or 1% response. Usually, this may not be required in your case, since I believe I've heard modeling with 0.2% is quite common. Only, you may not get the best fit or likelihood that fits the [0-1] range completely.

Also, just to mention, I think skewing/biasing/weighting the sample shouldn't affect the model parameter estimates, but I think I'll need to double check that.

What I would be more concerned with is the no of categorical variables - 55 out of 60! That's too many, and usually there's not much use of them. I'd suggest you try some other techniques - Neural Networks as suggested, Decision Trees, CART, or probably using GENMOD!

Would be interested to know your outcome..

- Arun
Dear Arun,

Thank you for your reply.

I have tried biased sampling where I considered fraud and legitimate claims in 1:10 ratio. I have found only 6 significant variables in my final outcome. You will be surprised to know that my concordance is 78.5 and tied observation is 19.8. It seems a satisfactory outcome to me. However, when I calculated the threshold point based on cost of misclassification I got the value as 0.0084. It is a very low probability value and has very high chance of increasing the false positive rate.

I am just wondering if I need to try some other method in this case. I will definitely update you with my new approach and result of the same.



I would not try to find a threshold in this cases (because of the riks you rightly mention).

What I'd do is to create a 'score distribution' with fraud rate.

For doing this, create deciles of your scored population (after building your fraud model via logistic regression) and calculate the fraud percentage in each range.

The scores would be the outcomes from applying your model, and perhaps applying a transformation to them afterwards if you like (multiply times 1000 for example).

With this, you can calculate you KS and Gini, as in a regular scorecard model, and take decisions based on each score range performance (fraud rate).


I propose the following approach:
1 classify your data
2 Identify all classes containing all fraudsters, scammers non of these classes are then considered as "fraudsters".

if the rate of potential perpetrators is important so we go to Step 3 if not, another solution be found.

3 create a binary variable containing potential fraudsters and potential fraudsters not
4 Apply the logistic regression
5 determine the rate of misclassification of the new variable
6 to adapt to this misplaced the original variable

I think this may solve your problem.

I apologize if my Angalis is approximate (I am French) so I use a translator.


Rare events logistic regression and or matching could be the way to go.

Best of luck


Dear All,

I have been trying to establish the best model for detecting fraud with regards to money laundering activities.

I have found that,

1. Logistic regression


3. Custom neural models

4. MARS and

5. Time series Analysis

are the appropriate tools that can be used for the same. Kindly help me as to which one is the most appropriate in this case  ?




For a previous discussion on the same topic see:

I also suggest looking at the work of Prof David Hand, he's quite active in the field of fraud detection and classification.

You need to use SMOTE (Synthetic Minority Oversampling Technique). Weka and Knime both have modules that can do it for you.
King, Gary, and Langche Zeng. "Logistic Regression in Rare Events Data." Political Analysis 9 (2001): 137-163.

Dear Mr Shoumak,

You may try out undersampling or oversampling or SMOTE in order to bring a sort balance to the dataset. Then any good algorithm will give reasonably high sensitivity meaning the number of frauds correctly identified by the model as fraud. You need not hesitate doing these, as they would not alter the boundary between the fraudulent and legitimate classes. In other words, they would not tinker with the physics of the system.

In addition, you may throw away some redundant predictor variables that are identified by algorithms or domain or both.

I am sure these things will help.


Dr Ravi


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service