A Data Science Central Community
I have received a data where only 0.23% claims are fraudulent rest 99.73% are legitimate claims. Can I build a logistic regression model using this data set to identify future suspicious claims/ fraudulent claims?
My worry is such a low % of fraudulent claims in the present data set may not give me a proper result if I use it as it is.
Can you suggest me any particular technique?
I have been trying to establish the best model for detecting fraud with regards to money laundering activities.
I have found that,
1. Logistic regression
3. Custom neural models
4. MARS and
5. Time series Analysis
are the appropriate tools that can be used for the same. Kindly help me as to which one is the most appropriate in this case ?
Dear Mr Shoumak,
You may try out undersampling or oversampling or SMOTE in order to bring a sort balance to the dataset. Then any good algorithm will give reasonably high sensitivity meaning the number of frauds correctly identified by the model as fraud. You need not hesitate doing these, as they would not alter the boundary between the fraudulent and legitimate classes. In other words, they would not tinker with the physics of the system.
In addition, you may throw away some redundant predictor variables that are identified by algorithms or domain or both.
I am sure these things will help.