Subscribe to DSC Newsletter

One of the challenges in data mining for fraud detection or prediction is the extremely small number of fraud cases captured by the client's existing system.

Besides under or over sampling to overcome this problem, what other approaches do you used at the data level? And from your experience, which DM technique is best suited for this kind of a problem? Please share your knowledge/experience.

Tags: data, fraud, techniques

Views: 1153

Reply to This

Replies to This Discussion

I recently heard of a very interesting technique that is a mixture between parameter estimation and filtering. The idea is used in discovering Denial of Service attacks in networks. It observes the network over many months and uses this data to feed a parameter estimation. This functions as the central tendency of the dynamic behavior of the network. Then when the DoS attack starts, it can recognize the abnormal behavior and start filtering the offending IP addresses.

This technique rests on the observation that the network is a dynamic system and can thus be characterized by some differential equation. Would this apply to the fraud prediction you are looking at?

not exactly, but interesting :)

my interest has more to do with credit cards & other payment methods. from the few projects i've worked on, i find that decision trees perform better most of the times.

would like to know other people's experiences while modeling for fraud detection/prediction.
I've been working on credit card fraud detection with Visa for about 2 years. I believe that combining multiple approaches (hybrid algorithm, e.g. 50 fairly small decision trees, possibly overlapping trees, where some final nodes are replaced by a logistic or logic regression if appropriate -- each decision tree featuring a few typical, clustered fraud cases) works best.

Back in 2002, we were using one very big decision tree, with thousands of nodes, produced by SAS Enterprise Miner. However, this approach is very poor in terms of robustness and interpretation. I was actually one of the guys to recommend moving away from this approach.
See below something I wrote a while ago
See for the context

Mathematical Model

The scoring methodology developed by Authenticlick is state-of-the art. It is based on almost 30 years of experience in auditing, statistics and fraud detection, both in real-time and on historical data. Several patents are currently pending.

It combines sophisticated cross-validation, design of experiments, linkage and unsupervised clustering to find new rules, machine learning, and the most advanced models ever used in scoring, with a parallel implementation and fast, robust algorithms to produce at once a large number of small overlapping decision trees. The clustering algorithm is a hybrid combination of unique decision-tree technology with a new type of PLS logistic stepwise regression to handle dozens of thousand highly redundant metrics. It provides meaningful regression coefficients computed in a very short amount of time, and efficiently handles interaction between rules.

Some aspects of the methodology show limited similarities with ridge regression, tree bagging and tree boosting. Below we compare the efficiency of different systems to detect click fraud on highly realistic simulated data. The criterion for comparison is the mean square error, a metric that measures the fit between scored clicks and conversions:

* Scoring system with identical weights: 60% improvement over binary (fraud / non fraud) approach
* First-order PLS regression: 113% improvement over binary approach
* Full standard regression (not recommended as it provides highly unstable and non-interpretable results): 157% improvement over binary approach
* Second-order PLS regression: 197% improvement over binary approach, easy interpretation and robust, nearly parameter-free technique

Substantial additional improvement is achieved when the decision trees component is added to the mix. Improvement rates on real data are similar.

Do you have code for the Markov Chain Monte Carlo or simulated annealing methods you speak of in your blog? I would like to understand the basic data size to compute ratio as well as the native instruction set of the operators.

Hi Theo,

Much of the material is protected (patent). You can have a look at the patent document, although it does not cover the details of Simulated Annealing / Monte Carlo. Some of the Monte Carlo simulations were performed in the context of computing extreme value distributions to detect strong outliers in complicated discrete distributions (multivariate bin counts).

For the patent, see

I've been developing fraud detection 'rules' for a few years and have found the preponderance of credit card fraud to be associated with identity theft. Clustering is useful to augment link analysis - to uncover fraud rings. Thresholds (possibly velocity checks) are commonly used.
Interesting thread. Does anyone have experience with Detica ?
Hello Romakanta ,
There are a lot of sophisticated algorithms for address the problem of identifying rare phenomena but the main problem with them is that they lack of stability. The reason is that those algorithms do detect low frequency group in the learning sample that those groups almost never appear in a different data set even from the same source. If you are going to implement Logistic Regression method you can try to change the cutoff point. Regardless to your proffered algorithms, my advice to you is to get your focus on the right set of variables. Good variables are always in front of sophisticated methods. If you would like I can send you a copy of my soft that can perform Fractional polynomials for Logistic Regression - this method produces new variables out of the original data set and can be more sensitive to the target variable (as long as it is a binary one - Fraud/not fraud).

You can use stratified sampling of good and bad observations in combination with a resampling technique like boot strapping or cross validation to get started.

I am relatively new to the filed of internet fraud, but I had some successes there with associative clustering. It can succeed where statistically based methods fail.

Let me give you an example: harvester bots. Remember the phrase form the nineties "Internet Superhighway"? Internet still is the superhighway for bots today, and fast lanes are online advertising. By following adds your bot is guaranteed to lend on sites where money is and where people with money go. On advertisers' sites you can harvest e-mail addressed to add to your spam list, look for unpatched servers where you can alter their web pages so they install your "identity outsourcing" tools on their visitors' computers and all sorts of other lucrative activities. Of course this bots do not declare themselves as bots, they try to pretend to be human visitors.

The traditional ways of capturing them was to look at the pattern of their behavior. They click systematically on many or all ads on a page and so it should be easy to detect this level of activity. Most detection method would look for extensive number of clicks from the same IP address and the same user agent (that is the string that identifies the "browser"). At the begining of this year these methods stopped working and the hundreds of know harvester bots disappeared. A quick application of clustering on several attributes produced very strong clusters, all but a few containing a clique of bots. It was amazing to see how effectively a few simple tricks hid the tens of very active bots in the plain site! They would change each of the attributes after a few clicks in a fashion that will not only avoid any statistics-based detection, but also the tests based on validating text with regular expression checkers.

The point is that it is often easy to beat detection based on detecting statistical anomalies. In these cases you need to look for associations between attackers that are harder to find and try to flush out shared unusual patterns. Understanding the goals and mechanics of an attack can help you narrow down which attributes to look, but a broader approach of searching unusual clusters will help you detect "0-day attacks" that is the attacks whose pattern has not been detected yet.


On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service