Subscribe to DSC Newsletter

Fraud Detection Using Interactive Construction And Analysis Of Decision Trees

A fantastic article from one of our company's very talented engineers:



Last week we presented a novel prototype visual analytics tool at the IEEE Visweek conference; the premier forum for advances in scientific and information visualization. The event-packed week brings together researchers and practitioners from academia, government, and industry to explore their shared interests in tools, techniques, and technology. We presented our work at the Visual Analytics Science and Technology track.

In this post I will explain what decision trees are, how we can construct and analyse them. Finally I will explain how we can use our novel techniques and apply them to financial KYC (know-your-customer) data in order to gain insight in our customers and use that to detect future fraud cases.

Decision trees are used as a classification model. With a decision tree we can classify or predict the target class label of an item using the attribute values of that item. For example, we have data about animals (see Table 1). The data consists of a number of attributes such as whether the animal has feathers (true, false), whether it gives milk (true, false) and the number of legs it has (0 <= n <= N). Furthermore we have a special attribute, the class label, that tells us to which class the animal belongs (mammal, bird, insect, amphibian etc.).

Table 1: Example animal data
From this data we can construct a decision tree. Decision trees can be represented by a standard node-link diagram (see Figure 1). In this node link diagram each node contains a test on an attribute, each link contains the test values and each leaf represents a target class label.

Figure 1: Example decision tree

Then if a new animal arrives for which we do know all the attribute values but we do not know the class label, we can evaluate the attributes down the decision tree to predict the target class label. For example, a new animal arrives with no hair and 4 legs, we classify it as amphibian.

Traditional and Interactive Construction

There exist different methods to automatically construct a decision tree, such as the C4.5 andID3 algorithm by Ross Quinlan. However, there are several problems with this automatic construction process. First, the users constructing the decision tree are often domain experts, such as fraud analysts. They have extensive knowledge of their field but have no knowledge of the decision tree construction algorithms. Even worse, domain experts often do not have detailed knowledge of the algorithms parameters. Therefore the construction is often a trial-and-error process. The user sets some parameters, generates the tree, then checks the result, tweaks the parameters, generates the tree again etc. etc. Furhtermore, because users are not actively involved in the construction process they are not able to use their domain knowledge to steer the construction process. Finally, once users have a decision tree they are not able to analyze it because they have no method to visualize it, other than standard node-link diagrams, which falls short. If decision trees are bigger than, say 100 nodes, analysis becomes really difficult.

Visual analytics to the rescue

Our solution to this problem is a visual analytics tool with a tight integration of interaction, visualization and algorithmic support. In the visual analytics tool users are able to interactively construct a decision tree, automatically construct a decision tree for analysis, or a combination of both. Furthermore, we developed scalable decision tree visualizations for analysis.

Figure 2: Visual Analytics tool


The first design decision concerned the visualization of the decision tree. We used a standard node-link diagram as a starting point. In this node link diagram we use the edges to denote the size of the child nodes. The width of each edge denotes the number of items that is flowing from the parent node to the child node. Next we divided the thick edges into proporitonally colored bands to show the different classes that are involved. At the nodes we show the split predicate, the class distribution, aStreamgraph that shows the distribution as well as the quantity of each class for the split attribute and finally we show the splitpoints with histogram widgets. The histograms show how many items of each class there are on both sides of the splitpoint. For analysis purposes we are less interested in the actual data but more in the underlying structure of the data. Therefore, we do not show theStreamgraph data visualization at the nodes but only the split predicate and color banded links. Later we took this concept even further by only showing the color banded links and according split predicates. This provides a scalable decision tree visualization to the user in which the underlying structure of the data can be analyzed.
Fraud detection use case

Now we will apply our techniques to a fictive KYC dataset to analyse suspicious, fraud and non-fraud cases and how we can use it to detect new fraud cases. Say we have a dataset with KYC data. We have attributes such as age, sex, income, region, number of transactions per week, month, the average transaction amount, total transaction amount per week, month etc. Each datarow represents a customer. The target class label assigned to each customer is one of three: normal, suspicious and fraud. Normal denotes that this customer did not commit any fraud. Suspicious means that this customer has had suspicious transactions in the past and fraud are the customers that are identified in the past as committing fraud.
Figure 3: Node link decision tree of KYC data
First we load the data into our visual analytics tool and let the tool automatically generate a decision tree based on the C4.5 algorithm. The generated decision tree consists of a total of 179 nodes; 93 leaf nodes and 86 internal nodes. If we use a standard node-link diagram to analyze the decision tree (see Figure 3) we miss information such as the number of items of a class that is flowing through the tree, or the distribution of the classes at the nodes. This makes interpretation of the data very difficult. Therefore we visualize the decision tree using our color banded link technique in order to analyze the underlying structure of our data (see Figure 4).
Figure 4: Decision tree visualized using color banded links technique
From this image several observations can be made. We see that the number of normal cases is (fortunately) the biggest share of the data. The number of suspicious and fraud cases are more or less equal. We see that most normal cases have an income lower than or equal to 36281. There are a few normal cases with a higher income. We also see that a big share of the fraud cases have a higher income. Furhtermore, we observe that the suspicious cases all have an income lower than 36281. The second observation we make is that from the fraud cases that have a high income, the average transaction amount is greater than 25781 and the fraudsters are maried and male. The fraud cases that are not married all have children. In addition, we observe that from the fraud cases that have a lower income than 36281, the biggest share lives in a inner-city region or town region. Also, we observe that the part that lives in a town region is mostly female. From the part that lives in an inner-city region, most is also married, does not have children and the average transaction amount per week is smaller than 19532.
Figure 5: Different classes highlighted
Next we observe that suspicious cases are hard to separate from the fraud cases because they tend to follow the same path down the decision tree. This becomes even more clear if we align the suspicious and fraud cases (see Figure 6).
Figure 6: Suspicious and Fraud classes aligned and highlighted
Finally we can determine what attributes are most important in detecting fraud cases. We color the split attributes and analyze the decision tree (see Figure 7). We see that income, region, average transaction-amount, married, age, sex and average transaction amount per week are the most important attributes to detect the fraud cases (see Figure 8).

Figure 7: Important attributes
Figure 8: Important attributes in detecting Fraud cases

By analyzing the decision tree we gained insight in the underlying structure of our data. We identified the most important attributes in determining wheter the customer is likely to commit fraud or not. Many more interesting observations can be made by analyzing the decision tree further, we will leave this as an excercise to the reader.
Now if a new customer arrives we can use this decision tree to determine if this customer needs further inspection because it is classified as suspicious or fraud.


We identified different user tasks and developed according visualization and interaction techniques. Furthermore, we provide scalable decision tree visualizations and have a tight integration of both the data as well as the tree. We help users extensively with algorithmic support during each task of the construction process. This is achieved with a tight integration of user and computer tasks. As an example, we applied our techniques to fictive know-your-customer data to show how decision trees can be used to analyse and detect new fraud cases.
More details are decribed in our paper (to appear):
Stef van den ElzenJarke J. van Wijk.
BaobabView: Interactive Construction and Analysis of Decision Trees.
In: Proceedings IEEE Symposium on Visual Analytics Science and Technology, VAST 2011, Providence, RI, USA, October 23-28, p. 151-160.

Views: 3712


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service