In this post I will explain what decision trees are, how we can construct and analyse them. Finally I will explain how we can use our novel techniques and apply them to financial KYC (know-your-customer) data in order to gain insight in our customers and use that to detect future fraud cases.
Decision trees are used as a classification model. With a decision tree we can classify or predict the target class label of an item using the attribute values of that item. For example, we have data about animals (see Table 1). The data consists of a number of attributes such as whether the animal has feathers (true, false), whether it gives milk (true, false) and the number of legs it has (0 <= n <= N). Furthermore we have a special attribute, the class label, that tells us to which class the animal belongs (mammal, bird, insect, amphibian etc.).
|Table 1: Example animal data|
|Figure 1: Example decision tree|
Then if a new animal arrives for which we do know all the attribute values but we do not know the class label, we can evaluate the attributes down the decision tree to predict the target class label. For example, a new animal arrives with no hair and 4 legs, we classify it as amphibian.
Traditional and Interactive Construction
There exist different methods to automatically construct a decision tree, such as the C4.5 andID3 algorithm by Ross Quinlan. However, there are several problems with this automatic construction process. First, the users constructing the decision tree are often domain experts, such as fraud analysts. They have extensive knowledge of their field but have no knowledge of the decision tree construction algorithms. Even worse, domain experts often do not have detailed knowledge of the algorithms parameters. Therefore the construction is often a trial-and-error process. The user sets some parameters, generates the tree, then checks the result, tweaks the parameters, generates the tree again etc. etc. Furhtermore, because users are not actively involved in the construction process they are not able to use their domain knowledge to steer the construction process. Finally, once users have a decision tree they are not able to analyze it because they have no method to visualize it, other than standard node-link diagrams, which falls short. If decision trees are bigger than, say 100 nodes, analysis becomes really difficult.
Visual analytics to the rescue
Our solution to this problem is a visual analytics tool with a tight integration of interaction, visualization and algorithmic support. In the visual analytics tool users are able to interactively construct a decision tree, automatically construct a decision tree for analysis, or a combination of both. Furthermore, we developed scalable decision tree visualizations for analysis.
|Figure 2: Visual Analytics tool|
The first design decision concerned the visualization of the decision tree. We used a standard node-link diagram as a starting point. In this node link diagram we use the edges to denote the size of the child nodes. The width of each edge denotes the number of items that is flowing from the parent node to the child node. Next we divided the thick edges into proporitonally colored bands to show the different classes that are involved. At the nodes we show the split predicate, the class distribution, aStreamgraph that shows the distribution as well as the quantity of each class for the split attribute and finally we show the splitpoints with histogram widgets. The histograms show how many items of each class there are on both sides of the splitpoint. For analysis purposes we are less interested in the actual data but more in the underlying structure of the data. Therefore, we do not show theStreamgraph data visualization at the nodes but only the split predicate and color banded links. Later we took this concept even further by only showing the color banded links and according split predicates. This provides a scalable decision tree visualization to the user in which the underlying structure of the data can be analyzed.
Now we will apply our techniques to a fictive KYC dataset to analyse suspicious, fraud and non-fraud cases and how we can use it to detect new fraud cases. Say we have a dataset with KYC data. We have attributes such as age, sex, income, region, number of transactions per week, month, the average transaction amount, total transaction amount per week, month etc. Each datarow represents a customer. The target class label assigned to each customer is one of three: normal, suspicious and fraud. Normal denotes that this customer did not commit any fraud. Suspicious means that this customer has had suspicious transactions in the past and fraud are the customers that are identified in the past as committing fraud.
|Figure 3: Node link decision tree of KYC data|
|Figure 4: Decision tree visualized using color banded links technique|
|Figure 5: Different classes highlighted|
|Figure 6: Suspicious and Fraud classes aligned and highlighted|
We identified different user tasks and developed according visualization and interaction techniques. Furthermore, we provide scalable decision tree visualizations and have a tight integration of both the data as well as the tree. We help users extensively with algorithmic support during each task of the construction process. This is achieved with a tight integration of user and computer tasks. As an example, we applied our techniques to fictive know-your-customer data to show how decision trees can be used to analyse and detect new fraud cases.