A Data Science Central Community
What are Tree Methods?
Tree methods are commonly used in data science to understand patterns within data and to build predictive models. The term Tree Methods covers a variety of techniques with different levels of complexity but my aim is to highlight three I find useful. To set the problem up let’s assume we have a census dataset containing age, education, employment status and so on. Given all this information we want to see if we can predict whether a person earns more than $50k per year. How can tree methods help us?
A simple decision tree is the easiest approach to understand. The model tries to find the variable that best splits between high and low income and the optimal point at which to make this split. In the example below the model finds that age is the most important splitter when predicting income > $50k and so forms the first branch of the tree (people in the data who were less than 35 have a lower likelihood of earning >$50k per year).
You can continue to split your data, making the tree deeper and deeper but there is a key trade-off here. The deeper the tree, the more powerful the model becomes at being able to explain the data but also the more over-fit the model becomes. This means that whilst it does a great job of explaining the exact data it is trained upon it may do a worse job at predicting new data. To help make this trade-off we often build a model on a portion of a dataset and test the model power on the remaining sample.
Simple decision trees are useful for basic models or to understand high level data however the major downside is that the larger the tree becomes the more your sample is sliced and diced leaving very small sample sizes at the lowest level. This restricts the techniques ability to build more complex models. To take this further a new approach is required to collect together multiple trees (called an ensemble method). At Capital One we use both Random Forests and Gradient Boosted Machines to do this.
The Random Forests technique builds multiple trees by randomly sampling both rows from the data with replacement (also known as Bagging). Where this technique deviates from the Bagging technique is that the input variables are also randomly sampled as well as the rows. This sampling and tree building process happens many times and then the predictions from all trees built are combined (often through a simple average) to give a final prediction.
This approach overcomes the issues with running out of sample as the trees built are generally smaller than the optimal tree. The re-sampling of rows with replacement helps the model capture variation in the data better and helps guard against over-fit. One potential downside of Bagging is that if there are a few very strongly predictive inputs they may dominate the trees leading to highly correlated predictions and minimal variance of prediction. The sampling of input variables helps overcome this risk as this leads to more varied trees.
Gradient Boosted Machines (GBMs)
GBMs differ from Random Forests as rather than building a series of trees independently and pooling the predictions each tree builds on the previous tree’s predictions. A basic first tree is built and scored out on the sample to create both the predictions and the residuals from the tree (actual outcome minus the prediction). The second phase is to then build another tree to prediction these residuals. This process then continues, building trees on the residuals from the prior tree until an optimal tree is created.
As with the other tree methods it is important to not over-fit to the data. There are lots of options to run the process including: restriction the size of each tree, penalizing more complex trees, controlling the influence of a single tree and sampling approaches for each tree build. When building a GBM it is important to know what settings have been used to understand the output. If they are well understood GBM models can often be some of the most predictive (for example they are often seen in winning Kaggle entrants).
Examples where we’ve used Tree Methods at Capital One:
Basic decision trees are really useful for building quick, simple models that can be easily understood and implemented. There are cases where the in-market business impact can be huge from building a quick tree and deploying in a short period of time rather than the more time and resource intensive build/deployment of a complex model. Knowing the trade-offs in techniques and having a really clear understanding of the business need helps the UK Data Science team to build the right type of model.
Trees are very useful at the outset of a modelling project to try and understand the relationships within your dataset. We use GBMs to understand the influential variables in a model upfront. They are also useful for quickly reducing the field of potential splitters, focusing in on the data that matters and a quick unconstrained GBM model can act as a useful benchmark model, allowing the Data Scientist to gauge how predictive their final model may be.
When would I use Tree Methods?
As with any technique related to data science Tree Methods are one of many approaches you could take to solve a business problem using large amounts of data. The key is being able pick and choose when to take Tree Methods off the shelf. At a high level: Tree Methods may help you with a prediction problem, given the warning around potential over-fit.
Credits: Sarah Pollicott, Carola Deppe, Sarah Johnston, Kevin Chisholm