# AnalyticBridge

A Data Science Central Community

I've been trying to understand the CART algorithm for classification trees, and I've got a rudimentary implementation up and running, but generating all possible splits to consider across all variables is computationally inefficient.  Right now I'm doing an exhaustive search which means it gets out of hand past when a variable has 10 classes to consider (2 ** 10 - 1 things to consider yikes!).

I know there must be a way to reduce the search space to find an optimal split, but I can't seem to find a straight forward explanation in the book.  Is the idea here to use sub sampling when predictor's unique values get large enough?

The next part is a crazy idea I had, and I wonder if this might work please be gentle. :-) Thinking about it by my self I thought I know what a pure split might be. So what if I built a couple of sets that are pure from the target response then worked it backwards to find the right predictor's split.  For example say I have the following data set where P1 is a predictor and T1 is the target response where both are categorical:

P1  T1
a    -
b    +
a    -
d    +
c    -
b    +
e    +

So I would split T1 it into two pure sets + = ( b, d, e ) and - = ( a,c ).  Calls those two sets TL and TR respectively.  Since there's no overlap between those two sets I've found the best split.  What do we do if there is overlap

P1  T1
a    -
a    +
b    +
b    -
b    -
c    +
c    +
d    -
e    -
e    -

So doing the same thing we'd get TR = ( a, b, c ) and TL = ( a, b, d, e ).  Now we need to make a trade off because we have overlap between these two sets.  Either by taking a & b out of TR's set, or TL's set. The set of overlap (a&b) will add some amount of impurity to either set.  At this point I could calculate the impurity of putting the overlap set with TR or TL, and pick the best one. For multi-class target responses it would be still be 2 ** L - 1 combinations worst case, but by concentrating on the overlap only it would be L - number of values fitting into a pure set.  Do this across all of the predictors and compare their relative gain of purity.

I think this first would limit my big search to just the target response's size.  And, it's dividing up the what's pure from the impure, and just doing permutations on the impure rather than the whole space.  I guess worst case is that you'd end up doing everything.  However, you could balance that with other predictors that have less impurity and expand those first, and maybe forget the one that's highly impure.

I know this is well trodden territory so what I'm suggesting probably won't work.

I'd still appreciate any help pointing me in the right direction.

Thanks,
Charlie

Views: 382

### Replies to This Discussion

It sounds like you are attempting to overfit the model. I very rarely go more than 3 nodes deep since it is impossible to explain to anyone what is going on any deeper that that. Is you go deeper you will probably end up pruning the tree later. There are various rules and algorithms which will tell you approximately when you have the best tree. You are looking for a balance between accuracy and simplicity. Try to keep your nodes as large as possible. You also want to make sure that you have a test and validation data set, since there is no guarantee that what will work on your dataset will work on the next dataset. You can also perform what is called crossfolding, which essentially "samples" the dataset to accomplish the same thing.

Hope this helps...

-Ralph Winters
Ralph,

Thanks for the reply. My first post is not very good so let me see if I can be more precise about my question. I'm actually trying to figure out the best split for a node during the tree growing process. Since CART is a binary tree nodes are always split on simple boolean questions(i.e x in set B?). For categorical variables that take possible values {b1, b2, b3, ..., bL} we want to construct a set B of some portion of those values that maximizes the goodness split criterion. My question is what's the best algorithm for finding the best set B when the brute force algorithm runs O(2 ** L - 1)? The brute force algorithm simply generates all possible combinations of values b1,...,bL then evaluates those splits using the goodness split criterion.

I found a discussion of a potential heuristic algorithm that that suppose to find a maximum in O(L), but I'm having trouble understanding how it actually works. This is section 4.2.2 in the Breiman 84 book on the subject.
Charlie,

I don't have the Breiman, Friedman, Olshen, and Stone [BFOS] book immediately at hand, so my recollection however incomplete will have to suffice.

If I am not mistaken, for categorical variables CART considers only binary splits of the form {x=a} versus {x!=a}. Internally, it uses a set of indicator variables to represent the categorical variable, treating them the same as any other variable. There is no loss of generality, since there's nothing to stop it from splitting off other values elsewhere in the tree. So it might split {a} versus {b,c}, then split again to get {b} versus {c}.)

By comparison the CHAID algorithm considers arbitrary subsets. As far as I know, this is the only tree algorithm in common use which attempts this.

I think the discussion on page 84 concerns the order statistics, i.e. take each variable, sort the values so that b1 < b2 < b3 < ... < bL, (there's a little bookkeeping needed to keep track of duplicated values in the original data) and proceed to play 'twenty questions'. The heuristic has some monotonicity requirements which don't fit well with categorical variables. You will need to understand this heuristic for your implementation to be as efficient as [BFOS].
> So it might split {a} versus {b,c}, then split again to get {b} versus {c}.)
That's good point. Just - when there are many small categories (e.g. 50cat x 2% of pop. in each), splits with larger subsets create more stable prediction and smaller tree. Charlie's question is valid.
Charlie,

I am not familiar with that method. I would probably approach this by first using a dimension reduction technique, to find the most important features, and then run CART in the usual way. This would reduce your computational time considerably.

-Ralph Winters
Thanks for help and ideas. The page number I quoted actually might only work well for the two class type problems. Reading further in the book it claims that section 4.2.2 is not generally applicable to Gini Impurity functions, and that only exhaustive searches will produce a global maxim.

However, just thinking about this problem I think there might be a way to reduce the search by considering all sets with a single unique value from B. Evaluate each split using the goodness split criterion (GINI Impurity in this case). Pick the set that produced the maximum reduction in impurity. Then repeat those steps using the remaining (L-1) values by adding them to the set and calculate the goodness split criterion on that new set if it doesn't increase in goodness split criterion from last iteration you're done. If you do get a boost recursively continue the process using this new set.

Of course this is just an idea, and it's not been proved out that this optimization would produce an optimal split. There could always be an issue where one that doesn't maximize early on and I'd miss it because sets containing that value aren't expanded. However, that would mean it must be a combination of some value in B + some other value remaining in B which I'm evaluating in the first pass so if something spiked the impurity this algorithm would chase that over the smaller value. It intuitively makes sense this will produce a maximum, but maybe not a global maximum. I tried it out on a couple of datasets, and found that it found the global maxim that the exhaustive search did. It only took 5 seconds on 1 Million rows with 13 unique values in the predictor for the optimized one, but 3 hours for the exhaustive search.

Thanks
Charlie
Jumping in...

@youridea
The data I deal normally with is quite noisy ... so I wont find any subset of pure values per variable. However, you are on the right track...

@yourproblem:
There is a measure called information gain (entropy) used in the decision tree ID3/C4.5. These type of trees have the advantage have that ...
- the information gain criterion measures the purity you described in your idea
- a variable is used for splitting in such a way, that a variable with n-values causes a splitting-node with n branches. Hence no binary split is made.

The tree starts of with the basic purity (of a random classifier) and then selects the variable for splitting which increases the average purity. Every good data mining textbook describes this algorithm, so I guess you can figure out the rest.

If you want to stick to CART, you could try "Random Forests". I didnt try them on my own, but they sound promising.

steffen
Steffen is right, there aren't many 'pure' subsets in real noisy data.
Cart may also use GINI or some other splitting criteria...

1. Create crosstab Category * Binary_target
2. Compute Weight Of Evidence for each category
3. Order categories by WoE
4a) Categories with WoE<0 go left, others right
4b) OR look at it as on ordinal values (order by WoE) - you have only k-1 splits to evaluate

I haven't got the book, couldn't you describe that potential heuristic alogithm here?
I have found one of heuristic approaches:
1. create initial split with k single member subsets ( k is number of categories) & compute informational gain
2. create all possible 2-combinations of those subsets, score them and choose the best split.
3. recursive continue until you have reach marginal distribution with only two subsets.
@Jozo

This sounds very close to what I just added above. Thanks for doing some research for me. :-) While I was looking for other options I came across some weka documentation that describes their process. They said they only consider local maximums for deciding splits, and they really only look at single valued partitions. If that's acceptable practice to this problem I'm going to do the same.