Help understanding optimizing CART for finding optimal splits - AnalyticBridge2020-08-12T07:29:52Zhttps://www.analyticbridge.datasciencecentral.com/forum/topics/help-understanding-optimizing?commentId=2004291%3AComment%3A63386&x=1&feed=yes&xn_auth=no> So it might split {a} ve…tag:www.analyticbridge.datasciencecentral.com,2010-03-16:2004291:Comment:633862010-03-16T11:19:35.686ZJozo Kovachttps://www.analyticbridge.datasciencecentral.com/profile/JozoKovac
> So it might split {a} versus {b,c}, then split again to get {b} versus {c}.)<br />
That's good point. Just - when there are many small categories (e.g. 50cat x 2% of pop. in each), splits with larger subsets create more stable prediction and smaller tree. Charlie's question is valid.
> So it might split {a} versus {b,c}, then split again to get {b} versus {c}.)<br />
That's good point. Just - when there are many small categories (e.g. 50cat x 2% of pop. in each), splits with larger subsets create more stable prediction and smaller tree. Charlie's question is valid. Charlie,
I don't have the Br…tag:www.analyticbridge.datasciencecentral.com,2010-03-15:2004291:Comment:632552010-03-15T14:09:39.040ZJohn Bauerhttps://www.analyticbridge.datasciencecentral.com/profile/JohnBauer
Charlie,<br />
<br />
I don't have the Breiman, Friedman, Olshen, and Stone [BFOS] book immediately at hand, so my recollection however incomplete will have to suffice.<br />
<br />
If I am not mistaken, for categorical variables CART considers only binary splits of the form {x=a} versus {x!=a}. Internally, it uses a set of indicator variables to represent the categorical variable, treating them the same as any other variable. There is no loss of generality, since there's nothing to stop it from splitting off other…
Charlie,<br />
<br />
I don't have the Breiman, Friedman, Olshen, and Stone [BFOS] book immediately at hand, so my recollection however incomplete will have to suffice.<br />
<br />
If I am not mistaken, for categorical variables CART considers only binary splits of the form {x=a} versus {x!=a}. Internally, it uses a set of indicator variables to represent the categorical variable, treating them the same as any other variable. There is no loss of generality, since there's nothing to stop it from splitting off other values elsewhere in the tree. So it might split {a} versus {b,c}, then split again to get {b} versus {c}.)<br />
<br />
By comparison the CHAID algorithm considers arbitrary subsets. As far as I know, this is the only tree algorithm in common use which attempts this.<br />
<br />
I think the discussion on page 84 concerns the order statistics, i.e. take each variable, sort the values so that b1 < b2 < b3 < ... < bL, (there's a little bookkeeping needed to keep track of duplicated values in the original data) and proceed to play 'twenty questions'. The heuristic has some monotonicity requirements which don't fit well with categorical variables. You will need to understand this heuristic for your implementation to be as efficient as [BFOS]. @Jozo
This sounds very close…tag:www.analyticbridge.datasciencecentral.com,2010-03-15:2004291:Comment:632522010-03-15T13:49:12.355ZCharlie Hubbardhttps://www.analyticbridge.datasciencecentral.com/profile/CharlieHubbard
@Jozo<br />
<br />
This sounds very close to what I just added above. Thanks for doing some research for me. :-) While I was looking for other options I came across some weka documentation that describes their process. They said they only consider local maximums for deciding splits, and they really only look at single valued partitions. If that's acceptable practice to this problem I'm going to do the same.<br />
<br />
<a href="http://wekadocs.com/node/2" target="_blank">http://wekadocs.com/node/2</a><br />
<br />
Charlie
@Jozo<br />
<br />
This sounds very close to what I just added above. Thanks for doing some research for me. :-) While I was looking for other options I came across some weka documentation that describes their process. They said they only consider local maximums for deciding splits, and they really only look at single valued partitions. If that's acceptable practice to this problem I'm going to do the same.<br />
<br />
<a href="http://wekadocs.com/node/2" target="_blank">http://wekadocs.com/node/2</a><br />
<br />
Charlie Thanks for help and ideas. Th…tag:www.analyticbridge.datasciencecentral.com,2010-03-15:2004291:Comment:632512010-03-15T13:34:11.816ZCharlie Hubbardhttps://www.analyticbridge.datasciencecentral.com/profile/CharlieHubbard
Thanks for help and ideas. The page number I quoted actually might only work well for the two class type problems. Reading further in the book it claims that section 4.2.2 is not generally applicable to Gini Impurity functions, and that only exhaustive searches will produce a global maxim.<br />
<br />
However, just thinking about this problem I think there might be a way to reduce the search by considering all sets with a single unique value from B. Evaluate each split using the goodness split criterion…
Thanks for help and ideas. The page number I quoted actually might only work well for the two class type problems. Reading further in the book it claims that section 4.2.2 is not generally applicable to Gini Impurity functions, and that only exhaustive searches will produce a global maxim.<br />
<br />
However, just thinking about this problem I think there might be a way to reduce the search by considering all sets with a single unique value from B. Evaluate each split using the goodness split criterion (GINI Impurity in this case). Pick the set that produced the maximum reduction in impurity. Then repeat those steps using the remaining (L-1) values by adding them to the set and calculate the goodness split criterion on that new set if it doesn't increase in goodness split criterion from last iteration you're done. If you do get a boost recursively continue the process using this new set.<br />
<br />
Of course this is just an idea, and it's not been proved out that this optimization would produce an optimal split. There could always be an issue where one that doesn't maximize early on and I'd miss it because sets containing that value aren't expanded. However, that would mean it must be a combination of some value in B + some other value remaining in B which I'm evaluating in the first pass so if something spiked the impurity this algorithm would chase that over the smaller value. It intuitively makes sense this will produce a maximum, but maybe not a global maximum. I tried it out on a couple of datasets, and found that it found the global maxim that the exhaustive search did. It only took 5 seconds on 1 Million rows with 13 unique values in the predictor for the optimized one, but 3 hours for the exhaustive search.<br />
<br />
Thanks<br />
Charlie I have found one of heuristic…tag:www.analyticbridge.datasciencecentral.com,2010-03-15:2004291:Comment:632492010-03-15T12:40:48.613ZJozo Kovachttps://www.analyticbridge.datasciencecentral.com/profile/JozoKovac
I have found one of heuristic approaches:<br />
1. create initial split with k single member subsets ( k is number of categories) & compute informational gain<br />
2. create all possible 2-combinations of those subsets, score them and choose the best split.<br />
3. recursive continue until you have reach marginal distribution with only two subsets.
I have found one of heuristic approaches:<br />
1. create initial split with k single member subsets ( k is number of categories) & compute informational gain<br />
2. create all possible 2-combinations of those subsets, score them and choose the best split.<br />
3. recursive continue until you have reach marginal distribution with only two subsets. Steffen is right, there aren'…tag:www.analyticbridge.datasciencecentral.com,2010-03-15:2004291:Comment:632482010-03-15T12:29:33.832ZJozo Kovachttps://www.analyticbridge.datasciencecentral.com/profile/JozoKovac
Steffen is right, there aren't many 'pure' subsets in real noisy data.<br />
Cart may also use GINI or some other splitting criteria...<br />
<br />
What about this idea (for nominal values)<br />
1. Create crosstab Category * Binary_target<br />
2. Compute Weight Of Evidence for each category<br />
3. Order categories by WoE<br />
4a) Categories with WoE<0 go left, others right<br />
4b) OR look at it as on ordinal values (order by WoE) - you have only k-1 splits to evaluate<br />
<br />
I haven't got the book, couldn't you describe that potential…
Steffen is right, there aren't many 'pure' subsets in real noisy data.<br />
Cart may also use GINI or some other splitting criteria...<br />
<br />
What about this idea (for nominal values)<br />
1. Create crosstab Category * Binary_target<br />
2. Compute Weight Of Evidence for each category<br />
3. Order categories by WoE<br />
4a) Categories with WoE<0 go left, others right<br />
4b) OR look at it as on ordinal values (order by WoE) - you have only k-1 splits to evaluate<br />
<br />
I haven't got the book, couldn't you describe that potential heuristic alogithm here? Jumping in...
@youridea
The…tag:www.analyticbridge.datasciencecentral.com,2010-03-15:2004291:Comment:632272010-03-15T07:23:34.503ZSteffen Springerhttps://www.analyticbridge.datasciencecentral.com/profile/SteffenSpringer
Jumping in...<br />
<br />
@youridea<br />
The data I deal normally with is quite noisy ... so I wont find any subset of pure values per variable. However, you are on the right track...<br />
<br />
@yourproblem:<br />
There is a measure called information gain (entropy) used in the decision tree ID3/C4.5. These type of trees have the advantage have that ...<br />
- the information gain criterion measures the purity you described in your idea<br />
- a variable is used for splitting in such a way, that a variable with n-values causes a…
Jumping in...<br />
<br />
@youridea<br />
The data I deal normally with is quite noisy ... so I wont find any subset of pure values per variable. However, you are on the right track...<br />
<br />
@yourproblem:<br />
There is a measure called information gain (entropy) used in the decision tree ID3/C4.5. These type of trees have the advantage have that ...<br />
- the information gain criterion measures the purity you described in your idea<br />
- a variable is used for splitting in such a way, that a variable with n-values causes a splitting-node with n branches. Hence no binary split is made.<br />
<br />
The tree starts of with the basic purity (of a random classifier) and then selects the variable for splitting which increases the average purity. Every good data mining textbook describes this algorithm, so I guess you can figure out the rest.<br />
<br />
If you want to stick to CART, you could try "Random Forests". I didnt try them on my own, but they sound promising.<br />
<br />
hope this was helpful<br />
<br />
steffen Charlie,
I am not familiar w…tag:www.analyticbridge.datasciencecentral.com,2010-03-13:2004291:Comment:630692010-03-13T16:21:47.308ZRalph Wintershttps://www.analyticbridge.datasciencecentral.com/profile/RalphWinters
Charlie,<br />
<br />
I am not familiar with that method. I would probably approach this by first using a dimension reduction technique, to find the most important features, and then run CART in the usual way. This would reduce your computational time considerably.<br />
<br />
-Ralph Winters
Charlie,<br />
<br />
I am not familiar with that method. I would probably approach this by first using a dimension reduction technique, to find the most important features, and then run CART in the usual way. This would reduce your computational time considerably.<br />
<br />
-Ralph Winters Ralph,
Thanks for the reply.…tag:www.analyticbridge.datasciencecentral.com,2010-03-13:2004291:Comment:630272010-03-13T05:48:40.042ZCharlie Hubbardhttps://www.analyticbridge.datasciencecentral.com/profile/CharlieHubbard
Ralph,<br />
<br />
Thanks for the reply. My first post is not very good so let me see if I can be more precise about my question. I'm actually trying to figure out the best split for a node during the tree growing process. Since CART is a binary tree nodes are always split on simple boolean questions(i.e x in set B?). For categorical variables that take possible values {b1, b2, b3, ..., bL} we want to construct a set B of some portion of those values that maximizes the goodness split criterion. My…
Ralph,<br />
<br />
Thanks for the reply. My first post is not very good so let me see if I can be more precise about my question. I'm actually trying to figure out the best split for a node during the tree growing process. Since CART is a binary tree nodes are always split on simple boolean questions(i.e x in set B?). For categorical variables that take possible values {b1, b2, b3, ..., bL} we want to construct a set B of some portion of those values that maximizes the goodness split criterion. My question is what's the best algorithm for finding the best set B when the brute force algorithm runs O(2 ** L - 1)? The brute force algorithm simply generates all possible combinations of values b1,...,bL then evaluates those splits using the goodness split criterion.<br />
<br />
I found a discussion of a potential heuristic algorithm that that suppose to find a maximum in O(L), but I'm having trouble understanding how it actually works. This is section 4.2.2 in the Breiman 84 book on the subject. It sounds like you are attemp…tag:www.analyticbridge.datasciencecentral.com,2010-03-13:2004291:Comment:630092010-03-13T02:56:57.799ZRalph Wintershttps://www.analyticbridge.datasciencecentral.com/profile/RalphWinters
It sounds like you are attempting to overfit the model. I very rarely go more than 3 nodes deep since it is impossible to explain to anyone what is going on any deeper that that. Is you go deeper you will probably end up pruning the tree later. There are various rules and algorithms which will tell you approximately when you have the best tree. You are looking for a balance between accuracy and simplicity. Try to keep your nodes as large as possible. You also want to make sure that you have a…
It sounds like you are attempting to overfit the model. I very rarely go more than 3 nodes deep since it is impossible to explain to anyone what is going on any deeper that that. Is you go deeper you will probably end up pruning the tree later. There are various rules and algorithms which will tell you approximately when you have the best tree. You are looking for a balance between accuracy and simplicity. Try to keep your nodes as large as possible. You also want to make sure that you have a test and validation data set, since there is no guarantee that what will work on your dataset will work on the next dataset. You can also perform what is called crossfolding, which essentially "samples" the dataset to accomplish the same thing.<br />
<br />
Hope this helps...<br />
<br />
-Ralph Winters