All Discussions Tagged 'decisiontrees' - AnalyticBridge2020-07-05T04:43:17Zhttps://www.analyticbridge.datasciencecentral.com/forum/topic/listForTag?tag=decisiontrees&feed=yes&xn_auth=noHelp with Regression Treestag:www.analyticbridge.datasciencecentral.com,2010-04-23:2004291:Topic:672862010-04-23T20:48:09.904ZCharlie Hubbardhttps://www.analyticbridge.datasciencecentral.com/profile/CharlieHubbard
I'm working on implementing regression trees, but the algorithm I'm using to calculate the best split is not working. I've read and re-read the chapters on regression trees and choosing the best split using Least Squares Regression, and they are very clear about how to do it. I've also validated my math against R with respect to MSE calculations and they match perfectly. My problem comes when I go to calculate the gain in MSE from performing a split. The texts are very clear. That…
I'm working on implementing regression trees, but the algorithm I'm using to calculate the best split is not working. I've read and re-read the chapters on regression trees and choosing the best split using Least Squares Regression, and they are very clear about how to do it. I've also validated my math against R with respect to MSE calculations and they match perfectly. My problem comes when I go to calculate the gain in MSE from performing a split. The texts are very clear. That calculation is as follows:<div><br/></div>
<div>deltaR(s, t) = R(t) - R(leftSplit) - R( rightSplit )</div>
<div>Best split = max( deltaR(s,t) )</div>
<div><br/></div>
<div>However, quite frequently I'm getting negative numbers or very small numbers for this quantity. For example, if R(t) = 0.65 then R(left) = 0.64 and R(right) = 0.43. Since adding up the left and right is potentially greater than 1.0 it can go negative. In this particular case R says the improvement is 0.17, and I'm coming up with -0.42. I also tried multiplying those two quantities by their relative size:</div>
<div><br/></div>
<div>R(t) - sizeOf(leftSplit) / sizeOf(t) * R(leftSplit) - sizeOf(rightSplit) / sizeOf(t) * R(rightSplit)</div>
<div><br/></div>
<div>But I didn't get anything close to what R had as the gain in MSE. At this point I'm not sure if they are measuring the gain the same way, but in their comments they mention Sum of Squares, and they calculate the mean of each split. </div>
<div><br/></div>
<div>I've compared this against the improvements in gain from R using rpart(), and I can't figure out how they are coming up with the numbers they get. I've even read the code in R, but they are doing an optimization that I can't quite understand so comparing their method against my own is not direct. If someone understands the optimization I'd love to incorporate that into what I'm doing, but without an explanation of how it works I'm reluctant to just copy it.</div>
<div><br/></div>
<div>What am I missing here? I'm running out of ideas.</div>
<div><br/></div>
<div>Thanks in Advance,</div>
<div>Charlie</div> Help understanding optimizing CART for finding optimal splitstag:www.analyticbridge.datasciencecentral.com,2010-03-11:2004291:Topic:628712010-03-11T21:29:15.853ZCharlie Hubbardhttps://www.analyticbridge.datasciencecentral.com/profile/CharlieHubbard
I've been trying to understand the CART algorithm for classification trees, and I've got a rudimentary implementation up and running, but generating all possible splits to consider across all variables is computationally inefficient. Right now I'm doing an exhaustive search which means it gets out of hand past when a variable has 10 classes to consider (2 ** 10 - 1 things to consider yikes!).<div><br></br></div>
<div>I know there must be a way to reduce the search space to find an optimal split,…</div>
I've been trying to understand the CART algorithm for classification trees, and I've got a rudimentary implementation up and running, but generating all possible splits to consider across all variables is computationally inefficient. Right now I'm doing an exhaustive search which means it gets out of hand past when a variable has 10 classes to consider (2 ** 10 - 1 things to consider yikes!).<div><br/></div>
<div>I know there must be a way to reduce the search space to find an optimal split, but I can't seem to find a straight forward explanation in the book. Is the idea here to use sub sampling when predictor's unique values get large enough?</div>
<div><br/></div>
<div>The next part is a crazy idea I had, and I wonder if this might work please be gentle. :-) Thinking about it by my self I thought I know what a pure split might be. So what if I built a couple of sets that are pure from the target response then worked it backwards to find the right predictor's split. For example say I have the following data set where P1 is a predictor and T1 is the target response where both are categorical:</div>
<div><br/></div>
<div>P1 T1</div>
<div> a -</div>
<div> b +</div>
<div> a -</div>
<div> d +</div>
<div> c -</div>
<div> b +</div>
<div> e +</div>
<div><br/></div>
<div>So I would split T1 it into two pure sets + = ( b, d, e ) and - = ( a,c ). Calls those two sets TL and TR respectively. Since there's no overlap between those two sets I've found the best split. What do we do if there is overlap</div>
<div><br/></div>
<div>P1 T1</div>
<div> a -</div>
<div> a +</div>
<div> b +</div>
<div> b -</div>
<div> b -</div>
<div> c +</div>
<div> c +</div>
<div> d -</div>
<div> e -</div>
<div> e -</div>
<div><br/></div>
<div>So doing the same thing we'd get TR = ( a, b, c ) and TL = ( a, b, d, e ). Now we need to make a trade off because we have overlap between these two sets. Either by taking a & b out of TR's set, or TL's set. The set of overlap (a&b) will add some amount of impurity to either set. At this point I could calculate the impurity of putting the overlap set with TR or TL, and pick the best one. For multi-class target responses it would be still be 2 ** L - 1 combinations worst case, but by concentrating on the overlap only it would be L - number of values fitting into a pure set. Do this across all of the predictors and compare their relative gain of purity.</div>
<div><br/></div>
<div>I think this first would limit my big search to just the target response's size. And, it's dividing up the what's pure from the impure, and just doing permutations on the impure rather than the whole space. I guess worst case is that you'd end up doing everything. However, you could balance that with other predictors that have less impurity and expand those first, and maybe forget the one that's highly impure.</div>
<div><br/></div>
<div>I know this is well trodden territory so what I'm suggesting probably won't work.</div>
<div><br/></div>
<div>I'd still appreciate any help pointing me in the right direction.</div>
<div><br/></div>
<div>Thanks,</div>
<div>Charlie</div>