# AnalyticBridge

A Data Science Central Community

Hi All-

Looking for validation on this one:

Is it legitimate / useful to calculate R-squared on a decision tree model - overall and specifically with following methodology (with example):

1) Calculate the SSTO on the testing set as SUM(yi - y-bar)^2
2) Calculate the SSE on the testing set by calculating SSE for every leaf node SUM(yi - ybar at leaf node)^2 and then simply adding up the SSE.
3) Calculate R-squared as 1- (SSE/SSTO).

For example I have a testing set with SSTO of 199375602089438
Adding up the SSE of 24 leaf nodes is 7460083730039.81
So R-squared is 1- (7460083730039.81/199375602089438) = 0.96

NOTES:
1) Used a CHAID MODEL with a numeric dependent and a couple predictors
2) 70% of the records have 0 sales in the period - which is the dependent variable - I think that decision trees are not affected by mass at zero....thoughts here is appreciated)

Thanks!

Views: 7044

### Replies to This Discussion

I will reply to my own question. I did find the following discussion where it seems that there is disagreement on this practice. One professor advocates using the normal R-squared formula and another suggests other sources / methods.

If anyone has an opinion, I would love to hear..

http://www.mail-archive.com/[email protected]/msg86041.html
I use the methodology you speak of all the time. I was the original programer for Breiman and Stone's version of CART in the late 70's which is where I believe I was first introduced to that method. However we were very careful to use the term variation explained since there is little relationship to the theoretical Pearson "r". (Multiply by 100 to get Percent Variation explained.)
Be aware that this value can go negative. Which implies that parts of your model behave a lot higher variation then the population variance.
I would use this "statistic" only as a means to compare outcome of different models. Built on the same population base.
In my experience a percent variation explained as high as you have usually implies the model is "too good to be true" you might want to take only a random exclude a large subset of your zero sales data and see what changes if you model what is left. you might need to run two models one to predict a zero or non-zero outcome and take the results that are predicted to be non zero and model those seperately.
Other modeling tools like KXEN K2R usually handle that type of underlying data structure pretty well.
John Gins
Thanks John!

Very insightful. I had a pretty large hold out group for the validation partition, but I understand the dangers - its did seem too good to be true :)

I'll check out KXEN for this type of data. I have in the past modeled the zeros via logistic regression and the >0 part as a gamma dist with log link (generalized linear model). Then put them together as a "zero inflated gamma" - using SAS NLMIXED proc.