Subscribe to DSC Newsletter

I have performed a desiccation tree analysis. The problem is that I get an "impossible" tree, the combination that the tree gives  can’t be true.

 

The first split is with a variable, how many donations the customer has given during a period. One of node is 'Missing', zero gifts during that time period.

 

Then this node splittes into several child nodes with help of a component variable that holds info about recency and frequency. When I read the child nodes titles I see that a component value that cannot be here is there. I have checked I the raw data and that combination don’t exist.

 

I have not run the score code, just locked at the tree diagram and that’s  when a saw the odd split. Can it be like this, the tree diagram gives wrong info but  the core code does it right? Can it be like this because of some 'missing' vägue bug I EM or because of that there are several hundreds of different values of the component variable and the diagram just shows med one or two.

 

Anyone met this problem before?

 

 

Views: 121

Reply to This

Replies to This Discussion

What kind of data are you putting into your decision tree? If you feed your decision tree with extremely granular data, such as zip code, area code or IP address, it won't work. For instance, replacing IP address (this field has potentially up to 256^4 values) with IP category (anonymous proxy, static IP, corporate proxy, AOL, dynamic ISP, edu proxy, etc.) will provide a very significant improvement.

The issue that you are talking about is probably not specific to SAS, but to all decision tree algorithms, including my own hidden decision trees.
Tanks for your answer!

The first split is generated from a frequency variable that is on an interval level and has about 20 outcomes. The next level is quite granular over 1000 categories, on a nominal level.

I can understand that it´s not optimal with this amount of nominal categories. But what I find odd is that a combination of a value in the first split "Missing" and a value in the second split that don´t exist in the raw data is presented in the decision tree diagram. That why I´m curios if it´s a bug.

In this case it is, no donations during the last 4 month (missing), and in the second split is recency/frequency (last donation for 1 month ago), an impossible combination.... There are dozens of categories in the second split and just two are written out in the tree diagram and one of those two is an impossible value.

Must run the score code ASAP and see if it´s a problem there too or just a bug in the tree diagram. but don´t it seem strange and not god at all?
Hi Sven,

Have you confirmed that it's a bug? My advice is to contact SAS support.

Tomas

RSS

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service