A Data Science Central Community
Have you checked hidden decision trees? It's a methodology that I developed: it blends decision trees with logistic regression.
Thx for your reply Vincent. I'll explore the concept if hidden DTs. Still can you please also tell me if logistic regression(LR) can be used in say churn analysis?can DT and LR be used interchangeably? Please point out the specific business scenarios for both?
There is no decision, except, Logistic Regression is parametric, while IDT is non-parametric. What you need to know is, they give you similar stuff you'll need, but using different approaches. AND, one is preferable over the other in certain situations.
For eg, IDT can be very helpful when you want to know rules to create your segments! Also, when you have no clue what your data looks like, IDT is a good place to start. Logistic Regression is a very good predictive tool, and is perfect all you need is the probability to predict which class someone belongs to.
FYI, Churn, Survival, Response etc.. any kind of predictive modeling has its base in Logistic!
One important note: both techniques are subject to over-fitting. You should use robust versions of these techniques.
Vincent can we not compare the two models (DT and LR) through ROC, Gains chart and accuracy and decide on which to use?
Thanks for your reply Arun, two questions:
1) What do you mean by "Logistic Regression is parametric, while IDT is non-parametric".
2) Does IDT also not give probabilities so that you can get the ROC, Gains chart etc?
To make regression more robust, what about using PLS regression (partial least squares)? It's been created for linear regression, but with ad-hoc transformations, you can easily turn a logistic regression problem into a linear regression problem. PLS regression is great when you did a terrible job at feature selection.
If undecided go with LR as it has better diagnostics (analysis of deviance).
If your predictors are most numeric functions that your Exploratory Data Analysis show are smooth functions then you would want to go with LR. LR is good with smooth numeric predictors.
DT is better if you have lots of categorical variables or a small number of them but each has a large number of levels. LR has to create a dummy variable for each level so it can become cumbersome, slow and you can run out if memory.
If you have complex interaction terms between your variables DT may be better.
If you have a lot of numeric variables that are correlated with each other use a multilevel model (a Bayesian heirarchical model if you're a Bayesian or a random effects model if you're not.)
agree with Egan, it depends on the independent variable types. Choosing the right algorithm saves your time in data preparation.
to me, in addition, I will see if the organization has used other similar algorithm. if risk used logistic regression, then my department ((marketing) department shall use logistic regression. This will ease your pressure in explaining why you use this algorithm.
to another extreme, I don't mind using linear regression model if people are comfortable with it.