A Data Science Central Community
I need inputs on the pros and cons of building a log-reg model using dummy variables instead of the Weight of evidence approach for categorical variables. Some of the cons that I can think of using Dummy variable approach are:
2. Interpretation of output
I know one of the things that needs to be looked at is the number of unique levels within a categorical variable. But, making reasonable assumptions, in a generic sense I would like to know if there are any pros and few other cons of using the Dummy variable approach vs the WoE approach.
First - you can compute WoE for both dummy and categorical variable, they aren't competitors.
Second - dummies lower degrees of freedom, produce simpler models and simpler is better according Occam's razor. And maybe also less sensitive to future changes in distribution.
Thanks for the reply. What I meant by WoE vs Dummy is say for eg. there is a categorical variable (independent variable, of course) with 4 levels-
Colour- Blue, Green, Red and White
The two ways I mentioned about is computing WoE for each of these 4 colours, thereby quantifying the categories and using them to build the model or adopting another approach, the dummy variable approach i.e creating 3 dummy variables for 4 categories in the variable say C1 (1 or 0) - Blue, C2 (1 or 0) - Green,C3 (1 or 0) - Red . [creating n-1 dummy variables for n levels in a variable].
If you could share any reference material on this, it would be great. I am working on a comparative study on the same topic.
I understand well. It's about terminology.
WoE=Weight of Evidence is metrics and has own formula.
Dummy variable is binary flag created from categorical variable with more than 2 categories.
And again - simpler model is better. If you have categorical variables, LR stepwise procedure has to add full attribute or exclude it. Also when some categories are:
- useless; WoE close to 0
- problematic; e.g. unstable in time
- too many
- too small
- unwanted from business reason
Do you want categories with close zero WoE in model? If you don't, definitely go for dummies.
And it depends on your modeling technique too:
- Logistic regression is very sensitive to data representation and dummies are good.
- Decistion trees are less sensitive, but some splitting criterias don't like attributes with too many categories.
Great book covering this topic is The Credit Risk Toolkit. Good luck.
Would you mind sharing the results of your comparative study?