A Data Science Central Community
Is there a cut-off for the number of unique 'classes' or levels for a categorical variable that one should consider to eliminate/select a categorical variable as part of variable selection prior to building the model.I read somewhere that a categorical variable should be discarded if the classes as such or even after grouping them meaningfully it is not possible to restrict them to a dozen levels .
I would like to know if this is subjective or needs to be screened only after looking at Weight of Evidence and Information Value of that particular categorical variable as in the case of logistic Regression.
Also do many classes for a categorical variable cause overfitting of a model or does it depend on the sample size for each of the individual classes.
Thanks for the reply.Hope you are doing good.I have heard about you through Sujith and Anil when I was working with them in the Data Mining team.Now, I am working for a different organization.
My question actually arises from our ex-client's approach (yeah,as you might have known she is still there :) ).I know about this approach which you have briefed about.This is exactly what I used to be told by her, but I was not sure/convinced if this actually works well for variables with more than two dozen levels (even after meaningful grouping if applicable).I have read elsewhere about overfitting due to large number of levels in a categorical variable.
Anyways thanks once again.Keep in touch.Correct me if I am wrong,I guess I filled in your place once you moved out of Sujith's team.World is small! :)
P.S. I have deliberately not mentioned names since I did not want to discuss few things on public forums.