A Data Science Central Community
Hi team,
I am facing a simple problem and trying to find the optimum solution:
Y(cont) = x1(cat) + x2(cat) +x3(cat) + x4(cat) + x5(cont)
Where: cat = categorical and cont = continuous.My categorical variables have 100 classes.
So my Y is cont and 4/5 Xs are categorical. What is the optimum approach? ANOVA? For ANOVA I think that would be true only when ALL of my Xs were categorical. If I simply apply a linear regression, then I would have 400 dummy Xs + 1 continuous.
I tried that in SAS and it gives me some results, but I am afraid if these are biased.
Thanks!
Comment
Thanks guys!
Hi,
Thanks for the question.
To begin with, you need to provide us more information regarding what kind of data you have, what your objectives and research questions were so we can provide you with relevant help so as not to speculate.
So please, help us to help you.
However, a general principle which I have used many often successfully is to conduct univariate regression on the combined effect of each categorical variable and then used follow on with multiple regression. If the combined effect of that categorical variable is not significant, there is no need to declare the classes for such such variables in the multiple regression model or if some of the classes are similar in nature, you could collapse then into one class and then test their combined effect again by repeating the process above. You will do this for all the categorical variables in your data set.
Yes, you can use linear regression to achieve this but having 100 classes for one categorical variable, I am afraid that you will be dealing with so many degrees of freedom which might have some serious effects on the optimality of your fitted model and its predictive power so I will suggest you collapse the classes to fewer if that is possible, bearing in mind your research questions and objectives.
Now you see, you have succeeded in allowing me to speculate because of incomplete information you provided about your question.
Please, help us to help you by providing us with more details.
Thanks
Hi,
ANOVA is designed for just categorical regressors. When there are both categorical and continuous regressors ANCOVA is the appropriate method. Also Linear Regression gives the exact same results. Finally because of the many categories, you may need a very large sample, however the chances are that many of these categories have similar effects, there is a method called Level Clustering that can help reduce the number of categories by clustering similar ones.
If the categories are ordered you can consider them as a simple continuous regressors.
I hope it helps.
© 2021 TechTarget, Inc. Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge