# AnalyticBridge

A Data Science Central Community

Subscribe to DSC Newsletter

# Question on Regression

Hi team,

I am facing a simple problem and trying to find the optimum solution:

Y(cont) = x1(cat) + x2(cat) +x3(cat) + x4(cat) + x5(cont)

Where: cat = categorical and cont = continuous.My categorical variables have 100 classes.

So my Y is cont and 4/5 Xs are categorical. What is the optimum approach? ANOVA? For ANOVA I think that would be true only when ALL of my Xs were categorical. If I simply apply a linear regression, then I would have 400 dummy Xs + 1 continuous.

I tried that in SAS and it gives me some results, but I am afraid if these are biased.

Thanks!

Views: 809

Comment

Join AnalyticBridge

Comment by Konstantinos Chlouverakis on January 3, 2015 at 3:41pm

Thanks guys!

Comment by JUSTICE MOSES K. AHETO on December 29, 2014 at 3:54am

Hi,

Thanks for the question.

To begin with, you need to provide us more information regarding what kind of data you have, what your objectives and research questions were so we can provide you with relevant help so as not to speculate.

However, a general principle which I have used many often successfully is to conduct univariate regression on the combined effect of each categorical variable and then used follow on with multiple regression. If the combined effect of that categorical variable is not significant, there is no need to declare the classes for such such variables in the multiple regression model or if some of the classes are similar in nature, you could collapse then into one class and then test their combined effect again by repeating the process above. You will do this for all the categorical variables in your data set.

Yes, you can use linear regression to achieve this but having 100 classes for one categorical variable, I am afraid that you will be dealing with so many degrees of freedom which might have some serious effects on the optimality of your fitted model and its predictive power so I will suggest you collapse the classes to fewer if that is possible, bearing in mind your research questions and objectives.

Now you see, you have succeeded in allowing me to speculate because of incomplete information you provided about your question.

Please, help us to help you by providing us with more details.

Thanks

Comment by Nasim Mousavi on December 24, 2014 at 11:01pm

Hi,

ANOVA is designed for just categorical regressors. When there are both categorical and continuous regressors ANCOVA is the appropriate method. Also Linear Regression gives the exact same results. Finally because of the many categories, you may need a very large sample, however the chances are that many of these categories have similar effects, there is a method called Level Clustering that can help reduce the number of categories by clustering similar ones.

If the categories are ordered you can consider them as a simple continuous regressors.

I hope it helps.