A Data Science Central Community
Hi all!
I have a query with regards to one of the assumptions of linear regression - linearity between dependent and independent variable.
Lets say I want to build an OLS model where my dependent variable is INCOME and one of my independent variables among many is GENDER, which is binary.
Now, wouldn't considering GENDER (0,1) as one of my independent variables in the OLS model violate the LINEARITY assumptions of linear regression?Is it possible for a binary variable or for that matter dummy variables (instaed of using categorical variables directly) to be linear with a continuous variable?
Regards,
Sharath
Tags:
Hi Sarath,
For categorical variable like GENDER, linearlity means slightly different thing.
Let's say dependent variable PRICE is continuous and independent variable COLOR is categorical with 3 levels (red,green,blue).
Lets talk about a model: PRICE = a + b*COLOR
In reality, the model is: PRICE = a + b1 * I_green + b2 * I_blue
here I_green and I_blue are dummy (0/1) for those levels.
There is always 1 less dummy than # of levels. Here red is missing. That actually is called baseline.
So model is,
if COLOR =red , PRICE = a
if COLOR =blue , PRICE = a+b1
if COLOR =green , PRICE = a+b2
So, in this sense, PRICE is linear in COLOR. b1 and b2 are increase in price over and above the baseline color RED.
regards,
Angshuman
Thanks a lot for the clarification Angshuman.I have one more question w.r.t. your response.In the example you provided let's say dummy variables I_Green and I_Blue don't turn out to be significant <hypothetical example,am not sure if this can happen.Please let me know if I am wrong>.Then, in this case, how will one know whether I_Red is significant or not?Hope my question is not silly :).
Thanks once again.
Regards,
Sharath
Hello Sharat,
This is actually a good question. For a categorical variable, can the model say that some veles are significant, some levels are not. Typically after a regression we look at the ANOVA (Analysis of Variance) table. There we have 1 row per independent variable. In other words, in My example we will see a single row corresponding to the variable COLOR (as opposed to say 2 rows for I_green and I_blue). The F statistic for the COLOR variable, i nANOVA table will tell you whether COLOR - as a whole is significant or not.This is one way.
The other more complex way perhaps to dig deeper into data. Say I_green is significant but I_blue is not. That might mean - there is no significant difference in PRICE when you change COLOR from red to green. Howver when you change the COLOR to blue, there is a significant difference. In such a case you may redefine the COLOR variable with 2 categories ( red_or_green , blue). Refit the model and see if blue is now significant. So, what you are doing essentially is, merging two categories and calling in as one.
Typically in such cases, recommendation is, take a careful look at various exploratory plots and stats. For example, if you draw boxplot of PRICE, - one box per COLOR category - do you see that the boxes for red and green are pretty close and that for blue is very different? -- that kind of thing.
Regrads,
Angshuman
Thank you Angshuman.
I am going to post one more question on a general topic.Your inputs and comments will be highly appreciated.
Regards,
Sharath
The 'linearity' assumption in linear regression means that the expected value of the response is a linear function of the parameters. "Linear in the betas." Compared to "linear in the predictor variable". Here is more info https://goo.gl/8YRr6A
© 2020 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles