A Data Science Central Community
I have a query with regards to one of the assumptions of linear regression - linearity between dependent and independent variable.
Lets say I want to build an OLS model where my dependent variable is INCOME and one of my independent variables among many is GENDER, which is binary.
Now, wouldn't considering GENDER (0,1) as one of my independent variables in the OLS model violate the LINEARITY assumptions of linear regression?Is it possible for a binary variable or for that matter dummy variables (instaed of using categorical variables directly) to be linear with a continuous variable?
For categorical variable like GENDER, linearlity means slightly different thing.
Let's say dependent variable PRICE is continuous and independent variable COLOR is categorical with 3 levels (red,green,blue).
Lets talk about a model: PRICE = a + b*COLOR
In reality, the model is: PRICE = a + b1 * I_green + b2 * I_blue
here I_green and I_blue are dummy (0/1) for those levels.
There is always 1 less dummy than # of levels. Here red is missing. That actually is called baseline.
So model is,
if COLOR =red , PRICE = a
if COLOR =blue , PRICE = a+b1
if COLOR =green , PRICE = a+b2
So, in this sense, PRICE is linear in COLOR. b1 and b2 are increase in price over and above the baseline color RED.
Thanks a lot for the clarification Angshuman.I have one more question w.r.t. your response.In the example you provided let's say dummy variables I_Green and I_Blue don't turn out to be significant <hypothetical example,am not sure if this can happen.Please let me know if I am wrong>.Then, in this case, how will one know whether I_Red is significant or not?Hope my question is not silly :).
Thanks once again.
This is actually a good question. For a categorical variable, can the model say that some veles are significant, some levels are not. Typically after a regression we look at the ANOVA (Analysis of Variance) table. There we have 1 row per independent variable. In other words, in My example we will see a single row corresponding to the variable COLOR (as opposed to say 2 rows for I_green and I_blue). The F statistic for the COLOR variable, i nANOVA table will tell you whether COLOR - as a whole is significant or not.This is one way.
The other more complex way perhaps to dig deeper into data. Say I_green is significant but I_blue is not. That might mean - there is no significant difference in PRICE when you change COLOR from red to green. Howver when you change the COLOR to blue, there is a significant difference. In such a case you may redefine the COLOR variable with 2 categories ( red_or_green , blue). Refit the model and see if blue is now significant. So, what you are doing essentially is, merging two categories and calling in as one.
Typically in such cases, recommendation is, take a careful look at various exploratory plots and stats. For example, if you draw boxplot of PRICE, - one box per COLOR category - do you see that the boxes for red and green are pretty close and that for blue is very different? -- that kind of thing.
Thank you Angshuman.
I am going to post one more question on a general topic.Your inputs and comments will be highly appreciated.
The 'linearity' assumption in linear regression means that the expected value of the response is a linear function of the parameters. "Linear in the betas." Compared to "linear in the predictor variable". Here is more info https://goo.gl/8YRr6A