# AnalyticBridge

A Data Science Central Community

# Independent variables need to be normally distributed in multiple regression?

Below is a quote regarding logistic regression. It seems it is saying OLS regression requires independent variables to be normally distributed. Based on my past experience, most independent variables are not normally distributed in real datasets. Could anyone comment on
this?

Source: http://www.sagepub.com/upm-data/5081_Spicer_Chapter_5.pdf, page 13

"The assumptions required for statistical tests in logistic regression are far less restrictive than those for OLS regression. There is no formal requirement for multivariate normality, homoscedasticity, or linearity of the independent variables within each category of the
dependent variable."

Views: 14133

### Replies to This Discussion

They don't need to be, indeed you can have variables that are binary, or dummy variables. What can cause problems though is highly skewed variables, e.g. a variable taking the value zero 98% of the time (e.g. in the context of fraud detection). Another source of problems is residual errors that are not independent, and strong cross-correlations. All these issues can be addressed by

• splitting your models into sub-models where residual noise is true white noise within each sub-model
• identifying nuisance variables such as time (they can be a source of non-independent noise)
• over/under sampling to avoid skewed variables
• Un-correlating the variables, step-wise regression, or using Lasso or ridge regression (cross correlations have an impact on the stability of your model)

Given dummy codes are not normal, would this generalize to impact the business presentation of "on average a unit increase in x produces an increase in y" if both are not normally distributed, or at least based on some fundamental assumption that the mean is the best predictor for both distributions, which is not the case in price elastic distributions. If so, what would be a better phrase to describe this by leaving out the on average part? Or is this consideration of "on average" altogether not really relevant?

The reason I am asking is because the binary representation of the category can be replaced by the within category means and the coefficient will be close to 1 in a uni-variate model.

Sorry, I meant the within category means of the DV, not the IV.

All you have to do is transform the shape of the distribution.

There can be hidden distributions within ranges of the IV. So there are techniques to transform the whole distribution, or restrict the range of the IV and transform those. All the rules apply for validation.

You are correct in saying most independent variables are not normally distributed in that (in classic statistics)
predictors are from designed experiments and are not random in that sense. But more importantly, the usual assumption of normality applies to the distribution of error predictions (observed - predicted) and not to the independent variables.

-Ralph Winters