Subscribe to DSC Newsletter

What do you do with a multiple variable model that has a low correlation coefficient (R2)?

OK so I've spent hours trying to maximize the R2 of a multiple variable model and the R2 is still low (ex. <50%). Should I become discouraged and put the model aside; I used expensive professional statistical analysis software.

When creating a comprehensive multiple variable model with dependant and independent variables, a user usually has the upper hand in understanding the fundamentals (ex. physics, chemistry, finance etc...). However, the following must also be considered: the quality of the data, the number of data points, the number of independent variables, the variation of the variables, data filtering etc...

For example, was a delay introduced into the independent variables, is the model linear and the independent variables are a non-linear function of the dependant variable etc...

If it is assumed that the independent and dependant variables are being measured with an adequate amount of accuracy and precision. If the R2 is still low after optimizing the model, then this may be due to the fact that an important independent variable is not being measured or is not being used in the model. This is not bad; in fact it may lead to a break-through or increase our understanding of that which is being modeled.

Even a model that has a low R2 tells us something important!

Views: 275


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Ralph Winters on September 22, 2010 at 7:34pm
Looking at the residuals after regression will also tell you a lot about where you stand. If the residuals are normal, I would suggest start looking to supplement your model with other unknown variables, as you yourself have suggested. Otherwise tricks like logs, Box power transforms, etc can work to smooth the data out, as suggested by Tom. But they are somewhat artificial, so I would proceed with caution.
I have also often found that if you have TOO much data you end up with a reversion to the mean type problem, and again low R2. In this case you need to segment the data out as suggested by Jon.

-Ralph Winters
Comment by Chris Carozza on September 22, 2010 at 7:28pm
Scott that is a good point, a model with a high R^2 is not necessarily a measure of a "good" model. However, I've already caught myself trying to maximize the R^2 to the detriment of model integrity. This is especially true when trying to develop predictive process models.
Comment by Scott Nicholson on September 22, 2010 at 6:40pm
A high R^2 is NOT a measure of a 'good' model. Generally time series data give you a high R^2 whereas cross-sectional data will yield a low R^2. Regressing a variable on the lag of itself generally will give you a high R^2, but does that fit the definition of a 'good' model? Depends on what the goal of your model-building exercise is.

If you have a low R^2 and are confident about your choice of predictors, then it's just true that there is a large amount of unobservable variation in your data. And expensive software won't solve that problem either, obviously.
Comment by Jonathan Davis on September 22, 2010 at 12:03pm
Another thing you can try is partitioning your data--for instance if one of those independent variables has a large influence in the prediction, the value of that one independent variable may affect how other variables influence the predicted value. I've run across this effect before when examining transportation systems--the relationship between wait times and speed/capacity/number of transports was completely different under different demand circumstances. A single model trying to include the level of demand yielded poor results--bad residual distribution, non constant variance. Multiple models of the overall system worked very well.
Comment by Chris Carozza on September 8, 2010 at 8:29pm
Tom, thank-you for your feed-back. The independent variables were not highly correlated and the outliers were removed from the indepedent and dependent variables. I like the point that you made of using a function (ex. Ln) to normnalize the data. In the end, it would appear that one of the most important independent variable was not in the model.

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service