Subscribe to DSC Newsletter


Please tell me which is the best way to select significant variables before any predictive modeling or Regression

Views: 2316

Reply to This

Replies to This Discussion

Throw everything in and try taking out those with insignificant coefficients or add them in one by one and watch what happens.
Gretl runs an f-test when you add or subtract variables to/ from a model.
You should consider model selection criteria (R-squared, AIC etc) as you run through your variables.
Hi Robin,
Thank you for your suggestion.. but my question is before any predictive modeling or Regression , how we will select a good variables. for ex missing info in varibales.. repeat values etc.
If you don't have too many variables, the easiest way is to run scatter plots. Visualizing how your data points are scattered in a X-Y chart will tell you whether or not they would be significant. simple univariate distributions can also help you detect missing values, repeating values etc...

Thank you.. I have around 60 t0 65 variables... i was doing the same thing that you have suggested. but its very time consuming to do it for all 65 variables.
One tool I use is for variable reduction variable clustering (e.g. Proc Varclus) and selecting one variable from each cluster. Then test for significance.

DO  varclus work for binary data as well?I mean is all predictors are binary and I want to group variables.

Selection of variable from each cluster is basis which criterion?

Hi All,

In my limited experice I have found that Proc Varclus is a good method of finding the key variable clusters and then one can pick a few variable from each cluster. However sometime we see that most of the variables get bunched up in a single cluster and then this method becomes less effective. In that case ordering those variables that are in the biggest cluster by information value or R square or any other similar metric could be used to select the most predictive variables from the largest cluster.

Also another limitation of using varclus is that it does not refer to the predictive power of the variabels so when one picks the top few contributing variables from each cluster one is not sure if one is picking up the most predictive varibale. Hence an IV or R squaere type mesrue might be looked in conjunction with the varclus results to pick the most predictive variables.

It will be great if we can get some more opinion on this.
Hi Sunil
STATISTICA Data Miner has a very good option which may solve your problem.
The option is called Feature selection condition where it uses Chi-square and F-test for Categorical and continuous depend variable respectively, after the analysis it will show you the important variables for most significant one into the decending order.
this will be very effective specially when you have large number of predictors.
I like to categorize the variables and run chi-square tests on them first independently. If they pass, then I will look further by adding a moderating variable.

-Ralph Winters
If we're talking about a linear regression here, a correlation analysis (between the dv and all candidate IVs) is usually done as a general practice. This will not reveal much about any non-linear relationships though - for that a scatterplot analysis is usually helpful. So it can be done this way:
1. Run a correlation analysis - pick the IVs showing significant high correlation with the DV
2. Run a scatterplot for the ones that weren't significant in step-1, to check if there is any visible non-linear relationship between the DV and the IV in question.

Does this make sense?
Thanks for all suggestions..

but its very time consuming to do it for all 65 variables.


Don't use the excuse of time for shortcuts as you don't know what learning you will miss.  Data experience comes from touching / exploring all the data at as low a level as possible to learn and understand how the variables relate and interact with each other. . . and then to the dependent variable.  That is how data intuition is built and how data business experience is built.


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service