A Data Science Central Community
Hi Everyone, i would like to know ;is it neccessary to exclude independent variables from a regression model based on the fact that they are correlated. i am working on a logistic regression model for fraud built from a very large dateset but with a very big imbalance in the population size betwen the target variables i.e very large size of non frauds and small size of frauds.
my model has proved consistent for a ong period with inclusion of these variables as they all have their own independent definition and i need them in the model. do we have any statistical reason why we could include all variables in a model despite them being correlated?
Regression models can become ustable if the variables included have strong correlations.
If you wan to include all the variables but but want to avoid the problems that come from correlated variables you could use principal component analysis (or some other method that combines the variables in a non correlated way) to create new variables that are not correlated but still retain the information in the original variables. If you are able to name the new variables in a meaningful way you wouldn't even lose any interpretability from the model.
To start with, usually, the cases where Logistic Regression is performed is when the cases of interest are small in no (<5%) - like in your case, small size of frauds. The intention is to identify patterns to be able to identify fraud before their fraud in future.
Second, when you say variables are correlated, they also generally have a similar information in terms of business sense. For eg., Price variables & Discount are two correlated variables, yet different transformations of the same kind of data. So, it makes sense to keep just one! Similarly, in your case, it's best to isolate such cases.
Even after that, if you have unavoidable col-linearity due to activity in the same time, for eg., having two TV ad campaigns or TV & Price discounts happening in conjunction, you have to then resort to methods mentioned in the other post like Factor or PCA to combine the variables. Sometimes, digging into the data as to what makes them collinear helps to alleviate the problem.
Finally, there is NO statistical reason to include variables that are collinear! The fact that they are collinear itself means it's redundant information in the model!
thanks Jarko and Arun,
the variables i have posses independent definition and to business are used to play an independent role eg citizenship and nationality, both are highly correlated but play different roles as they posses a different definition.
my tests show enormous consistecy in the trends in each variable in the model keeping in mind that each variable was significant.
I am basically trying to find valid reasons from a statistics perspective as to why my model has been consistent after deciding to include all the significant independent varaibles despite them being correlated.
If most of your correlated variables are categorical like citizenship and nationality then the reason that the having both in the model gives better results is that the difference in the model is meanigful for the case. That citizenship and nationality are different might be as important to the model as to what they are themselves. You could also try a model with only an indicator that tells that the variables are different.Are you coding the categorical variables as 0-1 or with a number indicating the case?
With these kinds of variables you are always going to have a high "correlation" as in the majority of cases the values are exactly the same and the cases where the values are different are seen mostly as noise. The coding of the variables can also affect the amount of correlation shown.
"i need them in the model". Observational data is always co-linear to some degree. Since you need them in the model, it means that you are already decided on which variables go in. If your concern is about coefficient stability, and assuming that you don't have perfect co-linearity, you could check the VIFS of your chosen predictors. In logistic, vifs or about 2 could be areas to start looking at.
That's a nice example of Nationality & Citizenship! They're both different in their meaning, but they correlate to a degree in that, all people of a certain Nationality are their Citizens too. Hence, it is a form of derived relationship.
If you go on to Set theory, you can visualize how these two variables correlate based on your definitions.
In your case, it's possible that both have information on your 'Y' and hence they would come into the model. Also, logistic is more tolerant to multi-collinearity, basically since most usually, logistic is on char/class data and uses MLE which more robust to collinearity than OLS.
Now, I would also suggest that you can create on variable out of these two for operational purposes! Possibly something like, those who were same Nationalities & Citizens, and those who added more Citizenship to their Nationalities depending on how you think each variable is independently affecting the 'Y'.
thanks again Jarko,
i agree with you especially with the fact that all these variables carry independent roles as they have independent definitions.
thanks for your input; i am not necessarily forcing these variables in to the model, but they are all significant and entered in to the model through a stepwise procedure. i personally do expect correlation between some variables e.g. permanent residence and nationality because many permanent residents will have the same nationality and while some will have changed their permanent residance status. having both of them in the model will give a conclusive prediction as it will get a complete profile of an individual.
Why don't you try using Bayesian networks to model your problem? Causal discovery procedures will learn the structure of your model and account for dependencies between your variables. You may get interesting insight into what is happening in your domain.