A Data Science Central Community
Dear colleaques and friends,
i would like to know how you go about handling a dataset with imbalanced groups being modelled using a classification model eg logistics regression. As an example, fitting a logistic regression model to a dataset whose dependent variable is made up of 5% of bads and 95% of goods.
Many thanks Abhijit,
i agree with Steven Finlay that this paper gives a comprehensive review of how to deal with imbalances in datasets when modelling.
I usually do nothing because it doesn't matter. At least not from the perspective of the logistic regression model.
3 original variables, LASTGIFT, FISTDATE, RFA_2F, 3 dummy variables built from RFA_2A (values D, E, F), and two dummies based on the variable DOMAIN (DOMAIN3 and DOMAIN1).
Notice that the coefficients for every variable are identical with the standard error except for the constant (which takes the relative proportions into account).
So for logistic regression, the distribution of the target variable (unbalanced or balanced) doesn't matter. It's the odds ratio that matters.
I deal with this as one of my "Five Predictive Analytics Pet Peeves" (#5) here: http://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_....
I'm not saying one should never balance/stratify, only that it isn't necessary. I do it myself sometimes to reduce the total sample size on occasion, but it isn't my default approach, especially when the sample sizes are relatively small.
The primary reason I believe most practitioners stratify (and the reason I always used to) is because of the "my classifier calls everything a 0" problem. But this has nothing to do with the classifier per se. It is because the posterior probability threshold applied to the probabilities to create the confusion matrix is 0.5 in every software package. It needn't be. If you apply the prior probability (proportion) as the posterior probability threshold, the confusion matrix will look fine. My PDF shows an example of this.