# AnalyticBridge

A Data Science Central Community

# Handling Imbalanced data when building regression models

Dear colleaques and friends,

i would like to know how you go about handling a dataset with imbalanced groups being modelled using a classification model eg logistics regression. As an example, fitting a logistic regression model to a dataset whose dependent variable is made up of 5% of bads and 95% of goods.

Views: 44592

### Replies to This Discussion

Many thanks Abhijit,

i agree with Steven Finlay that this paper gives a comprehensive review of how to deal with imbalances in datasets when modelling.

Mark

I usually do nothing because it doesn't matter. At least not from the perspective of the logistic regression model.

• Consider this example from the KDD Cup 1998 data set (https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html). I built a logistic regression model in KNIME (http://www.knime.org) based on a 50% sample of the natural distribution (45,281 0s, 2,425 1s) and a second based on a stratified sample (2,425 0s, 2,425 1s; so I deleted 0s to balance the sample)
• I selected just 8 variables to make the comparison more clear:

3 original variables, LASTGIFT, FISTDATE, RFA_2F, 3 dummy variables built from RFA_2A (values D, E, F), and two dummies based on the variable DOMAIN (DOMAIN3 and DOMAIN1).

• The model from the original distribution looks like this: (apologies for the formatting)
•  Variable Coefficient Std. Error z-scores P>|z| LASTGIFT 8.82E-04 0.002 0.495 0.620 FISTDATE -4.68E-04 0.00007 -6.882 5.92E-12 RFA_2F 0.209 0.023 8.999 0.0000 D_RFA_2A 0.507 0.103 4.930 8.21E-07 E_RFA_2A 0.382 0.081 4.722 2.33E-06 F_RFA_2A 0.301 0.069 4.367 1.26E-05 DOMAIN3 -0.160 0.061 -2.640 0.0083 DOMAIN1 0.164 0.047 3.489 4.85E-04 Constant 0.579 0.622 0.932 0.351
•
• For the stratified sample, the model looks like this:
•  Variable Coefficient Std. Error z-scores P>|z| LASTGIFT 4.15E-04 0.002 0.191 0.849 FISTDATE -5.52E-04 0.00010 -5.635 1.75E-08 RFA_2F 0.176 0.033 5.321 0.0000 D_RFA_2A 0.544 0.147 3.708 2.09E-04 E_RFA_2A 0.398 0.108 3.666 2.47E-04 F_RFA_2A 0.289 0.089 3.233 1.22E-03 DOMAIN3 -0.204 0.082 -2.477 0.0132 DOMAIN1 0.138 0.067 2.068 3.86E-02 Constant 4.358 0.895 4.872 0.000

Notice that the coefficients for every variable are identical with the standard error except for the constant (which takes the relative proportions into account).

So for logistic regression, the distribution of the target variable (unbalanced or balanced) doesn't matter. It's the odds ratio that matters.

I deal with this as one of my "Five Predictive Analytics Pet Peeves" (#5) here: http://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_...

I'm not saying one should never balance/stratify, only that it isn't necessary. I do it myself sometimes to reduce the total sample size on occasion, but it isn't my default approach, especially when the sample sizes are relatively small.

The primary reason I believe most practitioners stratify (and the reason I always used to) is because of the "my classifier calls everything a 0" problem. But this has nothing to do with the classifier per se. It is because the posterior probability threshold applied to the probabilities to create the confusion matrix is 0.5 in every software package. It needn't be. If you apply the prior probability (proportion) as the posterior probability threshold, the confusion matrix will look fine. My PDF shows an example of this.

Cheers!