Subscribe to DSC Newsletter

Dear colleaques and friends,

i would like to know how you go about handling a dataset with imbalanced groups being modelled using a classification model eg logistics regression. As an example, fitting a logistic regression model to a dataset whose dependent variable is made up of 5% of bads and 95% of goods.

Views: 38273

Reply to This

Replies to This Discussion

Many thanks Abhijit,

i agree with Steven Finlay that this paper gives a comprehensive review of how to deal with imbalances in datasets when modelling.

 

Mark

I usually do nothing because it doesn't matter. At least not from the perspective of the logistic regression model. 

  • Consider this example from the KDD Cup 1998 data set (https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html). I built a logistic regression model in KNIME (http://www.knime.org) based on a 50% sample of the natural distribution (45,281 0s, 2,425 1s) and a second based on a stratified sample (2,425 0s, 2,425 1s; so I deleted 0s to balance the sample)
  • I selected just 8 variables to make the comparison more clear: 

    3 original variables, LASTGIFT, FISTDATE, RFA_2F, 3 dummy variables built from RFA_2A (values D, E, F), and two dummies based on the variable DOMAIN (DOMAIN3 and DOMAIN1). 

  • The model from the original distribution looks like this: (apologies for the formatting)
  • Variable Coefficient   Std. Error z-scores P>|z|
    LASTGIFT 8.82E-04   0.002 0.495 0.620
    FISTDATE -4.68E-04   0.00007 -6.882 5.92E-12
    RFA_2F 0.209 0.023 8.999 0.0000
    D_RFA_2A  0.507 0.103 4.930 8.21E-07
    E_RFA_2A 0.382 0.081 4.722 2.33E-06
    F_RFA_2A 0.301 0.069 4.367 1.26E-05
    DOMAIN3 -0.160 0.061 -2.640 0.0083
    DOMAIN1 0.164 0.047 3.489 4.85E-04
    Constant 0.579 0.622 0.932 0.351
  •  
  • For the stratified sample, the model looks like this:
  • Variable Coefficient    Std. Error z-scores P>|z|
    LASTGIFT 4.15E-04    0.002 0.191 0.849
    FISTDATE -5.52E-04   0.00010 -5.635 1.75E-08
    RFA_2F 0.176 0.033 5.321 0.0000
    D_RFA_2A  0.544 0.147 3.708 2.09E-04
    E_RFA_2A  0.398 0.108 3.666 2.47E-04
    F_RFA_2A 0.289 0.089 3.233 1.22E-03
    DOMAIN3 -0.204 0.082 -2.477 0.0132
    DOMAIN1 0.138 0.067 2.068 3.86E-02
    Constant 4.358 0.895 4.872 0.000

Notice that the coefficients for every variable are identical with the standard error except for the constant (which takes the relative proportions into account). 

So for logistic regression, the distribution of the target variable (unbalanced or balanced) doesn't matter. It's the odds ratio that matters. 

I deal with this as one of my "Five Predictive Analytics Pet Peeves" (#5) here: http://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_...

I'm not saying one should never balance/stratify, only that it isn't necessary. I do it myself sometimes to reduce the total sample size on occasion, but it isn't my default approach, especially when the sample sizes are relatively small.

The primary reason I believe most practitioners stratify (and the reason I always used to) is because of the "my classifier calls everything a 0" problem. But this has nothing to do with the classifier per se. It is because the posterior probability threshold applied to the probabilities to create the confusion matrix is 0.5 in every software package. It needn't be. If you apply the prior probability (proportion) as the posterior probability threshold, the confusion matrix will look fine. My PDF shows an example of this.

Cheers!

RSS

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service