Subscribe to DSC Newsletter


I am working on a highly imbalanced dataset (negative examples over 20K and positive examples about 100). I am trying to build a logistic regression model. My current approach includes undersampling of negative examples. However with this approach there are a couple of problems:

1) Several LR models are possible with different samples. How to generalize these models and interpret the output? With different models I will find different significant attributes. Also, how to decide upon ideal sample size? Currently I am evaluating the models with % true positives since I am interested in getting models predicting high % true positives. However, this is a root of my another problem.

2) I am getting high % of false positives when I test my models with the complete dataset. Is there a way to minimize % false positives while maximizing % true positives?

Any comments / suggestions are welcome.

Thank you

Views: 2324

Reply to This

Replies to This Discussion

You need to concentrate on reducing the number predictor variables you are using to only include those that are useful.

Build many models from many random samples and then keep only those predictors that are showing up as significant in all the models. For each model also randonly select the initial predictors.

How many initial predictors do you have and how are you determining significance?
I have found this article useful:
Gary King and Langche Zeng. "Logistic Regression in Rare Events Data," Political Analysis, Vol. 9, No. 2.
PDF available at:
Thank you very much for the replies. They were very helpful for my analysis. Random multiple sampling really resulted in better insights about the data.


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service