I am working on a highly imbalanced dataset (negative examples over 20K and positive examples about 100). I am trying to build a logistic regression model. My current approach includes undersampling of negative examples. However with this approach there are a couple of problems:
1) Several LR models are possible with different samples. How to generalize these models and interpret the output? With different models I will find different significant attributes. Also, how to decide upon ideal sample size? Currently I am evaluating the models with % true positives since I am interested in getting models predicting high % true positives. However, this is a root of my another problem.
2) I am getting high % of false positives when I test my models with the complete dataset. Is there a way to minimize % false positives while maximizing % true positives?
Any comments / suggestions are welcome.