A Data Science Central Community
Dear colleaques and friends,
i would like to know how you go about handling a dataset with imbalanced groups being modelled using a classification model eg logistics regression. As an example, fitting a logistic regression model to a dataset whose dependent variable is made up of 5% of bads and 95% of goods.
This is a good question, and one that seems to get raised time and time again.
Myself and a colleague (Sven Crone from Lancaster University in the UK) published a paper on this issue last year in the International Journal of Forecasting. "Instance sampling in credit scoring: An empirical study of sample size and balancing." A summary of our findings can also be found in the book "Credit Scoring, Response Modeling and Insurance Rating. A Practical Guide to Forecasting Consumer Behavior.”
There are also some very good papers by G. Weiss from 2004/5 which are highly cited and referenced in our paper/book.
What we found was that for some methods of model construction sample imbalance was not an issue at all – not even a tiny amount. For logistic regression in particular, there was absolutely no benefit to creating a balanced sample. What was far more important was using all the data you had available. For example, for a marketing campaign, if you had 1,000 responses and 50,000 non-responses you got better models by using all 51,000 cases, compared to sampling down the non-responses to 1000 or by weighting up the 1,000 responses.
We also looked at Neural Networks, Discriminant Analysis and Decision Trees. Discriminant Analysis was somewhat sensitive to the class imbalance (Balanced better than imbalanced) but the method that was the most sensitive, by far, was the Decision Tree approach (CART 4.5). We saw differences in model performance of more than 10% - with the balanced sample performing much better than the imbalanced one.
We also considered the two different ways of creating a balanced sample. The first was “under-sampling” where you “throw away” some of the larger class (non-responders in the above example) to create a sample with the same numbers of each class. The second method was “over-sampling” which is where you weight up the minority class – so in the above example you would treat each response as if it appeared 51 times in your sample. Over sampling was generally better than under-sampling, particularly when small samples were involved – which makes intuitive sense given that with under sampling you are not making full use of the data available to you. Weiss comes to the same conclusion in his paper, if memory serves me right.
Hope this is useful
In my case logistic regression did not perform well in case of imbalanced data. In your case did you maintain the same ratio of classes in the training set and the testing set. I have heard that random forests can be used as they are not sensitive to imbalance in class.
That's interesting to know.
When assessing the performance of a model you should always use a data set that matches your real world application to decide how good it is.
So in the above example, if you balanced your data to construct your model then you should assess performance on an original unbalanced validation data set - after all its on a population like this that the model is going to be used. A common mistake is to develop and validate on a data set that's been balanced (i.e. not representative of the true population proportions from which the sample was taken).
I've not done any research or read anything about balancing with Random Forests so can't comment on that.
thank you very much for the detailed response. i have been developing scoring models and have always undersampled the larger group of the responses. Is there a minimum number of observations you need to have in a group? As an example i have a situation where the dataset is made up of 4000 goods and 230 bads. is this suitable to fit a regression model? i have been considering bootstrapping the regression estimated to possibly obtain unbiased estimates. please advise if ths approach is reliable or a more suitable one. many regards.
In my experience only 230 bads is quite difficult to work with for a credit scoring type problem. One problem with bootstrapping is that you only use about 2/3 of the data for each model which in your case may be a problem. To put it another way, the estimates will be unbiased, but the model won't be as efficient as if you used all of the data or had a bigger sample.
To maximize the use of the data with small samples, I have in the past used "leave one out" cross validation so that all the data (less one observation) is used to build models. I'm a SAS user, so to do this I would write a SAS macro to run the regression 4230 times, leaving one observation out each time. The left out observation is then used for validation. So after all the 4230 runs, you have 4230 observations in your validation sample. Downside is that it takes a little while to run all the cross validation models (about an hour or two for 10,000 observations and 20 variables). If can't do this then a similar idea is "K-fold cross fold validation" with say 20 folds.
I've also talked to people who adopt a "Delphi" approach with samples below about 2-300. All this means is that they look at the patterns in the data (univariate analysis) and combine it with their expert opinion and wider domain knowledge to "make up" a model. For example, with your 230 bads, I expect that there will be relatively few that have been bankrupt recently? So there may not be enough bankrupts for a statistical process to include bankruptcy in the model. However an expert knows that previous bankruptcy is correlated with going bad and so will allocate some points in the model - this might not be optimal, but is probably better than nothing.
I believe the Delphi method, if properly applied, can yield similar levels of performance compared to common statistical methods, for small sample.
Thanks for your explanation. We have also struggled with this issue and when we tried support vector machines, they were quite sensitive to imbalanced data. We balanced the data by undersampling but the results were sub-par.
I am not sure i fully understood your oversampling methodology - how/where are you applying this 51x weight? Thanks.
Most modelling software (e.g. SAS) allows you to create a weight variable. In the above example you would set the weight variable to 1 for the non responders (Majority class) and 50 for the minority class; i.e. the responders (My mistake in the previous post - you would give a weight of 50 not 51!!!). You then tell the modelling procedure (PROC LOGISITIC, PROC GLM etc.) what the weight field was and the software would treat observations with a weight of 50 as if it was repeated 50 times.
If your software does not support weighting, then the simplest (but very inefficient) way to achieve a weighting is to create multiple instances of the minority class. So in the above example, create 49 copies of each of the responders so that you have 50,000 responders and 50,000 non-responders.
i worked on a fraud model with a dataset which had 99.95% of non frauds and 0.05% of frauds. i used an undersampling technique to adjust the dataset so that the ratio of frauds to non-frauds in the model development dataset was 1:10. this was so that the estimates could capture the reality of the events being modelled. this worked for me very well.
Handling imbalanced data sets in classification is a tricky job. As suggested in other replies, you can handle it with few sampling tricks. Under-sampling the majority class in my view is not advisable as it is normally considered as potential loss of information. Giving differential weights to minority and majority class is a standard industry practice and give reasonably good results. If you think more from the industry perspective, making a stratified sample (i.e. to keep the same proportion of majority to minority class) in creating train / validation / test splits is a very important step. Next thing is whatever modeling method you use, what kind of lift i am getting in the first few deciles is quite important. Here, you can think of introducing priors also. By priors, i mean, let's say your sample says 5% bads and 95% goods....however, from the industry experience, you expect more bads, say 10% bads and 90% goods...you can actually enforce this information in setting up the priors which will actually influence your classifier performance. Also, you can design the profit matrix, in such a way that classifying bad as good will have higher penalty compared to classifying good as bad. I am not sure what tool you are using, but if you are using SAS tools, like EMiner, you can do all the things i described above. From my personal experience, any one algorithm is not good or bad in handling imbalance in the data. You can probably expect improvements by combining several classifiers' outputs (i.e. by creating ensemble of classifiers). Random Forest does it for decision trees...but my suggestion will be to create ensemble of different classifiers, like logistic regression, decision tree, neural networks, svm etc..the diversity in the classifier space will handle most of the cases in the data set properly. At the end, as i said in the beginning, from the business perspective, apart from pure classification performance, end users are interested in performance metrics like area under ROC curve, lift and profit in first few deciles etc. I hope my answer is helpful to you.
thank you very much for yourresponse.
i think alot of us developing predictive models have always used the undersampling technique when dealing with unbalanced datasets, and you are right there is likelihood of loss of information. I will try the adjusting sampling weights for each class. I would also like to know if there is a minimum number of observations you need to have in a group in order to fit a regression model? As an example i have a situation where the dataset is made up of 4000 goods and 230 bads. is this suitable to fit a regression model? i have been considering bootstrapping the regression estimates to possibly obtain unbiased estimates. please advise if ths approach is reliable or a more suitable one. thanks in advance.
Please find below the link for a very good review paper which addresses this problem. I hope you will find it interesting:
Thanks, I've not seen this paper before - a very comprehensive literature review of the methods to date.