Oversampling/Undersampling in Logistic Regression - AnalyticBridge2020-09-25T22:17:52Zhttps://www.analyticbridge.datasciencecentral.com/forum/topics/oversamplingundersampling-in?feed=yes&xn_auth=noHello, indeed in many applica…tag:www.analyticbridge.datasciencecentral.com,2018-03-07:2004291:Comment:3813092018-03-07T15:17:25.873ZJacob Zahavihttps://www.analyticbridge.datasciencecentral.com/profile/JacobZahavi
<p>Hello, indeed in many applications, if not in most of them, the number of "positive" event (e.g., the number of buyers in a marketing campaign) is very small, much smaller than the number of the "non events" (e.g., non buyers, in the above example). Hence, building a model on a sample of observations, however large, drawn randomly from the population (the "universe") may not inculde enough positive events to build a significant model with. The solution in these cases is to use a choice-based…</p>
<p>Hello, indeed in many applications, if not in most of them, the number of "positive" event (e.g., the number of buyers in a marketing campaign) is very small, much smaller than the number of the "non events" (e.g., non buyers, in the above example). Hence, building a model on a sample of observations, however large, drawn randomly from the population (the "universe") may not inculde enough positive events to build a significant model with. The solution in these cases is to use a choice-based sample for training the logistic regression model, namely a sample that contains a larger proportion of positive events (e.g., buyers, in a marketing campaign) than in the universe (sometimes, all of them), and only a sample of the non-buyers. As shown in Ben-Akiva, M., and S.R. Lerman, 1987, Discrete Choice Analysis, the MIT Press, Cambridge, MA, one needs to correct only the constant of the resulting logistic regression model to render a model that reflects the true proprtion of the buyers and non buyers in the universe. In the case the choice-based sample contains all the buyers in the universe, the constant is modified using the expression: modified constant = resulting constant of the logistic regression model + LN(TN/UN), where TN is the number of the non buyers in the training dataset used to build the model, UN is the number of the non buyers in the universe and LN is the natural logarithm. </p>
<p> </p> If you are modeling binomial…tag:www.analyticbridge.datasciencecentral.com,2010-07-19:2004291:Comment:742492010-07-19T17:36:21.624ZJoseph Hilbehttps://www.analyticbridge.datasciencecentral.com/profile/JosephHilbe
If you are modeling binomial data; ie a numerator consisting of the number of 1/0 successes you have for<br />
a given pattern of covariates, and a denominator that gives the value of the total number of observations having that covariate pattern (a specific profile of predictor values; eg age=23, married=1, working=0) , a logistic regresson is generally appropriate. But when the mean values of the numerators are less than 10% of the mean values of the denominator, it is likely that a Poisson model…
If you are modeling binomial data; ie a numerator consisting of the number of 1/0 successes you have for<br />
a given pattern of covariates, and a denominator that gives the value of the total number of observations having that covariate pattern (a specific profile of predictor values; eg age=23, married=1, working=0) , a logistic regresson is generally appropriate. But when the mean values of the numerators are less than 10% of the mean values of the denominator, it is likely that a Poisson model is preferred. The otherwise logistic numerator is the count response variable (dependent variable) and the natural log of the denominator is the offset. Generally the Poisson model will fit the data better. Logistic models are not indended for rare occurrences. But that's exactly how the Poisson distribution is derived from the binomial. If the Poisson model is overdispersed, ie if the variance of the response is greater than the mean, then you'll likely need a negative binomial model. Nearly all count models are overdispersed; some aren't, and some are underdispersed. Remember, given the data you have - and I don't know it - it may be that none of these models is really appropriate. But rule-of-thumb, modeling rare events is typically the task of Poisson or negative binomial count models. There are a host of different types of negative binomial models, all of which address a certain type of overdispersion. It appears on the surface that these are better alternatives than a logistic model. It's not widely used, but a c…tag:www.analyticbridge.datasciencecentral.com,2010-07-06:2004291:Comment:732492010-07-06T20:05:32.249ZJoseph Foutzhttps://www.analyticbridge.datasciencecentral.com/profile/JosephFoutz
It's not widely used, but a close friend of mine has had good success with Rare Events Logit (RELogit). Available for Stata, GAUSS, and R. It's fairly new (2001) and it seems that it is just now starting to gain traction in the academic literature.
It's not widely used, but a close friend of mine has had good success with Rare Events Logit (RELogit). Available for Stata, GAUSS, and R. It's fairly new (2001) and it seems that it is just now starting to gain traction in the academic literature. I agree with the statement th…tag:www.analyticbridge.datasciencecentral.com,2010-06-24:2004291:Comment:723692010-06-24T00:50:12.617ZJeffhttps://www.analyticbridge.datasciencecentral.com/profile/Jeff710
I agree with the statement that the density of rare events does not affect logistic regression, if we are talking about prediction. Where oversampling (or weights or costs or priors) does pay big dividends in my experience is decision trees. I think thats the canonical example of where consideration of the rare case as a proportion is needed.
I agree with the statement that the density of rare events does not affect logistic regression, if we are talking about prediction. Where oversampling (or weights or costs or priors) does pay big dividends in my experience is decision trees. I think thats the canonical example of where consideration of the rare case as a proportion is needed. My student tryied to predict…tag:www.analyticbridge.datasciencecentral.com,2010-06-23:2004291:Comment:723542010-06-23T22:32:21.054ZJozo Kovachttps://www.analyticbridge.datasciencecentral.com/profile/JozoKovac
My student tryied to predict fog - had only 100 positive among more than 200k observations.<br />
<br />
Just say there will be no fog and you are 99,99% right. There's no value in such trivial prediction.<br />
<br />
Fog requires some conditions - humidity, tempreature, etc. If you focus only on these observations what satisfy basic conditions, data mining task is completly different.<br />
<br />
Instead of 0,0005% you have 5% or maybe 20% of positive observations and your favorite algorithm works well.
My student tryied to predict fog - had only 100 positive among more than 200k observations.<br />
<br />
Just say there will be no fog and you are 99,99% right. There's no value in such trivial prediction.<br />
<br />
Fog requires some conditions - humidity, tempreature, etc. If you focus only on these observations what satisfy basic conditions, data mining task is completly different.<br />
<br />
Instead of 0,0005% you have 5% or maybe 20% of positive observations and your favorite algorithm works well. What would stop you from usin…tag:www.analyticbridge.datasciencecentral.com,2010-06-23:2004291:Comment:723322010-06-23T16:51:40.846ZStephen Croninhttps://www.analyticbridge.datasciencecentral.com/profile/StephenCronin
What would stop you from using a Poisson Regression technique here? like logistic, its log linear in the case of Poisson, Natural log, and I think you could do this in R very quickly and measure the Overdispersion coefficient to see if you have the correct level of precision for your point estimate. you can also adjust this via a Pearson coefficient and get some decent classification accuracy if the data is amiable to this type of regression.<br />
<br />
You might also want to look into Negative binomial…
What would stop you from using a Poisson Regression technique here? like logistic, its log linear in the case of Poisson, Natural log, and I think you could do this in R very quickly and measure the Overdispersion coefficient to see if you have the correct level of precision for your point estimate. you can also adjust this via a Pearson coefficient and get some decent classification accuracy if the data is amiable to this type of regression.<br />
<br />
You might also want to look into Negative binomial regression which is very applicable to rare events though again it may not fit your criteria as the details aren't provided. it allows for more variability then Poission but its really just a matter of firing your data sets into R or Rapid miner to see what weapons in your warchest give you the best model.<br />
<br />
good luck with your problem!