# AnalyticBridge

A Data Science Central Community

Subscribe to DSC Newsletter

# two-stage models to improve response prediction

Hi,
I am using PROC Logistic to build customer response model to predic the probabilities that our customers will respond to our advertisement. However, customers who get higher probabilities do not necessarily produce higher sales amounts. I heard that I should use Proc Logistic and Proc REG to build a two-stage model to address this problem. Does anyone have experience doing this? Please share with me how you do it. Thanks.

Views: 2469

### Replies to This Discussion

Hi,
My SAS instructor with whom I took course gave me an idea on how to do it. Basically, you have to build two models. One logistic regression model that predicts the probability of each customer respohnding to your advertisement, and the second linear regression model that predicts the sales amount of each customer. The trick is that when you build logistic regression model, you have to include every customer regardless of whether he/she responded or not whereas you just have to include the customers who actually responded to your campaign with positive sales amount. After that, you just multiply the predicted probability with the predicted sales amount and rank/score your customers with the product of the two quantities.
Weight the solution by the sales amount
Could you give me more details? Thanks.
Yi-Chun's response is a fairly common approach (and a good description).

The problem is that it assumes that the amount of the sale is independent of the probability of responding to the advertisement.

If this assumption doesn't make sense then you might consider a "Heckit" (Heckman) model. It's like the two stage approach described by Yi-Chun but when you build the regression model you include a term (the "Inverse Mills Ratio") that comes from the Logistic (technically, a 'Probit') model.

The method is described in detail in Madalla's (excellent) "Limited Dependent Variables" book, but it's also covered in Greene's "Econometrics" (and there's almost always a copy of Greene in someone's office/cubicle).

Note: I prefer the "Heckit" to the "Tobit", particularly when I believe (and I think this is your case) that the sales amount is NEGATIVELY correlated with the response.
Hi, Mark:
Do you mean that if the independence assumption does not hold, then I could include the estimated probability from the logistic regression in my linear regression model? If that is what you meant, then it makes sense to me to do just that. By the way, which book do you suggest me to read in detail?
One way would be tag those customers responding AND have high sales as 1's in your logistic model. Alternatively, build two logistic models - one to identify responders and one to identify high spending responders; do a cross-tab and set the strategy.

With Best Regards,

--
Anunay Gupta
Co-founder & Head of Analytics, Marketelligent
#1251, 32nd G Cross, First Floor
Jayanagar 4th 'T' Block
Bangalore - 560 041

+91 99452 81888 (India)
+1 201 301 2411 (USA)
www.marketelligent.com
Hi Yi-Chun,

Yes, it's pretty much that simple, and makes that much sense. I'd say Greene's "Econometrics" is a good place to start, and if you really want a deep dive then Maddala's book. You might just search on "Heckit" or "Heckman" and "Model" to find what you need. A general search might be on "instrumental Variable".

Regards,

Mark
Hi,

I agree with the Heckman model strategy as it is precisely meant to take care of the fact that "given you have take step 1 - I mean responded" we model the spend or any other variable.

Thanks
M