# AnalyticBridge

A Data Science Central Community

I came across some speculation on R:NR ratio to decide the technique that needs to be employed. I haven't found any documentation or proof as yet, so I thought I'd get some feedback/comments on the same.

Taking 3 scenarios of modeling situation:
We have a 3 populations of 100K customers, targeted by 3 different programs

Situation A -  5% have responded to a program of ours.
Situation B - Nearly 50% have responded.
Situation C - Greater than 70-80% have responded.

In each of the three scenarios, we can exploit the data to yield insights into what kind of customers our responders are. But the question is, does the response rate define what techniques we need to use?

For eg, Does only Situation A call for Logistic Regression, while B & C are not suitable for Logistic Regression? Would CHAID IDTs be more suitable where R:NR ratio is near equal i.e 50:50?

As far as my knowledge goes, with more data, a logistic should be benefited into making a robust model with better probability scores. So, a logistic regression model, would definitely work better in any scenario, given the best kind of predictor variables, and definitely better in 50:50 as compared to a 5:95.

Thanks,
Arun

Views: 118

### Replies to This Discussion

I think this is a great question, but I'm not sure I would let the ratio dictate which technique to use, other than from a purely theoretical standpoint.

For a 50/50 split, my inclination is to start with a decision tree technique, only because I would assume that for regression the maximum variance is at 50/50 (coin toss), but that would mean a lot of work refining the model. But I would definitely look at the distributions first and let that dictate how to proceed. You do have a lot more flexibility on how you define the problem in logistic regression. If I had a lot of categorical variables I would tend more towards CHAID. But no reason why you wouldn't combine both in the short term. Would probably end up presenting the results via logistic since it is easier to interpret.

Again, Good question.

-Ralph Winters
Thanks for your comment. I think I may not have understood the maximum variance part of it.

When you say that regression has maximum variance at 50:50, is it due to the theory of variance = np for a binomial distribution? In that case, I would assume that, greater the value of p, greater is the variance!

Also, how does variance affect logistic regression? And when you say refining, is it bringing a good S-shaped logistic curve out from 'Y' finally, where our ouputted values of deciles rank order in a fashion to show the best separation of Goods from Bads?

- Arun