# AnalyticBridge

A Data Science Central Community

# Test for Difference in Proportions - T Test? Proc GLM?

Hi,

I am coming across a lot of cases where people are talking about, or using T-Test when they are comparing campaign response rates, membership renewal rates, etc.

My understanding is that the T-Test is not appropriate for such cases. And that people are confusing the T-Test with the Two Sample test for proportions (which uses the Z stat). Personally speaking, I think the Chi Square test and its related tests (Fisher's Exact, Mc Nemar) are more appropriate for testing the differences in proportions/ratios.

There is one case where people have actually used a Proc GLM to test the difference in renewals among customer groups based on different communication channels. So how can proc glm or the t-test which are meant for continuous dependent values be used for comparing proportions/ratios?

Thanks

Views: 3974

### Replies to This Discussion

Student's t test is so widely used for the simple reason that it is the only test that many people know. This does not mean that it is appropriate in all these cases. This misuse is so widespread, even in refereed journals, that it tends to become legitimised and propagated.

We need to encourage the use of more appropriate methods wherever possible.
I love statistics and always look for ways to incorporate them in my every day work but.... why would one need a statistical test to compare campaign response rates ???
I mean they'll tell you whether one is statistically significantly different than other but shouldn't response rate "significance", importance etc ultimately be a business decision?

For example, to simplify things, if I'm a decision maker I can care less whether my response rate is statistically different than others. I want to know what does that response mean to my bottom line given the investment and how can the incremental difference between campaigns be quantified in terms of \$\$\$ and next steps for improvements.
If I'm an analyst or even more so a marketer, I'd rather focus on that aspect of my analysis than statistical significance of tests applied to response rates.

I can care less if XYZ test says my new campaign response rate of 1.2% is not statistically different from the old one of 0.9%. What I care about is that that 0.3% inceremental reactiavation means that I will potentially make \$1,000,000 to the company.
The difference in response you mention (1.2 vs 0.9) could be a fluke of random events and you'd be gambling you're company's money and in stead of . Statistics will give you more confidence in making your decision. If you have to segments of 5000 consumers with these response rates and you perform a Z test you'll find that the confidence level of the difference is 91.5%. This at least gives you the information that you've taken the decision to interpret the difference as significant while there's still a 8.5% chance that the difference is caused by randomness.
I suppose we have a misunderstanding here.
My example given above pertains to a "main" mailing of millions pieces of mail, not a test of 5000 or so.
0.3% response increase in those circumstances can indeed make a monetary impact on a bottom line, future customer retention effors, ROI etc.
I can see the logic you describe be applied in a test/control group setting though.
"Statistically significant" simply means that the difference you have seen is unlikely to have occurred purely by chance and is therefore likely to be a "real" effect. If the result is not statistically significant then, however attractive the potential profit might look from the pilot study, you simply don't know whether you will get a similar result when you make your investment. Unless you have calculated statistical confidence limits for the expected improvement (giving best and worst case results) you could just as easily see a drop in profit once you go live in a larger market.

In an ideal world, you would ensure via proper statistical sample size calculation that a difference that is important from a business perspective coincides with one that is statistically significant.
I certainly appreciate your perspective and agree with many things said.

I'll also add that in my opinion there is a danger of relying on the statistical tests too much for a number of reasons.
Here are a couple I could think of off the top of my head:

1) as the sample increases, the likelihood of a statistical test being significant increases as well (law of the large numbers).
In DM you're often dealing with huge samples which will tend to produce statisticaly significant tests more than they really should.

2) Statistical tests have quite a few assumptions and in most "real world application" cases they are violated.
I agree with both points. Especially with your remarks above about 'millions of cases' in mind. But even in that case it might make sense to use 'statistical techniques' like modelling on different samples because the response to a mailing can have a random factor and if you model on all data you are likely to mis some key segments. Using several models from different samples together (and maybe even different types of models) might give a more balanced propensity score. Also using multiple samples removes exactly the problem that you have with millions of cases as described in 1) in techniques like CHAID, C&RT, Logistic Regression and other models based on statistics.
OK now you're talking propensity scores and again I 100% agree the algorithms you mention are useful for that purpose.

However, I don't see why would one need to build a new predictive model in order to evaluate campaign results (i.e. response rate). Usually what people do is build and apply them prior to launching campaigns.

I can understand why would one use the already developed and deployed model results after campaign is finished to compare and evaluate actual results to model output like for example propensity deciles or clusters response rates.
I'm not sure why would someone need to build a new predictive model after campaign is over in order to evaluate that campaign.

Maybe I misunderstood something.
There's several reasons
1. You've intervened in the process. For example when you build a model for a retention campaign you've selected the customers most likely to leave and made them an offer to stay. Therefor you've created your own 'false positives'. You will need to look at the response to refine your model for next time and inprove both your churn preiction and offer acceptance models.
2. The world has changed. New competitors/competitive offers may have influenced the responses and you want to make sure you identify these changes asap.
3. In each model you run the risk of haveing 'false negatives'. You want to make sure for next time that you keep monitoing the perofmance of you're model and either have a model refresh or a champion-challenger approach. You may even include a random sample of non-selected targets to keep an eye on possible new opportunities.
4. More and more companies first send out pilot campaigns to test for different factors that influence response like message, offer, creative etc. These test campaigns are specifically for building models afterwards.
It is certainly true that a blind application and reliance on statistical methods without a good understanding of how to interpret the results can be dangerous - but this is not a failing of the methods. In my experience it is much more common for people to neglect the use of proper statistical techniques than to have too much reliance on them.

1. Yes, it is true that with enough observations, even the smallest of effects will become statistically significant, but that is not a reason to dismiss the use of statistical methods in these cases. If you have a very large data set (lucky you) and you find a significant effect that is too small to be of any business value, the interpretation of the result is still the same - that it is unlikely to have come about purely by chance. You have simply found a very small, but still apparently real, effect - no harm done because you (the intelligent practitioner) know that it is too small to be important (though it might still be of interest).

At the other end of the scale, there are those effects that are so large as to be obviously both significant and important without the use of any statistical tests, regardless of sample size - again, lucky you!

The real value of statistical methods comes in providing an objective criterion to assess all of those in-between cases which, due either to small relative sample size or large background variability, are difficult assess and where lack of knowledge or faulty human intuition can mislead you into making bad decisions.

2. Of course the strict assumptions behind the tests are often violated but this does not necessarily mean that the results are worthless - just that they have to be interpreted with caution. Many methods are quite robust to these violations (within limits) and there are usually other alternatives (e.g. models based on non-Normal distributions, non-parametric methods, etc.) that can be used to verify or refine the results before using them to support mission-critical decisions.

The bottom line...

Statistical methods are simply powerful tools to aid understanding and decision-making. They are not an excuse to turn off the brain. Just like any other power tools, they require training, skill, experience and care to get the best results. Without them, however, you are left with the old 'hand tools' of guesswork, intuition, and trial and error - I know which I prefer.
"Of course the strict assumptions behind the tests are often violated but this does not necessarily mean that the results are worthless"

Sure, but that 95% percent confidence that one provides hoping to give the results little more backbone looses its charm after one has to communicate with such caution

:)

"Statistical methods are simply powerful tools to aid understanding and decision-making. They are not an excuse to turn off the brain"

Couldn't agree more with that statement.
Good points Matt. If a difference in a statistic between two groups is found to be “statistically significant’ it simply means that based on sample size, variation, and the value of the measured statistic, the difference you have seen is not likely to have occurred by chance. What is actually being testing in this case is the hypothesis that p1-p2=0 (where p1 and p2 are the response rates for the two groups). To become even more useful in making business decisions, this hypothesis being tested can be changed to test whether the difference between the two groups is greater than a certain threshold.

For example, perhaps you have lowered a monthly rate from \$21.99 to \$19.99 for a group of customers, and you know that in order to make a business case for this, you need customer retention rates to increase by 10% (in real terms) in order for the lowering of the rate from \$21.99 to \$19.99 to make business sense. In that case instead of testing the hypothesis that p1-p2=0, you would want to test the hypothesis that p2-p1>.1. Just making this point to show that you are not limited to testing whether or not the difference for a statistic between two groups is zero.