# AnalyticBridge

A Data Science Central Community

Let's say you want to measure someone's success by the number of dates he/she managed to get:

• A is the number of persons you wanted to date (in a given year) and with whom you managed to secure a date
• B is the number of persons you wanted to date but failed to secure a date

Your odds of success (called odds ratio) is defined as R = A/B.

In practice, B can be zero, e.g. if you are in a relationship and not interested in dating. This creates all sorts of problems, and usually the fix is to add 0.5 to both A and B. This fix is arbitrary and makes R size-dependent when B = 0 (sample size is n = A+B). This can be a desirable property sometimes, when comparing R in nodes of different sizes in a decision tree, and sometimes not. Building confidence intervals is also problematic.

So what do you think of these two alternate statistics:

• P = A/(A+B) = R/(R+1) which is always between 0 and 1
• Q = (A-B)/(A+B) = (R-1)/(R+1) which is always between -1 and +1

The mapping between R and P, as well as between R and Q, is one-to-one.

Views: 2776

### Replies to This Discussion

By the way, if you are a guy interested in dating girls, an easy trick is to attend classes that are traditionally for girls - cooking classes for instance. Or stores that girls visit (Victoria Secret). This is the opposite of what most people do, yet I think it's actually an analytic strategy. Wine bars at airports, at least in US, tend to be frequented more by (good looking and smart) girls. A nice way to get a conversation started is to order a rare, expensive champaign, for instance in Vino Volo at San Francisco airport, terminal 1 (if you fly with Virgin). Even if you are very shy, it will work.

Although I like the fact that Q < 0 when A < B, I prefer P for two reasons: 1) statistics that range from 0 to 1 are more aesthetically pleasing (just my opinion) and 2) P approaches 1 “faster," which seems to be desirable in this case. Instead of adding an arbitrary constant, could you just skip R and define P to be 0 when A and B are both 0 (probably a graduate student)? The other cases (A = 0 and B > 0; A > 0 and B =0; A > 0 and B > 0) make sense with the formula, P = A/(A+B).

Calculating the odds of something that has no chance to happen is a "mindless application" of a numerical technicque or measurement. Measurement should only be applied after (a) operationalization of the concept that is planned to be measured and (b) exclusion/inclusion criteria are defined for who is eligible/ideal and who is neither for the event under consideration

The issue with big data, is you compute odds on millions of data buckets (decision tree nodes, segments, multivariate features) that chances are high that some of these buckets contain unexpected stuff. Yet you don't want your algorithm to crash because of some divisions by zero.

... which calls for the role of a "data scientist" (or "data engineer") who should be able to analyze each instance of measurement, apply the concepts of "target population", "population" and (if applicable) "sample" to the measurement data... In addition, the data engineer should be able to implement efficient code (with exception handling when pertinent etc...) to perform intelligent analysis (i.e., not just rely on COTS products to do the analysis which)

If I am forced to choose, will go with adding 0.5 to A and B in most of the case except when n is really small. For the reason that R is sensitive to higher value of A; is good in my area of work and is preserved by adding a small number. For example, when A = 90 and B = 10 R=9, a 10% change in A, swing R to 99. But this characteristic is lost in both P and Q.

Having said that, I distrust any rule that tell me 100% so I will probably prun it...

For most purposes I set the  numerator of  0 arbitrarily to 1, but in this  case, the meaning of the 0 in the  denominator is dependent upon the  numerator.   Consider the  case where B=0 and A=0 to B=0 and A=100.   In the first case, adding .5 to each makes the odds ratio = 1, which does not make logical sense as the person is not trying.  In  passing, I would note that when  A=B=0, the range  of P and Q are not as neatly defined as you suggest.I would  simply eliminate all cases where A=B=0.

In the second case, whether you add .5 to each or set B=1, it is not likely to change the relation of this odds ratio to those of others, and it will still be somewhat arbitrary.  I would not have a preference between P and Q in this  case.   Also, while P=1 where A=100 and B=0, I don't quite see how  it equals R/(R+1) because of the dividing by zero problem.  I think you might have to stipulate that B is not equal to 0 for that to be true.

Am I missing something?

The variable, P, is the simple proportion of persons you managed to secure a date (numerator) of all the persons you wanted to date (denominator).  The ratio of two proportions is known as the relative risk or the risk ratio and can be modelled using binomial regression (though the latter can sometimes have problems with convergence).  Note that P can include zero or one and is not strictly BETWEEN zero and one because it includes both of these endpoints.  P is not an odds ratio because it is not a ratio of odds, normalized or not, because its denominator is not an odds but the odds + 1, though I suppose that you can call it what you want.

I prefer Q why ?

Besause if you take the "Weight of Evidence" of Moore (range of value between ]-infty,+infty[) and you pass it into a Sigmoid function (to have a range between ]-1,+1[ then you obtain the Q indicator.

By this way you keep the symetry of this indicator (negative value = negative influence, positive value= positive influence).

And you mau also use it to measure the variables's contribution