A Data Science Central Community
Let's say you want to measure someone's success by the number of dates he/she managed to get:
Your odds of success (called odds ratio) is defined as R = A/B.
In practice, B can be zero, e.g. if you are in a relationship and not interested in dating. This creates all sorts of problems, and usually the fix is to add 0.5 to both A and B. This fix is arbitrary and makes R size-dependent when B = 0 (sample size is n = A+B). This can be a desirable property sometimes, when comparing R in nodes of different sizes in a decision tree, and sometimes not. Building confidence intervals is also problematic.
So what do you think of these two alternate statistics:
The mapping between R and P, as well as between R and Q, is one-to-one.
Although I like the fact that Q < 0 when A < B, I prefer P for two reasons: 1) statistics that range from 0 to 1 are more aesthetically pleasing (just my opinion) and 2) P approaches 1 “faster," which seems to be desirable in this case. Instead of adding an arbitrary constant, could you just skip R and define P to be 0 when A and B are both 0 (probably a graduate student)? The other cases (A = 0 and B > 0; A > 0 and B =0; A > 0 and B > 0) make sense with the formula, P = A/(A+B).
Calculating the odds of something that has no chance to happen is a "mindless application" of a numerical technicque or measurement. Measurement should only be applied after (a) operationalization of the concept that is planned to be measured and (b) exclusion/inclusion criteria are defined for who is eligible/ideal and who is neither for the event under consideration
The issue with big data, is you compute odds on millions of data buckets (decision tree nodes, segments, multivariate features) that chances are high that some of these buckets contain unexpected stuff. Yet you don't want your algorithm to crash because of some divisions by zero.
... which calls for the role of a "data scientist" (or "data engineer") who should be able to analyze each instance of measurement, apply the concepts of "target population", "population" and (if applicable) "sample" to the measurement data... In addition, the data engineer should be able to implement efficient code (with exception handling when pertinent etc...) to perform intelligent analysis (i.e., not just rely on COTS products to do the analysis which)
If I am forced to choose, will go with adding 0.5 to A and B in most of the case except when n is really small. For the reason that R is sensitive to higher value of A; is good in my area of work and is preserved by adding a small number. For example, when A = 90 and B = 10 R=9, a 10% change in A, swing R to 99. But this characteristic is lost in both P and Q.
Having said that, I distrust any rule that tell me 100% so I will probably prun it...
For most purposes I set the numerator of 0 arbitrarily to 1, but in this case, the meaning of the 0 in the denominator is dependent upon the numerator. Consider the case where B=0 and A=0 to B=0 and A=100. In the first case, adding .5 to each makes the odds ratio = 1, which does not make logical sense as the person is not trying. In passing, I would note that when A=B=0, the range of P and Q are not as neatly defined as you suggest.I would simply eliminate all cases where A=B=0.
In the second case, whether you add .5 to each or set B=1, it is not likely to change the relation of this odds ratio to those of others, and it will still be somewhat arbitrary. I would not have a preference between P and Q in this case. Also, while P=1 where A=100 and B=0, I don't quite see how it equals R/(R+1) because of the dividing by zero problem. I think you might have to stipulate that B is not equal to 0 for that to be true.
Am I missing something?
The variable, P, is the simple proportion of persons you managed to secure a date (numerator) of all the persons you wanted to date (denominator). The ratio of two proportions is known as the relative risk or the risk ratio and can be modelled using binomial regression (though the latter can sometimes have problems with convergence). Note that P can include zero or one and is not strictly BETWEEN zero and one because it includes both of these endpoints. P is not an odds ratio because it is not a ratio of odds, normalized or not, because its denominator is not an odds but the odds + 1, though I suppose that you can call it what you want.
I prefer Q why ?
Besause if you take the "Weight of Evidence" of Moore (range of value between ]-infty,+infty[) and you pass it into a Sigmoid function (to have a range between ]-1,+1[ then you obtain the Q indicator.
By this way you keep the symetry of this indicator (negative value = negative influence, positive value= positive influence).
And you mau also use it to measure the variables's contribution