A Data Science Central Community
An updated version with source code and detailed explanations can be found here.
If observations from a specific experiment (for instance, scores computed on 10 million credit card transactions) are assigned a random bin ID (labeled 1, ··· ,k), then you can easily build a confidence interval for any proportion or score computed on these k random bins, using the Analyticridge theorem.
The proof of this theorem relies on complicated combinatorial arguments and the use of the Beta function. Note that the ﬁnal result does not depend on the distribution associated with your data - in short, your data does not have to follow a Gaussian (a.k.a normal) or any prespecified statistical distribution, to make the confidence intervals valid. You can ﬁnd more details regarding the proof of the theorem in the book Statistics of Extremes by E.J. Gumbel, pages 58-59 (Dover edition, 2004)
Parameters in the Analyticbridge theorem can be chosen to achieve the desired level of precision - e.g a 95%, 99% or 99.5% confidence interval. The theorem will also tell you what your sample size should be to achieve a pre-specified accuracy level. This theorem is a fundamental result to compute simple, per-segment, data-driven, model-free conﬁdence intervals in many contexts, in particular when generating predictive scores produced via logistic / ridge regression or decision trees / hidden decision trees (e.g. for fraud detection, consumer or credit scoring).
Application:
A scoring system designed to detect customers likely to fail on a loan, is based on a rule set. On average, for an individual customer, the probability to fail is 5%. In a data set with 1 million observations (customers) and several metrics such as credit score, amount of debt, salary, etc. if we randomly select 99 bins each containing 1,000 customers, the 98% conﬁdence interval (per bin of 1,000 customers) for the failure rate is (say) [4.41%, 5.53%], based on the Analyticridge theorem, with k = 99 and m = 1 (read the theorem to understand what k and m mean - it's actually very easy to understand the signification of these parameters).
Now, looking at a non-random bin with 1,000 observations, consisting of customers with credit score < 650 and less than 26 years old, we see that the failure rate is 6.73%. We can thus conclude that the rule credit score < 650 and less than 26 years older is actually a good rule to detect failure rate, because 6.73% is well above the upper bound of the [4.41%, 5.53%] confidence interval.
Indeed, we could test hundreds of rules, and easily identify rules with high predictive power, by systematically and automatically looking at how far the observed failure rate (for a given rule) is from a standard conﬁdence interval. This allows us to rule out eﬀect of noise, and process and rank numerous rules (based on their predictive power - that is, how much their failure rate is above the confidence interval upper bound) at once.
Related article
Comment
Based on my understanding about this issue I wrote a 'toy code' in R. I also created a video about the simulation.
VIDEO:
https://www.youtube.com/watch?v=72uhdRrf6gM
CODE:
set.seed(3.1416)
x <- rnorm(1000000,10,2)
analytic_teorem <- function (N,x,med,desv)
plot(x_a, p_upper,type="l", ylim=range(c(med-1*desv,med+1*
This was an earlier post about the same result.
I´m a little bit confused with the relation to this post:
http://www.analyticbridge.com/forum/topics/easy-to-compute?commentI...
Vincent: correct me if I am wrong...
k would appear to be the number of random bins;
m would appear to be the number of non-random bins;
5% would appear to be a given (average fail rate = 50,000 fails/1,000,000 customers);
6.73% would appear to be a given (67.3 fails/1,000 customers with less than <650 credit and <26 years old) - although I don't really understand how you get 0.3 of a fail for 100 customers, assuming that the whole customer, and not some fraction, fails.
What I do not understand is where [4.41%,5.53%] comes from and how that is related to the 98% confidence interval given in the example.
I have to agree with Brt Dnk - Vincent, can you give us a spreadsheet or a dataset from which to replicate your example (or one similar but with perhaps only 100,000 total obserations) above?
-Marc d. Paradis
HI Vincent,
This sounds great but I'm having trouble understanding how to set the two parameters and the pdf describing the theorem din't help much. Could you please post a spreadsheet with specific calculation example and perhaps some pointers around selecting the right parameters?
Thank you in advance, this would be very useful if I knew how to get right k and m...
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge