AnalyticBridge

A Data Science Central Community

Challenge of the Week - Random Numbers

Most random number generators use an algorithm a(k+1) = f(a(k)) to produce a sequence of integers a(1), a(2), etc. that behaves like random numbers. The function f is integer-valued and bounded; because of these two conditions, the sequence a(k) eventually becomes periodic for k large enough. This is an undesirable property, and many public random number generators (those built in Excel, Python, and other languages) are poor and not suitable for cryptographic applications, Markov Chains Monte-Carlo associated with hierarchical Bayesian models, or large-scale Monte-Carlo simulations to detect extreme events (example: fraud detection, big data context).

It's easy to build a sequence of non-periodic numbers or digits, by using irrational numbers. A basic example is the following number:

0.12345678910111213141516171819202122232425....

By construction, this decimal number has no period. Unfortunately, if you use the digits (decimals) of this number as a sequence of random digits, it has obvious undesirable properties, even worse than most cases of periodicity. Is it possible to reshuffle these digits to make this sequence look much more random?

Better sequences are associated with natural transcendental numbers:

• e = 1/1! + 1/2! + 1/3! + ... + 1/n! + ...
• Log 10 - 2 Log 3 = 1/10 + 1/(2*10^2) + ... + 1/(n*10^n) + ...

The latter number has been artificially designed to produce fast convergence (one new digit for each term in the series) by integrating the series g(x) = 1 + x + x^2 + x^3 + ... = 1/(1 - x) and using x = 1/10. Click here for other similar examples.

To test whether a sequence is random enough, you can use a battery of statistical tests on the digits and blocks of 2, 3 or more digits:

• Is the distribution of a(k+r) - a(k) like a difference of two uniform distributions, for r = 1, 2, etc.
• Chi-square tests to make sure that the proportion of simulated numbers, in any interval, is as expected (for instance, about 10% of all digits must be equal to any pre-specified digit, for instance 7)
• Correlogram behaves like that of a non-correlated time series (that is, absence of auto-correlations of lag 1, 2 etc.)
• Run tests, to check whether sub-sequences of digits are repeating themselves too often or if you can predict a(k) with some accuracy given a(k-1), a(k-2), etc.

As you carry a large number of tests, you will face many false positives (deviation from randomness according to some tests, even though you have almost perfect random numbers). How do you address this issue?

Also, you don't need to know the statistical theory of hypothesis testing to work on this problem. For instance, to test if a(k +1) - a(k) behaves like a difference of two uniform distributions, either compute that theoretical distribution or plot it doing Monte-Carlo simulations, then look at the distribution of a(k+1) - a(k) on hundreds of data buckets (your simulated random numbers), then compute confidence intervals based on these buckets to see how close your observed distribution is to the theoretical distribution. If both distributions are really different, then your random number generator has a serious issue.

The challenge

This challenge can be broken down in a few sub-problems:

1. Prove that if a(k+1) = f(a(k)) where f is integer-valued and bounded, then the sequence a(1), a(2), etc. is periodic for k large enough, and the period is smaller or equal to the number of distinct values that f produces over all integers.
2. Find other transcendental numbers like our Log 10 - 2 Log 3 number, yielding to fast convergence, or even better, yielding to fast computation of the k-th digit, denoted as a(k).
3. Create a metric that measures how well an integer sequence represents randomness, based on the statistical tests previously mentioned.
4. Compare your or our random generators to standard random generators, by performing the statistical tests mentioned in the previous section and using the metric defined in the previous step.
5. Create a class of random number generators by introducing parameters, for instance, the decimals of Log b - c Log d, where b, c, d are parameters. Which parameter sets provide the best random generators (in terms of randomness, and in terms of easy-to-compute)
6. Find recurrence or other formulas (maybe continued fractions) to quickly compute the digits (or decimals) in question. For instance, for e = 2.71... the sum of the n first terms is B(n)/n! where B(n) is an integer satisfying simple recurrence relations; for Log 10 - 2 Log 3, the sum of the n first terms is P(n)/10^n where P(n) is a relatively manageable fraction. The number e also has some interesting continued fraction expansions, worth considering to compute the digits, or for numerical analysis purposes.

Finally, if we define a decimal number d as a concatenation of positive integers a(1), a(2) and so on, preceded by "0.", where a(k) is a sequence with no upper bound,  then d is non-periodic and irrational (Under which conditions is this true? For instance this is not true if a(1) = 3 and a(k+1) = 10*a(k) + 3, resulting in d = 1/3). As an illustration, if a(k) = k^2, then d = 0.149162536496481100121144... Can you find some a(k) that, in addition to non-periodicity, provides good randomness properties?

Related articles

Views: 4149

1

2

3

4