# AnalyticBridge

A Data Science Central Community

Check the three charts below: only one shows no pattern and is truly random. Which one? Chart #1 Chart #2 Chart #3

It is very clear that chart #3 exhibits a strong clustering pattern, unless you define your problem as points randomly distributed in an unknown domain whose boundary has to be estimated. So, the big question is: between chart #1 and #2, which one represents randomness? Look at these charts very closely for 60 seconds, then make a guess, then read on. Note that all three charts contain the same number of points - so there's no scaling issue involved here.

Let's assume that we are dealing with a spatial distrubtion of points over the entire 2-dimentional space, and that observations are seen through a small square window. For instance, points (observations) could be stars as seen on a picture taken from a telescope.

The first issue is the fact that the data is censored: if you look at the distribution of nearest neighbor distances to draw conclusions, you must take into accont the fact that points near the boundary have fewer neighbors because some neighbors are outside the boundary. You can eliminate the bias by

• Tiling the observation window to produce a mathematical tessellation
• Mapping the square observation window onto the surface of a torus
• Apply statistical bias-correction techniques
• Use Monte-Carlo simulations to estimate what the true distribution is (with confidence intervals) if the data was truly random

Second issue: you need to use better visualization tools to see the patterns. The fact that I use a + rather than a dot symbol to represents the points, helps: some points are so close to each other that if you represent points with dots, you won't visually see the double points (in our example, double points could correspond to double star systems - and these very small-scale point interactions are part of what makes the distribution non-random in two of our charts). But you can do much better: you could measure a number of metric (averages, standard deviations, correlation between x and y, number of points in each sub-square, density estimates, etc.) and identify metrics proving that we are not dealing with pure randomness.

In these 3 charts, the standard deviation for either x or y - in case of pure randomness - should be 0.290 plus or minus 0.005. Only one of the 3 charts succeeds with this randomness test.

Third issue: even if multiple statistical tests suggests that the data is truly random, it does not mean it really is. For instance, all three charts show zero correlation between x and y, and have mean x and y close to 0.50 (a requirement to qualify as random distribution in this case). However only one chart exhibits randomness.

Fourth issue: we need a mathematical framework to define and check randomness. True randomness is the realization of a Poisson stochastic process, and we need to use metrics that uniquely characterizes a Poisson process to check whether a point distribution is truly random or not. Such metrics could be e.g.

• The inter-point distance distributions
• Number of observations in sub-squares (these counts should be uniformly distributed over the sub-squares, and a Chi-square test could provide the answer; however in our charts, we don't have enough points in each sub-square to provide a valid test result)

Fifth issue: some of the great metrics (distances between kth-neighbors) might not have a simple mathematical formula. But we can use Monte-Carlo simulations to address this issue: simulate a random process, compute the distribution of distances (with confidence intervals) based on thousands of simulations, and compare with distances computed on your data. If distance distribution computed on the data set matches results from simulations, we are good, it means our data is probably random. However, we would have to make sure that distance distribution uniquely characterizes a Poisson process, and that no non-random processes could yield the same distance distribution. This exercise is known as goodness-of-fit testing: you try to see if your data support a specific hypothesis of randomness.

Sixth issue: if you have a million points (and in high dimensions, you need much more than a million points due to the curse of dimension), then you have a trillion distances to compute. No computer, not even in the cloud, will be able to make all these computations in less than a thousand year. So you need to pick up 10,000 points randomly, compute distances, and compare with equivalent computations based on simulated data. You need to make 1,000 simulations to get confidence intervals, but this is feasible.

Here's how the data (charts 1-3) was created

• Produce 158 random points [a(n), b(n)], n=1,...,158
• Produce 158 random deviates u(n), v(n), n=1,...,158
• Define x(n) as follows for n>1: if u(n) < r, then x(n) = a(n), else x(n) = s*v(n)*a(n) + [1-s*v(n)]*x(n-1), with x(1)=a(1)
• Define y(n) as follows for n>1: if u(n) < r, then y(n) = b(n), else y(n) = s*v(n)*b(n) + [1-s*v(n)]*y(n-1), with y(1)=b(1)
• Chart 1: x(n)=a(n), y(n)=b(n)
• Chart 2: r=0.5, s=0.5
• Chart 3: r=0.4, s=0.1

Notes

• The only chart exhibiting randomness is chart #1. Chart #2 has significantly too low standard deviations for x and y, too few points near boundaries, and too many points that are very close to each other
• Note that chart #1 (the random distribution) exhibits a little bit of clustering, as well as some point alignments: this is however perfectly expected from a random distribution. If the number of points in each sub-square was identical, the distribution would not be random, but would correspond to a situation where antagonist forces make points to stay as far away as possible from each other.
• How would you test randomness if you had only two points (impossible to test), three points, or just 10 points?
• Finally, once a pattern is detected (e.g. abnormal close proximity between neighboring points), it should be interpreted and/or leveraged, that is, it should lead to (say) ROI-positive trading rules if the framework is about stock trading, or the conclusion that double stars do exist (based on chart #2) if the framework is astronomy
• See an example of random number generator at http://www.analyticbridge.com/profiles/blogs/new-state-of-the-art-r...

Views: 47842

Comment

Join AnalyticBridge Comment by Vincent Granville on September 29, 2011 at 11:16am
Comment about the apparent "line structures": some level of "line structure" (several dots that you connect to create lines or smooth curves) happens in true randomness. Absence of these "lines" would also be an indicator of lack of randomness. Indeed, it might be difficult to create a chart where this effect of line structures is non existent (if you have one, please share with us). I tried once to simulate random points with my brain, dotting a piece of paper with my eyes closed, I also found these line structures.

More difficult to detect is interactions with points that are far away from each other. Comment by Vincent Granville on September 29, 2011 at 10:53am

Allan, you are correct. And thanks for the R code!

>Did you mean r=0.1 and s=0.4 for chart 3?  Some R code to draw the figures: Comment by Rick Wicklin on September 29, 2011 at 7:04am

Nice post. I like it. For your statistical readers, I'll add that there are some simple ways to test for the randomness and uniformity of patterns:

http://blogs.sas.com/content/iml/2011/02/04/divide-and-count-a-test...

You might find it interesting to note that I once used these ideas to arrange actors on stage in a pseudo-random configuration. That's an application of statistics that you don't see everyday!

http://blogs.sas.com/content/iml/2011/01/28/random-uniform-versus-u... Comment by Allan Engelhardt on September 29, 2011 at 3:26am

For your example, you can easily convince yourself that the variance of x*y is different from a random distribution.  Try plotting a histogram or use the variance test (even though this is not a normal distributed variable).

`set.seed(1)a <- x <- runif(158)b <- y <- runif(158)u <- runif(158)v <- runif(158)r <- 0.5s <- 0.5for (n in 2:length(a)) {    if (u[n] >= r) {        x[n] <- s*v[n]*a[n] + (1-s*v[n])*x[n-1]        y[n] <- s*v[n]*b[n] + (1-s*v[n])*y[n-1]    }}plot(a, b, pch = "+", col = "grey95")points(x, y, pch = "+")hist(a*b, freq = TRUE, col = rgb(1, 0, 0, .5))hist(x*y, freq = TRUE, col = rgb(0, 0, 1, .5), add = TRUE)print(var.test(x*y, runif(1e6)*runif(1e6)))` Comment by Allan Engelhardt on September 29, 2011 at 2:46am

Did you mean r=0.1 and s=0.4 for chart 3?  Some R code to draw the figures:

set.seed(1)
a <- x <- runif(158)
b <- y <- runif(158)
u <- runif(158)
v <- runif(158)

r <- 0.1
s <- 0.4

for (n in 2:length(a)) {
if (u[n] >= r) {
x[n] <- s*v[n]*a[n] + (1-s*v[n])*x[n-1]
y[n] <- s*v[n]*b[n] + (1-s*v[n])*y[n-1]
}
}

plot(a, b, pch = "+", col = "grey90")
points(x, y, pch = "+")