A Data Science Central Community
Check the three charts below: only one shows no pattern and is truly random. Which one?
It is very clear that chart #3 exhibits a strong clustering pattern, unless you define your problem as points randomly distributed in an unknown domain whose boundary has to be estimated. So, the big question is: between chart #1 and #2, which one represents randomness? Look at these charts very closely for 60 seconds, then make a guess, then read on. Note that all three charts contain the same number of points - so there's no scaling issue involved here.
Let's assume that we are dealing with a spatial distrubtion of points over the entire 2-dimentional space, and that observations are seen through a small square window. For instance, points (observations) could be stars as seen on a picture taken from a telescope.
The first issue is the fact that the data is censored: if you look at the distribution of nearest neighbor distances to draw conclusions, you must take into accont the fact that points near the boundary have fewer neighbors because some neighbors are outside the boundary. You can eliminate the bias by
Second issue: you need to use better visualization tools to see the patterns. The fact that I use a + rather than a dot symbol to represents the points, helps: some points are so close to each other that if you represent points with dots, you won't visually see the double points (in our example, double points could correspond to double star systems - and these very small-scale point interactions are part of what makes the distribution non-random in two of our charts). But you can do much better: you could measure a number of metric (averages, standard deviations, correlation between x and y, number of points in each sub-square, density estimates, etc.) and identify metrics proving that we are not dealing with pure randomness.
In these 3 charts, the standard deviation for either x or y - in case of pure randomness - should be 0.290 plus or minus 0.005. Only one of the 3 charts succeeds with this randomness test.
Third issue: even if multiple statistical tests suggests that the data is truly random, it does not mean it really is. For instance, all three charts show zero correlation between x and y, and have mean x and y close to 0.50 (a requirement to qualify as random distribution in this case). However only one chart exhibits randomness.
Fourth issue: we need a mathematical framework to define and check randomness. True randomness is the realization of a Poisson stochastic process, and we need to use metrics that uniquely characterizes a Poisson process to check whether a point distribution is truly random or not. Such metrics could be e.g.
Fifth issue: some of the great metrics (distances between kth-neighbors) might not have a simple mathematical formula. But we can use Monte-Carlo simulations to address this issue: simulate a random process, compute the distribution of distances (with confidence intervals) based on thousands of simulations, and compare with distances computed on your data. If distance distribution computed on the data set matches results from simulations, we are good, it means our data is probably random. However, we would have to make sure that distance distribution uniquely characterizes a Poisson process, and that no non-random processes could yield the same distance distribution. This exercise is known as goodness-of-fit testing: you try to see if your data support a specific hypothesis of randomness.
Sixth issue: if you have a million points (and in high dimensions, you need much more than a million points due to the curse of dimension), then you have a trillion distances to compute. No computer, not even in the cloud, will be able to make all these computations in less than a thousand year. So you need to pick up 10,000 points randomly, compute distances, and compare with equivalent computations based on simulated data. You need to make 1,000 simulations to get confidence intervals, but this is feasible.
Here's how the data (charts 1-3) was created: