Subscribe to DSC Newsletter

Data science defeats intuition: twin data points is the norm, not the exception

This is an example where data science and statistical analysis is superior to intuition. Here, intuition is misleading you into the wrong conclusions.

By twin data points, I mean observations that are almost identical. In any 2- or 3-dimensional data set with 300+ rows, if the data is quantitative and evenly distributed in a bounded space, you should expect to see a large proportion - above 15% - of data points that have a very close neighbor.

This applies to all data sets, but the discovery was first made by looking at a picture of stars in a galaxy (see below). There are so many binary stars, one would think there is a mechanism that forces stars to cluster in pairs. However, if you look at pure probabilities, it is perfectly normal to have 15% of the stars belong to a binary star system. 

See Explanation. Clicking on the picture will download the highest resolution version available.

Here is how I made the computation. Let's say the image is 10 cm x 10 cm, has about n=500 visible stars (data points), and a binary star is defined as a start having a neighboring star 1 mm away (or less) in the picture. If stars were distributed perfectly randomly, the expected number of stars in a binary star system would  be 73 (on average), out of 500 stars. Isn't that number far higher than you would have thought? Let's denote this proportion as p, thus p=14.5%, and n*p=73, the expected number of stars in a binary system, among these 500 stars.

A few interesting numbers (includes proof of my assertion):

We can compute p using the theory of stochastic processes - Poisson process in this case. The intensity L of the process is the number of points per square millimeter, that is L = 500 / (100 mm x 100 mm) = 0.05 per square millimeter.

The probability p that a star has at least one neighbor within 1 mm is 1 - Proba(zero neighbor) = 1 - exp(-L*Pi*r^2) where r = 1 mm and Pi = 3.14. Here Pi*r^2 is the area of a circle of radius r = 1 mm. The exponential term comes from the fact that for a Poisson process, the number of points in a given set (circle, rectangle, etc.), has a Poisson distribution of mean L*Area. Thus p=0.145.

So being a binary star is a Bernouilli (1/0) variable of parameter p=0.145. Let's V denote the number of stars that are in a binary star system: V is the sum of n Bernouilli variables of parameter p, and thus has a Binomial distribution of parameters n, p. The standardized variable Z = (V - np)/SQRT{np(1-p)} is very well approximated by a Normal(0,1) distribution. This fact can be used to compute various probabilities:

  • Probability of at least 60 stars in a binary system, out of 500 = 85%
  • Probability of at least 80 stars in a binary system, out of 500 = 18%
  • Probability of at least 100 stars in a binary system, out of 500 = 0% (almost)

Alternate method to compute p

The same results could have been obtained using Monte Carlo simulations, rather than using a theoretical model. This would have involved the generation of a million simulated images ( 2-dimensional tables), and in each simulated image, counting the number of stars in a binary system. A task that can be automated and performed in a few minutes with modern computers, a good random number generator, and a smart algorithm.

It could be slow if you use a naive approach. You can do much better than O(n^2) in terms of computational complexity, to compute the n distances to nearest stars. The idea is to store the data in an grid where granularity = 1 mm (that is, a 2-dim array with 100 x 100 = 10,000 cells). Thus for each star, you only have to look at the 8 surrounding pixels to count the number of neighbors less than 1 mm away. The O(n^2) complexity has been reduced to O(n), at the expense of using 10,000 memory cells of one bit each (presence / absence of a star)..

Note that I picked up the number 1,000,000 arbitrarily, but in practice it needs to be just big enough so that your estimates are stable enough, with additional simulations bringing little or no corrections. Selecting the right sample and sample size is a design of experiment problem, and using model-free confidence intervals facilitates this task, and makes the results robust. This Monte Carlo simulation approach is favored by operations research professionals, as well as by some data scientists, computer scientists and software engineers who love model-free statistical modeling. However in this case, the theoretical model is well known, simple if not elementary, and comes with a quick, simple answer. So unless you have to spend hours understanding how it works or to just discover its existence, I would go for the theoretical solution in this case

Caveat

In our example, stars are seen through a 2-dimensional screen. But in reality, they lie in a 3-D space. Two stars might appear as neighbors because their X and Y coordinates are very close to each other, but could be eons apart on the Z axis. So to compute the real expected proportion of binary stars, one would have to simulate stars (points) in 3-D, then project them on the 10 x 10 cm rectangle, then count the binary stars. Not sure the theoretical model offers a simple solution in this case, but the Monte Carlo simulations are still straightforward. In practice, stars that are really far away are not shiny enough to show up on the picture, so the 2-D model is indeed a good approximation to the real 3-D problem.

Also, in the theoretical model, we made some implicit independence assumptions regarding star locations (when mentioning the Binomial model), but this it not the case in ptactice, as the 1 mm circles centered on each star sometimes overlap. The approximation is still good and is conservative - in the sense that the theoretical number, when corrected for overlap, will be even higher than 73.

Final comments

  • This framework provides a good opportunity to test your analytic intuition: look at a chart, look at twin observations, and visually assess whether the twin observations are natural (random) or not (too numerous, or too few of them).  For other tests to assess your analytic intuition, check out our 66 job interview questions for data scientists, or our article How to detect a pattern? Problem and solution.
  • It would be a great exercise to write a piece of code (Python, Perl or R) that performs these similutations (including the more complicated 3-D case) to (1) double check my theoretical results and (2) compare R, Perl, Python, in terms of speed. Indeed I should add this as a potential project for students working on our data science apprenticeship.
  • More than 80% of stars are in a binary system. This number is not supported by above theory nor simulations, thus there is clearly a mechanism that forces stars to cluster in pairs. This conclusion is much deeper than just discovering correlations (what too many business analysts still do full time), but it does not explain the cause.

Related articles

Views: 3455

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Cristian Vava, PhD, MBA on July 2, 2013 at 7:37pm

@Vincent: The naïve intuition may suggest that an expected arrangement for stars would have them almost equally spaced. In reality, assuming the stars are perfect points (reducing their size and mass to zero thus eliminating all connections to the celestial body mechanics), it is equally possible to have them in any distribution including equally spaced or clustered all together (non-overlapping entities). No combination is more likely than the other since both dimensions follow independent uniform distributions.

If you push the abstraction even further and have the points spread on a single axis (1D) you end up with the traditional Near Match Generalization of the Birthday Problem. In this case (discrete space with dimension m filled with n stars) the probability of having 2 stars clustered at distance k is p(n,k,m)= 1-(m-nk-1)! / (m^(n-1) * (m-n(k+1))!). You can take this formula and expand it to multidimensional cases (2D, 3D) to find the probability of having stars clustered at any distance or in any cluster size.

You may even be able to expand it to non-Euclidian spaces by incorporating the proper metric but this is a case I assume no one will ever try to solve using the intuition. 

Comment by Vincent Granville on June 28, 2013 at 8:32am

@Maxim: you are talking about hypothetical densities, while I'm talking about real, observed densities. Of course if the star density was  a trillion times higher, my picture would consist only of  white pixels and the proportion of binary star systems would have to be 100%. But the reality is different. The idea of the article is to give you a reference framework, not just in the context of stars, but for any data set. And it has all the formulas needed to compute p, based on your parameters n etc.

Note that when density is so high that 100% of stars are binary because of density, the concept itself of binary stars makes no sense. It does make sense when there are both isolated and binary stars.

Comment by Maxim Nazarov on June 28, 2013 at 8:13am

I want to mention several inconsistencies here:

Your assertion that

In any 2- or 3-dimensional data set with 300+ rows, if the data is quantitative and evenly distributed in a bounded space, you should expect to see a large proportion - above 15% - of data points that have a very close neighbor.

is quite vague, and, in fact, not supported by your computations further:

This proportion very clearly depends on the "density" (number points per square mm), as you yourself write as 1 - exp(-L*Pi*r^2). You can easily draw the plot of this function (e.g. in google) and see that yes, for 0.05 it gives 14.5%, but for 0.01, for example, it gives 3%.

But you make it sound as if 15% is a general constant, by writing casually 

Let's say the image is 10 cm x 10 cm, has about n=500 visible stars

When the value of is crucial. Further, you say

More than 80% of stars are in a binary system. This number is not supported by above theory nor simulations, thus there is clearly a mechanism that forces stars to cluster in pairs.

Again, this number (80%) is perfectly supported by your theory, if you assume "density per mm" is around 0.51 (which doesn't seem too improbable at least for the picture provided)

Further, it is not clear what you mean by "...images ( 2x2 tables)..." and "2x2 array with 100 x 100 = 10,000 cells". Maybe you wanted to say "2-dimensional" instead of "2x2"?

And also you contradict yourself writing 

Two stars ... could be eons apart on the Z axis. So to compute the real expected proportion of binary stars, one would have to simulate stars (points) in 3-D, then project them on the 10 x 10 cm rectangle...

If you project 3-D into 2-D rectangle you will precisely not be able to recognize that points were eons apart.

And there are more...

I am sorry if I am a bit harsh, but to be honest the article is really poorly written...

Comment by Vincent Granville on June 27, 2013 at 10:41am

On a different note, the proportion of binary stars is an indicator of the star density, in a given location. Unless for some reasons, most stars must be binary.

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service