A Data Science Central Community
If you look at the picture below (Pleiades constellation), you will see - with the naked eye - that many star systems appear to be binary: that is, involving two (or more) stars orbiting around each other.
Is this a coincidence, or can we prove that from a statistical point of view, based on the theory of stochastic point processes, we are NOT dealing with a pure random process (Poisson process)? At first glance, as a statistician, I would say that the chance of observing so many pairs is extremely low, far below 0.000000001%. Now keep in mind that 2 stars that look very close to each other when viewed from Earth might actually be much more distant from each other than 2 stars that seem far apart, because we lack depth (the third dimension, or perspective) in these pictures. Also, most binary systems apparently consists of a normal star and a much smaller companion, thus we might see only a small fraction of all binary systems. In other words, maybe 90% of all solar systems are binary. Finally, there are cloudy areas in the picture below, where gas clouds hide stars located behind them.
The way to compute the probability to observe so many binary systems is as follows:
Note that if you know elementary statistics and basic concepts about Poisson processes (the most basic of all stochastic processes), then you don't even need to perform one million simulations. There is an exact mathematical formula that tells you the expected number of binary starts that you should see if binary stars were not favored: it is based on the Erlang distribution. Distances to nearest neighbors have extensively been studied in statistics, there is a solid theoretical background around it.
On a different topic, can we apply statistical principles used in astronomy, in the business world of big data?
I'm thinking of a measuring distance to far away stars as an example, where multiple measurements from a highly calibrated system are aggregated to refine the accuracy. In some ways, using multiple measurements to amplify a very weak signal. Can this concept of signal amplification can be used to gain better, more accurate insights from big data? After all, business data is also very noisy and foggy: it also has its own clouds just as in the above picture, both metaphorically and physically, making statistical inference, pattern detection, and insights discoveries more difficult.
Interestingly, this is an illustration where a picture is used as raw data for an analysis, rather than the opposite, classical setting where a picture is produced as the final step of analyzing data.