A Data Science Central Community
Randomness is all around us. Its existence sends fear into the hearts of predictive analytics specialists everywhere -- if a process is truly random, then it is not predictable, in the analytic sense of that term. Randomness refers to the absence of patterns, order, coherence, and predictability in a system.
Unfortunately, we are often fooled by random events whenever apparent order emerges in the system. In moments of statistical weakness, some folks even develop theories to explain such "ordered" patterns. However, if the events are truly random, then any correlation is purely coincidental and not causal. I remember learning in graduate school a simple joke about erroneous scientific data analysis related to this concept: "Two points in a monotonic sequence display a tendency. Three points in a monotonic sequence display a trend. Four points in a monotonic sequence define a theory." The message was clear -- beware of apparent order in a random process, and don't be tricked into developing a theory to explain random data.
One way that randomness is most likely to induce a reduction in rational thinking is in small-numbers phenomena. For example, suppose that I ask 12 people which American NFL football team that they like the most, and they all say Baltimore Ravens. Is that a statistical fluke, a fair statement about the national sentiment, or a selection effect (since all 12 people that I asked actually live in Baltimore)? The answer is probably the latter. Okay, this example may be too obvious. So, consider the following less obvious example:
Suppose I have a fair coin (with a head or a tail being equally likely to appear when I toss the coin). Of the following 3 sequences (each representing 12 sequential tosses of the fair coin), which sequence corresponds to a bogus sequence (i.e., a sequence that I manually typed on the computer)?
(d) None of the above.
In each case, a coin toss of head is listed as "H", and a coin toss of tail is listed as "T".
The answer is "(d) None of the Above."
None of the above sequences was generated manually. They were all actual subsequences extracted from a larger sequence of random coin tosses. I admit that I selected these 3 subsequences non-randomly (which induces a statistical bias known as a selection effect) in order to try to fool you. The small-numbers phenomenon is evident here -- it corresponds to the fact that when only 12 coin tosses are considered, the occurrence of any "improbable result" may lead us (incorrectly) to believe that it is statistically significant. Conversely, if we saw answer (b) continuing for dozens of more coin tosses (nothing but Tails, all the way down), then that would be truly significant.
So, let's try again with another sample problem (#2) in which I truly did invent one of the three sequences (i.e., a bogus sequence that I manually typed on the computer, attempting to create my own example of a random sequence). Which one of these 50-coin toss sequences is the bogus sequence?
For the two real (non-bogus) sequences, I used a random number generator to generate the 50-coin sequence. The random number generator (common to nearly all scientific programming environments) produces a random number between 0 and 1. I simply labeled the event as "H" when the number was 0.5 or greater, and labeled the event as "T" whenever the number was less than 0.5.
The answer to sample problem #2 is ... posted at the bottom of this post (by which point you will have probably guessed it).
This topic of "fooled by randomness" came up when I was reading an article recently on the Turing Award Winners from 1966 through 2013.
This article lists many interesting statistical facts about the 61 winners of the award. The article provides a fun, interactive data visualization built with Tableau tools in which you can explore these statistical data, which include: each winner's birth year, age at time of award, nationality, gender, and... astrological sign! Being a data scientist and astrophysicist, I found the inclusion of Zodiac sign to be disconcerting. However, the author of the original post does admit that this was included jokingly.
As you look at the data, you will see that 10 of the 61 Turing Award winners were born under one specific sign of the Zodiac, and only 2 of the 61 winners were born under another sign (in fact, two such examples exist). These questions then arise: Is there significance to this apparent correlation? Is there true order here, and not randomness? Are Capricorns really five times more likely to win future Turing Awards than Scorpios?
Of course, the response to these questions is that the statistical distribution of astrological birth signs does truly represent a purely random process, with no astrological (or astronomical) significance whatsoever. But, to prove this fact, it appeared to be a fun exercise for my random number generator once again.
So, I generated random birth months (1 through 12, corresponding equivalently to the 12 signs of the Zodiac) for 61 individuals. (For simplicity, we assumed that all birth months are equally likely, thus ignoring the variable length of the various months.) I repeated this simulation 100,000 times (which almost certainly falls into that scientific data analysis category of "overkill"). I then examined how many times in the 100,000 simulations did some of the following apparent correlations exist:
(1) We find 10 or more of the 61 individuals with the same birth month (astrological sign):
Answer: in 32% of the simulations
(2) We find 2 or fewer of the 61 individuals in any one of the birth months:
Answer: in 80% of the simulations
(3) We see the ratio of "maximum number of birthdays in one of the months" to "minimum number of birthdays in another month" equal to 5 or greater:
Answer: in 40% of the simulations
(4) We see the ratio of "maximum number of birthdays in one of the months" to "minimum number of birthdays in another month" equal to 4.5 or greater:
Answer: in 49% of the simulations
Therefore, it is statistically reasonable and totally expected that we would see 1 or 2 birth months that contain only two award winners. It is also statistically reasonable that we could see 5 times as many winners in the most populous month as in the least populous month. Regarding the first correlation (32% of the simulations revealing 10 or more of the 61 individuals with the same birth month), 32% is a non-trivial percentage and therefore not surprising that we see it occur in real life.
What conclusions can we draw from all of this discussion of "fooled by randomness"? What are the traps that we can fall into?