A Data Science Central Community
Randomness is all around us. Its existence sends fear into the hearts of predictive analytics specialists everywhere -- if a process is truly random, then it is not predictable, in the analytic sense of that term. Randomness refers to the absence of patterns, order, coherence, and predictability in a system.
Unfortunately, we are often fooled by random events whenever apparent order emerges in the system. In moments of statistical weakness, some folks even develop theories to explain such "ordered" patterns. However, if the events are truly random, then any correlation is purely coincidental and not causal. I remember learning in graduate school a simple joke about erroneous scientific data analysis related to this concept: "Two points in a monotonic sequence display a tendency. Three points in a monotonic sequence display a trend. Four points in a monotonic sequence define a theory." The message was clear -- beware of apparent order in a random process, and don't be tricked into developing a theory to explain random data.
One way that randomness is most likely to induce a reduction in rational thinking is in small-numbers phenomena. For example, suppose that I ask 12 people which American NFL football team that they like the most, and they all say Baltimore Ravens. Is that a statistical fluke, a fair statement about the national sentiment, or a selection effect (since all 12 people that I asked actually live in Baltimore)? The answer is probably the latter. Okay, this example may be too obvious. So, consider the following less obvious example:
Suppose I have a fair coin (with a head or a tail being equally likely to appear when I toss the coin). Of the following 3 sequences (each representing 12 sequential tosses of the fair coin), which sequence corresponds to a bogus sequence (i.e., a sequence that I manually typed on the computer)?
(a) HTHTHTHTHTHH
(b) TTTTTTTTTTTT
(c) HHHHHHHHHHHT
(d) None of the above.
In each case, a coin toss of head is listed as "H", and a coin toss of tail is listed as "T".
The answer is "(d) None of the Above."
None of the above sequences was generated manually. They were all actual subsequences extracted from a larger sequence of random coin tosses. I admit that I selected these 3 subsequences non-randomly (which induces a statistical bias known as a selection effect) in order to try to fool you. The small-numbers phenomenon is evident here -- it corresponds to the fact that when only 12 coin tosses are considered, the occurrence of any "improbable result" may lead us (incorrectly) to believe that it is statistically significant. Conversely, if we saw answer (b) continuing for dozens of more coin tosses (nothing but Tails, all the way down), then that would be truly significant.
So, let's try again with another sample problem (#2) in which I truly did invent one of the three sequences (i.e., a bogus sequence that I manually typed on the computer, attempting to create my own example of a random sequence). Which one of these 50-coin toss sequences is the bogus sequence?
(a) HTHHTHHTTHHTTTHTHTHHHTHTHTHHHTTHTTTHTHTHHTTHTHTHTT
(b) HHHHHHTHTHHHHHTTTHTTTTHTTHHHHTHHHHHTHTTHHHTHHHHHHH
(c) THTTTTTTHTTTTTTTTHHHTTTTHHTTTTHHHTHHTTHHTTTTTHTTHH
For the two real (non-bogus) sequences, I used a random number generator to generate the 50-coin sequence. The random number generator (common to nearly all scientific programming environments) produces a random number between 0 and 1. I simply labeled the event as "H" when the number was 0.5 or greater, and labeled the event as "T" whenever the number was less than 0.5.
The answer to sample problem #2 is ... posted at the bottom of this post (by which point you will have probably guessed it).
This topic of "fooled by randomness" came up when I was reading an article recently on the Turing Award Winners from 1966 through 2013.
This article lists many interesting statistical facts about the 61 winners of the award. The article provides a fun, interactive data visualization built with Tableau tools in which you can explore these statistical data, which include: each winner's birth year, age at time of award, nationality, gender, and... astrological sign! Being a data scientist and astrophysicist, I found the inclusion of Zodiac sign to be disconcerting. However, the author of the original post does admit that this was included jokingly.
As you look at the data, you will see that 10 of the 61 Turing Award winners were born under one specific sign of the Zodiac, and only 2 of the 61 winners were born under another sign (in fact, two such examples exist). These questions then arise: Is there significance to this apparent correlation? Is there true order here, and not randomness? Are Capricorns really five times more likely to win future Turing Awards than Scorpios?
Of course, the response to these questions is that the statistical distribution of astrological birth signs does truly represent a purely random process, with no astrological (or astronomical) significance whatsoever. But, to prove this fact, it appeared to be a fun exercise for my random number generator once again.
So, I generated random birth months (1 through 12, corresponding equivalently to the 12 signs of the Zodiac) for 61 individuals. (For simplicity, we assumed that all birth months are equally likely, thus ignoring the variable length of the various months.) I repeated this simulation 100,000 times (which almost certainly falls into that scientific data analysis category of "overkill"). I then examined how many times in the 100,000 simulations did some of the following apparent correlations exist:
(1) We find 10 or more of the 61 individuals with the same birth month (astrological sign):
Answer: in 32% of the simulations
(2) We find 2 or fewer of the 61 individuals in any one of the birth months:
Answer: in 80% of the simulations
(3) We see the ratio of "maximum number of birthdays in one of the months" to "minimum number of birthdays in another month" equal to 5 or greater:
Answer: in 40% of the simulations
(4) We see the ratio of "maximum number of birthdays in one of the months" to "minimum number of birthdays in another month" equal to 4.5 or greater:
Answer: in 49% of the simulations
Therefore, it is statistically reasonable and totally expected that we would see 1 or 2 birth months that contain only two award winners. It is also statistically reasonable that we could see 5 times as many winners in the most populous month as in the least populous month. Regarding the first correlation (32% of the simulations revealing 10 or more of the 61 individuals with the same birth month), 32% is a non-trivial percentage and therefore not surprising that we see it occur in real life.
What conclusions can we draw from all of this discussion of "fooled by randomness"? What are the traps that we can fall into?
Comment
@Dimitrios, That is a great comment! Thank you for adding your insights and discussion of hidden statistical bias (hidden variables) that can adversely induce unscientific conclusions from some individuals.
Very interesting article. I wanted only to add that there can be indeed some real Zodiac correlations which have nothing to do with astrology, of course.
One example is the birth date of e.g. German soccer players who tend to be born more at the beginning than at the end of the year (i.e. more likely "Aquarius" or "Pisces" than "Scorpio" or "Sagittarius"). The reason for that is that talented children are trained in groups according to their age with a cut-off date of January the first. The developmental edge of some months is quite significant for children. So, children born at the beginning of the year appear to be more talented and are more supported. This has an effect for their whole following career and is known as "Relative Age Effect".
I think it is very important to know about such effects since astrologers tend to misuse them for their own "theories".
(See article (in German): http://www.zeit.de/sport/2013-06/dfb-u21-nachwuchsfussball-dezember...)
@William, Thanks for your comments. My point is that astrology is a diversionary past-time (at best), not science. And science (through modeling and simulation) can reproduce the distribution of birth months for the Turing Award winners. Nevertheless, I do agree with you that one's age relative to your peer group when you start school is important in early development, but that difference fades with time, especially as inherent aptitudes (for sports, science, math, art, languages, innovation, etc.) start to emerge.
Sadly, you have missed a point on the "randomness" of astrological signs. It has been well documented that your age relative to your peer group on starting school is related to sports performance. The determinant of your relative age in your grade is (unsurprisingly) when in the calendar year you were born. It may be causally true that certain Zodiac signs over/under represent in the Turing awards as the birth date effect (although possibly different months) is highly present in the NFL. I don't claim astrology is useful; I'm just pointing out that there is an impact of the month of your birth.
You capture some of the traps... all traps in logical statistical thinking but there is at least one other. When we want a rn from a distribution we are often dealing with long tailed distributions. This roughly means that we have the possibility of certain events being very rare.
However when we are sampling on HOC machines wharves rare can and waiting long enough, short in HPC arena, occurs.
This kind of sampling was very rare in slower CPU environments.
One has to plan for the fact that rare random events can and do occur.
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge