A Data Science Central Community
The purpose here is to show that with big data, the risk associated with spurious correlations is high. If you are anti big-data (you don't like the hype), this is your chance to make a valid point about reckless processing of big data.
Assume we have n = 5,000 variables uniformly and independently distributed on [0,1]. In short, they are not correlated, by design. Let k be the number of observations or data points.
Figure 1: Funny discussion about spurious correlations
Let's start with k = 4 and the dummy variable X = (0.20, 0.40, 0.60, 0.80). Simulate n = 5,000 variables and compute how many will have a correlation with X above p = 0.75, in absolute value. A theoretical solution might also be available. Simulations can even be done using Excel.
More complicated exercise:
Among all n(n-1)/2 correlations between these n simulated independent variables, how many will be higher than p in absolute value, theoretically? Let f(n, k, p) be this number. Can you draw the function f, maybe for different values of p?
PS: Next week challenge will probably involve correlations on a circle or on a sphere.
You can get spurious correlations with small data, too. The odds are higher with big data, but if the analyst doesn't understand the problem the volume of data is irrelevant.
Here's a very interesting fact. I computed the correlations between n = 100,000 uniformly distributed, independent (simulated) random variables on [0, 1] each with k = 4 observations, and the seemingly non-random (artificial) observed vector X = (0.2, 0.4, 0.6, 0.8).
By design, these n = 100,000 correlations should all (theoretically) be equal to 0. It turns out that 20% of them have an absolute value above 0.80, indicating a strong correlation with X. Enough said, check out the results by yourself in this Excel spreadsheet (8 MB compressed with gzip).
These numbers are barely unchanged if you replace X = (0.2, 0.4, 0.6, 0.8) by X = (0.0, 0.0, 0.0, 0.25) or by any arbitrary 4-D vector with components between 0 and 1. So how do you come up with a better correlation definition to eliminate spurious correlations in big data? That's the subject of our next article, and we will see that it's a bit easier if k is bigger than 4. In short, the new correlation is robust; it is a blend of standard (robust) regression with comparison of the two auto-correlations of lag 1 (and maybe also lag 2) associated with the two time series being compared. More on this later. This is the kind of techniques that will be used in automated data science.
The graphical display of the "voters" data presented by Dr. Granville is very impressive. However, I do not quite see how one comes to conclude that one will try to study the Correlation aspect of this Data? What I see is a two-dimensional representation of a large set of Data, with a two way dichotomy resulting in four quadrants. Each quadrant contains clusters of individuals who are either democrats or republicans and voted or did not vote. As an applied Statistician and not a Data Scientist, (let me emphasize that, before I get into another big trouble,) what I see is a data set with Conditional events. The normal question for a Statistician (me) will be, What is the probability that a person selected at random from this set of Data, is a republican who voted and likes Bud? Obviously it is a conditional probability that can be derived from the data. And visually that can perhaps be represented with an additional little graph of a Bayesian Decision Tree, as the data keeps on growing and the size of the cluster changes. So continuous updating of the prior knowledge with the incoming data will be advisable to derive the posterior conditional probabilities.
I would not even venture after the study the correlation. It will yield nothing. Instead, I will try to find out if the voter was a male or a female and from which age group and likes what kind of the Beer?
In other words refine the data collection procedure. That, will need to redesign the original table, or perhaps add branches to the Bayesian Network representation prior to the dichotomy, namely the political affiliation of individuals.
I cannot overlook the fact, that the data is time dependent, for example, a new beer brand suddenly becomes popular or the number of registered voters change. I would be interested as a Statistician, of course, and emphasizing again not as a data scientist, to find out how the voting records of individual develops over time and how the clusters reshape. These are events that are Conditional in a Probabilistic sense.
It all depends on what is the final objective of the study?
Objective of a Data Scientist might be entirely different from that of a Statistician, now that I know one is not necessarily the same as the other.