A Data Science Central Community
With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations having the same multivariate feature. Specific contexts of interest include multivariate feature selection (a combinatorial problem) or identification of best predictive set of metrics.
In large data sets, some buckets will contain outliers or meaningless data, and buckets might have different sizes. We need something better than the tools offered by traditional statistics. In particular, we want a correlation metric that satisfies the following
Note that R-Squared, a goodness-of-fit measure used to compare model efficiency across multiple models, is typically the square of the correlation coefficient between observations and predicted values, measured on a training set via sound cross-validation techniques. It suffers the same drawbacks, and benefits from the same cures as traditional correlation. So we will focus here on the correlation.
To illustrate the first condition (dependence on n), let's consider the following made-up data set with two paired variables or time series X, Y:
Here n=8 and r (the traditional correlation) is equal to r=0.22. If n=7 (delete the last observation), then r=0. If n=6, r=0.29. Clearly we observe high correlations when n is even, although they slowly decrease to converge to 0, for large values of n. If you shift Y one cell down (assuming both X and Y are infinite time series), then correlations for n even are now negative! However, this (X, Y) process is supposed to simulate a perfect 0 correlation. The traditional correlation coefficient fails to capture this pattern, for small n.
This is a problem with big data, where you might have tons of observations (monthly home prices and square feet for 300 million homes) but you compute correlations on small buckets (for instance, for all specific types of houses sold in 2013 in a particular zip code), to refine home value estimates for houses not on the market, by taking into account local fluctuations. In particular, comparisons between buckets of different sizes become meaningless.
Our starting point to improve the standard correlation will be Spearman's rank correlation. It is the traditional correlation, but measured on ranked variables. All correlations from the new family that I will introduce in the next section, are also based on rank statistics. By working on ranked variables, you satisfy conditions #3 and #4 (though #4 will be further improved with my new correlation). Let's denote Spearman's coefficient as s.
It is easy to prove (see Wikipedia article) that:
s = 1 - S(X, Y)/q(n)
Of course s satisfies condition #2, by construction. A very interesting and important fact, and a source of concerns, is that q(n) = O(n^3). In short, q(n) grows too fast.
1. A new family of rank correlations
Without loss of generality, from now on we can now assume that X is ordered and thus x(j) = j. Our new correlation will be denoted as t, or t(c) to emphasize the fact that it is a family of correlations governed by the parameter c. Bayesian statisticians might view c as a prior.
The new correlation is computed as follows:
Explanations and discussion
Distribution of correlation vector (s, t), for n=9 . It shows all n! = 362,880 potential correlation combinations
Extreme case: s = 0.17, t = -0.26 (n=9). Which one makes sense, s or t? Both are not statistically different from 0
2. Asymptotic behavior and normalization
Again, we assume that t is defined using c=1. Simulations where X = (0, 1, ..., n-1) and Y is a random permutation of n elements (obtained by first creating n random numbers then extracting their rank) show that
Finally, when t and s are quite different, usually the t value looks more natural, as in the above chart: t<0 but s>0 because t is better at eliminating the effect of the two outliers (top right corner, bottom left corner). This is critical when you compute correlations across thousands of data buckets of various sizes: you are bound to find outliers in some buckets. And if you look for extreme correlations, your conclusions might be severely biased: read the curse of big data for explanations.
Below is the distribution for T, when n=10. When n=10, 97.2% of t values are between -0.5 and +0.5. When n=7, 92.0% of t values are between -0.5 and +0.5. So the distribution of t depends on n. Theoretically, when c gets very close to 0, what happens?
Histogram for T (n=10)
Interesting fact: t and s agree most of the time. The classic correlation r between s and t, computed on all n! = 362,880 potential (X, Y) vectors with n=9 observations, is equal to 0.96.
To estimate the asymptotic distribution of t (when n becomes large), one needs to compute q(n). All we know so far is that q(n)=O(n^2). Also, when n is a multiple of 3, then q(n)=(n-6)+(n^2)/3. Is there an exact formula if n is not a multiple of 3? We are going to ask this question on Kaggle. If no exact formula is known, a simple solution would consist in using an approximate t, based on q(n)=(n-6)+(n^2)/3 and T truncated to (n-6)+(n^2)/3.
Because the correlation is in the range [-1, +1] and is thus bounded, we would like a correlation of (say) 0.5 to always represent the same percentile of t, regardless of n. There are various ways to accomplish this, but it always involves transforming t. Some research still needs to be done. For r or s, one typically uses the Fisher transformation. But it will not work on t, plus it transforms a bounded metric in a non-bounded one, making interpretation difficult.
Finally, most of this research and these simulations involve the generation of a large number of permutations for small and large n. Here I discuss how to generate these permutations.
One way is to produce all permutations: click here to download the source code. However, it becomes very slow when n is larger than 20, although it is easy to use Map Reduce to split the task on multiple virtual machines, in parallel. Another strategy is to sample enough random permutations to get a good approximation of t's distribution, as well as q(n). There is a one-to-one mapping between permutations of n elements, and integer numbers between 1 and n! Each of these numbers can be uniquely represented by a n-permutation, and the other way around. See questions 79 and 80 in our 66 job interview questions for data scientists for details. Check the numbering permutations section in the Wikipedia article on permutations, for details. I will write a simple, easier to read version of this algorithms. In short, it is very slow: O(n^2) to transform a number into a permutation or the other way around. So instead, I used a rudimentary approach (click here to download my source code), which is just as fast:
Algorithm to generate a random permutation (p(0), p(1), ... , p(n-1))
For k=0 to n-1, do:
Question: what is the computational complexity of this algorithm? Answer: read here.
2.3. Final comments