A Data Science Central Community
This seminal article highlights the dangers of reckless applications and scaling of data science techniques that have worked well for small, medium-size and large data. We illustrate the problem with flaws in big data trading, and propose solutions. Also, we believe expert data scientists are more abundant (but very different) than what hiring companies claim: read our "related articles" section at the bottom for more details. This article is written in simple English, is very short and contains both high level material for decision makers, as well as deep technical explanations when needed.
In short, the curse of big data is the fact that when you search for patterns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have no predictive power - even worse, the strongest patterns might be
The questions is: how do you discriminate between a real and accidental signal in vast amounts of data?
Let's focus on one example: identifying strong correlations or relationships between time series. If you have 1,000 metrics (time series), you can compute 499,500 = 1,000*999/2 correlations. If you include cross-correlations with time lags (e.g. stock prices for IBM today with stock prices for Google 2 days ago), then we are dealing with many, many millions of correlations. Out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will loose. Keep in mind that analyzing cross-correlations on all metrics is one of the very first step statisticians do at a beginning of any project - it's part of the exploratory analysis step. However, a spectral analysis of normalized time series (instead of correlation analysis) provide a much more robust mechanism to identify true relationships.
To illustrate the issue, let's say that you have k time series, each with n observations, for instance, price deltas (price increases or decreases) computed for k different stock symbols with various time lags over a same time period consisting of n days. For instance, you want to detect patterns such as "When Google stock price goes up, Facebook goes down one day later". In order to detect such profitable patterns, you must compute cross-correlations over thousands of stocks, with various time lags: one day, two days, or maybe one second, two seconds depending on whether you do daily trading or extremely fast intraday, high frequency trading. Typically, you keep a small number of observations - e.g. n=10 days or n=10 milliseconds - as these patterns evaporate very fast (when your competitors detect the patterns in question, it stops becoming profitable). In other words, you can assume that n=10, or maybe n=20. In other cases based on monthly data (environmental statistics, emergence of a new disease), maybe n=48 (monthly data collected over a 2-year time period). In some cases n might be much larger, but in that case the curse of big data is no longer a problem. The curse of big data is very acute when n is smaller than 200 and k moderately large, say k=500. However, instances where both n is large (> 1,000) and k is large (> 5,000) are rather rare.
Now let's review a bit of mathematics to estimate the chance of being wrong when detecting a very high correlation. We could have done Monte Carlo simulations to compute the chance in question, but here we use plain old-fashioned statistical modeling.
Let's introduce a new parameter, denoted as m, representing the number of paired (bi-variate), independent time series selected out of the set of k time series at our disposal: we want to compute correlations for these m pairs of time series. Theoretical question: assuming you have m independent paired time series, each consisting of n numbers generated via a random number generator (an observation being e.g. a simulated normalized stock price at a given time for two different stocks), what are the chances that among the m correlation coefficients, at least one is higher than 0.80?
Under this design, the theoretical correlation coefficient (as opposed to the estimated correlation) is 0. To answer the question, let's assume (without loss of generality) that the time series (after a straightforward normalization) are Gaussian white noise. Then the estimated correlation coefficient, denoted as r, is (asymptotically, that is approximately when n is not small) normal with mean = 0 and variance = 1/(n-1). The probability that r is larger than a given large number a (say a=0.80, meaning a strong correlation) is p=P(r>a) with P representing a normal distribution with mean = 0 and variance = 1/(n-1). The probability that, among the m bivariate (paired) time series, at least one has of correlation above a=0.80 is thus equal to 1-[(1-p)^m], that is, 1 minus (1-p) at power m.
For instance,
Now in practice the way it works is as follows: you have k metrics or variables, each one providing a time series computed at n different time intervals. You compute all cross-correlations, that is m = k*(k-1)/2. However the assumption of independence between the m paired time series is now violated, thus concentrating correlations further away from a very high value such as a=0.90. But also, your data is not random numbers, it's not white noise. So the theoretical correlations are much above absolute 0, maybe around 0.15 when n=20. Also m will be much higher than (say) 10,000 or 100,000 even when you have as few as k=1,000 time series (say one time series for each stock price). These three factors (non independence, theoretical r different from 0, very large m) balance out and make my above computations still quite accurate when applied to a real typical big data problem. Note that I computed my probabilities using the online calculator stattrek.
Conclusion: hire the right data scientist before attacking big data problems. He/she does not need to be highly technical, but able to think in a way similar to my above argumentation, to identify possible causes of model failures before even writing down a statistical or computer science algorithm. Being a statistician helps, but you don't need to have advanced knowledge of stats. Being a computer scientist also helps to scale your algorithms and make them simple and efficient. Being a MBA analyst also helps to understand the problem that needs to be solved. Being the three types of guy at the same time is even far better. And yes these guys do exist and are not that rare.
Exercise:
Let's say you have 3 random variables X, Y, Z with corr(X,Y)=0.70, corr(X,Z)=0.80. What is the minimum value for corr(Y,Z). Can this correlation be negative?
Related articles:
Comment
Quite apart form whether variance or standard deviation is sensitive to outliers, there are very specific calculations which require it, and if they do, it should be calculated in a numerically stable manner.
Whoever wrote that variance computation code in Mahout is like a grade school child using his lunch knife to perform brain surgery - it may not come out well, but (s)he does know how to use a knife after all, so (s)he is qualified, so (s)he thinks.
@Mark: I would also question the use of E(X)^2-E(X^2), no matter how stable the computation is. Variance uses squares, which tends to make this metric sensitive to outliers, essentially squaring big deviations. In large data sets, there will be some big outliers, and they will render this metric useless.
Following up on Mirko's comment, has the numerically unstable computation of sample variance in Hadoop/Mahout by E(X)^2-E(X^2) been replaced by a numerically stable computation yet? Numerically stable one-pass methods for computing the sample variance have been known for 51 years, see http://webmail.cs.yale.edu/publications/techreports/tr222.pdf , and are readily parallelizable. Nevertheless, Hadoop/Mahout is undoubtedly an excellent tool if your goal is to very rapidly calculate the wrong answer on massive data sets.
The numerical method used to calculate sample variance is THE first thing I check when given access to source code. And hey sports fans, if the program authors don't get that right, it's usually just the tip of the iceberg, and symptomatic of software written by person(s) illiterate in numerical mathematical computation. The finite precision nature of floating point computation must be taken into account if effective and numerically stable and reliable software is to be produced. If the computations were done in higher precision, such as quad precision, the numerical instability "day of reckoning" can be pushed out a ways, but eventually it comes, and when it does, things go downhill in a hurry.
For a given number of floating point digits, all else being equal, the larger the data set, the less you can get away with using numerically unstable algorithms. Ideally, calculations on big data should be done in quad precision using numerically stable algorithms, and barring that, performed in double precision using numerically stable algorithms. Otherwise, people are just maximizing the speed at which they produce unreliable answers. You might think I'm just being a nitpicker, as real data is only an approximation anyhow. Really? What would you think if numerical variance came out negative? That's significant, and why to this day, I still see codes, some V V &A'd by the government for use in safety critical applications, which calculate standard deviation as sqrt(abs(variance)) or sqrt(max(variance, 0)), which is the "fix" someone put in after getting a sqrt of a negative argument error message.
Great article!
I agree totally that in a larger dataset the probabilities of finding spurious/accidental correlations are higher than in a smaller dataset. In this context, "larger" implies higher k. But when you state "curse of big data is very acute when n is smaller than 200", are we even working with big data?
I think the problem is two-fold:
1) Statisticians have not been involved in the big data revolution. Some have written books such as applied data science, but it's just a repackaging of very old stuff, and has nothing to do with data science. Read my article on fake data science, athttp://www.analyticbridge.com/profiles/blogs/fake-data-science
2) Methodologies that work for big data sets - as big data was defined back in 2005 (20 million rows would qualify back then) - miserably fail on post-2010 big data (terabytes). Read my article on the curse of big data, athttp://www.analyticbridge.com/profiles/blogs/the-curse-of-big-data
As a result, people think that data science is just statistics, with a new name. They are totally wrong on two points: they confuse data science and fake data science, and they confuse big data 2005 and big data 2013.
Other examples of misuses:
Good news: eventually these issues will be fixed. It's easy to fix them.
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge