A Data Science Central Community
We discuss here a large class of big data problems where MapReduce can't be used - not in a straightforward way at least - and we propose a rather simple analytic, statistical solution.
MapReduce is a technique that splits big data sets into many smaller ones, process each small data set separately (but simultaneously) on different servers or computers, then gather and aggregate the results of all the sub-processes to produce the final answer. Such a distributed architecture allows you to process big data sets 1,000 times faster than traditional (non-distributed) designs, if you use 1,000 servers and split the main process into 1,000 sub-processes.
MapReduce works very well in contexts where variables or observations are processed one by one. For instance, you analyze 1 terabyte of text data, and you want to compute the frequencies of all keywords found in your data. You can divide the 1 terabyte into 1,000 data sets, each 1 gigabyte. Now you produce 1,000 keyword frequency tables (one for each subset) and aggregate them to produce a final table.
However, when you need to process variables or data sets jointly, that is 2 by 2 or or 3 by 3, MapReduce offers no benefit over non-distributed architectures. One must come with a more sophisticated solution.
The Problem
Let's say that your data set consists of n observations and k variables. For instance, the k variables represent k different stock symbols or indices (say k=10,000) and the n observations represent stock price signals (up / down) measured at n different times. You want to find very high correlations (ideally with time lags to be able to make a profit) - e.g. if Google is up today, Facebook is up tomorrow.
You have to compute k * (k-1) /2 correlations to solve this problem, despite the fact that you only have k=10,000 stock symbols. You can not spit your 10,000 stock symbols in 1,000 clusters, each containing 10 stock symbols, then use MapReduce. The vast majority of the correlations that you have to compute will involve a stock symbol in one cluster, and another one in another cluster (because you have far more correlations to compute than you have clusters). These cross-clusters computations makes MapReduce useless in this case. The same issue arises if you replace the word "correlation" by any other function, say f, computed on two variables, rather than one. This is why I claim that we are dealing here with a large class of problems where MapReduce can't help. I'll discuss another example (keyword taxonomy) later in this article.
Three Solutions
Here I propose three solutions:
1. Sampling
Instead of computing all cross-correlations, just compute a fraction of them: select m random pairs of variables, say m = 0.001 * k * (k-1) / 2, and compute correlations for these m pairs only. A smart strategy consists of starting with a very small fraction of all possible pairs, and increase the number of pairs until the highest (most significant) correlations barely grow anymore. Or you may use a simulated-annealing approach to decide with variables to keep, which ones to add, to form new pairs, after computing correlations on (say) 1,000 randomly selected seed pairs (of variables).
I'll soon publish an article that shows how approximate solutions (a local optimum) to a problem, requiring a million time less computer resources than finding the global optimum, yield very good approximations with an error often smaller than the background noise found in any data set. In another paper, I will describe a semi-combinatorial strategy to handle not only 2x2 combinations (as in this correlation issue), but 3x3, 4x4 etc. to find very high quality multivariate vectors (in terms of predictive power) in the context of statistical scoring or fraud detection.
2. Binning
If you can bin your variables in a way that makes sense, and if n is small (say=5), then you can pre-compute all potential correlations and save them in a lookup table. In our example, variables are already binned: we are dealing with signals (up or down) rather than actual, continuous metrics such as price deltas. With n=5, there are at most 512 potential pairs of value. An example of such a pair is {(up, up, down, up, down), (up, up, up,down, down)} where the first 5 values correspond to a particular stock, and the last 5 values to another stock. It is thus easy to pre-compute all 512 correlations. You will still have to browse all k * (l-1) / 2 pairs of stocks to solve you problem, but now it's much faster: for each pair you get the correlation from the lookup table - no computation required, only accessing a value in a hash table or an array with 512 cells.
Note that with binary variables, the mathematical formula for correlation simplifies significantly, and using the simplified formula on all pairs migh be faster than using lookup tables to access 512 pre-computed correlations. However, the principle works regardless as to whether you compute a correlation, or much more complicated function f.
3. Classical data reduction
Traditional reduction techniques can also be used: forward or backward step-wise techniques where (in turn) you add or remove one variable at a time (or maybe two). The variable added is chosen to maximize the resulting entropy, and conversely for variables being removed. Entropy can be measured in various ways. In a nutshell, if you have two data subsets (from the same large data set),
Then you can say that the extra 400 variables (e.g. stocks symbols) in set B don't bring any extra predictive power and can be ignored. Or in other words, the lift obtained with the set B is so small that it's probably smaller than the noise inherent to these stock price signals.
Note: An interesting solution consists of using a combination of the three previous strategies. Also, be careful to make sure that the high correlations found are not an artifact caused by the "curse of big data" (see reference article below for details).
Another example where MapReduce is of no use
Building a keyword taxonomy:
Step 1:
You gather tons of keywords over the Internet with a web crawler (crawling Wikipedia or DMOZ directories), and compute the frequencies for each keyword, and for each "keyword pair". A "keyword pair" is two keywords found on a same web page, or close to each other on a same web page. Also by keyword, I mean stuff like "California insurance", so a keyword usually contains more than one token, but rarely more than three. With all the frequencies, you can create a table (typically containing many million keywords, even after keyword cleaning), where each entry is a pair of keywords and 3 numbers, e.g.
A="California insurance", B="home insurance", x=543, y=998, z=11
where
This "keyword pair" table can indeed be very easily and efficiently built using MapReduce. Note that the vast majority of keywords A and B do not form a "keyword pair", in other words, z=0. So by ignoring these null entries, your "keyword pair" table is still manageable, and might contain as little as 50 million entries.
Step 2:
To create a taxonomy, you want to put these keywords into similar clusters. One way to do it is to compute a dissimilarity d(A,B) between two keywords A, B. For instances d(A, B) = z / SQRT(x * y), although other choices are possible. The higher d(A, B), the closer keywords A and B are to each other. Now the big problem is to perform clustering - any kind of clustering, e.g. hierarchical - on the "keyword pair" table, using any kind of dissimilarity. This problem, just like the correlation problem, can not be split into sub-problems (followed by a merging step) using MapReduce. Why? Which solution would you propose in this case?
Related articles:
Comment
The first problem is interesting because the same observation that lets us reduce k^2 comparisons to k * (k-1) /2 can be used to partition the list not into clusters with the same number of members, but into clusters with about the same number of comparisons.
If you have a list of ten stocks, you have to compare the first stock against 9 others (stocks 2-10), the second stock against 8 others (stocks 3-10), etc. You can walk down that decreasing number of comparisons to map your initial list onto clusters representing an approximately equal number of comparisons, calculate a relatively small number of correlations on each server, and reduce the overall data set to only the significant ones.
I sketched out an algorithm way in the back of my dissertation long ago (Appendix B). http://goo.gl/pW7yx.
More precision, following a comment by a reader, on LinkedIn:
Here we are talking about self-joining 10,000 data sets with themselves, each data set being a time series of 5 or 10 observations. My point in my article is that the distributed architecture of Map Reduce does not bring any smart processing of these computations. Just brute force (better than a non-parallel approach, sure) but totally incapable of removing the very massive redundancy involved in these computations (as well as in storage redundancy, since the same data set must be stored on a bunch of servers to allow for the computations of all cross-correlations).
This (very good) article underscores the importance of a major upcoming change to Hadoop - the introduction of YARN which will open up Hadoop's distributed data and compute framework for use with non-MapReduce paradigms and frameworks. One such example - RushAnalytics which, incidentally, also adds fine-grained parallelism within each node of a cluster.
Hey Vincement,
Your points resonate with my personal experience in implementing algrithms on map-reduce framework.
The rigid framework and data structure (key and value pairs) made it challenging to handle complicated algorithm implementation in efficiency. There is a book, Mining of Massive Datasets by Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullm from Stanford Univ, the book dedicates to discuss some datamining algorithm design and implementation on Hadoop.
Great article. Hadoop is good for lots of things and the only reasonable choice for some things, but it's credibility is only hurt when it is used or promoted for the things it can't do. You do the entire Hadoop community a great service by providing such a high resolution mapping of some of its deeper pitfalls.
I think one of the takeaways here is that often with a fraction of the data you get very good results, IF you perform data reduction in a proper way. I am not sure that the concept of exactness is all that meaningful here anyway, since we are dealing with noisy data in almost all practical cases. The question is more along the lines of: is there meaningful insight to be gained by using ALL data as opposed to employing a smart sampling strategy.
Take a look at Carlos Guestrin's work on "GraphLab". He talks about MapReduce as being data parallel but many problems, as Vincent has illustrated here, are graph parallel. Graphlab is a framework for large scale computing in graph parallel problems.
Hi Vincent,
Really an interesting insight and great propositions for solving the case. However, would you like to talk about any alternative "Big Data Architectural" framework that would be able to solve for the case "exactly" rather than trying an "approximation" involved in these methods?
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge