Comments - The curse of big data - AnalyticBridge2019-02-16T13:07:35Zhttps://www.analyticbridge.datasciencecentral.com/profiles/comment/feed?attachedTo=2004291%3ABlogPost%3A226782&xn_auth=noQuite apart form whether vari…tag:www.analyticbridge.datasciencecentral.com,2013-12-22:2004291:Comment:2839452013-12-22T04:27:28.760ZMark L. Stonehttps://www.analyticbridge.datasciencecentral.com/profile/MarkLStone
<p>Quite apart form whether variance or standard deviation is sensitive to outliers, there are very specific calculations which require it, and if they do, it should be calculated in a numerically stable manner.</p>
<p>Whoever wrote that variance computation code in Mahout is like a grade school child using his lunch knife to perform brain surgery - it may not come out well, but (s)he does know how to use a knife after all, so (s)he is qualified, so (s)he thinks.</p>
<p>Quite apart form whether variance or standard deviation is sensitive to outliers, there are very specific calculations which require it, and if they do, it should be calculated in a numerically stable manner.</p>
<p>Whoever wrote that variance computation code in Mahout is like a grade school child using his lunch knife to perform brain surgery - it may not come out well, but (s)he does know how to use a knife after all, so (s)he is qualified, so (s)he thinks.</p> @Mark: I would also question…tag:www.analyticbridge.datasciencecentral.com,2013-12-22:2004291:Comment:2838572013-12-22T03:47:53.885ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>@Mark: I would also question the use of E(X)^2-E(X^2), no matter how stable the computation is. Variance uses squares, which tends to make this metric sensitive to outliers, essentially squaring big deviations. In large data sets, there will be some big outliers, and they will render this metric useless.</p>
<p>@Mark: I would also question the use of E(X)^2-E(X^2), no matter how stable the computation is. Variance uses squares, which tends to make this metric sensitive to outliers, essentially squaring big deviations. In large data sets, there will be some big outliers, and they will render this metric useless.</p> Following up on Mirko's comme…tag:www.analyticbridge.datasciencecentral.com,2013-12-22:2004291:Comment:2840332013-12-22T01:26:13.159ZMark L. Stonehttps://www.analyticbridge.datasciencecentral.com/profile/MarkLStone
<p>Following up on Mirko's comment, has the numerically unstable computation of sample variance in Hadoop/Mahout by E(X)^2-E(X^2) been replaced by a numerically stable computation yet? Numerically stable one-pass methods for computing the sample variance have been known for 51 years, see <a href="http://webmail.cs.yale.edu/publications/techreports/tr222.pdf" target="_blank">http://webmail.cs.yale.edu/publications/techreports/tr222.pdf</a> , and are readily parallelizable. Nevertheless,…</p>
<p>Following up on Mirko's comment, has the numerically unstable computation of sample variance in Hadoop/Mahout by E(X)^2-E(X^2) been replaced by a numerically stable computation yet? Numerically stable one-pass methods for computing the sample variance have been known for 51 years, see <a href="http://webmail.cs.yale.edu/publications/techreports/tr222.pdf" target="_blank">http://webmail.cs.yale.edu/publications/techreports/tr222.pdf</a> , and are readily parallelizable. Nevertheless, Hadoop/Mahout is undoubtedly an excellent tool if your goal is to very rapidly calculate the wrong answer on massive data sets.</p>
<p></p>
<p>The numerical method used to calculate sample variance is THE first thing I check when given access to source code. And hey sports fans, if the program authors don't get that right, it's usually just the tip of the iceberg, and symptomatic of software written by person(s) illiterate in numerical mathematical computation. The finite precision nature of floating point computation must be taken into account if effective and numerically stable and reliable software is to be produced. If the computations were done in higher precision, such as quad precision, the numerical instability "day of reckoning" can be pushed out a ways, but eventually it comes, and when it does, things go downhill in a hurry.</p>
<p></p>
<p>For a given number of floating point digits, all else being equal, the larger the data set, the less you can get away with using numerically unstable algorithms. Ideally, calculations on big data should be done in quad precision using numerically stable algorithms, and barring that, performed in double precision using numerically stable algorithms. Otherwise, people are just maximizing the speed at which they produce unreliable answers. You might think I'm just being a nitpicker, as real data is only an approximation anyhow. Really? What would you think if numerical variance came out negative? That's significant, and why to this day, I still see codes, some V V &A'd by the government for use in safety critical applications, which calculate standard deviation as sqrt(abs(variance)) or sqrt(max(variance, 0)), which is the "fix" someone put in after getting a sqrt of a negative argument error message.</p> Great article!
I agree totall…tag:www.analyticbridge.datasciencecentral.com,2013-08-16:2004291:Comment:2639662013-08-16T19:38:36.035ZBR Deshpandehttps://www.analyticbridge.datasciencecentral.com/profile/BRDeshpande
<p>Great article!</p>
<p>I agree totally that in a larger dataset the probabilities of finding spurious/accidental correlations are higher than in a smaller dataset. In this context, "larger" implies higher k. But when you state "<span>curse of big data is very acute when n is smaller than 200", are we even working with big data? </span></p>
<p>Great article!</p>
<p>I agree totally that in a larger dataset the probabilities of finding spurious/accidental correlations are higher than in a smaller dataset. In this context, "larger" implies higher k. But when you state "<span>curse of big data is very acute when n is smaller than 200", are we even working with big data? </span></p> I think the problem is two-f…tag:www.analyticbridge.datasciencecentral.com,2013-08-01:2004291:Comment:2594622013-08-01T03:18:46.219ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span> </span><span class="comment-body">I think the problem is two-fold: <br></br><br></br>1) Statisticians have not been involved in the big data revolution. Some have written books such as applied data science, but it's just a repackaging of very old stuff, and has nothing to do with data science. Read my article on fake data science, at…</span></p>
<p><span> </span><span class="comment-body">I think the problem is two-fold: <br/><br/>1) Statisticians have not been involved in the big data revolution. Some have written books such as applied data science, but it's just a repackaging of very old stuff, and has nothing to do with data science. Read my article on fake data science, at<a href="http://www.linkedin.com/redirect?url=http%3A%2F%2Fwww%2Eanalyticbridge%2Ecom%2Fprofiles%2Fblogs%2Ffake-data-science&urlhash=DIzf&_t=tracking_disc" target="_blank">http://www.analyticbridge.com/profiles/blogs/fake-data-science</a> <br/><br/>2) Methodologies that work for big data sets - as big data was defined back in 2005 (20 million rows would qualify back then) - miserably fail on post-2010 big data (terabytes). Read my article on the curse of big data, at<a href="http://www.linkedin.com/redirect?url=http%3A%2F%2Fwww%2Eanalyticbridge%2Ecom%2Fprofiles%2Fblogs%2Fthe-curse-of-big-data&urlhash=YOC4&_t=tracking_disc" target="_blank">http://www.analyticbridge.com/profiles/blogs/the-curse-of-big-data</a> <br/><br/>As a result, people think that data science is just statistics, with a new name. They are totally wrong on two points: they confuse data science and fake data science, and they confuse big data 2005 and big data 2013.</span></p> Other examples of misuses:
F…tag:www.analyticbridge.datasciencecentral.com,2013-01-07:2004291:Comment:2268942013-01-07T02:20:12.358ZMirko Krivanekhttps://www.analyticbridge.datasciencecentral.com/profile/MirkoKrivanek
<p>Other examples of misuses:</p>
<ol>
<li><strong>Flaw in Google's keyword relevancy algorithms</strong> due to complexity of text data. When you enter a search query such as "mining data" on Google (that is, data about mining companies), it returns search results about "data mining": the search relevancy algorithm is flawed. This Google algorithm maps the user query to an internal Google indexed keyword (the keyword index is used to associate URLs with their relevant keywords). To create the…</li>
</ol>
<p>Other examples of misuses:</p>
<ol>
<li><strong>Flaw in Google's keyword relevancy algorithms</strong> due to complexity of text data. When you enter a search query such as "mining data" on Google (that is, data about mining companies), it returns search results about "data mining": the search relevancy algorithm is flawed. This Google algorithm maps the user query to an internal Google indexed keyword (the keyword index is used to associate URLs with their relevant keywords). To create the mapping, the user query is first standardized: typos are fixed, unimportant words (or, the, and etc.) removed, plurals ignored, -ing are removed ("booking" could become "book" unless Google uses a table of words where -ing can't be removed), and finally only one combination among all n! potential n-grams of the indexed keyword (where n = number of tokens in indexed keyword) is stored in the keyword index table: it is the combination where all tokens are listed in alphabetical order. The solution consists of keeping all of the combinations (usually, 1, 2 or 3 at most, out of n!) with large volume: in the case of "data mining", both "data mining" and "mining data" should be kept in the keyword index, and thus treated separately.</li>
<li><strong>Highly unstable numerical computations in Hadoop/Mahout</strong>, reminding me of the inaccuracies in Excel statistical computations. If you look up the way Hadoop/Mahout computes the variance, it uses the naive E(X)^2-E(X^2) approach which is fundamentally flawed because it suffers from catastrophic cancellation. (See the numerical stability presentation at ICDE 2012). Nobody noticed or fixed this yet! It's a serious flaw that the variance is numerically unstable if your data is not centered around 0 (such as in timestamps). Heck, it can even become negative. There was too much attention to scalability and "volume", no "verification".</li>
<li><strong>Spam detection, user reviews and recommendations based on crowd sourcing</strong>. Most of these technologies lack "<a href="http://www.analyticbridge.com/profiles/blogs/yelp-hit-with-a-class-action-lawsuit-for-its-rotten-data-mining-a" target="_blank">fake review</a>" and "bogus spam report" detection algorithms, resulting in tons of false positives and false negatives. A bogus spam report is (for instance) a Gmail user flagging an email message as spam, by error or on purpose (to hurt a competitor). The reverse also happens: a Gmail spammer creating dozens of Gmail accounts, and flagging all his spam messages as "not spam" (that is, moving his spam from SpamBox to InBox in dozens of bogus Gmail accounts created for that very purpose, hoping it will clear spam flags on all Gmail accounts). Big data needs to be much smarter about detecting these simple tricks.</li>
</ol>
<p>Good news: eventually these issues will be fixed. It's easy to fix them.</p>