A Data Science Central Community
According to the Emcien, data mining is almost dead. Do you agree with this statement?
Here's my answer:
I don't think Data Mining is dead. It's been renamed as big data / data science, although data science is much more than data mining: 20% of data science is pure data mining, that is exploratory analysis to detect patterns, clusters etc. to develop automated predictive or visual solutions or automated systems such as fraud detection, automated bidding or costumized recommendations (books, restaurants, Facebook friends etc.) to targeted individuals. Indeed, data mining is at the very core of data science, together with data architecture / data warehousing / database modeling.
What's your opinion?
1. To say there is too much data to do data mining is a bit like saying there is too much ocean to do shipping--where to start?
2. Data mining sets in place automated processes precisely because the size and growth rate of data is so huge.
3. The Emcien author has a limited view of data mining. I distinguish between statistical modeling, which has a cognitive basis--formation and testing of a hypothesis--and data mining, which is non-cognitive, allowing a computer to discern patterns too complicated or numerous for a human mind to discern.
The confusion in the article is definitional, as you note. Properly understood, the article is an argument for more and continued data mining, not its obsolescence.
Hi Sam, do you think that sampling is part of data mining too (I do)? Also, I disagree that there's too much data. You can run R, Python or Perl on a distributed architecture to efficiently process and analyze billions of data rows.
Interestingly, more data is sometimes worse than less data. An example where "more data" failed is the spiralup Botnet (http://www.datashaping.com/ppc7.shtml) where fraudulent activity was detected on a very small data set using very few metrics. Google, despite it's gigantic data collected on trillions of clicks, failed to detect this massive fraud.
1. I have no doubt you are more technically adept than I am, so let's take that as given.
2. You raise the interesting question of the optimal size of a data set--I don't know the answer. We know that under 30 observations, perhaps under 50, creates small group problems. But, I am not sure of the optimal limit on the other end. If the data set is larger than that optimal limit, we would certainly want to sample. In addition, there is the question of limited processing power, which would make sampling important, just as we might string together temp tables in SQL rather than have one query with multiple sub-queries embedded.
3. Corollary to 3 is the problem of a phenomenon being important but still appearing as insignificant in contrast to the size of the total data set. Makes me think of Earth in the Universe, or even in the Milky Way--very insignificant--much more identifiable and significant as an actor in the solar system or even a dozen nearby solar systems. Perhaps that is what Google ran up against.
Emcien article gives a very narrow definition of data mining "analysts generating questions to feed to a database in the hope of finding an answer" . This activity is indeed dying, but "data mining". has always been defined much broader, including predictive analytics and much of what is today called data science
The American Statistical Association (ASA) managed to kill the signification of the word statistician by narrowing it down to drug discovery statistician working on (small data) clinical trials.
Interestingly, my background is computational statistics, which I think is identical to data mining, yet I don't call myself a statistician anymore because of ASA.
It looks to me like the author is calling for new techniques via an "editorial" but either has no clue what those new techniques are or that there aren't any yet.
But from your reply above it sounds like there are automated methods for generating and then testing hypothesis?
In an audio edition of a book called "Supercrunchers" after taking an extended look at Stepwise regression as applied to really large data sets that author then talked about Neuro-net based solution generation. If I understood it right, this technique could "find" correlations that weren't obvious. The problem is the reliability of the solution wasn't as transparent as it is in Stepwise regression.