AnalyticBridge

A Data Science Central Community

What is the difference between statistical computing and data mining?

In my opinion, there is none. If you graduated with a degree in stats, you call it computational statistics. If you graduated with a degree in computer science, you call it data mining. In both cases, it's about processing large data sets using statistical techniques. Do you have a different opinion?

Views: 6867

Replies to This Discussion

Your post may also be summarised by the question: "Is data mining `statistical déjà vu'?". Please check out a summary response at http://www.statoo.com/en/datamining/ and the links included.

A very interesting additional reading is Leo Breiman's 2001 paper "Statistical Modeling: The Two Cultures" which is available at http://projecteuclid.org/Dienst/getRecord?id=euclid.ss/1009213726/.
They are not comparable. statistical computing is using computers to implement or support statistical methods, which are well defined. data mining is loose collection of methods used to find structures or patterns in high row or high dimension data sets. A regression model often exists in data mining tool sets, but it is not data mining. If one is searching for some regression in an undefined subgroup, then that is datamining utilizing a statistical method, namely regression. Everyone uses statistical computing even for the smallest datasets. If you are not using a statistical package, then you are using a calculator. different
interface. same thing.
I should have used the word "computational statistics" rather than "statistical computing".
oh.

You're welcome.
I hope computational statistics is on favored side : )
I find that the definition (it appears in a number of places) that statoo dissects not very useful.
"Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns or models or trends in data to make crucial decisions."

here is my version:
-non-trivial process: is this a new mathematical definition? sort of like a poisson process with non-zero waiting time?
-valid: an abuse of language. If it is valid it exists, but if it is found it exists. To me this is like
saying to someone "look at this rock I found." And the other person responding, "Is it valid?"
-potentially useful: If patterns we find are *potentially* useful, so... that includes all patterns, so this delineates patterns data mined from .... patterns?
why climb a mountain? It is potentially useful? We analyze data because we enjoy it.
-Ultimately understandible: wazza?? if you are still confused after all these years, it must not have been data mining. How long must we wait though.
-crucial decisions: finally an immediate delineating factor! but what if a client is trying to make an insipid decision. I guess in that case we have to do statisical computing!

This definition was written by someone who wanted someone else to give them money to mine data.
lets dissect it again with the original authors' subliminal meanings
-non-trivial : this is hard stuff!, and not anyone can do it. You need me.
-valid: It is absolutely true.
-potentially useful: you can make money from this information, though some might be junk
-ultimately understandible: It will seem like junk at first, and not worth the money you are throwing at it, but 'ultimately' you will gain great knowledge and power! [evil laughter]
-crucial decisions : not for your average lack-a-day decisions, but crucial ones. This is heady stuff.
Thanks for this provoking and amusing comment, which leaves some space for justification.

Let me retake the definition of data mining from our web site at www.statoo.com/en/datamining/:

"Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns or models or trends in data to make crucial decisions."

What do all these terms mean?

• `Non-trivial': it is not a straightforward computation of predefined quantities like computing the average value of a set of numbers.
• `Valid': the patterns hold in general, i.e. being valid on new data in the face of uncertainty.
• `Novel': the patterns were not known beforehand.
• `Potentially useful': lead to some benefit to the user.
• `Understandable': the patterns are interpretable and comprehensible.

Clearly, one could argue on the exact meanings of these terms. However, there are several important points to make.

• Like statistical thinking and statistics, data mining is not only modelling and prediction, nor a product that can be bought, but a whole iterative problem solving cycle/process that must be mastered through team effort.
• Without a statistical thinking mind-set, data mining is completely inefficient; please check out www.statoo.com/en/statistical.thinking/ for what is meant by statistical thinking. The process focus of statistical thinking provides the context and the relevancy for broader and more effective use of data mining and statistical methods.
• Defining the `business' problem is the trickiest part of successful data mining because it is exclusively a communication problem. The data mining process is oriented toward solving a `business' problem rather than a data analysis problem!
• Avoid errors of the third kind, that is giving the `right' answer to the wrong question.
• Statistical data mining versus `business' `significance'.
• Computers and algorithms do not mine data, people do!

Bearing in mind these points, may sheed some light on your comment.

Let me conclude with Lewis Carroll's quote: "If you do not know where you are going, any road will take you there."
My opinion is that data mining differs profoundly, but subtly from statistical computing. As a simple example, imagine an id variable called 'color.' A statistical analysis would, for example, evaluate the different values of attribute variables associated with each color. We would use regression, factorization, cluster analysis, and so on, and speak meaningfully about what color 'is' (posit a causal relationship) and make generalizations -predictions- about its value (id) based on the values of its attributes. So far, so good; data mining does exactly the same thing, too. However, data mining takes things at least one step further because it's not 'afraid' of meaninglessness. A data miner would consume the label too, and disaggregate color into as many columns as there are unique colors among the observations, and, by setting a flag in the cell whenever each color occurs in the observations, the data miner would produce proportions such as '23 observations of color X out of a possible 3200 observations.' Those proportions extend meaning of sorts to a meaningless label. Now, the same statistics can be computed once again with the new variables included, and the combinations among the variables and their interactions will reveal different structure and new messages that are not possible within the conceptual framework of the statistical analysis, wherein a label is just that and nothing more. Note that what is meant here is qualitatively different than just including a column for 'number of occurrences' in the statistical analysis. I'm attempting, probably with little success, to articulate a structural coupling that accompanies the disaggregation versus the summarization. In a given data set, clustering color based on similar proportions is estimable, but in any other context continues to be a meaningless result of manipulating a meaningless concept as if it were a 'variable.' Is this a practical difference today? Probably not; but back when data were more often scarce and precious and rare we were exceedingly careful (as now) to work within the methodological constraints imposed by the relevant statistical theory being applied, and we would Never, Ever have included the Label in the analysis, except as a stupid, rookie mistake! A number of people whom I respect have difficulty with my take on this, so I surely understand if others whom I don't know take issue with it as well, but, as I said, it's an opinion.
Nice example.

In my opinion data mining also differs profoundly, but subtly from statistics.

But, what distinguishes data mining from statistics?

Statistics traditionally is concerned with analysing primary (e.g. experimental) data that has been collected to check specific research hypotheses (`hypothesis testing'). As such statistics is `primary data analysis' or top-down (confirmatory) analysis.

Data mining, on the other hand, typically is concerned with analysing secondary (e.g. observational) data collected for other reasons. As such data mining is `secondary data analysis', bottom-up (exploratory) analysis, `hypothesis generation' or `knowledge discovery'.

Nevertheless, the two approaches are complementary.

• The information obtained from a bottom-up analysis, which identifies important relations and tendencies, can not explain why these discoveries are useful and to what extent they are valid. The confirmatory tools of top-down analysis can be used to confirm the discoveries and evaluate the quality of decisions based on those discoveries.
• Performing a top-down analysis, we think up possible explanations for the observed behaviour and let those hypotheses dictate the data to be analysed.Then, performing a bottom-up analysis, we let the data suggest new hypotheses to test.

Source: www.statoo.com/en/datamining/.

Jim
I think there are two very different cultures between statistics and data mining. Basically, data miners tend to be optimistic and statisticians tend to be pessimistic. Data miners tend to believe that the latest methods will get the right answer and statisticians tend to believe that data can make fools of us unless we follow careful practice.

Charles Elkan wrote a great paper "Magical Thinking in Data Mining: lessons from the CoIL Challenge 2000". One of Elkan's main points was that the problem data set was so small that no score was significantly better than the averages score! (although a few were significantly worse than average :-) A statistician would look at this situation and say "the numbers are so small we can't really say which is better and which is worse"; a data miner would look at this and say "here was the training set, and here was the validation data set, and here is the method that got the best result on the validation data set, so that's the best method".

Me, I go back and forth. I like trying new things just to try them, and when I'm asked for an answer I give my best shot even if it's not as wrapped up as tight as I'd like, but I often assume that my data was generated by a malicious demon that can't actually lie but delights in getting me to make mistakes.

http://tactical-logic.blogspot.com/