# AnalyticBridge

A Data Science Central Community

# What is the difference between statistical computing and data mining?

In my opinion, there is none. If you graduated with a degree in stats, you call it computational statistics. If you graduated with a degree in computer science, you call it data mining. In both cases, it's about processing large data sets using statistical techniques. Do you have a different opinion?

Views: 6867

### Replies to This Discussion

Your post may also be summarised by the question: "Is data mining `statistical déjà vu'?". Please check out a summary response at http://www.statoo.com/en/datamining/ and the links included.

A very interesting additional reading is Leo Breiman's 2001 paper "Statistical Modeling: The Two Cultures" which is available at http://projecteuclid.org/Dienst/getRecord?id=euclid.ss/1009213726/.
They are not comparable. statistical computing is using computers to implement or support statistical methods, which are well defined. data mining is loose collection of methods used to find structures or patterns in high row or high dimension data sets. A regression model often exists in data mining tool sets, but it is not data mining. If one is searching for some regression in an undefined subgroup, then that is datamining utilizing a statistical method, namely regression. Everyone uses statistical computing even for the smallest datasets. If you are not using a statistical package, then you are using a calculator. different
interface. same thing.
I should have used the word "computational statistics" rather than "statistical computing".
oh.

You're welcome.
I hope computational statistics is on favored side : )
I find that the definition (it appears in a number of places) that statoo dissects not very useful.
"Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns or models or trends in data to make crucial decisions."

here is my version:
-non-trivial process: is this a new mathematical definition? sort of like a poisson process with non-zero waiting time?
-valid: an abuse of language. If it is valid it exists, but if it is found it exists. To me this is like
saying to someone "look at this rock I found." And the other person responding, "Is it valid?"
-potentially useful: If patterns we find are *potentially* useful, so... that includes all patterns, so this delineates patterns data mined from .... patterns?
why climb a mountain? It is potentially useful? We analyze data because we enjoy it.
-Ultimately understandible: wazza?? if you are still confused after all these years, it must not have been data mining. How long must we wait though.
-crucial decisions: finally an immediate delineating factor! but what if a client is trying to make an insipid decision. I guess in that case we have to do statisical computing!

This definition was written by someone who wanted someone else to give them money to mine data.
lets dissect it again with the original authors' subliminal meanings
-non-trivial : this is hard stuff!, and not anyone can do it. You need me.
-valid: It is absolutely true.
-potentially useful: you can make money from this information, though some might be junk
-ultimately understandible: It will seem like junk at first, and not worth the money you are throwing at it, but 'ultimately' you will gain great knowledge and power! [evil laughter]
-crucial decisions : not for your average lack-a-day decisions, but crucial ones. This is heady stuff.
Thanks for this provoking and amusing comment, which leaves some space for justification.

Let me retake the definition of data mining from our web site at www.statoo.com/en/datamining/:

"Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns or models or trends in data to make crucial decisions."

What do all these terms mean?

• `Non-trivial': it is not a straightforward computation of predefined quantities like computing the average value of a set of numbers.
• `Valid': the patterns hold in general, i.e. being valid on new data in the face of uncertainty.
• `Novel': the patterns were not known beforehand.
• `Potentially useful': lead to some benefit to the user.
• `Understandable': the patterns are interpretable and comprehensible.

Clearly, one could argue on the exact meanings of these terms. However, there are several important points to make.

• Like statistical thinking and statistics, data mining is not only modelling and prediction, nor a product that can be bought, but a whole iterative problem solving cycle/process that must be mastered through team effort.
• Without a statistical thinking mind-set, data mining is completely inefficient; please check out www.statoo.com/en/statistical.thinking/ for what is meant by statistical thinking. The process focus of statistical thinking provides the context and the relevancy for broader and more effective use of data mining and statistical methods.
• Defining the `business' problem is the trickiest part of successful data mining because it is exclusively a communication problem. The data mining process is oriented toward solving a `business' problem rather than a data analysis problem!
• Avoid errors of the third kind, that is giving the `right' answer to the wrong question.
• Statistical data mining versus `business' `significance'.
• Computers and algorithms do not mine data, people do!

Bearing in mind these points, may sheed some light on your comment.

Let me conclude with Lewis Carroll's quote: "If you do not know where you are going, any road will take you there."
My opinion is that data mining differs profoundly, but subtly from statistical computing. As a simple example, imagine an id variable called 'color.' A statistical analysis would, for example, evaluate the different values of attribute variables associated with each color. We would use regression, factorization, cluster analysis, and so on, and speak meaningfully about what color 'is' (posit a causal relationship) and make generalizations -predictions- about its value (id) based on the values of its attributes. So far, so good; data mining does exactly the same thing, too. However, data mining takes things at least one step further because it's not 'afraid' of meaninglessness. A data miner would consume the label too, and disaggregate color into as many columns as there are unique colors among the observations, and, by setting a flag in the cell whenever each color occurs in the observations, the data miner would produce proportions such as '23 observations of color X out of a possible 3200 observations.' Those proportions extend meaning of sorts to a meaningless label. Now, the same statistics can be computed once again with the new variables included, and the combinations among the variables and their interactions will reveal different structure and new messages that are not possible within the conceptual framework of the statistical analysis, wherein a label is just that and nothing more. Note that what is meant here is qualitatively different than just including a column for 'number of occurrences' in the statistical analysis. I'm attempting, probably with little success, to articulate a structural coupling that accompanies the disaggregation versus the summarization. In a given data set, clustering color based on similar proportions is estimable, but in any other context continues to be a meaningless result of manipulating a meaningless concept as if it were a 'variable.' Is this a practical difference today? Probably not; but back when data were more often scarce and precious and rare we were exceedingly careful (as now) to work within the methodological constraints imposed by the relevant statistical theory being applied, and we would Never, Ever have included the Label in the analysis, except as a stupid, rookie mistake! A number of people whom I respect have difficulty with my take on this, so I surely understand if others whom I don't know take issue with it as well, but, as I said, it's an opinion.
Nice example.

In my opinion data mining also differs profoundly, but subtly from statistics.

But, what distinguishes data mining from statistics?

Statistics traditionally is concerned with analysing primary (e.g. experimental) data that has been collected to check specific research hypotheses (`hypothesis testing'). As such statistics is `primary data analysis' or top-down (confirmatory) analysis.

Data mining, on the other hand, typically is concerned with analysing secondary (e.g. observational) data collected for other reasons. As such data mining is `secondary data analysis', bottom-up (exploratory) analysis, `hypothesis generation' or `knowledge discovery'.

Nevertheless, the two approaches are complementary.

• The information obtained from a bottom-up analysis, which identifies important relations and tendencies, can not explain why these discoveries are useful and to what extent they are valid. The confirmatory tools of top-down analysis can be used to confirm the discoveries and evaluate the quality of decisions based on those discoveries.
• Performing a top-down analysis, we think up possible explanations for the observed behaviour and let those hypotheses dictate the data to be analysed.Then, performing a bottom-up analysis, we let the data suggest new hypotheses to test.

Source: www.statoo.com/en/datamining/.
I agree with your concise and elegant characterization, and with the general thrust of your remarks, which seems to me to be about reaching a perspective that is able to see the techniques, for example, as a 2D projection from a 3D process/observer. That is distinguished from being immersed in one or the other camp, seeing what you see, and knowing the 'enemy' is just over the hill. I hope I'm not being too poetical on a Saturday morning. However, while it is true that statistical analysis is in common use, and is well-defined and understood in terms of its methods if not in terms of its application and interpretation, data mining still suffers by comparison. To elaborate, I have used commercially the data mining applications 'Enterprise Miner' (SAS) and 'Clementine' (SPSS) for about the last five or six years. In fact, in the latter case, the package was sold as a be-all, end-all application that didn't need any statistical support, so the company refused to purchase base SPSS or any other package, and the typical statistical analysis undertaken there was done in Excel! But that's another story. Where I am going with this is that data mining is still very much many things to many people. The vast and complex enterprise-level flagships are actually marketed to be used by upper-level Business analysts and officers with relatively little statistical background, with a message similar to 'why rely on stuffy scientific analyses that can take time, or require expertise that is scarce and expensive, and that don't answer the Business questions anyway? Why not take the bull by the horns and explore the data yourself, and let the machinery take care of the details like sampling design and testing, while you apply your sophisticated and sensitive business acumen toward posing interesting relationships and interpreting the results of multiple methodologies effortlessly applied? By the way, the graphical output is stunning, and compelling reports are easy to produce from the comfort of your own desktop.' Or words and graphics to that effect -visit their websites for yourself. So, to many that is 'data mining,' and anything less, or more for that matter, just will not do. These 'experts' produce outcomes that are generally not questioned because the software cost so much that it could not possibly be wrong. At the other extreme are those whose professional depth enables them to do the same thing by pipelining statistical procedures and macro'ed-up numerical methods to appropriately-constructed subsamples, and interpreting the tested outcomes within a context of business experience (that is often a hard-won gloss on their scientific credentials). In between, is a vast region populated with hopefuls. To return to a point that I made in my last comment, there is no overriding conceptual framework to data mining as there is for statistical analysis---nor should there be--so long as data mining works with what is There in terms of structures both apparent and underlying, and makes no pretense at generalization from what is observed, There. The mistake made by both sides occurs when they volunteer to extend their particular strengths to provide a complete solution in all contexts. Primary data analysis requires an understanding of the probabilistic issues, and the methodologies are designed to illuminate that and little else. It is up to the Analyst to somehow apply the one to the other, in terms of having useful answers to practical questions in science or business, through a combination of clever pre- and postprocessing. Data mining uses all available information and returns sense and nonsense, and again, it is up to the analyst to correctly differentiate the outcomes and espouse the correct ones (which are sometimes the nonsense, by the way). As you rightly point out, the approaches are, at best, complementary. However, I would suggest that, with the speed advantage that naturally accrues to data mining (no or little training, and the ability to do and redo countless attempts in computer-time) that data mining will win out in the end, if there is to be a contest. After all, I can apply the over-fit model of This data set to all data sets and quickly redo the analysis with new data as soon as my realtime sensors tell me that the model is misperforming. What does the elegant and beautiful statistical theory have by comparison with That? I feel like I've lost the point which I wanted to really express, and am beginning to babble on, so I will wish you a very good weekend and come back to it later, if it still bears reviewing.

Jim
I think there are two very different cultures between statistics and data mining. Basically, data miners tend to be optimistic and statisticians tend to be pessimistic. Data miners tend to believe that the latest methods will get the right answer and statisticians tend to believe that data can make fools of us unless we follow careful practice.

Charles Elkan wrote a great paper "Magical Thinking in Data Mining: lessons from the CoIL Challenge 2000". One of Elkan's main points was that the problem data set was so small that no score was significantly better than the averages score! (although a few were significantly worse than average :-) A statistician would look at this situation and say "the numbers are so small we can't really say which is better and which is worse"; a data miner would look at this and say "here was the training set, and here was the validation data set, and here is the method that got the best result on the validation data set, so that's the best method".

Me, I go back and forth. I like trying new things just to try them, and when I'm asked for an answer I give my best shot even if it's not as wrapped up as tight as I'd like, but I often assume that my data was generated by a malicious demon that can't actually lie but delights in getting me to make mistakes.

http://tactical-logic.blogspot.com/