Subscribe to DSC Newsletter

What is the difference between statistical computing and data mining?

In my opinion, there is none. If you graduated with a degree in stats, you call it computational statistics. If you graduated with a degree in computer science, you call it data mining. In both cases, it's about processing large data sets using statistical techniques. Do you have a different opinion?

Tags: data mining, statistical computing

Views: 7182

Reply to This

Replies to This Discussion

I got my tax rebate, so I am going to purchase some paragraph breaks for Dr. popoff. A great read, though.
Let me ask this question, which leans towards the trivial distinction (not 'a' trivial distinction, a distinction I should make). Is there an example of data mining where a data mining calculation gives a single summary statistic, or two related statistics like slope and intercept in regression.
Good question. I've seen many flavors of data mining, and the stuff i described earlier is just my take on the industry. Most folks consider 'data mining' to be the advanced use of SQL or SAS, etc. to provide summary statistics and some organizational structure to enterprise-scale databases for the purposes of reporting and answering one-off questions. Pure reporting, which resembles data mining in that sense is best handled in a reporting environment such as Essbase or Cognos, where the cube structure is used to repeat useful queries and format the results. For the most part, the outputs of all these activities---the tools I described earlier, the data cubes reporting, and the SQL querying---resemble one another in detail, in that they are also the standard outputs of standard statistical data exploration or simple descriptive statistics. So, when all is said and done, no matter what you bring to the table in terms of methodologies, you are outputing scatterplots, histograms, pie charts, summary tables, and lift graphs. Maybe even animated time-series plots. The most 'data-mining-representative' output is the decision tree, and that is already strongly associated with standard dendrograms (clustering) and CART. Given that operational bias toward conventional statistical communication, the line between data mining (for 'serious' statisticians, as it were) and statistics (for the less-advanced, perhaps) is really blurred. New paragraph :-)

To a degree, that is to be expected, since DM is an outgrowth and extension of standard statistics (SS). As such, the two are meant to be complementary and certainly the single summary concept, such as R-squared, is as useful in one as it is in the other and is freely used and misused alongside more common DM notions such as rules analysis and topological graphics describing the strength of relationships (the name escapes me just now, I don't use that method often-enough, sorry!). Where I think I would want to take this idea of yours, that seems to me to be saying that DM lacks something since it cannot actually provide anything new, beyond relaxing the rules and seeing what pops out, is into the realm of data visualization. There I think that DM may have a huge advantage over SS since the very 'free-formedness' of DM is de rigeur, and the potential for uniting creative artists and science (and semioticians) is unparalleled since the first images were made on cave walls and in ceremonial dance.
I guess what I am getting at is that statistics tends to reduce all of the data to a single number or single idea,
regardless of how complex the analysis. Do you feel this is true?
No, I don't see that at all, Emory. There are lots of statistical analysis that give complicated answers. Statistics by it's nature tries to simplify, to reduce to the essence, but by no means does that essence have to be a single number.

On the other hand, data miners can be just a quick to reduce complicated results to a single number. I'm thinking of Charles Elkan's paper on the CoIL challenge. If one model gets 115 right out of 800 and the other model gets 116 right out of 800, does that automatically mean that the second one is a better model?
When I became interested in data mining, I asked myself and others the same question. Some articles dealt with the same problem, but at the end I found the (for me, satisfying) answer.
In statistics, you start with your domain knowledge to formulate a hypothesis, than you gather the appropriate data and at last you use statistical methods to test a hypothesis : can you accept the hypothesis or do you have to reject it ?
In data mining, you start with data, then you calculate the best hypothesis : the model. Still you have to validate it, preferably on a hold-out data set.
It is because this hypothesis calculating thing (in stead of the researchers intelligence) that statsticians at the beginning did not consider data mining as a serious scientifical tool !
Not completely. John Tukey's work in exploratory data analysis was very much about identifying hypothesis from the data. That being said, Tukey had a long struggle getting his work accepted by other statisticians for exactly the reasons you describe.
I've given one answer, let me try for another. Data mining, it seems to me, came out of computer science and specifically studying algorithms. Often, they would take the data set as a given and see what their new methods did with the data set.

Statistics, it seems to me, has usually had comparatively simple and straightforwards algorithms and as a consequence concentrated more on the data: getting the best possible data, making sure there wasn't any contaminating artifacts, all that stuff.

I think lately the fields are starting to come together. At the last KDD I was at, there was very little discussion of new methods and all the presentations were about specific problems and applying domain knowledge to the advanced algorithms.
My take is this:

Statistical computing: Using models to reliably understand a process. Predicting the process is secondary to understanding the model and the process.

Data mining: Using models to reliably predict a process. Understand the process or model is secondary to predicting the process.
Apart from the fact that when you present the outcome to you management, they would like it to be called Statistical based than data mining, I would agree with you. 
Hello everyone,

This is a very nice post. I really enjoyed reading everybody's views. I agree to what others are saying regarding the difference. Statistics came more from doing analysis on experimental data, whereas, data mining came from doing analysis on the observational data. I would like to throw one other distinction in this discussion: 'hard computing' and 'soft computing'. While statistics could not be categorized as hard computing, statisticians find probabilistic models a natural way to think about data. This need not be the way by which 'data mining' or 'soft computing' look at the data. From data mining point of view, data accounts for everything that has happened and there may not be randomness or uncertainty in it. Probably this is why from data mining's point of view the best model could be evaluated only by testing or validation error. This is not the case for statistical models for which tests are more stringent.

The post is getting too long. Anyways, these are my views on this discussion.
I think when we say data mining, we need to haunt for data. Statisticians compute the data available in hand. As far as processing is concerned, there is no difference.


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service