# AnalyticBridge

A Data Science Central Community

# The Statistician vs. the Computer Sciencist: stochastic vs. algorithmic models

Most statisticians are now working either on data-driven / distribution-free models, or automatically and efficiently testing / maintaining / updating a large number of models via goodness-of-fit criteria. I think your comment describes a type of statistician that was popular 30 years ago, but not the modern computational statistician.

While computational statistics is identical to algorithmic modeling from a result / performance point of view, it has the advantage of providing simple interpretation (e.g. when using hierarchical Bayes or hidden Markov models), and successfully discriminate between cause / consequence / correlation if modeling is performed by real experts.

Interestingly, with the right choices of parameters, methodologies that a-priori look very different will produce exactly the same results on ALL data sets: e.g. logistic regression (a statistical technique) and neural networks (an algorithmic technique). As a data mining statistician, I like data-driven, exact or approximate models, as long as they result in efficient implementations when processing very large data sets. I even compute confidence intervals for predictive scores, using my own simple, model-free techniques, see http://www.analyticbridge.com/forum/topics/easy-to-compute, and see a model-free approach for designing predictive scores, at http://www.analyticbridge.com/group/whitepapers/forum/topics/hidden....

To me, statistic science includes any technology that does clustering on large or very large data sets, with or without a model. This has been the case for many decades, and clustering is part of all statistics curriculum. It also includes topics such as keyword taxonomy building or fraud detection since these topics use association metrics. Pretty much everything that algorithms do, and that include a predictive or descriptive or visualization part, statistics do. The word used to describe these techniques is computational statistics, and publication such as Computational Statistics and Data Analysis have most of their papers dealing with processing large data sets such as images or videos. And Markov fields (in image processing), Bayesian models (scoring) can handle very large data sets, such as the largest credit card databases on earth.

You could say that general linear models are to statistics what bubble sorting is to algorithms / computer science.

--
In response to the following message from S. Miller, posted on our LinkedIn group:

My take is that the B-eye-Network author was thinking along the lines of the late Berkeley statistician, Leo Breiman, originator of CART and random forests, who published a paper in 2001 entitled: Statistical Modeling: The Two Cultures: http://www.stat.osu.edu/~bli/dmsl/papers/Breiman.pdf .

The abstract from that paper is as follows:

There are two cultures in the use of statistical modeling to
reach conclusions from data. One assumes that the data are generated
by a given stochastic data model. The other uses algorithmic models and
treats the data mechanism as unknown. The statistical community has
been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to
solve problems, then we need to move away from exclusive dependence
on data models and adopt a more diverse set of tools.

Needless to say, Breiman's paper was controversial. I conducted an interview with Stanford statistician Brad Efron, originator of the bootstrap, who shared some thoughts on Breiman's paper: http://www.b-eye-network.com/view/9947/

Views: 3386

Comment