A Data Science Central Community

Most statisticians are now working either on data-driven / distribution-free models, or automatically and efficiently testing / maintaining / updating a large number of models via goodness-of-fit criteria. I think your comment describes a type of statistician that was popular 30 years ago, but not the modern computational statistician.

While computational statistics is identical to algorithmic modeling from a result / performance point of view, it has the advantage of providing simple interpretation (e.g. when using hierarchical Bayes or hidden Markov models), and successfully discriminate between cause / consequence / correlation if modeling is performed by real experts.

Interestingly, with the right choices of parameters, methodologies that a-priori look very different will produce exactly the same results on ALL data sets: e.g. logistic regression (a statistical technique) and neural networks (an algorithmic technique). As a data mining statistician, I like data-driven, exact or approximate models, as long as they result in efficient implementations when processing very large data sets. I even compute confidence intervals for predictive scores, using my own simple, model-free techniques, see http://www.analyticbridge.com/forum/topics/easy-to-compute, and see a model-free approach for designing predictive scores, at http://www.analyticbridge.com/group/whitepapers/forum/topics/hidden....

To me, statistic science includes any technology that does clustering on large or very large data sets, with or without a model. This has been the case for many decades, and clustering is part of all statistics curriculum. It also includes topics such as keyword taxonomy building or fraud detection since these topics use association metrics. Pretty much everything that algorithms do, and that include a predictive or descriptive or visualization part, statistics do. The word used to describe these techniques is computational statistics, and publication such as Computational Statistics and Data Analysis have most of their papers dealing with processing large data sets such as images or videos. And Markov fields (in image processing), Bayesian models (scoring) can handle very large data sets, such as the largest credit card databases on earth.

You could say that general linear models are to statistics what bubble sorting is to algorithms / computer science.

--

In response to the following message from S. Miller, posted on our LinkedIn group:

*My take is that the B-eye-Network author was thinking along the lines of the late Berkeley statistician, Leo Breiman, originator of CART and random forests, who published a paper in 2001 entitled: Statistical Modeling: The Two Cultures: http://www.stat.osu.edu/~bli/dmsl/papers/Breiman.pdf .*

The abstract from that paper is as follows:

There are two cultures in the use of statistical modeling to

reach conclusions from data. One assumes that the data are generated

by a given stochastic data model. The other uses algorithmic models and

treats the data mechanism as unknown. The statistical community has

been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to

solve problems, then we need to move away from exclusive dependence

on data models and adopt a more diverse set of tools.

Needless to say, Breiman's paper was controversial. I conducted an interview with Stanford statistician Brad Efron, originator of the bootstrap, who shared some thoughts on Breiman's paper: http://www.b-eye-network.com/view/9947/

While computational statistics is identical to algorithmic modeling from a result / performance point of view, it has the advantage of providing simple interpretation (e.g. when using hierarchical Bayes or hidden Markov models), and successfully discriminate between cause / consequence / correlation if modeling is performed by real experts.

Interestingly, with the right choices of parameters, methodologies that a-priori look very different will produce exactly the same results on ALL data sets: e.g. logistic regression (a statistical technique) and neural networks (an algorithmic technique). As a data mining statistician, I like data-driven, exact or approximate models, as long as they result in efficient implementations when processing very large data sets. I even compute confidence intervals for predictive scores, using my own simple, model-free techniques, see http://www.analyticbridge.com/forum/topics/easy-to-compute, and see a model-free approach for designing predictive scores, at http://www.analyticbridge.com/group/whitepapers/forum/topics/hidden....

To me, statistic science includes any technology that does clustering on large or very large data sets, with or without a model. This has been the case for many decades, and clustering is part of all statistics curriculum. It also includes topics such as keyword taxonomy building or fraud detection since these topics use association metrics. Pretty much everything that algorithms do, and that include a predictive or descriptive or visualization part, statistics do. The word used to describe these techniques is computational statistics, and publication such as Computational Statistics and Data Analysis have most of their papers dealing with processing large data sets such as images or videos. And Markov fields (in image processing), Bayesian models (scoring) can handle very large data sets, such as the largest credit card databases on earth.

You could say that general linear models are to statistics what bubble sorting is to algorithms / computer science.

--

In response to the following message from S. Miller, posted on our LinkedIn group:

The abstract from that paper is as follows:

There are two cultures in the use of statistical modeling to

reach conclusions from data. One assumes that the data are generated

by a given stochastic data model. The other uses algorithmic models and

treats the data mechanism as unknown. The statistical community has

been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to

solve problems, then we need to move away from exclusive dependence

on data models and adopt a more diverse set of tools.

Needless to say, Breiman's paper was controversial. I conducted an interview with Stanford statistician Brad Efron, originator of the bootstrap, who shared some thoughts on Breiman's paper: http://www.b-eye-network.com/view/9947/

© 2020 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge