A Data Science Central Community
Tags:
There are two approaches to working with data: Data science, a bottom-up approach (from data to hypothesis); Statistics, a top-down approach (model-driven, from hypothesis to data). I disagree with this. I believe you can blend bottom-up and top-down approaches. As a data scientist, this is my philosophy.
Vincent, I see the two approaches of "learning from data" as complementary - in order to enable proper data-driven decision making!
That was the reason why I used on slide 30 John W. Tukey's quote "Neither exploratory nor confirmatory is sufficient alone. To try to replace either by the other is madness. We need them both.", which is taken from his 1980 paper "We need both exploratory and confirmatory" (The American Statistician, 34, 23-25; see http://www.ece.rice.edu/~fk1/classes/ELEC697/TukeyEDA.pdf), where Tukey expanded on his ideas of how exploratory and confirmatory data analysis fit together.
* Example 1. The information obtained from a bottom-up analysis, which identifies important relations and tendencies, can not explain why these discoveries are useful and to what extent they are valid. The confirmatory tools of top-down analysis can be used to confirm the discoveries and evaluate the quality of decisions based on those discoveries.
* Example 2. Performing a top-down analysis, we think up possible explanations for the observed behaviour and let those hypotheses dictate the data to be analysed. Then, performing a bottom-up analysis, we let the data suggest new hypotheses (ideas) to test.
This complementary view I already applied several times successfully within client projects!
For example, when historical process data were available the idea to be generated from a bottom-up analysis (using a mix of ensemble techniques like random forests or stochastic gradient tree boosting) was "which are the most important factors (among a "large" list of candidate factors) from a predictive point of view that impact a given process output". Mixed with subject-matter knowledge this idea resulted in a list of a "small" number of factors ("the critical Xs"). The confirmatory tools of top-down analysis (statistical design of experiments, DOE, in most of the cases) was then used to confirm the idea.
Note that there is an updated version (as of November 18, 2014) of my presentation
‘A Statistician’s 'Big Tent' View on Big Data and Data Science’
at http://goo.gl/xTcTr9 and/or http://goo.gl/dsXco1
Note that there is an updated version (as of April 30, 2015) of my presentation
‘A Statistician's Introductory View on Big Data and Data Science (Version 7)’
at http://goo.gl/LpvihL and/or http://goo.gl/dsXco1
Note that there is an updated version (as of July 1, 2015) of my presentation
‘A Statistician’s 'Big Tent' View on Big Data and Data Science’
at http://goo.gl/xiZqkC and/or http://goo.gl/dsXco1
Note that there is an updated version (as of April 2016) of my presentation
'A Statistician's 'Big Tent' View on Big Data and Data Science in Health Sciences (Version 11)'
Note that there is an updated version (as of August 2016) of my presentation
A Swiss Statistician's 'Big Tent' Overview of Big Data and Data Science
in Pharmaceutical Development (Version 12)
at goo.gl/0C78wy and/or goo.gl/dsXco1