Subscribe to DSC Newsletter

I was reading the article Gambling versus Probability: Predictive Analytics Requires Advanced... published by Thomas Rathburn in the B-Eyed network. Here's the section that I found to be filled with wrong information:

Traditional statistical analysis is often of limited value. It is not that these tools are somehow flawed. Rather, it is that they are overly simplistic and, in many cases inappropriate for the task of modeling human behavior.

Traditional statistical techniques are overly simplistic as they are suitable for only the most basic support of our decision making. They typically assume that the interactions in our decision variables are independent of each other, when, in fact, we are bombarded with multiple inputs that are highly interrelated.

Additionally, these simple modeling techniques generally attempt to build linear relationships between the inputs and the desired output. It is often the case that the basic recognition of the non-linear aspects of a solution space will generate improved decision making.

Traditional statistical analysis is often an inappropriate choice because we are attempting to model human behavior. Human behavior is typically not normally distributed, it rarely has a stable mean and standard deviation and it never has inputs into a model that cause a particular type of behavior – conditions that are necessary for the correct application of traditional statistical tools.

My rebuttal:
  • Most statistical models DO NOT assume normal distribution. None of my models rely on normal distribution, but are dealing with multimodal or highly skewed distributions (e.g. in the context of fraud detection).
  • Most modern models do not assume that decision variables are independent. See e.g. my hidden decision tree technology that handles interaction, as well as many other models that include interactions.
  • Models with linear relationships are just a very small subset of all models. Hierarchical Bayesian and stochastic processes are examples of non linear models.
The author seems to believe that statistics is just about linear regression and basic tests of hypotheses. This is what you actually study during the first 30 hours in any basic statistics curriculum, but there's much much more than that.

Views: 432


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Vincent Granville on January 15, 2010 at 10:16pm
A few answers posted on our LinkedIn group:

Alastair Muir
Alastair Muir

Lean Leader at GE Energy

See all Alastair’s activity »

Stop following Follow Alastair
I was ready to dismiss the article as trivial. I'm glad you posted your rebuttal - I agree with all your points.

I never assume a normal distribution - if I see it, I assume someone has done some averaging and inadvertently demonstrated the Central Limit Theorem. All of statistical tests really come down to measuring or parametrizing the exact nature of the noise (randomness) and comparing it to your signal.

Interdependence is present in one way or another in nearly everything I have run across.

I usually do a lot of bootstrapping and resampling of my historical data when building models.

Cheers, Alastair

Posted 1 day ago | Reply Privately | Delete comment
Yogesh N

Senior Software Engineer

See all Yogesh’s activity »

Stop following Follow Yogesh
The gist of the authors argument in the section on Tools is that - ‘traditional statistical analysis’ are of limited value primarily because they do not –
a. Consider interaction between multiple variables simultaneously
b. Assume and only try to model liner relationships
c. Many phenomena are not captured by normal distributions

I understand and completely agree with Vincents and Alastairs comments about
a. not assuming normal distributions and
b. interdependence being present in one way or other (hidden decision trees, even graphical models).

On nonlinearity –
a. Just not clear about how Hierarchical Bayes models are nonlinear (Hierarchical means that the model can have parameters of variables and parameters of parameters of variables and so on. These parameters and parameters of parameters are described by a distribution - normal or others.)
b. Or how bootstrapping ensures some form of nonlinearity.

Would appreciate any pointers to basic material.

Also another query/comment – (maybe kinda related to the entire topic not just the tools section ) – WEKA’s creators make a case in their book that one should not look for a dividing line between machine learning (I equate data mining and machine learning because as claimed, most of the techniques contained in WEKA have originated from machine learning) and statistics because they are a continuum of data analysis techniques. But if forced to point to a single difference of emphasis – “it might be that statistics has been more concerned with testing hypotheses, whereas machine learning has been more concerned with formulating the process of generalization as a search through possible hypotheses. But this is a gross oversimplification: statistics is far more than hypothesis testing, and many machine learning techniques do not involve any searching at all.”

So I guess it depends on what the author means by ‘traditional statistics’.


Posted 9 hours ago | Reply Privately | Delete comment
Abhijit Dasgupta

Principal Statistician at ARAAstat

See all Abhijit’s activity »

Stop following Follow Abhijit

The bootstrapping is not ensuring nonlinearity. It ensures that you are really doing inference based on the empirical distribution of the data (which is the closest thing to the true distribution we can measure) rather than assuming normality.

Hierarchical Bayes models can model complex dependency structures between variables, and will rarely reduce to a simple linear model. See Andrew Gelman's blog and links from there for more introduction to this world (or the Gelman & Hill book)


Posted 6 hours ago | Reply Privately | Delete comment
John Stewart
John Stewart

Vice President, Analytics at Wireless Generation

See all John’s activity »

Stop following Follow John
As is so often the case, the opposition between so-called traditional statistics and its apparent other is fairly illusory, though to some degree institutionally entrenched. The more effective analysts will know when to apply the appropriate technique. I have seen machine learning people approach situations that would standardly be analyzed via parametric techniques with decision trees or naive bayes or whatever, giving models that 1) overfit the data like crazy 2) provided no insight into the relationships inherent in the data.

When developing theories you want parsimonious models where the relationships suggested by the analyses can be explained by the theory. In these contexts an extraneuos variable that adds 0.5% to, say, an F-score may well not matter. If instead you want to be sure to catch the next 9/11, well then the criteria of model utility are different.

As for normality. When people complain about normality assumptions, ask them: where must the assumptions hold? What is the cost of their being violated? Even in basic OLS the predictors needn't be normally distributed. In many situations the output is multinomial, so the normality of the DV is irrelevant. The assumption of normality of errors affects standard errors, i.e. significance tests and confidence intervals; in big data cases, these are irrelevant. Non-normality of errors might be due to non-normality in the variables (see above), or very deviant cases, which again is less of an issue for large Ns.
So parametric methods are actually quite robust at scale, and are to be preferred at small samples because they provide better diagnostics -- or, at least, better *known* diagnostics.

The assumption of linearity, on the other hand, is crucial, and should be thought through carefully. Here my comment is that people sometimes forget how prevalent the assumption is, however deeply buried in whatever procedure. For example, people building hidden markov models using E-M for gaussian mixtures, how aware are they of the linearity assumptions inherent in their model?

Posted 5 hours ago | Reply Privately | Delete comment
Abhijit Dasgupta

Principal Statistician at ARAAstat

See all Abhijit’s activity »

Stop following Follow Abhijit
John makes a lot of good points. The one place where I think the traditional mean-based estimators (like OLS or WLS) are mis-used is for highly skew data where some sort of median or quantile-based regression would be more representative. Generally, for large data modeling, hypothesis testing and confidence intervals are effectively irrelevant, so the normality assumption is not a deal-breaker.

In my practice I see the default methodologies used by many to be linear models in some sense, without any exploration of inherent patterns in the data. This can lead to grossly erroneous conclusions. The default really need to be flexible non-linear models, where the linear situation becomes a special case if the data supports it.

Posted 4 hours ago | Reply Privately | Delete comment
Ralph Winters
Ralph Winters

Data Analytics,SAS BI Development,Fortune Teller

See all Ralph’s activity »

Stop following Follow Ralph
I also reread the article several times, and I do not understand the authors comment that statistics "assume that the interactions in our decision variables are independent of each other....".
I have never made any assumptions about that in my work, and there are very clear guidelines for including interactions in regression models. That is really what separates a good regression model from a "simplistic" one.

Am I missing something? Do we need Mr. Rathburn to pipe in and clarify?

-Ralph Winters

Posted 3 hours ago | Reply Privately | Delete comment
Vincent Granville
Vincent Granville you

Business Analytics Leader: BI, Web Mining, Ad Optimization, Fraud Detection, Scoring, Predictive Modeling, Web Crawlers

See all Vincent’s activity »
To Yogesh - An example of Bayesian hierarchical model that I used is a stochastic process to model storms, where the top layer is a Poisson process to model the storm centers, and a child process for storm cells (distributed around each storm center according to a radial 2-dim distribution). Other distributions involved in the model include storm intensity, storm velocity, storm direction, etc. There is nothing linear in this model, and no normal distribution involved. This type of model has been around for decades, so it's certainly "traditional statistics".
Comment by Dominic Pouzin on January 14, 2010 at 9:50pm
One key point is that the author complains about the limitations of "traditional statistical analysis", which is a bit vague.

Assumption of Normality:
I think "traditional statistical analysis" ought to include non-parametric stats. Pretty much all stat books cover tests such as Kruskal-Wallis, Spearman's rank etc. To be fair, these tests make some assumptions, but at least they don't assume a normal distribution. I also can't help but notice that even Excel includes stat functions such as SKEW, GAMMADIST, POISSON, etc. Ok, Excel could have better support for non-parametric tests (, but I digress...

Most techniques aiming to build de-correlated estimators (while also able to capture complex interactions between variables) haven't fully moved into the mainstream. So perhaps the author is right to say that "traditional statistical analysis" (= Excel?) remain inadequate in that regard. Up to us data crunchers to fix that one I guess :)

Overall, I'd agree with the author that many people want a "quick fix" - which usually means linear regression, etc.

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service