# AnalyticBridge

A Data Science Central Community

Subscribe to DSC Newsletter

# Correlation

Does correlation imply causation?

Views: 147

### Replies to This Discussion

I dont think so.

My all-time favorite example: A lot of people die in hospitals (at least here in germany, do not know what all the americans without health insurance do), so does this imply that a hospital is a dangerous place ?
On the contrary, it does. It is simply a word trickery to treat "dangerous" as a reflexive relation. If one is likely to die -> he will be taken to a hospital, not the other way around. The causation holds here.

Hospitals, alas, are becoming growing source of in-house infection, so the implication hospital->danger may be literally correct. On the other hand, hospitals are clearly safe from other types of danger. You are not likely to be hit by a car there. It means, we need to consider more factors. In reality, we always deal with several inter-correlated attributes with various degrees of mutual dependency. My question now, given a table of observations and no other information, can we statistically infer which attributes are likely to be the real life antecedents, and which ones are the consequents? Is this possible?
Touche ! What a brilliant thought !

@Uri: A difficult question ...
My example relies on a dataset like "dies in " followed by several variables like hospital, home,etc. In this case it is not possible to infer such connections, because we simply do not have data of the previous health state available.

On the other hand, consider this dataset:
probability of dying (before treatment); treated_in_hospital; dies

Analyzing this set you have found out that:
1. "probability of dying" is correlated to "treated_in_hospital" (depending on the quality of your health care system :( )
2. "treated_in_hospital" is NOT correlated to "dies".
3. "probability of dying" is correlated to "dies".

Summarized:
3. would be no surprise (normally), but given 1. we ARE SURPRISED, because 2. and 3. imply that a) hospitals fail to cure or b) kill people due to other reasons.

So in conclusion I think that it is not always possible to differentiate antecedents and consequents without incorporation of expert knowledge. Another example is:
"All criminals drink water"
Nice. If you only have data of criminals, you cannot detect that this is a fake correlation. On the other side, if you have data of both criminals and not-criminals, you clearly see that the variable "drinks water" does not have any information and hence although it is an antecedent, "is criminal" is not a consequence.
Where's "LIKE IT" button around Steffen's post? I wanna click it!

Dear AnalyticsBridge, please implement "LIKE IT". I need it. Kind regards, Jozo
I'll check with Ning it this nice featured can be implemented. However, it needs to be carefully implemented, to avoid abuse.
Steffen thats a perfect analogy.i will have to carry
your example with me though.thank you
Maybe this is harder than I thought:
Can anyone come up with an example where "correlation does not imply causation" is true and the reason is NOT limited amount of data ? Is THIS possible ? :)
I am not sure if that answers your question. But there are correlations between usage of specific features of product and
a) usage of other features of product
b) behavior on product (for example, sticky rates)

A specific example would be ' consumers who use feature A of the product - tend to have higher spend and longer relationship with the product'.
This correlation does not imply causality. We can not state that people have longer stick rate because they use feature A. But there is a underlying factor that is probably causing both. You can call it data availability, but it is just a different level of data (psychological underpinning of adopters of feature A and / or their decisioning) as opposed to observable data of behavior (all of which might be available).
Steffen, I agree. Even if correlation and causation are not blood relatives they are from the same clan.

A simple typical example could be having two attributes correlated due to the dependence on a common (hidden from the view) third one. Say, you have crime rate and per capita income in various neighborhoods. Many attempts were made to derive one directly from another, but we could also consider factors like education. This looks like a better predictor -- but wait... surely more money lets you get better training?! Of course, if education column is not in the table, what shall we say at all?

Let me tell you what I think about this chicken and egg conundrum. This will be rather a wild heuristic, reeking junk science. Well, what if we take into account the mean and compute C = sigma / mean? This thing is called "coefficient of variation" and used in some murky applications. I would like to put forward a conjecture: "the attribute with smaller C is more likely to be the antecedent of the correlation". We therefore, want the means to be fairly positive. And, sure, this is general postulate: a decent mean must always be positive. The good nature abhors negative or small means; they should be regarded as a fiendish eccentricity.

The explanation of the "minimum variation" principle is truly scholastic and would probably make St. Thomas Aquinas happy. In the natural world the variations of "stimulus" are expected to be on a smaller scale than the variations of the "reaction", because there are other independent factors also adding the reaction. If I throw pebbles into our village pond, my arm's movements (stimulus) are accompanied by the random air blows and the pebble's shape and weight inconsistencies . Because of this the relative variations of all the pebble trajectories (reaction) are larger than the relative variations of my arm movements. So we can use C as the causation score.

In spite of hand waving and arm twisting in the air this theory is only half-jocular :-). Covariance between two attributes as strictly symmetric. To estimate causation we obviously need to break the symmetry. I do not have at hand enough data samples for the testing. If you or somebody else can point to a simple collection of the easily "causable" tables, I would take pains to calculate the causation scores. Who knows, maybe it will work?

Thank you
As as minor member of the Bayesian Conspiracy, I support this idea.

"If A causes B" => "then var(A) <= var(B) "

But if the left side is not true, the statement is true whatever you write on the right side. But I am pretty sure that you have lot more on your mind how to calculate a causation score. If everything fails, we can still learn an optimal bayesian network (pff ... NP) and see what we get.

unfortunately, I do not have such data available.

Good luck !
Thank you for revealing the very existence of the Conspiracy. I wonder who might be the Magnus Magister of the Order. I will try to get some data. Maybe it will open the passage rite for me if in true Bayesian spirit we will see that "[var(A) <= var(B)] => [A causes B]".

Thanks again