AnalyticBridge

A Data Science Central Community

Challenge of the week

You have 3 random variables X, Y, and Z, with corr(X,Y) = 0.70 and corr(X,Z) = 0.80. What is the minimum value for corr(Y,Z)? Can this correlation be negative?

The answer can be found in my book page 45.

Views: 1530

Replies to This Discussion

This is standard L^2 theory stuff.  The upper and lower bounds are given by the following expression.

corr(Y,Z) >= corr(X,Y)*corr(X,Z) - sqrt((1-Corr(X,Y)^2)*(1-Corr(X,Z)^2)) and

corr(Y,Z) <= corr(X,Y)*corr(X,Z) + sqrt((1-Corr(X,Y)^2)*(1-Corr(X,Z)^2))

The bounds here are .9885 and .1315 if I've calculated them correctly.

Keith Portman

Sounds correct based on my memory, when I worked on this problem - your first 2 digits are identical to mines, for sure. My solution was based on the fact that the corr matrix is a semi-positive definite matrix, thus the determinant must be > 0. This resulted in a 2-degree polynomial with solutions 0.99 and 0.13. Obviously, you came up with the correct solution without reading my book. Congratulations!

What if we are dealing with n (rather than 2) variables?

Im not sure if it generalizes.  Let me think about it.  Also, let me know if you want my proof.

KP

Yes, I'd be interested to read your proof and any generalization to more than 3 variables. Our next challenge will be about n = 5,000 or 100,000 variables (simulated data, uniform on [0,1]) and compute the expected number N of correlations (out of n(n-1)/2) that are above 0.90 or 0.95, with confidence intervals. Theoretical solution or Monte-Carlo simulations (based on sound uniform number generator) are OK to answer this next challenge.

O.k. Here is a sketch.  Without loss of generality, we can assume that X, Y, and Z have mean zero and variance one.

Write Y as Y = corr(X,Y)*X + A, where A is uncorrelated with X

Write Z as Z = corr(X,Z)*X + B, where B is uncorrelated with X

Then corr(Y,Z) = corr(X,Y)*corr(X,Z) + <A,B>, since the cross terms are zero.

Now, |<A,B>| <= sqrt(<A,A>*<B,B>) by cauchy -schwarz.

Also, 1 = <Y,Y> = corr(X,Y)^2 + <A,A> and

1 = <Z,Z> = corr(X,Z)^2 + <B,B> since the cross terms are again zero.

So, <A,A> = 1-corr(X,Y)^2 and <B,B> = 1-corr(X,Z)^2

Putting this together, we have corr(Y,Z) <= corr(X,Y)*corr(X,Z) + sqrt((1-corr(X,Y)^2)*(1-corr(X,Z)^2)) and  corr(Y,Z) >= corr(X,Y)*corr(X,Z) - sqrt((1-corr(X,Y)^2)*(1-corr(X,Z)^2))

Q.E.D

Is the answer to this question .3.  Please see the code below.  After you run it, you estimate by summing the corr_flag and dividing by the number of observations, n*(n-1)/2.

data temp;
array ran{2000} ran1-ran2000;
do i = 1 to 2000;
ran{i} = ranuni(1342);
do j = 1 to i-1;
if (ran{i}*ran{j} - .25)/(sqrt(1/144)) >= .9 then corr_flag = 1;
else corr_flag = 0;
output;
end;
end;
run;

Very interesting. Thanks for posting this reference.