A Data Science Central Community
What is the best correlation coefficient R(X, Y) to measure non-linear dependencies between two variables X and Y? Let's say that you want to assess weather there is a linear or quadratic relationship between X and Y. One way to do it is to perform a polynomial regression such as Y = a + bX + cX^2, and then measure the standard coefficient of correlation between the predicted and observed values. How good is this approach?
Note that the proposed correlation coefficient R(X, Y) is not symmetric. One way to get a symmetric version, is to use the maximum between | R(X, Y) | and | R(Y, X) |. It will be equal to 1 if and only if there is an exact polynomial or inverse polynomial relationship between X and Y.
Note: If one checks the model Y = a + bX + cX^2, the "inverse polynomial" model would be X = a' + b'Y + c'Y^2. So, R(X, Y) is computed on the first regression, while R(Y, X) is computed on the second (reversed, also called dual) regression.
An issue with my approach is the risk of over-fitting. If you have n observations and n coefficients in the regression, my correlation will always be 1.
There are various ways to avoid this problem, for instance:
The correlation coefficient in question can also be used for model selection: The best model would provide the correlation closest to 1.
This is interesting; I am wondering if it is possible to share the data you have adapted for this figure?
In general, I would recommend Mutual Information as an approach to measuring the strength of a relationship between two variables with an unknown linear, nonlinear and even non-functional relationship between them.
There are some challenges though, most notably you need to estimate the joint density between the variables (discussed in the paper supplement and the attached slides). And with the number of data points illustrated above, that would not be possible. But then again, you can't really fit a high-order polynomial through number of points above either (i.e., it's obviously overfit as illustrated). So the issue of sample size is really a generic problem. Nevertheless, MI is probably a bit more data hungry than a constrained analytical model (if you know what the model should be in advance).
And example of the application of this in biological sciences: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbi...
And also an introduction to can be found in the slide excerpt attached.