A Data Science Central Community
I was wondering if you are aware of any methodology to perform multivariate linear regression on non-standard spaces or domains.
The problem I have in mind is as follows:
I try to reverse-engineer the recipe for the coca cola beverage. The response, Y, is how close my recipe is to the actual formula, based on a number of tastings performed by a number of different people, according to a design of experiment plan. Indeed, it's quite similar to a clinical trial where a mix of atoms or chemical radicals (each combination producing a unique molecule) is tested to optimize a drug. The independent variables are binary, each one representing an ingredient: salt, water, corn syrup etc. The value is equal to one if the ingredient in question is present in the recipe, 0 otherwise. The regression coefficient a_k (k = 1,...,m) thus must meet the following requirements:
In short, I'm doing a regression on the simplex, where the a_k's represent the proportions of a mix. An interesting property of this regression is the fact that the sum of the square of the a_k coefficients is equal to the square of the area of the m-1 dimensional face defined by SUM(a_k) = 1 and a_k greater or equal to zero (this is just a generalization of Pythagore's theorem).
I'm wondering if Lasso, ridge or logic (not logistic) regression can solve this problem, or if there is a better solution. And what about solving a regression on a sphere - e.g.
What about a solution that consists of mapping the sphere onto a plane, and solve the regression in the plane?
Have you looked at John Cornell's work on mixture models? It consists of almost exactly this kind of problem. He uses regression, but it should be straightforward to add regularization.
I think it should work if you add your constrains to an optimization problem that minimizes the loss function you want (Sum squares).
your constrains are still convex hence the resulted problem is tractable.
I'd be curious how you compute confidence intervals for the coefficients, or test whether some are equal to 0. This might be a bit more tricky, though you can always use my non-parametric, model-free approach.
if you think of Bayesian regression model and place a Dirichlet prior over weights' mean then this may do what you need. but you will need to do approximate inference since we lost our conjugate prior
Maybe a Bayesian approach with MCMC.
Some of them look like attribution problems. For the "coca cola" problem you can try L² minimization and even relaxing the simplex constraint (assuming non-linearities do not exist, and your minimization variables don't try to find a probability distribution). If you have R in your bag of tricks you can use some code I wrote here: http://jcborras.net/carpet/voting-sympathies-in-double-round-electi... ("Candidate drop and its..." section, and if the simplex constraint is very very very important for you then the "Final vote mix..." section).
If you input data is large (as in the number of samples) your matrices may grow too big though.
You are facing a "composition" problem. You are right to be wary of doing a normal regression on the component percentages; it has flaws. John Aitchison solved this problem - which is big in the mining industry where one takes core samples. He showed you must first translate the k variables which are constrained into a (k-1)-dimensional set of unconstrained variables by using log-ratios. That is, z_j = log(x_j/x_k) for j=1,k-1. (This assumes z_k is not zero; there are other ways to do it if that's a problem.)
Then, with the answer, you can translate back into the original variables.
Seems like a good approach! I felt uneasy about the idea of running regression on the constrained optimization problem's variables. How to be certain that it made sense to layer one method over another? Either one would need to work it out as a proof (ugh) or try it with numeric data, then sanity check the results. The latter is not ideal, e.g. I wouldn't want to defend that as my rationale!
I like the idea of separating the problem into two parts by translating the constrained variables, solving that, then going back and doing the rest, so to speak ;o)
Can you illustrate the idea more for me. I think the issue in this problem is we do not observe the ratios in the training data, you just observe a value (say from 0 to 1).
The constrains here are on the hidden variable not on the output variable if I understand correctly
Philip Hanser suggested the following eBook: A Concise Guide to Compositional Data Analysis (by John Aitchison, estimated publication date is 1999 based on references) which deals with mixture / simplex domain. I did not find a section on "simplex regression", but reading this book is a good starting point.
Update: My intent was more to create a competing product that tastes the same, call it something different from Coke, and sell it for far less. If the ingredients are different, even very different, even though the taste is identical, it is actually a significant benefit, because Coke manufacturers won't be able to successfully sue you.
I think Virgin almost managed to create a clone. And of course, Pepsi does not come close, the taste is so different, just like apples and oranges.
As a first thought, I vaguely remember from my navigation skills that the Mercator projection maps a sphere into the inside face of a cylinder:
For a given latitude lambda and longitude phi,
x= lambda(n) - lambda(0) and y= ln(tan(phi) + sec(phi)),
being lambda(0) the longitude you took as origin - Greenwich meridian, for example.
Then you cut the cylinder vertically, unfold it, and you've got a rectangle.
Problem: Angles (directions) are kept constant, but distances not. The shorter distance between two points is now a curve, the orthodromic...