# AnalyticBridge

A Data Science Central Community

# 10,000 observations and 100,000 parameters: what to do?

Everybody learned in elementary statistical classes that when you have more parameters (in your statistical model) than observations, it is a recipe for disaster.

Here, I would like to provide two examples where more parameters than observations can successfully be handled:

• Scoring system to detect fraud: logistic or linear regression: 500,000 binary rules (most of them with a triggering rate < 0.05%) , resulting in 500,000 regression coefficients. Everybody knows that this is an ill-conditioned mathematical problem. Solution: regression coefficient for rule #42,675 set to correlation[response, rule #42,675]. If rule #42,675 has a triggering rate < 0.001%, set correlation to 0. You would be amazed at how such a simplistic model works!
• Discriminate analysis (AKA supervised clustering): a classifier based on adaptive density estimation uses kernel density estimators, with the kernel bandwidth (the model parameter) depending on location: in short, we are dealing with an infinite number of parameters. These estimators are actually very good, despite the fact that the number of observations is finite, and the number of parameters is infinite. A typical example of successful implementation uses bandwidths that are equal to a power function (typically with exponent = 1/5) to the distance to closest neighbors.

Views: 1098

Comment

Join AnalyticBridge

Comment by Vincent Granville on September 13, 2010 at 10:11pm
Joseph: To answer your question (What do you consider to be the practical limit on the number of parameters to (directly) estimate), it depends on the type of parameter. Actually, in the density estimation problem that I described, where parameter = kernel bandwidth, many people would indeed consider this not to be a parameter, they would use a different English word.

In short, parameters that explain little variance in your model can be very numerous; those that truly drive the model should be limited to a very small set, maybe less than 15 as a rule of the thumb. And when you have a large number of parameters, you can usually set constraints on them, a bit like in a hierarchical Bayesian model with 15 top parameters and thousands of lower level parameters.
Comment by Vincent Granville on September 13, 2010 at 4:31pm
Theo: Do you mean unsupervised clustering, as opposed to supervised clustering? If yes, I believe there is no solution that applies to all contexts, because there's no precise definition of what a cluster is (and there will never be): just look at the stars in the night sky, and try to group them into clusters with your naked eye; ask a friend do the same -- you will end up with different results, and different justifications. Even the (apparently simple) problem of detecting the number of clusters is extremely challenging, because many times a cluster is considered a sub-cluster by some algorithms or human beings, and the other way around.

I believe the most promising technology will combine multiple types of classifiers, and take advantage of the strength of each classifier (some are good to detect linear or curvilinear structures, some are good with overlapping clusters that have density peaks, some are good at detecting clusters defined by nearest neighbors, some are goods with clusters that have convex shapes, etc.)
Comment by Theodore Omtzigt on September 13, 2010 at 4:06pm
Vincent: is there a well-defined workflow to drive the clustering and to automatically build optimal models? If you have a pointer to a paper or presentation, I would love to try my hand on implementing this in a high-performance scalable library. I presume there are R implementations available to benchmark against?
Comment by Vincent Granville on September 12, 2010 at 7:36pm
Tom: In some models (regression), you can just ignore variables with little predictive power, or better, try to clusters the variables to reduce the number of variables (and thus parameters) to a meaningful number.

In the context of discriminate analysis based on kernel-based density estimators, you have one parameter (the kernel bandwidth) at each location in your 2- or 3 dimensional space, and for each group. The bandwidth is usually a function of the distance to nearest neighbors, and thus you can end up with an efficient model that truly has an infinite number of parameters and an infinite number of parameter values.
Comment by Vincent Granville on September 12, 2010 at 1:45pm
Theodore: you are right about information, also called entropy. If you have 10,000 observations, 10 core parameters explaining 80% of the entropy, 1,000 secondary parameters explaining 19% of the entropy, 10,000 parameters explaining 0.9% of the entropy, and a billion parameters explaining the remaining 0.1% of the entropy, then you could say that your framework has 10,000 observations and 1 billion parameters, and that statistical inference works very well, yet you would not be telling the truth -- that is the fact that statistical inference is working well because of your top 10 core parameters alone (meaning that if you would eliminate (1 billion - 10) parameters from your model, you would essentially get the same results).
Comment by Theodore Omtzigt on September 12, 2010 at 12:54pm
Do you have an intuitive explanation of what the technique does? It appears that by doing a low pass filter on the correlation between a parameter and the response variable that you filter out 'noise' that might have injected some correlation where there is none in causal terms. However, since this filter is non-discriminatory it appears to only affect the numerics of the algorithm. A simple thought experiment could construct a plausible case where the technique is mightily wrong: for example, if none of the observations capture any of the causal relationship between parameter and response, you would go badly wrong if suddenly a stream of observations comes across that do contain this information.

From an intuition point of view, the technique assumes that in the observations there is some information that guides the correlation to a reasonable estimate. To me that sounds like a big assumption especially for the case we are talking about where we might have 10 to 1 more parameters than observations. I can see where this could work, but I also see where this would blow up in your face. Any empirical data that you know of?
Comment by Joseph Foutz on September 11, 2010 at 10:15pm
I have a question on a slightly different tack of the same thing. What do you consider to be the practical limit on the number of parameters to (directly) estimate? Many of the mainstream packages have large limits on what they can theoretically handle. SAS has a 2TB dataset size limitation. Stata has a limit of 11,000 parameters (I assume this has to do with assigning a 32-bit integer to every matrix location) with an 8TB max theoretic, 5TB practical theoretic dataset size limitation. Big theoretical limits, but what are the practical limits of what you would estimate?
Comment by Vincent Granville on September 11, 2010 at 8:06pm
Let's say you have a regression model with 500,000 parameters. Parameter k is denoted as a[k]. The idea is to model a[k] with a formula such as a[k]=f( correletion[Var[k],Response], x, y, z) thus reducing the 500,000 parameters a[1], a[2], ... , a[500,000] to only 3 parameters x, y, z.