Everybody learned in elementary statistical classes that when you have more parameters (in your statistical model) than observations, it is a recipe for disaster.
Here, I would like to provide two examples where more parameters than observations can successfully be handled:
- Scoring system to detect fraud: logistic or linear regression: 500,000 binary rules (most of them with a triggering rate < 0.05%) , resulting in 500,000 regression coefficients. Everybody knows that this is an ill-conditioned mathematical problem. Solution: regression coefficient for rule #42,675 set to correlation[response, rule #42,675]. If rule #42,675 has a triggering rate < 0.001%, set correlation to 0. You would be amazed at how such a simplistic model works!
- Discriminate analysis (AKA supervised clustering): a classifier based on adaptive density estimation uses kernel density estimators, with the kernel bandwidth (the model parameter) depending on location: in short, we are dealing with an infinite number of parameters. These estimators are actually very good, despite the fact that the number of observations is finite, and the number of parameters is infinite. A typical example of successful implementation uses bandwidths that are equal to a power function (typically with exponent = 1/5) to the distance to closest neighbors.