# AnalyticBridge

A Data Science Central Community

# Data sets with more variables than observations

In which contexts this situation appeared? How did you handle it? Also, if you used models that had more parameters than data points, how did you handle the situation?

Depending on the situation, it might not be an issue. For instance, in density estimation with adaptive kernels, each window might have its own radius, resulting in as many parameters (radii) as data points. In a regression problem, you run the risk of over-fitting though, unless you use ridge regression, dimension reduction, stepwise regression or other techniques.

So, how did you cope with this problem?

Views: 368

### Replies to This Discussion

Actually I was working on a problem where we had US level macro economic variables for 37 months.
We were actually trying to see the effect of 11 macro economicl variables (Unemployment rate, GSP,Disposable income,Percapita Disposable income,Debt to service ratio etc.).We actually created variables which could measure the change or rate of change in these macro economic variables as measured from the booking month. So in all we had around 60+ odd variables. So we used correlation and varclus for dimension reduction followed by the stepwise reduction and also most importantly Business sense.
Singular Value Decomposition (SVD), which can be found on the web at Wikipedia and in the Numerical Recipes series by Press, et al, can deal with oversquare and undersquare data arrays, using a slightly different interpretation for each approach. The problem has come up for me repeatedly, since one sure-fire, and very exact way of modeling individual customer behavior is by using a logit model. In my case, the known values of attributes have made up the columns, and the cases themselves have been the rows, but that configuration could easily have been reversed. As you can easily imagine, even with several hundred variables, the number of cases in a real-life commercial setting could be in the thousands, hundreds of thousands, and millions. The system of equations could theoretically handle that situation, but practical limits on computability with current workstations required a sampling strategy to keep things in the M X N range, where M is, at most, thousands and N is hundreds. The outcome was an unique regression model describing the probability of each customer's behavior, framed as a binary decision about something, such as whether one would switch their business from a particular store if a sister store was built in the area (called an impact, or cannibalization model). Hope this helps.

-jim