# AnalyticBridge

A Data Science Central Community

# Decreasing Dataset Dimensionality

Hi everyone,

I have a nominal (unordered) categorical predictor variable in my
dataset that has too many levels I'd like to bin or group with respect
to my interval scaled dependent variable.

There are quite a few ways to discretize interval scaled inputs with
respect to the categorical output, but I'm having trouble finding the
procedure that will do opposite (bin categorical input with respect to
interval output).

Any tips would be greatly appreciated !!

Views: 178

### Replies to This Discussion

Hi Paul-

Consider using a regression tree. I use CHAID to bin nominals with respect to a categorical target, but it will accept a numeric target as well (most implementation I have seen). If you allow only one level from the root with just the variable of interest as a candidate predictor it should work nicely. There may well be more complex methods, but this one is simple and efficient. You could always group based on the mean or median of the target as well using your judgement, but I like the tree approach.

HTH
Right now, I can think of proportions only.

Your dependent is 1, 2 and 3. And the levels of your category variable are A, B, C and D. Do a cross-tab for these 2 variables.

If A and C have similar proportions of transactions or customers falling in 1, 2 and 3 - let's say 20%, 30% and 40% respectively, I would try to group A and C into a single level. You may need to check the business relevance/requirement before your merge.
Hi,
Are you talking about the Chi-squared method?
If this response is about my suggestion for a decision tree, the implementation of CHAID I use with IBM Modeler uses an F-test when the target is interval - not a chi square of course. You could bin the interval target and use a chi square but there are issues (e.g. number of bins).
Hi, Jeff:
No, I was not responding to your suggestion. I was responding to DataLLigence.
My reply was a bonus at no charge then :)
How are you doing your prediction? Most regression type models will give you a chi-square number to test that the model parameter for that category is 0. You could try collapsing these categories into one, unless it makes no business sense.

-Ralph Winters