Subscribe to DSC Newsletter

I have a question that if we are using proc varclus to eliminate redundancy in the IV's, how do we go about selecting the cluster representatives? I know the lower the (1-R^2) ratio, the better is a variable as a representative, however, if we use other factors such as business sense or univariate chi square of a variable along with (1-R^2) ratio then should we select cluster representatives that have a higher univariate chi square or making more 'business sense' even if they are having a higher (1-R^2) ratio?.. Please advise..!

Views: 796

Reply to This

Replies to This Discussion

.Or else, we should go by selecting the top 5 , top 10 variables per cluster and then look at other statistics later on?

Varun, I haven't used varclus for a while, but I would say that you could swap one variable for another if it made better business sense. The 1-R^2 ratio is only a guide. Also, look at the relationship between the 2 candidate variables.  They should be correlated.

-Ralph Winters

I would go with the business sense here. One of the things I like about proc varclus is that it takes a really hard problem for humans -- picking out some variables from hundreds -- into a bunch of very reasonable variables -- picking one or two variables out of 10 or so.

Thanks Ralph and Edmund..In fact, I used 100 such clusters and then looked at each variable in each cluster starting from the one having lowest (1-R^2) and left those variables which were 'redundant'..


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service