A Data Science Central Community
I have a question that if we are using proc varclus to eliminate redundancy in the IV's, how do we go about selecting the cluster representatives? I know the lower the (1-R^2) ratio, the better is a variable as a representative, however, if we use other factors such as business sense or univariate chi square of a variable along with (1-R^2) ratio then should we select cluster representatives that have a higher univariate chi square or making more 'business sense' even if they are having a higher (1-R^2) ratio?.. Please advise..!
.Or else, we should go by selecting the top 5 , top 10 variables per cluster and then look at other statistics later on?
Varun, I haven't used varclus for a while, but I would say that you could swap one variable for another if it made better business sense. The 1-R^2 ratio is only a guide. Also, look at the relationship between the 2 candidate variables. They should be correlated.
I would go with the business sense here. One of the things I like about proc varclus is that it takes a really hard problem for humans -- picking out some variables from hundreds -- into a bunch of very reasonable variables -- picking one or two variables out of 10 or so.
Thanks Ralph and Edmund..In fact, I used 100 such clusters and then looked at each variable in each cluster starting from the one having lowest (1-R^2) and left those variables which were 'redundant'..