A Data Science Central Community
My Objective is to do Market Segmentation of the Data. I got suggestions to do Cluster Analysis and I have the following quesries. I have access to both SASEG and SASEM.
 I have around 500 variables (Transactional, Demographic, ...). How can I reduce the variables to apply a Cluster Analysis?
 I came to know that for doing cluster analysis with SAS, only numeric variables are required. How to treat categorical variables (most of them or Nominal)
 Do I need to do Validation of the Cluster Analysis output?, If so how?
Looking for suggestions.
For your first question,
Variable Reduction for Numerical Variables can be done in two ways:
1. Before the Cluster Analysis(which again can be done in two ways):
(a) Use top N(choose as per the variation explained by the components up to your mark) principal components.
(b) Run a variable cluster analysis and select one or two variables from each cluster (You may go for PCA on these selected variables and choose top components from here).
2. After the Cluster Analysis: Run Cluster Analysis on all the variables. For every variable check if the average(Mean or Median) value is significantly different across the clusters(You may go for F-Test). If it is not significantly differing you can remove those variables and re-run the cluster procedure.
Variable Reduction for Categorical Variables: You may get measure of association between all pairs of categorical variables and choose one or two variables from each set of highly associated variables.
For your Second Question,
You can do cluster analysis with categorical variables also, but choose appropriate distance measure while using PROC DISTANCE (Obviously Euclidean distance does not work, you may choose DGower's dissimilarity). After you select set of numerical and categorical variables to use for cluster analysis, numerical variables need to be standardized before calculating distance matrix(there is option in PROC DISTANCE), otherwise variables with higher values will dominate the measure of distance.
Once you create the distance matrix dataset, you can use it as input to cluster procedure to run cluster analysis. As and when you decide to remove one variable, distance has to be recalculated removing that variable.
I am still checking out for 'Validation for Cluster Analysis'.
This is just my view, appreciate more inputs.
1) there are several nodes in EM available: variable selection tests for collinearity, variable clustering is an alternative; you might as well start with a factor analysis using sas stat and cluster for the factors as variables
2) EM offers in the cluster node several methods to automatically bin class-variables into numeric variables (problematic as you loose control on the variable weight). In stat-environment you might need to do it manually.
3) First step is to find business sense in the clusters by generating profiles or descriptive trees for each cluster. For the technical validation: the EM cluster node generates score code. So you can split your data in two parts, cluster both parts separatly and apply the score code from the model set to the validation set. The grouping should overlap for a large portion of the data (the numbering of the clusters will vary, though, so you need to find the permutation which maximizes the overlap - that's called Munkres Algorithm if I recall it right and can be achieved via SAS OR or any other linear optimizer)