Subscribe to DSC Newsletter

Is it possible to carry out a cluster analysis using categorical variables ?

Views: 28407

Reply to This

Replies to This Discussion

Tom. The two main benefits are ability to mix different scaled variables within the model framework, and no distribution assumptions. LCA does not assign to only 1 group, it computes the posterior probabilities of an observation belonging to all of the groups. Like cluster or factor analysis, the theory seems to be that the variation is explained via "hidden" groups (latent classes) rather than thru the variables themselves.

I can't really recommend a good book on LCA, since it is a relatively new field, and I'm still looking for one myself. Suggest you follow the documentation for whatever package you use.

-Ralph Winters
Yes, you are correct. A factor is similar to a Latent Class variable. Good synopsis.

-Ralph Winters
This problem is very simple now. There is a procedure called "Two Step Cluster Analysis", where we can use both categorical and continuous variables. All leading software have this procedure included in their list. Please look into the assumptions. This technique is useful for a wide class of problems.
Yes. the most important part of cluster analysis is the measure of "statistical distance" between two data points, which has numerous forms for either numerical or categorical variables. Try to google some keywords like "distance measure of categorical variables", I am sure you will find something useful.

I've never heard the name 'Latent Class Analysis', but from this discussion, it seems to me that it is a Structural Equation Model. Am I right?

Anyway, there are a lot of distance measures for categorical data, just like Simon said. You can surely use it, just make sure to take a look at some of them before using as they vary depending on your final objectives and might need (most likely WILL need) some data recoding.

Hi ,

for treating categorical varible in segmentation. Do first canonical discriminant analysis. After gettin " Can " result through ncan in SAS. Do final cluster for segmentation

Hi Ravi

Can you please send me more details around the Canonical Discrimiant Analysis steps you have mentioned.




I looked at the documentation for PROC LCA and it mentions the use of Binary variables for the creation of Latent Classes.

I have a set of 60 nominal variables with anywhere from 2-6 classes each.

Would I have to dummy code all of these in order to use PROC LCA?

And if so, how would I be able to analyze the results with so many unique dummy coded variables?

Thanks for any help you can provide.

is there a command in PROC LCA to output a "scored" dataset, along with an ID variable (for each subject) to match class-membership up with the original dataset?

Ross - I have not seen an option to score a dataset.  It would be up to you to score it yourself.

In regards to your previous question:  There is no need to code dummy variables in PROC LCA.  You specify the applicable category in the data (1,2 etc.) and you specify the number of categories in the CATEGORIES option.

-Ralph Winters



Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2018 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service