Subscribe to DSC Newsletter

Hi - has anyone worked on clustering project using some non numeric variables? For e.g. clustering customer behavior based on brand preference, type of product purchase etc? I only have SAS EG available with me and couldn't think of a way to do it as yet...

any help would be great!

Views: 20056

Reply to This

Replies to This Discussion

SOM (Self Organizing Maps) is very easy to implement in any programming language and in batch mode can deal with both categorical and numerical values. Just be careful to avoid computation packages that cheat by turning categorical or nominal values into continuous values by assigning a random numerical value to a category and than using these random numbers to compute distances.

In my not-so-humble opinion SOM is superior to other clustering methods because it answers the obvious question "are there clusters in the data?" with more than "yes" or "no", it lets you visualize clustering trends when there are no well defined clusters.

Any link to SOM sources?
Would not be my tool of choice, however here is a link to a SOM in Excel

-Ralph Winters
Multiple Correspondence analysis may be used as a preliminary step of clustering analysis. SAS PROC CORRESP followed by PROC FASTCLUS OR PROC CLUSTER is an interesting combination.
Firstly, Thanks to ALL of you for all the valuable suggestions. I have been working on this on and off for last couple of months, hence the delay.

I tried out something very simple since our clients wanted to see "something" very quick. I created dummy (1 or 0) variables from the categorical variables. For e.g. xi=1 if brand=i is purchased and xi=0 otherwise. With this I ended up with ~30 variables. I also had some numeric vars (like distance to closest competitor, guest scores etc) which I left aside for the time being since the clients were more interested in the dummy variables than the others. I derived 3 principal components from these dummy variable space. Once I was satisfied with these princomps, I used them to cluster guests ending with 6 clusters. As a sanity check, I ran an anova on these 6 groups for each of the numeric variables to ensure there was a significant difference in this numeric variable across all 6 groups. All the anova results showed that at least one cluster was different from the rest. The results were received well but I know, we can do lot better to improve the results. Do let me know your thoughts.

But I'd definitely like to try some of the suggestions you've made e.g. creating the dissimilarity matrix, using the cohesion measure. I am studying these techniques, so any help would be welcome!

Lastly, one of my team mates has access to SAS EM and he let me know that SOM was also giving great results. It made the clustering output more visually appealing. But it remains to be seen how does it compare with other techniques. I guess running tests would be the only way to know :-)

Thanks again.

Here is a partial list of free open source predictive analytics tools that are out there that can help you with clustering categorical values using Decision Trees or other methods

Hope this helps.
Hi Anindo,

I working on similar kind of project, I would like to know how you performed Factor Analysis on Binary data. I have Base SAS and tried Proc factor,Proc Princomp but for binary data they dont seem to work.
Finally I am now trying Correspondance Analysis (Proc Corresp) but im not able to interpret the output.
Any help is appreciated.


Anindo can you please share what you did post the princomp part and how did you calculate the cluster distance etc. Code snippets or web examples will be deeply appreciated

Did you use a normal princomp or any special cases,what would be the best method to do variable clustering on binary data???

hi sir,
im dng my phd in clustering area.i need some research tiltes in clustering.just nw i have p[lz provide me the details to my mail id [email protected]
thanking u
please have a look to It is a clustering application on text mining results. The web site is green-centric but the algorithm is domain independend. It is a small ruby on rails application. What kind of data do you have?
i wnt to be like you. a cluster numeric expertry


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service