# AnalyticBridge

A Data Science Central Community

# Clustering with non numeric data

Hi - has anyone worked on clustering project using some non numeric variables? For e.g. clustering customer behavior based on brand preference, type of product purchase etc? I only have SAS EG available with me and couldn't think of a way to do it as yet...

any help would be great!

Views: 19463

### Replies to This Discussion

Let say your model is as follows:

observation = (X1,...., Xk, Y1, ..., Yl)

where

• Xi=1 if customer likes brand i, 0 otherwise
• Yi=1 if customer buys product i, 0 otherwise

You can compute the difference between 2 customers as delta = sum{ |Xi-X'i| + |Yi-Y'i| }.

Now you have a dissimilarity matrix for your customers. Now you can do various types of clustering, via the dissimilarity matrix.
If you have both numeric and discrete data (nominal scale), please recognize that simply computing the sum of distances for each parameter / variable can be a trap. Since the distance of nominal variable can be either 0 or 1, but the the distance for a numeric variable anything, I recommend to use a weighted distance measure to control the influence of discrete variables.
I use Spad, a software based on Analyse des Données. First step is a correspondence analysis; then it is possible to carry on a cluster analysis based on factor scores (not the original variables). So the point is: does Sas do multiple correspondence analysis?
I usually perform decision tree analysis when working with categorical (or non numeric) data. I don't believe SAS EG contains this capability. I know "R" does. If you are doing brand preference studies, you can also do a simple paired t test.

One thing I have done is to perform traditional cluster analysis on the numeric variables of interest, and then observe which of the clusters fall into various categories. That a least wil give you some insight as to which of the categories are the best discriminators.
SAS used to supply a CHAID procedure and there was also third party version called SICHAID. I don't know if it's still available. There is a version available within Enterprise Miner, or if you are lucky enough to have SAS/IML installed there is a macro that you can run which is has an algorithm similar to CHAID.

-Ralph Winters
I assume that you have a mixed dataset which has both numeric and non-numeric data types. In such cases, clustering based on a Euclidean distance measures will not be relevant. You could try conceptual clustering techniques which are based on concept hierarchy. The technique, called conceptual clustering, subdivides the data incrementally into subgroups based on a probabilistic measure known as "COHESION". A partition score is computed based on a category utility measure at each branch in concept hierarchy. Each node in the hierachy constitutes a set of data points which cluster into the class or category representaed by that node. Two well known algorithms are COBWEB and ITERATE. Please let me know if you find this helpful or need more info.

Hello Indar,

Can you please give me a R implementation example of Conceptual clustering?

Qualitative variables are definitely part, in my experience, of every analytics project. The solution - GT data mining - cluster the qualitative together with the numeric data. It is available by service, SaaS.
Clearly a dummy variable approach is what is needed - convert your brands to 0-1 vectors and do the clustering there

However, CHAID may be a better approach or possibly latent class analysis methods
Hi Anindo - I have recently worked on clustering project, which used non numeric variables like gender, brand etc.
For the same,I used binary conversion/dummy variables to represent the original attributes.
Then to bring all different attributes to common measuring platform, standardization of data helps, which can be subjected to clustering techniques like proc fastclus / proc cluster in SAS.
SAS EG uses proc fastclus i guess!

Good Luck!
CHAID will be good to do the job when comes to categorical data. You can use SPSS or clementine to do the magic.
Hi Anindo,

Check out Jaccard coefficient. It measures similarity index. I found it to be very intuitive way of dealing with categorical variables.

If you have numeric data too then cluster separately for numeric and categorical and then club them into single equation (based on business logic).