Subscribe to DSC Newsletter

Newbie trying to cluster mixed data type variables in SAS


I have a lot of data (around 300,000 rows) and 5 clustering variables. 3 of these have only positive integral values, one is binary and the last one (a variable for day of week) is what is really killing me. Essentially the day of week variable (dow) takes values from Mon-Sun. I dont know how to treat this variable as Mon-Wed distance is the same as Sat-Mon distance. Is it possible to define a distance matrix specifically for this variable. I'm really new to this so any help would be really appreciated.


Tags: Distance, SAS, data, measure, mixed, type, variable

Views: 3917

Reply to This

Replies to This Discussion

simply decode days to 3 periods:
* weekend (sat-sun)
* begin_of_week (mon-wedn)
* end_of_week (thu-fri)
distance between all of them is 1 and that makes common sense :)
First of all, thanks a lot Dirk and Jozo. I didnt expect such fast (and useful) replies. Couple of points:

- Is fastclus the only way to go? (I guess given that I have data in the 100,000 rows range, thats a yes)
- If I use fastclus- can I use categorical variables? (Jozo- I liked your solution [3 periods] but can i use it with fastclus?) ( what I'm thinking is that maybe I can add 3 columns - weekend_ind, bow_ind and eow_ind which use 0/ a weeked dow will be (1,0,0) for weekend_ind, bow_ind and eow_ind variables.... would that be ok?)
dear aditya,

i am glad you like solutions.

1) there are more *clus procedures in sas, you can explore.
2) right, u can use only numbers - "The VAR statement lists the numeric variables to be used in the cluster analysis... " (SAS doc)
so yes, 3 dummy variables should be created:
if day in (sun, sat) then wend=1 else wend=0;
if day in (mon, tue, wed) then bow=1 else bow=0;
if day in (thu, fri) then eow=1 else eow=0;

... distance-based clustering algorithms are very senstitive to training data - extreme values, etc.
will you standardize data or not? e.g. to <0,1>
will you add weights to normalized data? e.g. one variable <0,1> another <0,5>?
will you rank data to N groups or not? after rank, will you divide them by N? or not ? <0,N> vs. <0,1>
will you use all 5 variables or just 4, 3, or you add new??
how many clusters do you want to have? 2? 5? 10? 50?

there are so many options. and each setup will create completly different results ( segments ). with the same data ! :)

i like PASW Modeler node - DECISION LIST. you can simply add conditions and create your own clusters.
in SAS, you can use data step:
data segments;
input my_data
if condition1 and condition2 then CLUSTER1
else if condition3 and condition4 then CLUSTER2
else if condition3 and condition4 then CLUSTER3
else if condition3 and condition4 then CLUSTER4
else Not_clustered_yet;

good luck!
Even though dates have an interval scale, you are "forcing" the DOW to be a numeric variable, when it's not, and you may end up with bad results, or having to explain why there is a difference when there is not. For example, Which is the lowest number Monday or Sunday?

You are better off doing hierarchical clustering with this rather than performing what looks like k-means clustering. Hierarchical clustering can handle both categorical and numeric variables.

Good luck..

-Ralph Winters
Hi, what i would recommend to do is to convert them into 7-dim variables: say
the clustering will then be in 7-dim space
Hey Guys,

Sorry for not replying earlier...But just wanted to thank you all for your help. The methodology I am following now is given below, but the main problem is that im getting large negative CCC values (-200) so I'm kinda worried. Dont want to strech my luck but any luck here would help too :).

- Taken data with 4 attributes ( var1(integral from 1-6),var2(integral from 0-145), var3(integral from 0-45), var4 (the DOW variable which has basically been reduce to a weekend stay indicator) is binary)
- Cleaned data( removed missing values|outliers)
- standardized data using proc standard
- run fastclus( since data is arround 2 million rows)
- changed clusters from 5 to 40 | maxiter=1000

Thats all folks :(
Umm...Really embarassed about this...but there was a stupid mistake in my code.... Now its giving me great CCC values (1600! ummm?) I guess things are great now...

im using the number of clusters where CCC is high and PSF peaks...hope thats ok... too scared to ask any more questions :) ..thanks guys


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service