Newbie trying to cluster mixed data type variables in SAS - AnalyticBridge2020-12-02T04:20:33Zhttps://www.analyticbridge.datasciencecentral.com/forum/topics/newbie-trying-to-cluster-mixed?feed=yes&xn_auth=noUmm...Really embarassed about…tag:www.analyticbridge.datasciencecentral.com,2009-12-18:2004291:Comment:582652009-12-18T06:19:39.443ZAdityahttps://www.analyticbridge.datasciencecentral.com/profile/Aditya
Umm...Really embarassed about this...but there was a stupid mistake in my code.... Now its giving me great CCC values (1600! ummm?)..so I guess things are great now...<br />
<br />
im using the number of clusters where CCC is high and PSF peaks...hope thats ok... too scared to ask any more questions :) ..thanks guys
Umm...Really embarassed about this...but there was a stupid mistake in my code.... Now its giving me great CCC values (1600! ummm?)..so I guess things are great now...<br />
<br />
im using the number of clusters where CCC is high and PSF peaks...hope thats ok... too scared to ask any more questions :) ..thanks guys Hey Guys,
Sorry for not repl…tag:www.analyticbridge.datasciencecentral.com,2009-12-16:2004291:Comment:581912009-12-16T14:54:56.952ZAdityahttps://www.analyticbridge.datasciencecentral.com/profile/Aditya
Hey Guys,<br />
<br />
Sorry for not replying earlier...But just wanted to thank you all for your help. The methodology I am following now is given below, but the main problem is that im getting large negative CCC values (-200) so I'm kinda worried. Dont want to strech my luck but any luck here would help too :).<br />
<br />
- Taken data with 4 attributes ( var1(integral from 1-6),var2(integral from 0-145), var3(integral from 0-45), var4 (the DOW variable which has basically been reduce to a weekend stay indicator)…
Hey Guys,<br />
<br />
Sorry for not replying earlier...But just wanted to thank you all for your help. The methodology I am following now is given below, but the main problem is that im getting large negative CCC values (-200) so I'm kinda worried. Dont want to strech my luck but any luck here would help too :).<br />
<br />
- Taken data with 4 attributes ( var1(integral from 1-6),var2(integral from 0-145), var3(integral from 0-45), var4 (the DOW variable which has basically been reduce to a weekend stay indicator) is binary)<br />
- Cleaned data( removed missing values|outliers)<br />
- standardized data using proc standard<br />
- run fastclus( since data is arround 2 million rows)<br />
- changed clusters from 5 to 40 | maxiter=1000<br />
<br />
Thats all folks :( Hi, what i would recommend to…tag:www.analyticbridge.datasciencecentral.com,2009-12-12:2004291:Comment:579992009-12-12T22:41:57.698ZZhuang Wuhttps://www.analyticbridge.datasciencecentral.com/profile/ZhuangWu
Hi, what i would recommend to do is to convert them into 7-dim variables: say<br />
(n1,n2,n3,n4,n5,n6,n7)<br />
the clustering will then be in 7-dim space
Hi, what i would recommend to do is to convert them into 7-dim variables: say<br />
(n1,n2,n3,n4,n5,n6,n7)<br />
the clustering will then be in 7-dim space Even though dates have an int…tag:www.analyticbridge.datasciencecentral.com,2009-12-11:2004291:Comment:578792009-12-11T01:38:48.939ZRalph Wintershttps://www.analyticbridge.datasciencecentral.com/profile/RalphWinters
Even though dates have an interval scale, you are "forcing" the DOW to be a numeric variable, when it's not, and you may end up with bad results, or having to explain why there is a difference when there is not. For example, Which is the lowest number Monday or Sunday?<br />
<br />
You are better off doing hierarchical clustering with this rather than performing what looks like k-means clustering. Hierarchical clustering can handle both categorical and numeric variables.<br />
<br />
Good luck..<br />
<br />
-Ralph Winters
Even though dates have an interval scale, you are "forcing" the DOW to be a numeric variable, when it's not, and you may end up with bad results, or having to explain why there is a difference when there is not. For example, Which is the lowest number Monday or Sunday?<br />
<br />
You are better off doing hierarchical clustering with this rather than performing what looks like k-means clustering. Hierarchical clustering can handle both categorical and numeric variables.<br />
<br />
Good luck..<br />
<br />
-Ralph Winters dear aditya,
i am glad you l…tag:www.analyticbridge.datasciencecentral.com,2009-12-10:2004291:Comment:578632009-12-10T11:22:11.065ZJozo Kovachttps://www.analyticbridge.datasciencecentral.com/profile/JozoKovac
dear aditya,<br />
<br />
i am glad you like solutions.<br />
<br />
1) there are more *clus procedures in sas, you can explore.<br />
2) right, u can use only numbers - "The VAR statement lists the numeric variables to be used in the cluster analysis... " (SAS doc)<br />
so yes, 3 dummy variables should be created:<br />
if day in (sun, sat) then wend=1 else wend=0;<br />
if day in (mon, tue, wed) then bow=1 else bow=0;<br />
if day in (thu, fri) then eow=1 else eow=0;<br />
<br />
... distance-based clustering algorithms are very senstitive to training data…
dear aditya,<br />
<br />
i am glad you like solutions.<br />
<br />
1) there are more *clus procedures in sas, you can explore.<br />
2) right, u can use only numbers - "The VAR statement lists the numeric variables to be used in the cluster analysis... " (SAS doc)<br />
so yes, 3 dummy variables should be created:<br />
if day in (sun, sat) then wend=1 else wend=0;<br />
if day in (mon, tue, wed) then bow=1 else bow=0;<br />
if day in (thu, fri) then eow=1 else eow=0;<br />
<br />
... distance-based clustering algorithms are very senstitive to training data - extreme values, etc.<br />
will you standardize data or not? e.g. to <0,1><br />
will you add weights to normalized data? e.g. one variable <0,1> another <0,5>?<br />
will you rank data to N groups or not? after rank, will you divide them by N? or not ? <0,N> vs. <0,1><br />
will you use all 5 variables or just 4, 3, or you add new??<br />
how many clusters do you want to have? 2? 5? 10? 50?<br />
<br />
there are so many options. and each setup will create completly different results ( segments ). with the same data ! :)<br />
<br />
i like PASW Modeler node - DECISION LIST. you can simply add conditions and create your own clusters.<br />
in SAS, you can use data step:<br />
data segments;<br />
input my_data<br />
if condition1 and condition2 then CLUSTER1<br />
else if condition3 and condition4 then CLUSTER2<br />
else if condition3 and condition4 then CLUSTER3<br />
else if condition3 and condition4 then CLUSTER4<br />
else Not_clustered_yet;<br />
<br />
good luck! First of all, thanks a lot Di…tag:www.analyticbridge.datasciencecentral.com,2009-12-10:2004291:Comment:578622009-12-10T10:27:48.014ZAdityahttps://www.analyticbridge.datasciencecentral.com/profile/Aditya
First of all, thanks a lot Dirk and Jozo. I didnt expect such fast (and useful) replies. Couple of points:<br />
<br />
- Is fastclus the only way to go? (I guess given that I have data in the 100,000 rows range, thats a yes)<br />
- If I use fastclus- can I use categorical variables? (Jozo- I liked your solution [3 periods] but can i use it with fastclus?) ( what I'm thinking is that maybe I can add 3 columns - weekend_ind, bow_ind and eow_ind which use 0/1..so a weeked dow will be (1,0,0) for weekend_ind,…
First of all, thanks a lot Dirk and Jozo. I didnt expect such fast (and useful) replies. Couple of points:<br />
<br />
- Is fastclus the only way to go? (I guess given that I have data in the 100,000 rows range, thats a yes)<br />
- If I use fastclus- can I use categorical variables? (Jozo- I liked your solution [3 periods] but can i use it with fastclus?) ( what I'm thinking is that maybe I can add 3 columns - weekend_ind, bow_ind and eow_ind which use 0/1..so a weeked dow will be (1,0,0) for weekend_ind, bow_ind and eow_ind variables.... would that be ok?) simply decode days to 3 perio…tag:www.analyticbridge.datasciencecentral.com,2009-12-09:2004291:Comment:578272009-12-09T20:14:37.870ZJozo Kovachttps://www.analyticbridge.datasciencecentral.com/profile/JozoKovac
simply decode days to 3 periods:<br />
* weekend (sat-sun)<br />
* begin_of_week (mon-wedn)<br />
* end_of_week (thu-fri)<br />
distance between all of them is 1 and that makes common sense :)
simply decode days to 3 periods:<br />
* weekend (sat-sun)<br />
* begin_of_week (mon-wedn)<br />
* end_of_week (thu-fri)<br />
distance between all of them is 1 and that makes common sense :)