cluster analysis - AnalyticBridge2020-07-09T17:39:09Zhttps://www.analyticbridge.datasciencecentral.com/forum/topics/cluster-analysis?commentId=2004291%3AComment%3A80133&%3Bfeed=yes&%3Bxn_auth=no&feed=yes&xn_auth=noGuys,
I just jump into the ol…tag:www.analyticbridge.datasciencecentral.com,2015-06-17:2004291:Comment:3271622015-06-17T17:59:01.387ZKamalhttps://www.analyticbridge.datasciencecentral.com/profile/Kamal264
<p>Guys,</p>
<p>I just jump into the old discussion thread. Do you have any sample dataset or SAS code that is used to determine optimal number of cluster using PROC CLUSTER first and then feed the resulting seed into PROC FASTCLUS to further refine the cluster ( or may be other way around, first PROC FASTCLUS to get the seed and then use those seeds in PROC CLUSTER to refine the cluster).</p>
<p>Guys,</p>
<p>I just jump into the old discussion thread. Do you have any sample dataset or SAS code that is used to determine optimal number of cluster using PROC CLUSTER first and then feed the resulting seed into PROC FASTCLUS to further refine the cluster ( or may be other way around, first PROC FASTCLUS to get the seed and then use those seeds in PROC CLUSTER to refine the cluster).</p> Hi Kumud,
I need some clarifi…tag:www.analyticbridge.datasciencecentral.com,2011-06-09:2004291:Comment:1144382011-06-09T07:06:26.216Zdeepa bhartihttps://www.analyticbridge.datasciencecentral.com/profile/deepabharti
<p>Hi Kumud,</p>
<p>I need some clarification. I know that clustering can be used with binary transformation using distance matric but can fastclust be used in the same fashion. Please let me know your thoughts on this.</p>
<p> </p>
<p>Thanks,</p>
<p>Deepa</p>
<p>Hi Kumud,</p>
<p>I need some clarification. I know that clustering can be used with binary transformation using distance matric but can fastclust be used in the same fashion. Please let me know your thoughts on this.</p>
<p> </p>
<p>Thanks,</p>
<p>Deepa</p> Paul,
Yes, I think we are ta…tag:www.analyticbridge.datasciencecentral.com,2010-10-14:2004291:Comment:805942010-10-14T17:21:09.573ZRalph Wintershttps://www.analyticbridge.datasciencecentral.com/profile/RalphWinters
Paul,<br/>
<br/>
Yes, I think we are talking about the same thing. What I meant by "restricted" in factor analysis is that we may be interested in demonstrating causal relationships in confirming the latent factors in post-analysis and as such would need to set parameters differently (e.g loadings = 0) on some of the variables. With EDA factor analysis it doesn't matter. Thanks for your insight<br/>
<br/>
-Ralph Winters
Paul,<br/>
<br/>
Yes, I think we are talking about the same thing. What I meant by "restricted" in factor analysis is that we may be interested in demonstrating causal relationships in confirming the latent factors in post-analysis and as such would need to set parameters differently (e.g loadings = 0) on some of the variables. With EDA factor analysis it doesn't matter. Thanks for your insight<br/>
<br/>
-Ralph Winters Your comments about factor an…tag:www.analyticbridge.datasciencecentral.com,2010-10-13:2004291:Comment:805552010-10-13T23:10:18.843Zpaul dhttps://www.analyticbridge.datasciencecentral.com/profile/pauld
<b><i>Your comments about factor analysis could apply to cluster analysis as well. One new case could create a cluster of 1. Certainly seeding the cluster 500 times can throw just about anything into a model. So no matter what technique we use, we need to insure that our sample is more or less representative</i><br></br></b><br></br>
Ralph, I am not sure I made myself clear. What I am talking about here is seeding the kmeans solution with 500 different starting seeds, the variables in the model that…
<b><i>Your comments about factor analysis could apply to cluster analysis as well. One new case could create a cluster of 1. Certainly seeding the cluster 500 times can throw just about anything into a model. So no matter what technique we use, we need to insure that our sample is more or less representative</i><br/></b><br/>
Ralph, I am not sure I made myself clear. What I am talking about here is seeding the kmeans solution with 500 different starting seeds, the variables in the model that are used stay exactly the same so it has no impact on what enters the model. In fact, this is fairly standard practice in kmeans, and given we know that our solutions can depend on the random seeding values we start with, it is a very good idea.<br/>
<br/>
<i><b>With regard to factors. Yes, it is true that for confirmatory factor analysis, we are more restricted in what we can do, however as a technique for exploratory data analysis, factor analysis does fine.</b></i><br/>
<br/>
On this point Ralph, can you explain what you mean by restricted. I am suggesting you run exploatory and then factor analysis with the same structure or it is not necessarily wise to use factor analysis.<br/>
<br/>
Hope that clears this up.<br/>
<br/>
Cheers Paul Tom,
Your comments about fac…tag:www.analyticbridge.datasciencecentral.com,2010-10-13:2004291:Comment:805452010-10-13T19:46:00.670ZRalph Wintershttps://www.analyticbridge.datasciencecentral.com/profile/RalphWinters
Tom,<br />
<br />
Your comments about factor analysis could apply to cluster analysis as well. One new case could create a cluster of 1. Certainly seeding the cluster 500 times can throw just about anything into a model. So no matter what technique we use, we need to insure that our sample is more or less representative<br />
.<br />
With regard to factors. Yes, it is true that for confirmatory factor analysis, we are more restricted in what we can do, however as a technique for exploratory data analysis, factor…
Tom,<br />
<br />
Your comments about factor analysis could apply to cluster analysis as well. One new case could create a cluster of 1. Certainly seeding the cluster 500 times can throw just about anything into a model. So no matter what technique we use, we need to insure that our sample is more or less representative<br />
.<br />
With regard to factors. Yes, it is true that for confirmatory factor analysis, we are more restricted in what we can do, however as a technique for exploratory data analysis, factor analysis does fine.<br />
<br />
I suppose in certain cases you could work with unscaled data, but I like to initially look at all variables as equal.<br />
<br />
<br />
-Ralph Winters Tom, precisely, espeically wi…tag:www.analyticbridge.datasciencecentral.com,2010-10-12:2004291:Comment:805072010-10-12T20:19:05.316Zpaul dhttps://www.analyticbridge.datasciencecentral.com/profile/pauld
Tom, precisely, espeically without justification through confirmatory factor analysis, and given that factor analysis is a data reduction tool based on relationships between variables, i am not sure re-scaling will impact the stability of the structures. Removing and adding a few cases can impact the structure, and before psychometric measures such as the big five personality traits etc become accepted, they need to be validated extensively in different populations. Also, think of pre and post…
Tom, precisely, espeically without justification through confirmatory factor analysis, and given that factor analysis is a data reduction tool based on relationships between variables, i am not sure re-scaling will impact the stability of the structures. Removing and adding a few cases can impact the structure, and before psychometric measures such as the big five personality traits etc become accepted, they need to be validated extensively in different populations. Also, think of pre and post gfc, these sorts of economic impacts and changes in consumption and patterns etc, will no doubt change collinear relationships between age, income, and education etc over time with loss of employment (so the underlying relationships between these type of variables are likely to suffer from temporal changes). If we apply holdout techniques and test/train methodologies to our model scoring, the same type of validation work can apply to factor analyis with exploratory and confirmatory factor analysis. Note my other objection about factor scores, depending on the type of extraction used, the scores can represent different things. Try running principal components and principal axis factoring, and see the differences in the initial communalities (should be the initial part of your output). This ultimately impacts the fctor scores. Cheers Paul ok, here is my two bobs worth…tag:www.analyticbridge.datasciencecentral.com,2010-10-12:2004291:Comment:804562010-10-12T02:01:28.897Zpaul dhttps://www.analyticbridge.datasciencecentral.com/profile/pauld
ok, here is my two bobs worth,<br></br>
<br></br>
i have always seeded at least 500 + seeds randomly, so the starting seeds are seeded 500 times at different random points and then i assess the degree to which the results are reproduced under these different seeding conditions, this is due to the fact that your initial seed values can impact your utimate solution and with different seeding starting values, the reproducibility your achieve can be an indicator of the reliability/stability of the initial…
ok, here is my two bobs worth,<br/>
<br/>
i have always seeded at least 500 + seeds randomly, so the starting seeds are seeded 500 times at different random points and then i assess the degree to which the results are reproduced under these different seeding conditions, this is due to the fact that your initial seed values can impact your utimate solution and with different seeding starting values, the reproducibility your achieve can be an indicator of the reliability/stability of the initial cluster solution/model<br/>
<br/>
hierarchical clustering is also a viable way to get initial seeds, and with a two step process where you start with hierarchical and then kmeans, you can also look at trimming outliers around your initial seeds (say 95%) and isolate these cases for followup as rare and of interest, A major issue with hclustering is the capacity of the software (limitations are variables and cases you can deploy this method on) so of course, variable reduction can become essential<br/>
<br/>
clustan graphics gives you a lot (IMHO) that sas does not, including guaranteed convergence, please check it out, and it can handle larger datasets and weight variables influence (therefore you can target your clusters to reflect vaiables based on usability, strategic importance etc) and downwieght those that have less impact after an initial run<br/>
<br/>
i would never ever use factor analysis, because if you think about it, the structure of the factors may not be stable across samples (for example, how are you going to score new cases? with a new or similar factor structure and what happens if the factor structure changes over samples or over time (you would at least need to undertake confirmatory factor analysis across samples to justify the stability of the factor solution you are using), and if you want to use factor scores in your models, that is another can of worms, because depending on the method of rotation and extraction, they have different meanings (: plus how can you explain these standardised factor scores to management (interpretation is necessary, and factor analysis and determining the number of factors with the exception of the use of velicers map etc is an art as much as a science - note that most people use eigenvalues >1 to identify the number of factors and components (rule of thumb only), oh, and don't forget that using non-interval/ratio variables is questionable due to a breach of assumptions - you need to find an algorithm that accounts for that<br/><br/>then how can you tell which of the individual variables is most influential in the segmentation if you are using factor scores Hi Chun,
I believe you are r…tag:www.analyticbridge.datasciencecentral.com,2010-10-06:2004291:Comment:801412010-10-06T20:17:04.764ZHariharan Sunderhttps://www.analyticbridge.datasciencecentral.com/profile/HariharanSunder
Hi Chun,<br />
<br />
I believe you are right but what if i have say some 200k records. Then Proc Cluster cannot be run as i think there a maximum limit of around 80,000 records in Proc Cluster (though i guess we may use wong's method to cluster. I am not sure about this though). So maybe in this case we could follow the procedure what Kumud has said above.<br />
<br />
Kumud,<br />
One more doubt i had was Can we use Age,Gender as variables for clustering or should they be purely used as profiling variables after…
Hi Chun,<br />
<br />
I believe you are right but what if i have say some 200k records. Then Proc Cluster cannot be run as i think there a maximum limit of around 80,000 records in Proc Cluster (though i guess we may use wong's method to cluster. I am not sure about this though). So maybe in this case we could follow the procedure what Kumud has said above.<br />
<br />
Kumud,<br />
One more doubt i had was Can we use Age,Gender as variables for clustering or should they be purely used as profiling variables after clustering.<br />
Also if I have categorical variables (nominal scale), I may not be able to use Proc Fastclus as it doesn't take Distance matrix as input unlike Proc Cluster in which I can specify it as a distance matrix. Hi, Kumud:
Thanks. It is the…tag:www.analyticbridge.datasciencecentral.com,2010-10-06:2004291:Comment:801332010-10-06T19:07:43.906ZYi-Chun Tsaihttps://www.analyticbridge.datasciencecentral.com/profile/YiChunTsai
Hi, Kumud:<br />
Thanks. It is the first time I heard of this way of clustering. It may be worth trying. From what people recommended me to do was the other way around: determine the optimal number of clusters using PROC CLUSTER first and then feed the resulting seeds into PROC FASTCLUST to further refine the clusters.<br />
<br />
The reason is that, first of all, non-hierarchical clustering algorithms are very sensitive to the initial partition, in general. Secondly, since a number of starting partitions can…
Hi, Kumud:<br />
Thanks. It is the first time I heard of this way of clustering. It may be worth trying. From what people recommended me to do was the other way around: determine the optimal number of clusters using PROC CLUSTER first and then feed the resulting seeds into PROC FASTCLUST to further refine the clusters.<br />
<br />
The reason is that, first of all, non-hierarchical clustering algorithms are very sensitive to the initial partition, in general. Secondly, since a number of starting partitions can be used, the final solution could result in local optimization of the objective function.<br />
<br />
According to some results of simulation studies, nonhierarchical algorithms perform poorly when random initial partitions are used. On the other hands, their performance is much superior when the results from hierarchical methods are used to form the initial partition. Yes, but you are better off s…tag:www.analyticbridge.datasciencecentral.com,2010-10-06:2004291:Comment:801322010-10-06T19:00:16.014ZRalph Wintershttps://www.analyticbridge.datasciencecentral.com/profile/RalphWinters
Yes, but you are better off standardizing the variable to mean= 0, and sd=1. When you normalize to a 0,1 scale, there is no guarantee that the variance will be full range since you bound your space, and thus it will be more difficult to perform statistical inference.<br/>
<br/>
-Ralph Winters
Yes, but you are better off standardizing the variable to mean= 0, and sd=1. When you normalize to a 0,1 scale, there is no guarantee that the variance will be full range since you bound your space, and thus it will be more difficult to perform statistical inference.<br/>
<br/>
-Ralph Winters