Subscribe to DSC Newsletter

Hi,

I am trying to perform clustering on my customer files with about 80K customers and 50 variables. 

 

Instead of using either just hierarchical or non-hierarchical methods in SAS, I first tried to determine the "OPTIMAL" number of clusters and their seeds using PROC CLUSTER. 

 

Next, I will feed this information/seeds into PROC FASTCLUS to refine the clusters.  This was the recommendation that someone gave to me: use hierarchical method first to get the seeds and feed the seeds to non-hierarchical methods to fine tune the clusters. 

 

However, it took forever for PROC CLUSTER to even create clusters for my 80K customers.  I had to abandoned it before it returned any result. 

 

Can anyone suggest a way to deal with big data set like mine?  Thanks.

Views: 7284

Reply to This

Replies to This Discussion

Tom,

Your comments about factor analysis could apply to cluster analysis as well. One new case could create a cluster of 1. Certainly seeding the cluster 500 times can throw just about anything into a model. So no matter what technique we use, we need to insure that our sample is more or less representative
.
With regard to factors. Yes, it is true that for confirmatory factor analysis, we are more restricted in what we can do, however as a technique for exploratory data analysis, factor analysis does fine.

I suppose in certain cases you could work with unscaled data, but I like to initially look at all variables as equal.


-Ralph Winters
Your comments about factor analysis could apply to cluster analysis as well. One new case could create a cluster of 1. Certainly seeding the cluster 500 times can throw just about anything into a model. So no matter what technique we use, we need to insure that our sample is more or less representative

Ralph, I am not sure I made myself clear. What I am talking about here is seeding the kmeans solution with 500 different starting seeds, the variables in the model that are used stay exactly the same so it has no impact on what enters the model. In fact, this is fairly standard practice in kmeans, and given we know that our solutions can depend on the random seeding values we start with, it is a very good idea.

With regard to factors. Yes, it is true that for confirmatory factor analysis, we are more restricted in what we can do, however as a technique for exploratory data analysis, factor analysis does fine.

On this point Ralph, can you explain what you mean by restricted. I am suggesting you run exploatory and then factor analysis with the same structure or it is not necessarily wise to use factor analysis.

Hope that clears this up.

Cheers Paul
Paul,

Yes, I think we are talking about the same thing. What I meant by "restricted" in factor analysis is that we may be interested in demonstrating causal relationships in confirming the latent factors in post-analysis and as such would need to set parameters differently (e.g loadings = 0) on some of the variables. With EDA factor analysis it doesn't matter. Thanks for your insight

-Ralph Winters

Guys,

I just jump into the old discussion thread. Do you have any sample dataset or SAS code that is used to determine optimal number of cluster using PROC CLUSTER first and then feed the resulting seed into PROC FASTCLUS to further refine the cluster ( or may be other way around, first PROC FASTCLUS to get the seed and then use those seeds in PROC CLUSTER to refine the cluster).

RSS

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service