Subscribe to DSC Newsletter

Hi,

I am trying to perform clustering on my customer files with about 80K customers and 50 variables. 

 

Instead of using either just hierarchical or non-hierarchical methods in SAS, I first tried to determine the "OPTIMAL" number of clusters and their seeds using PROC CLUSTER. 

 

Next, I will feed this information/seeds into PROC FASTCLUS to refine the clusters.  This was the recommendation that someone gave to me: use hierarchical method first to get the seeds and feed the seeds to non-hierarchical methods to fine tune the clusters. 

 

However, it took forever for PROC CLUSTER to even create clusters for my 80K customers.  I had to abandoned it before it returned any result. 

 

Can anyone suggest a way to deal with big data set like mine?  Thanks.

Views: 6885

Reply to This

Replies to This Discussion

2 suggestions:

Try reducing the number of variables via factor analysis.
Use FASTCLUS to produce X clusters, and then feed the results to PROC CLUSTER.

-Ralph Winters
Hi, Ralph:
Thanks. It makes sense to do factor analysis to reduce the number of variables. How about the number of observations?
The reason that I want to use PROC CLUSTER first to produce initial seeds that got fed into FASTCLUS was that FASTCLUS is quite sensitive to the initial seeds. At least, PROC CLUSTER can give me a reasonable starting point (initial seeds) for FASTCLUS to refine it.
Hi Kumud,

Can you please elaborate on what you mean by feeding Centroid values from Fastclus into Proc cluster? For ex: let us suppose I get 1000 centroids for 1000 clusters that I generated using Fastclus. Do you want me to feed just 1000 centroids into Proc Cluster.

Thanks,
Hari
Hi, Kumud:
I have question regarding your suggestion on initial seeds generation. I belive that you should get the initial seeds as a result of running PROC CLUSTER and then feed them into PROC FASTCLUS to further refine the clusters, not the other way around. Am I missing something here?
Hi, Kumud:
Thanks. It is the first time I heard of this way of clustering. It may be worth trying. From what people recommended me to do was the other way around: determine the optimal number of clusters using PROC CLUSTER first and then feed the resulting seeds into PROC FASTCLUST to further refine the clusters.

The reason is that, first of all, non-hierarchical clustering algorithms are very sensitive to the initial partition, in general. Secondly, since a number of starting partitions can be used, the final solution could result in local optimization of the objective function.

According to some results of simulation studies, nonhierarchical algorithms perform poorly when random initial partitions are used. On the other hands, their performance is much superior when the results from hierarchical methods are used to form the initial partition.
Hi Chun,

I believe you are right but what if i have say some 200k records. Then Proc Cluster cannot be run as i think there a maximum limit of around 80,000 records in Proc Cluster (though i guess we may use wong's method to cluster. I am not sure about this though). So maybe in this case we could follow the procedure what Kumud has said above.

Kumud,
One more doubt i had was Can we use Age,Gender as variables for clustering or should they be purely used as profiling variables after clustering.
Also if I have categorical variables (nominal scale), I may not be able to use Proc Fastclus as it doesn't take Distance matrix as input unlike Proc Cluster in which I can specify it as a distance matrix.

Hi Kumud,

I need some clarification. I know that clustering can be used with binary transformation using distance matric but can fastclust be used in the same fashion. Please let me know your thoughts on this.

 

Thanks,

Deepa

Hi,

Getting initial seed is your first objective, then take sample size and using factors run the PROC CLUSTER. It will give you the initial seeds.
Tom,

The only requirement is that the data is that it be at least interval scale. I think you are talking about another kind of scale. If you would do a correlation analysis between these two variables, then you should be able to do a factor analysis.

-Ralph Winters
Yes, but you are better off standardizing the variable to mean= 0, and sd=1. When you normalize to a 0,1 scale, there is no guarantee that the variance will be full range since you bound your space, and thus it will be more difficult to perform statistical inference.

-Ralph Winters
ok, here is my two bobs worth,

i have always seeded at least 500 + seeds randomly, so the starting seeds are seeded 500 times at different random points and then i assess the degree to which the results are reproduced under these different seeding conditions, this is due to the fact that your initial seed values can impact your utimate solution and with different seeding starting values, the reproducibility your achieve can be an indicator of the reliability/stability of the initial cluster solution/model

hierarchical clustering is also a viable way to get initial seeds, and with a two step process where you start with hierarchical and then kmeans, you can also look at trimming outliers around your initial seeds (say 95%) and isolate these cases for followup as rare and of interest, A major issue with hclustering is the capacity of the software (limitations are variables and cases you can deploy this method on) so of course, variable reduction can become essential

clustan graphics gives you a lot (IMHO) that sas does not, including guaranteed convergence, please check it out, and it can handle larger datasets and weight variables influence (therefore you can target your clusters to reflect vaiables based on usability, strategic importance etc) and downwieght those that have less impact after an initial run

i would never ever use factor analysis, because if you think about it, the structure of the factors may not be stable across samples (for example, how are you going to score new cases? with a new or similar factor structure and what happens if the factor structure changes over samples or over time (you would at least need to undertake confirmatory factor analysis across samples to justify the stability of the factor solution you are using), and if you want to use factor scores in your models, that is another can of worms, because depending on the method of rotation and extraction, they have different meanings (: plus how can you explain these standardised factor scores to management (interpretation is necessary, and factor analysis and determining the number of factors with the exception of the use of velicers map etc is an art as much as a science - note that most people use eigenvalues >1 to identify the number of factors and components (rule of thumb only), oh, and don't forget that using non-interval/ratio variables is questionable due to a breach of assumptions - you need to find an algorithm that accounts for that

then how can you tell which of the individual variables is most influential in the segmentation if you are using factor scores
Tom, precisely, espeically without justification through confirmatory factor analysis, and given that factor analysis is a data reduction tool based on relationships between variables, i am not sure re-scaling will impact the stability of the structures. Removing and adding a few cases can impact the structure, and before psychometric measures such as the big five personality traits etc become accepted, they need to be validated extensively in different populations. Also, think of pre and post gfc, these sorts of economic impacts and changes in consumption and patterns etc, will no doubt change collinear relationships between age, income, and education etc over time with loss of employment (so the underlying relationships between these type of variables are likely to suffer from temporal changes). If we apply holdout techniques and test/train methodologies to our model scoring, the same type of validation work can apply to factor analyis with exploratory and confirmatory factor analysis. Note my other objection about factor scores, depending on the type of extraction used, the scores can represent different things. Try running principal components and principal axis factoring, and see the differences in the initial communalities (should be the initial part of your output). This ultimately impacts the fctor scores. Cheers Paul

RSS

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2018   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service