A Data Science Central Community
k-means is a method of clustering optimization where the number of clusters must be known a priori. If you're not sure about the number of clusters you can run several FASTCLUS procs with different K values to determine which number might be the optimal. You can also try a hierarchical method with PROC CLUSTER which will give you the best mathematical solution using a distance metric. Keep in mind that PROC CLUSTER calculates the distance between all observations and all variables in a Dataset, so if it is very large you might not be able to run the PROC.
Thanks for your reponse Guillermo.
When we run FASTCLUS with different k values, it'll return output with different number of clusters everytime. So how do we know which of the outputs is the optimal?
That's the most difficult part of unsupervised learning techniques. You must use your business knowledge, the problem context or discuss the results with a subject matter expert to decide which solution might be the best. For example in a marketing scenario you might be restricted to a maximum of 6 clusters because they can't implement more targeted campaigns than that. So you should generate several possible solutions and discuss which is the most fitting solution for your problem.
Perfectly got your point Guilermo. Thanks a lot once again
You can tell PROC FASTCLUS how many clusters (i.e. K) by using the MAXCLUSTERS option, like this:
proc fastclus data=YourDataSet maxclusters=3 out=Clusters;
*maxclusters=# of clusters used;
var YourFirstVariable YourSecondVariable EtcVariables;
I like to try a few different numbers of clusters and inspect them visually to see if there is a "natural" number of clusters that forms. You can do this with the following code in SAS:
/* Obtain the principal components for the same variables used in the cluster analysis */
proc princomp data=Clusters out=PrincipalComponents;
var YourFirstVariable YourSecondVariable EtcVariables;
/* Biplot: plot the clusters on the two principal components */
proc gplot data=PrincipalComponents;
The code above does two things. It first calculates the principal components using the "Clusters" data set you made in the PROC FASTCLUS statement. You can think of principal components are variables that are each a unique combination of the variables you already have. The way they are created (and this gets too complex to explain easily), the first principal component will "explain" the greatest amount of the variance in the data, the second will explain the second greatest amount of the variance, etc. And importantly, none of the components overlap, so they each explain different parts of the data.
Once you have the principal components, you can plot your clusters on the first two to create a biplot. Try this and see where your clusters end up. You might have to change the default colors with a statement like the following to see the clusters a little more distinctly. If they are separated a good amount and appear to be fairly non-overlapping, then you have a good number of clusters. If they are overlapping or awkwardly mixed, try a different number of clusters.
symbol1 v=plus c=blue;
symbol2 v=plus c=red;
symbol3 v=plus c=green;
If there is not an operational definition for the number of clusters, yes, you have to figure this out yourself. You can use an algorithm to figure it out, but how do you know the algorithm is trading off the # clusters vs. compactness the way you want?
You have to have some idea of what you want, of course, but usually in my consulting engagements where k was unknown we would do the following.
1) interpret the clusters
there are two ways to interpret clusters. First, we compute the mean values of all the input variables to get the gist of where the clusters are centered. (normalizing the input variables can greatly influence the formation of clusters and these mean values. I have a blog post on this topic here (http://abbottanalytics.blogspot.com/2009/04/why-normalization-matte...).
the second way is to compute how the clusters differ from one another. You can compute the mean values of every variable in the clusters, but it could be that all the variables except one have the same mean for every cluster--it's just one variable that is really responsible for driving the formation of the clusters. But how do you easily find these differences, especially when you have perhaps dozens of input variables?
You can eyeball it, but that can be tedious and was to get wrong. I prefer to find this algorithmically. How? By using decision trees to predict the cluster label from the same inputs. (after all, the one thing that stood in our way of doing supervised learning in the first place was that we didn't have labels for the data. now that we have clusters, we have record labels!) The tree doesn't have to be perfect, just get the gist of the differences for you to understand the key differences between clusters.
2) overlay the clusters with another measure of interest
If you have another important variable that is important, even if that variable was not included in the cluster analysis, if you compute it's mean value (or IQR) for each cluster, you can get a sense for what the clusters may mean operationally. For example, if you are computing clusters of customers, you can overlay demographics on top of the clusters (age, income, home value, etc.). Or, when I built fraud related clusters where we had so few adjudicated fraud cases that we couldn't build supervised learning models, we can still overlay the fraud label even for the relatively few cases we have to get a sense for which clusters those fraudulent transactions landed in.
So there are huge differences between selecting the number of clusters based on operational concerns vs. numeric concerns. In the latter case, cluster compactness and separation may not be the most important aspects to consider. (though sometimes they may be too!).