A Data Science Central Community
Here I propose a solution that can be automated and does not require visual inspection by a human being. The solution can thus be applied to billions of clustering problems, automated and processed in batch mode.
Note that the concept of cluster is a fuzzy one. How do you define a cluster? How many clusters in the chart below?
Nevertheless, in many applications, there's a clear optimum number of clusters. The methodology described here will solve all easy and less easy cases, and will provide a "don't know" answer to cases that are ambiguous.
Methodology:
This is based on the fact that the piece-wise linear plot of number of cluster versus percentage of variance explained by clusters is a convex function with an elbow point, see chart below. The elbow point determines the optimum number of clusters. If the piece-wise linear function is approximated by a smooth curve, the optimum would be the point vanishing the 4-th derivative of the approximating smooth curve. This methodology is simply an application of this "elbow detection" technique in a discrete framework (the number of clusters being a discrete number).
Example:
1 2 3 4 5 6 7 8 9 ==> number of clusters
40 65 80 85 88 90 91 91 ==> percentage of variance explained by clusters
25 15 5 3 2 1 0 ==> 1st difference
-10 -10 -2 -1 -1 -1 ==> 2nd difference
0 8 1 0 0 ==> 3rd difference
The optimum number of cluster in this example is 4, corresponding to maximum = 8 in the 3rd differences.
Note:
If you have already a strong minimum in the 2nd difference (not the case here), you don't need to go to 3rd difference: stop at level 2.
Comment
This is reminiscent of a "scree plot," which can be used to estimate the number of principal components. I like graphical methods like this.
There has been a lot of work done on methods for estimating the number of clusters. The "percentage of variance explained by clusters" is the R-square value reported by SAS in the PROC Cluster output. PROC cluster also provides three methods for determining the number of clusters, one of which is the "Pseudo t-squared criterion," which is similar to yours in that it looks at "elbows" in the graph of a quantity. Like you, I've found that this criterion works well on a number of examples. If you want to compare your method, there are several examples in the PROC CLUSTER documentation that you can experiment with.
Nice idea.
# of clusters unveiled. Good you still keep method for selection of right clustering dimensions for yourself.
Otherwise everything would be automated and consultants may only start-up their hot-dog business.
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge