A Data Science Central Community
Here I propose a solution that can be automated and does not require visual inspection by a human being. The solution can thus be applied to billions of clustering problems, automated and processed in batch mode.
Note that the concept of cluster is a fuzzy one. How do you define a cluster? How many clusters in the chart below?
Nevertheless, in many applications, there's a clear optimum number of clusters. The methodology described here will solve all easy and less easy cases, and will provide a "don't know" answer to cases that are ambiguous.
Methodology:
This is based on the fact that the piece-wise linear plot of number of cluster versus percentage of variance explained by clusters is a convex function with an elbow point, see chart below. The elbow point determines the optimum number of clusters. If the piece-wise linear function is approximated by a smooth curve, the optimum would be the point vanishing the 4-th derivative of the approximating smooth curve. This methodology is simply an application of this "elbow detection" technique in a discrete framework (the number of clusters being a discrete number).
Example:
1 2 3 4 5 6 7 8 9 ==> number of clusters
40 65 80 85 88 90 91 91 ==> percentage of variance explained by clusters
25 15 5 3 2 1 0 ==> 1st difference
-10 -10 -2 -1 -1 -1 ==> 2nd difference
0 8 1 0 0 ==> 3rd difference
The optimum number of cluster in this example is 4, corresponding to maximum = 8 in the 3rd differences.
Note:
If you have already a strong minimum in the 2nd difference (not the case here), you don't need to go to 3rd difference: stop at level 2.
Comment
So, using the 2nd/3rd/.. derivative for discovering the "elbow" point restricts the search to K >= 3 correct ? and so on..
Exactly how does one calculate the % variance explained by a cluster? Why does it depend on
the shape of the cluster?
Another idea, the plot of 1/RSQ_ratio and RSQ itself can be used to select the optimal number of clusters because RSQ_ratio = Between cluster variance/within cluster variance( RSQ/1-RSQ) and will always increase with the increase in the number of clusters . Inverse of this would be a decreasing curve and RSQ will be an increasing curve. The intersection point of these two curves would give us the optimal number of clusters.
Hi there,
If you are interested about validity index for clustering, I have written 3 posts on my blog about it:
http://www.dataminingblog.com/cluster-validity-introduction-to-clus...
http://www.dataminingblog.com/cluster-validity-clustering-algorithms/
http://www.dataminingblog.com/cluster-validity-existing-indices/
Hope this helps!
See also the gap statistics to determine the number of clusters: http://www.stanford.edu/~hastie/Papers/gap.pdf
Hi Vincent,
I believe that your approach has a missing preamble:
It optimizes the number of the cluster when the clustering method is maximizing the variance among the clusters.
In different scenarios it doesn't work.
Consider the following scenario where we have to clusters:
If you are using for example K-means as clustering algorithm, your method will fail for every number of cluster you try to use! Because the problem is not related to the number of the clusters, but to the function that the clustering is trying to maximize:
using 2 clusters:
using 4 clusters:
and using 8 clusters:
As you can see doesn't exist the right number of clusters, for this problem using the "naive" kmeans.
BTW I've seen for kmeans and density based clustering algo, methods based on EM (expectation and maximizazion) and Bayesian information criterion (BIC) that are a little bit more robust than this method.
...But as I like say: be practical: if your method solve your problem, then your method is good enough :).
just a question:
you plotted the points in red/blu: does it mean that are you looking to find a cluster containing only red points and another points containing only blu points?
Could you share the table of the points...just to play a little bit with them :)
Cheers
cristian
The solution to finding the number of clusters is as follows:
This is the solution to the min-angle rule proposed by Amy. SAS should implement it.
@Sandeep: I think 3rd or 4th derivative is usually not necessary, except in a few rare cases where elbow is barely apparent (and thus clusters not well defined).
I believe that Amy's solution, based on a discrete version of curvature, is even better. And curvature only involves 1st derivative, and the angle (or its sinus) discussed by Amy is also very easy to compute. What do you think?
There is Cubic Clustering Criterion (CCC) available in PROC CLUSTER which helps in deciding the number of clusters.
@Vincent: I am little confused in going up to 4th derivative, since the plot is, when you smooth it, a second degree curve, which will have the second differences constant and the third differences vanish. As per my understanding "We can see the second differences being constant after certain point" and that point would be the no.of clusters".
Please correct me if wrong :)
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge