# AnalyticBridge

A Data Science Central Community

Update: a new, much better version of this article, was published in 2019, here

Here I propose a solution that can be automated and does not require visual inspection by a human being. The solution can thus be applied to billions of clustering problems, automated and processed in batch mode.

Note that the concept of cluster is a fuzzy one. How do you define a cluster? How many clusters in the chart below? Nevertheless, in many applications, there's a clear optimum number of clusters. The methodology described here will solve all easy and less easy cases, and will provide a "don't know" answer to cases that are ambiguous.

Methodology:

• create a 2-dim table with the following rows: number of clusters in row #1, and percentage of variance explained by clusters in row #2.
• compute 3rd differences
• maximum for 3rd differences (if much higher than other values) determine number of clusters

This is based on the fact that the piece-wise linear plot of number of cluster versus percentage of variance explained by clusters is a convex function with an elbow point, see chart below. The elbow point determines the optimum number of clusters. If the piece-wise linear function is approximated by a smooth curve, the optimum would be the point vanishing the 4-th derivative of the approximating smooth curve. This methodology is simply an application of this "elbow detection" technique in a discrete framework (the number of clusters being a discrete number). Example:
1   2   3   4   5   6   7   8   9   ==> number of clusters

40  65  80  85  88  90  91  91    ==> percentage of variance explained by clusters

25  15   5   3   2   1   0      ==> 1st difference

-10  -10  -2  -1  -1  -1       ==> 2nd difference

0    8   1   0   0         ==> 3rd difference

The optimum number of cluster in this example is 4, corresponding to maximum = 8 in the 3rd differences.

Note:

If you have already a strong minimum in the 2nd difference (not the case here), you don't need to go to 3rd difference: stop at level 2.

Views: 28868

Comment

Join AnalyticBridge Comment by Sergio Casca on May 21, 2015 at 5:30am

So, using the 2nd/3rd/.. derivative for discovering the "elbow" point restricts the search to K >= 3 correct ? and so on.. Comment by ohAodhagain on June 17, 2013 at 10:06pm

Exactly how does one calculate the % variance explained by a cluster?  Why does it depend on

the shape of the cluster? Comment by Anand Singh on October 25, 2012 at 8:34am

Another idea, the plot of 1/RSQ_ratio and RSQ itself can be used to select the optimal number of clusters because RSQ_ratio = Between cluster variance/within cluster variance( RSQ/1-RSQ) and will always increase with the increase in the number of clusters . Inverse of this would be a decreasing curve and RSQ will be an increasing curve. The intersection point of these two curves would give us the optimal number of clusters. Comment by Sandro Saitta on April 13, 2012 at 4:59am

Hi there,

If you are interested about validity index for clustering, I have written 3 posts on my blog about it:

Hope this helps! Comment by Vincent Granville on February 29, 2012 at 9:28pm

See also the gap statistics to determine the number of clusters: http://www.stanford.edu/~hastie/Papers/gap.pdf Comment by Vincent Granville on October 4, 2011 at 9:07pm
@Cristian: agree with you, "percentage of variance explained by clusters" might not be a good criterion depending on the configuration / shapes of the expected clusters. The purpose of this post was to illustrate the elbow detection technique and how it can be automated. In many contexts, you will need to use a different curve (not the "percentage of variance explained by clusters"), but you would still use the same automated technique for elbow detection. Comment by Cristian Mesiano on October 4, 2011 at 10:54am

Hi Vincent,

I believe that your approach has a missing preamble:

It optimizes the number of the cluster when the clustering method is maximizing the variance among the clusters.

In different scenarios it doesn't work.

Consider the following scenario where we have to clusters: If you are using for example  K-means as clustering algorithm, your method will fail for every number of cluster you try to use! Because the problem is not related to the number of the clusters, but to the function that the clustering is trying to maximize:

using 2 clusters: using 4 clusters:

and using 8 clusters: As you can see doesn't exist the right number of clusters, for this problem using the "naive" kmeans.

BTW I've seen for kmeans and density based clustering algo, methods based on EM (expectation and maximizazion) and Bayesian information criterion (BIC) that are a little bit more robust than this method.

...But as I like say: be practical: if your method solve your problem, then your method is good enough :).

just a question:

you plotted the points in red/blu: does it mean that are you looking to find a cluster containing only red points and another points containing only blu points?

Could you share the table of the points...just to play a little bit with them :)

Cheers

cristian Comment by Capri on October 3, 2011 at 2:21pm

The solution to finding the number of clusters is as follows:

• Let's f(k) be the the percentage of variance explained by k clusters
• Compute g(k) = arctan[f(k+1)-f(k)] + arctan[f(k)-f(k-1)]
• The number k that minimizes g(k) is the optimum number of clusters

This is the solution to the min-angle rule proposed by Amy. SAS should implement it. Comment by Vincent Granville on October 3, 2011 at 10:36am

@Sandeep: I think 3rd or 4th derivative is usually not necessary, except in a few rare cases where elbow is barely apparent (and thus clusters not well defined).

I believe that Amy's solution, based on a discrete version of curvature, is even better. And curvature only involves 1st derivative, and the angle (or its sinus) discussed by Amy is also very easy to compute. What do you think? Comment by Sandeep Sunkara on October 3, 2011 at 9:32am

There is Cubic Clustering Criterion (CCC) available in PROC CLUSTER which helps in deciding the number of clusters.

@Vincent: I am little confused in going up to 4th derivative, since the plot is, when you smooth it, a second degree curve, which will have the second differences constant and the third differences vanish. As per my understanding "We can see the second differences being constant after certain point" and that point would be the no.of clusters".

Please correct me if wrong :)