# AnalyticBridge

A Data Science Central Community

Subscribe to DSC Newsletter

# Identifying the number of clusters: finally a solution

Here I propose a solution that can be automated and does not require visual inspection by a human being. The solution can thus be applied to billions of clustering problems, automated and processed in batch mode.

Note that the concept of cluster is a fuzzy one. How do you define a cluster? How many clusters in the chart below?

Nevertheless, in many applications, there's a clear optimum number of clusters. The methodology described here will solve all easy and less easy cases, and will provide a "don't know" answer to cases that are ambiguous.

Methodology:

• create a 2-dim table with the following rows: number of clusters in row #1, and percentage of variance explained by clusters in row #2.
• compute 3rd differences
• maximum for 3rd differences (if much higher than other values) determine number of clusters

This is based on the fact that the piece-wise linear plot of number of cluster versus percentage of variance explained by clusters is a convex function with an elbow point, see chart below. The elbow point determines the optimum number of clusters. If the piece-wise linear function is approximated by a smooth curve, the optimum would be the point vanishing the 4-th derivative of the approximating smooth curve. This methodology is simply an application of this "elbow detection" technique in a discrete framework (the number of clusters being a discrete number).

Example:
1   2   3   4   5   6   7   8   9   ==> number of clusters

40  65  80  85  88  90  91  91    ==> percentage of variance explained by clusters

25  15   5   3   2   1   0      ==> 1st difference

-10  -10  -2  -1  -1  -1       ==> 2nd difference

0    8   1   0   0         ==> 3rd difference

The optimum number of cluster in this example is 4, corresponding to maximum = 8 in the 3rd differences.

Note:

If you have already a strong minimum in the 2nd difference (not the case here), you don't need to go to 3rd difference: stop at level 2.

Views: 26162

Comment

Join AnalyticBridge

Comment by Rick Wicklin on October 3, 2011 at 7:40am

This is reminiscent of a "scree plot," which can be used to estimate the number of principal components. I like graphical methods like this.

There has been a lot of work done on methods for estimating the number of clusters. The "percentage of variance explained by clusters" is the R-square value reported by SAS in the PROC Cluster output. PROC cluster also provides three methods for determining the number of clusters, one of which is the "Pseudo t-squared criterion," which is similar to yours in that it looks at "elbows" in the graph of a quantity.  Like you, I've found that this criterion works well on a number of examples. If you want to compare your method, there are several examples in the PROC CLUSTER documentation that you can experiment with.

Comment by Jozo Kovac on October 3, 2011 at 3:54am

Nice idea.

# of clusters unveiled. Good you still keep method for selection of right clustering dimensions for yourself.

Otherwise everything would be automated and consultants may only start-up their hot-dog business.

## On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by