# AnalyticBridge

A Data Science Central Community

Let's say that you have a large number n of elements a, b, c, etc. and you want to group them into clusters. Each cluster is supposed to contain few elements, say O(1).

You have one similarity metric d(a,b) to compare any two elements a, b. Also, you have a list of all pairs where d(a,b) > threshold, or in other words, all pairs (a,b) where a and b belong to the same cluster. The n x n similarity matrix is way too big to be stored in memory (e.g. n = 5 million), but it is extremely sparse.

I propose the following clustering algorithm, with the following loop over all pairs of connected elements:

1. At a particular iteration, as we sequentially process pairs of connected elements one at a time, let's say that {a, b, c} and {k, l, m, n} are two clusters, among many others.
2. We find the pair {b, n}, meaning that b and n belong to a same cluster. We thus must update (merge) the two above clusters.
3. We use doubly circular linked lists to store cluster data structure, with the following pointers:
• Next(a) = b, Next(b) = c, Next(c) = a
• Previous(a) = c, Previous(c) = b, Previous(b) = a
• Next(k) = l, Next(l) = m, Next(m) = n, Next(n) = k
• Previous(k) = n, Previous(n) = m, Previous(m) = l. Previous(l) = k
4. Check if b and n are already attached to a same cluster, by following successors of b until you fall back on b. If n is not found, then b and n are not yet attached. Note that if either b or n does not have a successor (by far the most frequent case) then b and n are not yet attached to a same cluster. If b and n are not yet attached to a same cluster (as in step 1), then perform step 5 below:
5. Perform the following update: merge cluster containing b with cluster containing n (if b and n not already in the same cluster):
• Replace Next(b) = c by Next(b) = n
• Replace Next(m) = n by Next(m) =c [note: m is identified as Previous(n), c is identified as Next(b)]
• Replace Previous(n) = m by Previous(n) = b
• Replace Previous(c) = b by Previous(c) = m [note: c is identified as Next(b), m is identified as Previous(n)]

If the Previous(*) and Next(*) functions are implemented as hash tables rather than arrays (due to the large number of potential unique values a, b, c, etc.), then the computational complexity becomes O(n log n) instead of O(n).

Now, when we've processed all pairs of elements (there is O(n) such pairs), we have actually produced all the clusters. One additional step could consist, for each element a, b, c etc. to find its cluster, by following the Next(*) pointers, and then order the elements in each cluster, to create ordered clusters.Then count distinct clusters. Et voila! Problem solved.

Views: 1024

Comment

Join AnalyticBridge Comment by Richard on May 31, 2011 at 1:16pm

From a practical viewpoint, your algorithm will be O(n log n) in most situations. The only times it will be O(n) is in rare instances such as:

• n < 1 million
• The set of potential values for a, b, c, etc. is bounded and quite small, e.g. 2 million potential values, so that they can fit under an array structure, rather than being accessed through hashing. If a, b, c are words, and the problem is about finding associations between words, it means that word length must be very short to meet the O(n) requirement.