# AnalyticBridge

A Data Science Central Community

Let's say you have to cluster 10 million points, for instance keywords. You have a dissimilarity function, available as a text file with 100,000,000 entries, each entry consisting of three data points:

Keyword A, Keyword B, distance between A and B denoted as d(A,B)

So, in short, you can perform k-NN (k-nearest neighbors) clustering or some other types of clustering, which typically is O(n^2) or worse, from a computational complexity point of view. Has anyone ever used a clustering method based on sampling?

The idea is to start by sampling 1% (or less) of the 100,000,000 entries, and perform clustering on these pairs of keywords, to create a "seed" or "baseline" cluster structure.

The next step is to browse sequentially your 10,000,000 keywords, and for each keyword, find the closest cluster from the baseline cluster structure. If there are 1,000 base clusters, you'll have to perform 10 billion (10,000,000 x 1,000) computations (hash table accesses), a rather large number, but it can easily be done with a distributed architecture (Hadoop).

You could indeed repeat the whole procedure 5 times (with 5 different samples), and blend the 5 different final cluster structures that you obtain. The final output might be a table with 30,000,000 entries, where each entry consist of (1) is a pair of keywords A and B, and (2) the number of times (if >0) when A and B belong to a same cluster (based on the 5 iterations corresponding to 5 different samples). The final cluster detection algorithm consist in extracting the connected components from these 30,000,000 entries: these entries, from a graph theory point of view, are called ridges, joining two nodes A and B (here a node is a keyword).

What are the conditions for such an algorithm to work? You can assume that each keyword A has on average between 2 and 50 neighboring keywords B such that d(A,B) > 0. The number of such neighbors is usually much closer to 2, than to 50. So we might end up, after classifying all keywords, with 10 or 20% of all keywords left isolated (forming single-point clusters).

How do you address this issue, without using too much computer power? Of course if you work with 50 rather than 5 samples, or samples that represents 25% of the data rather than 1%, you'll solve the problem, but this is a time-prohibitive approach that very poorly and inefficiently leverage statistical sampling. How to do better?

Related article

Views: 21036

### Replies to This Discussion

I suggest to consider priority to areas which has much density of nodes. Then assume higher chance to those areas for sampleing. This would reduce the clustering and sampeling effeciency. For me, it worked with a Genetic Algorithm.

Essentially it seems that you are constructing a graph using seed nodes and connecting the other nodes one-at-a-time depending on their distance, similar to a preferential attachment graph growing model.  Very efficient!  If you are worried about the outliers, or the dependence of your algorithm on the seed nodes, or other such problems, I would keep all of the d(a,b) information from your iterations with various seeds, combine this, and then run graph algorithms similar to those use to find potential inferred connections, similar to how LinkedIn and Facebook predict one's possible friends based on clustering coefficients (friends-of-friends are likely my friends).

I'd have to check the computational complexity of calculating clustering coefficients on large sparse networks,  but I'm guessing it would be less than O(n^2) and would get you the accuracy that the method you propose is lacking.

Hello Vincent,

In a working paper from 2015 I exploit a sampling technique to derive an exact algorithm for the minimal diameter clustering problem. When compared to complete linkage our algorithm performs substantially better. One on the main characteristics of our algorithm (besides being exact) is to drop the necessity of loading the dissimilarity matrix in RAM. Our method consumes then very limited RAM (around 100mb in our experiments with problems of ~500k objects). You can see our tech report at the following address : https://www.gerad.ca/en/papers/G-2015-140