A Data Science Central Community
Let's say that you have a large number n of elements a, b, c, etc. and you want to group them into clusters. Each cluster is supposed to contain few elements, say O(1).
You have one similarity metric d(a,b) to compare any two elements a, b. Also, you have a list of all pairs where d(a,b) > threshold, or in other words, all pairs (a,b) where a and b belong to the same cluster. The n x n similarity matrix is way too big to be stored in memory (e.g. n = 5 million), but it is extremely sparse.
I propose the following clustering algorithm, with the following loop over all pairs of connected elements:
If the Previous(*) and Next(*) functions are implemented as hash tables rather than arrays (due to the large number of potential unique values a, b, c, etc.), then the computational complexity becomes O(n log n) instead of O(n).
Now, when we've processed all pairs of elements (there is O(n) such pairs), we have actually produced all the clusters. One additional step could consist, for each element a, b, c etc. to find its cluster, by following the Next(*) pointers, and then order the elements in each cluster, to create ordered clusters.Then count distinct clusters. Et voila! Problem solved.
Comment
From a practical viewpoint, your algorithm will be O(n log n) in most situations. The only times it will be O(n) is in rare instances such as:
© 2020 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge