Clustering idea for very large datasets - AnalyticBridge2019-03-25T09:32:15Zhttps://www.analyticbridge.datasciencecentral.com/forum/topics/clustering-idea-for-very-large-datasets?feed=yes&xn_auth=noHello Vincent,
In a working…tag:www.analyticbridge.datasciencecentral.com,2017-07-14:2004291:Comment:3671572017-07-14T00:43:35.974ZClaudio Contardohttps://www.analyticbridge.datasciencecentral.com/profile/ClaudioContardo
Hello Vincent,<br />
<br />
In a working paper from 2015 I exploit a sampling technique to derive an exact algorithm for the minimal diameter clustering problem. When compared to complete linkage our algorithm performs substantially better. One on the main characteristics of our algorithm (besides being exact) is to drop the necessity of loading the dissimilarity matrix in RAM. Our method consumes then very limited RAM (around 100mb in our experiments with problems of ~500k objects). You can see our tech…
Hello Vincent,<br />
<br />
In a working paper from 2015 I exploit a sampling technique to derive an exact algorithm for the minimal diameter clustering problem. When compared to complete linkage our algorithm performs substantially better. One on the main characteristics of our algorithm (besides being exact) is to drop the necessity of loading the dissimilarity matrix in RAM. Our method consumes then very limited RAM (around 100mb in our experiments with problems of ~500k objects). You can see our tech report at the following address : <a href="https://www.gerad.ca/en/papers/G-2015-140" target="_blank">https://www.gerad.ca/en/papers/G-2015-140</a> Essentially it seems that you…tag:www.analyticbridge.datasciencecentral.com,2014-04-08:2004291:Comment:2929202014-04-08T16:47:22.556ZGlenn Stryckerhttps://www.analyticbridge.datasciencecentral.com/profile/GlennStrycker
<p>Essentially it seems that you are constructing a graph using seed nodes and connecting the other nodes one-at-a-time depending on their distance, similar to a preferential attachment graph growing model. Very efficient! If you are worried about the outliers, or the dependence of your algorithm on the seed nodes, or other such problems, I would keep all of the d(a,b) information from your iterations with various seeds, combine this, and then run graph algorithms similar to those use to find…</p>
<p>Essentially it seems that you are constructing a graph using seed nodes and connecting the other nodes one-at-a-time depending on their distance, similar to a preferential attachment graph growing model. Very efficient! If you are worried about the outliers, or the dependence of your algorithm on the seed nodes, or other such problems, I would keep all of the d(a,b) information from your iterations with various seeds, combine this, and then run graph algorithms similar to those use to find potential inferred connections, similar to how LinkedIn and Facebook predict one's possible friends based on clustering coefficients (friends-of-friends are likely my friends).</p>
<p></p>
<p>I'd have to check the computational complexity of calculating clustering coefficients on large sparse networks, but I'm guessing it would be less than O(n^2) and would get you the accuracy that the method you propose is lacking.</p> I suggest to consider priorit…tag:www.analyticbridge.datasciencecentral.com,2013-09-25:2004291:Comment:2750252013-09-25T14:34:37.873ZFarab Alipanahhttps://www.analyticbridge.datasciencecentral.com/profile/FarabAlipanah
<p>I suggest to consider priority to areas which has much density of nodes. Then assume higher chance to those areas for sampleing. This would reduce the clustering and sampeling effeciency. For me, it worked with a Genetic Algorithm.</p>
<p></p>
<p>I suggest to consider priority to areas which has much density of nodes. Then assume higher chance to those areas for sampleing. This would reduce the clustering and sampeling effeciency. For me, it worked with a Genetic Algorithm.</p>
<p></p>