K-means, SOM, k-nn or classical clustering methods? - AnalyticBridge2020-06-06T18:26:22Zhttps://www.analyticbridge.datasciencecentral.com/forum/topics/kmeans-som-knn-or-classical?groupUrl=miningterabytesofdata&commentId=2004291%3AComment%3A76910&groupId=2004291%3AGroup%3A11659&feed=yes&xn_auth=noThank all of you for replying…tag:www.analyticbridge.datasciencecentral.com,2011-01-27:2004291:Comment:873362011-01-27T19:47:00.129ZUrkohttps://www.analyticbridge.datasciencecentral.com/profile/Urko
Thank all of you for replying to my questions... I am working on it... I think your opinions have given me a clue... but we can continue discussing...
Thank all of you for replying to my questions... I am working on it... I think your opinions have given me a clue... but we can continue discussing... For the kinds of segmentation…tag:www.analyticbridge.datasciencecentral.com,2010-08-24:2004291:Comment:769102010-08-24T00:50:52.424ZKevin Grayhttps://www.analyticbridge.datasciencecentral.com/profile/KevinGray
For the kinds of segmentations I most often do, latent class is usually most appropriate (though it can be a headache). For k-means, I find CCEA (Convergent Cluster & Ensemble Analysis) from Sawtooth Software very useful (<a href="http://www.sawtoothsoftware.com/products/cca/" target="_blank">http://www.sawtoothsoftware.com/products/cca/</a>).
For the kinds of segmentations I most often do, latent class is usually most appropriate (though it can be a headache). For k-means, I find CCEA (Convergent Cluster & Ensemble Analysis) from Sawtooth Software very useful (<a href="http://www.sawtoothsoftware.com/products/cca/" target="_blank">http://www.sawtoothsoftware.com/products/cca/</a>). hi urko,
i guess, the most i…tag:www.analyticbridge.datasciencecentral.com,2010-08-08:2004291:Comment:758122010-08-08T12:17:29.759ZJochen Klapheckhttps://www.analyticbridge.datasciencecentral.com/profile/JochenKlapheck
hi urko,<br />
<br />
i guess, the most important point is to have an idea, what kind of types/ cluster you need. that means that you better have a hypothesis about your data. in the end you have describe your clusters with clear labels.<br />
<br />
the advantage of the `classical`methods like k-means is in the most cases the easiness of interpretation. but if your data are in some way correlated, you find with the every cluster-method a solution.<br />
<br />
i would test 2 or 3 different methods (hierachical like ward metdod,…
hi urko,<br />
<br />
i guess, the most important point is to have an idea, what kind of types/ cluster you need. that means that you better have a hypothesis about your data. in the end you have describe your clusters with clear labels.<br />
<br />
the advantage of the `classical`methods like k-means is in the most cases the easiness of interpretation. but if your data are in some way correlated, you find with the every cluster-method a solution.<br />
<br />
i would test 2 or 3 different methods (hierachical like ward metdod, k-means, som, ...) and compare their results. this can be a good validation of your results. Hi Urko,
what are your expect…tag:www.analyticbridge.datasciencecentral.com,2010-03-11:2004291:Comment:628482010-03-11T16:26:31.378ZJozo Kovachttps://www.analyticbridge.datasciencecentral.com/profile/JozoKovac
Hi Urko,<br />
what are your expectations from segmentation? What results will be fine for you?<br />
Don't think about methods for a moment, focus only on wanted outputs.
Hi Urko,<br />
what are your expectations from segmentation? What results will be fine for you?<br />
Don't think about methods for a moment, focus only on wanted outputs. Hi Urko,
I've successfully us…tag:www.analyticbridge.datasciencecentral.com,2010-03-11:2004291:Comment:628122010-03-11T10:11:39.981ZTomas Keller (formerly Ohlson)https://www.analyticbridge.datasciencecentral.com/profile/TomasKeller
Hi Urko,<br />
I've successfully used SOMs in a research project some years ago. At the time I frequently used NNs so it was a natural step to start using SOMs as a clustering algorithm. There are a couple of parameters you can optimize but I found it appropriate to map the data into a 2D map of size m*n where m is less than n (i.e. a rectangular map instead of a quadratic m*m map - the same idea also applied to 3D maps). So depending on your data (and your deadlines) try the methods you have access…
Hi Urko,<br />
I've successfully used SOMs in a research project some years ago. At the time I frequently used NNs so it was a natural step to start using SOMs as a clustering algorithm. There are a couple of parameters you can optimize but I found it appropriate to map the data into a 2D map of size m*n where m is less than n (i.e. a rectangular map instead of a quadratic m*m map - the same idea also applied to 3D maps). So depending on your data (and your deadlines) try the methods you have access to and after the evaluation step decide which is best.<br />
<br />
It would be interesting if you could share some conclusions with us.<br />
<br />
<br />
<br />
Tomas The best-known optimization c…tag:www.analyticbridge.datasciencecentral.com,2010-03-11:2004291:Comment:628102010-03-11T10:03:16.484ZSteffen Springerhttps://www.analyticbridge.datasciencecentral.com/profile/SteffenSpringer
<i>The best-known optimization clustering algorithm is k-means clustering</i><br />
<br />
"Best-known" in the sense of "commonly used" ? I agree.<br />
"Best-known" in the sense of "best algorithm as far as we know" ? This statement would be too strong to be true. But I guess you mean the former interpretation :)<br />
<br />
I want to add:<br />
1. k-nn is not a cluster, it is a classification algorithm. Except you are referring to the linkage-algorithms.<br />
2. k-means is a specialized version of a neural net / SOM<br />
3. Another…
<i>The best-known optimization clustering algorithm is k-means clustering</i><br />
<br />
"Best-known" in the sense of "commonly used" ? I agree.<br />
"Best-known" in the sense of "best algorithm as far as we know" ? This statement would be too strong to be true. But I guess you mean the former interpretation :)<br />
<br />
I want to add:<br />
1. k-nn is not a cluster, it is a classification algorithm. Except you are referring to the linkage-algorithms.<br />
2. k-means is a specialized version of a neural net / SOM<br />
3. Another nearly parameterless option could be Autoclass, although I cannot provide any practical experience with this algorithm. Another (less complicated ?) could be EM-Clustering with k-means as pre-step (see for example the implementation within RapidMiner).<br />
<br />
EM-Clustering / Autoclass have the advantage, that they do not require a metric (k-means does). The choose of the metric is naturally the most critical step when it comes to clustering: You have to decide what is similar and what is not.<br />
<br />
SOM does not require a metric, but has trouble with dealing with categorical values. On the other hand, SOM provides a projection / visualization of the data. (see for example the project esom).<br />
<br />
my cents The best-known optimization c…tag:www.analyticbridge.datasciencecentral.com,2010-03-10:2004291:Comment:627692010-03-10T18:31:57.125ZOleg Danilchenkohttps://www.analyticbridge.datasciencecentral.com/profile/OlegDanilchenko
The best-known optimization clustering algorithm is k-means clustering. Unlike<br />
hierarchical clustering methods that require processing time proportional to the<br />
square or cube of the number of observations, the time required by the k-means<br />
algorithm is proportional to the number of observations. This means that k-means<br />
clustering can be used on larger data sets. In fact, k-means clustering is inappropriate<br />
for small (< 100 observations) data sets. If the data set is small, the k-means…
The best-known optimization clustering algorithm is k-means clustering. Unlike<br />
hierarchical clustering methods that require processing time proportional to the<br />
square or cube of the number of observations, the time required by the k-means<br />
algorithm is proportional to the number of observations. This means that k-means<br />
clustering can be used on larger data sets. In fact, k-means clustering is inappropriate<br />
for small (< 100 observations) data sets. If the data set is small, the k-means solution<br />
becomes sensitive to the order in which the observations appear (the order effect).<br />
1. A set of points known as seeds is selected as a first guess of the means of the<br />
final clusters. These seeds are typically selected from the sample data.<br />
2. Each observation is assigned to the nearest seed, forming temporary clusters. The<br />
seeds are then replaced by the means of the temporary clusters, and the process is<br />
repeated until no significant change occurs in the position on the cluster means.<br />
3. Form the final clusters by assigning each observation to its nearest centroid.<br />
The first two steps are just the “hill-climbing” heuristic search algorithm. Step 3 is an<br />
extra iteration that performs the final cluster membership assignments.<br />
<br />
<br />
The FASTCLUS procedure selects the first complete observation as its initial seed.<br />
Under the default REPLACE=FULL method, two further tests are performed:<br />
1. An old seed is replaced if the distance between the observation and the closest<br />
seed is greater than the minimum distance between seeds.<br />
If the observation fails the first test, PROC FASTCLUS goes on to the second test.<br />
2. The observation replaces the nearest seed if the smallest distance from the<br />
observation to all seeds other than the nearest one is greater than the shortest<br />
distance from the nearest seed to all other seeds.<br />
If the observation fails the second test, the next complete observation is considered.<br />
You can omit the second test for seed replacement (REPLACE=PART), causing the<br />
FASTCLUS procedure to run faster. But the seeds that are selected may not be as<br />
widely separated as those obtained by the default method.<br />
You can even suppress replacement entirely by specifying REPLACE=NONE, but<br />
you must choose a good value for the RADIUS= option to get good clusters.<br />
Finally, REPLACE=RANDOM specifies that a simple pseudo-random sample of<br />
complete observations is to be selected as the initial seeds.