Subscribe to DSC Newsletter

What is your opinion about K-means, self-organized maps, k-nn as clustering methods? I am involved in a project and I would like to know the pros and cons of these methods.. Are these better than the classical clustering methods? I would like to work in a unsupervised way..

Thank you!

Views: 4006

Replies to This Discussion

The best-known optimization clustering algorithm is k-means clustering. Unlike
hierarchical clustering methods that require processing time proportional to the
square or cube of the number of observations, the time required by the k-means
algorithm is proportional to the number of observations. This means that k-means
clustering can be used on larger data sets. In fact, k-means clustering is inappropriate
for small (< 100 observations) data sets. If the data set is small, the k-means solution
becomes sensitive to the order in which the observations appear (the order effect).
1. A set of points known as seeds is selected as a first guess of the means of the
final clusters. These seeds are typically selected from the sample data.
2. Each observation is assigned to the nearest seed, forming temporary clusters. The
seeds are then replaced by the means of the temporary clusters, and the process is
repeated until no significant change occurs in the position on the cluster means.
3. Form the final clusters by assigning each observation to its nearest centroid.
The first two steps are just the “hill-climbing” heuristic search algorithm. Step 3 is an
extra iteration that performs the final cluster membership assignments.


The FASTCLUS procedure selects the first complete observation as its initial seed.
Under the default REPLACE=FULL method, two further tests are performed:
1. An old seed is replaced if the distance between the observation and the closest
seed is greater than the minimum distance between seeds.
If the observation fails the first test, PROC FASTCLUS goes on to the second test.
2. The observation replaces the nearest seed if the smallest distance from the
observation to all seeds other than the nearest one is greater than the shortest
distance from the nearest seed to all other seeds.
If the observation fails the second test, the next complete observation is considered.
You can omit the second test for seed replacement (REPLACE=PART), causing the
FASTCLUS procedure to run faster. But the seeds that are selected may not be as
widely separated as those obtained by the default method.
You can even suppress replacement entirely by specifying REPLACE=NONE, but
you must choose a good value for the RADIUS= option to get good clusters.
Finally, REPLACE=RANDOM specifies that a simple pseudo-random sample of
complete observations is to be selected as the initial seeds.
The best-known optimization clustering algorithm is k-means clustering

"Best-known" in the sense of "commonly used" ? I agree.
"Best-known" in the sense of "best algorithm as far as we know" ? This statement would be too strong to be true. But I guess you mean the former interpretation :)

I want to add:
1. k-nn is not a cluster, it is a classification algorithm. Except you are referring to the linkage-algorithms.
2. k-means is a specialized version of a neural net / SOM
3. Another nearly parameterless option could be Autoclass, although I cannot provide any practical experience with this algorithm. Another (less complicated ?) could be EM-Clustering with k-means as pre-step (see for example the implementation within RapidMiner).

EM-Clustering / Autoclass have the advantage, that they do not require a metric (k-means does). The choose of the metric is naturally the most critical step when it comes to clustering: You have to decide what is similar and what is not.

SOM does not require a metric, but has trouble with dealing with categorical values. On the other hand, SOM provides a projection / visualization of the data. (see for example the project esom).

my cents
Hi Urko,
I've successfully used SOMs in a research project some years ago. At the time I frequently used NNs so it was a natural step to start using SOMs as a clustering algorithm. There are a couple of parameters you can optimize but I found it appropriate to map the data into a 2D map of size m*n where m is less than n (i.e. a rectangular map instead of a quadratic m*m map - the same idea also applied to 3D maps). So depending on your data (and your deadlines) try the methods you have access to and after the evaluation step decide which is best.

It would be interesting if you could share some conclusions with us.



Tomas
Hi Urko,
what are your expectations from segmentation? What results will be fine for you?
Don't think about methods for a moment, focus only on wanted outputs.
hi urko,

i guess, the most important point is to have an idea, what kind of types/ cluster you need. that means that you better have a hypothesis about your data. in the end you have describe your clusters with clear labels.

the advantage of the `classical`methods like k-means is in the most cases the easiness of interpretation. but if your data are in some way correlated, you find with the every cluster-method a solution.

i would test 2 or 3 different methods (hierachical like ward metdod, k-means, som, ...) and compare their results. this can be a good validation of your results.
For the kinds of segmentations I most often do, latent class is usually most appropriate (though it can be a headache). For k-means, I find CCEA (Convergent Cluster & Ensemble Analysis) from Sawtooth Software very useful (http://www.sawtoothsoftware.com/products/cca/).
Thank all of you for replying to my questions... I am working on it... I think your opinions have given me a clue... but we can continue discussing...

RSS

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service