Subscribe to DSC Newsletter

Most of the papers on initialization techniques for k-means clustering compare the proposed technique with random method. Why? This is but obvious that as the random name suggests it is very random and can give large iterations for k-means clustering.

Views: 709

Replies to This Discussion

This is not a direct answer to the question, but a comment on the topic of k-means initialization: In my opinion, if your goal is to achieve better clustering results, the initialization techniques for k-means only treat the symptom but not the origin of the problem, which is k-means' heavy sensitivity to initialization. From my experience, you can always use Neural Gas (from Martinetz, Schulten) instead of k-means and gain much more robust results with little more computational cost without caring much about initialization.
However, from practical experience, I think the original online version of Neural Gas works better than the batch version, although the latter is of course faster.
thanks bassam for neural gas. not much exposure to neural gas so will study it.
Sign !

1. Self-organizing maps (with neural gas as special case) are (in my opinion) always a better pick than k-means, which is used so widely (not only) because it is so easy to understand and implement.

2. Ultsch has shown that batch-updates of soms can lead to wrong cluster topologies. Hence online-updating is recommended.

The one advantage of k-means is the automatic delivery of clusters. How do you, Bassam, extract the clusters afterwards (from Neural Gas) ? I only know a complicated algorithmn (U*C), so I'd like to hear yours :).
Sorry for the late answer, I was travelling for a while.
Steffen, I agree to your first point. However, as a side-note, I personally would not call NG a special case of SOM: Both are prototype-based methods, i.e., prototypes are introduced and moved in the data space, while the movement is influenced by a neighborhood structure defined over the prototypes. However, SOM has a predefined, fixed grid topology as a neighborhood structure, while NG uses a dynamic neighborhood according to the prototypes' current neighborhood situation in data space. So I don't see SOM beeing the more general concept. Of course, this might be a matter of debate.

Thanks for mentioning Ultsch's contribution about batch-SOM-updates. Could you give me a reference to the corresponding article?

Now to your question: I'm not sure if I understand correctly, but I see no problem in extracting clusters after training with NG. I'm not familiar with the algorithm you mentioned but I don't know why a complicated method would be needed. As I said, NG uses prototypes which are moved in data space. In case of classic NG, this must be vector space, e.g. a space over numerical features of the data. Therefore, after training, the prototypes have a fixed position in this space. These are the cluster centers. Then, if you want to get the assignment of data points to prototypes (and therefore clusters), you have to determine the closest prototype for each data point. To sound a bit more scientific, the distance of a data point to the closest prototype is often called winner-takes-all function and it defines a Voronoi tesselation of the data space. At least that's the classic way that I know. Sorry for all the details, these are to avoid potential misunderstandings. :)
have you ever had problem with 'too many interations' ? doesn't k-means converge quite quickly?
it converges but if the initialization is not correct it takes too many iterations.
The K-means++ is meant exactly for choosing initial centroids. The overhead due to this approach is negligible - see the final slides from "The advantage of careful seeding".
Agree, A lot of methods based on Kmeans mainly focus on the initialization step, even by using background knowledge (Paiwise-constriants, seeded data...etc) to initialize the cluster centers, of course in that case it would beat down random. However the iteration could significantly decrease if there are no noise from the background knowledge.


On Data Science Central

© 2019 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service