A Data Science Central Community
Need advice Dear Community,
I have a situation, where I need to classify items into groups (lets say 6). When I ran k-means 90% of my data fall in 1 group remaining 10% fall in other groups. What's next step? In order to further group the data, I have taken the 90% data group and once again I ran k-means.This time I have 15 new groups within this new dataset. But now again 76% fell in one group remaining in 14 groups? How to deal in such situation?
Hi Suresh, have you derived any general statistics on your data? It sounds like the means kurtosis distribution is really high. That could be the correct result... How many variables are you using? Are they independent variables? I think all the variables in a cluster analysis are supposed to be fairly independent. You can run a correlation test to find out. Good luck.
Think about the data that you are trying to cluster with. How many dimensions are you using? Are the variables highly related? DO the variables have different standard deviations? What is the distribution?
For instance, if your data is log-normal then a lot of the cases will be in the low end of the distribution with a few at the high end. If you have a bunch of highly correlated log-normal variables, that could get the kind of results you are seeing.
Clustering is often treated as a garbage-disposal method; toss anything in and it gets crunched. I find that one has to put a lot of thought into the variables used to get meaningful results.