Let's continue our discussion about the applications of the graph entropy concept.
Today I'm going to show how we can re-use the same concept on the document clustering.
What I want to highlight is that through such methodology it's possible to:
- extract from a document the relevant words (as discussed here);
- clustering of the words of a document (as discussed here);
- clustering set of documents;
- clustering a graph and assign a ranking score to each cluster by homogeneity criteria;
The Experiment
The documents of this data set are email selected from different newsgroups.
I selected some documents from the categories: Hockey, Motorcycle, Atheist and Electronics.
You can download
here the documents used for this experiment.
The procedure
- For each document build the respective graph (filtering by stopwords list, and stemming the words).
- For each graph calculate the entropy graph value of each vertex (word).
- For each graph extract the first k relevant vertex (sorting by entropy graph value)
- Perform hierarchical clustering (in this case I used once again an approach based on Simulated annealing).
The results
Here you are the early results obtained through the clustering.
CLICK HERE TO READ THE COMPLETE POST
ConsiderationsThe results are promising but not perfect. The accuracy achieved is
- 75.0% for first cluster
- 62.5% for second cluster
- 50.0 % for the fifth cluster
- 37.5 % for the third and fourth cluster
CLICK HERE TO READ THE COMPLETE POST
Something more...
I mentioned before that such approach works over the concept of entropy graph.
Here you are the graph representation of the words related to each cluster.
...
CLICK HERE TO READ THE COMPLETE POST
...
...
CLICK HERE TO READ THE COMPLETE POST
...
Notes
Such representation of the clustering highlights that the linkage function used in this experiment depicts quite well the different areas of the graph.
- The Red cluster grouped together the right block of vertexes.
- The Brown cluster grouped the right side of the graph
- The Blue cluster focused on the core of the graph
- The yellow cluster grouped together the corona of the core of the graph
- The green cluster grouped together the corona of the peripheral of the graph.
The next steps
In the next post I'll try to refine the results, optimizing the clustering algorithm, and trying to use different kind of clustering algos.
Stay Tuned
cristian
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge