# AnalyticBridge

A Data Science Central Community

# Document Clustering and Graph Clustering: graph entropy as linkage function

Let's continue our discussion about the applications of the graph entropy concept.

Today I'm going to show how we can re-use the same concept on the document clustering.
What I want to highlight is that through such methodology it's possible to:

1. extract from a document the relevant words (as discussed here);
2. clustering of the words of a document (as discussed here);
3. clustering set of documents;
4. clustering a graph and assign a ranking score to each cluster by homogeneity criteria;
The Experiment
For this experiment I chose a subset of the standard dataset called "the 20 Newsgroup dataset".
The documents of this data set are email selected from different newsgroups.
I selected some documents from the categories: Hockey, Motorcycle, Atheist and Electronics.
The procedure
• For each document build the respective graph (filtering by stopwords list, and stemming the words).
• For each graph calculate the entropy graph value of each vertex (word).
• For each graph extract the first relevant vertex (sorting by entropy graph value)
• Perform hierarchical clustering (in this case I used once again an approach based on Simulated annealing).
The results
Here you are the early results obtained through the clustering.

Considerations
The results are promising but not perfect. The accuracy achieved is
•  75.0% for first cluster
• 62.5% for second cluster
• 50.0 % for the fifth cluster
• 37.5 % for the third and fourth cluster

Something more...
I mentioned before that such approach works over the concept of  entropy graph.
Here you are the graph representation of the words related to each cluster.
...

...
...

...
Notes
Such representation of the clustering highlights that the linkage function used in this experiment depicts quite well the different areas of the graph.
• The Red cluster grouped together the right block of vertexes.
• The Brown cluster grouped the right side of the graph
• The Blue cluster focused on the core of the graph
• The yellow cluster grouped together the corona of the core of the graph
• The green cluster grouped together the corona of the peripheral of the graph.
The next steps
In the next post I'll try to refine the results, optimizing the clustering algorithm, and trying to use different kind of clustering algos.
Stay Tuned
cristian

Views: 467

Comment