So what I did is the following (be aware that is not the formal implementation of LSA!):
- Filter and take the base form of the words as usual.
- Build the multidimensional sparse matrix of the co-occurrences;
- I calculated for each instance the frequency to find it in the corpus;
- I calculated for each instance the frequency to find it in the doc;
- I weighted such TF-IDF considering also the distance among the co-occurrences.
In this way we are able to rank all co-occurrences and set a threshold to discard items having low rank.
In the last step I built a graph where I linked the co-occurrences.
As you can see in the following examples, the graphs are initially pretty complex, and to refine the results, I applied filter based on the number of connected components in the graph.
to read the entire post, visit my blog at:
results before filtering:
Results after filtering: