So what I did is the following (be aware that is not the formal implementation of LSA!):

- Filter and take the base form of the words as usual.
- Build the multidimensional sparse matrix of the co-occurrences;
- I calculated for each instance the frequency to find it in the corpus;
- I calculated for each instance the frequency to find it in the doc;
- I weighted such TF-IDF considering also the distance among the co-occurrences.

In this way we are able to rank all co-occurrences and set a threshold to discard items having low rank.

In the last step I built a graph where I linked the co-occurrences.

As you can see in the following examples, the graphs are initially pretty complex, and to refine the results, I applied filter based on the number of connected components in the graph.

to read the entire post, visit my blog at:

results before filtering:

Results after filtering:

