Subscribe to DSC Newsletter

Document Classification: latent semantic vs bag of words. Who is the best?

We have seen few posts ago an approach to extract meta "concepts" from text based on latent semantic paradigm.
In this post we apply this approach to classify documents, and we do a comparison between this approach and the canonical bag of words.
The comparison test will be done through the ensemble method already showed in the last post.

To read the entire post click here.

The image below represents the graph of the co-occurrences extracted considered the highest ranking (the ranking has been assigned in the way described in this post applied to the data set used to explain the ensemble method based on ADA boost M1).

In blue it has been represented the results for LSA based approach.
In red it has been represented the results for Bag of Words approach.

As you can see, even if the results obtained with the LSA classifier are not so bad, the traditional approach based on bag of words (in this experiment extracted through TD-DF) returns better results.

Views: 1082

Tags: Text, analysis, bag, categorization, latent, of, semantic, words

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

On Data Science Central

© 2020   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service