Subscribe to DSC Newsletter

A new post on my blog:

An early approach at "document classifier".

I hope it will triggers some discussion on classification techniques.

Tags: C5, Classification, SVM

Views: 172

Reply to This

Replies to This Discussion

I like your nicely visualized graphs! Have to wait until you series is over to understand what's it all about.


So far I tried to give a sort of overview of the common steps involved in the text classification.

The last one is only to show that the naive approach to extract a single bag of words, and pass all stuff to a classifier doesn't work! (I'm doing some tests on REUTER data set that is easy to classify... imagine what happens in real complex dataset!!).

After a brief comparison of the above mentioned naive approach with different bag of words (obtained using TF-IDF and closeness centrality funct.) I would like to show some paradigm more performant in term of accuracy: boosting methods like ada, bayesian post probability and so on.

My approach is always to provide a reasonable benchmark, just to understand better our dataset.

When the discussion will be more mature, I would like to introduce some statistical analysis to understand the predictability of the training set ...

Of course, it is just a blog, and I haven't pretension to be exhaustive and theoretically precise, but ...also in the real world: who has time to be always precise outside the academical context? 

BTW I'm totally open to accept suggestions and different point of views!



On Data Science Central

© 2020 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service