A Data Science Central Community
So far I tried to give a sort of overview of the common steps involved in the text classification.
The last one is only to show that the naive approach to extract a single bag of words, and pass all stuff to a classifier doesn't work! (I'm doing some tests on REUTER data set that is easy to classify... imagine what happens in real complex dataset!!).
After a brief comparison of the above mentioned naive approach with different bag of words (obtained using TF-IDF and closeness centrality funct.) I would like to show some paradigm more performant in term of accuracy: boosting methods like ada, bayesian post probability and so on.
My approach is always to provide a reasonable benchmark, just to understand better our dataset.
When the discussion will be more mature, I would like to introduce some statistical analysis to understand the predictability of the training set ...
Of course, it is just a blog, and I haven't pretension to be exhaustive and theoretically precise, but ...also in the real world: who has time to be always precise outside the academical context?
BTW I'm totally open to accept suggestions and different point of views!