A Data Science Central Community
Let's continue our discussion about the applications of the graph entropy concept.
Today I'm going to show how we can re-use the same concept on the document clustering.
What I want to highlight is that through such methodology it's possible to:
Added by Cristian Mesiano on November 17, 2012 at 1:35am — No Comments
In the last post I showed how to extract key words from a text through a principle called graph entropy.
Today I'm going to show another application of the graph entropy in order to extract clusters of key words.
The key words of a document depict the main topic of the content, but if the document is big, often, there are many different sub topics related to the…
Added by Cristian Mesiano on October 24, 2012 at 11:34am — No Comments
I would share with you some early results about a research I'm doing in the field of "graph entropy" applied to text mining problem.
Why Graph Entropy is so important?
Based on the main concept of entropy the following assumptions are true:
Added by Cristian Mesiano on September 24, 2012 at 2:39pm — No Comments
Most of the datamining problems can be reduced as a minimization/maximization problem.
Added by Cristian Mesiano on August 12, 2012 at 3:09am — No Comments
One of the most widely algorithm used in Machine Learning is the Simulated Annealing (SA).
The reason of its celebrity lays in:
I considered two instances of the problem, the first one with 10 towns and the…
Added by Cristian Mesiano on July 3, 2012 at 12:56pm — No Comments
...Usually such behavior is not proficient to obtain good results, but this time I think that the change of…Continue
Added by Cristian Mesiano on May 23, 2012 at 2:20pm — No Comments
Added by Cristian Mesiano on May 5, 2012 at 1:09pm — No Comments
While I was writing the last post I was wondering how long before my followers notice the mistakes I introduced in the experiments.
Let's start the treasure hunt!
1. Don't always trust your data: often they are not homogeneous.
A good data miner must always check his dataset! you should always ask to yourself whether the data have been produced in a…Continue
Added by Cristian Mesiano on April 4, 2012 at 2:10pm — No Comments
In the last months we discussed a lot about text mining algorithms, I would like for a while focus on data mining aspects.
Today I would talk about one of the most intriguing topics related to data mining tasks: the regression analysis.
...To read the entire post click here
Added by Cristian Mesiano on March 7, 2012 at 2:27pm — No Comments
In the last two post we have discussed about co - occurrences analysis to extract features in order to classify documents and extract "meta concepts" from the corpus.
We have also noticed that this approach doesn't return better than the traditional bag of words.
I would now explore some derivation of this approach, taking advantage of the graph theory.
the graph of the co occurrences is really huge and complex, how could we reduce its complexity without big information…Continue
Added by Cristian Mesiano on February 29, 2012 at 1:59pm — No Comments
We have seen few posts ago an approach to extract meta "concepts" from text based on latent semantic paradigm.
In this post we apply this approach to classify documents, and we do a comparison between this approach and the canonical bag of words.
The comparison test will be done through the ensemble method already showed in the last post.
To read the entire post click …Continue
Added by Cristian Mesiano on February 20, 2012 at 7:22am — No Comments
ADaBoost.M1 tries to improve step by step the accuracy of the classifier analyzing its behavior on training set. (Of course you cannot try to improve the classifier working with the test set!!).
Here lays the problem, because if we choose as "weak algorithm" an SVM, we know that almost always it returns excellent accuracy on the training set with results closed to 100% (in term of true positive).
In this scenario, try to improve the accuracy of classifier assigning different weights…
Added by Cristian Mesiano on January 30, 2012 at 2:06pm — No Comments
Added by Cristian Mesiano on January 13, 2012 at 9:04am — No Comments
The strategy is very easy to describe:
1. Divide the domain of your function in k sub intervals.
2. Initialize k monomials;
3. Consider the monomials as centroids of your clustering algorithm.
4. Assign the points of the function to each monomial in compliance to the cluster algo.
5. Use the gradient descent to adjust the parameters of each monomial.
6. Go to 4. until the accuracy is good enough.
Read the entire post at:…Continue
Added by Cristian Mesiano on December 15, 2011 at 11:00am — No Comments
In the real world rarely a problem can be solved using just a single algorithm, more often a solution is a chain of algorithms where the output of the former is the input for the follower.
Added by Cristian Mesiano on December 8, 2011 at 2:31pm — No Comments
read more here: …Continue
Added by Cristian Mesiano on November 23, 2011 at 3:05pm — No Comments
I was working on the post related to the relation between IFS fractals and Analytical Probability density, when an IT manager contacted me asking an informal opinion about a tool for Neural Networks.
Added by Cristian Mesiano on November 13, 2011 at 9:35pm — No Comments
Added by Cristian Mesiano on October 24, 2011 at 3:05pm — No Comments
(have a look at my blog for further details and examples: http://textanddatamining.blogspot.com/)
6 weak classifiers:…Continue
Added by Cristian Mesiano on October 14, 2011 at 12:23am — No Comments
An implementation using AMPL & SNOPTContinue
Added by Cristian Mesiano on September 26, 2011 at 3:36am — No Comments