# AnalyticBridge

A Data Science Central Community

# Cristian Mesiano's Blog (23)

### Document Clustering and Graph Clustering: graph entropy as linkage function

Let's continue our discussion about the applications of the graph entropy concept.

Today I'm going to show how we can re-use the same concept on the document clustering.

What I want to highlight is that through such methodology it's possible to:

1. extract from a document the relevant words (as discussed …
Continue

Added by Cristian Mesiano on November 17, 2012 at 1:35am — No Comments

### Key words through graph entropy Hierarchical clustering

In the last post I showed how to extract key words from a text through a principle called graph entropy.

Today I'm going to show another application of the graph entropy in order to extract clusters of key words.

Why

The key words of a document depict the main topic of the content, but if the document is big, often, there are many different sub topics related to the…

Continue

Added by Cristian Mesiano on October 24, 2012 at 11:34am — No Comments

### Graph Entropy to extract relevant words

I would share with you some early results about a research I'm doing in the field of "graph entropy" applied to text mining problem.

Why Graph Entropy is so important?

Based on the main concept of entropy the following assumptions are true:

• The entropy of a graph should be a functional of the…
Continue

Added by Cristian Mesiano on September 24, 2012 at 2:39pm — No Comments

### Function minimization: Simulated Annealing led by variance criteria vs Nelder Mead

Most of the datamining problems can be reduced as a minimization/maximization problem.

Examples
Let's consider  easy scenarios where the function cost is conditionated just by two parameters.…

Continue

Added by Cristian Mesiano on August 12, 2012 at 3:09am — No Comments

### Simulated Annealing: How to boost performance through Matrix Cost rescaling

One of the most widely algorithm used in Machine Learning is the Simulated Annealing (SA).

The reason of its celebrity lays in:

• Simplicity of implementation

The experiments

I considered two instances of the problem, the first one with 10 towns and the…

Continue

Added by Cristian Mesiano on July 3, 2012 at 12:56pm — No Comments

### Outlier analysis: Chebyschev criteria vs approach based on Mutual Information

As often happens, I usually do many thing in the same time, so during a break while I was working for a new post on applications of mutual information in data mining, I read the interesting paper suggested by Sandro Saitta on his blog (dataminingblog)  related to the outlier detection.

...Usually such behavior is not proficient to obtain good results, but this time I think that the change of…

Continue

Added by Cristian Mesiano on May 23, 2012 at 2:20pm — No Comments

### Uncertainty coefficients for Features Reduction - comparison with LDA technique

Uncertainty coefficient
Consider a set of people's data labelled with two different labels, let's say blue and red, and let's assume that for this people we have a bunch of variables to describe them.
Moreover, let's assume that one of the variables is the social…
Continue

Added by Cristian Mesiano on May 5, 2012 at 1:09pm — No Comments

### Earthquake prediction through sunspots part II: common Data mining mistakes!

While I was writing the last post I was wondering how long before my followers notice the mistakes I introduced in the experiments.

Let's start the treasure hunt!

1. Don't always trust your data: often they are not homogeneous.

A good data miner must always check his dataset! you should always ask to yourself whether the data have been produced in a…

Continue

Added by Cristian Mesiano on April 4, 2012 at 2:10pm — No Comments

### Support Vector Regression (SVR): predict earthquakes through sunspots

In the last months we discussed a lot about text mining algorithms, I would like for a while focus on data mining aspects.

Today I would talk about one of the most intriguing topics related to data mining tasks: the regression  analysis.

...To read the entire post click here

Experiment: Earthquakes prediction using sunspots as…
Continue

Added by Cristian Mesiano on March 7, 2012 at 2:27pm — No Comments

### Features Extraction: Co-occurrences and Graph clustering

In the last two post we have discussed about co - occurrences analysis to extract features  in order to classify documents and extract "meta concepts" from the corpus.

We have also noticed that this approach doesn't return better than the traditional bag of words.

I would now explore some derivation of this approach, taking advantage of the graph theory.

the graph of the co occurrences is really huge and complex, how could we reduce its complexity without big information…

Continue

Added by Cristian Mesiano on February 29, 2012 at 1:59pm — No Comments

### Document Classification: latent semantic vs bag of words. Who is the best?

We have seen few posts ago an approach to extract meta "concepts" from text based on latent semantic paradigm.

In this post we apply this approach to classify documents, and we do a comparison between this approach and the canonical bag of words.

The comparison test will be done through the ensemble method already showed in the last post.

To read the entire post click …

Continue

Added by Cristian Mesiano on February 20, 2012 at 7:22am — No Comments

### Document Classification: how to boost your classifier

ADaBoost.M1 tries to improve step by step the accuracy of the classifier analyzing its behavior on training set. (Of course you cannot try to improve the classifier working with the test set!!).

Here lays the problem, because if we choose as "weak algorithm" an SVM, we know that almost always it returns excellent accuracy on the training set with results closed to 100% (in term of true positive).

In this scenario, try to improve the accuracy of classifier assigning different weights…

Continue

Added by Cristian Mesiano on January 30, 2012 at 2:06pm — No Comments

### Extract meta concepts through co-occurrences analysis and graph theory

....
So what I did is the following (be aware that is not the formal implementation of LSA!):
1. Filter and take the base form of the words as usual.
2. Build the multidimensional sparse matrix of the co-occurrences;
3. I calculated for each instance the frequency to find it in the corpus;
4. I calculated for each instance the frequency to find it in the doc;
5. I weighted such TF-IDF considering also the distance among the…
Continue

Added by Cristian Mesiano on January 13, 2012 at 9:04am — No Comments

### Clustering algorithm to approximate functions

The strategy is very easy to describe:

1. Divide the domain of your function in k sub intervals.

2. Initialize k monomials;

3. Consider the monomials as centroids of your clustering algorithm.

4. Assign the points of the function to each monomial in compliance to the cluster algo.

5. Use the gradient descent to adjust the parameters of each monomial.

6. Go to 4. until the accuracy is good enough.

Continue

Added by Cristian Mesiano on December 15, 2011 at 11:00am — No Comments

### Power Real Polynomial to approximate functions: The Gradient Method

In the real world rarely a problem can be solved using just a single algorithm, more often a solution is a chain of algorithms where the output of the former is the input for the follower.

But you know that quite often machine learning algorithms return functions almost always extremely complex, and they don’t fit directly in the next step of your strategy.
In these conditions, it is really helpful the trick of the function approximation, that is, we reduce the complexity…
Continue

Added by Cristian Mesiano on December 8, 2011 at 2:31pm — No Comments

### Neural Nets Tips and Tricks: add recall Output neuron

...

As mentioned, in many cases the customization of algorithms is the only way to achieve the target, but sometimes, some tricks can help to improve the learning even without changes of learning strategy!
Consider for example our XOR problem solved through neural networks.
Let's see how we can reduce considerably the epochs required to train the net.

Continue

Added by Cristian Mesiano on November 23, 2011 at 3:05pm — No Comments

### Buy or build, a practical example to explain my point of view

I was working on the post related to the relation between IFS fractals and Analytical Probability density, when an IT manager contacted me asking an informal opinion about a tool for Neural Networks.

I was working on the post related to the relation between IFS fractals and Analytical Probability density, when an IT manager contacted me asking an informal opinion about a tool for Neural Networks.
He told me: “we are evaluating two different tools to perform…
Continue

Added by Cristian Mesiano on November 13, 2011 at 9:35pm — No Comments

### Think different part 1: IFS fractals to generate points in a convex polygon

How to describe analytically a set of points belonging to convex irregular polygon:
Here you are some example…
Continue

Added by Cristian Mesiano on October 24, 2011 at 3:05pm — No Comments

### ADA boost: a way to improve a weak classifier

(have a look at my blog for further details and examples: http://textanddatamining.blogspot.com/)

6 weak classifiers:…

Continue

Added by Cristian Mesiano on October 14, 2011 at 12:23am — No Comments

### Support Vector Clustering: An approach to overcome the limits of K-means

An implementation using AMPL & SNOPT

http://textanddatamining.blogspot.com/2011/09/support-vector-clustering-approach-to.html

Results managed and plotted via…

Continue

Added by Cristian Mesiano on September 26, 2011 at 3:36am — No Comments

2012

2011