A Data Science Central Community

Let's continue our discussion about the applications of the graph entropy concept.

Today I'm going to show how we can re-use the same concept on the document clustering.

What I want to highlight is that through such methodology it's possible to:

- extract from a document the relevant words (as discussed …

Added by Cristian Mesiano on November 17, 2012 at 1:35am — No Comments

In the last post I showed how to extract key words from a text through a principle called graph entropy.

Today I'm going to show another application of the graph entropy in order to extract clusters of key words.**Why**

The key words of a document depict the main topic of the content, but if the document is big, often, there are many different sub topics related to the…

Added by Cristian Mesiano on October 24, 2012 at 11:34am — No Comments

I would share with you some early results about a research I'm doing in the field of "graph entropy" applied to text mining problem.

click here to read the entire post

**Why Graph Entropy is so important?**

Based on the main concept of entropy the following assumptions are true:

- The entropy of a graph should be a functional of the…

Added by Cristian Mesiano on September 24, 2012 at 2:39pm — No Comments

Most of the datamining problems can be reduced as a minimization/maximization problem.

... click here to read the entire post

Let's consider easy scenarios where the function cost is conditionated just by two parameters.…

Continue
Added by Cristian Mesiano on August 12, 2012 at 3:09am — No Comments

One of the most widely algorithm used in Machine Learning is the Simulated Annealing (SA).

The reason of its celebrity lays in:

- Simplicity of implementation
- broad spectrum of applicability

... click here to read the entire post

**The experiments**

I considered two instances of the problem, the first one with 10 towns and the…

Added by Cristian Mesiano on July 3, 2012 at 12:56pm — No Comments

As often happens, I usually do many thing in the same time, so during a break while I was working for a new post on applications of mutual information in data mining, I read the interesting paper suggested by Sandro Saitta on his blog (dataminingblog) related to the outlier detection.

...Usually such behavior is not proficient to obtain good results, but this time I think that the change of…

ContinueAdded by Cristian Mesiano on May 23, 2012 at 2:20pm — No Comments

Consider a set of people's data labelled with two different labels, let's say blue and red, and let's assume that for this people we have a bunch of variables to describe them.

Moreover, let's assume that one of the variables is the social…

Continue
Added by Cristian Mesiano on May 5, 2012 at 1:09pm — No Comments

While I was writing the last post I was wondering how long before my followers notice the mistakes I introduced in the experiments.

Let's start the treasure hunt!**1. Don't always trust your data: often they are not homogeneous.**

...click here to read the entire post

A good data miner must always check his dataset! you should always ask to yourself whether the data have been produced in a…

ContinueAdded by Cristian Mesiano on April 4, 2012 at 2:10pm — No Comments

In the last months we discussed a lot about text mining algorithms, I would like for a while focus on data mining aspects.

Today I would talk about one of the most intriguing topics related to data mining tasks: the regression analysis.

...To read the entire post click here

Added by Cristian Mesiano on March 7, 2012 at 2:27pm — No Comments

In the last two post we have discussed about co - occurrences analysis to extract features in order to classify documents and extract "meta concepts" from the corpus.

We have also noticed that this approach doesn't return better than the traditional bag of words.

I would now explore some derivation of this approach, taking advantage of the graph theory.

the graph of the co occurrences is really huge and complex, how could we reduce its complexity without big information…

ContinueAdded by Cristian Mesiano on February 29, 2012 at 1:59pm — No Comments

We have seen few posts ago an approach to extract meta "concepts" from text based on latent semantic paradigm.

In this post we apply this approach to classify documents, and we do a comparison between this approach and the canonical bag of words.

The comparison test will be done through the ensemble method already showed in the last post.

**To read the entire post click …**

Added by Cristian Mesiano on February 20, 2012 at 7:22am — No Comments

ADaBoost.M1 tries to improve step by step the accuracy of the classifier analyzing its behavior on training set. (Of course you cannot try to improve the classifier working with the test set!!).

Here lays the problem, because if we choose as "weak algorithm" an SVM, we know that almost always it returns excellent accuracy on the training set with results closed to 100% (in term of true positive).

In this scenario, try to improve the accuracy of classifier assigning different weights…

Added by Cristian Mesiano on January 30, 2012 at 2:06pm — No Comments

....

So what I did is the following (be aware that is not the formal implementation of LSA!):

- Filter and take the base form of the words as usual.
- Build the multidimensional sparse matrix of the co-occurrences;
- I calculated for each instance the frequency to find it in the corpus;
- I calculated for each instance the frequency to find it in the doc;
- I weighted such TF-IDF considering also the distance among the…

Added by Cristian Mesiano on January 13, 2012 at 9:04am — No Comments

The strategy is very easy to describe:

1. Divide the domain of your function in k sub intervals.

2. Initialize k monomials;

3. Consider the monomials as centroids of your clustering algorithm.

4. Assign the points of the function to each monomial in compliance to the cluster algo.

5. Use the gradient descent to adjust the parameters of each monomial.

6. Go to 4. until the accuracy is good enough.

Read the entire post at:…

ContinueAdded by Cristian Mesiano on December 15, 2011 at 11:00am — No Comments

In the real world rarely a problem can be solved using just a single algorithm, more often a solution is a chain of algorithms where the output of the former is the input for the follower.

But you know that quite often machine learning algorithms return functions almost always extremely complex, and they don’t fit directly in the next step of your strategy.

In these conditions, it is really helpful the trick of the function approximation, that is, we reduce the complexity…

Continue
Added by Cristian Mesiano on December 8, 2011 at 2:31pm — No Comments

...

As mentioned, in many cases the customization of algorithms is the only way to achieve the target, but sometimes, some tricks can help to improve the learning even without changes of learning strategy!

Consider for example our XOR problem solved through neural networks.

Let's see how we can reduce considerably the epochs required to train the net.

read more here: …

ContinueAdded by Cristian Mesiano on November 23, 2011 at 3:05pm — No Comments

I was working on the post related to the relation between IFS fractals and Analytical Probability density, when an IT manager contacted me asking an informal opinion about a tool for Neural Networks.

I was working on the post related to the relation between IFS fractals and Analytical Probability density, when an IT manager contacted me asking an informal opinion about a tool for Neural Networks.

He told me: “we are evaluating two different tools to perform…

Continue
Added by Cristian Mesiano on November 13, 2011 at 9:35pm — No Comments

How to describe analytically a set of points belonging to convex irregular polygon:

Here you are some example…

Continue
Added by Cristian Mesiano on October 24, 2011 at 3:05pm — No Comments

(have a look at my blog for further details and examples: http://textanddatamining.blogspot.com/)

6 weak classifiers:…

ContinueAdded by Cristian Mesiano on October 14, 2011 at 12:23am — No Comments

An implementation using AMPL & SNOPT

http://textanddatamining.blogspot.com/2011/09/support-vector-clustering-approach-to.html

Results managed and plotted via…

ContinueAdded by Cristian Mesiano on September 26, 2011 at 3:36am — No Comments

- Document Clustering and Graph Clustering: graph entropy as linkage function
- Key words through graph entropy Hierarchical clustering
- Graph Entropy to extract relevant words
- Function minimization: Simulated Annealing led by variance criteria vs Nelder Mead
- Simulated Annealing: How to boost performance through Matrix Cost rescaling
- Outlier analysis: Chebyschev criteria vs approach based on Mutual Information
- Uncertainty coefficients for Features Reduction - comparison with LDA technique

- Outlier analysis: Chebyschev criteria vs approach based on Mutual Information
- ADA boost: a way to improve a weak classifier
- Simulated Annealing: How to boost performance through Matrix Cost rescaling
- Function minimization: Simulated Annealing led by variance criteria vs Nelder Mead
- Document Classification: latent semantic vs bag of words. Who is the best?
- Graph Entropy to extract relevant words
- Features Extraction: Co-occurrences and Graph clustering

- Back (2)
- Neural (2)
- Propagation (2)
- clustering (2)
- AMPL (1)
- IFS (1)
- Networks (1)
- Power (1)
- Real (1)
- SNOPT (1)
- SVC (1)
- Text (1)
- adaboost (1)
- algorithm (1)
- analysis (1)
- bag (1)
- categorization (1)
- classifier (1)
- fit (1)
- fitting (1)
- function (1)
- functions (1)
- improve (1)
- iterated (1)
- latent (1)
- networks (1)
- of (1)
- polynomials (1)
- semantic (1)
- system (1)
- to (1)
- weak (1)
- words (1)

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions