A Data Science Central Community

In the last two post we have discussed about co - occurrences analysis to extract features in order to classify documents and extract "meta concepts" from the corpus.

We have also noticed that this approach doesn't return better than the traditional bag of words.

I would now explore some derivation of this approach, taking advantage of the graph theory.

the graph of the co occurrences is really huge and complex, how could we reduce its complexity without big information loss?

The **Kcore algorithm **could help in doing that. It gives groups of vertices that are connected to at least k others.

To build it you can remove repeatedly vertices of out-degree less than k.

Below the k core components from the original graph (the k chosen here is five).

Form the k-core graph we could extract the shortest path for each combination of two vertexes.

And we can compare the paths obtained with the document in the corpus to select the paths really existing in the documents (this trick of course should be applied only on the training set!).

The procedure returned 526 entities.

Here you are the longest entities extracted:

1 rice farmer reach acre

2 department say last summer

3 contract trade agricultural

4 month pct increase

5 summer grain harvest

6 board trade agricultural

7 record tonne previous

8 rice farmer reach

9 cash crop include

10 high rate national

11 pay crop loan

12 cotton payment reach

13 major farm group

14 future year ago

15 future trade area

16 total tonne intervention

17 offer exporter issue

18 import tonne agricultural

19 bushel price current

20 bushel corn producer

Another interesting things we can do with such entities is to isolate in the kcore graph the nodes that describe the entities and apply a clustering algorithm over the graph connections.

I decided to apply in this case as a clustering distance the "community modularity".

The clustering returns unstructured features.

Here you are the clustered graph and the unstructured features clustered:

© 2020 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge