A Data Science Central Community
In this two-part series, we will explore text clustering and how to get insights from unstructured data. It will be quite powerful and industrial strength. The first part will focus on the motivation. The second part will be about implementation.
This post is the first part of the two-part series on how to get insights from unstructured data using text clustering. We will build this in a very modular way so that it can be applied to any dataset. Moreover, we will also focus on exposing the functionalities as an API so that it can serve as a plug and play model without any disruptions to the existing systems.
In case you are in a hurry you can find the full code for the project at my Github Page
Just a sneak peek into how the final output is going to look like –
It is established beyond reasonable doubt that data is the new oil. Organizations across the globe are aggressively building in-house analytics capabilities to harness this untapped treasure cove. However sustainable business benefits arising from analytics initiatives remain elusive at large as organizations are yet to discover the secret recipe that makes it all work.
As per a recent study, the average ROI from analytics initiatives is still negative for most organizations. Most organizations are in one of the following stages of evolution towards becoming a data driven organization –
Organizations today are sitting on vast heaps of data and unfortunately, most of it is unstructured in nature. There is an abundance of data in the form of free flow text residing in our data repositories.
While there are many analytical techniques in place that help process and analyze structured (i.e. numeric) data, fewer techniques exist that are targeted towards analyzing natural language data.
In order to overcome these problems, we will devise an unsupervised text clustering approach that enables business to programmatically bin this data. These bins themselves are programmatically generated based on the algorithm’s understanding of the data. This would help tone down the volume of the data and understanding the broader spectrum effortlessly. So instead of trying to understand millions of rows, it just makes sense to understand the top keywords in about 50 clusters.
Based on this, a world of opportunities open up –
This list is endless but the point of focus is a generic machine learning algorithm that can help derive insights in an amenable form from large parts of unstructured text.
The algorithm first performs a series of transformations on the free flow text data (elaborated in subsequent sections) and then performs a k-means clustering on the vectorized form of the transformed data. Subsequently, the algorithm creates cluster-wise tags, also known as cluster-centers, that are representative of the data contained in these clusters.
The solution boasts of end-to-end automation and is generic enough to operate on any dataset.
The text clustering algorithm works in five stages enumerated below:-
These are elaborated below along with illustrations:-
The free flow text data is first curated in the following stages:-
These steps are best explained through the illustration below:-
Once all the documents in the corpus are transformed as explained above, a term document matrix is created and the documents are transformed into this vector space model using the 1-gram vectorizer (see below). Other more sophisticated implementations include n-gram (where n in a reasonably small integer)
This is an optional step and can be performed in case there is high variability in the document corpus and the number of documents in the corpus is extremely large (of the order of several million). This normalization increases the importance of terms that appear multiple times in the same document while decreasing the importance of terms that appear in many documents (which would mostly be generic terms). The term weightages are computed as follows:-
Post the TF-IDF transformation, the document vectors are put through a K-Means clustering algorithm which computes the Euclidean Distances amongst these documents and clusters nearby documents together.
The algorithm then generates cluster tags, known as cluster centers which represent the documents contained in these clusters. The clustering and auto-generated tags are best depicted in the illustration below (Principal components 1 and 2 are plotted along the x and y axes respectively):-
In order for more and more users to benefit from this solution and analyze their unstructured text data, I have created a RESTful web service that users can access in two ways:-
Since all computations are performed in-memory, the results are lightning fast.
A mathematical approach to understanding and analyzing natural language data could prove instrumental in unlocking the enormous value and insights currently trapped within it and vastly improve our understanding of our organization and its eco-system. The next post will contain the ground-level implementation details. Follow along with me if you are interested and this will work out great. My next post on the tech details will be up soon. The code is available at my Github Page