Subscribe to DSC Newsletter

Text Analytics/Data Mining: Reaping benefits from decentralisation of knowledge

As the industry begins to understand what can be done with massive amounts of data, Hive, which makes it simple for users to gather data without having to write java or python code, aptly describes the progress being made in this arena.

Facebook considers the appropriate use of the information generated by and from its users as a critical component of its decision-making related to bringing improvements to the overall product.

According to Facebook, Hadoop has enabled the company to make better use of the data at its disposal.

The rapid adoption of Hadoop at Facebook has been aided by a couple of key decisions. First, developers are free to write map-reduce programmes in the language of their choice. Second, the team embraced SQL as a familiar paradigm to address and operate on large data sets.

Hadoop and Hive, which is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarisation, adhoc querying and analysis of large datasets data stored in Hadoop files, have allowed Facebook to leverage its massive amounts of data by allowing engineers and analysts throughout the company to quickly answer questions using a SQL-like query language.

“By distributing data and computation across a tier of several hundred commodity machines, the Hadoop infrastructure supports a wide variety of data needs, from small ad-hoc requests to complex distributed multi-stage machine learning pipelines,” says Roddy Lindsay, Data Scientist, Facebook.

Hive provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.

Lindsay, who is scheduled to speak during the 5th Annual Text Analytics Summit, to be held in the US in June this year, shared that Hive was developed by a small team at Facebook to make it simple for users to gather data without having to write java or python code.

On how it enables a new breed of computationally intensive language analysis algorithms, Lindsay said that by handling requests in a familiar SQL-like syntax, Hive has enabled most data requests to be handled by users outside of the data team.

“Hive has enabled the decentralisation of knowledge and empowered people from across the organization to leverage Facebook’s data,” said Lindsay.

Prowess of this data warehouse infrastructure

Lindsay highlights that when it comes to dealing with the quantity of data that Facebook has, a traditional relational database doesn’t cut it; there’s too much overhead to store all those terabytes.

The solution is to keep data in flat files, and use commodity hardware to store and compute.

“Hadoop is the bones and muscles of our data infrastructure by providing a distributed file system and Map-Reduce operators on top,” explained Lindsay. “For engineers who write scripts, Hadoop allows a wide variety of complex data transformation, including distributed machine learning.”

He added, “But we found that most of the requests coming from others in the organisation were simple counts; for example, the number of users in a country who had installed a certain application. The traditional SQL operators such as JOIN and GROUP BY fit quite nicely in the latter paradigm, and many people were already familiar with the syntax. So the solution was to create a metadata layer on top of the raw log files stored in Hadoop and provide this SQL-like query language. The result of this effort is Hive, which is now an Apache project.”


Lindsay and his friends have worked on a new application, called HappyFactor, built on the Facebook Platform.

“It is not at all affiliated with Facebook,” clarified Lindsay.

With this tool, a solution is being offered: an occasional text message that asks a simple question, “How happy are you right now?”

Over time, the data collected from users responses can help them understand what in life makes them happy and what makes them depressed. For example, if you are truly happier when you are spending time with a particular friend or doing a particular activity, you will be able to see this trend from the charts on the site.

According to Lindsay, the idea is to create a sort of happiness diary by randomly prompting people via SMS to log how happy they are feeling and what they are doing. When enough data is gathered, there are certain trends that come up; for example, the days of the week and times of the day when you are happiest.

“With the unstructured data, there is an opportunity to use text analytics to extract relevant information. As a basic example, we might see that the word “driving” is associated with low scores, or the word “Sally” is associated with high scores. So we might suggest that you might spend less time in the car and more time with Sally to increase your happiness,” elaborated Lindsay.

Expectations going forward

“We’re just beginning to understand what we can do with massive amounts of data, in any context,” says Lindsay.

“The computation that used to be limited to supercomputers and government research labs is now available to anyone. Even if you only have a few dollars and a few GB of data you want to analyse, you can rent computation with a service like Amazon EC2. So obviously this is going to be an important arena, and it’s great that more and more people are getting excited about it and contributing to the ecosystem. I’m very excited to see Cloudera’s commercial offering; Hammer (Jeff Hammerbacher) and his team are very strong, and I think that they will be very successful bringing this technology outside of the traditional weblog-analysis sphere,” concluded Lindsay.

5th Annual Text Analytics Summit

Roddy Lindsay, Data Scientist, Facebook is scheduled to speak during the 5th Annual Text Analytics Summit, to be held in the US in June this year.

For more information, click here:


Contact: Ben Satchwell by email [email protected]

Tags: Analytics, BI, Business, Clarabridge, Endeca, Fayyad, Grimes, IBM, Lexalytics, Microsoft, More…Oracle, ROI, SAS, Sentiment, Seth, Summit, Text, Usama, analysis, analytics, architecture, blog, content, contextual, customer, dashboarding, data, discovery, ediscovery, extraction, integrated, intelligence, intelligent, management, marketing, mining, of, ontological, optimization, relationship, scalable, scoring, search, semantic, semantics, sentiment, solutions, speech, systems, text, the, voice, web

Views: 55

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service