A Data Science Central Community
Unstructured Data Really Isn’t
Bradley S. Fordham, PhD (www.linkedin.com/in/drbradleyfordham)
The term “unstructured data”, is truly an oxymoron. All data has structure, and in fact most data has multiple structures that allow us to inspect, analyze, transform, and derive value from it. The big question we need to ask is not, “Is the data structured?”, but rather “Does our current understanding of the data’s structures support the operations we desire to perform?”
Consider the example of a large set of web pages. It is possible to have a number of progressively more refined structural understandings of this information, such as:
Even though this data may not be structured in a way as traditional as database records, it is structured. What we do not know, at least not yet, is does our understanding of this structure support the operations we want to perform?
This question ultimately comes down to how much of the semantics, the meaning of the information, is represented in the structural understanding that we currently have. In a database we can, in a very standard and well-known way, find a “schema” that tells us where each data element can be found within the structure. There is also robust meta-data, description information about the data, which further explains the data elements. This includes human-readable labels, data types, organization of data elements into “entities” – e.g. this first name and last name data element are of an entity called Student, constraints on the data, relationships between entities – e.g. Student “studies-with” Teacher, and more.
In an HTML file, on the other hand, the structure is not always as revealing of the deeper meaning. I can probably figure out that a particular piece of data is a title when it is found within a <title></<title> tag-set. I may know that another piece of data should be underlined or emphasized because of how it is tagged, but I would not convincingly know why. Presumably this information is important, but at this level of structural understanding we run out of clues as to what we can attribute that importance to. Of course, this was by design. The Hyper Text Markup Language (HTML) was designed to structurally convey the meaning of “how to render the information”, typically within a web browser, as visible or audible web page experiences. So:
Are HTML pages unstructured? Absolutely not.
Is this structure sufficient for the purpose of rendering a visual or audible experience to a web surfer? Certainly.
Is the semantics we understand at the level of HTML tags alone sufficient for finding all the students in Mr. Johnson’s 3rd grade science class, even if that information is clearly part of the content of these pages? No.
Lucky for us, HTML (or more precisely XHTML since it enforces the syntax more rigorously) is just a subset or specialized form of the eXtensible Markup Language (XML) which in turn is a subset of the Standard Generalized Markup Language (SGML). At these higher levels of structure we can achieve deeper levels of semantic understanding. In fact, we can find schemata very similar to what we see in databases. So, it is indeed possible that this data we have been given might be sufficient to perform this task of finding Mr. Johnson’s 3rd grade science students if we can just raise our level of understanding of the structure of the information.
In conclusion, the next time someone starts to talk to you about unstructured data, think “balderdash!” quietly – but loudly – to yourself and start asking the right question. Is your understanding of the structure that exists sufficient to answer the questions or solve the problems that you would like to with the data at hand? If the answer is initially no, do not give up so quickly. Perhaps you can raise your sights a bit and reach another level of structural understanding that is sufficient to the challenges at hand.
Dr. Bradley Fordham PhD
The (ART+DATA) Institute