Subscribe to DSC Newsletter

How to track and visualize data lineage

Data lineage is about tracking the flow of information. It is necessary to guarantee the quality, usability and security of your data. For large organizations, it is also a key conformity requirement. With Linkurious, it is possible to use a graph-based approach to solve these challenges.

What is data lineage?

The success of an organization depends on the quality, usability and security of its data. Want to provide amazing support to your customer? Create new products and services? Respect legal requirements? The best companies approach these issues in data-driven way.

But when your management looks at the quarterly sales report, do you know exactly what data they are looking at? Sometimes bad data can be more dangerous than no data. That’s why data lineage is so important.

Data lineage “is defined as a data life cycle that includes the data’s origins and where it moves over time”. For large organizations, that life cycle can be quite complex as data flows from file, to databases or reports while going through various transformation processes. Tracking the data provenance of a specific data point is very challenging.

dgsdgsd

Example of a real-life data pipeline at Pinterest.

Part of the issue is due to the limitations of RDBMS when it comes to connected data:

  • querying connected data through SQL is an hard and error-prone process;
  • slow performances for questions requiring looking up multiple connections (like getting the full data lineage of a given property);
  • it’s  hard to accommodate an evolving data model in a relational database.

Graph databases like Neo4j are a perfect match for the challenges of data lineage:

  • it’s easy to model the flow of data in a graph;
  • you can query relationships with ease and in real-time;
  • a graph schema can evolve to accommodate new data and relationships.

We are going to show you how to use Linkurious to build a powerful and easy-to-use data lineage system leveraging Neo4j.

Using a Neo4j graph database to power your metadata management

To build an effective data lineage system, it is necessary to map the various data elements and the processes or algorithms they go through. To be thorough we’d have to track the files, the tables, views, columns and reports in databases, the ETL jobs, etc.

For clarity purposes, we have prepared a small dataset that focuses on four types types of entities: the metadata, the systems, the processes and the reports

data lineage data model

Data lineage model.

A metadata summarizes basic information about data. It can be for example the column name is a database and its type. A metadata can flow through a process (an ETL job, a SQL query, program code, etc) to another metadata. It is stored in a system (like a database) and be used in a report (a set of data accessible to end users through a visual interface).

Having the data into Neo4j allows us to ask questions like what is the data lineage of a report. For that kind of query, we can use Cypher, the Neo4j query language. Here’s for example how to understand where the data from our sales report comes from:

// Data lineage pf the “Employee count” report
MATCH (a)-[:FLOWS_TO*]->(b:REPORT {name: ‘Employee_Count’})
RETURN a,b

That query will return all the entities which are involved in my report.

Data lineage visualize through Neo4j.

Data lineage visualize through Neo4j.

Here are a few other questions we can use Cypher to answer:

  • is my database still being used in an important company process or can I remove it?
  • what systems and reports would be impacted by a change in a particular process?
  • which data is used by whom?

Graph visualization can help business users investigate data lineage

A graph visualization solution like Linkurious can compliment Neo4j by providing tech people and business users the ability to analyse data lineage to find answers.

With Linkurious, it’s possible to search any property in your graph through a simple search bar and display it. You can then explore the graph by expanding the relationships of your choice. It’s easy to drill down in the data and find answers. That’s the difference between having a theoretical capability of tracking the data lineage and an analyst being able to quickly answer a question regarding the provenance of his data with confidence.

For example, if I want to know what data is used for my sales report report I simply look up the report via the search bar and add it to my visualization.

dgsdgs

Visualizing the total sales report as a node.

I can then explore its connections. In a few seconds I can find out that the origin of my report is the order_total metadata stored in the sales_db.

full

Complete data lineage with all the processes, meta data and systems associated with the report.

 

Graph visualization and Neo4j facilitate data lineage. You can try Linkurious now and extract new insights from your data!

Views: 2733

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service