A Data Science Central Community
Data lineage is about tracking the flow of information. It is necessary to guarantee the quality, usability and security of your data. For large organizations, it is also a key conformity requirement. With Linkurious, it is possible to use a graph-based approach to solve these challenges.
The success of an organization depends on the quality, usability and security of its data. Want to provide amazing support to your customer? Create new products and services? Respect legal requirements? The best companies approach these issues in data-driven way.
But when your management looks at the quarterly sales report, do you know exactly what data they are looking at? Sometimes bad data can be more dangerous than no data. That’s why data lineage is so important.
Data lineage “is defined as a data life cycle that includes the data’s origins and where it moves over time”. For large organizations, that life cycle can be quite complex as data flows from file, to databases or reports while going through various transformation processes. Tracking the data provenance of a specific data point is very challenging.
Part of the issue is due to the limitations of RDBMS when it comes to connected data:
Graph databases like Neo4j are a perfect match for the challenges of data lineage:
We are going to show you how to use Linkurious to build a powerful and easy-to-use data lineage system leveraging Neo4j.
To build an effective data lineage system, it is necessary to map the various data elements and the processes or algorithms they go through. To be thorough we’d have to track the files, the tables, views, columns and reports in databases, the ETL jobs, etc.
For clarity purposes, we have prepared a small dataset that focuses on four types types of entities: the metadata, the systems, the processes and the reports
A metadata summarizes basic information about data. It can be for example the column name is a database and its type. A metadata can flow through a process (an ETL job, a SQL query, program code, etc) to another metadata. It is stored in a system (like a database) and be used in a report (a set of data accessible to end users through a visual interface).
Having the data into Neo4j allows us to ask questions like what is the data lineage of a report. For that kind of query, we can use Cypher, the Neo4j query language. Here’s for example how to understand where the data from our sales report comes from:
// Data lineage pf the “Employee count” report MATCH (a)-[:FLOWS_TO*]->(b:REPORT {name: ‘Employee_Count’}) RETURN a,b |
That query will return all the entities which are involved in my report.
Here are a few other questions we can use Cypher to answer:
A graph visualization solution like Linkurious can compliment Neo4j by providing tech people and business users the ability to analyse data lineage to find answers.
With Linkurious, it’s possible to search any property in your graph through a simple search bar and display it. You can then explore the graph by expanding the relationships of your choice. It’s easy to drill down in the data and find answers. That’s the difference between having a theoretical capability of tracking the data lineage and an analyst being able to quickly answer a question regarding the provenance of his data with confidence.
For example, if I want to know what data is used for my sales report report I simply look up the report via the search bar and add it to my visualization.
I can then explore its connections. In a few seconds I can find out that the origin of my report is the order_total metadata stored in the sales_db.
Graph visualization and Neo4j facilitate data lineage. You can try Linkurious now and extract new insights from your data!
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge