Data Lineage: Case Studies of Data-Driven Businesses

Parth Shukla Last Updated : 29 Nov, 2022

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Data lineage is the process of analyzing the path of the data and how it is involved in different methods with time. Many businesses and companies use it to get an idea of the source, data pathway, and how the data is being used. It can help organizations gain insight from the data to plan for future steps and use the data for better product or service performance.

In this article, we will discuss 3 case studies where data-driven companies like Netflix, Slack, and Postman implemented data lineage and benefitted from that. Here we will also discuss their process of it and its technique they applied while implementing and using it.

Data Lineage Case Studies

Some data-driven businesses like Netflix, Slack, UBS, Postman, and Airbnb are convinced of the benefits of data lineage and are now using it and reaping returns. Let us discuss the data linkage process in these companies and how they get benefitted from it.

Case 1: Improved data infrastructure reliability and efficiency at Netflix

Netflix has been convinced of the benefits of the data lineage and has implemented it. At the project’s inception stage, they defined design goals to help guide the architecture and development work to deliver a complete, accurate, reliable, and scalable lineage system mapping Netflix’s diverse data landscape. A few of these principles are:

Ensure data integrity
Enable seamless integration
Design a flexible data model

Based on a standard data model at the entity level, they have built a generic relationship model that describes the dependencies between any pair of entities. Using this approach, they can make a unified data model and the repository to deliver the proper leverage to enable multiple use cases such as data discovery, SLA service, and Data Efficiency.

Case 2: Easy operational maintenance and better execution of data programs at Slack

Slack has been convinced of the benefits of data lineage, and hence they have also invested in the same. Slack states that as datasets become more complex and the number of contributors grows, it becomes more and more challenging to understand the relationships between different data sources.

To make it easier for folks to use their lineage data, they have produced a flattened version of tier tables and stored it in Hive. The flattened table allows folks to query lineage data in our data warehouse and also makes queries easier to write/run for typical use cases.

Also, with the help of data lineage, they have worked on a notifications system. They have built notification tooling on their internal Data Portal to allow their data consumers to use lineage information and notify downstream consumers. There is a notify button, using which the dataset owners can get information.

Case 3: Moving beyond data discovery at Postman

Postman has also fixed a missing layer in their data layer. Postman’s data system was pretty simple. They had a set of data tables, and information about those tables lived in the heads of Their early data team members. This worked when the company and its data were small but needed help to keep up as it started to grow exponentially.

Postman currently has hundreds of team members distributed across four continents and more than 17 million users from 500,000 companies using their API platform.

Postman Co-founder and CTO Ankit Sobti wanted to ensure that data was democratized. He said that it is a challenging task for a data engineering team to gain insights from data at any given time in the day. He believed that everyone in the company should be able to access the data and gain insights. This became very tedious in 2020 when Potman became fully online due to the COVID pandemic.

The data team decided to take on Postman’s data system as a project to address this issue. Their main goal was to make Postman’s data easier to access and understand, both for new hires within the data team and for people across the company with the help of data lineage.

They have used data lineage to know where the data comes from and how it is connected to other layers. Data lineage helped them understand the data’s connectivity and daily bugs and errors occurring on the system. It helped them solve issues quicker; Without asking a doubt, the slack team could solve the problem by just looking at data lineage. They are also planning to take further steps in data lineage to make their data management more accessible and quicker.

When Data Lineage is a No-brainer (of No Use) For Some Organizations

Data lineage is proven the best fit solution for most organizations working with data and data management. Still, there are some cases where it is proven to be a no-brainer for organizations.

Some organizations store a large amount of data and work with many data sources and storage. Data lineage can prove a no-brainer for such an organization, as it needs to provide the best reliable information for such data.

Data lineage provides information about the data sources and the entire lifecycle of the data; the data’s design lineage can help one get an idea about the data’s head and consumption. However, it is helpful for architects to understand the implementation of how data flows. However, subject matter experts in the business that wish to audit the data processing can find it complex to navigate.

Business lineage provides simplified views on analyzing business types over the design lineage. A business lineage report may only show the significant systems or may eliminate the systems and job structures only to show the transformation.

So this is how the data lineage is designed to show things quickly and easily, but not to search the items. Let us suppose that the organization works with a large amount of data or discrete data sources that vary frequently. It will not be able to find the desired information from the data as it can show the flowchart or lifecycle of the data. Still, the results from it will only be reliable for a small amount of data or varying data. Hence, it is a proven no-brainer for organizations working with large volumes and ranging data.

Conclusion

In this article, we discussed some case studies of the data-driven companies that implemented and used the data lineage and its application and benefitted from that. We saw data-driven companies like Netflix, Slack, and Postman, which used the concept in their database, which returned positive results. Knowledge about these companies and their data lineage process will help one understand how colossal data companies are using this and also help one answer the questions asked in data engineering interviews very efficiently.

Some Key Takeaways from this article are:

1. Today, most data-driven companies use data lineage for better data governance and handling.

2. Companies with data sources can implement data lineage very efficiently and help them get more idea about the data being used in no time.

3. It is a no-brainer or not so useful for companies with a small amount of generation of data or startups with lighter databases.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Parth Shukla

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Data Lineage: Case Studies of Data-Driven Businesses

Introduction

Data Lineage Case Studies

Case 1: Improved data infrastructure reliability and efficiency at Netflix

Case 2: Easy operational maintenance and better execution of data programs at Slack

Case 3: Moving beyond data discovery at Postman

When Data Lineage is a No-brainer (of No Use) For Some Organizations

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Data Lineage: Case Studies of Data-Driven Businesses

Introduction

Data Lineage Case Studies

Case 1: Improved data infrastructure reliability and efficiency at Netflix

Case 2: Easy operational maintenance and better execution of data programs at Slack

Case 3: Moving beyond data discovery at Postman

When Data Lineage is a No-brainer (of No Use) For Some Organizations

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques