Harshit Ahluwalia — January 17, 2022
Beginner Data Engineering Data Science

Overview

In this article, I will walk you through the layers of the Data Platform Architecture. First of all, let’s understand what is a Layer, a layer represents a serviceable part that performs a precise job or set of tasks in the data platform. The different layers of the data platform architecture that we are going to discuss in this article include the Data ingestion layer, Data storage layer, Data processing layer and Analysis, User interface layer, and Data Pipeline layer. If you are new to Data Engineering, then follow these top 9 skills required to be a data engineer.

Source: Author

Table of content

  1. Data Collection Layer or Data ingestion Layer
  2. Data Storage Layer or Integration Layer
  3. Data Processing Layer
  4. Analysis and User Interface Layer
  5. Data Pipeline Layer

 

Data Collection Layer

 

Source: Author

This is the first layer of the data platform architecture. The Data Collection layer as the name suggests is responsible for connecting to the source systems and bringing data into the data platform in a periodic manner. This layer performs the following tasks:

 

  1. This layer is responsible for connecting to the data sources.
  2. This layer is responsible for transferring data from data sources to the data platform in streaming mode or batch mode or both.
  3. Moreover, this layer is responsible for maintaining the information about the data collected in the metadata repository. For example, how much data is gobbled into the data platform and other descriptive information?

 

There are various tools that are available in the market, but some of the popular tools include Google Cloud Data Flow, IBM Streams, Amazon Kinesis and Apache Kafka are some of the tools that are used for data ingestion that supports both batch and streaming modes.

 

Source: Author

Once the data is gobbled, it needs to be stored and integrated into the data platform like the way we store food in the stomach. For storing and integrating the data, we head towards the second layer of the data platform that is the Data storage layer or Data Integration layer.

Data Storage and Data Integration Layer

 

Source: Author

This is the second layer of the data platform architecture. The Data collection layer as the name suggests is responsible for storing data for processing and long-term use. Moreover, this layer is also responsible for making data available for processing in both streaming and batch modes. As this layer is responsible for making data available for processing, it needs to be reliable, scalable, high-performing and cost-efficient. IBM DB2, IBM DB2, Microsoft SQL Server, MySQL, Oracle Database, and PostgreSQL are some of the popular relational databases. But nowadays, cloud-based relational databases gained popularity over the recent years, some cloud-based relational databases are IBM DB2, Google Cloud SQL and SQL Azure. In the NoSQL or non-relational database systems on the cloud, we have IBM Cloudant, Redis, MongoDB, Cassandra, and Neo4J. Tools for integration includes IBM’s cloud Pak for Data, IBM’s cloud Pak for Integration and Open Studio. Once the data has been ingested, stored and integrated, it needs to be processed. So with this, we move forward to Data Processing Layer

 

Data Processing Layer

 

Source: Author

This is the third layer of the data platform architecture. As the name suggests, this layer is responsible for a processing task. The processing includes data validations, transformations and applying business logic to the data. The processing layer should be able to perform some tasks that include:

  1. Read data in batch or streaming modes from storage and apply transformations.
  2. Support popular querying tools and programming languages.
  3. Scale to meet the processing demands of a growing dataset.
  4. Provide a way for analysts and data scientists to work with data in the data platform.

The transformation task that usually occurs in this layer include:

  1. Structuring: These are the actions that change the structure of the data. This change can be simple or complex in nature. The simple one can also be like changing the arrangement of fields within the record or dataset or complex as combining fields complex structures using joins and unions.
  2. Normalization: This part focuses on reducing redundancy and inconsistency. It also focuses on cleaning the database of unused data.
  3. Denormalization: Denormalization is the task of combining data from multiple tables into a single table so that it can be queried more efficiently for reporting and analysis purposes.
  4. Data Cleaning: Data Cleaning, which fixes irregularities in data to provide credible data for downstream applications and uses.

 

There are numerous amount of tools available in the market for performing these operations on the data includes Such as spreadsheets, OpenRefine, Google DataPrep, Watson Studio Refinery, and Trifacta Wrangler. Python and R also offer several libraries and packages that are explicitly created for processing data. It’s very important to know that storage and processing are not always been performed in separate layers. For example, in relational databases, storage and processing are both occur in the same layer while in Big data systems, data is first stored in the Hadoop File Distributed system and then processed in the data processing engine such as Spark.

 

Analysis and User Interface Layer

 

Source: Author

This is the fourth layer of the data platform architecture. This layer is responsible for delivering the process data to the end-users that including business intelligence analysts and business stakeholders who consume these data with the help of interactive dashboards and reports, moreover, data scientists and data analysts fall under this end-user category that further process this data for the specific use case. This layer needs to support querying tools such as SQL tools and No-SQL tools and programming languages like Python, R and Java and moreover, these layers need to support API’s that can be used to run reports on data for both online and offline processing.

 

Data Pipeline Layer

 

Source: Author

This is the last layer of this architecture, this layer is responsible for implementing and maintaining a continuous flow of data through this data pipeline. It is the layer that has the capability to extract, transform and load tools. There are a number of data pipeline solutions available, most popular among them being Apache Airflow and DataFlow.

 

End Notes

In this article, you learned about the layers of a data platform architecture. This is a simplified rendition of a complex architecture that supports a broad spectrum of tasks.

About the Author

Harshit Ahluwalia

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *