The DataHour Synopsis: Learning Path to Master Data Engineering in 2022

ankita184 Last Updated : 14 Jun, 2022

6 min read

Data is the new oil of the industry. The way raw oil empowers the industrial economy, data is empowering the information economy.

Anonymous

Overview

Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “DataHour”. This is a series of webinars by top industry experts where they teach and democratize data science knowledge. On 23rd April 2022, we were joined by Mr.Shashank Mishra for a DataHour session on “Learning Path to master Data Engineering in 2022”.

Shashank is an experienced Data Engineer with a demonstrated history of working in service and product companies such as Amazon, Paytm and McKinsey. Currently, Shashank is an active contributor to the Data Science and Data Engineering community through his incredible podcasts and Youtube Channel(E-Learning Bridge)

Are you excited to dive deeper into the world of Data Engineering? We got you covered. Let’s get started with the major highlights: Learning Path to master Data Engineering in 2022.

Introduction

In the era of a highly technologically-advanced world, every company or an organization is heavily dependent on data-driven business decisions. The collected data is huge, and taking business decisions or crux out of this data is not a cakewalk. All this happens in phases and needs a highly advanced system or pipeline. This is where the role of data engineer and data engineering comes into the picture. This is what a data engineer is supposed to do. They put the data in a more accurate format for better business decision making. Now, let’s dive into the roadmap that will lead us to become successful data engineers.

Who are Data Engineers?

Data engineers work as front-liners like any company is working on a big data solution. They are treated as front liners because their actual task is to create those scalable and optimized data pipelines which every company requires. But creating a data pipeline or ETL is not an easy task. The ETL and creating the pipelines is not a simple task of just bringing data from some source and putting it into some downstream system. Here software engineering stuff is also involved and plays a vital role because scientists need to design those scalable systems where their distributed computation engines will be involved and will need to handle the load as well as the data load you need to create scalable data pipelines in such a way that even if there is a spike in the data, the pipeline doesn’t get a break or take time to process the data.

The Roadmap that Leads us to our Destination

Data Engineering in 2022

Programming language

You must be thinking, is programming or the understanding of programming actually required in data engineering or not? Probably, you are thinking in the right direction.

To be well equipped with these programming languages is a necessity in this domain. It’s because to create scalable and optimized data pipelines, some data transformation will be applied there. This means we need to employ some code to apply this transformation for whatever execution is related. Hence, understanding the programming and the modern object-oriented concepts becomes a necessity. There are a few popular languages in data engineering, and these are:

Python
Scala
Java

Perhaps, you can choose any one of them and get expertise in the same. But if I need to suggest you, people, I would say rather you choose Python first. Nowadays, Python is known as the language of data. Whereas Java and Scala are a bit difficult languages due to their object-oriented way of writing and executing. And Python has different data analysis libraries and is easy to understand. This is the first checklist that will help you in solving the use-case with more simple programming.

Operating System & Scripting

You also need to understand operating systems (such as Linux and Unix) and shell scripting. Be handy with how to execute commands on the terminal and perform basic operations such as copying, formatting, etc., of files. How to write a shell script to automate stuff or to do some stuff in the background.

Data Structure & algorithms

In data engineering, you need to have a basic understanding of DS and its algorithms. These are used in the activities such as building pipelines using array, string, stack, etc. Other algo’s are: linked list, queue, tree, graph and its traversal, dynamic programming, searching and sorting.

These all will help in checking your logical thinking ability and other programming skills. So, only understanding these algo’s is important just to make sure that you are well known of the concepts being used further in the process.

DBMS(Data Based Management System)

A core understanding of DBMS is a must. This will help you solve the problem statement and database designing and management, too; additionally, the clarity of DBMS concepts “which one to use where” will make these statement cases simpler. These are the few commands used in DBMS:

DDL-Data Definition Language
DCL-Data Control Language
DML-Data Manipulation Language
Integrity Constraints
Data Schema
Basic Operations
ACID Properties
Transactions
Concurrency Control
Deadlock
Indexing
Hashing
Normalization forms
Views
Stored Procedures
ER Diagrams

SQL Scripting

This is one of the must-have checklists. You must be using this in your daily day-to-day activities to produce complex and analytical results. Here is the list that you need to know:

Transactional Databases : MySQL, PostgreSQL
All types of joins
Nested Queries
Group By
Use of Case When Statements
Window Functions

Big Data

Here, you need to understand what big data is and its terminologies, such as technical terminologies of the 5 V’s of data, computation distribution, and how it works. Other things to learn are:

Vertical vs Horizontal Scaling
Commodity Hardware
Clusters
File formats-CSV, JSON, AVRO, Parquet, ORC
Type of Data-Structured, Unstructured and Semi-structured

You’ll employ these basic terminologies in tech frameworks which are being used in big data.

Important Python Libraries

Get used to two very important python libraries:

NumPy
Pandas

These are data exploration libraries where you can read and explore the data. Additionally, some mathematical/ statistical things can also be performed. Go through all the libraries thoroughly.

Data Warehousing Concepts

This is important in design rounds and in real-world use-cases. Everything is coming from some source that will have a kind of data or warehouse. To build or design downstream systems efficiently, you need to understand data-warehousing and data-modelling concepts thoroughly.

OLAP vs OLTP
Dimension Tables
Fact Tables
Star Schema
Snowflake Schema
Warehouse Designing Questions
Many more topics

BigData Framework

From here onward, we’ll dive deeper into the data engineering domain. The base or the foundation of BigData is Apache Hadoop. This is the first framework that is used for batch data processing, and architectural understanding is a must. It was discovered for distributed data computation. With all this, you’ll be able to write on.

HDFS(file storage system)
Map-Reduce
Yarn(resource manager)

The second framework is Apache Hive. It’s a frequently used warehouse service in companies. It’s a kind of framework which is written on top of Hadoop. It simply converts the SQL/code into map-reduce code under the hood itself. Learn:

How to load data in different file formats?
Internal Tables and External Tables
Querying table data stored in HDFS
Partitioning and Bucketing
Map-Side Join and Sorted-Merge Join
UDF’s and SerDe in Hive

Nowadays, a must-have skill set is Apache Spark. It is 100times faster than Hadoop and has great capabilities such as its run time because it does the computation in memory. There are three components:

Spark Core
Spark SQL
Spark Streaming

Next, Apache Flink is meant for Real-Time Data Processing / Stream Processing. It can also solve batch processing streams, but that will be a special case of real-time processing.

You also need to go through other frameworks such as Apache SQOOP, Apache NIFI, and Apache FLUME.

Two things you need to know or focus on:

How to do batch processing?
How to do real-time processing?
Workflow Schedulers, Dependency Management

This is used whenever data is not systematically arranged, and data will be extracted from different datasets and are dependent on each other, and their delivery time is different. We will use these to schedule these jobs in such a way that proper dependency is maintained.

Apache Airflow
Azkaban
NoSQL Databases: Transactional databases cannot solve big data-related use cases. So, we need to learn:
HBase
DataStax Cassandra (Recommended)
ElasticSearch
MongoDB
Messaging Queue Frameworks: It holds the data which the Apache Flink requires for real-time processing. Apache KAFKA is used for stream related processing.
Dashboarding Tools: This is important to test things right. Below are some tools-
Tableau
PowerBI
Grafana
Kibana (Part of ELK (ElasticSearch – Logstash – Kibana)
BigData Services in Cloud (AWS): It’s the most important nowadays, but you can simply just have an overview of this because you’ll get good exposure in the company itself.
Ondemand Machines: AWS EC2
Access Management: AWS IAM
For Storing and Accessing Credentials: AWS Secret Manager
Distributed File Storage: AWS S3
Transactional Database Services: AWS RDS, AWS Athena, AWS Redshift (Data Warehousing)
NoSQL Database Services: AWS Dynamo
Serverless : AWS Lambda
Scheduler: AWS Cloudwatch
Distributed Data Computation: AWS EMR
Messaging Queue: AWS SNS, AWS SQS
Real-Time Data Processing: AWS Kinesis

Conclusion

This article has discussed the roadmap that will help you become a great data engineer. Head on to our YouTube channel if you missed attending this session. The recording is available now!

We plan to bring more such DataHours to you and let the industry experts impart knowledge to you in the most practical sense. The upcoming DataHour sessions are:

Head to the above links to know more about these sessions. And, mark your calendar! Hope to see you there!

ankita184

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

The DataHour Synopsis: Learning Path to Master Data Engineering in 2022

Overview

Introduction

Who are Data Engineers?

The Roadmap that Leads us to our Destination

DBMS(Data Based Management System)

Data Warehousing Concepts

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

The DataHour Synopsis: Learning Path to Master Data Engineering in 2022

Overview

Introduction

Who are Data Engineers?

The Roadmap that Leads us to our Destination

DBMS(Data Based Management System)

Data Warehousing Concepts

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques