The DataHour Synopsis: Learning Path to Master Data Engineering in 2022
Data is the new oil of the industry. The way raw oil empowers the industrial economy, data is empowering the information economy.
Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “DataHour”. This is a series of webinars by top industry experts where they teach and democratize data science knowledge. On 23rd April 2022, we were joined by Mr.Shashank Mishra for a DataHour session on “Learning Path to master Data Engineering in 2022”.
Shashank is an experienced Data Engineer with a demonstrated history of working in service and product companies such as Amazon, Paytm and McKinsey. Currently, Shashank is an active contributor to the Data Science and Data Engineering community through his incredible podcasts and Youtube Channel(E-Learning Bridge)
Are you excited to dive deeper into the world of Data Engineering? We got you covered. Let’s get started with the major highlights: Learning Path to master Data Engineering in 2022.
In the era of a highly technologically-advanced world, every company or an organization is heavily dependent on data-driven business decisions. The collected data is huge, and taking business decisions or crux out of this data is not a cakewalk. All this happens in phases and needs a highly advanced system or pipeline. This is where the role of data engineer and data engineering comes into the picture. This is what a data engineer is supposed to do. They put the data in a more accurate format for better business decision making. Now, let’s dive into the roadmap that will lead us to become successful data engineers.
Who are Data Engineers?
Data engineers work as front-liners like any company is working on a big data solution. They are treated as front liners because their actual task is to create those scalable and optimized data pipelines which every company requires. But creating a data pipeline or ETL is not an easy task. The ETL and creating the pipelines is not a simple task of just bringing data from some source and putting it into some downstream system. Here software engineering stuff is also involved and plays a vital role because scientists need to design those scalable systems where their distributed computation engines will be involved and will need to handle the load as well as the data load you need to create scalable data pipelines in such a way that even if there is a spike in the data, the pipeline doesn’t get a break or take time to process the data.
The Roadmap that Leads us to our Destination
You must be thinking, is programming or the understanding of programming actually required in data engineering or not? Probably, you are thinking in the right direction.
To be well equipped with these programming languages is a necessity in this domain. It’s because to create scalable and optimized data pipelines, some data transformation will be applied there. This means we need to employ some code to apply this transformation for whatever execution is related. Hence, understanding the programming and the modern object-oriented concepts becomes a necessity. There are a few popular languages in data engineering, and these are:
Perhaps, you can choose any one of them and get expertise in the same. But if I need to suggest you, people, I would say rather you choose Python first. Nowadays, Python is known as the language of data. Whereas Java and Scala are a bit difficult languages due to their object-oriented way of writing and executing. And Python has different data analysis libraries and is easy to understand. This is the first checklist that will help you in solving the use-case with more simple programming.
Operating System & Scripting
You also need to understand operating systems (such as Linux and Unix) and shell scripting. Be handy with how to execute commands on the terminal and perform basic operations such as copying, formatting, etc., of files. How to write a shell script to automate stuff or to do some stuff in the background.
Data Structure & algorithms
In data engineering, you need to have a basic understanding of DS and its algorithms. These are used in the activities such as building pipelines using array, string, stack, etc. Other algo’s are: linked list, queue, tree, graph and its traversal, dynamic programming, searching and sorting.
These all will help in checking your logical thinking ability and other programming skills. So, only understanding these algo’s is important just to make sure that you are well known of the concepts being used further in the process.
DBMS(Data Based Management System)
A core understanding of DBMS is a must. This will help you solve the problem statement and database designing and management, too; additionally, the clarity of DBMS concepts “which one to use where” will make these statement cases simpler. These are the few commands used in DBMS:
- DDL-Data Definition Language
- DCL-Data Control Language
- DML-Data Manipulation Language
- Integrity Constraints
- Data Schema
- Basic Operations
- ACID Properties
- Concurrency Control
- Normalization forms
- Stored Procedures
- ER Diagrams
This is one of the must-have checklists. You must be using this in your daily day-to-day activities to produce complex and analytical results. Here is the list that you need to know:
- Transactional Databases : MySQL, PostgreSQL
- All types of joins
- Nested Queries
- Group By
- Use of Case When Statements
- Window Functions
Here, you need to understand what big data is and its terminologies, such as technical terminologies of the 5 V’s of data, computation distribution, and how it works. Other things to learn are:
- Vertical vs Horizontal Scaling
- Commodity Hardware
- File formats-CSV, JSON, AVRO, Parquet, ORC
- Type of Data-Structured, Unstructured and Semi-structured
You’ll employ these basic terminologies in tech frameworks which are being used in big data.
Important Python Libraries
Get used to two very important python libraries:
These are data exploration libraries where you can read and explore the data. Additionally, some mathematical/ statistical things can also be performed. Go through all the libraries thoroughly.
Data Warehousing Concepts
This is important in design rounds and in real-world use-cases. Everything is coming from some source that will have a kind of data or warehouse. To build or design downstream systems efficiently, you need to understand data-warehousing and data-modelling concepts thoroughly.
- OLAP vs OLTP
- Dimension Tables
- Fact Tables
- Star Schema
- Snowflake Schema
- Warehouse Designing Questions
- Many more topics
From here onward, we’ll dive deeper into the data engineering domain. The base or the foundation of BigData is Apache Hadoop. This is the first framework that is used for batch data processing, and architectural understanding is a must. It was discovered for distributed data computation. With all this, you’ll be able to write on.
- HDFS(file storage system)
- Yarn(resource manager)
The second framework is Apache Hive. It’s a frequently used warehouse service in companies. It’s a kind of framework which is written on top of Hadoop. It simply converts the SQL/code into map-reduce code under the hood itself. Learn:
- How to load data in different file formats?
- Internal Tables and External Tables
- Querying table data stored in HDFS
- Partitioning and Bucketing
- Map-Side Join and Sorted-Merge Join
- UDF’s and SerDe in Hive
Nowadays, a must-have skill set is Apache Spark. It is 100times faster than Hadoop and has great capabilities such as its run time because it does the computation in memory. There are three components:
- Spark Core
- Spark SQL
- Spark Streaming
Next, Apache Flink is meant for Real-Time Data Processing / Stream Processing. It can also solve batch processing streams, but that will be a special case of real-time processing.
You also need to go through other frameworks such as Apache SQOOP, Apache NIFI, and Apache FLUME.
Two things you need to know or focus on:
- How to do batch processing?
- How to do real-time processing?
- Workflow Schedulers, Dependency Management
This is used whenever data is not systematically arranged, and data will be extracted from different datasets and are dependent on each other, and their delivery time is different. We will use these to schedule these jobs in such a way that proper dependency is maintained.
- Apache Airflow
- NoSQL Databases: Transactional databases cannot solve big data-related use cases. So, we need to learn:
- DataStax Cassandra (Recommended)
- Messaging Queue Frameworks: It holds the data which the Apache Flink requires for real-time processing. Apache KAFKA is used for stream related processing.
- Dashboarding Tools: This is important to test things right. Below are some tools-
- Kibana (Part of ELK (ElasticSearch – Logstash – Kibana)
- BigData Services in Cloud (AWS): It’s the most important nowadays, but you can simply just have an overview of this because you’ll get good exposure in the company itself.
- Ondemand Machines: AWS EC2
- Access Management: AWS IAM
- For Storing and Accessing Credentials: AWS Secret Manager
- Distributed File Storage: AWS S3
- Transactional Database Services: AWS RDS, AWS Athena, AWS Redshift (Data Warehousing)
- NoSQL Database Services: AWS Dynamo
- Serverless : AWS Lambda
- Scheduler: AWS Cloudwatch
- Distributed Data Computation: AWS EMR
- Messaging Queue: AWS SNS, AWS SQS
- Real-Time Data Processing: AWS Kinesis
This article has discussed the roadmap that will help you become a great data engineer. Head on to our YouTube channel if you missed attending this session. The recording is available now!
We plan to bring more such DataHours to you and let the industry experts impart knowledge to you in the most practical sense. The upcoming DataHour sessions are:
Head to the above links to know more about these sessions. And, mark your calendar! Hope to see you there!