Step-by-Step Roadmap to Become a Data Engineer in 2023
You must have noticed the personalization happening in the digital world, from personalized Youtube videos to canny ad recommendations on Instagram. While not all of us are tech enthusiasts, we all have a fair knowledge of how Data Science works in our day-to-day lives.
All of this is based on Data Science which is being applied behind the scenes. But little do people know that none of this is possible without the ardent Data Engineers working alongside the Data Scientists.
Well, in concise terms, Data Engineering is the field that ensures the proper flow of raw data, coming in from a plethora of sources to a reliable, common repository that can act as a single source of truth for the entire organization.
This article defines the roadmap you can follow to become a Data Engineer in 2023.
Before you jump into the roadmap in detail, it is good to know the quarterly outcome of the roadmap.
In the first quarter, you will focus on building a solid foundation of programming which will help you get started with your Data Engineering journey.
In the second quarter, the focus will be on gaining hands-on experience in Cloud and Distributed Frameworks. And so, by the end of this quarter, you may start applying for Data Engineering internships.
In the third quarter, the focus will be on data warehousing and handling streaming data. And since you can handle batch and streaming data at the end of this quarter, you will be in an excellent position to start applying for entry-level Data Engineering roles.
The fourth quarter will focus on Testing, NoSQL Databases, and Workflow Orchestration Tools. So, by the end of this quarter or this year, you will know most of the tools and technologies that a Data Engineer needs to master and will be in an excellent position to apply and nail any Data Engineering job interview!
January – Basics of Programming
The first thing you need to master to become a Data Engineer in 2023 is a programming language. This will kickstart your journey in this field and allow you to think in a structured manner. And there is no better programming language to start your programming journey with than Python.
Python is one of the most suitable programming languages for Data Engineering. It is easy to use, has a dearth of supporting libraries, has a vast community of users, and has been thoroughly incorporated into every aspect and tool of Data Engineering.
- Understanding Operators, Variables, and Data Types in Python
- Conditional Statements & Looping Constructs
- Data Structures in Python (Lists, Dictionaries, Tuples, Sets & String methods)
- Writing custom Functions (incl. lambda, map & filter functions)
- Understanding Standard Libraries in Python
- Using basic Regular Expressions for data cleaning and extraction tasks
Besides this, you should specifically focus on the Pandas library in Python. This library is widely used for reading, manipulating, and so on. Here you can focus on the following –
- Basics of data manipulation with Pandas library
- Reading and writing files with Pandas
- Manipulate columns in Pandas – Rename columns, sort data in Pandas dataframe, binning data using Pandas, etc.
- How to deal with missing values using Pandas
- Apply function in Pandas
- Pivot table
- Group by
February – Fundamentals of Computing
Once you become comfortable with the Python programming language, it is essential to focus on some computing fundamentals to become a Data Engineer in 2023. This is extremely helpful because, most of the time, the data sources you are working with will require you to grasp these computing fundamentals well.
It would help if you focused on shell scripting in Linux as you will be working with the Linux environment in a shell format. You will extensively be using shell scripting for cron jobs, setting up environments, or even working in the distributed environment, which is used widely by Data Engineers. Besides this, you will need to work with APIs. This is important because projects involve multiple APIs that send and receive data. So learn the basics of APIs like GET, PUT, and POST. The requests library is widely used for this purpose.
Web Scraping is also essential for a Data Engineer’s day-to-day tasks. This is because, a lot of the time, we need to extract data from websites that might not have a straightforward or helpful API. For this, you can focus on working with BeautifulSoup and Selenium libraries. Finally, master Git and GitHub as they are handy for version control. Different members will be working on a single project when you are working in a team. Without a version control tool, this collaboration would be impossible.
March – Relational Databases
No Data Engineering project is complete without a storage component. And Relational Database is one of the core storage components used widely in Data Engineering projects. One needs a good understanding of relational databases to work with the enormous amounts of data generated in this field. Relational databases are widely used to store data in any given field. This is because of their ACID properties, which allow them to handle transactional data easily.
To work with relational databases, you must master the Structure Query Language (SQL). You can focus on the following while learning SQL –
- Basic Querying in SQL
- Keys in SQL
- Joins in SQL
- Practice Subqueries in SQL
- Constraints in SQL
- Window Functions
Project – At the end of the first quarter, you will understand programming, SQL, web scraping, and APIs well. This should give you enough leeway to work on a small sample project.
In this project, you can focus on bringing in data from any open API or scraping the data from a website. Transform that particular data using Pandas in Python. And finally, store it in a relational database. A similar project can be found here, where we analyze streaming tweets using Python and PostgreSQL.
April – Cloud Computing Fundamentals with AWS
We will start this quarter with a focus on cloud computing because, given the size of Big Data, Data Engineers find it extremely useful to work on the cloud. This allows them to work with the data without resource limitations. Also, given the proliferation of cloud computing technologies, it has become increasingly easier to manage complex work processes entirely on the cloud.
- Learn the basics of AWS
- Learn about IAM users and IAM Roles
- Learn to launch and operate an EC2 on AWS
- Get comfortable with Lambda Functions on AWS
- AWS S3 is the primary storage component
- API gateway
- Practice networking with AWS VPC
- Practice databases with AWS RDS and Aurora
Project – You can take your project from the previous quarter to the cloud by following the below steps –
- Use API Gateway to ingest the data from Twitter API
- Process this data with AWS Lambda
- Store the processed data in AWS Aurora for further analysis
May – Data Processing with Apache Spark
Next, you need to learn how to process Big Data. Big Data broadly has two aspects, batch data, and streaming data. This month, you should focus on learning the tools to handle batch data. Batch data is accumulated over time, say a day, a month, or a year. Since the data is extensive in such cases, we need specialized tools. One such popular tool is Apache Spark.
You can focus on the following while learning Apache Spark –
- Spark architecture
- RDDs in Spark
- Working with Spark Dataframes
- Understand Spark Execution
- Broadcast and Accumulators
- Spark SQL
While you are at it, also learn about the ETL pipeline concept. ETL is nothing but Extracting the data from a source, Transforming the incoming data into the required format, and finally Loading it to a specified location. Apache Spark is widely used for this purpose, and Data Engineers use ETL in every project!
You can work with the Databricks community edition to practice Apache Spark.
June – Hadoop Distributed Framework
Source: Wikimedia Commons
No Data Engineering project is complete without utilizing the capabilities of a distributed framework. Distributed frameworks allow Data Engineers to distribute the workload onto multiple small-scale machines instead of relying on a single massive system. This provides higher scalability and better fault tolerance.
- Get an overview of the Hadoop Ecosystem
- Understand MapReduce architecture
- Understand the working of YARN
- Work with Hadoop on the cloud with AWS EMR
Project – By the end of this quarter, you will have a good understanding of handling batch data in a distributed environment, and you will also have the basics of cloud computing in place. Also, you can start your data engineering journey with these tools by applying to Data Engineering internships.
To showcase your skills, you can build a project on the cloud. You can take up any datasets from the DataHack platform and practice working with the Spark framework.
July – Data Warehousing with Apache Hive
Getting data into databases is only one-half of the work. The real challenge is aggregating and storing the data in a central repository. This will act as a single version of the truth that anyone from an organization can query and get a common and consistent result. You will first need to understand the differences between Database, Data Warehouse, and Data lake since you will often come across these terms. Not only this but also try to understand the difference between OLTP vs OLAP.
Next, you should focus on the modeling aspect of data warehouses. To learn about the Star and Snowflake schema often used in designing data warehouses. Finally, you can start to learn the various data warehouse tools. One of the most popular is Apache Hive which is built on top of Apache Hadoop and is used widely in the industry. While learning Hive, you can focus on the following topics –
- Hive Query Language
- Managed vs External tables
- Partitioning and Bucketing
- Types of File formats
- SerDes in Hive
Once you have mastered the basics of Apache Hive, you can practice working with it on the cloud using the AWS EMR service.
August – Ingesting Streaming Data with Apache Kafka
Having worked extensively with batch data, it is time to move on to streaming data.
Streaming data is the data that is generated in real-time. Think of tweets being generated, clicks being recorded on a website, or transactions occurring on an e-commerce website. Such data sometimes needs to be handled in real-time. For example, we may need to determine in real time whether a tweet is toxic so we can easily remove it from the platform. Or determine whether a transaction is fraudulent or not to prevent it from causing extensive damage.
The problem with working with such data is that we must ingest it in real-time and process it at the same rate. This will ensure that there is no data loss in the interim. To ensure that data is being ingested reliably while it is being generated, we need to use Apache Kafka.
- Learn Kafka architecture
- Learn about Producers and Consumers
- Create topics in Kafka
Besides this, you can also learn about AWS Kinesis which is also used to handle streaming services on the cloud.
September – Process streaming data with Spark Streaming
Once you have learned to ingest streaming data with Kafka, it is time to learn how to process that data in real time. Although you can do that with Kafka, it is nowhere as flexible for ETL purposes as Spark Streaming. Spark Streaming is part of the core Spark API. You can focus on the following while learning Spark Streaming –
- Stateless vs. Stateful transformations
- Structured Streaming
Project – At the end of this quarter, you will be able to handle batch and streaming data. You will also have a good knowledge of data warehousing. So, you will know most of the tools and technologies a data engineer needs to master.
To showcase your skills, you can build a small project. Here you can leverage the benefits of streaming services in your Twitter sentiment analysis project from before. Let’s say you want to analyze all the tweets related to Data Engineering in real-time and store them in a database.
- You ingest tweets from API into AWS Kinesis. This will ensure that you are not losing out on any incoming tweets.
- These tweets can be processed in small batches on AWS EMR with Spark Streaming API.
- The processed tweets can be stored in Hive tables.
- These processed tweets can be aggregated daily using Spark. For example, you can find the total number of tweets, common hashtags used, any upcoming event highlighted in the tweets, and so on.
October – Advanced Programming
The final quarter will focus on mastering the skills required to complete your data engineering profile. We will start by bringing our attention back to programming. We will focus on some advanced programming skills, which will be vital to your growth as a Data Engineer in 2023. These are useful for working on a larger project in the industry as we need to incorporate the best programming practices. For this, you can work on the following –
- OOPs concepts – Classes, Objects, Inheritance
- Understand Recursion functions
- Testing – Unit and Integration testing
November – NoSQL
Source: Wikimedia Commons
Having worked with relational databases, you must have noticed some glaring drawbacks like the data always needs to be structured, the querying is not that fast when working with relatively large data, and even it has scalability issues. To overcome such drawbacks, the industry came up with NoSQL databases. These databases can deal with structured and unstructured data, are quick for data insertions, and even the querying is much faster. They are increasingly being used in the industry to capture user data.
So start by understanding the difference between SQL and NoSQL databases. Once you have done that, focus on the different types of NoSQL databases.
You can focus on learning one particular NoSQL database. I would suggest going for MongoDB as it is popularly used in the industry and is also very easy to learn for someone who has already mastered SQL. to learn MongoDB, you can focus on the following –
- CAP theorem
- Documents and Collections
- CRUD operations
- Working with different types of operators – Array, Logical, Comparison, etc.
- Aggregation Pipeline
- Sharding and Replication in MongoDB
Project – As an excellent hands-on practice, I would encourage you to set up a MongoDB cluster on AWS. You will need to host MongoDB servers on different EC2 instances. You can treat one of these as the Primary, and the rest can be treated as Secondary nodes. Further, you can employ Sharding and Replication concepts to solidify your understanding. And you can use any open-source API to treat it as a source of incoming data.
This is no doubt going out of the way and exploring NoSQL in-depth, but it will give you a good understanding of how the servers are set up and interact with each other in the real world and will be a bonus towards your journey of becoming a Data Engineer in 2023.
December – Workflow Scheduling
The ETL pipelines you build to get the data into databases and data warehouses must be managed separately. This is because any Data Engineering project involves building complex ETL pipelines, which must be scheduled at different time intervals and work with varying data types. We need a powerful workflow scheduling tool to manage such pipelines successfully and gracefully handle the errors. One of the most popular ones is Apache Airflow.
You can learn the following concepts in Apache Airflow –
- Task dependencies
Project – By now, you will have a thorough understanding of the essential tools in data engineering. These will prepare you for your dream job interview and to become a Data Engineer in 2023.
To showcase your skills, you can build a capstone project. Here, you can take up batch and streaming data to showcase your holistic knowledge of the field. You can manage the ETL pipelines using Apache Airflow. And finally, think of taking the project onto the cloud so that there is no resource crunch at any point in time. This was the final leap of faith in becoming a Data Engineer in 2023.
These tools will get your journey started in the Data Engineering field. Armed with these tools and sample projects, you can quickly nail any internship and job interview and become a Data Engineer in 2023. But these Data Engineering tools are constantly evolving. Therefore, staying updated with the most recent technologies and trends to become a Data Engineer in 2023 is essential. Happy learning!
Leave a Reply Your email address will not be published. Required fields are marked *