Data Engineering 101 – Getting Started with Apache Airflow

Last Updated : 22 Mar, 2024

7 min read

Introduction

Automation of work plays a key role in any industry and it is one of the quickest ways to reach functional efficiency. But many of us fail to understand how to automate some tasks and end in the loop of manually doing the same things again and again.

Most of us have to deal with different workflows like collecting data from multiple databases, preprocessing it, upload it, and report it. Consequently, it would be great if our daily tasks just automatically trigger on defined time, and all the processes get executed in order. Apache Airflow is one such tool that can be very helpful for you. Whether you are Data Scientist, Data Engineer, or Software Engineer you will definitely find this tool useful.

In this article, we will discuss Apache Airflow, how to install it and we will create a sample workflow and code it in Python.

Introduction
What is Apache Airflow?
Features of Apache Airflow
Installation Steps
Components of Apache Airflow
User Interface
Define your first DAG
Conclusion

What is Apache Airflow?

Apache Airflow is a workflow engine that will easily schedule and run your complex data pipelines. It will make sure that each task of your data pipeline will get executed in the correct order and each task gets the required resources.

It will provide you an amazing user interface to monitor and fix any issues that may arise.

Features of Apache Airflow

Easy to Use: If you have a bit of python knowledge, you are good to go and deploy on Airflow.
Open Source: It is free and open-source with a lot of active users.
Robust Integrations: It will give you ready to use operators so that you can work with Google Cloud Platform, Amazon AWS, Microsoft Azure, etc.
Use Standard Python to code: You can use python to create simple to complex workflows with complete flexibility.
Amazing User Interface: You can monitor and manage your workflows. It will allow you to check the status of completed and ongoing tasks.

Installation Steps

Let’s start with the installation of the Apache Airflow. Now, if already have pip installed in your system, you can skip the first command. To install pip run the following command in the terminal.

sudo apt-get install python3-pip

Next airflow needs a home on your local system. By default ~/airflow is the default location but you can change it as per your requirement.

export AIRFLOW_HOME=~/airflow

Now, install the apache airflow using the pip with the following command.

pip3 install apache-airflow

Airflow requires a database backend to run your workflows and to maintain them. Now, to initialize the database run the following command.

airflow initdb

We have already discussed that airflow has an amazing user interface. To start the webserver run the following command in the terminal. The default port is 8080 and if you are using that port for something else then you can change it.

airflow webserver -p 8080

Now, start the airflow schedular using the following command in a different terminal. It will run all the time and monitor all your workflows and triggers them as you have assigned.

airflow scheduler

Now, create a folder name dags in the airflow directory where you will define your workflows or DAGs and open the web browser and go open: http://localhost:8080/admin/ and you will see something like this:

Components of Apache Airflow

DAG: It is the Directed Acyclic Graph – a collection of all the tasks that you want to run which is organized and shows the relationship between different tasks. It is defined in a python script.
Web Server: It is the user interface built on the Flask. It allows us to monitor the status of the DAGs and trigger them.
Metadata Database: Airflow stores the status of all the tasks in a database and do all read/write operations of a workflow from here.
Scheduler: As the name suggests, this component is responsible for scheduling the execution of DAGs. It retrieves and updates the status of the task in the database.

User Interface

Now that you have installed the Airflow, let’s have a quick overview of some of the components of the user interface.

DAGS VIEW

It is the default view of the user interface. This will list down all the DAGS present in your system. It will give you a summarized view of the DAGS like how many times a particular DAG was run successfully, how many times it failed, the last execution time, and some other useful links.

GRAPH VIEW

In the graph view, you can visualize each and every step of your workflow with their dependencies and their current status. You can check the current status with different color codes like:

TREE VIEW

The tree view also represents the DAG. If you think your pipeline took a longer time to execute than expected then you can check which part is taking a long time to execute and then you can work on it.

TASK DURATION

In this view, you can compare the duration of your tasks run at different time intervals. You can optimize your algorithms and compare your performance here.

CODE

In this view, you can quickly view the code that was used to generate the DAG.

Define your first DAG

Let’s start and define our first DAG.

In this section, we will create a workflow in which the first step will be to print “Getting Live Cricket Scores” on the terminal, and then using an API, we will print the live scores on the terminal. Let’s test the API first and for that, you need to install the cricket-cli library using the following command.

sudo pip3 install cricket-cli

Now, run the following command and get the scores.

cricket scores

It might take a few seconds of time, based on your internet connection, and will return you the output something like this:

Importing the Libraries

Now, we will create the same workflow using Apache Airflow. The code will be completely in python to define a DAG. Let’s start with importing the libraries that we need. We will use only the BashOperator only as our workflow requires the Bash operations to run only.

Defining DAG Arguments

For each of the DAG, we need to pass one argument dictionary. Here is the description of some of the arguments that you can pass:

owner: The name of the owner of the workflow, should be alphanumeric and can have underscores but should not contain any spaces.
depends_on_past: If each time you run your workflow, the data depends upon the past run then mark it as True otherwise mark it as False.
start_date: Start date of your workflow
email: Your email ID, so that you can receive an email whenever any task fails due to any reason.
retry_delay: If any task fails, then how much time it should wait to retry it.

Defining DAG

Now, we will create a DAG object and pass the dag_id which is the name of the DAG and it should be unique. Pass the arguments that we defined in the last step and add a description and schedule_interval which will run the DAG after the specified interval of time

Defining the Tasks

We will have 2 tasks for our workflow:

print: In the first task, we will print the “Getting Live Cricket Scores!!!” on the terminal using the echo command.
get_cricket_scores: In the second task, we will print the live cricket scores using the library that we have installed.

Now, while defining the task first we need to choose the right operator for the task. Here both the commands are terminal-based so we will use the BashOperator.

We will pass the task_id which is a unique identifier of the task and you will see this name on the nodes of Graph View of your DAG. Pass the bash command that you want to run and finally the DAG object to which you want to link this task.

Finally, create the pipeline by adding the “>>” operator between the tasks.

Update the DAGS in Web UI

Now, refresh the user interface and you will see your DAG in the list. Turn on the toggle on the left of each of the DAG and then trigger the DAG.

Click on the DAG and open the graph view and you will see something like this. Each of the steps in the workflow will be in a separate box and its border will turn dark green once it is completed successfully.

Click on the node “get_cricket scores” to get more details about this step. You will see something like this.

Now, click on View Log to see the output of your code.

That’s it. You have successfully created your first DAG in the Apache Airflow.

Conclusion

I recommend you go through the following data engineering resources to enhance your knowledge-

If you have any questions related to this article do let me know in the comments section below.

Beginner Data Engineering Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Blue wa

Great post! I'm really interested in learning more about Apache Airflow and how it can help streamline my workflow. I'm currently using a manual process to schedule tasks and it's taking up a lot of time and energy. I'm hoping Airflow can help me automate this process and make it more efficient. Thanks for sharing this information!

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Data Engineering 101 – Getting Started with Apache Airflow

Introduction

Table of contents

What is Apache Airflow?

Features of Apache Airflow

Installation Steps

Components of Apache Airflow

User Interface

DAGS VIEW

GRAPH VIEW

TREE VIEW

TASK DURATION

CODE

Define your first DAG

Importing the Libraries

Defining DAG Arguments

Defining DAG

Defining the Tasks

Update the DAGS in Web UI

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at