The Ultimate Guide To Setting-Up An ETL (Extract, Transform, and Load) Process Pipeline

Prashant Last Updated : 02 Nov, 2021

8 min read

This article was published as a part of the Data Science Blogathon

What is ETL?

ETL is a process that extracts data from multiple source systems, changes it (through calculations, concatenations, and so on), and then puts it into the Data Warehouse system. ETL stands for Extract, Transform, and Load.

It’s easy to believe that building a Data warehouse is as simple as pulling data from numerous sources and feeding it into a Data warehouse database. This is far from the case, and a complicated ETL procedure is required. The ETL process, which is technically complex, involves active participation from a variety of stakeholders, including developers, analysts, testers, and senior executives.

To preserve its value as a decision-making tool, the data warehouse system must develop in sync with business developments. ETL is a regular (daily, weekly, monthly) process of a data warehouse system that must be agile, automated, and properly documented.

How Does ETL Work?

Here we will learn how the ETL process works step by step:

Step 1) Extraction

Data is extracted from the source system and placed in the staging area during extraction. If any transformations are required, they are performed in the staging area so that the performance of the source system is not harmed. Rollback will be difficult if damaged data is transferred directly from the source into the Data warehouse database. Before moving extracted data into the Data warehouse, it can be validated in the staging area.

Data warehouses can combine systems with different hardware, database management systems, operating systems, and communication protocols. Data warehouses must combine systems with disparate DBMS, hardware, operating systems, and communication protocols. Sources might include legacy programs such as mainframes, customized applications, point-of-contact devices such as ATMs and call switches, text files, spreadsheets, ERP, data from vendors and partners, and so on.

Thus, before extracting data and loading it physically, a logical data map is required. The connection between sources and target data is shown in this data map.

Three Data Extraction methods:

Partial Extraction – If the source system alerts you when a record is modified, that is the simplest way to obtain the data.
Partial Extraction (without update notification) – Not all systems can deliver a notification when an update occurs; but, they can indicate to the records that have been changed and provide extraction of those records.
Full extract – Certain systems are incapable of determining which data has been changed at all. In this scenario, the only way to get the data out of the system is to perform a full extract. This approach requires having a backup of the previous extract in the
same format on hand in order to identify the changes that have been done.

Regardless of the method adopted, extraction should not have an impact on the performance or response time of the source systems. These are real-time production databases. Any slowdown or locking might have an impact on the company’s bottom line.

Step 2) Transformation

The data retrieved from the source server is raw and unusable in its original state. As a result, it must be cleaned, mapped, and transformed. In reality, this is the key step in where the ETL process adds value and transforms data in order to produce meaningful BI reports.

It is a key ETL concept in which you apply a collection of functions to extracted data. Direct move or pass through data is the type of data that does not require any transformation.

You can execute customized operations on data during the transformation step. For example, suppose the client wants a sum-of-sales revenue that does not exist in the database. or if the first and last names in a table are in separate columns. Before loading, they can be concatenated.

The following are some examples of data integrity issues:

Different spellings of the same individual, such as Prashant, Parshant, and etc.
There are many ways to represent a company name, such as Google, Google Inc.
Various names, such as Cleaveland and Cleveland, are used.
It is possible that multiple account numbers are produced by different applications for the same client.
Some data needed files are left blank.

Step 3) Loading

The final stage in the ETL process is to load data into the target data warehouse database. A large volume of data is loaded in a relatively short period of time in a typical data warehouse. As a result, the load process should be optimized for performance.

In the occurrence of a load failure, recovery procedures should be put in place so that operations can restart from the point of failure without compromising data integrity. Data Warehouse administrators must monitor, continue, and stop loads based on server performance.

Types of Loading:

Initial Load — filling all of
the Data Warehouse tables
Incremental Load — implementing ongoing
modifications as needed on a regular basis
Full Refresh — clearing the contents
of one or more tables and reloading them with fresh data

Load verification

Check that the key field data is not missing or null.
Modelling views based on target tables should be tested.
Examine the combined values3 and computed measures.
Data checks in the dimension and history tables.
Examine the BI reports on the loaded fact and dimension table.

Setting Up ETL Using PythonScript

As a result, you must execute basic Extract Transform Load (ETL) from several databases to a data warehouse in order to do data aggregation for business intelligence. There are several ETL packages available that you believed were excessive for your basic use case.

I’ll show you how to extract data from MySQL, SQL-server, and firebird in this article. Using Python 3.6, transform the data and load it into SQL-server (data warehouse).

First of all, we have to create a directory for our project:

python_etl
    |__main.py
    |__db_credentials.py
    |__variables.py
    |__sql_queries.py
    |__etl.py

To set up ETL using Python, you’ll need to generate the following files in your project directory.

db_credentials.py: Should have all of the information needed to connect to all databases. such as Database Password, Port Number, etc.
sql_queries.py: All commonly used database queries for extracting and loading data in String format should be available.
etl.py: Connect to the database and conduct the needed queries by performing all necessary procedures.
main.py: Responsible for managing the flow of operations and executing the essential operations in a specified order.

In this section of sql_queries.py, this is the place where we are going to store all of our sql queries for extracting from source databases and importing into our target database (data warehouse)

Python to MySQL Connector: MySQL-connector-python
Python to Microsoft SQL Server Connector: pyodbc
Python to Firebird Connector: fdb

Setup Database Credentials and Variables

In variables.py, create a variable to record the name of the data warehouse database.

datawarehouse_name = 'your_datawarehouse_name'

Configure all of your source and target database connection strings and credentials in db_credentials.py as shown below. Save the configuration as a list so that we can iterate it whenever required through many databases later.

from variables import datawarehouse_name
datawarehouse_name = 'your_datawarehouse_name'
# sql-server (target db, datawarehouse)
datawarehouse_db_config = {
  'Trusted_Connection': 'yes',
  'driver': '{SQL Server}',
  'server': 'datawarehouse_sql_server',
  'database': '{}'.format(datawarehouse_name),
  'user': 'your_db_username',
  'password': 'your_db_password',
  'autocommit': True,
}
# sql-server (source db)
sqlserver_db_config = [
  {
    'Trusted_Connection': 'yes',
    'driver': '{SQL Server}',
    'server': 'your_sql_server',
    'database': 'db1',
    'user': 'your_db_username',
    'password': 'your_db_password',
    'autocommit': True,
  }
]
# mysql (source db)
mysql_db_config = [
  {
    'user': 'your_user_1',
    'password': 'your_password_1',
    'host': 'db_connection_string_1',
    'database': 'db_1',
  },
  {
    'user': 'your_user_2',
    'password': 'your_password_2',
    'host': 'db_connection_string_2',
    'database': 'db_2',
  },
]
# firebird (source db)
fdb_db_config = [
  {
    'dsn': "/your/path/to/source.db",
    'user': "your_username",
    'password': "your_password",
  }
]

SQL Queries

In this section of sql_queries.py, this is the place where we are going to store all of our sql queries for extracting from source databases and importing into our target database (data warehouse).

We have to implement various syntaxes for every database because we are working with multiple data platforms. We can do this by separating the queries based on the database type.

# example queries, will be different across different db platform
firebird_extract = ('''
  SELECT fbd_column_1, fbd_column_2, fbd_column_3
  FROM fbd_table;
''')
firebird_insert = ('''
  INSERT INTO table (column_1, column_2, column_3)
  VALUES (?, ?, ?)  
''')
firebird_extract_2 = ('''
  SELECT fbd_column_1, fbd_column_2, fbd_column_3
  FROM fbd_table_2;
''')
firebird_insert_2 = ('''
  INSERT INTO table_2 (column_1, column_2, column_3)
  VALUES (?, ?, ?)  
''')
sqlserver_extract = ('''
  SELECT sqlserver_column_1, sqlserver_column_2, sqlserver_column_3
  FROM sqlserver_table
''')
sqlserver_insert = ('''
  INSERT INTO table (column_1, column_2, column_3)
  VALUES (?, ?, ?)  
''')
mysql_extract = ('''
  SELECT mysql_column_1, mysql_column_2, mysql_column_3
  FROM mysql_table
''')
mysql_insert = ('''
  INSERT INTO table (column_1, column_2, column_3)
  VALUES (?, ?, ?)  
''')
# exporting queries
class SqlQuery:
  def __init__(self, extract_query, load_query):
    self.extract_query = extract_query
    self.load_query = load_query
# create instances for SqlQuery class
fbd_query = SqlQuery(firebird_extract, firebird_insert)
fbd_query_2 = SqlQuery(firebird_extract_2, firebird_insert_2)
sqlserver_query = SqlQuery(sqlserver_extract, sqlserver_insert)
mysql_query = SqlQuery(mysql_extract, mysql_insert)
# store as list for iteration
fbd_queries = [fbdquery, fbd_query_2]
sqlserver_queries = [sqlserver_query]
mysql_queries = [mysql_query]

Extract Transform Load

To set up ETL using Python for the above-mentioned data sources, you’ll need the following modules:

# python modules
 import mysql.connector
 import pyodbc
 import fdb
# variables
 from variables import datawarehouse_name

We can use two techniques in this: etl() and etl_process().

etl_process() is the procedure for establishing a database source connection and calling the etl() method based on the database platform.

And in the second method which is etl() method, it runs the extract query first, then stores the SQL data in the variable data and inserts it into the targeted database, which is our data warehouse. Data transformation may be accomplished by altering the data variable of the type tuple.

def etl(query, source_cnx, target_cnx):
  # extract data from source db
  source_cursor = source_cnx.cursor()
  source_cursor.execute(query.extract_query)
  data = source_cursor.fetchall()
  source_cursor.close()
  # load data into warehouse db
  if data:
    target_cursor = target_cnx.cursor()
    target_cursor.execute("USE {}".format(datawarehouse_name))
    target_cursor.executemany(query.load_query, data)
    print('data loaded to warehouse db')
    target_cursor.close()
  else:
    print('data is empty')
def etl_process(queries, target_cnx, source_db_config, db_platform):
  # establish source db connection
  if db_platform == 'mysql':
    source_cnx = mysql.connector.connect(**source_db_config)
  elif db_platform == 'sqlserver':
    source_cnx = pyodbc.connect(**source_db_config)
  elif db_platform == 'firebird':
    source_cnx = fdb.connect(**source_db_config)
  else:
    return 'Error! unrecognised db platform'
  # loop through sql queries
  for query in queries:
    etl(query, source_cnx, target_cnx)
  # close the source db connection
  source_cnx.close()

Putting Everything Together

Now, in the next step, We can loop over all credentials in main.py and execute the etl for all databases.

For that we have to Import all required variables and methods:

# variables
  from db_credentials import datawarehouse_db_config, 
  sqlserver_db_config, mysql_db_config, fbd_db_config
  from sql_queries import fbd_queries, sqlserver_queries, mysql_queries
  from variables import *
 # methods
  from etl import etl_process

The code in this file is responsible for iterating over credentials in order to connect to the database and execute the necessary ETL Using Python operations.

def main():
  print('starting etl')
  # establish connection for target database (sql-server)
  target_cnx = pyodbc.connect(**datawarehouse_db_config)
  # loop through credentials
  # mysql
  for config in mysql_db_config: 
    try:
      print("loading db: " + config['database'])
      etl_process(mysql_queries, target_cnx, config, 'mysql')
    except Exception as error:
      print("etl for {} has error".format(config['database']))
      print('error message: {}'.format(error))
      continue
  # sql-server
  for config in sqlserver_db_config: 
    try:
      print("loading db: " + config['database'])
      etl_process(sqlserver_queries, target_cnx, config, 'sqlserver')
    except Exception as error:
      print("etl for {} has error".format(config['database']))
      print('error message: {}'.format(error))
      continue
  # firebird
  for config in fbd_db_config: 
    try:
      print("loading db: " + config['database'])
      etl_process(fbd_queries, target_cnx, config, 'firebird')
    except Exception as error:
      print("etl for {} has error".format(config['database']))
      print('error message: {}'.format(error))
      continue
  target_cnx.close()
if __name__ == "__main__":
  main()

In your terminal, type python main.py and you’ve just created an ETL using a pure python script.

ETL Tools

There are several Data Warehousing tools on the market. Here are some of the most famous examples:

1. MarkLogic:

MarkLogic is a data warehousing system that uses an array of business capabilities to make data integration easier and faster. It can query many sorts of data, such as documents, relationships, and metadata.

https://www.marklogic.com/product/getting-started/

2. Oracle:

Oracle is the industry’s most popular database. It offers a vast variety of Data Warehouse solutions for both on-premises and cloud services. It helps in better client experiences by boosting operational efficiency.

https://www.oracle.com/index.html

3. Amazon RedShift:

Redshift is a data warehousing solution from Amazon. It’s a simple and cost-effective solution for analyzing various sorts of data with standard SQL and existing business intelligence tools. It also enables the execution of complex queries on petabytes of structured data.

https://aws.amazon.com/redshift/?nc2=h_m1

Conclusion

This article gave you a deep understanding of what ETL is, as well as a step-by-step tutorial on how to set up your ETL in Python. It also gave you a list of the finest tools that most organizations nowadays use to build up their ETL data pipelines.

Most organizations nowadays, on the other hand, have a massive amount of data with a highly dynamic structure. Creating an ETL pipeline from scratch for such data is a hard procedure since organizations will have to use a large number of resources in order to create this pipeline and then ensure that it can keep up with the high data volume and Schema changes.

About The Author

Prashant Sharma

Currently, I Am pursuing my Bachelors of Technology( B.Tech) from Vellore Institute of Technology. I am very enthusiastic about programming and its real applications including software development, machine learning, Deep Learning, and data science.

I hope you like the article. If you want to connect with me then you can connect on:

or for any other doubts, you can send a mail to me also

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Prashant

Hello, my name is Prashant, and I'm currently pursuing my Bachelor of Technology (B.Tech) degree. I'm in my 3rd year of study, specializing in machine learning, and attending VIT University.

In addition to my academic pursuits, I enjoy traveling, blogging, and sports. I'm also a member of the sports club. I'm constantly looking for opportunities to learn and grow both inside and outside the classroom, and I'm excited about the possibilities that my B.Tech degree can offer me in terms of future career prospects.

Thank you for taking the time to get to know me, and I look forward to engaging with you further!

Beginner Data Engineering Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

The Ultimate Guide To Setting-Up An ETL (Extract, Transform, and Load) Process Pipeline

What is ETL?

How Does ETL Work?

Here we will learn how the ETL process works step by step:

Step 1) Extraction

Three Data Extraction methods:

Step 2) Transformation

Step 3) Loading

Types of Loading:

Load verification

Setting Up ETL Using PythonScript

Setup Database Credentials and Variables

SQL Queries

Extract Transform Load

Putting Everything Together

ETL Tools

1. MarkLogic:

2. Oracle:

3. Amazon RedShift:

Conclusion

About The Author

Prashant Sharma

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)