Shivani Sharma — September 1, 2021
Beginner Big data Data Engineering Data Warehouse

This article was published as a part of the Data Science Blogathon

A data scientist’s ability to extract value from data is closely related to how well-developed a company’s data storage and processing infrastructure is. This means that the analyst must not only be able to build models, but also have sufficient data engineering skills to meet the needs of the company and take on increasingly ambitious projects.

 

Data Engineering

Data engineering according to M. Beauchemin is as follows: “This is an area that can be seen as a mixture of business intelligence and databases, which brings more programming elements. This area includes a specialization in distributed big data systems, the extended Hadoop ecosystem, and scalable computing. ”

Among the many skills of a data engineer, one that is most important is the ability to design, build, and maintain data warehouses. The lack of a good storage infrastructure makes any data mining activity either too expensive or not scalable.

 

ETL: Extract, Transform, Load

Extract, Transform, and Load is 3 conceptually important steps that define how most modern data pipelines work. Today, this is the basic model for making raw data ready for analysis.

Extract– This is a step at which sensors take data from various sources (user logs, copies of a relational database, an external dataset, etc.) as input, and then pass them on for subsequent transformations.

ETL Star schema airflowImage 1

Transform– This works as a soul of any ETL, in this stage, we exactly apply the business logic and perform filtering, supports grouping, and perform aggregation to manipulate the raw data into a dataset ready for analysis. This procedure requires an understanding of the business objectives and a basic knowledge of the field.

Load- Finally, we download the processed data and send it to the end-use location. End-users can use the resulting dataset, or it can be placed as an input stream to other ETL.

 

Which ETL Framework to Choose?

In the world of batch processing, there are several open-source platforms to try and play with. Consider the below characteristics while choosing a specific framework:

Configuration-ETLs are inherently quite complex, so it is important how the user of the framework will construct them. Is it based on the user interface or are the queries written in some programming language? Today, it is the second method that is gaining more and more popularity, since programming pipelines makes them more flexible, allowing you to change any detail.

ETL configuration

Image 2

Error monitoring and alerts- Bulky and long batch requests sooner or later fail with an error, even if there are no bugs in the job itself. As a consequence, monitoring and reporting errors come to the fore. How well does the framework visualize the progress of a request? Are the alerts arriving on time?

Backfilling of data (backfilling)- Often, after building the finished pipeline, we need to go back and re-process the historical data. Ideally, we would not want to build two independent jobs: one for the reverse and historical data, and the second for the current activity. How easy is backfilling with this framework? Is the resulting solution scalable and efficient?

 

2 paradigms: SQL vs. JVM

As we found out, companies have a huge choice of what tools to use for ETL, and it is not always clear for a novice data scientist which framework to devote time to. This is just about me: in Washington Post Labs, job order was carried out primitively, using Cron, in Twitter ETL jobs were built in Pig, and now in Airbnb, we write pipelines in Hive through Airflow.

SJVM-oriented ETLusually written in a JVM-oriented language (Java or Scala). Building data pipelines in such languages ​​means defining data transformations using key-value pairs, but writing custom functions and testing jobs becomes easier because you don’t need to use another programming language.

SQL-oriented ETLs are usually written in SQL or Hive. Almost everything in them revolves around SQL and tables, which is very convenient. At the same time, writing custom functions can be problematic, since it requires the use of another language (for example, Java or Python).

I prefer SQL-oriented ETLs, and the reason behind this is as a newbie data scientist it is easier to learn SQL than Java or Scala (if, of course, you are not already familiar with them) and concentrate on learning new ones. practice than putting it on top of learning a new language.

Data Modeling, Normalization, and Star Schema

In the process of building a high-quality analytical platform, the main goal of the system designer is to make analytical queries easy to write, and various statistics are calculated efficiently. Let’s define a data model for the same!

As one of the first steps in data modeling, you need to understand to what extent tables need to be normalized. In general, normalized tables have simpler schemas, more standardized data, and eliminate some types of redundancy. At the same time, the use of such tables leads to the fact that it requires more accuracy and diligence to establish relationships between tables, queries become more complex (more JOINs), and more ETL jobs are required to be supported.

On the other hand, it is much easier to write queries against denormalized tables since all dimensions and metrics are already connected. However, given the larger size of tables, data processing becomes slower (“You can argue here because it all depends on how the data is stored and what queries are. You can, for example, store large tables in Hbase and refer to separate columns, then the queries will be fast ”- approx. lane).

Out of all the available data models that try to search for the perfect balance between the two approaches, one of the most famous (we use it at Airbnb) is the “star” schema. This scheme is based on the construction of normalized tables (fact tables and dimension tables), from which, in which case, denormalized tables can be obtained. As a result, this design tries to balance the ease of analytics with the complexity of ETL support.

star schema | ETL Star schema airflow

Image 3

Transactional data at specific points is contained more often in fact schemas at times. Each row in a table can be extremely simple and is most often a single transaction. At Airbnb, we have a variety of fact tables that store data by type of transaction: bookings, checkouts, cancellations, and more.

Dimension tables contain slowly changing attributes of certain keys from the fact table, and they can be joined to it using these keys. The attributes themselves can be organized within a hierarchical structure. Airbnb, for example, has dimension tables with users, orders, and markets to help us analyze data in detail.

Below is a simple example of how fact tables and dimension tables (normalized) can be connected to answer a simple question: how many bookings were made in the last week for each market?

SELECT
    j.dim_market
  , SUM(i.m_bookings) AS m_bookings
FROM (
SELECT
      id_listing
    , 1          AS m_bookings
    , m_i        
    , m_j        
    , m_k        
FROM
    fct_bookings
WHERE
    ds BETWEEN '{{ previous_sunday }}' AND '{{ this_saturday }}'
) i 
JOIN (
SELECT
      id_listing
    , dim_market
    , dim_x      
    , dim_y      
    , dim_z      
FROM
    dim_listings
WHERE
    ds BETWEEN '{{ latest_ds }}'
) j
ON (i.id_listing = j.id_listing)
GROUP BY
  j.dim_market
;

Partitioning data by timestamp

Now that the cost of storing data is very low, companies can afford to store historical data in their vaults instead of throwing it away. The downside of this trend is that as the amount of data accumulates, analytical queries become ineffective and slow. Along with such SQL principles as “filter data more often and earlier” and “use only those fields that are needed”, there is another one that allows you to increase the efficiency of queries: data partitioning.

The basic idea of ​​partitioning is very simple – instead of storing data in one piece, we will divide it into several independent parts. All parts retain the primary key from the original part, so any data can be accessed quickly enough.

In particular, using a timestamp as the key used for partitioning has a number of advantages. First, in S3 storage, raw data is often sorted by timestamp and stored in directories that are also tagged. Secondly, usually, a batch-ETL job takes about one day, that is, new data partitions are created every day for each job. Finally, many analytical queries include counting the number of events that have occurred over a certain time period, so time partitioning is very useful here.

Backfilling historical data

Another major benefit of using a timestamp as a partitioning key is the ease of backfilling data. If the ETL pipeline is already built, then it calculates metrics and measurements in advance, and not in retrospect. Often we would like to look at the prevailing trends by calculating measurements in the past – this process is called backfilling.

Backfilling is so widespread that Hive has built-in dynamic partitioning to execute equivalent SQL queries across multiple partitions at once. Let’s illustrate this idea with an example: let’s say you want to fill in the number of bookings for each market for a dashboard, starting with earl_ds and ending with lat_ds… One possible solution looks like this:

INSERT OVERWRITE TABLE bookings_summ PARTITION (ds= '{{ earl_ds }}')
SELECT
      dim_market
    , SUM(m_bookings) AS m_bookings
FROM
    fct_bookings
WHERE
    ds = '{{ earl_ds }}'
GROUP BY
    dim_market
;
# after many insertions from '{{ earl_ds + 1 day }}' to '{{ lat_ds - 1 day }}'
INSERT OVERWRITE TABLE bookings_summ PARTITION (ds= '{{ latest_ds }}')
SELECT
      dim_market
    , SUM(m_bookings) AS m_bookings
FROM
    fct_bookings
WHERE
    ds = '{{ latest_ds }}'
GROUP BY
    dim_market
;
Such a request is possible, but it is too cumbersome, since we are performing the same operation, only on different partitions. Using dynamic partitioning, we can simplify everything to one query:
INSERT OVERWRITE TABLE bookings_summary PARTITION (ds)
SELECT
      dim_market
    , SUM(m_bookings) AS m_bookings
    , ds              
FROM
    fct_bookings
WHERE
    ds BETWEEN '{{ earl_ds }}' AND '{{ lat_ds }}'
GROUP BY
      dim_market
    , ds
;

This approach is very effective and is used in many pipelines at Airbnb.

Now, let’s look at all the concepts learned using the example of ETL jobs in Airflow.

 

Directed acyclic graph (DAG)

It would seem that from the point of view of the ETL idea, jobs are very simple, but in reality, they are often very confusing and consist of many combinations of Extract, Transform and Load operations. In this case, it is very useful to visualize the entire data flow using a graph in which a node displays an operation and an arrow represents the relationship between operations. Considering that each operation is performed once, and the data goes further along with the graph, then it is directed and acyclic, hence the name.

DAG

Image 4

One of the features of the Airflow interface is the presence of a mechanism that allows you to visualize the data pipeline through the DAG.

At the same time, in addition to DAGs that determine the order of launching operations, Airflow has operators that specify what needs to be done within the pipeline. There are usually 3 types of operators, each of which simulates one of the stages of the ETL process:

  • Sensors: after a specific time just open the data stream, or when data from the source input becomes available.

  • Operators: run specific commands (execute Python file, query in Hive, etc.). Similar to Transform, operators are concerned with transforming data.

  • Transfers: transfer data from one place to another (as in the Load stage).

 

A simple example

Below is a simple example of how to declare a DAG file and define the graph structure using the statements in Airflow that we discussed above:

from datetime import datetime, timedelta
from airflow.models import DAG  
from airflow.operators.sensors import NamedHivePartitionSensor
from airflow.operators.hive_operator import HiveOperator
# from airflow.operators.bash_operator import BashOperator
# from airflow.operators.python_operator import PythonOperator
default_args = {
    'owner': 'you',
    'depends_on_past': False,
    'start_date': datetime(2018, 2, 9),
}
# Instantiate the Airflow DAG
dag = DAG(
    dag_id='anatomy_of_a_dag',
    description="This describes my DAG",
    default_args=default_args,
    schedul_inter=timedelta(days=1))   # This is a daily DAG.
# Put upstream dependencies in a dictionary
wf_dependencies = {
    'wf_upstream_table_1': 'upstream_table_1/ds={{ ds }}',
    'wf_upstream_table_2': 'upstream_table_2/ds={{ ds }}',
    'wf_upstream_table_3': 'upstream_table_3/ds={{ ds }}',
}
# Define the sensors for upstream dependencies
for wf_task_id, partition_name in wf_dependencies.iteritems():
    NamedHivePartitionSensor(
        task_id=wf_task_id,
        partition_names=[partition_name],
        dag=dag
    )
# Put the tasks in a list
tasks = [
    ('hql', 'task_1'),
    ('hql', 'task_2'),
]
# To define the operators listed above
for directory, task_name in tasks:
    HiveOperator(
        task_id=task_name,
        hql='{0}/{1}.hql'.format(directory, task_name),
        dag=dag,
    )
# Put the dependencies in a map
deps = {
    'task_1': [
        'wf_upstream_table_1',
        'wf_upstream_table_2',
    ],
    'task_2': [
        'wf_upstream_table_1',
        'wf_upstream_table_2',
        'wf_upstream_table_3',
    ],
}
for downstream, upstream_list in deps.iteritems():
for upstream in upstream_list:
        dag.set_dependency(upstream, downstream)

And the final picture after graph building is shown below:

graph building | ETL Star schema airflowImage 5

Conclusion

So, I hope that in this article I managed to quickly and efficiently immerse you in an interesting and diverse field – Data Engineering. We studied what ETL is, the advantages and disadvantages of various ETL platforms. Then we discussed data modelling and star schema in particular and looked at how fact tables differ from dimension tables. Finally, after looking at concepts such as data partitioning and backfilling, we moved on to the example of a small ETL job in Airflow. Now you can learn how to work with data on your own, building up your knowledge base. See you!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Image Sources :

  1. 1. Image 1 – https://www.google.com/imgres?imgurl=https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1*Uo56hrm9r5L_5fmPsY7I9A.png&imgrefurl=https%3A%2F%2Fmedium.com%2F%40rchang%2Fa-beginners-guide-to-data-engineering-part-i-4227c5c457d7&tbnid=cqts3VKjlDHZ5M&vet=12ahUKEwjwmPam-dfyAhVwmUsFHdMIBJEQMygFegUIARCyAQ..i&docid=60AlnqM9_quSSM&w=1400&h=350&q=Introduction%20to%20Data%20Engineering-%20ETL%2C%20star%20schema%20and%20Airflow&ved=2ahUKEwjwmPam-dfyAhVwmUsFHdMIBJEQMygFegUIARCyAQ
  2. Image2 –  https://biconsult.ru/img/cloud_dwh/cd1-4-3.png
  3. Image 3 – https://biconsult.ru/img/cloud_dwh/cd1-4-3.png
  4. Image 4 –  https://biconsult.ru/img/cloud_dwh/cd1-4-3.png
  5. Image 5 – https://biconsult.ru/img/cloud_dwh/cd1-4-6.png
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *