Understand The Internal Working of Apache Spark

Dhanya Last Updated : 24 Jul, 2024

7 min read

In this fast-paced digitized world, the size of data generation is increasing every second. This data cannot be thrown away as this may help to get important business insights. Apache Spark is the largest open-source project for data processing.

In this article, I am going to discuss the internals of Apache Spark and the high level of Spark architecture.

Introduction

Apache Spark is an open-source distributed big data processing engine. It provides a common processing engine for both streaming and batch data. It provides parallelism and fault tolerance. Apache Spark provides high-level APIs in four languages such as Java, Scala, Python and R. Apace Spark was developed to eliminate the drawbacks of Hadoop MapReduce. Spark works on the concept of in-memory computation which makes it around a hundred times faster than Hadoop MapReduce.

This article was published as a part of the Data Science Blogathon

Introduction
Spark Components
Cluster Managers
Spark RDD
The high-level architecture of Spark
Spark Architecture run-time components
Spark sample program
How does Spark works?
Endnotes

Let’s see the different components of Spark.

Spark Components

Source

Spark consists of one of the bigger components called Spark core which contains the majority of the libraries. On top of this Spark core, there are four different components. They are

Spark SQL
Spark Streaming
MLLIB
GraphX

Spark SQL

It provides a SQL like interface to do the data processing with Spark as a processing engine. It can process both structured and semi-structured data. Using Spark SQL, you can query in SQL and HQL too. It can process data from multiple sources.

Spark Streaming

It provides support for processing a large stream of data. The real-time streaming is supported by using micro-batching. The incoming live data is converted into small batches and processed by the processing engine.

MLLIB

This component provides libraries for machine learning towards the statistic and dynamic analysis of the data. It also includes clustering, classification, regression and many other machine learning algorithms. It delivers high-quality algorithms with high speed.

GraphX

It provides a distributed graph computation on top of the Spark core. It consists of several Spark RDD API which helps in creating directed graphs whose vertices and edges are linked with arbitrary properties. Using GraphX, traversal, searching and pathfinding can be done.

Cluster Managers

Spark also consists of three pluggable cluster managers.

Standalone

It is a simple cluster manager which is easier to set up and execute the tasks on the cluster. It is a reliable cluster manager which can handle failures successfully. It can manage the resources based on the application requirements.

Apache Mesos

It is a general cluster manager from the Apache group that can also run Hadoop MapReduce along with Spark and other service applications. It consists of API for most of the programming languages.

Hadoop YARN

It is a resource manager which was provided in Hadoop 2. It stands for Yet Another Resource Negotiator. It is also a general-purpose cluster manager and can work in both Hadoop and Spark.

These are the three cluster managers supported by the Spark ecosystem.

Now, let me give a glimpse of the most essential part of Apache Spark. i.e., Spark RDD.

Spark RDD

Apache Spark has a fundamental data structure called Resilient Distributed Dataset (RDD) which forms the backbone of Apache Spark. It performs two main functions called transformations and actions. Transformations are the operations that we perform on the input data. Examples are Map(), Filter(), sortBy() which we can perform on the required data. Transformations create new RDD based on the operations we perform. Actions are the processes that we perform on the newly created RDD to get the desired results. Examples of actions are collect(), countByKey(). Action returns the result to the driver.

The high-level architecture of Spark

Now, let’s look into the high-level architecture of the Spark execution program.

When we run any job
with Spark, this is how underlying execution happens.

Source

There would be one driver program that works with the cluster manager to schedule tasks on the worker nodes. So, once these tasks are completed, they will return the result to the driver program.

Let me explain this
process in detail.

Spark Architecture run-time components

Spark Driver

The first and foremost activity of the Spark driver is to call the main method of the program. Whenever you write any execution code in Spark, you will give the main class which is the entry point to your program and that point is executed by the Spark driver. The Spark driver creates the Spark context or Spark session depends on which version of Spark you are working in. The driver is the process that runs the user code which eventually creates RDD data frames and data units which are data unit abstractions in the Spark world. The driver performs all the different transformations and executes the actions in the form of tasks that are executed on different executors. The two main functions performed by the driver are

Converting the user program into tasks
Scheduling those tasks on the executors with the help of the cluster manager.

Spark cluster manager

Spark execution is agnostic to the cluster manager. You can plug in any of the three available cluster managers or supported cluster managers and the execution behaviour don’t change that it doesn’t change based on which cluster manager you are using.

Spark lies on the cluster manager to launch the executors. This is the prerogative of the cluster manager to schedule and launch executors. Resources are allocated by the cluster manager for the execution of the tasks. The cluster manager is a pluggable component as we have seen we have a choice out of three available and supported cluster managers and we can plug any one of them based on our use case and needs. Cluster manager can dynamically adjust your resource used by the Spark application depending on the workload. The cluster manager can increase the number of executors or decrease the number of executors based on the kind of workload data processing needs to be done.

Now, let’s see what are the different activities performed by Spark executors.

Spark executor

The individual tasks in the given Spark job run in the Spark executor. The Spark executors run the actual programming logic of data processing in the form of tasks. The executors are launched at the beginning of the Spark application when you submit to do the jobs and they run for the entire lifetime of an application. The two main roles of the executors are

To run the tasks and return the results to the driver.
It provides in-memory
storage for the RDD data set and data frames that are cached by the user.

So,
the executor is the actual unit of execution that performs tasks for the data
processing.

Spark sample program

Consider the following sample program written in PySpark.

from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize (
   ["scala", 
   "java", 
   "hadoop", 
   "spark", 
   "akka",
   "spark vs hadoop", 
   "pyspark",
   "pyspark and spark"]
)
words_filter = words.filter(lambda x: 'spark' in x)
counts = words_filter.count()
print "Number of elements filtered  -> %i" % (counts)

//sample output
Number of elements
filtered -> 4

First, we are importing the SparkContext which is the entry point of Spark functionality. The second line creates a SparkContext object. Next, we are creating a sample RDD ‘words’ by the parallelize method. The words containing the string ‘spark’ is filtered and stored in words_filter. Count() function is used to count the number of words filtered and the result is printed.

Here, the process of applying a filter to the data in RDD is transformation and counting the number of words is action.

Now, let’s take a further deep dive into what happens when you do submit a particular Spark job.

How does Spark works?

This is the detailed diagram of Spark Architecture.

Source

The first block you see is the driver program. Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.

Based on that, the execution logic is processed and parallelly Spark context is also created. Using the Spark context, the different transformations and actions are processed. So, till the time the action is not encountered, all the transformations will go into the Spark context in the form of DAG that will create RDD lineage.

Once the action is called job is created. Job is the collection of different task stages. Once these tasks are created, they are launched by the cluster manager on the worker nodes and this is done with the help of a class called task scheduler.

The conversion of RDD lineage into tasks is done by the DAG scheduler. Here DAG is created based on the different transformations in the program and once the action is called these are split into different stages of tasks and submitted to the task scheduler as tasks become ready.

Then these are launched on the different executors in the worker node through the cluster manager. The entire resource allocation and the tracking of the jobs and tasks are performed by the cluster manager.

As soon as you do a Spark submit, your user program and other configuration mentioned are copied onto all the available nodes in the cluster. So that the program becomes the local read on all the worker nodes. Hence, the parallel executors running on the different worker nodes do not have to do any kind of network routing.

This is how the entire execution of the Spark job happens.

Endnotes

Now you have a basic idea about the internal working of Apache Spark and how it is applied in solving big data problems.

I hope you enjoyed reading this article.

Thanks for reading, cheers!

Please take a look at
my other articles on Dhanya_Thailappan, Author at Analytics Vidhya.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Dhanya

Beginner Data Engineering Programming Python Spark

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Marc Leprince

Excellent blog article Dhanya! This really is a fantastic summary of how Apache Spark executes at runtime. You clearly explain the key logical components involved as well as the flow of execution with some fantastic references. This really helped me better understand Apache Spark execution flow and architecture at a deeper level!

best ca in Delhi

thank you for sharing a great full information and good explanation

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Understand The Internal Working of Apache Spark

Introduction

Table of contents

Spark Components

Cluster Managers

Spark RDD

The high-level architecture of Spark

Spark Architecture run-time components

Spark sample program

How does Spark works?

Endnotes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid