Understanding the Basics of Apache Spark RDD
This article was published as a part of the Data Science Blogathon
In this article, I am going to discuss one of the most essential parts of Apache Spark called RDD.
Before getting into Spark RDD, I strongly recommend you to read my article, Understand the internal working of Apache Spark to get an overview of the working of Apache Spark.
Table of Contents
- What is RDD in Spark?
- Features of Spark RDD
- How to create RDDs?
- Operations of RDD
- Practical demo of RDD operations
- When to use RDDs?
- Let’s understand, what is RDD in Spark?
What is RDD in Spark?
RDD stands for Resilient Distributed Dataset. It is considered the backbone of Apache Spark. This is available since the beginning of the Spark. That’s why it is considered as a fundamental data structure of Apache Spark. Data structures in the newer version of Sparks such as datasets and data frames are built on the top of RDD. In Spark, anything you do will go around RDD. The dataset in Spark RDDs is
divided into logical partitions. If the data is logically partitioned within RDD, it is possible to send different pieces of data across different nodes of the cluster for distributed computing. RDD helps Spark to achieve efficient data processing.
Features of Spark RDD
Spark RDD possesses the following features.
The important fact about RDD is, it is immutable. You cannot change the state of RDD. If you want to change the state of RDD, you need to create a copy of the existing RDD and perform your required operations. Hence, the required RDD can be retrieved at any time.
Data stored in a disk takes much time to load and process. Spark supports in-memory computation which stores data in RAM instead of disk. Hence, the computation power of Spark is highly increased.
Transformations in RDDs are implemented using lazy operations. In lazy evaluation, the results are not computed immediately. It will generate the results, only when the action is triggered. Thus, the performance of the program is increased.
As I said earlier, once you perform any operations in an existing RDD, a new copy of that RDD is created, and the operations are performed on the newly created RDD. Thus, any lost data can be recovered easily and recreated. This feature makes Spark RDD fault-tolerant.
Data items in RDDs are usually huge. This data is partitioned and send across different nodes for distributed computing.
Intermediate results generated by RDD are stored to make the computation easy. It makes the process optimized.
Spark RDD offers two types of grained operations namely coarse-grained and fine-grained. The coarse-grained operation allows us to transform the whole dataset while the fine-grained operation allows us to transform individual elements in the dataset.
How to create RDD?
In Apache Spark, RDDs can be created in three ways.
- Parallelize method by which already existing collection can be used in the driver program.
- By referencing a dataset that is present in an external storage system such as HDFS, HBase.
- New RDDs can be created from an existing RDD.
Operations of RDD
Two operations can be applied in RDD. One is transformation. And another one in action.
Transformations are the processes that you perform on an RDD to get a result which is also an RDD. The example would be applying functions such as filter(), union(), map(), flatMap(), distinct(), reduceByKey(), mapPartitions(), sortBy() that would create an another resultant RDD. Lazy evaluation is applied in the creation of RDD.
Actions return results to the driver program or write it in a storage and kick off a computation. Some examples are count(), first(), collect(), take(), countByKey(), collectAsMap(), and reduce().
Transformations will always return RDD whereas actions return some other data type.
Practical demo of RDD operations
Let’s take a practical look at some of the RDD operations. To practice Apache Spark, you need to install Cloudera virtual environment. You can find a detailed guide to install Cloudera VM here.
First, let’s create an RDD using parallelize() method which is the simplest method.
val rdd1 = sc.parallelize(List(23, 45, 67, 86, 78, 27, 82, 45, 67, 86))
Here, sc denotes SparkContext
and each element is copied to form RDD.
We can read the result generated by RDD by using the collect operation.
The results are shown
The count action is used to get the total number of elements present in the particular RDD.
There are 10 elements in rdd1.
Distinct is a type of transformation that is used to get the unique elements in the RDD.
The distinct elements are displayed.
Filter transformation creates a new dataset by selecting the elements according to the given condition.
rdd1.filter(x => x < 50).collect
Here, the elements which are less than 50 are displayed.
sortBy operation is used to arrange the elements in ascending order when the condition is true and in descending order when the condition is false.
rdd1.sortBy(x => x, true).collect rdd1.sortBy(x => x, false).collect
Reduce action is used to summarize the RDD based on the given formula.
rdd1.reduce((x, y) => x + y)
Here, each element is added and the total sum is printed.
Map transformation processes each element in the RDD according to the given condition and creates a new RDD.
rdd1.map(x => x + 1).collect
Here, each element is incremented once.
Union, intersection, and cartesian
Let’s create another RDD.
val rdd2 = sc.parallelize(List(25,73, 97, 78, 27, 82))
Union operation combines all the elements of the given two RDDs.
Intersection operation forms a new RDD by taking the common elements in the given RDDs.
Cartesian operation is used to create a cartesian product of the required RDDs.
rdd1.union(rdd2).collect rdd1.intersection(rdd2).collect rdd1.cartesian(rdd2).collect
First is a type of action that always returns the first element of the RDD.
Here, the first element in rdd1 is 23.
Take action returns the first n elements in the RDD.
Here, the first 5 elements are displayed.
Now, you may have noticed that when you do any transformations, only copies of existing RDDs are created and the initially created RDD doesn’t change. This is because RDDs are immutable. This feature makes RDDs fault-tolerant and the lost data can also be recovered easily.
When to use RDDs?
RDD is preferred to use when you want to apply low-level transformations and actions. It gives you a greater handle and control over your data. RDDs can be used when the data is highly unstructured such as media or text streams. RDDs are used when you want to add functional programming constructs rather than domain-specific expressions. RDDs are used in the situation where the schema is not applied.
I hope now you have a basic idea about the RDDs and their role in Apache Spark.
Thanks for reading, cheers!
Please take a look at my other articles on dhanya_thailappan, Author at Analytics Vidhya.