A Beginner’s Guide to Spark Streaming For Data Engineers

Siddharth Sonkar 22 Dec, 2023
8 min read

Overview

  • Understand Spark Streaming and its functioning.
  • Learn about Windows in Spark Streaming with an example.

Introduction

According to IBM, 60% of all sensory information loses value in a few milliseconds if it is not acted on. Bearing in mind that the Big Data and analytics market has reached $125 billion and a large chunk of this will be attributed to IoT in the future, the inability to tap real-time information will result in a loss of billions of dollars.

Examples of some of these applications include a telco, working out how many of its users have used Whatsapp in the last 30 minutes, a retailer keeping track of the number of people who have said positive things about its products today on social media, or a law enforcement agency looking for a suspect using data from traffic CCTV.

This is the primary reason stream-processing systems like Spark Streaming will define the future of real-time analytics. There is also a growing need to analyze both data at rest and data in motion to drive applications, which makes systems like Spark—which can do both—all the more attractive and powerful. It’s a system for all Big Data seasons.

You learn how Spark Streaming not only keeps the familiar Spark API intact but also, under the hood, uses RDDs for storage as well as fault-tolerance. This enables Spark practitioners to jump into the streaming world from the outset. With that in mind, let’s get right to it.

 Spark Streaming introduction
An Introduction to Spark Streaming | by Harshit Agarwal | Medium

Apache Spark

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of this writing, Spark is the most actively developed open-source engine for this task, making it a standard tool for any developer or data scientist interested in big data.

Spark supports multiple widely-used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale-up to big data processing or an incredibly large scale. Below are a few of the features of Spark:

  • Fast and general-purpose engine for large-scale data processing

    • Not a modified version of Hadoop
    • The leading candidate for “successor to Map Reduce”
  • Spark can efficiently support more types of computations

    • For example, interactive queries, stream processing
  • Can read/write to any Hadoop-supported system (e.g., HDFS)
  • Speed: in-memory data storage for very fast iterative queries
    • the system is also more efficient than MapReduce for complex applications running on disk
    • up to 40x faster than Hadoop
    • Ingest data from many sources: Kafka, Twitter, HDFS, TCP sockets
    • Results can be pushed out to file-systems, databases, live dashboards, but not only
 Spark Streaming

Apache Spark Ecosystem

The following are the components of Apache Spark Ecosystem-

  • Spark Core: basic functionality of Spark (task scheduling, memory management, fault recovery, storage systems interaction).
  • Spark SQL: package for working with structured data queried via SQL as well as HiveQL
  • Spark Streaming: a component that enables processing of live streams of data (e.g., log files, status updates messages)
  • MLLib: MLLib is a machine learning library like Mahout. It is built on top of Spark and has the provision to support many machine learning algorithms.
  • GraphX: For graphs and graphical computations, Spark has its own Graph Computation Engine, called GraphX. It is similar to other widely used graph processing tools or databases, like Neo4j, Giraffe, and many other distributed graph databases.
 Spark Streaming - ecosystem

Spark Streaming: Abstractions

Spark Streaming has a micro-batch architecture as follows:

  • treats the stream as a series of batches of data
  • new batches are created at regular time intervals
  • the size of the time intervals is called the batch interval
  • the batch interval is typically between 500 ms and several seconds
 Spark Streaming, Abstractions

The reduce value of each window is calculated incrementally.

Discretized Stream (DStream)

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from the source or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval.

Spark Streaming - DStream
  • RDD transformations are computed by the Spark engine
  • the DStream operations hide most of these details
  • Any operation applied on a DStream translates to operations on the underlying RDDs
  • The reduce value of each window is calculated incrementally.
DStream

Spark Streaming: Streaming Context

It is the main entry point for Spark Streaming functionality. It provides methods used to create DStreams from various input sources. Streaming Spark can be either created by providing a Spark master URL and an appName, or from an org.apache.spark.SparkConf configuration, or from an existing org.apache.spark.SparkContext. The associated SparkContext can be accessed using context.sparkContext.

After creating and transforming DStreams, streaming computation can be started and stopped using context.start() and, respectively. context.awaitTermination() allows the current thread to wait for the termination of the context by stop() or by an exception.

To execute a SparkStreaming application, we need to define the StreamingContext. It specializes SparkContext for streaming applications.

Streaming context in Java can be defined as follows-

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, batchInterval);

where:

  • master is a Spark, Mesos, or YARN cluster URL; to run your code in local mode, use “local[K]” where K>=2 represents the parallelism
  • appname is the name of your application
  • batch interval time interval (in seconds) of each batch

Once built, they offer two types of operations:

  • Transformations which yield a new DStream from a previous one. For example, one common transformation is filtering data.

    • stateless transformations: the processing of each batch does not depend on the data of its previous batches.

                      Examples are: map(), filter(), and reduceByKey()

  • stateful transformations: use data from previous batches to compute the results of the current batch. They include sliding windows, tracking state across time, etc
  • Output operations that write data to an external system. Each streaming application has to define an output operation.

Note that a streaming context can be started only once, and must be started after we set up all the DStreams and output operations.

Basic Data Sources

Below listed are the basic data sources of Spark Streaming:

  • File Streams: It is used for reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:
    ... = streamingContext.fileStream<...>(directory);
  • Streams based on Custom Receivers: DStreams can be created with data streams received through custom receivers, extending the Receiver<T> class
    ... = streamingContext.queueStream(queueOfRDDs)
  • Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using
    ... = streamingContext.queueStream(queueOfRDDs)

Most of the transformations have the same syntax as the one applied to RDDs

Transformation

Meaning

map(func)

Return a new DStream by passing each element of the source DStream through a function func.

flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items.

filter(func)

Return a new DStream by selecting only the records of the source DStream on which func returns true.

union(otherStream)

Return a new DStream that contains the union of the elements in the source DStream and otherDStream.

join(other Stream)

When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.

Example: Word Count-

SparkConf sparkConf = new SparkConf()
.setMaster("local[2]").setAppName("WordCount");
JavaStreamingContext ssc = ...
JavaReceiverInputDStream<String> lines = ssc.socketTextStream( ... );
JavaDStream<String> words = lines.flatMap(...);
JavaPairDStream<String, Integer> wordCounts = words
                                             .mapToPair(s -> new Tuple2<>(s, 1))
                                             .reduceByKey((i1, i2) -> i1 + i2);

wordCounts.print();

Spark Streaming: Window

The simplest windowing function is a window, which lets you create a new DStream, computed by applying the windowing parameters to the old DStream. You can use any of the DStream operations on the new stream, so you get all the flexibility you want.

Windowed computations allow you to apply transformations over a sliding window of data. Any window operation needs to specify two parameters:

  • window length
    • The duration of the window in secs
  • sliding interval
    • The interval at which the window operation is performed in secs
    • These parameters must be multiples of the batch interval
DStream
window(windowLength, slideInterval)

It returns a new DStream which is computed based on windowed batches.

...
JavaStreamingContext ssc = ...
JavaReceiverInputDStream<String> lines = ...
JavaDStream<String> linesInWindow =
lines.window(WINDOW_SIZE, SLIDING_INTERVAL);
JavaPairDStream<String, Integer> wordCounts = linesInWindow.flatMap(SPLIT_LINE)
.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
  • reduceByWindow(func, InvFunc, windowLength, slideInterval)
    • Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func (which should be associative).
    • The reduce value of each window is calculated incrementally.
    • func reduces new data that enters the sliding window
    • invFunc “inverse reduces” the old data that leaves the window.
  • reduceByKeyAndWindow(func, InvFunc, windowLength, slideInterval)
    • When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window.

For performing these transformations, we need to define a checkpoint directory

Window-based – Word Count

...
JavaPairDStream<String, Integer> wordCountPairs = ssc.socketTextStream(...)
.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator())
.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = wordCountPairs
.reduceByKeyAndWindow((i1, i2) -> i1 + i2, WINDOW_SIZE, SLIDING_INTERVAL);
wordCounts.print();
wordCounts.foreachRDD(new SaveAsLocalFile());

A (more efficient) Window-based – Word Count

In a more efficient version, the reduce value of each window is calculated incrementally-

  • a reduce function handles new data that enters the sliding window;
  • an “inverse reduce” function handles old data that leaves the window.

Note that checkpointing must be enabled for using this operation.

...
ssc.checkpoint(LOCAL_CHECKPOINT_DIR);
...
JavaPairDStream<String, Integer> wordCounts = wordCountPairs.reduceByKeyAndWindow(
(i1, i2) -> i1 + i2,
(i1, i2) -> i1 - i2, WINDOW_SIZE, SLIDING_INTERVAL);

Spark Streaming: Output Operations

Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems

Output Operation

Meaning

print()

Prints the first ten elements of every batch of data in a DStream on the driver node running the application.

saveAsTextFiles(prefix, [suffix])

Save this DStream’s contents as text files. The file name at each batch interval is generated based on prefix.

saveAsHadoopFiles(prefix, [suffix])

Save this DStream’s contents as Hadoop files.

saveAsObjectFiles(prefix, [suffix])

Save this DStream’s contents as SequenceFiles of serialized Java objects.

foreachRDD(func)

Generic output operator that applies a function, func, to each RDD generated from the stream.

Online References-
Spark Documentation
Spark Documentation

Conclusion

It should be clear that Spark Streaming presents a powerful way to write streaming applications. Taking a batch job you already run and turning it into a streaming job with almost no code changes is both simple and extremely helpful from an engineering standpoint if you need to have this job interact closely with the rest of your data processing application.

I recommend you go through the following data engineering resources to enhance your knowledge-

Frequently Asked Questions

Q1.How does Spark Streaming differ from batch processing?

Spark Streaming processes data in small, configurable micro-batches, providing low-latency processing compared to traditional batch processing.

Q2.What types of data sources are supported by Spark Streaming?

Spark Streaming supports various data sources, including HDFS, Kafka, Flume, and others, allowing seamless integration with diverse streaming platforms.

Q3.Can Spark Streaming handle both data at rest and data in motion?

Yes, Spark Streaming can analyze both data at rest (static data) and data in motion (live data streams), making it versatile for different use cases.

If you liked the article then please drop a comment in the comment section below.

Siddharth Sonkar 22 Dec, 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,