A Beginner’s Guide to Spark Streaming For Data Engineers
- Understand Spark Streaming and its functioning.
- Learn about Windows in Spark Streaming with an example.
According to IBM, 60% of all sensory information loses value in a few milliseconds if it is not acted on. Bearing in mind that the Big Data and analytics market has reached $125 billion and a large chunk of this will be attributed to IoT in the future, the inability to tap real-time information will result in a loss of billions of dollars.
Examples of some of these applications include a telco, working out how many of its users have used Whatsapp in the last 30 minutes, a retailer keeping track of the number of people who have said positive things about its products today on social media, or a law enforcement agency looking for a suspect using data from traffic CCTV.
This is the primary reason stream-processing systems like Spark Streaming will define the future of real-time analytics. There is also a growing need to analyze both data at rest and data in motion to drive applications, which makes systems like Spark—which can do both—all the more attractive and powerful. It’s a system for all Big Data seasons.
You learn how Spark Streaming not only keeps the familiar Spark API intact but also, under the hood, uses RDDs for storage as well as fault-tolerance. This enables Spark practitioners to jump into the streaming world from the outset. With that in mind, let’s get right to it.
Table of Contents
- Apache Spark
- Apache Spark Ecosystem
- Spark Streaming: DStreams
- Spark Streaming: Streaming Context
- Example: Word Count
- Spark Streaming: Window
- A Window based – Word Count
- A (more efficient) Window-based – Word Count
- Spark Streaming- Output Operations
Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of this writing, Spark is the most actively developed open-source engine for this task, making it a standard tool for any developer or data scientist interested in big data.
Spark supports multiple widely-used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale-up to big data processing or an incredibly large scale. Below are a few of the features of Spark:
Fast and general-purpose engine for large-scale data processing
Not a modified version of Hadoop
The leading candidate for “successor to Map Reduce”
Spark can efficiently support more types of computations
For example, interactive queries, stream processing
Can read/write to any Hadoop-supported system (e.g., HDFS)
- Speed: in-memory data storage for very fast iterative queries
the system is also more efficient than MapReduce for complex applications running on disk
up to 40x faster than Hadoop
Ingest data from many sources: Kafka, Twitter, HDFS, TCP sockets
Results can be pushed out to file-systems, databases, live dashboards, but not only
Apache Spark Ecosystem
The following are the components of Apache Spark Ecosystem-
- Spark Core: basic functionality of Spark (task scheduling, memory management, fault recovery, storage systems interaction).
Spark SQL: package for working with structured data queried via SQL as well as HiveQL
Spark Streaming: a component that enables processing of live streams of data (e.g., log files, status updates messages)
- MLLib: MLLib is a machine learning library like Mahout. It is built on top of Spark and has the provision to support many machine learning algorithms.
- GraphX: For graphs and graphical computations, Spark has its own Graph Computation Engine, called GraphX. It is similar to other widely used graph processing tools or databases, like Neo4j, Giraffe, and many other distributed graph databases.
Spark Streaming: Abstractions
Spark Streaming has a micro-batch architecture as follows:
treats the stream as a series of batches of data
new batches are created at regular time intervals
the size of the time intervals is called the batch interval
the batch interval is typically between 500 ms and several seconds
The reduce value of each window is calculated incrementally.
Discretized Stream (DStream)
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from the source or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval.
- RDD transformations are computed by the Spark engine
the DStream operations hide most of these details
Any operation applied on a DStream translates to operations on the underlying RDDs
The reduce value of each window is calculated incrementally.
Spark Streaming: Streaming Context
It is the main entry point for Spark Streaming functionality. It provides methods used to create
DStreams from various input sources. Streaming Spark can be either created by providing a Spark master URL and an appName, or from an org.apache.spark.SparkConf configuration, or from an existing org.apache.spark.SparkContext. The associated SparkContext can be accessed using
After creating and transforming DStreams, streaming computation can be started and stopped using
context.start() and, respectively.
context.awaitTermination() allows the current thread to wait for the termination of the context by
stop() or by an exception.
To execute a SparkStreaming application, we need to define the StreamingContext. It specializes SparkContext for streaming applications.
Streaming context in Java can be defined as follows-
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, batchInterval);
master is a Spark, Mesos, or YARN cluster URL; to run your code in local mode, use “local[K]” where K>=2 represents the parallelism
appname is the name of your application
batch interval time interval (in seconds) of each batch
Once built, they offer two types of operations:
Transformations which yield a new DStream from a previous one. For example, one common transformation is filtering data.
stateless transformations: the processing of each batch does not depend on the data of its previous batches.
Examples are: map(), filter(), and reduceByKey()
stateful transformations: use data from previous batches to compute the results of the current batch. They include sliding windows, tracking state across time, etc
Output operations that write data to an external system. Each streaming application has to define an output operation.
Note that a streaming context can be started only once, and must be started after we set up all the DStreams and output operations.
Basic Data Sources
Below listed are the basic data sources of Spark Streaming:
- File Streams: It is used for reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:
... = streamingContext.fileStream<...>(directory);
- Streams based on Custom Receivers: DStreams can be created with data streams received through custom receivers, extending the Receiver<T> class
... = streamingContext.queueStream(queueOfRDDs)
- Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using
... = streamingContext.queueStream(queueOfRDDs)
Most of the transformations have the same syntax as the one applied to RDDs
Return a new DStream by passing each element of the source DStream through a function func.
Similar to map, but each input item can be mapped to 0 or more output items.
Return a new DStream by selecting only the records of the source DStream on which func returns true.
Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
Example: Word Count-
SparkConf sparkConf = new SparkConf() .setMaster("local").setAppName("WordCount"); JavaStreamingContext ssc = ... JavaReceiverInputDStream<String> lines = ssc.socketTextStream( ... ); JavaDStream<String> words = lines.flatMap(...); JavaPairDStream<String, Integer> wordCounts = words .mapToPair(s -> new Tuple2<>(s, 1)) .reduceByKey((i1, i2) -> i1 + i2); wordCounts.print();
Spark Streaming: Window
The simplest windowing function is a window, which lets you create a new DStream, computed by applying the windowing parameters to the old DStream. You can use any of the DStream operations on the new stream, so you get all the flexibility you want.
Windowed computations allow you to apply transformations over a sliding window of data. Any window operation needs to specify two parameters:
- window length
- The duration of the window in secs
- sliding interval
- The interval at which the window operation is performed in secs
- These parameters must be multiples of the batch interval
It returns a new DStream which is computed based on windowed batches.
... JavaStreamingContext ssc = ... JavaReceiverInputDStream<String> lines = ... JavaDStream<String> linesInWindow = lines.window(WINDOW_SIZE, SLIDING_INTERVAL); JavaPairDStream<String, Integer> wordCounts = linesInWindow.flatMap(SPLIT_LINE) .mapToPair(s -> new Tuple2<>(s, 1)) .reduceByKey((i1, i2) -> i1 + i2);
- reduceByWindow(func, InvFunc, windowLength, slideInterval)
- Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func (which should be associative).
- The reduce value of each window is calculated incrementally.
func reduces new data that enters the sliding window
invFunc “inverse reduces” the old data that leaves the window.
- reduceByKeyAndWindow(func, InvFunc, windowLength, slideInterval)
- When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window.
For performing these transformations, we need to define a checkpoint directory
Window-based – Word Count
... JavaPairDStream<String, Integer> wordCountPairs = ssc.socketTextStream(...) .flatMap(x -> Arrays.asList(SPACE.split(x)).iterator()) .mapToPair(s -> new Tuple2<>(s, 1)); JavaPairDStream<String, Integer> wordCounts = wordCountPairs .reduceByKeyAndWindow((i1, i2) -> i1 + i2, WINDOW_SIZE, SLIDING_INTERVAL); wordCounts.print(); wordCounts.foreachRDD(new SaveAsLocalFile());
A (more efficient) Window-based – Word Count
In a more efficient version, the reduce value of each window is calculated incrementally-
a reduce function handles new data that enters the sliding window;
an “inverse reduce” function handles old data that leaves the window.
Note that checkpointing must be enabled for using this operation.
... ssc.checkpoint(LOCAL_CHECKPOINT_DIR); ... JavaPairDStream<String, Integer> wordCounts = wordCountPairs.reduceByKeyAndWindow( (i1, i2) -> i1 + i2, (i1, i2) -> i1 - i2, WINDOW_SIZE, SLIDING_INTERVAL);
Spark Streaming: Output Operations
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems
Prints the first ten elements of every batch of data in a DStream on the driver node running the application.
Save this DStream’s contents as text files. The file name at each batch interval is generated based on prefix.
Save this DStream’s contents as Hadoop files.
Save this DStream’s contents as SequenceFiles of serialized Java objects.
Generic output operator that applies a function, func, to each RDD generated from the stream.
It should be clear that Spark Streaming presents a powerful way to write streaming applications. Taking a batch job you already run and turning it into a streaming job with almost no code changes is both simple and extremely helpful from an engineering standpoint if you need to have this job interact closely with the rest of your data processing application.
I recommend you go through the following data engineering resources to enhance your knowledge-
- Getting Started with Apache Hive – A Must Know Tool For all Big Data and Data Engineering Professionals
- Introduction to the Hadoop Ecosystem for Big Data and Data Engineering
- Types of Tables in Apache Hive – A Quick Overview
If you liked the article then please drop a comment in the comment section below.