Apache Kafka: A Metaphorical Introduction to Event Streaming for Data Scientists and Data Engineers

Kaushik Roy Chowdhury 08 Dec, 2020 • 7 min read

Overview

Learn about viewing data as streams of immutable events in contrast to mutable containers
Understand how Apache Kafka captures real-time data through event streaming

Introduction

Back in 2010, professional networking site LinkedIn faced inconsistency and latency issues on their existing request-response system. To overcome this, a team of software engineers at the company developed an optimized messaging system that could solve their problem of the continuous flow of data. In late 2010, LinkedIn open-sourced the project.

In July 2011, Apache Software Foundation accepted it as an incubator project; thus, giving birth to Apache Kafka that went on to become one of the largest streaming platforms in the world.

Capturing real-time data was possible by using Kafka (we will get into the discussion of how later on). Alongside Kafka, LinkedIn also created Samza to process data streams in real-time. Apache added Samza as part of their project repository in 2013.

I always wondered what thoughts the creators of Kafka had in mind when naming the tool. Jay Kreps, one of the cofounders of Kafka (along with Neha Narkhede and Jun Rao), said:

“I thought that since Kafka was a system optimized for writing, using a writer’s name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.

So basically there is not much of a relationship.” (Narkhede, Shapira, & Palino, 2017 p. 16)

The way Kafka treats the concept of data is entirely different from what we have thought of data to be. Though Kreps may be right in saying not to read too much into the name of the tool, I find a lot of similarities between the philosophical underpinnings of 20th-century’s celebrated literary figure Franz Kafka’s works and how Apache Kafka treats data.

In this article, I will take metaphors from Kafka’s stories and relate them to understand the working of Apache Kafka.

“The Metamorphosis” (of the data architecture)

In one of the famous stories of Franz Kafka’s The Metamorphosis, the protagonist Gregor Samsa wakes up in the morning to find himself transformed into a giant insect. There is a physical alteration to Gregor Samsa’s body. The Metamorphosis is a story that deals with the transformation of the protagonist and events that describe the absurdity of life, which remains one of the central ideas in most of Kafka’s works.

You must be wondering how does this story even relates to data architecture in any way? Let me explain.

Data as a stream of events

Traditionally, we have thought of data as a collection of a set of values about objects. Data tells us the current state of an entity which may be qualitative or quantitative — for example, some characteristics of a customer, a product, or an economy.

Till recently, data mostly existed in a monolithic framework called the database. If the attributes of the entity change, we would update the database to reflect the changes.

Thus, traditional relational databases are mutable, that is, subject to change its state and once changed remains in that state until any update occurs. You can think of updating a database as changing the data attributes or appending new records.

However, in the world of Apache Kafka, data is not objectified but treated as a stream of events. This stream of events is recorded in the form of log files called topics. For starters, a log is a file that records an event that occurred sequentially.

Mutable states and immutable events

For example, a customer identified by customer_id 123 buys one unit of a product having product_id 999. In a traditional relational database, the quantity 1 will be recorded against a row matching the customer id and product id, which reflects the current state of the database. Suppose, the customer changes her mind and buys three units of the same product. We need to mutate the database for this transactional change by updating the quantity from 1 to 3.

But, if we treat data as streams of events, the log file reflects the change of mind as a fact or an immutable event occurring at a specific time. The log appends the event after the previous occurrence of quantity 1 has been recorded in the log stream.

Apache Kafka - Mutable vs immutable events

Source: Kleppmann, M. (2015)

Statefulness and Fault Tolerance of Apache Kafka

Returning to the plight of Gregor Samsa, an event at a specific time led to his physical transformation. But his mind remains intact as he worries about getting late to office. The metamorphosis of the physical reality is an event or a fact at a particular point in time. But it is also stateful, in the sense that his mind still thinks like a human.

Just like Gregor accepts the physical transformation without any surprise or shock, his mind changes as per his physical needs. This is even true for event streams in Apache Kafka. Though it keeps a memory of all the events taking place over a while, you have the option of removing the logs dating back to an extended period.

Being stateful ensures that Apache Kafka is fault-tolerant and resilient, i.e., if any problems occur at a future date, you always have the option of returning to the previous working state.

Agile response for real-time data

The advent of a multitude of data sources has transformed the process of decision-making by governments, businesses, and individual agents.

Earlier, data was like a source of validation of the decision-making process; that is, strategic decisions were instinctive and experiential and then validated by data. Whereas, nowadays, data-driven decision making is a part of the process itself. Thus, we record data as events taking place within an enterprise. The enterprise analyses these events manipulates them and uses them for more data as output.

“Every byte of data has a story to tell, something of importance that will inform the next thing to be done. In order to know what that is, we need to get the data from where it is created to where it can be analyzed.” (Narkhede, Shapira, & Palino, 2017 p. 1)

In this fast-changing app oriented world, agility and responsiveness in data transfers are critical. Moving around data quickly within internal systems and responding to requests from external servers become imperative. This is where Apache Kafka’s pub-sub (publish-subscribe) messaging comes in handy.

Publish-Subscribe System of Messaging in Apache Kafka

Consider an example where an e-retailer wants to build an application for the checkout process, which is a webpage or app where buyers can pay for their products that they want to order. When a customer shops online and checks out through the webpage or app, the purchase message should reflect on the shipment team’s data to ship the product to the customer who purchased it. This is not a complicated process.

Now, think of a scenario where this event should also trigger a system of other events like sending an automated email receipt to the customer, updating the inventory database, etc. As the front and back-end services get added, and the list of responses to a purchase checkout grows, more integrations need to be built, and this can get too messy. The integrations lead to an inter-team dependency for any changes to the applications, which makes development slow.

Source: Narkhede, Shapira, & Palino (2017) p. 2

Apache Kafka helps achieve the decoupling of system dependencies that makes the hard integration go away. This makes the checkout webpage or app broadcast events instead of directly transferring the events to different servers. This is what we mean by publishing. Then, the other services like inventory, shipment, or email subscribe to that stream, which triggers them to act accordingly.

The publish-subscribe model uses the messaging service, that is, the publisher stream event messages and the subscriber chooses to subscribe to the requisite messages.

Retrieved from https://www.confluent.io/what-is-apache-kafka/

Producers and Consumers

In Apache Kafka, you can store the streams of events generated by the producers durably and reliably for the specified time.

It also allows you to read and process these streams by the consumers. The producers and consumers are completely decoupled, that is, the producers don’t wait for the consumers to consume the events, and the consumers can consume the streams in any sequence. The consumers also have an option to decide which messages to consume. However, the logs are appended to the topics sequentially. This property removes the complexity of having complicated routing rules which makes Kafka fast, efficient, scalable, and all these are performed in a distributed manner.

This reminds me of how Gregor Samsa’s mind and body are decoupled. Samsa thinks of going to work or standing upright with an insects’ body. To make space for his physical existence, Samsa’s family members move furniture out of his room. Samsa realizes his possessions are taken away from him.

Through these details, the story suggests that our physical lives shape and direct our mental lives, not the other way around. Alternatively, the need for real-time capture of data makes us think about data as not as physical realities but as streams of events.

End Notes

In this article, we looked at an evolved way of understanding data as streams of events. This is very different from how data has been viewed till now. We also understood the need for a decoupled architecture for these data events to be published and subscribed to make the process of data generation, consumption, and update much faster and efficient. This is achieved in the Apache Kafka framework.

I recommend you go through the following data engineering resources to enhance your knowledge:

For more technical details related to Apache Kafka, feel free to refer to some of the below-mentioned resources.

References

Narkhede, N., Shapira, G., & Palino, T. (2017). Kafka: The definitive guide. Sebastopol, CA: O’Reilly.

IBM Cloud Education (2020). Apache Kafka. https://www.ibm.com/cloud/learn/apache-kafka.

Kleppmann, M. (2015). Turning the database inside-out with Apache Samza. https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/.

Apache Kafka. Introduction. https://kafka.apache.org/intro

Kaushik Roy Chowdhury 08 Dec 2020

Kaushik is a data science professional and a trainer with more than 7 years of experience working in the consulting domain and around 3 years of teaching experience.

Beginner Data Engineering