Kafka Event Streaming Explained

Published on March 8, 2024

The author selected Apache Software Foundation to receive a donation as part of the Write for DOnations program.

Introduction

Traditional ways of processing and receiving data (such as batch processing and polling) are inefficient in the context of microservices employed in modern applications. These methods work on aggregated chunks of data, which delay the final result of their processing and force a substantial amount of data to accumulate beforehand. They introduce the additional complexity of having to synchronize the workers, potentially leaving some of them underutilized despite their resource usage. In contrast, since cloud computing offers rapid scalability for deployed resources, incoming data can be processed in real time by delegating it to numerous workers in parallel.

Event streaming is the approach of flexibly collecting and delegating incoming events for processing by keeping a continuous flow of data between different systems. Scheduling incoming data for processing immediately ensures maximal resource usage and real-time responsiveness. Event streaming decouples producers from consumers and allows you to have a disproportionate number of either, depending on the current load. This enables instantaneous reactions to dynamic conditions in the real world.

Such responsiveness can be particularly important in areas such as financial trading, monitoring payments or observing traffic. For example, Uber uses event streaming to connect hundreds of microservices, passing event data from rider to driver apps in real-time and archiving them for later analysis.

With event streaming, instead of traditionally having a worker wait for batches of data at a regular interval, the event broker can notify the consumer (usually a microservice) as soon as an event happens and provide it with the event data. The event broker takes care of routing, receiving and passing events on. It also provides fault tolerance in case a worker fails or rejects to process an event.

In this conceptual article, we’ll explore the event streaming approach and its benefits. We’ll also introduce Apache Kafka, an open-source event broker and review its role in this approach.

Event Streaming architecture

At its core, event streaming is an implementation of the pub/sub architecture pattern. Generally, the pub/sub pattern involves:

Topics to which messages (containing whatever data you wish to communicate) are addressed to
Publishers, which generate messages
Subscribers who receive messages and act upon them
A message broker that accepts messages from publishers and provides them to subscribers in the most efficient way

A topic is akin to a category to which a message is related. Topics durably store the sequence of messages, ensuring that new messages are always added to the very end of the sequence. Once a message has been added to the topic, it cannot be modified later.

With event streaming, the premise is similar, albeit more specialized:

Events and related metadata together are passed as messages
Events in a topic are ordered, usually by time of arrival
Subscribers (also called consumers) can replay events from any point in the topic up to the current moment
Unlike true pub/sub, events in a topic can be retained for a set period of time or kept indefinitely (as an archive)

Event streaming does not place restrictions or generate assumptions about the nature of an event; as far as the underlying broker is concerned, it signifies that a producer informed it that something occurred. What actually happened is up to you to define and give meaning in your implementation. For this reason, from the viewpoint of the broker, events are also interchangeably called messages or records.

To illustrate, here is a detailed diagram of the Kafka Event Streaming architecture from Confluent documentation:

Event Streaming Architecture Diagram

There are two models for how consumers can opt to receive data from a broker: pushing and pulling. Pushing refers to the event broker initializing the process of sending data to an available consumer, whereas pulling means that the consumer will request the next available records from the broker. The difference seems innocuous, but pulling is preferred in practice.

One of the main reasons pushing is not widely used is that the broker can not be sure that the consumer will actually be able to act on the event. Thus, it may end up sending the event multiple times in vain while still needing to store it in the topic. The broker would also have to consider batching events for higher efficiency, which goes against the idea of streaming them as soon as possible.

Having the consumer pull the data when it’s ready to process it cuts down on unnecessary network traffic and allows for greater reliability. This ensures that it will only receive data when it’s ready to process it. How long the processing takes is up to the business logic and influences the planning of the number of workers. In both cases, the broker should remember which events the consumer has acknowledged.

You now know what event streaming is and what architecture it’s based on. You’ll now learn about the benefits of this dynamic approach.

Benefits of Event Streaming

The main benefits of event streaming are:

Consistency: The event broker ensures that events are properly forwarded to all interested consumers.
Fault-tolerance: If a consumer fails to accept an event, it can be rerouted elsewhere, ensuring that no event is left unprocessed.
Reusability: Events stored in a topic are immutable. They can be replayed in whole or from a specific point in time, allowing you to reprocess events in case of business logic changes.
Scalability: Producers and consumers are separate entities and do not have to wait for each other, meaning that they can be dynamically scaled up or down depending on demand.
Ease of use: The event broker handles event routing and storage, abstracting away the complex logic and allowing you to focus on the data itself.

Each event should contain only the necessary details about the occurrence. Event brokers are generally very efficient, and although it’s recommended not to expire events once they’re in a topic, they shouldn’t be regarded as a traditional database.

For instance, it’s fine to signal that an article’s number of views has changed, but there’s no need to store the whole article and its related metadata along with that fact. Instead, the event can contain a reference to the article ID in an external database. This way, history can still be tracked without including unnecessary information and polluting the topic.

You’ll now learn about Apache Kafka and other popular event brokers, how they compare and how they slot into the event streaming ecosystem.

Role of Apache Kafka

Apache Kafka is an open-source event broker written in Java and maintained by the Apache Software Foundation. It consists of distributed servers and clients that communicate using a custom TCP network protocol for maximum performance. Kafka is highly reliable and scalable and can be run on virtual machines, bare metal hardware, containers, and other cloud environments.

For reliability, Kafka is deployed as a cluster containing one or more servers. The cluster can span multiple cloud regions and data centres. Kafka clusters are fault-tolerant, meaning that in case a server fails or loses connection, the remaining ones will regroup to ensure high availability of operations without outside impact and data loss.

For maximum efficiency, not all Kafka servers have the same role. Some servers will group together and act as the broker, forming the storage layer for holding the data. The rest can integrate with your existing systems and pull in data as event streams using Kafka Connect, a tool for reliably streaming data from existing systems (such as relational databases) into Kafka.

Kafka considers producers and consumers its clients. As explained earlier, producers write events to a Kafka broker, which forwards them to interested consumers. In the default configuration, Kafka guarantees that an event will ultimately be processed only once by one of the consumers.

In Kafka, topics are partitioned. This means that a topic is spread in parts on different Kafka brokers, which ensures scalability. Kafka also guarantees that events stored in a particular combination of topics and their partitions can always be read in the same order in which they were written.

Note that just partitioning a topic does not ensure redundancy, which can be achieved only through replication across different regions and data centres. It’s common to have at least 3 replicas of a cluster in a production setting, meaning that there are three topic-partition combinations available at all times.

Integrating Kafka

As mentioned, data contained in existing systems can be imported and exported using Kafka Connect. It’s suitable for ingesting entire databases, logs or metrics from your servers into topics with low latency. Kafka Connect offers connectors for various data systems, which allow you to manage data in a standardized way. Another advantage of using connectors instead of rolling your own solutions is that Connect is scalable by default (multiple workers can group together) and automatically tracks progress.

For communicating with Kafka from your own apps, a breadth of clients is available. Many programming languages are supported, such as Java, Scala, Python, .NET, C++, Go, and others. A high-level client library called Kafka Streams is also available for Java and Scala. This library abstracts away the inner workings and allows you to easily connect to a Kafka broker and start receiving streamed events.

Conclusion

This article covered the paradigms of the modern Event Streaming approach for processing data and events and its advantages over traditional data batching processes. You’ve also learned about Apache Kafka as an event broker and its client ecosystem.

You can check out this tutorial on Introduction to Kafka to learn how to set up Apache Kafka, create and delete topics, and send and receive events.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products