Introduction to Kafka

Published on March 8, 2024

Apache

Kafka

Write for DO

By Savic and Easha Abid

The author selected Apache Software Foundation to receive a donation as part of the Write for DOnations program.

Introduction

Apache Kafka is an open-source distributed event and stream-processing platform written in Java, built to process demanding real-time data feeds. It is inherently scalable, with high throughput and availability. Developed by the Apache Software Foundation, Kafka has gained widespread adoption for its reliability, ease of use, and fault tolerance. It’s used by the world’s biggest organizations for handling large volumes of data in a distributed and efficient manner.

In this tutorial, you’ll download and set up Apache Kafka. You’ll learn about creating and deleting topics, as well as sending and receiving events using the provided scripts. You’ll also learn about similar projects with the same purpose, and how Kafka compares.

Prerequisites

To complete this tutorial, you’ll need:

A machine with at least 4GB RAM and 2 CPUs. In case of an Ubuntu server, follow the Initial Server Setup for setup instructions.
Java 8 or higher installed on your Droplet or local machine. For instructions on installing Java on Ubuntu, see the How To Install Java with Apt on Ubuntu tutorial.

Step 1 - Downloading and Configuring Apache Kafka

In this section, you will download and extract Apache Kafka on your machine. You’ll set it up under its own user account for additional security. Then, you’ll configure and run it using KRaft.

First, you’ll create a separate user under which Kafka will run. Create a user called kafka by running the following command:

sudo adduser kafka

You will be asked for the account password. Enter a strong password and skip filling in the additional information by pressing ENTER for each field.

Finally, switch to the Kafka-specific user:

su kafka

Next, you’ll download the Kafka release package from the official Downloads page. At the time of writing, the latest version was 3.7.0, built for Scala 2.13. If you’re on macOS or Linux, you can download Kafka with curl.

Use this command to download Kafka and place it under /tmp:

curl -o /tmp/kafka.tgz https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgz

You’ll store the release under ~/kafka, in the home directory. Create it by running:

mkdir ~/kafka

Then, extract it to ~/kafka by running:

tar -xzf /tmp/kafka.tgz -C ~/kafka --strip-components=1

Since the archive you downloaded contains a root folder with the same name as the Kafka release, --strip-components=1 will skip it and extract everything in it.

At the time of writing, Kafka 3 was the latest major version, which supports two systems for metadata management: Apache ZooKeeper and Kafka KRaft (short for Kafka Raft). ZooKeeper is an open-source project providing a standardized way of distributed data coordination for applications, also developed by the Apache Software Foundation.

However, starting with Kafka 3.3, support for KRaft was introduced. KRaft is a purpose-built system for coordinating just Kafka instances, simplifying the installation process and allowing much greater scalability. With KRaft, Kafka itself holds full responsibility for the data instead of keeping the administrative metadata externally.

While it’s still available, ZooKeeper support is expected to be removed starting from Kafka 4 and onwards. In this tutorial, you’ll set up Kafka using KRaft.

You’ll need to create a unique identifier for your new Kafka cluster. For now, it will consist of just one node. Navigate to the directory where Kafka now resides:

cd ~/kafka

Kafka with KRaft stores its configuration under config/kraft/server.properties, while the ZooKeeper config file is config/server.properties.

Before running it for the first time, you’ll have to override some of the default settings. Open the file for editing by running:

nano config/kraft/server.properties

Find the following lines:

config/kraft/server.properties

...
############################# Log Basics #############################

# A comma separated list of directories under which to store log files
log.dirs=/tmp/kafka-logs
...

The log.dirs setting specifies where Kafka will keep its log files. By default, it will store them under /tmp/kafka-logs, as that’s guaranteed to be writeable, although temporary. Replace the value with the highlighted path:

config/kraft/server.properties

...
############################# Log Basics #############################

# A comma separated list of directories under which to store log files
log.dirs=/home/kafka/kafka-logs
...

Since you have created a separate user for Kafka, you set the logs directory path to be under the user’s home directory. If it doesn’t exist, Kafka will create it. When you’re done, save and close the file.

Now that you’ve configured Kafka run the following command to generate a random cluster ID:

KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"

Then, create storage for log files by running the following command and passing in the ID:

bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties

The output will be:

OutputFormatting /home/kafka/kafka-logs with metadata.version 3.7-IV4.

Finally, you can start the Kafka server for the first time:

bin/kafka-server-start.sh config/kraft/server.properties

The end of the output will be similar to this:

Output...
[2024-02-26 10:38:26,889] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.DataPlaneAcceptor)
[2024-02-26 10:38:26,890] INFO [BrokerServer id=1] Waiting for all of the authorizer futures to be completed (kafka.server.BrokerServer)
[2024-02-26 10:38:26,890] INFO [BrokerServer id=1] Finished waiting for all of the authorizer futures to be completed (kafka.server.BrokerServer)
[2024-02-26 10:38:26,890] INFO [BrokerServer id=1] Waiting for all of the SocketServer Acceptors to be started (kafka.server.BrokerServer)
[2024-02-26 10:38:26,890] INFO [BrokerServer id=1] Finished waiting for all of the SocketServer Acceptors to be started (kafka.server.BrokerServer)
[2024-02-26 10:38:26,890] INFO [BrokerServer id=1] Transition from STARTING to STARTED (kafka.server.BrokerServer)
[2024-02-26 10:38:26,891] INFO Kafka version: 3.7.0 (org.apache.kafka.common.utils.AppInfoParser)
[2024-02-26 10:38:26,891] INFO Kafka commitId: 5e3c2b738d253ff5 (org.apache.kafka.common.utils.AppInfoParser)
[2024-02-26 10:38:26,891] INFO Kafka startTimeMs: 1708943906890 (org.apache.kafka.common.utils.AppInfoParser)
[2024-02-26 10:38:26,892] INFO [KafkaRaftServer nodeId=1] Kafka Server started (kafka.server.KafkaRaftServer)

The output indicates that Kafka has successfully initialized using KRaft and that it’s accepting connections at 0.0.0.0:9092.

Once you press CTRL+C, the process will exit. Because running Kafka by holding a session open is not preferable, you’ll create a service for running Kafka in the background in the next step.

Step 2 - Creating a systemd Service for Kafka

In this section, you’ll create a systemd service for running Kafka in the background at all times. systemd services can be started, stopped and restarted consistently.

You’ll store the service configuration in a file named code-server.service, in the /lib/systemd/system directory, where systemd stores its services. Create it using your text editor:

sudo nano /etc/systemd/system/kafka.service

Add the following lines:

/etc/systemd/system/kafka.service

[Unit]
Description=kafka-server

[Service]
Type=simple
User=kafka
ExecStart=/bin/sh -c '/home/kafka/kafka/bin/kafka-server-start.sh /home/kafka/kafka/config/kraft/server.properties > /home/kafka/kafka/kafka.log 2>&1'
ExecStop=/home/kafka/kafka/bin/kafka-server-stop.sh
Restart=on-abnormal

[Install]
WantedBy=multi-user.target

Here you first specify the description of the service. Then, in the [Service] section, you define the type of the service (simple means that the command should simply execute) and provide the command that will be run. You also specify that the user it runs as is kafka, and that the service should be automatically restarted if Kafka exits.

The [Install] section orders systemd to start this service when it becomes possible to log in to your server. Save and close the file when you’re done.

Start the Kafka service by running the following command:

sudo systemctl start kafka

Check that it’s started correctly by observing its status:

sudo systemctl status kafka

You’ll see output similar to:

Output● kafka.service - kafka-server
     Loaded: loaded (/etc/systemd/system/kafka.service; disabled; preset: enabled)
     Active: active (running) since Mon 2024-02-26 11:17:30 UTC; 2min 40s ago
   Main PID: 1061 (sh)
      Tasks: 94 (limit: 4646)
     Memory: 409.2M
        CPU: 10.491s
     CGroup: /system.slice/kafka.service
             ├─1061 /bin/sh -c "/home/kafka/kafka/bin/kafka-server-start.sh /home/kafka/kafka/config/kraft/server.properties > /home/kafka/kafka/kafka.log 2>&1"
             └─1062 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true "-Xlog:gc*:file=/home/kafka/kafka/bin/../logs/kaf>

Feb 26 11:17:30 kafka-test1 systemd[1]: Started kafka.service - kafka-server.

To make Kafka start automatically after a server reboot, enable its service by running the following command:

sudo systemctl enable kafka

In this step, you’ve created a systemd service for Kafka and enabled it, so that it starts at every server boot. Next, you’ll learn about creating and deleting topics in Kafka, as well as how to produce and consume textual messages using the included scripts.

Step 3 - Producing and Consuming Topic Messages

Now that you’ve set up a Kafka server, you’ll learn about topics and how to manage them using the provided scripts. You’ll also learn how to send and stream back messages from a topic.

As explained in the Event Streaming article, publishing and receiving messages are tied to topics. A topic can be related to a category to which a message belongs.

The provided kafka-topics.sh script can be used to manage topics in Kafka through the CLI. Run the following command to create a topic called first-topic:

bin/kafka-topics.sh --create --topic first-topic --bootstrap-server localhost:9092

All provided Kafka scripts require that you specify the server address with --bootstrap-server.

The output will be:

OutputCreated topic first-topic.

To list all available topics, pass in --list instead of --create:

bin/kafka-topics.sh --list --bootstrap-server localhost:9092

You’ll see the topic you’ve just created:

Outputfirst-topic

You can get detailed information and statistics about the topic by passing in --describe:

bin/kafka-topics.sh --describe --topic first-topic --bootstrap-server localhost:9092

The output will look similar to this:

OutputTopic: first-topic      TopicId: VtjiMIUtRUulwzxJL5qVjg PartitionCount: 1       ReplicationFactor: 1    Configs: segment.bytes=1073741824
        Topic: first-topic      Partition: 0    Leader: 1       Replicas: 1     Isr: 1

The first line specifies the topic name, ID, and replication factor, which is 1 because the topic is present only on the current machine. The second line is indented on purpose and shows information about the first (and only) partition of the topic. Kafka allows you to partition the topic, meaning that different fractions of a topic can be distributed to different servers, enhancing scalability. Here, only one partition exists.

Now that you’ve created a topic, you’ll produce messages for it using the kafka-console-producer.sh script. Run the following command to start the producer:

bin/kafka-console-producer.sh --topic first-topic --bootstrap-server localhost:9092

You’ll see an empty prompt:

The producer is waiting for you to enter a textual message. Input test and press ENTER. The prompt will look like this:

>test
>

The producer is now waiting for the next message, meaning that the previous one was successfully communicated to Kafka. You can input as many messages as you want for testing. To exit the producer, press CTRL+C.

To read back the messages from the topic, you’ll need a consumer. Kafka provides a simple consumer in the form of kafka-console-consumer.sh. Execute it by running:

bin/kafka-console-consumer.sh --topic first-topic --bootstrap-server localhost:9092

However, there will be no output. The reason is that the consumer is streaming data from the topic, and nothing is currently being produced and sent. To consume the messages you’ve produced before starting the consumer, you’ll have to read the topic from the beginning by running:

bin/kafka-console-consumer.sh --topic first-topic --from-beginning --bootstrap-server localhost:9092

The consumer will replay all events in the topic and fetch the messages:

Outputtest
...

As with the producer, press CTRL+C to exit.

To verify that the consumer is indeed streaming the data, you’ll open it in a separate terminal session. Open a secondary SSH session and run the consumer in the default configuration:

bin/kafka-console-consumer.sh --topic first-topic --bootstrap-server localhost:9092

In the primary session, run the producer:

bin/kafka-console-producer.sh --topic first-topic --bootstrap-server localhost:9092

Then, input messages of your choice:

>second test
>third test
>

You’ll immediately see them being received by the consumer:

Outputsecond test
third test

When you are done testing, terminate both the producer and the consumer.

To delete first-topic, pass in --delete to kafka-topics.sh:

bin/kafka-topics.sh --delete --topic first-topic --bootstrap-server localhost:9092

There will be no output. You can list the topics to verify that it’s indeed been deleted:

bin/kafka-topics.sh --list --bootstrap-server localhost:9092

The output will be:

Output__consumer_offsets

__consumer_offsets is a topic internal to Kafka, which stores how far a consumer has been reading into a topic.

In this step, you’ve created a Kafka topic and produced messages into it. Then, you’ve consumed the messages using the provided script and finally, received them in real time. Next, you’ll learn about how Kafka compares to other event brokers and similar software.

Comparison with Similar Architectures

Apache Kafka is considered the de-facto solution for event streaming use cases. However, Apache Pulsar and RabbitMQ are also widely used and stand out as versatile options, albeit with differences in their approach.

The main difference between message queuing and event streaming is that the main task of the former is getting the messages out to consumers in the quickest way possible, with no regard for their order. Such systems usually store the messages in memory until they are acknowledged by consumers. Filtering and routing the messages is an important aspect, as consumers can express interest in specific categories of data. RabbitMQ is a strong example of a traditional messaging system, where multiple consumers can subscribe to the same topic and receive multiple copies of a message.

Event streaming, on the other hand, is focused on persistence. Events should be archived, kept in order, and processed once. Routing them to specific consumers is not important, as the idea is that all consumers process the events the same.

Apache Pulsar is an open-source messaging system, developed by the Apache Software Foundation, which supports event streaming. Unlike Kafka, which was built with it in mind from the start, Pulsar started as a traditional message-queuing solution and gained event-streaming capabilities later on. Pulsar is thus useful when a mixture of both approaches is needed, without having to deploy separate applications.

Conclusion

You now have Apache Kafka securely running in the background on your server, configured as a systemd service. You’ve also learned how to manipulate topics from the command line, as well as produce and consume messages. However, the main appeal of Kafka is the wide variety of clients for integrating it into your apps.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products