Mastering Apache Kafka Architecture: A Comprehensive Tutorial for Data Engineers and Developers

Apache Kafka is an open-source distributed event streaming platform used by organizations of all sizes to manage, process, and transmit real-time data streams. It provides a scalable, fault-tolerant, and high-throughput platform for processing streaming data.

In this article, we will explore how to get started with Apache Kafka, step by step, with code examples and real-world use cases. But before diving into hands-on, let us understand the architecture of Apache Kafka in detail.

Overview of Apache Kafka Architecture

Apache Kafka is a distributed streaming platform that is designed to handle high volumes of data in real-time. It is widely used in modern data processing pipelines and is particularly well-suited for use cases that involve real-time data streams. In this article, we will explore the architecture of Apache Kafka and its key components.

At a high level, the architecture of Apache Kafka consists of the following components:

Producers: These are processes or applications that produce data to be sent to Kafka.
Topics: These are named feeds or categories that data is published to and subscribed from.
Partitions: These are the units of parallelism in Kafka. Each partition is an ordered, immutable sequence of records that is stored on a Kafka broker.
Brokers: These are the servers that store and manage the partitions. They are responsible for handling read and write requests from producers and consumers.
Consumers: These are processes or applications that consume data from Kafka.
Consumer groups: These are groups of consumers that coordinate to consume data from Kafka topics. Each consumer group can have multiple consumers, and each partition in a topic can be consumed by only one consumer group.
ZooKeeper: This is a distributed coordination service that is used by Kafka to manage cluster membership, leader election, and other operational tasks.

Let’s take a closer look at each of these components.

Producers

A producer is a process or application that produces data to be sent to Kafka. Producers publish messages to Kafka topics, which are named feeds or categories that data is published to and subscribed from.

When a producer sends a message to a Kafka topic, it specifies a key and a value. The key is used to determine the partition that the message will be written to, and the value is the actual data that is being sent.

Producers can also choose to send messages with a specific message key, which can be used to ensure that messages with the same key always end up in the same partition. This is useful when the order of messages is important or when messages are related to a particular entity.

You might also like: Benchmarking in Python: Techniques and Best Practices for Performance Evaluation

Topics and Partitions

A Kafka topic is a named feed or category to which messages are published. Topics are partitioned to enable parallel processing of messages and to provide fault-tolerance.

Each partition in a topic is an ordered, immutable sequence of records that is stored on a Kafka broker. The partitions in a topic are distributed across multiple brokers in a Kafka cluster to provide fault-tolerance and to enable scalability.

When a producer sends a message to a topic, it specifies a key and a value. The key is used to determine the partition that the message will be written to, and the value is the actual data that is being sent.

Consumers can subscribe to one or more topics to consume messages. Each message in a partition is assigned a unique offset, which is used by consumers to keep track of the last consumed message.

Brokers

A broker is a server that stores and manages the partitions in a Kafka cluster. Brokers are responsible for handling read and write requests from producers and consumers.

Each broker in a Kafka cluster is identified by a unique ID, which is used to coordinate replication and partitioning of data.

Consumers and Consumer Groups

A consumer is a process or application that consumes data from Kafka. Consumers subscribe to one or more topics to consume messages. Each message in a partition is assigned a unique offset, which is used by consumers to keep track of the last consumed message.

A consumer group is a group of consumers that coordinate to consume data from Kafka topics. Each consumer group can have multiple consumers, and each partition in a topic can be consumed by only one consumer group.

When a consumer joins a consumer group, it is assigned a set of partitions to consume

Read More:

Installing and Configuring Kafka

Before we dive into coding, we need to install and configure Kafka. Here are the steps to install Kafka on your local machine:

Download the latest version of Apache Kafka from the official website.
Extract the downloaded archive to a desired location.
Navigate to the Kafka directory and start the ZooKeeper server:shellCopy code$ bin/zookeeper-server-start.sh config/zookeeper.properties
Open a new terminal window and start the Kafka server:shellCopy code$ bin/kafka-server-start.sh config/server.properties
Create a new topic:cssCopy code$ bin/kafka-topics.sh --create --topic my-topic --bootstrap-server localhost:9092
Verify that the topic has been created:cssCopy code$ bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Now that we have installed and configured Kafka, we can move on to the coding part.

Writing a Producer

In Kafka, a producer is a component that publishes messages to a Kafka topic. Here’s an example of how to write a producer using the Java client:

import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.KafkaProducer;
import java.util.Properties;

public class MyProducer {
   public static void main(String[] args) throws Exception {
      String topicName = "my-topic";
      String key = "key1";
      String value = "value1";

      Properties props = new Properties();
      props.put("bootstrap.servers", "localhost:9092");
      props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
      props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

      Producer<String, String> producer = new KafkaProducer<>(props);
      ProducerRecord<String, String> record = new ProducerRecord<>(topicName, key, value);

      producer.send(record);
      producer.close();
   }
}

In this example, we create a KafkaProducer instance with a set of properties, such as the bootstrap servers, and the serializers for the key and value. We then create a ProducerRecord with the topic name, key, and value, and send it to the Kafka topic.

You might also like: How to Get Started with Real Time Database in Google Firebase using Python

Writing a Consumer

In Kafka, a consumer is a component that reads messages from a Kafka topic. Here’s an example of how to write a consumer using the Java client:

import org.apache.kafka.clients.consumer.Consumer;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import java.time.Duration;
import java.util.Arrays;
import java.util.Properties;

public class MyConsumer {
   public static void main(String[] args) throws Exception {
      String topicName = "my-topic";

      Properties props = new Properties();
      props.put("bootstrap.servers", "localhost:9092");
      props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
      props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
      props.put("group.id", "test-group");

      Consumer<String, String> consumer = new KafkaConsumer<>(props);
      consumer.subscribe(Arrays.asList(topicName));

      while (true) {
         ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
records.forEach(record -> {
System.out.println("Received message: (" + record.key() + ", " + record.value() + ") at offset " + record.offset());
});
}
}
}

In this example, we create a KafkaConsumer instance with a set of properties, such as the bootstrap servers, and the deserializers for the key and value. We then subscribe to the Kafka topic and use a while loop to continuously poll for new messages. When we receive a message, we print the key, value, and offset.

Real-world Use Case

Now that we have learned how to write producers and consumers in Kafka, let’s take a look at a real-world use case.

Suppose you work for a large retail company that has multiple online stores. You want to keep track of the sales data in real-time to make data-driven decisions. You can use Kafka to collect and process data from different sources and make it available for analysis.

You can create a Kafka topic for each online store and use a producer to send the sales data to the respective topic. You can then use a consumer to read the data from each topic and process it in real-time. You can use tools like Apache Spark, Apache Flink, or Apache Storm to analyze the data and generate insights.

Now such an application will often be handling huge amounts of data and it always makes sense to build such applications using Apache Spark. Here’s an example of how we can integrate Kafka with Spark.

Using Kafka with PySpark

here is a PySpark code example that uses Apache Kafka to process streaming data:

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

# Create a Spark session
spark = SparkSession.builder \
    .appName("KafkaDemo") \
    .getOrCreate()

# Define the schema for the incoming data
schema = StructType([
    StructField("timestamp", StringType(), True),
    StructField("store_id", StringType(), True),
    StructField("product_id", StringType(), True),
    StructField("quantity", DoubleType(), True)
])

# Define the Kafka consumer configuration
kafka_conf = {
    "kafka.bootstrap.servers": "localhost:9092",
    "subscribe": "sales_data",
    "startingOffsets": "earliest",
    "failOnDataLoss": "false"
}

# Create a DataFrame that reads from Kafka
df = spark \
    .readStream \
    .format("kafka") \
    .options(**kafka_conf) \
    .load()

# Convert the value column from binary to string
df = df.withColumn("value", df["value"].cast("string"))

# Parse the JSON data and apply the schema
df = df \
    .withColumn("data", from_json("value", schema)) \
    .select("data.*")

# Group the data by store and product and calculate the total quantity
result = df \
    .groupBy("store_id", "product_id") \
    .sum("quantity")

# Define the Kafka producer configuration
producer_conf = {
    "kafka.bootstrap.servers": "localhost:9092",
    "topic": "sales_summary"
}

# Write the result DataFrame to Kafka as a new topic
result \
    .selectExpr("to_json(struct(*)) AS value") \
    .writeStream \
    .format("kafka") \
    .options(**producer_conf) \
    .start()

# Start the streaming query
spark.streams.awaitAnyTermination()

In this example, we first create a Spark session and define the schema for the incoming data. We then define the Kafka consumer configuration and create a DataFrame that reads from Kafka.

You might also like: Polars - A Lightning-Fast DataFrame Library for Rust and Python

Next, we convert the value column from binary to string, parse the JSON data, and apply the schema. We then group the data by store and product and calculate the total quantity.

Finally, we define the Kafka producer configuration and write the result DataFrame to Kafka as a new topic. We start the streaming query and wait for it to terminate.

TOP PAYING JOBS REQUIRE THIS SKILL

ENROLL AT 90% OFF TODAY

Complete ElasticSearch Integration with LogStash, Hadoop, Hive, Pig, Kibana and MapReduce - DataSharkAcademy

This code example demonstrates how Apache Kafka can be used to process real-time data streams in PySpark. By defining the Kafka consumer and producer configurations and using PySpark’s streaming capabilities, we can create powerful and scalable data processing pipelines.

If you are interested in Apache Kafka, we have a perfect course for you where we deep dive into more detailed examples and topics of Kafka which will get you finish any real world application within few hours. Find more about it here: Apache Kafka Guru – Zero to Hero in Minutes

Conclusion

If you want to learn more about Apache Kafka, here are more details: Apache Kafka documentation.

In conclusion, Apache Kafka is a powerful tool for managing real-time data streams. With its high throughput and fault-tolerance, it’s a popular choice for organizations of all sizes. By following the steps outlined in this article and exploring the relevant resources, you can get started with Apache Kafka and use it to build robust and scalable data processing pipelines.

2 Comments

Apache Kafka: A Step-by-Step Guide to Handling Producer and Consumer Failures - DataShark AcademyApril 7, 2023

[…] Mastering Apache Kafka Architecture: A Comprehensive Tutorial for Data Engineers and Developers […]

Log in to Reply
PySpark Window Functions - Simple Aggregation: A Real-World Guide - DataShark AcademyApril 7, 2023

[…] Mastering Apache Kafka Architecture: A Comprehensive Tutorial for Data Engineers and Developers […]

Log in to Reply