Kafka - A Quick and Simple Guide

April 08, 2025 ・0comments

Introduction

Apache Kafka is a distributed event streaming platform. Originally developed at LinkedIn, it is now an open-source project under the Apache Software Foundation. Kafka is widely used for building real-time data pipelines and streaming applications.

What is Apache Kafka?

Kafka is a messaging system that lets applications publish (write), subscribe (read), store, and process streams of data in real-time.


Kafka Core Concepts
1. Producers and Consumers
  •     Producers → Send data to Kafka topics.
  •     Consumers → Read data from Kafka topics.

2. Topics and Partitions
  • Topic → Logical channel for a specific type of data.
  • Partition → Topics are split into partitions for scalability. Each partition stores data in order.

3. Brokers and Cluster
  • Broker → Kafka server that stores data.
  • Cluster → Group of brokers working together.
4. Zookeeper
    Kafka uses Zookeeper to manage broker metadata and coordination tasks.

    Note: Newer Kafka versions can run without Zookeeper (KRaft mode).

5. Replication Factor
   This provides fault tolerance and state how many copies of each partition exist across brokers. as example if replication factor is 3, that mean partition will kept inside leader and 2 brokers

6. Partition Rebalancing
     When consumer join or leave a consumer group, Kafka will triggers a rebalance to redistribute partitions among the available consumers. This will ensure load balancing, but it may temporarily stop message processing


How Kafka Stores Data

Kafka stores data in logs → sequence of records per partition. Logs are divided into segments (files on disk). This helps manage data efficiently and allows old data to be deleted to free up space.

Each record has a unique offset. Kafka assigns each record an offset as it is produced, which consumers use to read records in order.

Data can be replicated to multiple brokers for fault tolerance. Each partition has one leader broker and several replica brokers. The leader handles all reads and writes, while followers replicate the data.


Ensuring No Duplicate Messages

Idempotent Producers : Enable it by setting enable.idempotence=true we can ensures the same message is not written more than once even if retries happen.

Exactly Once Semantics (EOS) : Use transactional.id on producers and manage offsets manually in consumers(commit offsets only after processing the record). 

Consumer Offset Management : By disabling auto-commit (enable.auto.commit=false) and committing offsets explicitly after successful processing. This will ensures at least once delivery. Combine with Idempotent processing logic for effectively once delivery.

How Kafka Handles Multiple Consumers the Same Group ID

Kafka uses consumer groups to allow multiple consumers to divide the work. Each consumer in the group reads from a unique subset of partitions. They share partitions of a topic. Kafka ensures only one consumer from a group reads a partition at a time.

Consumers in different consumer groups read the same data independently

How long does Kafka store data?

By Default, it's time-based retention (retention.ms=604800000), but this can change to size-based retention(retention.bytes) or forever (log.retention.ms = -1)



Post a Comment