Kafka Basics: Theory and Python Examples
Apache Kafka is a distributed platform for handling real‑time event streams. It provides three main capabilities:
- Publish and subscribe to streams of events.
- Store those streams durably and reliably for as long as needed.
- Process streams of events as they occur or later.
Events, Producers, and Consumers
An event (or record/message) captures that “something happened” in your system. Client applications that write events are called producers, while those that read events are consumers.
Topics and Partitions
- Events are grouped in topics, similar to folders in a filesystem.
- Topics are immutable; once data is written it cannot be changed.
- Each topic is split into partitions. Messages in a partition are ordered and assigned a sequential offset.
- Kafka retains data for a configurable period (default: one week). Offsets are never reused, even if old events expire.
- Ordering is guaranteed only within a single partition. Without specifying a key, events are distributed round‑robin across partitions. Using a key ensures all events with the same key go to the same partition.
Replication and Durability
- Topics can be replicated across brokers for high availability. A replication factor of three is common in production.
- Each partition has a leader and one or more followers. Producers and consumers interact with the leader.
- In‑Sync Replicas (ISR) are followers that have fully caught up with the leader. Kafka can elect a new leader from the ISR if the current leader fails.
- Producers choose the desired reliability via acknowledgements:
acks=0– no confirmation (risk of loss).acks=1– leader only.acks=all– leader and all ISRs.
Consumers and Groups
- Kafka uses a pull model: consumers request data from brokers.
- Consumers can form consumer groups for parallelism and fault tolerance. Each partition is consumed by only one member of a group at a time. Multiple groups can independently read the same topic.
- The broker tracks the last processed offset for each group in the
__consumer_offsetsinternal topic. - Delivery semantics when committing offsets manually:
- At most once – commit before processing.
- At least once – commit after processing (may lead to duplicates).
- Exactly once – requires idempotent processing and transactional APIs.
Brokers and Coordination
- A Kafka broker hosts topic partitions and has a unique ID. After connecting to any bootstrap broker, clients discover the rest of the cluster.
- Legacy deployments rely on ZooKeeper for coordination, but modern Kafka versions support KRaft mode, removing ZooKeeper and improving scalability.
Useful CLI Commands
# Create a topic
kafka-topics.sh --create --bootstrap-server $BROKERS \
--replication-factor 3 --partitions 2 --topic demo
# Produce messages (with optional keys)
kafka-console-producer.sh --bootstrap-server $BROKERS \
--topic demo --property parse.key=true --property key.separator=:
>key1:value1
# Consume from the beginning
kafka-console-consumer.sh --bootstrap-server $BROKERS \
--topic demo --from-beginning
# Inspect consumer groups
kafka-consumer-groups.sh --bootstrap-server $BROKERS --list
# Reset offsets for a group
kafka-consumer-groups.sh --bootstrap-server $BROKERS \
--group app-1 --reset-offsets --to-earliest --topic demo --execute
Python Client Tutorial
1. Run Kafka with Docker
Create a simple docker-compose.yml:
version: '3'
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-kafka:7.5.0
depends_on:
- zookeeper
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
Start the cluster:
docker compose up -d
2. Listen with a console consumer
In one terminal, wait for messages:
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic demo --from-beginning \
--property print.key=true --property print.value=true \
--property key.separator=:
# waiting for messages ...
3. Python producer
Install the client library and create producer.py:
pip install kafka-python
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers=["localhost:9092"],
key_serializer=lambda k: k.encode("utf-8") if k else None,
value_serializer=lambda v: v.encode("utf-8"),
)
for i in range(3):
producer.send("demo", key=f"key-{i}", value=f"hello-{i}")
producer.flush()
Run the script:
python producer.py
The console consumer prints:
key-0:hello-0
key-1:hello-1
key-2:hello-2
4. Python consumer
Create consumer.py:
from kafka import KafkaConsumer
consumer = KafkaConsumer(
"demo",
bootstrap_servers=["localhost:9092"],
group_id="python-app",
auto_offset_reset="earliest",
key_deserializer=lambda k: k.decode("utf-8") if k else None,
value_deserializer=lambda v: v.decode("utf-8"),
)
for msg in consumer:
print(
f"partition={msg.partition} offset={msg.offset} key={msg.key} value={msg.value}"
)
Produce a message from the CLI:
kafka-console-producer.sh --bootstrap-server localhost:9092 --topic demo
>cli-msg
Then run the consumer:
python consumer.py
Sample output:
partition=0 offset=3 key=None value=cli-msg
These scripts show how Python clients interact with Kafka alongside the command-line tools.
Summary
Kafka’s combination of publish/subscribe, durable storage, and stream processing makes it well suited for event‑driven architectures. Understanding topics, partitions, replication, and consumer groups is essential before building applications atop Kafka.
