~/articles/design-distributed-message-queue

◆◆◆Advancedasked at Amazonasked at LinkedInasked at Metaasked at Confluent

Design a Distributed Message Queue (Kafka)

Q: What is the In-Sync Replica (ISR) set in Kafka and why does it matter?

The ISR is the set of replicas that are within replica.lag.time.max.ms of the leader. The high-watermark is the exclusive upper bound of committed offsets — consumers can fetch up to but not including the HW offset, so they never see uncommitted data. It advances only when all ISR members have confirmed a write. With acks=all plus min.insync.replicas=2, no acknowledged message is lost even if one broker crashes.

Q: What happens when unclean leader election is enabled versus disabled?

With unclean.leader.election.enable=true, an out-of-sync replica can become the new leader after the ISR leader crashes, restoring availability but permanently losing any messages the replica had not yet replicated. With the setting disabled (the safer default), the partition stays offline until an ISR member comes back, preserving durability at the cost of availability.

Q: How does Kafka's exactly-once semantics work, and what does it not cover?

Kafka achieves exactly-once in two layers: the idempotent producer uses a Producer ID and per-partition sequence number so the broker can detect and discard retried duplicates, and transactions let a producer atomically write across multiple partitions with consumers using isolation.level=read_committed to avoid partial reads. Critically, this guarantee is scoped to produce-through-consume within Kafka — if the consumer then writes to an external database and that write fails after the offset commit, end-to-end exactly-once requires idempotent consumers or external deduplication.

Q: When should you use log compaction instead of time-based retention?

Use log compaction when you care only about the latest value per key — for example, database changelog topics, configuration tables, or materialized views in Kafka Streams — because the compaction thread rewrites segments to keep only the most recent message per key. Use time- or size-based retention when every event matters in sequence, such as click events or financial transactions.

Q: Why is network, not disk, the primary bottleneck in a large Kafka cluster?

At 1M messages/sec with 1 KB average message size, ingest is 1 GB/s, but with 100 consumer groups each reading at that rate, aggregate egress reaches 100 GB/s (roughly 800 Gbps). Each broker serving 33 partitions needs at least a 40 Gbps NIC to keep up — a 25 Gbps NIC is insufficient. Modern NVMe drives sustain 3-7 GB/s sequential writes, so disk throughput is not the bottleneck; fan-out to many consumer groups is.

Build a durable, partitioned, replicated commit log like Kafka — ordering, consumer groups, replication (ISR), and exactly-once.

21 min read2026-05-20Ironclad Academy

#interview #messaging #distributed-systems #storage

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Kafka is one of the most consequential pieces of infrastructure built in the last 15 years. LinkedIn built the original version in 2010 because its activity pipeline — page views, job clicks, search queries — was drowning in ad-hoc point-to-point integrations. Every service that produced data had a custom connection to every service that consumed it. Adding one new consumer meant touching every producer. The solution was a shared, durable log that producers write to once and consumers read from independently — at their own pace, on their own schedule, without coordinating with each other.

Today that same pattern powers event streaming at companies like Netflix (user play events), Uber (location updates and trip state), and Confluent (which commercializes Kafka itself). The product analogy that maps most directly is a durable, replayable append-only log: think of it as a commit log on disk that any number of readers can walk through, each holding their own bookmark, and where the data stays available for days or weeks — not just until the first reader picks it up.

What makes this genuinely hard is the collision of four properties that each pull in different directions. You want high throughput — millions of messages per second — which demands sequential disk I/O and aggressive batching. You want durability — no acknowledged message ever disappears — which requires replication across multiple brokers. You want strict ordering within a stream, which forces you to serialize writes to a single partition. And you want flexible, independent consumption — multiple teams replaying the same data at different speeds — which rules out any design where the broker tracks per-message delivery state. Getting all four at the same time, at PB scale, is the design challenge this article unpacks.

This article builds the system from scratch, then deep-dives each component with the precision an interview panel at a distributed-systems company expects.

Functional requirements

produce(topic, key, value) — publish a message to a named topic.
consume(topic, group_id, from_offset) — read messages from a topic, resuming from a committed offset.
Messages are durable: survive broker restarts and tolerate F broker failures with replication factor RF = F + 1.
Multiple independent consumer groups each see all messages (fan-out without duplication).
Replay: a consumer group can reset to any past offset within the retention window.

Non-functional requirements

Throughput: millions of messages/sec, GB/s sustained ingest and egress.
Latency: p99 end-to-end < 100 ms with acks=1; < 500 ms with acks=all under moderate load.
Durability: with acks=all and min.insync.replicas=2, no acknowledged message is lost even on a single-broker crash.
Ordering: strict within a partition; no global ordering guarantee across partitions.
Scalability: add partitions to scale throughput; add brokers to increase capacity linearly.

Capacity estimation

Dimension	Estimate	How we got there
Ingest rate	1 GB/s	`1M msg/s × 1 KB avg`
Write per broker	~33 MB/s	`1 GB/s ÷ 30 brokers`; with 100 consumer groups, egress fan-out is 100:1 — network, not disk, is the bottleneck
Disk bottleneck	Network, not disk	Modern NVMe: 3–7 GB/s sequential write — disk is NOT the bottleneck
Raw retention	~600 TB	`7 days × 86,400 s/day × 1 GB/s = 604,800 GB ≈ 600 TB`
Total physical storage	1.8 PB	`600 TB × replication factor 3`
Partitions	1,000 total; 33/broker	1,000 partitions across 30 brokers
Data per partition	~600 GB retained	`600 TB ÷ 1,000 partitions`
Hot segment data per broker	~33 GB	Active segment 1 GB/partition × 33 partitions/broker
Consumer groups	100	At most 1 consumer per partition per group
Aggregate egress	100 GB/s (≈ 800 Gbps)	`100 groups × 1 GB/s`
Egress per broker	~3.3 GB/s (~26 Gbps)	`100 groups × 33 MB/s local leader traffic`; each broker needs ≥ 40 Gbps NIC (25 Gbps is insufficient); zero-copy `sendfile` makes this viable when not using TLS

Takeaway: network is the primary bottleneck, not disk throughput. NVMe is fast enough that sequential writes aren't the limiter — fan-out egress to many consumer groups is.

Building up to the design

V1: A single queue, single box

The simplest mental model: one process, one in-memory FIFO queue. Producers enqueue; consumers dequeue. This works at less than 10k msg/sec for a single producer/consumer pair, but restart loses everything, a second consumer group sees nothing, replay is impossible, and throughput is bounded by RAM bandwidth. You need to persist the data.

V2: Persist to disk — the commit log insight

Write each message to an append-only log file. Assign each message a monotonically increasing offset (essentially a file position). Consumers track their own read position.

offset 0: {key="user:42", value="{event: login}"}
offset 1: {key="user:17", value="{event: purchase}"}
offset 2: {key="user:42", value="{event: logout}"}

Now you have durability, replay (seek to offset 0 and re-read), and true fan-out — two consumers can independently hold different offsets in the same file. But one file means one writer and one reader. At 1 GB/s ingest, a single log file on one disk is your ceiling.

V3: Partitions — the unit of parallelism

Split each topic into N partitions. Each partition is an independent commit log with its own offset counter, its own disk file, and its own set of consumer assignments.

Topic: "user-events"
  Partition 0 → broker 1 (leader)
  Partition 1 → broker 2 (leader)
  Partition 2 → broker 3 (leader)

Producers route messages to partitions by key hash (hash(key) % N — same key always lands on the same partition, preserving per-key ordering) or round-robin when keyless throughput matters more than ordering. Throughput now scales linearly with partition count. The new problem: partitions live on one broker, so a broker crash takes down every partition it leads. You need replication.

V4: Replication — ISR and the high-watermark

Each partition has one leader and RF-1 follower replicas on other brokers. Followers pull data from the leader and stay caught up.

The In-Sync Replica (ISR) set is the set of replicas within replica.lag.time.max.ms of the leader. If a follower falls behind, it is removed from the ISR. The high-watermark (HW) is the highest offset that has been replicated to all ISR members — consumers can only read up to the HW, so they never see uncommitted data.

With replication in place, a broker crash triggers leader election within the ISR with no data loss for committed messages. What remains unclear is: from the producer's perspective, what exactly counts as "committed"?

V5: Producer acks — the durability dial

acks=0   → fire-and-forget. No guarantee. Fastest.
acks=1   → leader wrote to its local log (OS page cache; may not be flushed
           to disk yet). Lost if leader crashes before any follower replicates.
acks=all → all ISR members confirmed. Maximum durability.
           Latency includes the follower round-trip.

Combined with min.insync.replicas=2: a write is only acknowledged when at least 2 replicas (including the leader) confirm it. If the ISR shrinks below 2 — say, only the leader remains — the broker rejects the produce with NotEnoughReplicasException rather than accepting a write that can't be durably replicated.

V6: Consumer groups and offset management

flowchart LR
    P0[Partition 0] --> C1[Consumer 1<br/>Group A]
    P1[Partition 1] --> C2[Consumer 2<br/>Group A]
    P2[Partition 2] --> C3[Consumer 3<br/>Group A]
    P0 --> C4[Consumer 1<br/>Group B]
    P1 --> C4
    P2 --> C4
    style C1 fill:#ff6b1a,color:#fff
    style C2 fill:#ff6b1a,color:#fff
    style C3 fill:#ff6b1a,color:#fff
    style C4 fill:#15803d,color:#fff

Each partition is consumed by at most one consumer per group at a time — that's the invariant that makes processing safe and ordered. A consumer group with more consumers than partitions leaves some consumers idle. Multiple consumer groups are fully independent: Group A and Group B each maintain their own offsets and see all messages.

Consumers commit their offsets back to a special internal topic — __consumer_offsets. This decouples offset tracking from the broker's data log and lets the broker stay stateless with respect to consumer position.

V7: Exactly-once semantics

"Exactly-once" is precise language. It means each message is processed and its effects produced exactly once — not at-most-once and not at-least-once. In Kafka's model, it has two layers.

The idempotent producer eliminates at-least-once duplicates on the produce path: the broker assigns each producer a Producer ID (PID) and tracks the last sequence number received per PID per partition. When a producer retries after a network error, the broker detects the duplicate sequence number and discards it — the message is written exactly once to the log.

Transactions handle atomic multi-partition writes: a producer can open a transaction, write to multiple partitions (or topics), then commit atomically. Consumers with isolation.level=read_committed only see messages from committed transactions and never read partial writes. Under the hood, a transaction coordinator (a special broker role) uses a two-phase commit protocol with a durable transaction log.

One caveat worth saying clearly: Kafka's exactly-once is scoped to produce → broker → consume within the Kafka system. If your consumer writes to an external database and that write fails after the offset commit, you don't automatically get end-to-end exactly-once. That requires idempotent consumers or an external deduplication mechanism.

flowchart LR
    V1["V1: in-memory FIFO<br/>no durability"] --> V2["V2: commit log on disk<br/>replay, fan-out"]
    V2 --> V3["V3: + partitions<br/>parallel throughput"]
    V3 --> V4["V4: + ISR replication<br/>fault tolerance"]
    V4 --> V5["V5: + acks + HW<br/>durability dial"]
    V5 --> V6["V6: + consumer groups<br/>independent offsets"]
    V6 --> V7["V7: + idempotent producer<br/>+ transactions EOS"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V7 fill:#a855f7,color:#fff

Deep dive: the commit log on disk

This is where Kafka's performance comes from. Understanding it is the difference between knowing Kafka facts and actually understanding the system.

Segment files and the index

Each partition is stored as a sequence of segment files, not one giant file. A segment is a .log file (the actual message data) paired with two index files:

.index — maps offsets to byte positions within the .log file. Sparse, not every offset.
.timeindex — maps timestamps to offsets, for time-based seek.

partition-0/
  00000000000000000000.log     ← segment starting at offset 0
  00000000000000000000.index
  00000000000000000000.timeindex
  00000000000000001000.log     ← segment starting at offset 1000
  00000000000000001000.index
  00000000000001000000.log     ← active (current) segment

Segments are rolled — closed and a new one opened — when they hit a size limit (e.g. 1 GB) or a time limit (e.g. 7 days). Old segments are then eligible for deletion under the retention policy.

flowchart LR
    ACT["Active segment<br/>(append-only writes)"]
    SEG1["Closed segment<br/>offset 0..999"]
    SEG2["Closed segment<br/>offset 1000..1999"]
    IDX1[".index + .timeindex"]
    IDX2[".index + .timeindex"]

    SEG1 --- IDX1
    SEG2 --- IDX2
    SEG2 --> ACT
    ACT -->|"roll at 1 GB or 7 days"| NEW["New segment"]

    style ACT fill:#ff6b1a,color:#fff
    style SEG1 fill:#0e7490,color:#fff
    style SEG2 fill:#0e7490,color:#fff
    style IDX1 fill:#ffaa00,color:#0a0a0f
    style IDX2 fill:#ffaa00,color:#0a0a0f

To look up offset N: binary search the sparse .index to find the nearest stored offset, then scan forward a few bytes in the .log file. Two reads, both likely already in the OS page cache.

Why sequential I/O is fast

Random writes to disk require the disk head to physically seek (HDD) or the SSD controller to manage write amplification across pages. Sequential writes avoid both. Writing to the end of a commit log is always sequential — and modern operating systems buffer these writes through the page cache, so the broker is often writing to memory first, with the OS flushing to disk asynchronously.

Reads are similar: a consumer reading sequentially forward through a segment is almost always served from the OS page cache (the data was just written). Kafka does not maintain its own in-heap cache — it delegates entirely to the OS page cache, which is why Kafka brokers benefit from large amounts of RAM even though they're "just" writing to disk.

Zero-copy transfer (sendfile)

When sending a segment to a consumer, the naive path is:

Read bytes from disk into kernel buffer.
Copy kernel buffer into userspace (JVM heap in Kafka's case).
Copy userspace buffer back to kernel socket buffer.
Send over the network.

Two unnecessary copies. Kafka uses the sendfile(2) system call (via Java NIO's FileChannel.transferTo()), which sends data directly from the kernel file buffer to the socket buffer — zero bytes enter userspace. This is the mechanism that makes high egress throughput (many consumer groups reading simultaneously) practical without saturating the broker's CPU or heap.

One important caveat: zero-copy only works when the connection is unencrypted (PLAINTEXT). When TLS/SSL is enabled, Kafka must copy data into userspace for encryption before passing it to the socket — the sendfile path is bypassed entirely. Production clusters using TLS therefore pay a higher per-byte CPU cost for egress.

Deep dive: replication and leader election

sequenceDiagram
    participant P as Producer
    participant L as Leader (Broker 1)
    participant F1 as Follower (Broker 2)
    participant F2 as Follower (Broker 3)

    P->>L: Produce(offset 42, acks=all)
    L->>L: Write to local log
    F1->>L: Fetch(from offset 42)
    F2->>L: Fetch(from offset 42)
    L-->>F1: Record batch (offset 42)
    L-->>F2: Record batch (offset 42)
    F1->>L: Fetch(from offset 43) ← acknowledges receipt
    F2->>L: Fetch(from offset 43) ← acknowledges receipt
    L->>L: Advance HW to 43 (exclusive: all msgs < 43 are committed)
    L-->>P: Ack (offset 42 committed)
    P-->>P: Produce success

Followers pull from the leader using the same Fetch API that consumers use. When all ISR members have fetched past offset N, the leader advances the high-watermark to N.

ISR shrink and unclean leader election

If a follower falls behind (replica.lag.time.max.ms exceeded), it is removed from the ISR. If the leader fails and the ISR now has only {follower_2}, that follower becomes the new leader with no data loss.

The dangerous case: the ISR shrinks to only the leader, then the leader crashes before any replica has caught up. Every remaining replica is now out-of-sync (OSR — out-of-sync replicas).

With unclean leader election enabled (unclean.leader.election.enable=true), an OSR replica can become leader. Availability returns — but that replica is missing some messages that were acknowledged to producers, and those messages are permanently lost.

With unclean leader election disabled (false, the safer default), no OSR replica can become leader. The partition stays offline until an ISR member comes back online. Durability is preserved, but at the cost of availability.

This is the canonical availability vs. durability trade-off in Kafka. Know it cold for any interview.

Deep dive: consumer groups and rebalancing

When a consumer joins or leaves a group, the broker's group coordinator triggers a rebalance: partition ownership is re-assigned across the surviving consumers. During a rebalance, all consumers in the group pause consumption — this is the stop-the-world window, and it's the most common source of production incidents with Kafka consumers.

The typical causes: consumers not calling poll() fast enough (heartbeat timeout), long processing per message that makes the consumer appear dead to the coordinator, frequent restarts during rolling deploys without tuning session.timeout.ms, or adding/removing consumer instances too aggressively.

Two mitigations make a real difference in practice. Cooperative incremental rebalancing (Kafka 2.4+) only revokes the minimum set of partitions necessary rather than pausing everything at once. Static group membership (group.instance.id) lets a consumer that rejoins within session.timeout.ms reclaim its old partitions without triggering a full rebalance. Tuning max.poll.interval.ms to match your actual processing time and session.timeout.ms to balance responsiveness against false positives rounds out the picture.

Deep dive: retention and log compaction

Time/size-based retention

The simple model: delete segment files older than retention.ms (default: 7 days) or when total partition size exceeds retention.bytes. When a segment is deleted, all offsets in that segment are gone.

A consumer that has fallen behind past the retention window will find its next offset no longer exists and must restart from the earliest available offset (or the latest, if it can afford to skip). This is a real production failure mode — a consumer group that is consistently slower than the ingest rate will eventually fall behind the retention window and lose data. The fix is monitoring consumer lag actively.

flowchart LR
    RET["Retention window<br/>(e.g. 7 days)"]
    SAFE["Consumer offset<br/>inside window — OK"]
    DANGER["Consumer offset<br/>behind window — data gone"]
    LOG["Log: oldest ←————————→ newest"]

    LOG --> RET
    RET --> SAFE
    RET -.-> DANGER

    style SAFE fill:#15803d,color:#fff
    style DANGER fill:#ff2e88,color:#fff
    style LOG fill:#0e7490,color:#fff
    style RET fill:#ffaa00,color:#0a0a0f

Log compaction

For use cases where you care about the latest value per key — not every historical event — log compaction is a better retention strategy. Kafka runs a background compaction thread that rewrites segments, keeping only the most recent message per key.

Before compaction:           After compaction:
offset 0: key=A, val=1       offset 2: key=A, val=3   ← latest
offset 1: key=B, val=x       offset 1: key=B, val=x   ← only one B
offset 2: key=A, val=3

A message with a null value is a tombstone — it signals that the key has been deleted. The compaction thread retains tombstones for a configurable period before removing them.

Reach for compaction when a Kafka topic acts as an event-sourced state store: database changelog topics, configuration tables, materialized views in Kafka Streams. The topic becomes a compact snapshot of current state, queryable by key via log replay. For event streams where every event matters — click events, financial transactions — use time/size retention instead.

Metadata and coordination

ZooKeeper era

Historically, Kafka used ZooKeeper for storing broker membership and topic/partition metadata, electing the cluster controller (one broker that handles leader elections, ISR updates, and topic configuration), and (in very early Kafka) storing consumer group offsets — moved to __consumer_offsets in Kafka 0.9.

ZooKeeper worked but introduced operational complexity: two separate systems to monitor, size, and upgrade; ZooKeeper's throughput also bounded how fast metadata could change during large topology changes.

KRaft (Kafka without ZooKeeper)

Kafka 3.3 declared KRaft production-ready for new clusters (KIP-833), but with meaningful caveats at the time (no SCRAM for controllers, no JBOD, no ZooKeeper-to-KRaft migration path). Kafka 3.6 added SCRAM and made ZooKeeper-to-KRaft migration production-ready; JBOD in KRaft arrived as early access in 3.7. ZooKeeper support was removed entirely in Kafka 4.0. KRaft replaces ZooKeeper with a Raft-based quorum of controller nodes embedded within Kafka brokers themselves. The metadata log is stored as a Kafka log (naturally). Benefits:

Simpler operations: one system.
Faster controller failover (seconds instead of tens of seconds in large clusters).
Metadata scalability: Raft log can handle hundreds of thousands of partitions more gracefully.

The key point for an interview: ZooKeeper was the coordination layer, KRaft replaces it with Raft built into Kafka itself. Don't fabricate internal KRaft implementation specifics beyond what is public.

Tiered storage

Storing 7 days of multi-PB data on local broker NVMe SSDs is expensive. Tiered storage (available in Confluent Cloud; added to Apache Kafka as an early-access feature in 3.6, GA in Apache Kafka 3.9.0) moves older segments to object storage (S3/GCS/ADLS) while keeping hot/recent segments on local disk.

Consumers transparently read from either local or remote storage — the fetch API is unchanged. The broker acts as a cache/proxy for remote fetches. This decouples retention duration (now limited only by object storage cost) from broker disk capacity, and allows brokers to be sized for throughput rather than retention.

Kafka vs. traditional brokers (RabbitMQ)

Dimension	Kafka (commit log)	RabbitMQ (message broker)
Consumer model	Pull	Push
Message ordering	Per-partition	Per-queue (with caveats)
Replay	Yes — any offset in retention	No — message deleted after ack
Throughput	Very high (GB/s, sequential I/O)	Moderate (random I/O per message ack)
Consumer groups / fan-out	Native — independent offsets	Exchange + routing patterns
Per-message ack	No — offset commit (bulk)	Yes — per-message
Smart broker vs. smart consumer	Dumb broker, smart consumer	Smart broker (routing, TTL, dead-letter)
Retention	Configurable — messages outlive consumers	Message deleted on ack (or TTL)
Best fit	Event streaming, audit logs, replayable pipelines	Task queues, RPC-style work distribution

The key conceptual difference: RabbitMQ is push-based — the broker pushes messages to consumers and tracks per-message delivery state. Kafka is pull-based — consumers pull at their own pace and the broker is largely stateless with respect to what consumers have processed. This makes Kafka naturally backpressure-resistant (a slow consumer just falls further behind but doesn't affect the broker or other consumers) but shifts responsibility for lag monitoring to the consumer team.

Full architecture

flowchart TD
    PG["Producer Group<br/>Key-hashed to partition"] --> LB1["Broker 1<br/>(Leader: P0, P3)"]
    PG --> LB2["Broker 2<br/>(Leader: P1, P4)"]
    PG --> LB3["Broker 3<br/>(Leader: P2, P5)"]

    LB1 <-->|"ISR replication"| LB2
    LB2 <-->|"ISR replication"| LB3
    LB1 <-->|"ISR replication"| LB3

    LB1 & LB2 & LB3 <--> CTRL["Controller<br/>(KRaft quorum)"]

    LB1 --> CGA["Consumer Group A<br/>(3 consumers, 1 per partition)"]
    LB2 --> CGA
    LB3 --> CGA

    LB1 --> CGB["Consumer Group B<br/>(1 consumer, all partitions)"]
    LB2 --> CGB
    LB3 --> CGB

    CGA & CGB --> OCT["__consumer_offsets topic<br/>(offset commits)"]

    style LB1 fill:#ff6b1a,color:#fff
    style LB2 fill:#ff6b1a,color:#fff
    style LB3 fill:#ff6b1a,color:#fff
    style CTRL fill:#a855f7,color:#fff
    style CGA fill:#15803d,color:#fff
    style CGB fill:#0e7490,color:#fff

Failure modes

Failure	Behavior	Recovery
Broker crash (not leader)	Follower removed from ISR; leader continues serving	Broker restarts, re-joins ISR by catching up
Leader crash (ISR has other members)	Controller elects a new ISR member as leader; HW may rewind slightly	New leader serves; consumers retry automatically
Leader crash (ISR = leader only)	Partition goes offline (unclean=false) or data loss (unclean=true)	Wait for leader to return, or accept data loss
Consumer group rebalance	Consumption pauses for all consumers in the group	Cooperative rebalancing + static membership minimize pause
Consumer falls behind retention	Consumer offset is below log start offset; data gone	Reset to earliest/latest; accept data loss or catch up via replay elsewhere
Hot partition	One partition receives disproportionate writes	Re-key by more granular key, or use synthetic key randomization (`key + random suffix`)
Network partition (split-brain)	Minority side cannot reach ISR quorum; stops accepting writes	Wait for quorum; Kafka prefers CP (consistency/durability) over availability by default

Storage choices across the system

Component	Store	Why
Message data	Local disk (NVMe/HDD), segmented log files	Sequential I/O, OS page cache, sendfile
Old segments	Object storage (S3) with tiered storage	Cheap, infinite retention; slower access
Consumer offsets	`__consumer_offsets` Kafka topic (compacted)	Uses the same replication guarantees; decentralized
Cluster metadata	KRaft log (Raft-based) or ZooKeeper	Strong consistency for topology changes
Schema registry	Separate service (Confluent Schema Registry or equivalent)	Evolve message schemas without breaking consumers

Things to discuss in an interview

Partition count: more partitions = more parallelism + more overhead (leader elections, file handles, rebalance time). Rule of thumb: start at 3–10× the expected consumer count, but benchmark before assuming.
Replication factor vs. min.insync.replicas: RF=3, min.insync.replicas=2 is the standard production setting. Explains why: tolerates one failure without risking a non-durable write.
Pull vs push trade-off: explicitly frame consumer lag as the cost of pull. Then explain why that cost is worth it (independent pace, natural backpressure, replay).
Hot partitions: a single high-volume key can saturate one broker. The fix is key diversification — adding a random suffix to the partition key and handling aggregation downstream.
Exactly-once in context: be precise. EOS in Kafka means the log entry is written once. It does not mean the downstream database write succeeds exactly once — that's on the consumer.
Log compaction vs. time retention: knowing which to reach for is a signal you've thought about real use cases.

Things you should now be able to answer

Why does Kafka guarantee ordering only per partition, not globally?
What is the high-watermark, and why can consumers not read past it?
What happens if you set unclean.leader.election.enable=true?
How does the idempotent producer eliminate duplicate writes on retry?
What is log compaction and when does it beat time-based retention?
Why is pull-based consumption naturally backpressure-resistant?
What is consumer lag and how do you detect that a consumer has fallen off the retention window?
How does KRaft differ from ZooKeeper in terms of what it stores and why that matters for operations?

Frequently asked questions

▸What is the In-Sync Replica (ISR) set in Kafka and why does it matter?

The ISR is the set of replicas that are within replica.lag.time.max.ms of the leader. The high-watermark is the exclusive upper bound of committed offsets — consumers can fetch up to but not including the HW offset, so they never see uncommitted data. It advances only when all ISR members have confirmed a write. With acks=all plus min.insync.replicas=2, no acknowledged message is lost even if one broker crashes.

▸What happens when unclean leader election is enabled versus disabled?

With unclean.leader.election.enable=true, an out-of-sync replica can become the new leader after the ISR leader crashes, restoring availability but permanently losing any messages the replica had not yet replicated. With the setting disabled (the safer default), the partition stays offline until an ISR member comes back, preserving durability at the cost of availability.

▸How does Kafka's exactly-once semantics work, and what does it not cover?

Kafka achieves exactly-once in two layers: the idempotent producer uses a Producer ID and per-partition sequence number so the broker can detect and discard retried duplicates, and transactions let a producer atomically write across multiple partitions with consumers using isolation.level=read_committed to avoid partial reads. Critically, this guarantee is scoped to produce-through-consume within Kafka — if the consumer then writes to an external database and that write fails after the offset commit, end-to-end exactly-once requires idempotent consumers or external deduplication.

▸When should you use log compaction instead of time-based retention?

Use log compaction when you care only about the latest value per key — for example, database changelog topics, configuration tables, or materialized views in Kafka Streams — because the compaction thread rewrites segments to keep only the most recent message per key. Use time- or size-based retention when every event matters in sequence, such as click events or financial transactions.

▸Why is network, not disk, the primary bottleneck in a large Kafka cluster?

At 1M messages/sec with 1 KB average message size, ingest is 1 GB/s, but with 100 consumer groups each reading at that rate, aggregate egress reaches 100 GB/s (roughly 800 Gbps). Each broker serving 33 partitions needs at least a 40 Gbps NIC to keep up — a 25 Gbps NIC is insufficient. Modern NVMe drives sustain 3-7 GB/s sequential writes, so disk throughput is not the bottleneck; fan-out to many consumer groups is.

← previous

Design a Payment System (Stripe-style)

Design a Distributed Counter (view / like counts)

// RELATED

Design a Distributed Message Queue (Kafka)

The problem

Functional requirements

Non-functional requirements

Capacity estimation

Building up to the design

V1: A single queue, single box

V2: Persist to disk — the commit log insight

V3: Partitions — the unit of parallelism

V4: Replication — ISR and the high-watermark

V5: Producer acks — the durability dial

V6: Consumer groups and offset management

V7: Exactly-once semantics

Deep dive: the commit log on disk

Segment files and the index

Why sequential I/O is fast

Zero-copy transfer (sendfile)

Deep dive: replication and leader election

ISR shrink and unclean leader election

Deep dive: consumer groups and rebalancing

Deep dive: retention and log compaction

Time/size-based retention

Log compaction

Metadata and coordination

ZooKeeper era

KRaft (Kafka without ZooKeeper)

Tiered storage

Kafka vs. traditional brokers (RabbitMQ)

Full architecture

Failure modes

Storage choices across the system

Things to discuss in an interview

Things you should now be able to answer

Further reading

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform