Design a Distributed Message Queue (Kafka)
Build a durable, partitioned, replicated commit log like Kafka — ordering, consumer groups, replication (ISR), and exactly-once.
The problem
Kafka is one of the most consequential pieces of infrastructure built in the last 15 years. LinkedIn built the original version in 2010 because its activity pipeline — page views, job clicks, search queries — was drowning in ad-hoc point-to-point integrations. Every service that produced data had a custom connection to every service that consumed it. Adding one new consumer meant touching every producer. The solution was a shared, durable log that producers write to once and consumers read from independently — at their own pace, on their own schedule, without coordinating with each other.
Today that same pattern powers event streaming at companies like Netflix (user play events), Uber (location updates and trip state), and Confluent (which commercializes Kafka itself). The product analogy that maps most directly is a durable, replayable append-only log: think of it as a commit log on disk that any number of readers can walk through, each holding their own bookmark, and where the data stays available for days or weeks — not just until the first reader picks it up.
What makes this genuinely hard is the collision of four properties that each pull in different directions. You want high throughput — millions of messages per second — which demands sequential disk I/O and aggressive batching. You want durability — no acknowledged message ever disappears — which requires replication across multiple brokers. You want strict ordering within a stream, which forces you to serialize writes to a single partition. And you want flexible, independent consumption — multiple teams replaying the same data at different speeds — which rules out any design where the broker tracks per-message delivery state. Getting all four at the same time, at PB scale, is the design challenge this article unpacks.
This article builds the system from scratch, then deep-dives each component with the precision an interview panel at a distributed-systems company expects.
Functional requirements
produce(topic, key, value)— publish a message to a named topic.consume(topic, group_id, from_offset)— read messages from a topic, resuming from a committed offset.- Messages are durable: survive broker restarts and tolerate
Fbroker failures with replication factorRF = F + 1. - Multiple independent consumer groups each see all messages (fan-out without duplication).
- Replay: a consumer group can reset to any past offset within the retention window.
Non-functional requirements
- Throughput: millions of messages/sec, GB/s sustained ingest and egress.
- Latency: p99 end-to-end < 100 ms with acks=1; < 500 ms with acks=all under moderate load.
- Durability: with acks=all and
min.insync.replicas=2, no acknowledged message is lost even on a single-broker crash. - Ordering: strict within a partition; no global ordering guarantee across partitions.
- Scalability: add partitions to scale throughput; add brokers to increase capacity linearly.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Ingest rate | 1 GB/s | 1M msg/s × 1 KB avg |
| Write per broker | ~33 MB/s | 1 GB/s ÷ 30 brokers; with 100 consumer groups, egress fan-out is 100:1 — network, not disk, is the bottleneck |
| Disk bottleneck | Network, not disk | Modern NVMe: 3–7 GB/s sequential write — disk is NOT the bottleneck |
| Raw retention | ~600 TB | 7 days × 86,400 s/day × 1 GB/s = 604,800 GB ≈ 600 TB |
| Total physical storage | 1.8 PB | 600 TB × replication factor 3 |
| Partitions | 1,000 total; 33/broker | 1,000 partitions across 30 brokers |
| Data per partition | ~600 GB retained | 600 TB ÷ 1,000 partitions |
| Hot segment data per broker | ~33 GB | Active segment 1 GB/partition × 33 partitions/broker |
| Consumer groups | 100 | At most 1 consumer per partition per group |
| Aggregate egress | 100 GB/s (≈ 800 Gbps) | 100 groups × 1 GB/s |
| Egress per broker | ~3.3 GB/s (~26 Gbps) | 100 groups × 33 MB/s local leader traffic; each broker needs ≥ 40 Gbps NIC (25 Gbps is insufficient); zero-copy sendfile makes this viable when not using TLS |
Takeaway: network is the primary bottleneck, not disk throughput. NVMe is fast enough that sequential writes aren't the limiter — fan-out egress to many consumer groups is.
Building up to the design
V1: A single queue, single box
The simplest mental model: one process, one in-memory FIFO queue. Producers enqueue; consumers dequeue. This works at less than 10k msg/sec for a single producer/consumer pair, but restart loses everything, a second consumer group sees nothing, replay is impossible, and throughput is bounded by RAM bandwidth. You need to persist the data.
V2: Persist to disk — the commit log insight
Write each message to an append-only log file. Assign each message a monotonically increasing offset (essentially a file position). Consumers track their own read position.
offset 0: {key="user:42", value="{event: login}"}
offset 1: {key="user:17", value="{event: purchase}"}
offset 2: {key="user:42", value="{event: logout}"}
Now you have durability, replay (seek to offset 0 and re-read), and true fan-out — two consumers can independently hold different offsets in the same file. But one file means one writer and one reader. At 1 GB/s ingest, a single log file on one disk is your ceiling.
V3: Partitions — the unit of parallelism
Split each topic into N partitions. Each partition is an independent commit log with its own offset counter, its own disk file, and its own set of consumer assignments.
Topic: "user-events"
Partition 0 → broker 1 (leader)
Partition 1 → broker 2 (leader)
Partition 2 → broker 3 (leader)
Producers route messages to partitions by key hash (hash(key) % N — same key always lands on the same partition, preserving per-key ordering) or round-robin when keyless throughput matters more than ordering. Throughput now scales linearly with partition count. The new problem: partitions live on one broker, so a broker crash takes down every partition it leads. You need replication.
V4: Replication — ISR and the high-watermark
Each partition has one leader and RF-1 follower replicas on other brokers. Followers pull data from the leader and stay caught up.
The In-Sync Replica (ISR) set is the set of replicas within replica.lag.time.max.ms of the leader. If a follower falls behind, it is removed from the ISR. The high-watermark (HW) is the highest offset that has been replicated to all ISR members — consumers can only read up to the HW, so they never see uncommitted data.
With replication in place, a broker crash triggers leader election within the ISR with no data loss for committed messages. What remains unclear is: from the producer's perspective, what exactly counts as "committed"?
V5: Producer acks — the durability dial
acks=0 → fire-and-forget. No guarantee. Fastest.
acks=1 → leader wrote to its local log (OS page cache; may not be flushed
to disk yet). Lost if leader crashes before any follower replicates.
acks=all → all ISR members confirmed. Maximum durability.
Latency includes the follower round-trip.
Combined with min.insync.replicas=2: a write is only acknowledged when at least 2 replicas (including the leader) confirm it. If the ISR shrinks below 2 — say, only the leader remains — the broker rejects the produce with NotEnoughReplicasException rather than accepting a write that can't be durably replicated.
V6: Consumer groups and offset management
flowchart LR
P0[Partition 0] --> C1[Consumer 1<br/>Group A]
P1[Partition 1] --> C2[Consumer 2<br/>Group A]
P2[Partition 2] --> C3[Consumer 3<br/>Group A]
P0 --> C4[Consumer 1<br/>Group B]
P1 --> C4
P2 --> C4
style C1 fill:#ff6b1a,color:#fff
style C2 fill:#ff6b1a,color:#fff
style C3 fill:#ff6b1a,color:#fff
style C4 fill:#15803d,color:#fff
Each partition is consumed by at most one consumer per group at a time — that's the invariant that makes processing safe and ordered. A consumer group with more consumers than partitions leaves some consumers idle. Multiple consumer groups are fully independent: Group A and Group B each maintain their own offsets and see all messages.
Consumers commit their offsets back to a special internal topic — __consumer_offsets. This decouples offset tracking from the broker's data log and lets the broker stay stateless with respect to consumer position.
V7: Exactly-once semantics
"Exactly-once" is precise language. It means each message is processed and its effects produced exactly once — not at-most-once and not at-least-once. In Kafka's model, it has two layers.
The idempotent producer eliminates at-least-once duplicates on the produce path: the broker assigns each producer a Producer ID (PID) and tracks the last sequence number received per PID per partition. When a producer retries after a network error, the broker detects the duplicate sequence number and discards it — the message is written exactly once to the log.
Transactions handle atomic multi-partition writes: a producer can open a transaction, write to multiple partitions (or topics), then commit atomically. Consumers with isolation.level=read_committed only see messages from committed transactions and never read partial writes. Under the hood, a transaction coordinator (a special broker role) uses a two-phase commit protocol with a durable transaction log.
One caveat worth saying clearly: Kafka's exactly-once is scoped to produce → broker → consume within the Kafka system. If your consumer writes to an external database and that write fails after the offset commit, you don't automatically get end-to-end exactly-once. That requires idempotent consumers or an external deduplication mechanism.
flowchart LR
V1["V1: in-memory FIFO<br/>no durability"] --> V2["V2: commit log on disk<br/>replay, fan-out"]
V2 --> V3["V3: + partitions<br/>parallel throughput"]
V3 --> V4["V4: + ISR replication<br/>fault tolerance"]
V4 --> V5["V5: + acks + HW<br/>durability dial"]
V5 --> V6["V6: + consumer groups<br/>independent offsets"]
V6 --> V7["V7: + idempotent producer<br/>+ transactions EOS"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V5 fill:#ff6b1a,color:#0a0a0f
style V7 fill:#a855f7,color:#fff
Deep dive: the commit log on disk
This is where Kafka's performance comes from. Understanding it is the difference between knowing Kafka facts and actually understanding the system.
Segment files and the index
Each partition is stored as a sequence of segment files, not one giant file. A segment is a .log file (the actual message data) paired with two index files:
.index— maps offsets to byte positions within the.logfile. Sparse, not every offset..timeindex— maps timestamps to offsets, for time-based seek.
partition-0/
00000000000000000000.log ← segment starting at offset 0
00000000000000000000.index
00000000000000000000.timeindex
00000000000000001000.log ← segment starting at offset 1000
00000000000000001000.index
00000000000001000000.log ← active (current) segment
Segments are rolled — closed and a new one opened — when they hit a size limit (e.g. 1 GB) or a time limit (e.g. 7 days). Old segments are then eligible for deletion under the retention policy.
flowchart LR
ACT["Active segment<br/>(append-only writes)"]
SEG1["Closed segment<br/>offset 0..999"]
SEG2["Closed segment<br/>offset 1000..1999"]
IDX1[".index + .timeindex"]
IDX2[".index + .timeindex"]
SEG1 --- IDX1
SEG2 --- IDX2
SEG2 --> ACT
ACT -->|"roll at 1 GB or 7 days"| NEW["New segment"]
style ACT fill:#ff6b1a,color:#fff
style SEG1 fill:#0e7490,color:#fff
style SEG2 fill:#0e7490,color:#fff
style IDX1 fill:#ffaa00,color:#0a0a0f
style IDX2 fill:#ffaa00,color:#0a0a0f
To look up offset N: binary search the sparse .index to find the nearest stored offset, then scan forward a few bytes in the .log file. Two reads, both likely already in the OS page cache.
Why sequential I/O is fast
Random writes to disk require the disk head to physically seek (HDD) or the SSD controller to manage write amplification across pages. Sequential writes avoid both. Writing to the end of a commit log is always sequential — and modern operating systems buffer these writes through the page cache, so the broker is often writing to memory first, with the OS flushing to disk asynchronously.
Reads are similar: a consumer reading sequentially forward through a segment is almost always served from the OS page cache (the data was just written). Kafka does not maintain its own in-heap cache — it delegates entirely to the OS page cache, which is why Kafka brokers benefit from large amounts of RAM even though they're "just" writing to disk.
Zero-copy transfer (sendfile)
When sending a segment to a consumer, the naive path is:
- Read bytes from disk into kernel buffer.
- Copy kernel buffer into userspace (JVM heap in Kafka's case).
- Copy userspace buffer back to kernel socket buffer.
- Send over the network.
Two unnecessary copies. Kafka uses the sendfile(2) system call (via Java NIO's FileChannel.transferTo()), which sends data directly from the kernel file buffer to the socket buffer — zero bytes enter userspace. This is the mechanism that makes high egress throughput (many consumer groups reading simultaneously) practical without saturating the broker's CPU or heap.
One important caveat: zero-copy only works when the connection is unencrypted (PLAINTEXT). When TLS/SSL is enabled, Kafka must copy data into userspace for encryption before passing it to the socket — the sendfile path is bypassed entirely. Production clusters using TLS therefore pay a higher per-byte CPU cost for egress.
Deep dive: replication and leader election
sequenceDiagram
participant P as Producer
participant L as Leader (Broker 1)
participant F1 as Follower (Broker 2)
participant F2 as Follower (Broker 3)
P->>L: Produce(offset 42, acks=all)
L->>L: Write to local log
F1->>L: Fetch(from offset 42)
F2->>L: Fetch(from offset 42)
L-->>F1: Record batch (offset 42)
L-->>F2: Record batch (offset 42)
F1->>L: Fetch(from offset 43) ← acknowledges receipt
F2->>L: Fetch(from offset 43) ← acknowledges receipt
L->>L: Advance HW to 43 (exclusive: all msgs < 43 are committed)
L-->>P: Ack (offset 42 committed)
P-->>P: Produce success
Followers pull from the leader using the same Fetch API that consumers use. When all ISR members have fetched past offset N, the leader advances the high-watermark to N.
ISR shrink and unclean leader election
If a follower falls behind (replica.lag.time.max.ms exceeded), it is removed from the ISR. If the leader fails and the ISR now has only {follower_2}, that follower becomes the new leader with no data loss.
The dangerous case: the ISR shrinks to only the leader, then the leader crashes before any replica has caught up. Every remaining replica is now out-of-sync (OSR — out-of-sync replicas).
With unclean leader election enabled (unclean.leader.election.enable=true), an OSR replica can become leader. Availability returns — but that replica is missing some messages that were acknowledged to producers, and those messages are permanently lost.
With unclean leader election disabled (false, the safer default), no OSR replica can become leader. The partition stays offline until an ISR member comes back online. Durability is preserved, but at the cost of availability.
This is the canonical availability vs. durability trade-off in Kafka. Know it cold for any interview.
Deep dive: consumer groups and rebalancing
When a consumer joins or leaves a group, the broker's group coordinator triggers a rebalance: partition ownership is re-assigned across the surviving consumers. During a rebalance, all consumers in the group pause consumption — this is the stop-the-world window, and it's the most common source of production incidents with Kafka consumers.
The typical causes: consumers not calling poll() fast enough (heartbeat timeout), long processing per message that makes the consumer appear dead to the coordinator, frequent restarts during rolling deploys without tuning session.timeout.ms, or adding/removing consumer instances too aggressively.
Two mitigations make a real difference in practice. Cooperative incremental rebalancing (Kafka 2.4+) only revokes the minimum set of partitions necessary rather than pausing everything at once. Static group membership (group.instance.id) lets a consumer that rejoins within session.timeout.ms reclaim its old partitions without triggering a full rebalance. Tuning max.poll.interval.ms to match your actual processing time and session.timeout.ms to balance responsiveness against false positives rounds out the picture.
Deep dive: retention and log compaction
Time/size-based retention
The simple model: delete segment files older than retention.ms (default: 7 days) or when total partition size exceeds retention.bytes. When a segment is deleted, all offsets in that segment are gone.
A consumer that has fallen behind past the retention window will find its next offset no longer exists and must restart from the earliest available offset (or the latest, if it can afford to skip). This is a real production failure mode — a consumer group that is consistently slower than the ingest rate will eventually fall behind the retention window and lose data. The fix is monitoring consumer lag actively.
flowchart LR
RET["Retention window<br/>(e.g. 7 days)"]
SAFE["Consumer offset<br/>inside window — OK"]
DANGER["Consumer offset<br/>behind window — data gone"]
LOG["Log: oldest ←————————→ newest"]
LOG --> RET
RET --> SAFE
RET -.-> DANGER
style SAFE fill:#15803d,color:#fff
style DANGER fill:#ff2e88,color:#fff
style LOG fill:#0e7490,color:#fff
style RET fill:#ffaa00,color:#0a0a0f
Log compaction
For use cases where you care about the latest value per key — not every historical event — log compaction is a better retention strategy. Kafka runs a background compaction thread that rewrites segments, keeping only the most recent message per key.
Before compaction: After compaction:
offset 0: key=A, val=1 offset 2: key=A, val=3 ← latest
offset 1: key=B, val=x offset 1: key=B, val=x ← only one B
offset 2: key=A, val=3
A message with a null value is a tombstone — it signals that the key has been deleted. The compaction thread retains tombstones for a configurable period before removing them.
Reach for compaction when a Kafka topic acts as an event-sourced state store: database changelog topics, configuration tables, materialized views in Kafka Streams. The topic becomes a compact snapshot of current state, queryable by key via log replay. For event streams where every event matters — click events, financial transactions — use time/size retention instead.
Metadata and coordination
ZooKeeper era
Historically, Kafka used ZooKeeper for storing broker membership and topic/partition metadata, electing the cluster controller (one broker that handles leader elections, ISR updates, and topic configuration), and (in very early Kafka) storing consumer group offsets — moved to __consumer_offsets in Kafka 0.9.
ZooKeeper worked but introduced operational complexity: two separate systems to monitor, size, and upgrade; ZooKeeper's throughput also bounded how fast metadata could change during large topology changes.
KRaft (Kafka without ZooKeeper)
Kafka 3.3 declared KRaft production-ready for new clusters (KIP-833), but with meaningful caveats at the time (no SCRAM for controllers, no JBOD, no ZooKeeper-to-KRaft migration path). Kafka 3.6 added SCRAM and made ZooKeeper-to-KRaft migration production-ready; JBOD in KRaft arrived as early access in 3.7. ZooKeeper support was removed entirely in Kafka 4.0. KRaft replaces ZooKeeper with a Raft-based quorum of controller nodes embedded within Kafka brokers themselves. The metadata log is stored as a Kafka log (naturally). Benefits:
- Simpler operations: one system.
- Faster controller failover (seconds instead of tens of seconds in large clusters).
- Metadata scalability: Raft log can handle hundreds of thousands of partitions more gracefully.
The key point for an interview: ZooKeeper was the coordination layer, KRaft replaces it with Raft built into Kafka itself. Don't fabricate internal KRaft implementation specifics beyond what is public.
Tiered storage
Storing 7 days of multi-PB data on local broker NVMe SSDs is expensive. Tiered storage (available in Confluent Cloud; added to Apache Kafka as an early-access feature in 3.6, GA in Apache Kafka 3.9.0) moves older segments to object storage (S3/GCS/ADLS) while keeping hot/recent segments on local disk.
Consumers transparently read from either local or remote storage — the fetch API is unchanged. The broker acts as a cache/proxy for remote fetches. This decouples retention duration (now limited only by object storage cost) from broker disk capacity, and allows brokers to be sized for throughput rather than retention.
Kafka vs. traditional brokers (RabbitMQ)
| Dimension | Kafka (commit log) | RabbitMQ (message broker) |
|---|---|---|
| Consumer model | Pull | Push |
| Message ordering | Per-partition | Per-queue (with caveats) |
| Replay | Yes — any offset in retention | No — message deleted after ack |
| Throughput | Very high (GB/s, sequential I/O) | Moderate (random I/O per message ack) |
| Consumer groups / fan-out | Native — independent offsets | Exchange + routing patterns |
| Per-message ack | No — offset commit (bulk) | Yes — per-message |
| Smart broker vs. smart consumer | Dumb broker, smart consumer | Smart broker (routing, TTL, dead-letter) |
| Retention | Configurable — messages outlive consumers | Message deleted on ack (or TTL) |
| Best fit | Event streaming, audit logs, replayable pipelines | Task queues, RPC-style work distribution |
The key conceptual difference: RabbitMQ is push-based — the broker pushes messages to consumers and tracks per-message delivery state. Kafka is pull-based — consumers pull at their own pace and the broker is largely stateless with respect to what consumers have processed. This makes Kafka naturally backpressure-resistant (a slow consumer just falls further behind but doesn't affect the broker or other consumers) but shifts responsibility for lag monitoring to the consumer team.
Full architecture
flowchart TD
PG["Producer Group<br/>Key-hashed to partition"] --> LB1["Broker 1<br/>(Leader: P0, P3)"]
PG --> LB2["Broker 2<br/>(Leader: P1, P4)"]
PG --> LB3["Broker 3<br/>(Leader: P2, P5)"]
LB1 <-->|"ISR replication"| LB2
LB2 <-->|"ISR replication"| LB3
LB1 <-->|"ISR replication"| LB3
LB1 & LB2 & LB3 <--> CTRL["Controller<br/>(KRaft quorum)"]
LB1 --> CGA["Consumer Group A<br/>(3 consumers, 1 per partition)"]
LB2 --> CGA
LB3 --> CGA
LB1 --> CGB["Consumer Group B<br/>(1 consumer, all partitions)"]
LB2 --> CGB
LB3 --> CGB
CGA & CGB --> OCT["__consumer_offsets topic<br/>(offset commits)"]
style LB1 fill:#ff6b1a,color:#fff
style LB2 fill:#ff6b1a,color:#fff
style LB3 fill:#ff6b1a,color:#fff
style CTRL fill:#a855f7,color:#fff
style CGA fill:#15803d,color:#fff
style CGB fill:#0e7490,color:#fff
Failure modes
| Failure | Behavior | Recovery |
|---|---|---|
| Broker crash (not leader) | Follower removed from ISR; leader continues serving | Broker restarts, re-joins ISR by catching up |
| Leader crash (ISR has other members) | Controller elects a new ISR member as leader; HW may rewind slightly | New leader serves; consumers retry automatically |
| Leader crash (ISR = leader only) | Partition goes offline (unclean=false) or data loss (unclean=true) | Wait for leader to return, or accept data loss |
| Consumer group rebalance | Consumption pauses for all consumers in the group | Cooperative rebalancing + static membership minimize pause |
| Consumer falls behind retention | Consumer offset is below log start offset; data gone | Reset to earliest/latest; accept data loss or catch up via replay elsewhere |
| Hot partition | One partition receives disproportionate writes | Re-key by more granular key, or use synthetic key randomization (key + random suffix) |
| Network partition (split-brain) | Minority side cannot reach ISR quorum; stops accepting writes | Wait for quorum; Kafka prefers CP (consistency/durability) over availability by default |
Storage choices across the system
| Component | Store | Why |
|---|---|---|
| Message data | Local disk (NVMe/HDD), segmented log files | Sequential I/O, OS page cache, sendfile |
| Old segments | Object storage (S3) with tiered storage | Cheap, infinite retention; slower access |
| Consumer offsets | __consumer_offsets Kafka topic (compacted) | Uses the same replication guarantees; decentralized |
| Cluster metadata | KRaft log (Raft-based) or ZooKeeper | Strong consistency for topology changes |
| Schema registry | Separate service (Confluent Schema Registry or equivalent) | Evolve message schemas without breaking consumers |
Things to discuss in an interview
- Partition count: more partitions = more parallelism + more overhead (leader elections, file handles, rebalance time). Rule of thumb: start at 3–10× the expected consumer count, but benchmark before assuming.
- Replication factor vs. min.insync.replicas: RF=3, min.insync.replicas=2 is the standard production setting. Explains why: tolerates one failure without risking a non-durable write.
- Pull vs push trade-off: explicitly frame consumer lag as the cost of pull. Then explain why that cost is worth it (independent pace, natural backpressure, replay).
- Hot partitions: a single high-volume key can saturate one broker. The fix is key diversification — adding a random suffix to the partition key and handling aggregation downstream.
- Exactly-once in context: be precise. EOS in Kafka means the log entry is written once. It does not mean the downstream database write succeeds exactly once — that's on the consumer.
- Log compaction vs. time retention: knowing which to reach for is a signal you've thought about real use cases.
Things you should now be able to answer
- Why does Kafka guarantee ordering only per partition, not globally?
- What is the high-watermark, and why can consumers not read past it?
- What happens if you set
unclean.leader.election.enable=true? - How does the idempotent producer eliminate duplicate writes on retry?
- What is log compaction and when does it beat time-based retention?
- Why is pull-based consumption naturally backpressure-resistant?
- What is consumer lag and how do you detect that a consumer has fallen off the retention window?
- How does KRaft differ from ZooKeeper in terms of what it stores and why that matters for operations?
Further reading
- "The Log: What every software engineer should know about real-time data's unifying abstraction" — Jay Kreps (LinkedIn Engineering blog)
- Apache Kafka documentation: Replication, ISR, and Controlled Shutdown — kafka.apache.org
- "Kafka: The Definitive Guide" — Gwen Shapira, Neha Narkhede, Todd Palino (O'Reilly)
- KRaft: Apache Kafka Without ZooKeeper — Kafka Improvement Proposal (KIP-500)
- Kafka tiered storage — KIP-405 and Confluent documentation
Frequently asked questions
▸What is the In-Sync Replica (ISR) set in Kafka and why does it matter?
The ISR is the set of replicas that are within replica.lag.time.max.ms of the leader. The high-watermark is the exclusive upper bound of committed offsets — consumers can fetch up to but not including the HW offset, so they never see uncommitted data. It advances only when all ISR members have confirmed a write. With acks=all plus min.insync.replicas=2, no acknowledged message is lost even if one broker crashes.
▸What happens when unclean leader election is enabled versus disabled?
With unclean.leader.election.enable=true, an out-of-sync replica can become the new leader after the ISR leader crashes, restoring availability but permanently losing any messages the replica had not yet replicated. With the setting disabled (the safer default), the partition stays offline until an ISR member comes back, preserving durability at the cost of availability.
▸How does Kafka's exactly-once semantics work, and what does it not cover?
Kafka achieves exactly-once in two layers: the idempotent producer uses a Producer ID and per-partition sequence number so the broker can detect and discard retried duplicates, and transactions let a producer atomically write across multiple partitions with consumers using isolation.level=read_committed to avoid partial reads. Critically, this guarantee is scoped to produce-through-consume within Kafka — if the consumer then writes to an external database and that write fails after the offset commit, end-to-end exactly-once requires idempotent consumers or external deduplication.
▸When should you use log compaction instead of time-based retention?
Use log compaction when you care only about the latest value per key — for example, database changelog topics, configuration tables, or materialized views in Kafka Streams — because the compaction thread rewrites segments to keep only the most recent message per key. Use time- or size-based retention when every event matters in sequence, such as click events or financial transactions.
▸Why is network, not disk, the primary bottleneck in a large Kafka cluster?
At 1M messages/sec with 1 KB average message size, ingest is 1 GB/s, but with 100 consumer groups each reading at that rate, aggregate egress reaches 100 GB/s (roughly 800 Gbps). Each broker serving 33 partitions needs at least a 40 Gbps NIC to keep up — a 25 Gbps NIC is insufficient. Modern NVMe drives sustain 3-7 GB/s sequential writes, so disk throughput is not the bottleneck; fan-out to many consumer groups is.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.