MODULE 09 / 12crash course

~/roadmap/09-message-queues

◆◆Intermediate

Message Queues and Event Streams

Queues vs streams, Kafka in depth, delivery semantics, idempotency, dead-letter queues, schema registries, the outbox pattern, and stream processing fundamentals.

17 min read2026-01-23Ironclad Academy

#messaging #asynchronous #kafka #streams #fundamentals

A message queue is a buffer between services — the trick that lets you turn a synchronous, brittle, all-or-nothing call chain into an asynchronous, decoupled, retry-able flow.

If caches give you speed, queues give you resilience.

Why a queue?

Compare two architectures.

Synchronous:

flowchart LR
    U[User] -->|"POST /signup"| S[Signup Service]
    S -->|"send welcome email"| E[Email]
    S -->|"create wallet"| W[Wallet]
    S -->|"send push"| P[Push]
    S -->|"index in search"| I[Search]
    S -->|200 OK| U
    style S fill:#ff2e88,color:#fff

The user waits for all five services. If Email is slow, signup is slow. If Push is down, signup fails. Your p99 latency is owned by whatever downstream is slowest on any given day.

Asynchronous, with a queue:

flowchart LR
    U[User] -->|"POST /signup"| S[Signup Service]
    S -->|"INSERT user"| DB[(DB)]
    S -->|"publish UserCreated"| Q[Queue]
    S -->|201 Created| U
    Q --> E[Email worker]
    Q --> W[Wallet worker]
    Q --> P[Push worker]
    Q --> I[Search indexer]
    style Q fill:#15803d,color:#fff

The user waits only for the user row to be written. Everything else happens in the background, retried automatically if a worker fails.

That one change buys you five things at once: producers don't need to know about consumers (so you can add a new one without touching signup code); a downstream outage becomes "queue grows" instead of "front-end errors"; a 10× traffic spike fills the queue and drains at sustainable rate; queue depth tells you exactly how overloaded you are; and with a log-style queue you can replay historical events for new use cases. None of those are free — you're trading away simplicity and synchrony — but for most back-end fan-out work, that's a trade worth making.

Queue vs stream — the most important distinction

	Message Queue	Event Stream / Log
Examples	RabbitMQ, SQS, Cloud Tasks	Kafka, Kinesis, Pulsar, Redpanda
Lifecycle	Consumed once, deleted	Append-only, retained
Replay	No (or limited)	Yes (rewind to offset)
Consumers	Compete for messages	Each consumer reads independently
Order	Per-queue (often FIFO option)	Per-partition
Throughput	Tens of thousands msg/sec	Millions msg/sec
Use case	Task offload, RPC	Event sourcing, analytics, fanout

flowchart TD
    subgraph Queue [Traditional Queue: SQS / RabbitMQ]
    QP[Producer] --> QQ[(Queue)]
    QQ --> QC1[Consumer A]
    QQ --> QC2[Consumer B]
    Note1[Each msg goes to ONE consumer]
    end
    subgraph Stream [Log-based Stream: Kafka]
    SP[Producer] --> ST[(Topic)]
    ST --> SC1[Consumer Group 1]
    ST --> SC2[Consumer Group 2]
    Note2[Each consumer group reads<br/>the WHOLE topic independently]
    end
    style QQ fill:#ff6b1a,color:#0a0a0f
    style ST fill:#0e7490,color:#fff

A useful rule of thumb: reach for a queue when you want to say "do this work" — offload, retry, smooth bursts. Reach for a stream when you want to say "this happened" — broadcast, replay, analytics. Real systems use both. Streams hold the canonical event log; queues handle "send 100k emails in the next hour."

The major brokers, briefly

Broker	Type	Sweet spot
Kafka	Stream	Event log, analytics, fanout, CDC, anything > 100k msg/sec
AWS Kinesis	Stream	Kafka-equivalent, fully managed
Pulsar	Stream + queue	Multi-tenant, geo-replication
Redpanda	Stream	Kafka-API, faster, single binary
RabbitMQ	Queue	Complex routing (topics, fanout, headers), small-medium scale
AWS SQS	Queue	Simple "do this work" queue, fully managed
AWS SNS	Pub/sub	Notifications fanout (mobile, email, SQS)
Cloud Tasks / GCP Pub/Sub	Queue / pub-sub	GCP equivalents
NATS / NATS JetStream	Pub/sub + queue	Lightweight, IoT, edge

For most teams, the practical default is SQS for tasks and Kafka (or Kinesis) for events.

Delivery semantics

Three flavors. Get this wrong and you'll either lose data or send duplicate emails.

Semantic	Description	When OK	When bad
At-most-once	Maybe deliver, maybe not	Telemetry, logs	Payments
At-least-once	Definitely deliver, possibly twice	Email, notifications, idempotent ops	Anything that costs money
Exactly-once	Deliver once, no duplicates	Money, inventory	(Hard to actually achieve)

The hard truth: "exactly-once delivery" is essentially impossible. What's possible is exactly-once processing, which means at-least-once delivery + idempotent consumers.

Why exactly-once delivery is impossible (informally)

There are two ways a consumer can fail, and they're mirror images of each other.

sequenceDiagram
    participant Broker
    participant Consumer

    Note over Broker,Consumer: Scenario 1 — process before ack
    Broker->>Consumer: deliver message
    Consumer->>Consumer: process (success)
    Note over Consumer: crash before ack
    Consumer--xBroker: ack lost
    Broker->>Consumer: redeliver (duplicate processing)

    Note over Broker,Consumer: Scenario 2 — ack before process
    Broker->>Consumer: deliver message
    Consumer->>Broker: ack (success)
    Consumer->>Consumer: process
    Note over Consumer: crash during processing
    Consumer--xBroker: message gone (lost work)

If you ack after processing and the ack is lost, the broker redelivers and you process twice. If you ack before processing and the work fails, the message is gone. You cannot guarantee exactly one delivery across a network with failures — that's simply the two-generals problem applied to message passing. The practical answer: accept at-least-once delivery from the broker, and make your consumers idempotent.

Idempotency: the only way to survive duplicates

Make every consumer idempotent. Every message needs a unique event_id, and the consumer must check whether it has already acted on that ID before doing any work.

def handle(event):
    if seen.contains(event.id):
        return  # already processed
    process(event)
    seen.add(event.id)  # ideally in same transaction

That "ideally in the same transaction" note is the entire difficulty. If you do the work and then mark it done, you can fail in between and redo work. The side effect and the dedup record need to commit atomically. In practice this means one of: a database UNIQUE constraint on the action (e.g. payment_id unique — second insert fails harmlessly), a SET NX lock in Redis keyed on event_id, the outbox pattern with a dedup table, or a conditional update with WHERE current_state = 'pending' so a second attempt is a no-op.

Dead-letter queues (DLQ)

What happens to a message that consistently fails to process? You don't want it to retry forever and block the queue.

flowchart LR
    Q[Main Queue] --> C[Consumer]
    C -->|success| OK[ack]
    C -->|fail| RETRY{"retry < N?"}
    RETRY -->|yes| Q
    RETRY -->|no| DLQ[Dead Letter Queue]
    DLQ --> ALERT[Alert + Manual Review]
    style DLQ fill:#ff2e88,color:#fff

Set a max-retry limit. After N failures, the message moves to a DLQ for human inspection. Always have alerts on DLQ depth — a non-empty DLQ is a bug, and a growing one is a silent incident. A common pattern is a DLQ replay tool: a CLI that reads from the DLQ, lets a human classify each message (transient error / poison message / fixable), and either requeues or discards.

Backpressure

What happens when producers are faster than consumers?

producer rate > consumer rate
→ queue depth grows
→ memory/disk fills
→ broker dies

Four responses: bound the queue and reject the producer when it's full (fastest, simplest fail signal); auto-scale consumers based on queue depth; shed load at the producer by rate-limiting or dropping low-priority messages; or buffer to disk. Kafka takes the fourth approach — disk is the queue, so backpressure shows up as growing consumer lag rather than a hard rejection. SQS has a 14-day retention maximum; messages older than that are silently dropped.

Apache Kafka in one diagram

Kafka is the most important piece of infrastructure in modern data architectures. Here's how it works:

flowchart LR
    P1[Producer A] --> T
    P2[Producer B] --> T
    subgraph T [Topic: 'orders']
    P0[Partition 0]
    Pp1[Partition 1]
    Pp2[Partition 2]
    end
    P0 --> CG1A[Consumer 1<br/>in Group X]
    Pp1 --> CG1B[Consumer 2<br/>in Group X]
    Pp2 --> CG1C[Consumer 3<br/>in Group X]
    P0 --> CG2[Consumer<br/>in Group Y]
    Pp1 --> CG2
    Pp2 --> CG2
    style T fill:#0e7490,color:#fff

A few concepts to notice in that diagram. A topic is a named log of messages. It's split into N partitions for parallelism — messages within a partition are ordered, but there's no ordering guarantee across partitions. Each producer picks a partition (usually by hashing the message key). A consumer group is a set of consumers that share work: within a group, each partition is owned by exactly one consumer. Group X above has three consumers, one per partition — each consumer handles a third of the throughput. Group Y has one consumer that reads all three partitions on its own. Every consumer group reads the whole topic independently; adding a new group doesn't affect any existing group.

Two more things worth knowing: the offset is the position a consumer has reached within a partition, stored in Kafka itself — consumers can rewind to any past offset. Each partition is replicated to N brokers (typically 3), with one leader handling writes and followers staying in sync. Leader election now runs through KRaft (the replacement for ZooKeeper in modern Kafka deployments).

Partitioning by key

This is the single most important Kafka design decision you'll make on any project.

producer.send(
    topic='orders',
    key=order.user_id,    # determines partition
    value=order
)

All messages with the same key go to the same partition, which means they're ordered relative to each other and consumed by a single consumer in that group. So: if event ordering matters within a logical entity (all events for user 42 must be processed in sequence), use that entity's ID as the key. If ordering doesn't matter and you just want even spread, use a null key or a random one.

The common mistake is keying on something low-cardinality like country. Most partitions sit idle while three are red-hot, and you get none of the parallelism benefits of partitioning at all.

Consumer lag and backpressure in Kafka

One more concept that's worth visualizing: consumer lag. Lag is the gap between the latest offset in a partition and the offset your consumer has actually committed.

flowchart LR
    subgraph P0 [Partition 0]
    direction LR
    O0[offset 0] --> O1[offset 1] --> O2[offset 2] --> O3[offset 3] --> O4[offset 4<br/>latest]
    end
    C[Consumer<br/>committed: offset 2] -.lag = 2.-> O4
    style C fill:#ff6b1a,color:#0a0a0f
    style O4 fill:#15803d,color:#fff

Lag of zero means your consumer is keeping up. A lag that grows steadily means your consumers are slower than your producers — you need more consumers (up to the number of partitions) or faster processing. Monitoring lag is how you detect backpressure in Kafka before it becomes an incident.

Why Kafka has eaten the world

Kafka is both a queue (for tasks) and a log (for events), which means it can replace several other tools at once. It's disk-based, so retention is cheap and you can keep months of history. Throughput is high: a single broker typically handles somewhere between 100–500 MB/sec of sustained writes depending on hardware (NVMe SSDs, 10 Gbps NICs, and batching push the upper end); reads are usually faster than writes because hot partitions are served from the OS page cache. A 6-broker cluster scales near-linearly with broker count. The replay capability is what enables event sourcing, CDC, and stream processing. Scale throughput by adding partitions — if you need 10× more, give the topic 10× more partitions.

Schema management

Kafka stores raw bytes. Without discipline, producers and consumers drift, someone ships a breaking change, and suddenly yesterday's records can't be deserialized. This is the kind of incident that takes weeks to clean up.

Schema registries

Confluent Schema Registry, AWS Glue Schema Registry, and Apicurio all work on the same idea: producers register a schema, messages carry a small "schema ID" header, and consumers fetch the schema by ID to deserialize.

flowchart LR
    P[Producer] -->|"1. register schema v3"| SR[(Schema Registry)]
    SR -->|"2. ID = 42"| P
    P -->|"3. publish<br/>(schema_id=42, body)"| K[Kafka]
    K --> C[Consumer]
    C -->|"4. fetch schema 42"| SR
    style SR fill:#a855f7,color:#fff

The registry enforces compatibility rules: BACKWARD means a new schema can read old data (add fields with defaults; remove fields); FORWARD means an old schema can read new data; FULL means both. Use Avro or Protobuf — both are compact, fast, and have explicit evolution semantics. JSON-Schema works but is larger and slower.

Schema evolution in practice

A few rules that prevent most incidents: never reuse field numbers (in Protobuf, field numbers are identity, not just metadata — Avro identifies fields by name, so the equivalent rule is never reuse a field name for a different type or purpose); never change a field's type — add a new field with a new name instead; give new fields default values so old consumers don't break when they encounter them; and document breaking changes at the schema-registry level before you ship code, not after.

Common patterns built on queues/streams

Fanout (one to many)

A user posts a tweet and the tweet is delivered to millions of follower timelines. This is covered in depth in the Twitter article.

Outbox pattern (DB + queue, transactionally)

sequenceDiagram
    participant App
    participant DB
    participant Outbox as outbox table (in DB)
    participant Q as Kafka
    App->>DB: BEGIN
    App->>DB: UPDATE balances ...
    App->>Outbox: INSERT event
    App->>DB: COMMIT
    Note over Outbox,Q: separate process polls outbox<br/>and publishes to Kafka
    Outbox->>Q: publish

Without the outbox, you might write to the DB but fail to publish to Kafka — or publish but have the DB transaction roll back. With the outbox, both the business write and the event record commit in the same transaction. A separate relay process polls the outbox table and publishes to Kafka; if it crashes, it just re-reads from the last unprocessed row.

CDC (Change Data Capture)

Postgres → Debezium → Kafka. Every row change becomes a Kafka event, letting you fan out DB changes to caches, search indexes, and downstream services without your application code having to publish anything explicitly.

flowchart LR
    A[Postgres] -->|"WAL stream"| D[Debezium]
    D --> K[Kafka topic per table]
    K --> S1[Search index sync]
    K --> S2[Cache invalidator]
    K --> S3[Data warehouse]
    style A fill:#ff6b1a,color:#0a0a0f
    style K fill:#0e7490,color:#fff

CDC is the foundation of the "everything is a stream" architectures you see at companies like LinkedIn and Confluent.

Saga (distributed transaction)

Orchestrate a multi-step transaction across services with compensating actions on failure.

flowchart LR
    S1[Reserve seat] -->|ok| S2[Charge card]
    S2 -->|ok| S3[Issue ticket]
    S2 -->|fail| C1[Release seat]
    style C1 fill:#ff2e88,color:#fff

Each step is a message; if a later step fails, earlier steps are undone by compensating messages. There are two ways to coordinate this. In choreography, each service reacts to events without a central coordinator — simpler initially, but the causality chain becomes hard to follow as the number of steps grows. In orchestration, a central state machine explicitly drives each step and handles failures — more code to write, but you can look at one place and see exactly what the saga is doing. For three or more steps, prefer orchestration with a workflow engine like Temporal, AWS Step Functions, or Cadence.

Event sourcing

Store the events themselves, not the current state. Current state is derived by replaying events. The Kafka topic is the database.

You get a full audit log, the ability to time-travel to any past state, and the freedom to add new projections from historical data without schema migrations. The cost: every read becomes a fold over events, so you end up building read models (caches) for anything that needs to query current state efficiently, and schema evolution for old events is genuinely hard. Reach for event sourcing when audit and replay are core requirements — banking, healthcare, anything regulated.

Stream processing

Once you have a stream, you'll want to process it: filter, transform, join, aggregate, window.

Tool	Style	Sweet spot
Kafka Streams	Library (JVM)	Stateful processing inside a JVM service
Apache Flink	Cluster	Heavy-duty stream processing, exactly-once, large state
Apache Spark Streaming	Micro-batch	Reuse Spark batch code as streaming
AWS Kinesis Data Analytics	Managed Flink	Managed flavor
Materialize / RisingWave	SQL on streams	"Just give me a continuously updated view"

A canonical streaming pipeline:

events_topic → filter (paid only)
             → enrich (join with user dim from KTable)
             → window (5-min tumbling)
             → aggregate (sum amount per region)
             → publish to dashboard topic

Three concepts that trip people up: windowing (tumbling, hopping, and session windows each have different semantics for how you group events by time); watermarks (how late can data arrive and still be assigned to its window — set this too tight and you lose late events, too loose and your latency grows); and exactly-once (Flink achieves it via two-phase-commit sinks coordinated with checkpoints; Kafka Streams wraps the full read-process-write cycle in a single Kafka transaction via processing.guarantee=exactly_once_v2). This is its own field; the news feed article and the broader literature go deeper.

Pub/sub patterns

Pub/sub is "broadcast" — a publisher sends a message; many subscribers receive it.

flowchart LR
    P[Publisher] --> T[(Topic)]
    T --> S1[Subscriber A]
    T --> S2[Subscriber B]
    T --> S3[Subscriber C]
    style T fill:#ff6b1a,color:#0a0a0f

Three common implementations: Kafka with multiple consumer groups (each group is an independent subscriber, reading the same topic from its own offset); AWS SNS + SQS (SNS fans out to many SQS queues, one per subscriber — each subscriber gets its own queue with its own retry policy); and Redis pub/sub (in-memory, no persistence, simplest to set up but messages are lost if a subscriber is down). Pub/sub is the standard pattern for cross-team integration: team A publishes events, and teams B, C, and D consume independently without any coordination with A.

When NOT to use a queue

A queue adds latency, failure modes, and operational complexity. Skip it when the user needs the result right now (a search query, an autocomplete), when the work itself is trivial (don't queue a 1ms operation), when you can't tolerate any extra latency budget, or when what you actually need is durable storage with query support — for that, use a database, not a queue. A queue is a transport, not a database; using it as one is a common mistake that surfaces as "we need to replay but the messages expired."

Worked example: a notification service

Spec: 100M users; multi-channel (email, push, SMS, in-app); template + variable substitution; retry on failure; backpressure when a channel is degraded.

flowchart TD
    EVENTS[Application events] --> KAFKA[Kafka: notifications topic]
    KAFKA --> ROUTER[Router: pick channels]
    ROUTER --> EQ[Email queue]
    ROUTER --> PQ[Push queue]
    ROUTER --> SQ[SMS queue]
    ROUTER --> IQ[In-app queue]
    EQ --> EW[Email workers]
    PQ --> PW[Push workers]
    SQ --> SW[SMS workers]
    IQ --> IW[In-app worker]
    EW -.fail.-> DLQ[(DLQ)]
    style KAFKA fill:#0e7490,color:#fff
    style DLQ fill:#ff2e88,color:#fff

Kafka holds the canonical event log. Per-channel queues (SQS) isolate failures — a slow SMS provider backs up its own queue but doesn't touch email or push. Each worker is idempotent: the notification has a UUID and the sender refuses to act on a duplicate. DLQ depth is alerted. The full deep-dive is in the notification system article.

Things you should now be able to answer

When would you choose Kafka over SQS?
Why is "exactly-once delivery" effectively impossible? What's exactly-once processing?
What's the role of a dead-letter queue?
What problem does the outbox pattern solve?
A consumer crashes after processing 100 messages but before ack-ing them. What happens?
A producer publishes events with key=user_id. Why? What goes wrong if you used key=country instead?
A new field needs to be added to an event schema. Walk through the rollout (producer first or consumer first?).
A queue is growing faster than it's being drained. What three things should you investigate?

→ Next: CAP theorem & consistency

// FAQ

Frequently asked questions

▸What is the difference between a message queue and an event stream?

A message queue (RabbitMQ, SQS) delivers each message to exactly one consumer and deletes it after consumption; throughput tops out at tens of thousands of messages per second. An event stream (Kafka, Kinesis) appends messages to a durable log that each consumer group reads independently, supports replay via offsets, and handles millions of messages per second. Use a queue to say "do this work"; use a stream to say "this happened."

▸Why is exactly-once delivery impossible, and what is exactly-once processing?

Exactly-once delivery fails at the network boundary: if a consumer acks after processing and the ack is lost, the broker redelivers and the message is processed twice; if a consumer acks before processing and then crashes, the message is gone. This is the two-generals problem applied to message passing. Exactly-once processing is the practical substitute: use at-least-once delivery from the broker and make consumers idempotent by checking a unique event_id before acting.

▸When should you use Kafka instead of SQS?

Reach for Kafka when you need replay (event sourcing, CDC, new downstream consumers reading historical data), throughput above roughly 100k messages per second, or a durable log that multiple independent consumer groups can read without coordination. SQS is the better default for simple task offload at moderate scale where replay and fan-out to many independent readers are not requirements.

▸What is the outbox pattern and what problem does it solve?

The outbox pattern writes both the business update and the event record in the same database transaction, using an outbox table inside that same DB. A separate relay process polls the outbox and publishes to Kafka. This eliminates the dual-write race where an app might commit to the DB but fail to publish, or publish to Kafka but have the DB transaction roll back.

▸What goes wrong if you partition a Kafka topic by a low-cardinality key like country?

All messages for the largest countries pile into a handful of partitions while most partitions sit idle. The consumers assigned those hot partitions are overwhelmed; the rest are underutilized. You lose the parallelism that partitioning is meant to provide. Use a high-cardinality key like user_id or order_id to distribute load evenly across partitions.

← previous module

Load Balancers, Proxies, and Service Mesh

next module →

CAP, Consistency, and Replication