~/articles/backpressure-and-flow-control

◆◆Intermediateasked at Netflixasked at LinkedInasked at Amazon

Backpressure & Flow Control

What happens when a fast producer overwhelms a slow consumer? Backpressure, bounded buffers, load shedding, and why unbounded queues are a trap.

17 min read2026-03-30Ironclad Academy

#distributed-systems #reliability #streaming #performance

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

Every real pipeline is finite: CPUs, threads, connections, and memory all have limits. Backpressure and flow control are the mechanisms by which those limits propagate through a system in a controlled way. Understanding this is the difference between a system that degrades gracefully under load and one that quietly accumulates queue lag until it falls over.

The core problem

Producer --[burst]--> [???] --> Consumer (slow)

A producer that emits 15,000 messages per second feeding a consumer that can process 10,000 messages per second is not a problem you can ignore. The 5,000 messages per second delta has to go somewhere. Your options:

Slow the producer — backpressure.
Buffer the difference — queue.
Drop the excess — load shedding.

Option 2 is the default and the most dangerous.

The unbounded queue is a latency bomb

An unbounded queue appears to solve the problem: the producer never blocks, the consumer never starves, and everyone is "happy." But the latency of any given message equals the time spent waiting in the queue plus the time to process it.

queue latency = queue_depth / consumer_throughput

If queue_depth is growing by 5,000 messages every second, queue latency grows by 0.5 seconds every second. After 60 seconds of a burst, the tail of the queue is sitting 30 seconds in the past. The system reports 100% availability. The user experiences a 30-second delay. Then memory runs out and the process crashes.

This failure mode has a name in networking: bufferbloat. Consumer routers with enormous packet buffers keep links fully utilized at the cost of 100–500ms of latency on what should be a 10ms connection. The fix — Active Queue Management (AQM) algorithms like CoDel (Controlled Delay) — enforces a target maximum sojourn time for packets; when a packet has waited longer than the target, CoDel begins dropping to signal congestion rather than allowing the buffer to stay persistently full.

Backpressure: slowing the producer

Backpressure means the consumer (or a buffer near capacity) sends a signal upstream: "slow down." The signal propagates hop by hop until it reaches whoever is generating work. The producer then emits less, sleeps, or pauses.

TCP's receive window: the canonical example

TCP's flow control is the clearest implementation of backpressure in computing. The receiver advertises a receive window — the number of bytes it is prepared to accept — in every ACK. The sender may not have more than receive_window bytes outstanding (unacknowledged) at any time.

When the receiver's application layer is slow to drain its socket buffer, the buffer fills, the advertised window shrinks, and the sender's transmission rate drops automatically — without any application-layer code. The slow consumer has mechanically slowed the fast sender.

sequenceDiagram
    participant S as Sender
    participant R as Receiver (slow app)
    S->>R: data [seq=1..4096]
    R-->>S: ACK, window=65535
    S->>R: data [seq=4097..8192]
    Note over R: App layer slow to drain
    R-->>S: ACK, window=4096
    S->>R: data [seq=8193..12288]
    Note over R: Buffer nearly full
    R-->>S: ACK, window=0
    Note over S: Sender pauses — zero window
    Note over R: App drains buffer
    R-->>S: ACK, window=65535
    Note over S: Sender resumes

Congestion control (CUBIC, BBR) adds a second loop on top: the network itself feeds back via packet loss or delay, and the sender reduces its rate in response. These are two nested backpressure loops in one protocol stack.

Pull-based (demand-driven) flow

The cleanest way to implement backpressure in application code is inversion of control: instead of the producer pushing data to the consumer, the consumer pulls exactly as many items as it can handle.

Kafka consumer pull model. Kafka brokers do not push messages to consumers. Consumers call poll() to request a batch of records, specifying a maximum number of bytes or records. The broker returns up to that amount. The consumer processes the batch, commits the offset, and polls again. If the consumer slows down — because downstream storage is slow, for example — it simply polls less frequently. The broker retains the unconsumed messages on disk (within the configured retention window) and the producer continues writing to the topic. The consumer's poll rate is the throttle.

This is demand-driven flow at the application layer, and it is why Kafka architectures are naturally backpressure-friendly. The consumer dictates the pace; the broker absorbs the buffer; the producer is independent. The cost: latency grows proportional to how far the consumer falls behind. That debt must eventually be paid.

Reactive Streams. The Reactive Streams specification (now part of Java as java.util.concurrent.Flow, and implemented by Project Reactor, RxJava, Akka Streams, and others) formalizes this as a protocol. A Subscriber signals to a Publisher via request(n) how many items it wants. The Publisher emits at most n items. The Subscriber processes them, then calls request(n) again. At no point does the Publisher emit faster than the Subscriber has demanded.

sequenceDiagram
    participant PUB as Publisher
    participant SUB as Subscriber
    SUB->>PUB: subscribe()
    PUB-->>SUB: onSubscribe(subscription)
    SUB->>PUB: request(3)
    PUB-->>SUB: onNext(item 1)
    PUB-->>SUB: onNext(item 2)
    PUB-->>SUB: onNext(item 3)
    Note over SUB: Processing — not ready yet
    Note over PUB: Publisher waits; emits nothing
    SUB->>PUB: request(3)
    PUB-->>SUB: onNext(item 4)
    PUB-->>SUB: onNext(item 5)

This propagates naturally through a pipeline: each stage is both a Subscriber to its upstream and a Publisher to its downstream. A slow terminal Subscriber reduces demand, which reduces demand on the stage above it, all the way to the source. The whole pipeline slows together.

Credit-based flow control. Some protocols (gRPC HTTP/2, AMQP, many proprietary internal transports) generalize this into credits: the receiver grants the sender a number of tokens representing how much it is prepared to receive. The sender uses a credit per message (or per byte) and stops when credits run out. The receiver replenishes credits as it processes work. Credits flow upstream; data flows downstream. This is the mechanism underlying HTTP/2's flow control, which operates at both the stream level and the connection level.

Bounded buffers: making the constraint explicit

A bounded buffer is the forcing function that makes backpressure observable and self-regulating. Set a maximum depth. When the buffer is full, you have three choices: block the producer thread until space is available (LinkedBlockingQueue in Java, channels in Go), reject the work and return an error to the caller, or drop the oldest item to make room (a ring buffer / circular buffer). Which policy to use depends on whether you can tolerate backpressure propagating to the caller or must reply immediately.

Thread pool with a bounded queue

The canonical Java ThreadPoolExecutor example makes this concrete:

// Bounded work queue — 1000 pending tasks max.
// When full: CallerRunsPolicy slows the submitter,
// or AbortPolicy rejects with RejectedExecutionException.
ExecutorService pool = new ThreadPoolExecutor(
    16,                         // core threads
    32,                         // max threads
    60, TimeUnit.SECONDS,
    new ArrayBlockingQueue<>(1000),      // bounded!
    new ThreadPoolExecutor.CallerRunsPolicy()
);

CallerRunsPolicy is a form of backpressure: when the queue is full, the calling thread runs the task itself, slowing down whoever is submitting work. AbortPolicy is load shedding: reject work and throw, so the caller can decide what to do. An unbounded LinkedBlockingQueue silently buffers everything — the classic mistake.

Choosing the queue bound

How do you size the bound?

target_latency    = 200ms
processing_rate   = 10,000 rps
max_queue_depth   = target_latency × processing_rate
                  = 0.2s × 10,000 = 2,000 tasks

Any task that sits in a queue of depth 2,000 at 10,000 rps will wait at least 200ms before even starting. If your SLA is 200ms total, the queue depth should probably be much smaller. Make the bound match your latency target, not your memory limit.

Load shedding: when you cannot slow the source

Backpressure requires being able to slow the producer. For user-facing HTTP traffic — a mobile app sending requests — you cannot slow the sender. You can only accept or reject.

Load shedding is the deliberate rejection of excess work to protect the system's ability to handle the work it does accept. A server under overload that tries to process every request will process none of them well, and may crash. A server that rejects 30% of requests with 429s and processes the remaining 70% reliably is in a much better position.

flowchart TD
    IN[Incoming requests] --> ADMIT{Admission control}
    ADMIT -->|"within capacity"| WORKER[Worker pool]
    ADMIT -->|"over limit"| SHED["Reject: 429 Too Many Requests\nor 503 Service Unavailable"]
    WORKER --> RESP[Response]
    SHED --> CALLER[Caller retries with backoff]
    style ADMIT fill:#ff6b1a,color:#0a0a0f
    style SHED fill:#a855f7,color:#fff
    style WORKER fill:#15803d,color:#fff

What to drop

When you must drop, the question is which work to discard:

Strategy	Drop	Best when
Tail drop	Newest arrivals	Simple; protects work already in flight
Head drop / oldest first	Oldest in queue	Work has a deadline; stale requests are already past SLA
Priority-based	Lowest priority	You can classify work; expendable jobs buffer expensive ones
Random / sampling	Random subset	Equal treatment; avoids starvation of any one client

In practice: drop the oldest or lowest-priority first, and reject fast (before touching storage or downstream services) to minimize waste. A 429 that returns in 1ms is far cheaper than a 200 that returns in 10 seconds and consumed a database connection the whole time.

Rate limiting as applied flow control

Rate limiting is load shedding with memory: a producer is allowed a quota over a time window (e.g., 1,000 requests per second per API key). Requests within the quota are admitted; requests that exceed it are rejected. See the rate limiter article for the algorithms — token bucket, leaky bucket, fixed window, sliding window. The key insight is that rate limiting is not a replacement for backpressure; it is a complement. Rate limiting at the edge protects the system from any single abusive source; backpressure within the system propagates natural load signals between stages.

Queueing theory: why utilization near 100% is dangerous

Little's Law from queueing theory:

L = λ × W

L = average number of items in the system
λ = average arrival rate
W = average time an item spends in the system

Rearranged: W = L / λ. As utilization (λ / μ, where μ is the service rate) approaches 1, average queue depth diverges. For an M/M/1 queue there are two related measures:

Lq = ρ² / (1 − ρ) — mean number of items waiting in queue (not counting the item in service)
L = ρ / (1 − ρ) — mean number of items in the system (queue + item in service)
Mean time in system W = 1 / (μ(1 − ρ)), so W / (1/μ) = 1 / (1 − ρ) — the latency multiplier relative to bare service time

ρ = 0.50 → Lq = 0.5     (latency ≈ 2× service time)
ρ = 0.80 → Lq = 3.2     (latency ≈ 5× service time)
ρ = 0.90 → Lq = 8.1     (latency ≈ 10× service time)
ρ = 0.95 → Lq = 18.05   (latency ≈ 20× service time)
ρ = 0.99 → Lq = 98      (latency ≈ 100× service time)

The takeaway is not that you need to run at 50% utilization — it is that utilization above ~80% in a system with variable request sizes and arrival times causes latency to grow non-linearly. Target 70–80% as a steady-state operating point and treat anything above that as an alarm.

flowchart LR
    U0["Utilization 50%\nLatency ≈ 2×"] --> U80["Utilization 80%\nLatency ≈ 5×"]
    U80 --> U90["Utilization 90%\nLatency ≈ 10×"]
    U90 --> U99["Utilization 99%\nLatency ≈ 100×"]
    style U0 fill:#15803d,color:#fff
    style U80 fill:#ffaa00,color:#0a0a0f
    style U90 fill:#ff6b1a,color:#0a0a0f
    style U99 fill:#ff2e88,color:#fff

This is why a system that looks fine at 85% CPU starts returning 2-second responses: arrivals are random, service times vary, and the math is merciless.

End-to-end vs hop-by-hop flow control

Backpressure can be applied at each hop of a pipeline (hop-by-hop) or only at the edges (end-to-end). Both are needed.

Hop-by-hop control prevents any single stage from accumulating an unbounded backlog. Each intermediate stage has bounded buffers. A slow terminal stage slows the stage before it, which slows the stage before that. The pressure propagates backwards through the pipeline.

End-to-end control is needed because hop-by-hop signals can be slow to propagate. If you have 10 pipeline stages and each stage takes 100ms to respond to backpressure, it takes 1 second for a terminal slowdown to reduce the producer's rate. During that second, each intermediate stage is filling its buffer. End-to-end flow control — for example, a global rate limit at the API gateway that is informed by downstream service health — can react faster.

In practice, systems use both: bounded queues at each stage (hop-by-hop) plus a high-level admission control or rate limiter at the entry point (end-to-end).

sequenceDiagram
    participant P as Producer
    participant S1 as Stage 1
    participant S2 as Stage 2
    participant C as Consumer (slow)
    P->>S1: emit 15k/s
    S1->>S2: forward
    S2->>C: forward
    C-->>S2: queue full (block / reject)
    S2-->>S1: queue full (block / reject)
    S1-->>P: queue full — slow down
    Note over P,C: Backpressure propagated hop-by-hop

Failure modes

Retry storms

When load shedding returns 429s, clients should back off and retry later. If they do not — if they retry immediately in a tight loop — they amplify the load that caused the 429. A system at 110% capacity receiving immediate retries is now at 120%, then 130%. This is a retry storm, and it can drive a recoverable overload into a full outage.

The defense is exponential backoff with jitter. Each retry attempt waits min(cap, base × 2^attempt) + random(0, jitter) before retrying. Jitter breaks the synchronized retry thundering herd. Circuit breakers (see below) prevent retries from reaching a known-down dependency at all.

Head-of-line blocking

In a single queue, a slow request at the front prevents all requests behind it from being served, even if those requests would complete in microseconds. HTTP/1.1 pipelining suffers from this; HTTP/2 multiplexing and QUIC were designed to eliminate it at the protocol level. In application queues, the mitigation is priority queues (fast tasks in a separate lane) or timeouts that abandon slow tasks and return their resources to the pool.

Bufferbloat in distributed systems

The same dynamics that cause network bufferbloat appear inside service meshes. A slow downstream service causes the upstream's HTTP client connection pool to fill (each connection is occupied waiting for the response). New requests queue waiting for a connection. Latency climbs. The upstream service reports high latency to its callers, who queue requests waiting for the upstream to respond. Eventually every service in the call graph is full of waiting threads.

The fix: timeouts at every hop, bounded connection pools, and circuit breakers. A timeout means the slow dependency gets a fixed time budget; when it expires, the waiting thread is freed and can serve other requests. A circuit breaker trips when a dependency's error rate exceeds a threshold, and fast-fails subsequent requests — returning an error immediately rather than occupying a thread for the timeout duration. This is backpressure's cousin: the circuit breaker propagates the "I am unavailable" signal without the caller having to wait for the evidence.

Cascading failure

A slow database causes a slow application service, which causes a slow API gateway, which causes client retries, which adds more load to the already-slow application service. Each stage's buffer absorbs the spike briefly, then fills, and the pressure propagates upstream faster than any human can intervene.

flowchart LR
    DB["Database (slow)"] -->|latency spike| APP[App service]
    APP -->|threads fill up| GW[API gateway]
    GW -->|504s| CLI[Clients]
    CLI -->|immediate retries| GW
    GW -->|more load| APP
    APP -->|more load| DB
    style DB fill:#ff2e88,color:#fff
    style APP fill:#ff6b1a,color:#0a0a0f
    style GW fill:#a855f7,color:#fff
    style CLI fill:#0e7490,color:#fff

The defense is to have all the above mechanisms in place before the failure: bounded queues, load shedding, backoff, circuit breakers, and utilization headroom.

Putting it together: a concrete example

Consider an ingestion service that reads from a Kafka topic, transforms events, and writes to a database:

flowchart LR
    KAFKA[Kafka topic] -->|"poll()"| SVC[Ingestion service\n16 worker threads]
    SVC --> BQ["Bounded transform queue\n(max 512)"]
    BQ --> DB[(Database\n5k writes/s max)]
    DB -.slow writes.-> BQ
    BQ -. "queue full - pause poll" .-> KAFKA
    style KAFKA fill:#0e7490,color:#fff
    style SVC fill:#ff6b1a,color:#0a0a0f
    style BQ fill:#15803d,color:#fff
    style DB fill:#a855f7,color:#fff

The database is the slowest stage: 5,000 writes/sec max.
The bounded transform queue holds at most 512 pending write operations.
When the queue fills (a database hiccup causes writes to slow), the worker threads block trying to enqueue.
Blocked worker threads stop calling Kafka poll().
Kafka's consumer lag grows — messages sit in Kafka rather than piling in memory.
Kafka retains them reliably (within its retention window).
When the database recovers, writes resume, the queue drains, threads unblock, and polling resumes.

This is the textbook backpressure success story: the capacity constraint at the database propagates cleanly back to Kafka, which acts as the durable, bounded buffer. No unbounded in-process queues, no OOM, no lost messages.

The failure mode to avoid: an unbounded in-process queue between the workers and the database. The database slows, the queue grows, the process runs out of heap, and crashes — losing all in-flight data.

Storage and mechanism comparison

Mechanism	Scope	Best for	Cost
TCP receive window	Network	Any TCP stream	Built-in, free
Kafka consumer pull	Service-to-service	Async pipelines with durable backlog	Kafka infra + consumer lag monitoring
Reactive Streams (request(n))	In-process	JVM stream pipelines	Library complexity
Bounded blocking queue	In-process	Thread pool work dispatch	Lock contention under high load
Credit-based (HTTP/2, gRPC)	RPC	Synchronous service calls	Requires protocol support
Rate limiting (token bucket)	Edge / per-client	User-facing APIs, multi-tenant	Quota management
Load shedding (429/503)	Edge	User traffic you cannot throttle	Must handle gracefully on client
Circuit breaker	Service mesh	Downstream dependency isolation	False positives if thresholds mis-set

Things to discuss in an interview

Why is an unbounded queue not a solution? It converts a throughput problem into a latency and memory problem. The system appears healthy while queued latency silently grows.
Pull vs push: why pull-based (demand-driven) flow is easier to reason about and compose.
How do you size a bounded queue? Match the bound to your latency budget, not your memory limit.
Load shedding strategy: what to drop, how to signal the client, and how clients should behave (backoff).
Retry storms: why bounded queues + load shedding break down without exponential backoff + jitter on the client side.
Little's Law intuition: why high utilization causes super-linear latency growth, not linear.
End-to-end vs hop-by-hop: both are needed; neither alone is sufficient.

Things you should now be able to answer

Why does an unbounded queue make the problem worse, not better?
What is the difference between backpressure and load shedding, and when do you use each?
How does TCP's receive window implement backpressure?
Why does Kafka's pull model give consumers natural backpressure without explicit signaling?
What does Little's Law tell you about operating a system at 95% utilization?
How does a retry storm turn a recoverable overload into an outage?
What is the role of a circuit breaker relative to backpressure?
How do you choose what to drop when you must shed load?

Frequently asked questions

▸What is backpressure, and how does it differ from load shedding?

Backpressure is a mechanism by which a slow consumer signals upstream stages to reduce their emission rate, propagating the constraint through the pipeline until the producer slows down. Load shedding is used when you cannot slow the source — such as inbound user HTTP traffic — so instead you reject excess requests outright with 429 or 503 responses and let callers retry later. Backpressure preserves all work; load shedding deliberately discards some to protect the work it does accept.

▸Why is an unbounded queue dangerous rather than a safe buffer?

An unbounded queue converts a throughput problem into a latency and memory problem. Queue latency equals queue depth divided by consumer throughput, so if a 15k msg/s producer feeds a 10k msg/s consumer, the queue grows by 5,000 messages per second — accumulating 300,000 messages and 300 MB in 60 seconds. The system reports full availability while queued latency silently climbs to 30 seconds, until the process runs out of memory and crashes.

▸How does Kafka's pull model implement backpressure without an explicit signal?

Kafka consumers call poll() to request a batch of records, specifying a maximum number of bytes or records; the broker returns at most that amount. If the consumer slows down because downstream storage is slow, it simply polls less frequently, and the broker retains unconsumed messages on disk within its retention window. The consumer's poll rate is the throttle, making the architecture naturally backpressure-friendly without a separate signal channel.

▸What does queueing theory say about running a system at 95% utilization?

For an M/M/1 queue at utilization rho = 0.95, the mean number of items waiting in queue (Lq) is 18.05 and the latency multiplier is roughly 20 times the bare service time. At rho = 0.80 latency is already 5 times service time. The article recommends targeting 70-80% utilization as a steady-state operating point and treating anything above that as an alarm, because latency grows non-linearly — not linearly — as utilization climbs.

▸How should you size a bounded queue?

Size the bound to your latency budget, not your memory limit. The formula is max_queue_depth = target_latency x processing_rate. For a 200ms latency target at 10,000 requests per second that yields a depth of 2,000 tasks — and if your SLA is 200ms total, the actual bound should be much smaller since queued tasks haven't even started processing yet.

← previous

Design a Distributed Lock / Coordination Service (ZooKeeper / etcd)

Design a Recommendation System (Netflix / TikTok)

// RELATED