MODULE 11 / 12crash course

~/roadmap/11-reliability-patterns

◆◆Intermediate

Reliability and Failure Patterns

Timeouts, retries with backoff, circuit breakers, bulkheads, deadlines, hedged requests, and graceful degradation — the patterns that keep distributed systems standing.

15 min read2026-01-25Ironclad Academy

#reliability #resilience #distributed-systems #fundamentals

"Anything that can go wrong, will go wrong — at scale, simultaneously, in production, on a Friday afternoon."

This module covers the patterns engineers use to keep systems standing when their dependencies fall over. None of these patterns are magic. Each one is a small, tested idea that — applied consistently — turns "one slow downstream takes the whole site down" into "one slow downstream causes a graph spike on a dashboard somewhere."

Failure is not the exception

In a single-machine program, things mostly work. In a distributed system with N services, the probability that all of them are healthy at any given moment is (1 - p_fail)^N. With 100 services each at 99.9% availability, the system as a whole is at 90.5%.

flowchart LR
    A[Service A<br/>99.9% up] --> B[Service B<br/>99.9% up]
    B --> C[Service C<br/>99.9% up]
    C --> D[Service D<br/>99.9% up]
    Note["Combined: 0.999^4 = 99.6% — ~3 hours of downtime/month"]
    D --> Note
    style A fill:#15803d,color:#fff
    style D fill:#15803d,color:#fff

You cannot prevent failure. You can only limit its blast radius and fail fast and predictably. That's what these patterns are for.

Timeouts: the most underused pattern in software

Every network call should have a timeout. Every. Single. One.

# WRONG — hangs forever if the API is slow
resp = requests.get('https://payments.example.com/charge')

# RIGHT
resp = requests.get(
    'https://payments.example.com/charge',
    timeout=(2.0, 5.0)   # connect 2s, read 5s
)

Without a timeout, a slow downstream pins one of your worker threads for minutes. A hundred of those slow in-flight requests fill your thread pool. Now your service is down even though your code is fine — you just never told it when to give up.

Choosing timeout values

flowchart TD
    A[How fast is the dependency<br/>at p99 normally?] --> B{Set timeout to<br/>~3× p99}
    B --> C[Below: timeouts<br/>fire on healthy slow days]
    B --> D[Above: late detection<br/>of degradation]
    style B fill:#ff6b1a,color:#0a0a0f

A few rules worth internalizing. Timeouts are non-optional — no infinite waits, even in "internal" services. Start tight and loosen as needed; it's safer to fail fast than to fail slow. And always cascade timeouts: if A gives up at 5s and B waits 10s, B is doing pointless work after A has already returned an error to the user. Finally, distinguish connect vs read timeouts — a connect failure means "couldn't reach you"; a read failure means "you're slow."

Deadline propagation

Timeouts work at a single hop. The mature evolution is a deadline: the entry point sets an absolute wall-clock expiry time and passes it along on every downstream call. Each service checks whether any budget remains before making the next hop.

sequenceDiagram
    participant U as User
    participant A as "API (deadline 5s)"
    participant B as "Service B (deadline 4.5s)"
    participant C as "Service C (deadline 4s)"
    U->>A: request (5s SLA)
    A->>B: with deadline 4.5s
    B->>C: with deadline 4s
    Note over C: takes too long - aborts before A times out
    C-->>B: error (deadline exceeded)
    B-->>A: error
    A-->>U: error (fast, no thread held)

Without deadline propagation, you waste server CPU computing results the caller has already given up on. gRPC and most modern RPC frameworks have this built in — you just need to wire it through.

Retries (with care)

Retries feel obvious: you got an error, try again. But an incorrect retry policy turns a transient blip into a self-inflicted denial-of-service.

The retry algorithm

def call_with_retries(fn, max_retries=3, base_delay=0.1, max_delay=10.0):
    for attempt in range(max_retries + 1):
        try:
            return fn()
        except RetryableError:
            if attempt == max_retries:
                raise
            # exponential backoff with jitter
            sleep_for = min(max_delay, base_delay * (2 ** attempt))
            sleep_for *= random.uniform(0.5, 1.5)  # jitter!
            time.sleep(sleep_for)

Three properties matter here. Exponential backoff (delay = base * 2^attempt) means you don't hammer a struggling service. Jitter — multiplying the delay by a random factor — prevents the thundering-herd problem where every client wakes up and retries at the same instant, overwhelming the server the moment it starts to recover. And a cap on attempts keeps you from retrying forever; three is usually enough.

What's retryable?

flowchart TD
    E[Error received] --> CHECK{Is it retryable?}
    CHECK -->|Network failure| RETRY[Retry]
    CHECK -->|429 rate limited| RETRY2["Retry after Retry-After"]
    CHECK -->|503 unavailable| RETRY3[Retry with backoff]
    CHECK -->|500 internal error| MAYBE{Idempotent?}
    CHECK -->|400/404/409| NO[Do not retry]
    CHECK -->|401/403| NO2[Do not retry]
    MAYBE -->|Yes| RETRY4[Retry]
    MAYBE -->|No| NO3[Do not retry]
    style RETRY fill:#15803d,color:#fff
    style NO fill:#ff2e88,color:#fff

Never retry a non-idempotent operation that may have succeeded — that's how you charge a credit card twice. Use idempotency keys (covered in APIs) so retries are always safe.

Retry storms (the classic outage)

Here's the failure mode that trips teams up. With five service layers each retrying three times, you get 3⁵ = 243× amplification at the bottom of the chain during an outage. The recovering service gets hammered into the ground again the moment it starts to come back up.

sequenceDiagram
    participant A as Service A
    participant B as Service B
    participant C as "Service C (overloaded)"
    A->>B: request
    B->>C: request
    C-->>B: fail
    B->>C: retry 1
    C-->>B: fail
    B->>C: retry 2
    C-->>B: fail
    B-->>A: error
    A->>B: retry 1
    Note over A,C: 2 calls to B x 3 calls to C each = 6x load amplification across two hops

The mitigations: retry only at the outermost layer (inner services propagate errors rather than retrying), use a token-bucket "retry budget" that caps retries to N% of normal RPS, and use circuit breakers — which is where we're headed next.

Circuit breakers

A circuit breaker watches calls to a dependency and opens — stops sending traffic — when failures exceed a threshold, giving the downstream time to recover.

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failure ratio > threshold
    Open --> HalfOpen: cooldown elapsed
    HalfOpen --> Closed: probe succeeds
    HalfOpen --> Open: probe fails

The three states: Closed means traffic flows normally and you're tracking success/fail counts in a rolling window. Open means you've hit the threshold — stop calling, fail fast, and start a cooldown timer. Half-open means the cooldown elapsed; send a single probe request by default and if it works, close the breaker; if not, reopen.

Why this matters in practice:

flowchart LR
    NB[No breaker] -->|"all 1000 reqs/s<br/>hit a dying service"| F1[Service crashes harder]
    BR[Breaker open] -->|"0 reqs flow"| HEAL[Service recovers]
    HEAL --> CLOSE[Breaker closes]
    style F1 fill:#ff2e88,color:#fff
    style HEAL fill:#15803d,color:#fff

Without a breaker, every retry hammers a dying service. With one, the dying service gets a quiet window to recover.

Tuning a breaker

Parameter	Typical	Watch out for
Failure ratio threshold	50% over 1-min window	Too low = flaps; too high = slow to react
Minimum requests	20–100	Without this, 1 of 1 errors trips it
Cooldown	5–30s	Too short = thrash; too long = slow recovery
Probe count in half-open	1–5	More = safer reopen, less responsive

Most service meshes (Envoy, Istio, Linkerd) ship circuit breakers. You configure the thresholds; the proxy enforces them.

Bulkheads

The bulkheads of a ship are watertight compartments — if one floods, the rest stay dry. In software: isolate resources so failure in one part doesn't sink the whole ship.

flowchart TD
    subgraph Bad[No bulkhead]
    POOL1[(Shared connection pool: 100)]
    POOL1 --> A1[Service A]
    POOL1 --> B1[Service B]
    POOL1 --> C1[Service C - slow]
    end
    subgraph Good[With bulkheads]
    POOL2A[(Pool A: 30)] --> A2[Service A]
    POOL2B[(Pool B: 30)] --> B2[Service B]
    POOL2C[(Pool C: 40)] --> C2[Service C - slow]
    end
    style Bad fill:#ff2e88,color:#fff
    style Good fill:#15803d,color:#fff

In the bad case, Service C goes slow, monopolizes all 100 connections, and A and B starve. In the good case, C exhausts its 40 connections and A and B keep working.

Common places to draw bulkhead boundaries: per-downstream connection pools, per-tenant queues or thread pools, separate node pools for batch vs interactive workloads, and separate database connections for reads vs writes.

Hedged requests (the latency superpower)

If you have an endpoint with a long tail — say, most calls finish at 50ms but p99 is 500ms — you can cut that p99 dramatically by sending the same request to two replicas in parallel and using whichever responds first.

sequenceDiagram
    participant C as Client
    participant R1 as Replica 1
    participant R2 as Replica 2
    C->>R1: request
    Note over C: wait p95 (~150ms)
    C->>R2: hedge request
    R1-->>C: slow (450ms)
    R2-->>C: fast (50ms) ← use this

You don't fire both immediately. You wait until the first response is overdue — the p95 latency is a good threshold — and only then send the hedge. The cost is roughly 5% extra requests, and the payoff is a 5–10× reduction in p99 latency. Jeff Dean and Luiz André Barroso wrote about this in their 2013 Google paper "The Tail at Scale" — it's fundamental for low-latency search and storage systems.

Load shedding

When you can't keep up, shed load explicitly rather than letting everything slow down for everyone.

flowchart TD
    REQ[Incoming request] --> CHECK{System load<br/>> threshold?}
    CHECK -->|No| PROC[Process normally]
    CHECK -->|Yes| PRI{Priority?}
    PRI -->|High| PROC
    PRI -->|Low| SHED[Shed: 503 + Retry-After]
    style SHED fill:#ff6b1a,color:#0a0a0f

A few implementation shapes: reject immediately when the inbound queue depth exceeds N; cap in-flight requests and reject new ones once you're at the limit; use per-priority shedding to keep payments running while dropping analytics; or use adaptive concurrency (Netflix's "concurrency-limits" library) which reduces limits automatically when latency rises.

The thing to avoid is implicit queueing. A 30-second timeout on an overloaded server means 30 seconds of wasted CPU per request before saying no. Rejecting in 1ms is strictly better.

Graceful degradation

When a non-critical dependency dies, don't take the whole feature down with it. Return a degraded response.

Situation	Don't do	Do
Recommendations service down	Show error page	Show static "popular items"
Personalization down	500	Show generic homepage
Search ranking down	500	Show results in lexical order
Image CDN down	Broken UI	Show placeholders
Real-time updates down	Page won't load	Page loads with stale data + "refreshing..." indicator

The architectural shape is consistent: every call to a non-critical dependency is wrapped in a fallback. The fallback returns cached data, a default value, or a simpler computation. The caller gets something useful, not an error.

flowchart LR
    REQ[Request] --> DEP{Dependency<br/>available?}
    DEP -->|Yes| LIVE[Live response]
    DEP -->|No| FB{Fallback?}
    FB -->|Cached data| CACHE[Return stale]
    FB -->|Default| DEF[Return default]
    FB -->|Critical path| ERR[Propagate error]
    style LIVE fill:#15803d,color:#fff
    style CACHE fill:#ffaa00,color:#0a0a0f
    style DEF fill:#0e7490,color:#fff
    style ERR fill:#ff2e88,color:#fff

The key judgment call is deciding which dependencies are critical (errors propagate) vs non-critical (fallback kicks in). Make that decision explicit at code-review time, not during an incident.

Idempotency keys (revisited)

Reliability and APIs intersect at idempotency. Without idempotent operations, you cannot safely retry. Without safe retries, you cannot tolerate transient failures.

# bad: not idempotent
charge_card(amount=4200)

# good: idempotent — second call returns the same result
charge_card(amount=4200, idempotency_key='ord-7f3c2e80')

Server-side, the key is stored with the response for ~24h. Repeated calls with the same key return the cached response — no double-charge.

Health checks (and the trap)

Health checks tell load balancers whether a backend is ready for traffic. Two flavors:

Liveness (/livez): "I'm running — don't kill me." Checks that the process is up.
Readiness (/readyz): "I'm ready to serve — send me traffic." Checks that database connections, caches, etc. are reachable.

The trap is checking downstream health in your readiness probe.

flowchart LR
    A["Service A<br/>readyz checks B"] -->|"B is slow"| MARK[A marked unready]
    MARK --> CASCADE[All A pods<br/>marked unready]
    CASCADE --> OUT[Whole region down]
    style OUT fill:#ff2e88,color:#fff

If A's readiness probe fails whenever B is slow, your load balancer pulls A out of rotation — even though A could have served degraded behavior gracefully. B's outage cascades into A's outage.

The rule: liveness and readiness probes should check only what this instance can fix by restarting. Downstream health belongs behind circuit breakers, not health checks.

Rate limiting

Rate limiting is reliability for upstream callers — protecting your service from misbehaving clients, intentional or not.

Two algorithms you'll see in the wild:

Token bucket

flowchart LR
    REFILL[Refill at R per sec] --> BUCKET[(Bucket: capacity N)]
    REQ[Request] --> CHECK{Tokens?}
    BUCKET --> CHECK
    CHECK -->|Yes| ACCEPT[Accept, decrement]
    CHECK -->|No| REJECT[429 Too Many Requests]
    style ACCEPT fill:#15803d,color:#fff
    style REJECT fill:#ff2e88,color:#fff

The bucket holds N tokens and refills at R per second. Each request consumes one. You can burst up to N requests quickly, then the rate settles to R.

Sliding window

Count requests in the last 60s; reject when over limit. Smoother than fixed windows, which let callers double-burst across a window boundary, but requires slightly more state to track.

Enforce rate limits at the edge — API gateway or CDN, per IP and per API key. Rejecting there is far cheaper than letting the request propagate down before you say no.

The deep dive is in the rate limiter article.

Outboxes and the durable-write pattern

You write to your database, then publish a message. The DB write succeeds; the publish fails. Now your database and message bus disagree about what happened.

The fix is the transactional outbox.

sequenceDiagram
    participant App
    participant DB
    participant OB as outbox table
    participant Q as Kafka
    App->>DB: BEGIN
    App->>DB: UPDATE balance ...
    App->>OB: INSERT event row
    App->>DB: COMMIT (atomic!)
    Note over OB,Q: separate poller / CDC
    OB->>Q: publish, mark sent

The DB row and the outbox row are written in the same transaction. A separate process — or a CDC stream reading the DB's WAL — ships the outbox to Kafka. If the publisher crashes before sending, it picks up the unsent row on restart. If Kafka's ack is lost, the publisher retries and the consumer deduplicates on the event's stable ID.

This gives you at-least-once delivery without ever letting your DB and queue drift apart. See Message Queues for more.

Chaos engineering

You cannot be confident your reliability patterns work until you break things on purpose.

Game days are planned exercises where teams kill a service, partition a network, or fill a disk in production. Chaos monkeys do this continuously and randomly — Netflix invented the concept and has run it in production for years. Failure injection adds deliberate latency or errors at a load balancer to test downstream resilience without fully killing anything.

The point isn't sadism. Untested code paths fail in production, and your recovery paths are code paths too. Unverified replica promotion, dusty runbooks, failover scripts that haven't been run in two years — chaos engineering surfaces these before a real incident does.

SLOs, SLIs, and error budgets

Reliability is a business decision, not a binary one. The framework Google made standard:

SLI (Service Level Indicator): a measurable signal — e.g., 99th-percentile latency, success ratio, queue depth.
SLO (Service Level Objective): a target on the SLI — e.g., 99.9% of requests complete within 200ms.
Error budget: 1 - SLO. At 99.9%, you have 0.1% of requests to "spend" — roughly 43 minutes of downtime per month.

flowchart LR
    A[100% reliable] --> B[Slow feature velocity]
    B --> C[Competitor catches up]
    D[99.9% SLO] --> E[Error budget]
    E --> F[Spend on shipping]
    F --> G[Spend on chaos exercises]
    style A fill:#ff2e88,color:#fff
    style D fill:#15803d,color:#fff

When you're under budget — reliability is exceeding the SLO — you can ship faster and take more risk. When you've burned the budget, freeze risky changes until reliability recovers. The budget makes the tradeoff explicit so the conversation between engineering and product is about data, not gut feeling.

A reliability checklist

Before any service goes to production:

Each item costs an hour to implement and prevents a category of incident.

Things you should now be able to answer

A downstream service starts taking 5 seconds per request. Why does that take your service down too, even if your code is perfect?
A naive retry policy (3 retries everywhere) can make outages worse. Why? What's the fix?
Your circuit breaker is tripping every 30 seconds. What's wrong with the tuning?
Why shouldn't your readiness probe check whether downstream services are healthy?
A user gets billed twice because their phone retried the request. What's the simplest fix?
Your SLO is 99.9% and you've already burned 70% of this month's error budget. What changes about how you operate?

→ Next: Observability

// FAQ

Frequently asked questions

▸What is a circuit breaker and what are its three states?

A circuit breaker watches calls to a dependency and stops sending traffic when failures exceed a threshold. The three states are Closed (traffic flows normally while success/fail counts are tracked in a rolling window), Open (threshold exceeded — fail fast and start a cooldown timer), and Half-open (cooldown elapsed — send a single probe request and close the breaker if it succeeds, reopen if it fails).

▸What is deadline propagation and how does it differ from a timeout?

A timeout applies at a single hop. A deadline is an absolute wall-clock expiry set at the entry point and passed along on every downstream call so each service can abort before making the next hop. Without deadline propagation, downstream services waste CPU computing results the caller has already given up on; gRPC and most modern RPC frameworks have this built in.

▸Why is jitter required in exponential backoff retry logic?

Without jitter, every client that experienced the same failure wakes up and retries at the same instant, overwhelming the server the moment it starts to recover. Multiplying the backoff delay by a random factor (the article uses uniform 0.5 to 1.5) staggers the retry wave and prevents that thundering-herd problem.

▸How much request amplification can retry storms create, and what are the mitigations?

With five service layers each retrying three times, you get 3 to the power of 5, or 243 times amplification at the bottom of the chain during an outage. The article recommends retrying only at the outermost layer so inner services propagate errors instead, capping retries with a token-bucket retry budget at a percentage of normal RPS, and using circuit breakers to stop traffic to recovering services.

▸What is the error budget at a 99.9% SLO, and how should teams use it?

At a 99.9% SLO the error budget is 0.1% of requests, which translates to roughly 43 minutes of downtime per month. When reliability exceeds the SLO and budget remains, teams can ship faster and take more risk; when the budget is burned, risky changes should be frozen until reliability recovers.

← previous module

CAP, Consistency, and Replication

next module →

Observability — Logs, Metrics, Tracing