Reliability and Failure Patterns
Timeouts, retries with backoff, circuit breakers, bulkheads, deadlines, hedged requests, and graceful degradation — the patterns that keep distributed systems standing.
"Anything that can go wrong, will go wrong — at scale, simultaneously, in production, on a Friday afternoon."
This module covers the patterns engineers use to keep systems standing when their dependencies fall over. None of these patterns are magic. Each one is a small, tested idea that — applied consistently — turns "one slow downstream takes the whole site down" into "one slow downstream causes a graph spike on a dashboard somewhere."
Failure is not the exception
In a single-machine program, things mostly work. In a distributed system with N services, the probability that all of them are healthy at any given moment is (1 - p_fail)^N. With 100 services each at 99.9% availability, the system as a whole is at 90.5%.
flowchart LR
A[Service A<br/>99.9% up] --> B[Service B<br/>99.9% up]
B --> C[Service C<br/>99.9% up]
C --> D[Service D<br/>99.9% up]
Note["Combined: 0.999^4 = 99.6% — ~3 hours of downtime/month"]
D --> Note
style A fill:#15803d,color:#fff
style D fill:#15803d,color:#fff
You cannot prevent failure. You can only limit its blast radius and fail fast and predictably. That's what these patterns are for.
Timeouts: the most underused pattern in software
Every network call should have a timeout. Every. Single. One.
# WRONG — hangs forever if the API is slow
resp = requests.get('https://payments.example.com/charge')
# RIGHT
resp = requests.get(
'https://payments.example.com/charge',
timeout=(2.0, 5.0) # connect 2s, read 5s
)
Without a timeout, a slow downstream pins one of your worker threads for minutes. A hundred of those slow in-flight requests fill your thread pool. Now your service is down even though your code is fine — you just never told it when to give up.
Choosing timeout values
flowchart TD
A[How fast is the dependency<br/>at p99 normally?] --> B{Set timeout to<br/>~3× p99}
B --> C[Below: timeouts<br/>fire on healthy slow days]
B --> D[Above: late detection<br/>of degradation]
style B fill:#ff6b1a,color:#0a0a0f
A few rules worth internalizing. Timeouts are non-optional — no infinite waits, even in "internal" services. Start tight and loosen as needed; it's safer to fail fast than to fail slow. And always cascade timeouts: if A gives up at 5s and B waits 10s, B is doing pointless work after A has already returned an error to the user. Finally, distinguish connect vs read timeouts — a connect failure means "couldn't reach you"; a read failure means "you're slow."
Deadline propagation
Timeouts work at a single hop. The mature evolution is a deadline: the entry point sets an absolute wall-clock expiry time and passes it along on every downstream call. Each service checks whether any budget remains before making the next hop.
sequenceDiagram
participant U as User
participant A as "API (deadline 5s)"
participant B as "Service B (deadline 4.5s)"
participant C as "Service C (deadline 4s)"
U->>A: request (5s SLA)
A->>B: with deadline 4.5s
B->>C: with deadline 4s
Note over C: takes too long - aborts before A times out
C-->>B: error (deadline exceeded)
B-->>A: error
A-->>U: error (fast, no thread held)
Without deadline propagation, you waste server CPU computing results the caller has already given up on. gRPC and most modern RPC frameworks have this built in — you just need to wire it through.
Retries (with care)
Retries feel obvious: you got an error, try again. But an incorrect retry policy turns a transient blip into a self-inflicted denial-of-service.
The retry algorithm
def call_with_retries(fn, max_retries=3, base_delay=0.1, max_delay=10.0):
for attempt in range(max_retries + 1):
try:
return fn()
except RetryableError:
if attempt == max_retries:
raise
# exponential backoff with jitter
sleep_for = min(max_delay, base_delay * (2 ** attempt))
sleep_for *= random.uniform(0.5, 1.5) # jitter!
time.sleep(sleep_for)
Three properties matter here. Exponential backoff (delay = base * 2^attempt) means you don't hammer a struggling service. Jitter — multiplying the delay by a random factor — prevents the thundering-herd problem where every client wakes up and retries at the same instant, overwhelming the server the moment it starts to recover. And a cap on attempts keeps you from retrying forever; three is usually enough.
What's retryable?
flowchart TD
E[Error received] --> CHECK{Is it retryable?}
CHECK -->|Network failure| RETRY[Retry]
CHECK -->|429 rate limited| RETRY2["Retry after Retry-After"]
CHECK -->|503 unavailable| RETRY3[Retry with backoff]
CHECK -->|500 internal error| MAYBE{Idempotent?}
CHECK -->|400/404/409| NO[Do not retry]
CHECK -->|401/403| NO2[Do not retry]
MAYBE -->|Yes| RETRY4[Retry]
MAYBE -->|No| NO3[Do not retry]
style RETRY fill:#15803d,color:#fff
style NO fill:#ff2e88,color:#fff
Never retry a non-idempotent operation that may have succeeded — that's how you charge a credit card twice. Use idempotency keys (covered in APIs) so retries are always safe.
Retry storms (the classic outage)
Here's the failure mode that trips teams up. With five service layers each retrying three times, you get 3⁵ = 243× amplification at the bottom of the chain during an outage. The recovering service gets hammered into the ground again the moment it starts to come back up.
sequenceDiagram
participant A as Service A
participant B as Service B
participant C as "Service C (overloaded)"
A->>B: request
B->>C: request
C-->>B: fail
B->>C: retry 1
C-->>B: fail
B->>C: retry 2
C-->>B: fail
B-->>A: error
A->>B: retry 1
Note over A,C: 2 calls to B x 3 calls to C each = 6x load amplification across two hops
The mitigations: retry only at the outermost layer (inner services propagate errors rather than retrying), use a token-bucket "retry budget" that caps retries to N% of normal RPS, and use circuit breakers — which is where we're headed next.
Circuit breakers
A circuit breaker watches calls to a dependency and opens — stops sending traffic — when failures exceed a threshold, giving the downstream time to recover.
stateDiagram-v2
[*] --> Closed
Closed --> Open: failure ratio > threshold
Open --> HalfOpen: cooldown elapsed
HalfOpen --> Closed: probe succeeds
HalfOpen --> Open: probe fails
The three states: Closed means traffic flows normally and you're tracking success/fail counts in a rolling window. Open means you've hit the threshold — stop calling, fail fast, and start a cooldown timer. Half-open means the cooldown elapsed; send a single probe request by default and if it works, close the breaker; if not, reopen.
Why this matters in practice:
flowchart LR
NB[No breaker] -->|"all 1000 reqs/s<br/>hit a dying service"| F1[Service crashes harder]
BR[Breaker open] -->|"0 reqs flow"| HEAL[Service recovers]
HEAL --> CLOSE[Breaker closes]
style F1 fill:#ff2e88,color:#fff
style HEAL fill:#15803d,color:#fff
Without a breaker, every retry hammers a dying service. With one, the dying service gets a quiet window to recover.
Tuning a breaker
| Parameter | Typical | Watch out for |
|---|---|---|
| Failure ratio threshold | 50% over 1-min window | Too low = flaps; too high = slow to react |
| Minimum requests | 20–100 | Without this, 1 of 1 errors trips it |
| Cooldown | 5–30s | Too short = thrash; too long = slow recovery |
| Probe count in half-open | 1–5 | More = safer reopen, less responsive |
Most service meshes (Envoy, Istio, Linkerd) ship circuit breakers. You configure the thresholds; the proxy enforces them.
Bulkheads
The bulkheads of a ship are watertight compartments — if one floods, the rest stay dry. In software: isolate resources so failure in one part doesn't sink the whole ship.
flowchart TD
subgraph Bad[No bulkhead]
POOL1[(Shared connection pool: 100)]
POOL1 --> A1[Service A]
POOL1 --> B1[Service B]
POOL1 --> C1[Service C - slow]
end
subgraph Good[With bulkheads]
POOL2A[(Pool A: 30)] --> A2[Service A]
POOL2B[(Pool B: 30)] --> B2[Service B]
POOL2C[(Pool C: 40)] --> C2[Service C - slow]
end
style Bad fill:#ff2e88,color:#fff
style Good fill:#15803d,color:#fff
In the bad case, Service C goes slow, monopolizes all 100 connections, and A and B starve. In the good case, C exhausts its 40 connections and A and B keep working.
Common places to draw bulkhead boundaries: per-downstream connection pools, per-tenant queues or thread pools, separate node pools for batch vs interactive workloads, and separate database connections for reads vs writes.
Hedged requests (the latency superpower)
If you have an endpoint with a long tail — say, most calls finish at 50ms but p99 is 500ms — you can cut that p99 dramatically by sending the same request to two replicas in parallel and using whichever responds first.
sequenceDiagram
participant C as Client
participant R1 as Replica 1
participant R2 as Replica 2
C->>R1: request
Note over C: wait p95 (~150ms)
C->>R2: hedge request
R1-->>C: slow (450ms)
R2-->>C: fast (50ms) ← use this
You don't fire both immediately. You wait until the first response is overdue — the p95 latency is a good threshold — and only then send the hedge. The cost is roughly 5% extra requests, and the payoff is a 5–10× reduction in p99 latency. Jeff Dean and Luiz André Barroso wrote about this in their 2013 Google paper "The Tail at Scale" — it's fundamental for low-latency search and storage systems.
Load shedding
When you can't keep up, shed load explicitly rather than letting everything slow down for everyone.
flowchart TD
REQ[Incoming request] --> CHECK{System load<br/>> threshold?}
CHECK -->|No| PROC[Process normally]
CHECK -->|Yes| PRI{Priority?}
PRI -->|High| PROC
PRI -->|Low| SHED[Shed: 503 + Retry-After]
style SHED fill:#ff6b1a,color:#0a0a0f
A few implementation shapes: reject immediately when the inbound queue depth exceeds N; cap in-flight requests and reject new ones once you're at the limit; use per-priority shedding to keep payments running while dropping analytics; or use adaptive concurrency (Netflix's "concurrency-limits" library) which reduces limits automatically when latency rises.
The thing to avoid is implicit queueing. A 30-second timeout on an overloaded server means 30 seconds of wasted CPU per request before saying no. Rejecting in 1ms is strictly better.
Graceful degradation
When a non-critical dependency dies, don't take the whole feature down with it. Return a degraded response.
| Situation | Don't do | Do |
|---|---|---|
| Recommendations service down | Show error page | Show static "popular items" |
| Personalization down | 500 | Show generic homepage |
| Search ranking down | 500 | Show results in lexical order |
| Image CDN down | Broken UI | Show placeholders |
| Real-time updates down | Page won't load | Page loads with stale data + "refreshing..." indicator |
The architectural shape is consistent: every call to a non-critical dependency is wrapped in a fallback. The fallback returns cached data, a default value, or a simpler computation. The caller gets something useful, not an error.
flowchart LR
REQ[Request] --> DEP{Dependency<br/>available?}
DEP -->|Yes| LIVE[Live response]
DEP -->|No| FB{Fallback?}
FB -->|Cached data| CACHE[Return stale]
FB -->|Default| DEF[Return default]
FB -->|Critical path| ERR[Propagate error]
style LIVE fill:#15803d,color:#fff
style CACHE fill:#ffaa00,color:#0a0a0f
style DEF fill:#0e7490,color:#fff
style ERR fill:#ff2e88,color:#fff
The key judgment call is deciding which dependencies are critical (errors propagate) vs non-critical (fallback kicks in). Make that decision explicit at code-review time, not during an incident.
Idempotency keys (revisited)
Reliability and APIs intersect at idempotency. Without idempotent operations, you cannot safely retry. Without safe retries, you cannot tolerate transient failures.
# bad: not idempotent
charge_card(amount=4200)
# good: idempotent — second call returns the same result
charge_card(amount=4200, idempotency_key='ord-7f3c2e80')
Server-side, the key is stored with the response for ~24h. Repeated calls with the same key return the cached response — no double-charge.
Health checks (and the trap)
Health checks tell load balancers whether a backend is ready for traffic. Two flavors:
- Liveness (
/livez): "I'm running — don't kill me." Checks that the process is up. - Readiness (
/readyz): "I'm ready to serve — send me traffic." Checks that database connections, caches, etc. are reachable.
The trap is checking downstream health in your readiness probe.
flowchart LR
A["Service A<br/>readyz checks B"] -->|"B is slow"| MARK[A marked unready]
MARK --> CASCADE[All A pods<br/>marked unready]
CASCADE --> OUT[Whole region down]
style OUT fill:#ff2e88,color:#fff
If A's readiness probe fails whenever B is slow, your load balancer pulls A out of rotation — even though A could have served degraded behavior gracefully. B's outage cascades into A's outage.
The rule: liveness and readiness probes should check only what this instance can fix by restarting. Downstream health belongs behind circuit breakers, not health checks.
Rate limiting
Rate limiting is reliability for upstream callers — protecting your service from misbehaving clients, intentional or not.
Two algorithms you'll see in the wild:
Token bucket
flowchart LR
REFILL[Refill at R per sec] --> BUCKET[(Bucket: capacity N)]
REQ[Request] --> CHECK{Tokens?}
BUCKET --> CHECK
CHECK -->|Yes| ACCEPT[Accept, decrement]
CHECK -->|No| REJECT[429 Too Many Requests]
style ACCEPT fill:#15803d,color:#fff
style REJECT fill:#ff2e88,color:#fff
The bucket holds N tokens and refills at R per second. Each request consumes one. You can burst up to N requests quickly, then the rate settles to R.
Sliding window
Count requests in the last 60s; reject when over limit. Smoother than fixed windows, which let callers double-burst across a window boundary, but requires slightly more state to track.
Enforce rate limits at the edge — API gateway or CDN, per IP and per API key. Rejecting there is far cheaper than letting the request propagate down before you say no.
The deep dive is in the rate limiter article.
Outboxes and the durable-write pattern
You write to your database, then publish a message. The DB write succeeds; the publish fails. Now your database and message bus disagree about what happened.
The fix is the transactional outbox.
sequenceDiagram
participant App
participant DB
participant OB as outbox table
participant Q as Kafka
App->>DB: BEGIN
App->>DB: UPDATE balance ...
App->>OB: INSERT event row
App->>DB: COMMIT (atomic!)
Note over OB,Q: separate poller / CDC
OB->>Q: publish, mark sent
The DB row and the outbox row are written in the same transaction. A separate process — or a CDC stream reading the DB's WAL — ships the outbox to Kafka. If the publisher crashes before sending, it picks up the unsent row on restart. If Kafka's ack is lost, the publisher retries and the consumer deduplicates on the event's stable ID.
This gives you at-least-once delivery without ever letting your DB and queue drift apart. See Message Queues for more.
Chaos engineering
You cannot be confident your reliability patterns work until you break things on purpose.
Game days are planned exercises where teams kill a service, partition a network, or fill a disk in production. Chaos monkeys do this continuously and randomly — Netflix invented the concept and has run it in production for years. Failure injection adds deliberate latency or errors at a load balancer to test downstream resilience without fully killing anything.
The point isn't sadism. Untested code paths fail in production, and your recovery paths are code paths too. Unverified replica promotion, dusty runbooks, failover scripts that haven't been run in two years — chaos engineering surfaces these before a real incident does.
SLOs, SLIs, and error budgets
Reliability is a business decision, not a binary one. The framework Google made standard:
- SLI (Service Level Indicator): a measurable signal — e.g., 99th-percentile latency, success ratio, queue depth.
- SLO (Service Level Objective): a target on the SLI — e.g., 99.9% of requests complete within 200ms.
- Error budget:
1 - SLO. At 99.9%, you have 0.1% of requests to "spend" — roughly 43 minutes of downtime per month.
flowchart LR
A[100% reliable] --> B[Slow feature velocity]
B --> C[Competitor catches up]
D[99.9% SLO] --> E[Error budget]
E --> F[Spend on shipping]
F --> G[Spend on chaos exercises]
style A fill:#ff2e88,color:#fff
style D fill:#15803d,color:#fff
When you're under budget — reliability is exceeding the SLO — you can ship faster and take more risk. When you've burned the budget, freeze risky changes until reliability recovers. The budget makes the tradeoff explicit so the conversation between engineering and product is about data, not gut feeling.
A reliability checklist
Before any service goes to production:
- Every external call has a timeout.
- Retries use exponential backoff with jitter and a cap.
- Critical writes are idempotent.
- Circuit breakers wrap calls to dependencies that can fail independently.
- Bulkheads isolate connection pools / thread pools per dependency.
- Liveness and readiness probes don't check downstream health.
- Load shedding rejects fast under overload (no implicit queueing).
- Non-critical features have graceful-degradation paths.
- Rate limiting at the edge protects against bad clients.
- Outbox or CDC ensures DB + queue stay consistent.
- SLOs are defined; alerts fire on error budget burn rate.
- Runbooks exist for the top failure modes — and have been rehearsed.
Each item costs an hour to implement and prevents a category of incident.
Things you should now be able to answer
- A downstream service starts taking 5 seconds per request. Why does that take your service down too, even if your code is perfect?
- A naive retry policy (3 retries everywhere) can make outages worse. Why? What's the fix?
- Your circuit breaker is tripping every 30 seconds. What's wrong with the tuning?
- Why shouldn't your readiness probe check whether downstream services are healthy?
- A user gets billed twice because their phone retried the request. What's the simplest fix?
- Your SLO is 99.9% and you've already burned 70% of this month's error budget. What changes about how you operate?
→ Next: Observability
Frequently asked questions
▸What is a circuit breaker and what are its three states?
A circuit breaker watches calls to a dependency and stops sending traffic when failures exceed a threshold. The three states are Closed (traffic flows normally while success/fail counts are tracked in a rolling window), Open (threshold exceeded — fail fast and start a cooldown timer), and Half-open (cooldown elapsed — send a single probe request and close the breaker if it succeeds, reopen if it fails).
▸What is deadline propagation and how does it differ from a timeout?
A timeout applies at a single hop. A deadline is an absolute wall-clock expiry set at the entry point and passed along on every downstream call so each service can abort before making the next hop. Without deadline propagation, downstream services waste CPU computing results the caller has already given up on; gRPC and most modern RPC frameworks have this built in.
▸Why is jitter required in exponential backoff retry logic?
Without jitter, every client that experienced the same failure wakes up and retries at the same instant, overwhelming the server the moment it starts to recover. Multiplying the backoff delay by a random factor (the article uses uniform 0.5 to 1.5) staggers the retry wave and prevents that thundering-herd problem.
▸How much request amplification can retry storms create, and what are the mitigations?
With five service layers each retrying three times, you get 3 to the power of 5, or 243 times amplification at the bottom of the chain during an outage. The article recommends retrying only at the outermost layer so inner services propagate errors instead, capping retries with a token-bucket retry budget at a percentage of normal RPS, and using circuit breakers to stop traffic to recovering services.
▸What is the error budget at a 99.9% SLO, and how should teams use it?
At a 99.9% SLO the error budget is 0.1% of requests, which translates to roughly 43 minutes of downtime per month. When reliability exceeds the SLO and budget remains, teams can ship faster and take more risk; when the budget is burned, risky changes should be frozen until reliability recovers.