~/articles/design-rate-limiter
◆◆Intermediateasked at Stripeasked at Cloudflareasked at Googleasked at Amazon

Design a Rate Limiter

Token bucket, leaky bucket, fixed window, sliding window — and the distributed Redis-based limiter you can copy into production.

// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Stripe's public API serves millions of developers, and without a rate limiter, a single misconfigured client could send thousands of requests per second, starving every other caller on the platform. A rate limiter is the enforcement layer that caps how many requests any one client — identified by IP, user ID, or API key — can make within a rolling time window, rejecting the excess with a 429 Too Many Requests response instead of passing it downstream.

The rules sound simple: N requests per minute per client. The implementation is not. The moment you run a second API server, per-process counters break — each box tracks its own count, so a client effectively multiplies its allowance by the number of servers. The fix is a shared counter in something like Redis, but now you have distributed state and the attendant hazards: atomic read-modify-write, Redis downtime, and bursty traffic patterns that hammer a window boundary.

Those distributed-systems traps — not the counter arithmetic — are what make this a canonical interview question. The core tension is balancing sub-millisecond enforcement overhead against cluster-wide correctness: you need a single source of truth for the count, but querying that source on every request adds latency and creates a new single point of failure. Every design decision in this article flows from that tension.

Most teams bolt a rate limiter on after their first incident and get it subtly wrong. This article builds the design from scratch so both the algorithms and the traps are visible along the way.

Functional requirements

  • Limit a client to N requests per time window (per user, per IP, per API key, per endpoint).
  • Reject excess with 429 Too Many Requests and Retry-After header.
  • Different rules per endpoint: /login more restrictive than /search.

Non-functional requirements

  • Low overhead: <1ms added latency.
  • Highly available: a rate limiter outage shouldn't take down the API.
  • Accurate enough: small bursts above limit usually fine; never let through 10× the limit.
  • Distributed: limits enforced cluster-wide, not per-instance.

Building up to the design

Start with the simplest thing that could work, and walk forward until the design earns all its complexity.

V1: A counter in process memory

counts = defaultdict(int)
RESET_EVERY = 60   # seconds

@app.middleware
def limit(req):
    key = req.user_id
    if counts[key] > 100:
        return 429
    counts[key] += 1

A background job clears counts every minute. This limits a single-process app in 8 lines and works fine until you deploy a second server.

The moment you run two API boxes, each has its own counts dict — so a user effectively gets 2× the limit. Worse, a restart wipes all counters. And the fixed-window reset lets a motivated client fire N requests in the last second of a window and another N in the first second of the next, squeezing 2N through in two seconds.

V2: Move the counter to Redis

key = f"rate:{user_id}:{minute_bucket}"
n = redis.incr(key)
redis.expire(key, 60)
if n > 100: return 429

Now the count is shared across all your servers and survives restarts. One Redis hop per request is around 1ms — acceptable overhead. But this version has a real race: if the process crashes (or the connection drops) between INCR and EXPIRE, the key has no expiry and grows forever, permanently locking that client. (INCR itself is atomic in Redis, so two concurrent increments always produce distinct values — the race is only between the two separate commands.) The bursty-boundary problem also survives intact: the fixed-window reset still exists.

V3: Sliding window — fix the bursty boundary

Instead of one hard reset per minute, blend two adjacent buckets weighted by how much of the previous window still falls inside the current one:

estimate = current_bucket
         + previous_bucket × fraction_of_previous_still_in_window

This smooths the rate at the boundary. Two counters per key, averaging ~6% rate estimation error — but Cloudflare's analysis of 400M real requests found only 0.003% of decisions were actually wrong. That's accurate enough for almost any use case.

The sliding window still doesn't let clients burst. Real workloads want "let me send 20 requests quickly when I need to, then settle to 1/sec." For that you need tokens.

V4: Token bucket on Redis

Each user has (tokens_remaining, last_refill_ts) in Redis. On each request: refill tokens based on elapsed time, consume one, allow if any remain.

The naive implementation has a read-modify-write race: two concurrent requests both read tokens=1, both consume, both get through. The fix is to run the whole refill-consume cycle as a Lua script, which Redis executes atomically — one round-trip, no race.

This gives you bursty-but-bounded behavior at ~1ms per request. It's the design that Stripe runs in production.

V5: Failure modes and tiers

What happens when Redis goes down? Almost everyone chooses fail-open — allow the request rather than rejecting it. An API that goes dark because its rate limiter is unavailable is worse than briefly unenforced limits. You mitigate by keeping a local in-memory fallback with a stricter limit on each server, plus Redis Sentinel or Cluster for HA.

Then layer multiple limits: per-IP (loose), per-user (medium), per-API-key (strict). All three are checked; any one can reject.

V6: Production

Token bucket via Lua, layered per-(IP, user, key) limits, fail-open with degraded local limits, X-RateLimit-* and Retry-After headers, and a global ceiling in front of your database pool to protect the backend.

flowchart LR
    V1[V1: in-process counter<br/>broken with 2 boxes] --> V2[V2: Redis counter<br/>bursty edges]
    V2 --> V3[V3: sliding window<br/>smooth but no bursts]
    V3 --> V4[V4: token bucket + Lua<br/>atomic, bursty]
    V4 --> V5[V5: + tiers + fail-open<br/>resilient]
    V5 --> V6[V6: + global ceiling<br/>production]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

The four algorithms

1. Fixed window counter

Window: 1-minute buckets.
Bucket key: rate:{user}:{minute}
On request: INCR bucket; reject if > N.
flowchart LR
    R[Request at 12:34:59] --> K[bucket: rate:u123:12:34<br/>count: 19/20]
    K --> ALLOW[Allow]
    R2[Request at 12:35:00] --> K2[bucket: rate:u123:12:35<br/>count: 1/20]
    K2 --> ALLOW
    style ALLOW fill:#15803d,color:#fff

Simple — just one INCR + EXPIRE. The catch is at the boundary: a client can put 20 requests in the last second before the reset and 20 more in the first second after, landing 40 requests in a 2-second window.

2. Sliding window log

Store every request timestamp in a sorted set. On each request, drop timestamps older than now - window, count, accept or reject.

ZADD key now <unique_id>   # score=timestamp, member=unique ID (e.g. UUID)
ZREMRANGEBYSCORE key 0 (now - window)
COUNT = ZCARD key
EXPIRE key window

The member must be unique per request (e.g. a UUID), not the timestamp itself — using now as both score and member would silently deduplicate two requests arriving in the same millisecond.

Perfectly accurate, but memory grows with N. At thousands of requests per window per user, it adds up.

3. Sliding window counter (the practical winner)

Two adjacent fixed-window buckets, weight the previous one by how much of it is still in the sliding window.

N at this moment = current_minute_count
                 + previous_minute_count × overlap_fraction
flowchart LR
    PREV[Previous minute<br/>count: 18] -->|"75% overlap"| BLEND
    CUR[Current minute<br/>count: 5] -->|"100%"| BLEND
    BLEND[Estimated:<br/>5 + 18×0.75 = 18.5]
    style BLEND fill:#ff6b1a,color:#0a0a0f

Very low memory (2 counters per key); smooth without timestamp logs. The estimation error on the rate itself averages ~6%, but Cloudflare's analysis of 400M real requests found only 0.003% of decisions were wrong (allowed when should reject, or vice versa). In practice this is more than accurate enough — it's what Cloudflare documents using in their production edge rate limiter.

4. Token bucket

The bucket holds up to burst tokens. Tokens refill at rate per second. Each request consumes one. If empty, reject.

flowchart LR
    R[Refill at rate r] --> B((Bucket<br/>capacity = burst))
    B -->|tokens >= 1: consume + allow| OUT[Allow]
    B -->|tokens < 1| REJ[Reject 429]
    style B fill:#ff6b1a,color:#0a0a0f
    style OUT fill:#15803d,color:#fff
    style REJ fill:#ff2e88,color:#fff

A client can drain the bucket quickly in a burst — say, 10 requests back-to-back — and then refill at 1/sec afterward. That matches what real callers actually want. You need to track two fields (tokens and last_refill_ts), but both fit in a Redis hash. This is what Stripe uses, and what most API gateways implement.

5. Leaky bucket

A queue with a fixed drain rate. Requests enter the queue; if full, reject. The queue drains at a constant rate regardless of what's in it.

The output is perfectly smooth — no bursts can ever escape the queue. But that's also its limitation: for most API rate limiting, you want to reject excess rather than buffer it. Callers get no feedback until the queue fills. Leaky bucket is a better fit for network traffic shaping and egress control than for API protection.

Quick comparison

AlgorithmMemoryAccuracyBurstiness allowedCommon use
Fixed window1 counterBursty at edgesYes (badly)Simple cases
Sliding window logN timestampsExactNo (exact rolling limit)Compliance
Sliding window counter2 counters~6% rate est.; 0.003% wrong decisionsYes (smooth)Cloudflare edge
Token bucket2 fieldsExactConfigurableMost API gateways
Leaky bucketQueueExactNone (smoothed)Networking

Distributed token bucket on Redis (production-ready)

The naive implementation has a race: between reading the token count and decrementing, another request can steal a token. The fix is Lua scripts — Redis executes the entire script atomically.

-- KEYS[1]: rate-limit key
-- ARGV[1]: capacity (burst)
-- ARGV[2]: refill rate (tokens/sec)
-- ARGV[3]: now (ms)
-- ARGV[4]: cost of this request (1)

local key      = KEYS[1]
local capacity = tonumber(ARGV[1])
local rate     = tonumber(ARGV[2])
local now      = tonumber(ARGV[3])
local cost     = tonumber(ARGV[4])

local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1]) or capacity
local ts     = tonumber(data[2]) or now

-- refill
local delta = math.max(0, now - ts) / 1000.0
tokens = math.min(capacity, tokens + delta * rate)

-- consume
local allowed = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
end

redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("PEXPIRE", key, math.ceil(capacity / rate * 1000) + 1000)

return { allowed, tokens }

Each request is a single round-trip to Redis (1 ms). Atomic. Cluster note: in Redis Cluster, all keys touched by a Lua script must hash to the same slot. Because this script accesses exactly one key (KEYS[1]), it is cluster-safe. If you extend it to multi-key patterns (e.g., per-IP and per-user in the same script), use hash tags — e.g., {user:42}:ip:1.2.3.4 — so both keys land on the same shard.

Architecture

flowchart TD
    U[User] --> APIGW[API Gateway]
    APIGW --> RL[Rate Limiter Middleware]
    RL --> R[(Redis Cluster)]
    RL -->|"allow"| BACKEND[Backend Service]
    RL -->|"reject"| R429[429 + Retry-After]
    style RL fill:#ff6b1a,color:#0a0a0f
    style R fill:#15803d,color:#fff
    style R429 fill:#ff2e88,color:#fff

The rate limiter middleware is part of the API gateway (or a sidecar like Envoy).

Tiered limits: how a single request gets checked

When a request arrives, it passes through several limit tiers in sequence. Any tier can reject it.

flowchart TD
    REQ[Incoming request] --> IP{IP limit?}
    IP -->|"over limit"| REJ1[429 Retry-After]
    IP -->|"ok"| USER{User limit?}
    USER -->|"over limit"| REJ2[429 Retry-After]
    USER -->|"ok"| KEY{API key limit?}
    KEY -->|"over limit"| REJ3[429 Retry-After]
    KEY -->|"ok"| BE[Backend]
    style REJ1 fill:#ff2e88,color:#fff
    style REJ2 fill:#ff2e88,color:#fff
    style REJ3 fill:#ff2e88,color:#fff
    style BE fill:#15803d,color:#fff
    style IP fill:#ff6b1a,color:#0a0a0f
    style USER fill:#ff6b1a,color:#0a0a0f
    style KEY fill:#ff6b1a,color:#0a0a0f

The tiers are checked in order from cheapest to most granular. A per-IP check catches unauthenticated flood traffic before you even spend cycles looking up the user. A per-user check catches authenticated abuse. A per-API-key check enforces your pricing tiers. All three Redis lookups happen in the same middleware pass and add roughly 1–3ms total.

What about Redis being down?

This is the failure mode most candidates skip, and it's the one interviewers care most about.

Fail-open means allow the request when Redis is unreachable. Fail-closed means reject it. The tradeoff is direct: fail-open risks briefly unenforced limits; fail-closed risks taking your API down every time the rate-limit store hiccups.

Most teams choose fail-open with alerting — your API being unavailable is a worse outcome than a brief window of unenforced limits. Three mitigations bound the blast radius.

Each API server should maintain its own token bucket at roughly 20% of the normal per-server quota. Under Redis failure the aggregate across N servers still caps total traffic at N × 20% — not perfect, but bounded. Pair that with a circuit breaker that bypasses Redis only after K consecutive failures, not on the first timeout, so brief network glitches don't disable limits entirely.

For the Redis layer itself, Sentinel provides automatic failover for a primary-replica setup: end-to-end cutover ranges from tens of seconds to over a minute (the down-after-milliseconds detection window plus replica promotion time). Redis Cluster shards data across multiple nodes so there is no single point of failure. Finally, on Redis reconnect, sync the local counter back to the recovered store so the accounting from the outage is not lost.

The failure mode worth calling out in interviews: a cascading failure where a load spike causes Redis timeouts, which triggers fail-open, which lets the spike through unchecked, which makes the spike worse. The circuit breaker + local limiter combo breaks that loop.

Identifying the client

What do you key on?

KeyProsCons
API keyPer-customer, accurate, easy to vary by tierRequires authenticated traffic
User IDAuthenticated contextDoesn't protect login itself
IP addressAlways availableNAT (carriers) puts thousands behind one IP
IP + User-AgentSlightly betterEasy to spoof
Device fingerprintAnti-abusePrivacy concerns, brittle

The practical answer is multiple tiers of limits — per IP (loose), per user (medium), per API key (strict for the endpoint). None of these keys is perfect in isolation; layered, they cover each other's blind spots.

Configuring limits

A common shape:

limits:
  - endpoint: /api/v1/login
    by: ip
    bucket: 10/minute
  - endpoint: /api/v1/login
    by: user
    bucket: 5/minute     # stricter than IP
  - endpoint: /api/v1/search
    by: api_key
    bucket: 1000/minute, burst 200
  - endpoint: /api/v1/posts
    by: api_key
    bucket: 100/second, burst 50

Stripe-style tiered limits (free / pro / enterprise) are just multiple bucket configs per API key tier.

Headers to return

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 487
X-RateLimit-Reset: 1707432600        ; Unix epoch

When rejecting:

HTTP/1.1 429 Too Many Requests
Retry-After: 30                        ; seconds
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1707432600

Retry-After is the most important header — well-behaved clients honor it and back off automatically. Without it, clients that retry immediately amplify the load spike instead of smoothing it.

Hierarchical / global limits

Sometimes you want both a per-user limit and a global ceiling on aggregate traffic to the database. Implement as two layers: per-user check (Redis token bucket), then a global counter also in Redis — or a token bucket sitting in front of the database connection pool itself.

Things to discuss in an interview

  • Name 3+ algorithms and choose one with reasons — token bucket is usually right; sliding window counter is Cloudflare's choice for memory-constrained edge nodes.
  • Explain how Lua scripts make the read-modify-write atomic and why INCR+EXPIRE without atomicity can produce a key that never expires.
  • Walk through fail-open vs fail-closed, the cascade failure scenario, and the circuit breaker + local fallback pattern.
  • Describe layered identification: per-API-key + per-IP, and the NAT/shared-IP pitfall.
  • Contrast burst handling: token bucket allows N quickly then smooths, while leaky bucket queues instead of rejecting.
  • Flag the Redis Cluster caveat: Lua scripts must touch only keys in the same hash slot; use hash tags when combining multiple key dimensions in one script.

Things you should now be able to answer

  • Why is the fixed-window algorithm bursty at the edges?
  • Why does splitting INCR and EXPIRE into two separate commands create a race condition, and what can go wrong?
  • How do you make rate-limiter logic atomic on Redis?
  • Should a rate limiter fail open or closed?
  • How do you handle bursts vs. sustained rate in token bucket?

Further reading

// FAQ

Frequently asked questions

What is a token bucket rate limiter and why does Stripe use it?

A token bucket holds up to a configured burst number of tokens, refilled at a fixed rate per second. Each request consumes one token; if the bucket is empty, the request is rejected with 429. Stripe uses it because it lets clients send a burst of requests quickly and then settle to the sustained rate — matching real caller behavior — while storing only two fields per client (tokens and last_refill_ts) in Redis.

How does a Lua script make Redis rate limiting atomic, and what race does it prevent?

Redis executes a Lua script as a single atomic operation, so the entire read-refill-consume cycle happens in one round-trip with no interleaving. Without it, two concurrent requests can both read tokens=1, both see a token available, and both get through — a classic read-modify-write race that the Lua EVAL call eliminates entirely.

Sliding window counter vs. sliding window log: what is the accuracy trade-off?

The sliding window log stores a timestamp for every request and is perfectly accurate, but memory grows proportionally with request volume. The sliding window counter uses only two fixed-window buckets weighted by overlap fraction, averaging roughly 6% error in rate estimation — but Cloudflare's analysis of 400 million real requests found only 0.003% of actual allow/reject decisions were wrong, making it accurate enough for production edge rate limiting.

Should a rate limiter fail open or fail closed when Redis goes down?

Most teams choose fail-open — allowing requests through when Redis is unreachable — because an API going dark due to rate-limiter unavailability is a worse outcome than briefly unenforced limits. The blast radius is bounded by running a local in-memory token bucket on each server at roughly 20% of the normal per-server quota, so aggregate traffic across N servers is still capped at N times 20% during a Redis outage.

Why is the fixed-window counter algorithm dangerous at window boundaries?

A client can send N requests in the last second before the minute resets and another N in the first second of the new minute, landing 2N requests in a 2-second span — double the intended limit. The fixed window resets atomically, so there is no mechanism to detect or block this boundary burst.

// RELATED

You may also like