Design a Distributed Cache (like Memcached)
A cache that scales across hundreds of nodes — consistent hashing, replication, eviction, and the operational problems you'll meet.
The problem
When Facebook's Memcached cluster handles 1 billion requests per second, or when Netflix serves recommendations to 230 million subscribers simultaneously, neither company is hammering a MySQL database on every request. They're hitting a distributed cache — a fleet of in-memory key-value stores that sits between the application and the database, absorbing the read traffic that would otherwise kill the underlying storage layer.
A distributed cache is exactly what the name says: a cache (fast, in-memory, short-lived key-value storage) that's sharded across many machines. Memcached, Redis Cluster, Hazelcast, and AWS ElastiCache all implement this pattern. The interface is minimal — GET key, SET key value, DELETE key — but the engineering underneath is not. You're building a system that needs to serve millions of operations per second at single-digit millisecond latency, survive node failures without stampeding the database, and grow or shrink the cluster without invalidating all your data at once.
The core tension has two parts. First, partitioning: how do you split the keyspace across N nodes so that adding a node doesn't move every key to a different machine? Naive modulo hashing (hash(key) % N) fails catastrophically — change N from 10 to 11 and you reroute ~91% of traffic to the wrong node instantly. Second, hot keys: once you solve partitioning, a single celebrity profile or trending post can still overload one node while the other 49 sit idle. The fix for each problem is non-obvious and the two solutions don't naturally compose, which is why this is a staple interview question.
Functional requirements
GET key→ returns value or null.SET key value [TTL].DELETE key.- (Optional)
INCR,DECRfor counters. - (Optional) atomic compare-and-swap (
CAS).
Non-functional
- p99 latency < 5ms even at high QPS.
- Throughput: 10M ops/sec across the cluster.
- High availability: a node failing should not lose more than its share of cached entries.
- Eventual consistency is fine — this is a cache, not the source of truth.
Capacity
| Dimension | Estimate | How we got there |
|---|---|---|
| Working set | 10 TB hot data | Given |
| Effective capacity per node | ~205 GB | 256 GB commodity node × 80% usable |
| Node count (capacity) | ~50 nodes | 10 TB ÷ 205 GB ≈ 49 → round to 50 |
| Cluster throughput | 10 M ops/sec | Given |
| Per-node throughput | 200 k ops/sec | 10 M ÷ 50 nodes |
| Avg value size | ~5 KB | Given |
| Network per node (peak) | ~1 GB/s | 200 k ops/s × 5 KB |
| NIC capacity (10 Gbps) | ~1.25 GB/s | Standard 10 Gbps NIC — already tight at peak (~1 GB/s load vs 1.25 GB/s capacity); consider jumbo frames or 25 Gbps NICs |
Takeaway: with 50 nodes you can absorb 10 M ops/sec, but NIC saturation — not CPU — is the first wall you'll hit; keep average value sizes small or upgrade to 25 Gbps links before scaling further.
The network bandwidth figure deserves attention: at 5 KB average values and 200 k ops/sec, you need ~1 GB/s per node. A standard 10 Gbps NIC provides ~1.25 GB/s capacity — already tight at that peak load. Staff-level answer: prefer smaller values (serialize only needed fields), or scale horizontally before hitting NIC limits.
Building up to the design
The full picture — consistent hashing, replication, slab allocator, membership service, hot-key mitigation — is a lot to absorb at once. Each layer only makes sense once you've felt the pain that justifies it, so let's walk forward and earn each piece.
Start simple: a dict and a lock
cache = {}
lock = threading.Lock()
def get(k): return cache.get(k)
def put(k,v):
with lock: cache[k] = v
This gets you ~100k ops/sec on one box in about five minutes. It also dies with the process, caps at available RAM, and can only serve a single app's workers. The moment you need multiple app servers sharing state, you need to move the cache out of process.
Put it on a separate host (Memcached)
Move the dict into a single Memcached process. App servers talk to it over the network. Now any number of app servers can share one cache, and restarting the app boxes no longer wipes cached data. That's progress. The problem is that one cache node still caps your capacity at a few hundred GB and your throughput at a few hundred thousand ops/sec. When you outgrow it — and you will — the naive next step is to add more nodes.
Add nodes: hash-based sharding
Have N cache nodes. The client picks one with hash(key) % N. Capacity and throughput scale linearly. But now try adding a node: N changes from 10 to 11, and nearly every key routes to a different node. You just invalidated almost your entire cache simultaneously. The resulting miss storm hits your database all at once. That's not a scaling strategy — that's an outage.
Consistent hashing + virtual nodes
This is the fix that makes distributed caching actually work. Hash keys and nodes onto a ring; each key lives on the first node clockwise from its hash. When you add or remove a node, only the keys that were between it and its predecessor on the ring need to move — roughly 1/N of the total. A 50-node cluster adding one more node moves about 2% of keys, not 100%.
flowchart LR
subgraph ring ["Consistent Hash Ring"]
direction LR
K1(["key: user:99<br/>hash → 12:00"]) --> NA["Node A<br/>owns 10:00–2:00"]
K2(["key: post:42<br/>hash → 3:30"]) --> NB["Node B<br/>owns 2:00–5:00"]
K3(["key: feed:7<br/>hash → 7:45"]) --> NC["Node C<br/>owns 5:00–10:00"]
end
style NA fill:#15803d,color:#fff
style NB fill:#0e7490,color:#fff
style NC fill:#ff6b1a,color:#0a0a0f
Virtual nodes (vnodes) take this one step further. Instead of placing each real server once on the ring, you place it 100–256 times. When a node fails, its keys scatter across many surviving nodes rather than piling entirely onto one neighbor. The load distribution stays even even under churn.
Graceful resizing is solved. But a node death still means its slice of the cache goes cold — DB load spikes briefly for those keys. You need somewhere to fall back.
Replication for HA
Each key is stored on N nodes (typically N=2 or 3): the primary owner and the next one or two clockwise on the ring. Writes replicate asynchronously. When a node dies, reads that would have gone there fall back to a replica. The DB never sees it. This is replication as an availability tactic, not a consistency one — you're not trying to make all copies agree instantly, just to keep the cache warm through a failure.
That handles most failure modes. But there's one more pathology: what if a single key is so popular that even one node can't absorb the reads?
Hot-key mitigation + per-node optimizations
Some keys are disproportionately popular — a trending celebrity's profile, the site homepage, a viral post. Even with 50 nodes, "Justin Bieber's Wikipedia page" lands on exactly one of them. That node gets 10× the load of its peers and starts dropping requests.
The fix is to stop treating the single-owner model as sacred for those keys. Scatter extra copies of hot keys to K randomly chosen nodes, and have clients pick one at random on each read. For the very hottest items — the top 100 or so by request rate — go further and cache them in a small in-process store on each app server. That read never even touches the network. You get microsecond latency instead of milliseconds, and the cache nodes see zero traffic for those keys.
Rounding out the per-node story: Memcached uses a slab allocator (described below) to avoid malloc fragmentation, and evicts via LRU when a slab class fills up.
Membership service + smart clients
The last piece is how clients know the ring. A small ZooKeeper or etcd cluster tracks which cache nodes are alive. Each client subscribes to watch updates; when a node joins or fails, the membership service notifies all watchers within milliseconds and clients rebuild their local ring instantly. No proxy is needed in the request path — the client routes directly to the right node based on its local view.
This is the production design: consistent hashing, async replication, hot-key scatter, in-process L1, and a membership service keeping every client in sync.
flowchart LR
V1[V1: in-process dict<br/>one app] --> V2[V2: + Memcached box<br/>shared, capped]
V2 --> V3[V3: + sharding<br/>resharding hell]
V3 --> V4[V4: consistent hash<br/>graceful resize]
V4 --> V5[V5: + replicas<br/>HA]
V5 --> V6[V6: + hot-key handling<br/>no node melts]
V6 --> V7[V7: + membership + smart clients<br/>production]
style V1 fill:#0e7490,color:#fff
style V4 fill:#15803d,color:#fff
style V6 fill:#ff6b1a,color:#0a0a0f
style V7 fill:#a855f7,color:#fff
Architecture
flowchart TD
APP1[App Server 1] --> CLIENT[Cache Client Library]
APP2[App Server 2] --> CLIENT
APP3[App Server N] --> CLIENT
CLIENT --> R{Consistent Hash Ring}
R --> N1[Cache Node 1]
R --> N2[Cache Node 2]
R --> N3[Cache Node 3]
R --> NN[Cache Node N]
N1 -.replicate.-> N2
N2 -.replicate.-> N3
M[(Membership Service<br/>Zookeeper / etcd)] --> CLIENT
style CLIENT fill:#ff6b1a,color:#0a0a0f
style R fill:#15803d,color:#fff
style M fill:#a855f7,color:#fff
Three layers:
- Cache nodes — hold key/value pairs in memory.
- Client library — embedded in every app server; knows the ring and routes requests.
- Membership service — tracks which nodes are alive and which key ranges they own.
Sharding: where does a key live?
Consistent hashing. Each node is a vnode set on a ring; each key hashes to "the next node clockwise."
def find_node(key):
h = hash(key)
idx = bisect(sorted_hashes, h) % len(sorted_hashes)
return ring[sorted_hashes[idx]]
The lookup is a binary search — fast regardless of ring size. Adding or removing a node only moves ~1/N of keys. With virtual nodes (~100–256 per real node), load stays near-even across the cluster. The ring is replicated to every client via the membership service, so clients route directly with no proxy hop.
Replication
For HA, each key is replicated to N nodes (typically N=2 or 3).
node = primary owner
node[+1] = replica
node[+2] = replica (if N=3)
Where the next two clockwise nodes hold copies. On read, the client may:
- Read primary, fall back to replica on failure.
- Or read all N and use the freshest (timestamp).
- Or read with quorum (
R + W > Nfor strong-ish consistency).
For a pure cache, async replication is fine — losing a few writes is acceptable.
Storage in a single node
Each node holds an in-memory hashmap with eviction.
Slab allocator (Memcached)
Memcached pre-allocates memory in fixed-size "slabs" to avoid fragmentation:
slab class 1: chunks of 96 bytes
slab class 2: chunks of 120 bytes
slab class 3: chunks of 152 bytes
...
A SET with a 100-byte value goes into the slab class with the smallest fitting chunk size. Internal fragmentation is the price; in return, no malloc/free hot path.
Eviction (LRU)
When a slab class is full and you need to insert, evict the least-recently-used item from that slab class (not globally). Each slab class maintains its own doubly-linked list: head = MRU, tail = LRU. A SET that needs a 100-byte slot can only evict from the slab class whose chunks fit 100 bytes — not from a slab class holding 1 KB chunks, even if those are cold. This can cause counter-intuitive evictions when one size class is hot and another is completely idle.
For Redis, eviction is global across all keys and tunable: noeviction, allkeys-lru, allkeys-lfu, volatile-lru, volatile-lfu, allkeys-random, etc. allkeys-lfu tends to outperform LRU on workloads with repeated popular keys.
The client library
Often the most complex part. Responsibilities:
- Maintain the ring locally.
- Pool TCP connections to nodes (one connection per node, multiplexed).
- Detect node failures, route around them.
- Implement retries with backoff.
- Optionally batch GETs (one round-trip, multiple keys).
async def get(key):
node = ring.lookup(key)
try:
return await node.get(key)
except NodeFailed:
replica = ring.next_replica(key)
return await replica.get(key)
Membership and failure detection
Use a small, strongly-consistent service (ZooKeeper, etcd, Consul) to track which nodes exist.
- Each node heartbeats every few seconds.
- If a node misses 3 heartbeats, it's marked dead.
- Membership change → ring rebuilt → clients informed via watch.
Alternatively: gossip-based membership (Cassandra, Riak). Each node randomly tells a few peers what it knows; failure suspicion spreads.
Failures and recovery
A node dies
Its keys are gone. New requests for those keys hash to the next node clockwise. Cache misses up — DB load spikes briefly. Within seconds, the new owner re-fills from cache misses.
Mitigation: replicas hold a copy; promote the replica to "primary" so misses don't cascade to the database.
A node comes back
It's empty. Two options:
- Cold start — accept the cache miss penalty; refill from misses.
- Warmup — copy from a peer (replica). Costs network + CPU.
Most production systems do cold start because warmup adds complexity for marginal benefit.
Network partition
Half the cluster can't talk to the other half. Each half might think the others are dead and reassign keys. When the partition heals, you have divergence — same key with different values on different sides.
For a cache, divergence is OK for a short time. Last-write-wins reconciliation, or just accept stale reads briefly.
Thundering herd / cache stampede
When a popular key expires (or a node loses it), hundreds of concurrent requests may all miss the cache simultaneously and fire DB queries in parallel. This can cause a spike of DB load 100–1000× above steady-state — and the DB returning slowly under that load means keys stay uncached longer, which makes more requests pile up. It's a feedback loop.
The mitigations each trade a different thing:
| Technique | How | Trade-off |
|---|---|---|
| Mutex / single-flight | First miss acquires a lock and fetches; waiters block on the lock. | Extra coordination; lock holder is a SPOF for that key. |
| Probabilistic Early Expiration (XFetch) | Recompute key slightly before expiry, with probability that increases as TTL approaches zero. Each request independently decides whether to refresh; no lock needed. | Small overhead on every read near expiry; parameter β controls how aggressively you recompute early. |
| Stale-while-revalidate | Return the stale value immediately; recompute in the background. | Caller sees stale data during revalidation window; fine for most caches. |
| In-process L1 (hot keys) | App servers cache the top N keys locally; DB never sees the miss. | Memory per app server; stale for up to L1 TTL. |
For most teams, stale-while-revalidate is the right default — it removes the coordination problem entirely and keeps p99 latency flat even during a miss wave.
Hot key problem
A single super-popular key (Justin Bieber's Wikipedia page) lands on one node. That node is overwhelmed.
Mitigations:
- Replicate hot keys to multiple nodes; client picks one randomly.
- Caching tiers: a small, in-process cache (~1ms network hit becomes ~10μs) absorbs the hottest 100 keys.
- Automatic detection: monitor per-key request rates; auto-promote hot keys to local cache.
Operations
Resizing
Adding 10 nodes to a 40-node cluster (50 total):
- Vnode rebalance: each new node claims ~1/50 of the ring. 10 new nodes → ~20% of keys move (10/50). Each existing node donates proportionally — no node gives up more than its fair share.
- During the move, reads to a moving key may hit either the old or new owner; write-through caches should invalidate during the window. Brief miss rate bump is expected and acceptable.
- Done online with no downtime. With virtual nodes (~100–256 per node), the ownership handoff is fine-grained rather than one large contiguous range.
Versioning
Cluster-side feature flags. Different client versions must coexist. Be conservative — wire format compatibility is forever.
Monitoring
- Hit rate (% of GETs returning a value).
- Latency p50/p99/p99.9 per op.
- Memory used, eviction rate.
- Network bytes in/out.
- Per-key hot list.
A hit rate below 90% on a properly-sized cache is a red flag — either too small, or the workload is genuinely random.
Memcached vs. Redis
| Memcached | Redis | |
|---|---|---|
| Data types | Strings only | Strings, lists, sets, sorted sets, streams, hashes, bitmaps, geo, HyperLogLog |
| Threaded | Multi-threaded: one listener + N worker threads (configurable via -t); each worker handles many connections via epoll | Command execution single-threaded; network I/O multi-threaded since Redis 6.0 (optional, off by default) |
| Persistence | None | Optional (RDB snapshots, AOF log) |
| Replication | None native — use mcrouter (Facebook) for proxy-level replication; twemproxy does NOT support replication (sharding only) | Yes (primary-replica, async) |
| Cluster | App-side hashing or mcrouter | Native Redis Cluster (CRC16 mod 16384 hash slots — not consistent hashing) |
| Best for | Pure cache, simple values, max raw throughput | Cache + rich data types (sorted sets, streams, pub/sub, scripting) |
For most "I just need a cache" workloads, Memcached's simplicity is a feature. For "I need atomic counters, leaderboards, queues, pub/sub" → Redis.
Hot path: a GET
sequenceDiagram
participant App
participant Client as Cache Client
participant Node
App->>Client: get("user:42")
Client->>Client: hash → node 17
Client->>Node: GET user:42
Node-->>Client: value (or null)
Client-->>App: value (or null)
Note over App,Node: total ~1ms in-region
Optional: if it's a hot key the client recognizes, hit the in-process cache first; if miss, network.
Cache invalidation
Caches break correctness when they hold stale data after a write to the source of truth. The two main strategies are write-through invalidation and TTL-based expiry.
With write-through, every DB write also invalidates (or updates) the cache entry. It's easy to reason about, but it adds latency to the write path, and getting the write and the invalidation to be truly atomic is harder than it looks — most systems don't provide that for free.
TTL-based expiry sidesteps the coordination entirely: accept staleness up to the TTL window. For most caches where brief staleness is acceptable, this is the right default. The tricky part is correctness when two writes race.
sequenceDiagram
participant W1 as Writer 1
participant W2 as Writer 2
participant DB as Database
participant C as Cache
Note over W1,C: Wrong order — can leave stale cache permanently
W1->>C: DELETE cache key
W2->>DB: write new value
W2->>C: DELETE cache key
W1->>DB: write old value
Note over W1,C: Cache is now empty; next read refills from stale DB row
sequenceDiagram
participant W as Writer
participant DB as Database
participant C as Cache
Note over W,C: Correct order: DB first, then invalidate
W->>DB: write new value
W->>C: DELETE cache key
Note over W,C: Next reader fetches fresh data from DB
Always write the DB first, then invalidate the cache. If you do it in the other order, a concurrent writer can restore a stale cache entry that never gets evicted again.
Things to discuss in an interview
- Consistent hashing + vnodes — the foundation.
- Client-side routing vs. proxy-based — pros/cons.
- Replication as an availability tactic, not a consistency one.
- Hot key mitigation — replicate, in-process tier.
- Failure handling — what happens when a node dies, how does the system heal?
- Cache stampede / thundering herd — what triggers it, how you prevent it.
- Cache invalidation strategy — write-through vs. TTL; why write DB before invalidating.
Things you should now be able to answer
- Why is consistent hashing critical for a distributed cache?
- What does the client library do that a "dumb client" wouldn't?
- How does a Memcached node use slab allocation for memory, and what is the per-slab-class LRU implication?
- What happens to cache entries when a node fails? When does the replica promote?
- How does Redis Cluster handle resharding — and why is it not consistent hashing?
- What causes a cache stampede and how do stale-while-revalidate and Probabilistic Early Expiration (XFetch) prevent it?
- Why does the order of write (DB) vs. cache invalidation matter for correctness?
Further reading
- "Scaling Memcache at Facebook" (NSDI 2013) — the canonical production war story; covers mcrouter, lease-based read-repair, and regional failover.
- Redis Cluster Specification — explains why Redis Cluster uses CRC16 mod 16384 hash slots rather than consistent hashing, and how resharding works online.
- Consistent Hashing and Random Trees (Karger et al., 1997) — the original paper.
- McRouter (Facebook/Meta) — the Memcached proxy that adds replication, failover, and pool routing without changing the client protocol.
- Optimal Probabilistic Cache Stampede Prevention (Vattani et al., VLDB 2015) — the XFetch algorithm; β=1 is already near-optimal for most workloads.
Frequently asked questions
▸Why does consistent hashing with virtual nodes outperform modulo hashing when resizing a cache cluster?
With modulo hashing, changing N from 10 to 11 reroutes roughly 91% of keys to the wrong node instantly, causing a cache miss storm that hits the database all at once. Consistent hashing maps keys and nodes onto a ring, so adding or removing a node only moves approximately 1/N of keys — about 2% on a 50-node cluster. Virtual nodes, placed 100 to 256 times per real server, ensure that when a node fails, its keys scatter across many surviving nodes rather than piling onto one neighbor.
▸How does Memcached's slab allocator handle eviction, and what is the key limitation compared to Redis?
Memcached pre-allocates memory in fixed-size slab classes (96 bytes, 120 bytes, 152 bytes, and so on), and when a slab class fills up it evicts the least-recently-used item from that specific slab class only — not globally. A SET needing a 100-byte slot can only evict from the matching slab class, not from an idle class holding 1 KB chunks. Redis eviction is global and tunable: allkeys-lru, allkeys-lfu, volatile-lfu, and others — allkeys-lfu tends to outperform LRU on workloads with repeated popular keys.
▸What is a cache stampede and which mitigation strategy is recommended as the right default?
A cache stampede occurs when a popular key expires or is lost, causing hundreds of concurrent requests to miss simultaneously, fire parallel DB queries, and potentially spike DB load 100 to 1000 times above steady-state — the slow DB response then prolongs the miss window, creating a feedback loop. The article recommends stale-while-revalidate as the right default for most teams: return the stale value immediately and recompute in the background, which removes coordination entirely and keeps p99 latency flat during a miss wave. Alternatives include mutex/single-flight and Probabilistic Early Expiration (XFetch), each trading different costs.
▸When should you choose Memcached over Redis for a distributed cache?
Memcached is the right choice for pure cache workloads with simple string values where maximum raw throughput is the priority; its simplicity is a feature there. Redis is the right choice when you need atomic counters, leaderboards, queues, sorted sets, pub/sub, streams, or optional persistence via RDB snapshots or AOF log. One concrete distinction: Redis Cluster uses CRC16 mod 16384 hash slots rather than consistent hashing, while Memcached relies on app-side hashing or a proxy like mcrouter.
▸Why must you write to the database before invalidating the cache, and what goes wrong if you reverse the order?
If you delete the cache entry first and then write to the database, a concurrent second writer can delete the cache entry and write its value between your two steps, and then your stale DB write lands, leaving the cache empty with a stale row underneath — subsequent reads refill the cache from that stale data, and the bad value never gets evicted. Writing the database first ensures that any reader who misses the cache during the invalidation window fetches the correct, already-committed value.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.