~/articles/design-distributed-cache

◆◆◆Advancedasked at Metaasked at Amazonasked at Twitterasked at Netflix

Design a Distributed Cache (like Memcached)

A cache that scales across hundreds of nodes — consistent hashing, replication, eviction, and the operational problems you'll meet.

18 min read2026-02-14Ironclad Academy

#interview #caching #distributed-systems #sharding #consistent-hashing

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

When Facebook's Memcached cluster handles 1 billion requests per second, or when Netflix serves recommendations to 230 million subscribers simultaneously, neither company is hammering a MySQL database on every request. They're hitting a distributed cache — a fleet of in-memory key-value stores that sits between the application and the database, absorbing the read traffic that would otherwise kill the underlying storage layer.

A distributed cache is exactly what the name says: a cache (fast, in-memory, short-lived key-value storage) that's sharded across many machines. Memcached, Redis Cluster, Hazelcast, and AWS ElastiCache all implement this pattern. The interface is minimal — GET key, SET key value, DELETE key — but the engineering underneath is not. You're building a system that needs to serve millions of operations per second at single-digit millisecond latency, survive node failures without stampeding the database, and grow or shrink the cluster without invalidating all your data at once.

The core tension has two parts. First, partitioning: how do you split the keyspace across N nodes so that adding a node doesn't move every key to a different machine? Naive modulo hashing (hash(key) % N) fails catastrophically — change N from 10 to 11 and you reroute ~91% of traffic to the wrong node instantly. Second, hot keys: once you solve partitioning, a single celebrity profile or trending post can still overload one node while the other 49 sit idle. The fix for each problem is non-obvious and the two solutions don't naturally compose, which is why this is a staple interview question.

Functional requirements

GET key → returns value or null.
SET key value [TTL].
DELETE key.
(Optional) INCR, DECR for counters.
(Optional) atomic compare-and-swap (CAS).

Non-functional

p99 latency < 5ms even at high QPS.
Throughput: 10M ops/sec across the cluster.
High availability: a node failing should not lose more than its share of cached entries.
Eventual consistency is fine — this is a cache, not the source of truth.

Capacity

Dimension	Estimate	How we got there
Working set	10 TB hot data	Given
Effective capacity per node	~205 GB	256 GB commodity node × 80% usable
Node count (capacity)	~50 nodes	`10 TB ÷ 205 GB ≈ 49` → round to 50
Cluster throughput	10 M ops/sec	Given
Per-node throughput	200 k ops/sec	`10 M ÷ 50 nodes`
Avg value size	~5 KB	Given
Network per node (peak)	~1 GB/s	`200 k ops/s × 5 KB`
NIC capacity (10 Gbps)	~1.25 GB/s	Standard 10 Gbps NIC — already tight at peak (~1 GB/s load vs 1.25 GB/s capacity); consider jumbo frames or 25 Gbps NICs

Takeaway: with 50 nodes you can absorb 10 M ops/sec, but NIC saturation — not CPU — is the first wall you'll hit; keep average value sizes small or upgrade to 25 Gbps links before scaling further.

The network bandwidth figure deserves attention: at 5 KB average values and 200 k ops/sec, you need ~1 GB/s per node. A standard 10 Gbps NIC provides ~1.25 GB/s capacity — already tight at that peak load. Staff-level answer: prefer smaller values (serialize only needed fields), or scale horizontally before hitting NIC limits.

Building up to the design

The full picture — consistent hashing, replication, slab allocator, membership service, hot-key mitigation — is a lot to absorb at once. Each layer only makes sense once you've felt the pain that justifies it, so let's walk forward and earn each piece.

Start simple: a dict and a lock

cache = {}
lock = threading.Lock()
def get(k):  return cache.get(k)
def put(k,v):
    with lock: cache[k] = v

This gets you ~100k ops/sec on one box in about five minutes. It also dies with the process, caps at available RAM, and can only serve a single app's workers. The moment you need multiple app servers sharing state, you need to move the cache out of process.

Put it on a separate host (Memcached)

Move the dict into a single Memcached process. App servers talk to it over the network. Now any number of app servers can share one cache, and restarting the app boxes no longer wipes cached data. That's progress. The problem is that one cache node still caps your capacity at a few hundred GB and your throughput at a few hundred thousand ops/sec. When you outgrow it — and you will — the naive next step is to add more nodes.

Add nodes: hash-based sharding

Have N cache nodes. The client picks one with hash(key) % N. Capacity and throughput scale linearly. But now try adding a node: N changes from 10 to 11, and nearly every key routes to a different node. You just invalidated almost your entire cache simultaneously. The resulting miss storm hits your database all at once. That's not a scaling strategy — that's an outage.

Consistent hashing + virtual nodes

This is the fix that makes distributed caching actually work. Hash keys and nodes onto a ring; each key lives on the first node clockwise from its hash. When you add or remove a node, only the keys that were between it and its predecessor on the ring need to move — roughly 1/N of the total. A 50-node cluster adding one more node moves about 2% of keys, not 100%.

flowchart LR
    subgraph ring ["Consistent Hash Ring"]
        direction LR
        K1(["key: user:99<br/>hash → 12:00"]) --> NA["Node A<br/>owns 10:00–2:00"]
        K2(["key: post:42<br/>hash → 3:30"]) --> NB["Node B<br/>owns 2:00–5:00"]
        K3(["key: feed:7<br/>hash → 7:45"]) --> NC["Node C<br/>owns 5:00–10:00"]
    end
    style NA fill:#15803d,color:#fff
    style NB fill:#0e7490,color:#fff
    style NC fill:#ff6b1a,color:#0a0a0f

Virtual nodes (vnodes) take this one step further. Instead of placing each real server once on the ring, you place it 100–256 times. When a node fails, its keys scatter across many surviving nodes rather than piling entirely onto one neighbor. The load distribution stays even even under churn.

Graceful resizing is solved. But a node death still means its slice of the cache goes cold — DB load spikes briefly for those keys. You need somewhere to fall back.

Replication for HA

Each key is stored on N nodes (typically N=2 or 3): the primary owner and the next one or two clockwise on the ring. Writes replicate asynchronously. When a node dies, reads that would have gone there fall back to a replica. The DB never sees it. This is replication as an availability tactic, not a consistency one — you're not trying to make all copies agree instantly, just to keep the cache warm through a failure.

That handles most failure modes. But there's one more pathology: what if a single key is so popular that even one node can't absorb the reads?

Hot-key mitigation + per-node optimizations

Some keys are disproportionately popular — a trending celebrity's profile, the site homepage, a viral post. Even with 50 nodes, "Justin Bieber's Wikipedia page" lands on exactly one of them. That node gets 10× the load of its peers and starts dropping requests.

The fix is to stop treating the single-owner model as sacred for those keys. Scatter extra copies of hot keys to K randomly chosen nodes, and have clients pick one at random on each read. For the very hottest items — the top 100 or so by request rate — go further and cache them in a small in-process store on each app server. That read never even touches the network. You get microsecond latency instead of milliseconds, and the cache nodes see zero traffic for those keys.

Rounding out the per-node story: Memcached uses a slab allocator (described below) to avoid malloc fragmentation, and evicts via LRU when a slab class fills up.

Membership service + smart clients

The last piece is how clients know the ring. A small ZooKeeper or etcd cluster tracks which cache nodes are alive. Each client subscribes to watch updates; when a node joins or fails, the membership service notifies all watchers within milliseconds and clients rebuild their local ring instantly. No proxy is needed in the request path — the client routes directly to the right node based on its local view.

This is the production design: consistent hashing, async replication, hot-key scatter, in-process L1, and a membership service keeping every client in sync.

flowchart LR
    V1[V1: in-process dict<br/>one app] --> V2[V2: + Memcached box<br/>shared, capped]
    V2 --> V3[V3: + sharding<br/>resharding hell]
    V3 --> V4[V4: consistent hash<br/>graceful resize]
    V4 --> V5[V5: + replicas<br/>HA]
    V5 --> V6[V6: + hot-key handling<br/>no node melts]
    V6 --> V7[V7: + membership + smart clients<br/>production]
    style V1 fill:#0e7490,color:#fff
    style V4 fill:#15803d,color:#fff
    style V6 fill:#ff6b1a,color:#0a0a0f
    style V7 fill:#a855f7,color:#fff

Architecture

flowchart TD
    APP1[App Server 1] --> CLIENT[Cache Client Library]
    APP2[App Server 2] --> CLIENT
    APP3[App Server N] --> CLIENT
    CLIENT --> R{Consistent Hash Ring}
    R --> N1[Cache Node 1]
    R --> N2[Cache Node 2]
    R --> N3[Cache Node 3]
    R --> NN[Cache Node N]

    N1 -.replicate.-> N2
    N2 -.replicate.-> N3

    M[(Membership Service<br/>Zookeeper / etcd)] --> CLIENT

    style CLIENT fill:#ff6b1a,color:#0a0a0f
    style R fill:#15803d,color:#fff
    style M fill:#a855f7,color:#fff

Three layers:

Cache nodes — hold key/value pairs in memory.
Client library — embedded in every app server; knows the ring and routes requests.
Membership service — tracks which nodes are alive and which key ranges they own.

Sharding: where does a key live?

Consistent hashing. Each node is a vnode set on a ring; each key hashes to "the next node clockwise."

def find_node(key):
    h = hash(key)
    idx = bisect(sorted_hashes, h) % len(sorted_hashes)
    return ring[sorted_hashes[idx]]

The lookup is a binary search — fast regardless of ring size. Adding or removing a node only moves ~1/N of keys. With virtual nodes (~100–256 per real node), load stays near-even across the cluster. The ring is replicated to every client via the membership service, so clients route directly with no proxy hop.

Replication

For HA, each key is replicated to N nodes (typically N=2 or 3).

node = primary owner
node[+1] = replica
node[+2] = replica  (if N=3)

Where the next two clockwise nodes hold copies. On read, the client may:

Read primary, fall back to replica on failure.
Or read all N and use the freshest (timestamp).
Or read with quorum (R + W > N for strong-ish consistency).

For a pure cache, async replication is fine — losing a few writes is acceptable.

Storage in a single node

Each node holds an in-memory hashmap with eviction.

Slab allocator (Memcached)

Memcached pre-allocates memory in fixed-size "slabs" to avoid fragmentation:

slab class 1: chunks of 96 bytes
slab class 2: chunks of 120 bytes
slab class 3: chunks of 152 bytes
...

A SET with a 100-byte value goes into the slab class with the smallest fitting chunk size. Internal fragmentation is the price; in return, no malloc/free hot path.

Eviction (LRU)

When a slab class is full and you need to insert, evict the least-recently-used item from that slab class (not globally). Each slab class maintains its own doubly-linked list: head = MRU, tail = LRU. A SET that needs a 100-byte slot can only evict from the slab class whose chunks fit 100 bytes — not from a slab class holding 1 KB chunks, even if those are cold. This can cause counter-intuitive evictions when one size class is hot and another is completely idle.

For Redis, eviction is global across all keys and tunable: noeviction, allkeys-lru, allkeys-lfu, volatile-lru, volatile-lfu, allkeys-random, etc. allkeys-lfu tends to outperform LRU on workloads with repeated popular keys.

The client library

Often the most complex part. Responsibilities:

Maintain the ring locally.
Pool TCP connections to nodes (one connection per node, multiplexed).
Detect node failures, route around them.
Implement retries with backoff.
Optionally batch GETs (one round-trip, multiple keys).

async def get(key):
    node = ring.lookup(key)
    try:
        return await node.get(key)
    except NodeFailed:
        replica = ring.next_replica(key)
        return await replica.get(key)

Membership and failure detection

Use a small, strongly-consistent service (ZooKeeper, etcd, Consul) to track which nodes exist.

Each node heartbeats every few seconds.
If a node misses 3 heartbeats, it's marked dead.
Membership change → ring rebuilt → clients informed via watch.

Alternatively: gossip-based membership (Cassandra, Riak). Each node randomly tells a few peers what it knows; failure suspicion spreads.

Failures and recovery

A node dies

Its keys are gone. New requests for those keys hash to the next node clockwise. Cache misses up — DB load spikes briefly. Within seconds, the new owner re-fills from cache misses.

Mitigation: replicas hold a copy; promote the replica to "primary" so misses don't cascade to the database.

A node comes back

It's empty. Two options:

Cold start — accept the cache miss penalty; refill from misses.
Warmup — copy from a peer (replica). Costs network + CPU.

Most production systems do cold start because warmup adds complexity for marginal benefit.

Network partition

Half the cluster can't talk to the other half. Each half might think the others are dead and reassign keys. When the partition heals, you have divergence — same key with different values on different sides.

For a cache, divergence is OK for a short time. Last-write-wins reconciliation, or just accept stale reads briefly.

Thundering herd / cache stampede

When a popular key expires (or a node loses it), hundreds of concurrent requests may all miss the cache simultaneously and fire DB queries in parallel. This can cause a spike of DB load 100–1000× above steady-state — and the DB returning slowly under that load means keys stay uncached longer, which makes more requests pile up. It's a feedback loop.

The mitigations each trade a different thing:

Technique	How	Trade-off
Mutex / single-flight	First miss acquires a lock and fetches; waiters block on the lock.	Extra coordination; lock holder is a SPOF for that key.
Probabilistic Early Expiration (XFetch)	Recompute key slightly before expiry, with probability that increases as TTL approaches zero. Each request independently decides whether to refresh; no lock needed.	Small overhead on every read near expiry; parameter β controls how aggressively you recompute early.
Stale-while-revalidate	Return the stale value immediately; recompute in the background.	Caller sees stale data during revalidation window; fine for most caches.
In-process L1 (hot keys)	App servers cache the top N keys locally; DB never sees the miss.	Memory per app server; stale for up to L1 TTL.

For most teams, stale-while-revalidate is the right default — it removes the coordination problem entirely and keeps p99 latency flat even during a miss wave.

Hot key problem

A single super-popular key (Justin Bieber's Wikipedia page) lands on one node. That node is overwhelmed.

Mitigations:

Replicate hot keys to multiple nodes; client picks one randomly.
Caching tiers: a small, in-process cache (~1ms network hit becomes ~10μs) absorbs the hottest 100 keys.
Automatic detection: monitor per-key request rates; auto-promote hot keys to local cache.

Operations

Resizing

Adding 10 nodes to a 40-node cluster (50 total):

Vnode rebalance: each new node claims ~1/50 of the ring. 10 new nodes → ~20% of keys move (10/50). Each existing node donates proportionally — no node gives up more than its fair share.
During the move, reads to a moving key may hit either the old or new owner; write-through caches should invalidate during the window. Brief miss rate bump is expected and acceptable.
Done online with no downtime. With virtual nodes (~100–256 per node), the ownership handoff is fine-grained rather than one large contiguous range.

Versioning

Cluster-side feature flags. Different client versions must coexist. Be conservative — wire format compatibility is forever.

Monitoring

Hit rate (% of GETs returning a value).
Latency p50/p99/p99.9 per op.
Memory used, eviction rate.
Network bytes in/out.
Per-key hot list.

A hit rate below 90% on a properly-sized cache is a red flag — either too small, or the workload is genuinely random.

Memcached vs. Redis

	Memcached	Redis
Data types	Strings only	Strings, lists, sets, sorted sets, streams, hashes, bitmaps, geo, HyperLogLog
Threaded	Multi-threaded: one listener + N worker threads (configurable via `-t`); each worker handles many connections via epoll	Command execution single-threaded; network I/O multi-threaded since Redis 6.0 (optional, off by default)
Persistence	None	Optional (RDB snapshots, AOF log)
Replication	None native — use mcrouter (Facebook) for proxy-level replication; twemproxy does NOT support replication (sharding only)	Yes (primary-replica, async)
Cluster	App-side hashing or mcrouter	Native Redis Cluster (CRC16 mod 16384 hash slots — not consistent hashing)
Best for	Pure cache, simple values, max raw throughput	Cache + rich data types (sorted sets, streams, pub/sub, scripting)

For most "I just need a cache" workloads, Memcached's simplicity is a feature. For "I need atomic counters, leaderboards, queues, pub/sub" → Redis.

Hot path: a GET

sequenceDiagram
    participant App
    participant Client as Cache Client
    participant Node
    App->>Client: get("user:42")
    Client->>Client: hash → node 17
    Client->>Node: GET user:42
    Node-->>Client: value (or null)
    Client-->>App: value (or null)
    Note over App,Node: total ~1ms in-region

Optional: if it's a hot key the client recognizes, hit the in-process cache first; if miss, network.

Cache invalidation

Caches break correctness when they hold stale data after a write to the source of truth. The two main strategies are write-through invalidation and TTL-based expiry.

With write-through, every DB write also invalidates (or updates) the cache entry. It's easy to reason about, but it adds latency to the write path, and getting the write and the invalidation to be truly atomic is harder than it looks — most systems don't provide that for free.

TTL-based expiry sidesteps the coordination entirely: accept staleness up to the TTL window. For most caches where brief staleness is acceptable, this is the right default. The tricky part is correctness when two writes race.

sequenceDiagram
    participant W1 as Writer 1
    participant W2 as Writer 2
    participant DB as Database
    participant C as Cache
    Note over W1,C: Wrong order — can leave stale cache permanently
    W1->>C: DELETE cache key
    W2->>DB: write new value
    W2->>C: DELETE cache key
    W1->>DB: write old value
    Note over W1,C: Cache is now empty; next read refills from stale DB row

sequenceDiagram
    participant W as Writer
    participant DB as Database
    participant C as Cache
    Note over W,C: Correct order: DB first, then invalidate
    W->>DB: write new value
    W->>C: DELETE cache key
    Note over W,C: Next reader fetches fresh data from DB

Always write the DB first, then invalidate the cache. If you do it in the other order, a concurrent writer can restore a stale cache entry that never gets evicted again.

Things to discuss in an interview

Consistent hashing + vnodes — the foundation.
Client-side routing vs. proxy-based — pros/cons.
Replication as an availability tactic, not a consistency one.
Hot key mitigation — replicate, in-process tier.
Failure handling — what happens when a node dies, how does the system heal?
Cache stampede / thundering herd — what triggers it, how you prevent it.
Cache invalidation strategy — write-through vs. TTL; why write DB before invalidating.

Things you should now be able to answer

Why is consistent hashing critical for a distributed cache?
What does the client library do that a "dumb client" wouldn't?
How does a Memcached node use slab allocation for memory, and what is the per-slab-class LRU implication?
What happens to cache entries when a node fails? When does the replica promote?
How does Redis Cluster handle resharding — and why is it not consistent hashing?
What causes a cache stampede and how do stale-while-revalidate and Probabilistic Early Expiration (XFetch) prevent it?
Why does the order of write (DB) vs. cache invalidation matter for correctness?

Frequently asked questions

▸Why does consistent hashing with virtual nodes outperform modulo hashing when resizing a cache cluster?

With modulo hashing, changing N from 10 to 11 reroutes roughly 91% of keys to the wrong node instantly, causing a cache miss storm that hits the database all at once. Consistent hashing maps keys and nodes onto a ring, so adding or removing a node only moves approximately 1/N of keys — about 2% on a 50-node cluster. Virtual nodes, placed 100 to 256 times per real server, ensure that when a node fails, its keys scatter across many surviving nodes rather than piling onto one neighbor.

▸How does Memcached's slab allocator handle eviction, and what is the key limitation compared to Redis?

Memcached pre-allocates memory in fixed-size slab classes (96 bytes, 120 bytes, 152 bytes, and so on), and when a slab class fills up it evicts the least-recently-used item from that specific slab class only — not globally. A SET needing a 100-byte slot can only evict from the matching slab class, not from an idle class holding 1 KB chunks. Redis eviction is global and tunable: allkeys-lru, allkeys-lfu, volatile-lfu, and others — allkeys-lfu tends to outperform LRU on workloads with repeated popular keys.

▸What is a cache stampede and which mitigation strategy is recommended as the right default?

A cache stampede occurs when a popular key expires or is lost, causing hundreds of concurrent requests to miss simultaneously, fire parallel DB queries, and potentially spike DB load 100 to 1000 times above steady-state — the slow DB response then prolongs the miss window, creating a feedback loop. The article recommends stale-while-revalidate as the right default for most teams: return the stale value immediately and recompute in the background, which removes coordination entirely and keeps p99 latency flat during a miss wave. Alternatives include mutex/single-flight and Probabilistic Early Expiration (XFetch), each trading different costs.

▸When should you choose Memcached over Redis for a distributed cache?

Memcached is the right choice for pure cache workloads with simple string values where maximum raw throughput is the priority; its simplicity is a feature there. Redis is the right choice when you need atomic counters, leaderboards, queues, sorted sets, pub/sub, streams, or optional persistence via RDB snapshots or AOF log. One concrete distinction: Redis Cluster uses CRC16 mod 16384 hash slots rather than consistent hashing, while Memcached relies on app-side hashing or a proxy like mcrouter.

▸Why must you write to the database before invalidating the cache, and what goes wrong if you reverse the order?

If you delete the cache entry first and then write to the database, a concurrent second writer can delete the cache entry and write its value between your two steps, and then your stale DB write lands, leaving the cache empty with a stale row underneath — subsequent reads refill the cache from that stale data, and the bad value never gets evicted. Writing the database first ensures that any reader who misses the cache during the invalidation window fetches the correct, already-committed value.

← previous

Design a Notification System (Push, Email, SMS)

Design Search Autocomplete (Typeahead)

// RELATED

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform