Consistent Hashing
Why hash-mod-N breaks when you resize, and how Amazon Dynamo, Cassandra, and Memcached avoid it with consistent hashing and virtual nodes.
If you take only one algorithm from this whole site, take this one. Consistent hashing is the single most reused trick in distributed systems — caches, sharded databases, service meshes, CDNs, every load balancer at scale. Once you see it, you'll see it everywhere.
The problem: naive sharding breaks on resize
Suppose you have 4 cache servers and a million keys. The simplest sharding scheme:
server = hash(key) % 4
Works great. Until you need to add a 5th server. Now:
server = hash(key) % 5
Almost every key now hashes to a different server. Cache hit rate goes from 95% to ~0% overnight. Your databases melt, your latency explodes, you page someone at 3am.
flowchart LR
subgraph "Before: 4 servers"
K1[key 'apple'<br/>hash=42] -->|"42 % 4 = 2"| S2A[Server 2]
K2[key 'orange'<br/>hash=37] -->|"37 % 4 = 1"| S1A[Server 1]
end
subgraph "After: 5 servers"
K1B[key 'apple'<br/>hash=42] -->|"42 % 5 = 2"| S2B[Server 2]
K2B[key 'orange'<br/>hash=37] -->|"37 % 5 = 2"| S2B
end
style K1B fill:#ff2e88,color:#fff
style K2B fill:#ff2e88,color:#fff
Roughly N/(N+1) of keys move when you go from N to N+1 servers. With 4 → 5, that's 80% of keys remapped. Useless.
The idea: a hash ring
Consistent hashing puts both the keys and the servers on the same conceptual circle (the "ring"). You hash both into the same large integer space (say, 0 to 2^32 - 1), arrange them around the ring, and assign each key to the next server clockwise.
flowchart TD
subgraph "Hash ring (0 to 2^32 - 1)"
A[Server A<br/>at hash 100]
B[Server B<br/>at hash 1000]
C[Server C<br/>at hash 2500]
D[Server D<br/>at hash 3500]
end
K1[key 'apple'<br/>hash 50] -->|next clockwise| A
K2[key 'banana'<br/>hash 800] -->|next clockwise| B
K3[key 'cherry'<br/>hash 1500] -->|next clockwise| C
K4[key 'durian'<br/>hash 3000] -->|next clockwise| D
K5[key 'elderberry'<br/>hash 4000] -->|wraps around to| A
style A fill:#ff6b1a,color:#0a0a0f
style B fill:#ffaa00,color:#0a0a0f
style C fill:#15803d,color:#fff
style D fill:#0e7490,color:#fff
Now what happens when we add a new server E at hash 1800? Only the keys whose next-clockwise-server was C (and who hash between B and 1800) move to E. Roughly 1/(N+1) of keys move, not N/(N+1).
When we remove a server, only its keys redistribute to the next neighbor.
flowchart LR
subgraph "Add server E at 1800"
direction TB
NA[A: 100]
NB[B: 1000]
NE[E: 1800<br/>NEW]
NC[C: 2500]
ND[D: 3500]
end
NK1[apple, banana — unchanged]
NK2[cherry: 1500] -->|"used to go to C,<br/>now goes to E"| NE
style NE fill:#15803d,color:#fff
Why this matters
Adding a server moves only ~1/(N+1) keys. This is the property that makes consistent hashing usable in practice.
| Scheme | Keys remapped on N → N+1 |
|---|---|
hash(k) % N | ≈ N/(N+1) (almost all) |
| Consistent hashing | ≈ 1/(N+1) (a small slice) |
For N = 100, going from 100 → 101 servers, the naive scheme reshuffles ~99% of keys; consistent hashing reshuffles ~1%.
The unevenness problem
There's a catch. Random hashing puts servers at random positions on the ring, and random points are not uniformly spaced. With 4 servers, you might end up with one server "owning" 60% of the ring and another only 5%.
flowchart TD
subgraph "Bad luck: uneven server placement"
BA[A: 100]
BB[B: 200]
BC[C: 300]
BD[D: 3500]
end
Note["A, B, C cluster together<br/>D 'owns' ~60% of the ring"]
style BD fill:#ff2e88,color:#fff
D will get most of the load. Not what we wanted.
The fix: virtual nodes (vnodes)
Instead of placing each server once on the ring, we place it many times — commonly 100–256 virtual copies, depending on fleet size and whether the allocator picks positions randomly or deterministically. Each virtual node is at a different position derived from hash(server + "#" + i).
flowchart TD
subgraph "With virtual nodes (vnodes)"
VA1[A#1]
VA2[A#2]
VA3[A#...]
VB1[B#1]
VB2[B#2]
VB3[B#...]
VC1[C#1]
VC2[C#2]
VC3[C#...]
end
Note["100 vnodes per server<br/>distribution evens out"]
style VA1 fill:#ff6b1a,color:#0a0a0f
style VB1 fill:#ffaa00,color:#0a0a0f
style VC1 fill:#15803d,color:#fff
With 100 vnodes per server, the law of large numbers kicks in: each server's share of the ring is within ~1% of 1/N. There's a second benefit worth calling out: when a server fails, its keys get redistributed across all remaining servers (not just one neighbor), spreading the recovery load. Without vnodes, you'd flood a single node; with them, every survivor absorbs a proportional share.
Pseudocode
import bisect
import hashlib
class ConsistentHash:
def __init__(self, replicas: int = 100):
self.replicas = replicas
self.ring: dict[int, str] = {} # hash -> server
self.sorted_hashes: list[int] = [] # sorted keys of ring
def _h(self, key: str) -> int:
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_server(self, server: str):
for i in range(self.replicas):
h = self._h(f"{server}#{i}")
self.ring[h] = server
bisect.insort(self.sorted_hashes, h)
def remove_server(self, server: str):
for i in range(self.replicas):
h = self._h(f"{server}#{i}")
del self.ring[h]
self.sorted_hashes.remove(h)
def get_server(self, key: str) -> str:
if not self.ring:
return None
h = self._h(key)
idx = bisect.bisect(self.sorted_hashes, h) % len(self.sorted_hashes)
return self.ring[self.sorted_hashes[idx]]
That's the entire algorithm. ~30 lines.
Where consistent hashing is used in real systems
| System | Use |
|---|---|
| Amazon Dynamo / DynamoDB | Partition data across nodes |
| Apache Cassandra | Same. Each node owns a range of the ring |
| Memcached clients (e.g. Ketama) | Pick which server holds a key |
| Riak | Partition data |
| Akamai, CDN routing | Map URLs to edge servers |
| Discord's Elixir services | Map guild entities to nodes (via ex_hash_ring) |
| Many service meshes | Sticky load balancing for stateful workloads |
Variants and refinements
Rendezvous hashing (HRW)
The idea here is different: for each key, compute hash(key + server_id) for every server and pick the one that scores highest. There's no ring and no vnodes — just a list of servers and a comparison. The distribution is perfectly even, and you don't need to reason about ring positions at all.
The cost is that lookup is O(N) — you must score every server. That's fine up to a few hundred nodes (the comparisons are embarrassingly parallelizable), but it becomes impractical at thousands. Used in Microsoft's CARP and some BitTorrent trackers.
Jump consistent hash (Google, 2014)
A clever closed-form algorithm that takes a key and a bucket count and returns the chosen bucket — no data structures at all. The paper describes it as ~5 lines of code:
// Jump consistent hash, from the paper
int32_t jump_hash(uint64_t key, int32_t num_buckets) {
int64_t b = -1, j = 0;
while (j < num_buckets) {
b = j;
key = key * 2862933555777941757ULL + 1;
j = (b + 1) * ((double)(1LL << 31) / (double)((key >> 33) + 1));
}
return b;
}
It runs in O(log N) and uses zero memory. The constraint is that your buckets must be numbered 0..N-1, and the only way to "remove a server" is to remove the highest-numbered bucket. You can't pull an arbitrary node out of the middle. That makes it well-suited to bulk storage systems with sequentially numbered shards (used inside Google for exactly that), but awkward for any cluster where arbitrary nodes come and go.
Anchor hashing, Multi-probe consistent hashing
Newer variants that try to combine the strengths of all the above. Mostly a research curiosity unless you're maintaining a hashing library.
Picking the right approach
flowchart TD
Q1{"Buckets numbered<br/>0..N-1 and grow<br/>only at the end?"}
Q1 -->|yes| JUMP["Jump hash<br/>O(log N), zero memory"]
Q1 -->|no| Q2{"Fleet size<br/>≤ a few hundred?"}
Q2 -->|yes| HRW["Rendezvous HRW<br/>perfect balance, O(N) lookup"]
Q2 -->|no| Q3{"Heterogeneous<br/>node capacity?"}
Q3 -->|yes| WRING[Weighted ring<br/>more vnodes to bigger nodes]
Q3 -->|no| RING[Ring + vnodes<br/>the standard answer]
style JUMP fill:#0e7490,color:#fff
style HRW fill:#ffaa00,color:#0a0a0f
style WRING fill:#a855f7,color:#fff
style RING fill:#ff6b1a,color:#0a0a0f
Trade-offs to know
Memory is essentially free. With 1000 servers × 100 vnodes you have 100k ring entries — trivially small.
Lookup speed is O(log V) binary search over the sorted vnode list, where V is the total vnode count. Fast enough at any practical scale.
Rebalancing when a node joins or leaves: ~1/(N+1) of keys move, and with vnodes those keys scatter across all surviving nodes rather than piling onto a single neighbor. That's the load-spreading property in action.
Heterogeneous servers are handled naturally by giving bigger nodes more vnodes — weighted consistent hashing. No protocol changes needed.
Vnode count tuning is worth understanding. More vnodes gives better balance but creates more data-streaming pairs at bootstrap time. Cassandra 3.x used 256 random positions per node; Cassandra 4.x dropped to 16 by adding a deterministic token allocator (allocate_tokens_for_local_replication_factor) that achieves similar balance with far less churn. The right count depends on whether your allocator is random or deterministic.
Hot-key problem is a separate concern. Consistent hashing distributes keys across nodes evenly in aggregate, but if a single key is read or written at extreme rate — a viral tweet's media object, say — it still lands on exactly one node. That requires a different remedy: key-level replication, local caching, or sharding the key's value itself.
Gossip consistency of the ring matters in practice. Every node needs an up-to-date view of ring membership for routing to work without a central coordinator. Cassandra gossips token ownership continuously. A stale ring view causes wrong-node routing, which the target node corrects with a redirect — a latency spike, not a correctness failure, but worth planning for.
Things you should now be able to answer
If you got here from that hypothetical 3am page, here is what consistent hashing gives you: the ability to add a server without melting your database. The five questions below are the ones interviewers reach for — work through each one in your own words before moving on.
- Why does
hash(key) % Nbreak when N changes? How much data moves? - What is a virtual node and why do we need them?
- Why does Cassandra use consistent hashing?
- When would jump hashing beat ring-based consistent hashing?
- A new node is added to the ring. Which keys move and which don't?
Further reading
- Karger et al., "Consistent Hashing and Random Trees" (1997) — the original paper
- DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store" (2007)
- Lamping & Veach, "A Fast, Minimal Memory, Consistent Hash Algorithm" (2014) — jump hashing (arXiv:1406.2294)
- Cassandra docs: "How data is distributed across a cluster (using virtual nodes)" — Datastax/Apache Cassandra documentation on vnode tuning
Frequently asked questions
▸What is consistent hashing and why is it better than hash-mod-N sharding?
Consistent hashing places both servers and keys on a shared hash ring spanning 0 to 2^32-1, assigning each key to the next clockwise server. Unlike hash(key) % N, which remaps roughly N/(N+1) of all keys when the server count changes — about 80% when going from 4 to 5 nodes — consistent hashing moves only ~1/(N+1) keys, so a cache cluster survives scaling events without a mass-miss storm.
▸What are virtual nodes and why does consistent hashing need them?
Virtual nodes (vnodes) give each physical server 100–256 positions on the ring instead of one, derived from hash(server + "#" + i). Without them, random placement clusters servers unevenly and one node can own 60% of the ring while another owns 5%. With 100 vnodes per server, the law of large numbers brings each server's share to within ~1% of 1/N, and when a server fails its keys scatter across all survivors rather than flooding a single neighbor.
▸How many virtual nodes per server do Cassandra and other real systems use?
The typical range is 100–256 vnodes per server. Cassandra 3.x used 256 random positions per node; Cassandra 4.x dropped to 16 by introducing a deterministic token allocator that achieves similar balance with far less churn at bootstrap time. The right count depends on whether the allocator picks positions randomly or deterministically.
▸When should I use jump hashing instead of ring-based consistent hashing?
Use jump hashing when your buckets are numbered 0 through N-1 and you only ever add capacity at the end — you cannot remove an arbitrary node from the middle. It runs in O(log N) with zero memory and fits in about 5 lines of C. Ring-based consistent hashing is the standard answer whenever nodes can join or leave in arbitrary order, which covers most cache and database clusters.
▸What is the lookup cost and memory cost of consistent hashing with virtual nodes?
Lookup is O(log V) binary search over the sorted vnode list, where V equals servers multiplied by vnodes per server. Memory cost is trivial: 1000 servers with 100 vnodes each produces 100,000 ring entries, well within any practical budget.
You may also like
Design a Vector Database / Semantic Search Service
Index 1 billion 768-dimensional vectors and answer top-k similarity queries in under 20 ms — the ANN indexing, sharding, and filtering architecture behind Pinecone, Weaviate, and pgvector.
Design a Social Graph Service (Facebook's TAO)
Serve billions of "who follows whom" reads over a graph of trillions of edges. The objects-and-associations model, a cache in front of sharded SQL, and the hot-vertex problem.
Design an Authorization System (Google Zanzibar / RBAC / ReBAC)
Answer "can user U do action A on resource R?" globally, in milliseconds, consistently. RBAC vs ABAC vs ReBAC, Zanzibar relation tuples, and the new-enemy problem.