~/articles/consistent-hashing

◆◆Intermediateasked at Amazonasked at Metaasked at Netflixasked at Google

Consistent Hashing

Why hash-mod-N breaks when you resize, and how Amazon Dynamo, Cassandra, and Memcached avoid it with consistent hashing and virtual nodes.

9 min read2026-02-02Ironclad Academy

#distributed-systems #hashing #sharding #load-balancing

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

If you take only one algorithm from this whole site, take this one. Consistent hashing is the single most reused trick in distributed systems — caches, sharded databases, service meshes, CDNs, every load balancer at scale. Once you see it, you'll see it everywhere.

The problem: naive sharding breaks on resize

Suppose you have 4 cache servers and a million keys. The simplest sharding scheme:

server = hash(key) % 4

Works great. Until you need to add a 5th server. Now:

server = hash(key) % 5

Almost every key now hashes to a different server. Cache hit rate goes from 95% to ~0% overnight. Your databases melt, your latency explodes, you page someone at 3am.

flowchart LR
    subgraph "Before: 4 servers"
    K1[key 'apple'<br/>hash=42] -->|"42 % 4 = 2"| S2A[Server 2]
    K2[key 'orange'<br/>hash=37] -->|"37 % 4 = 1"| S1A[Server 1]
    end
    subgraph "After: 5 servers"
    K1B[key 'apple'<br/>hash=42] -->|"42 % 5 = 2"| S2B[Server 2]
    K2B[key 'orange'<br/>hash=37] -->|"37 % 5 = 2"| S2B
    end
    style K1B fill:#ff2e88,color:#fff
    style K2B fill:#ff2e88,color:#fff

Roughly N/(N+1) of keys move when you go from N to N+1 servers. With 4 → 5, that's 80% of keys remapped. Useless.

The idea: a hash ring

Consistent hashing puts both the keys and the servers on the same conceptual circle (the "ring"). You hash both into the same large integer space (say, 0 to 2^32 - 1), arrange them around the ring, and assign each key to the next server clockwise.

flowchart TD
    subgraph "Hash ring (0 to 2^32 - 1)"
    A[Server A<br/>at hash 100]
    B[Server B<br/>at hash 1000]
    C[Server C<br/>at hash 2500]
    D[Server D<br/>at hash 3500]
    end
    K1[key 'apple'<br/>hash 50] -->|next clockwise| A
    K2[key 'banana'<br/>hash 800] -->|next clockwise| B
    K3[key 'cherry'<br/>hash 1500] -->|next clockwise| C
    K4[key 'durian'<br/>hash 3000] -->|next clockwise| D
    K5[key 'elderberry'<br/>hash 4000] -->|wraps around to| A
    style A fill:#ff6b1a,color:#0a0a0f
    style B fill:#ffaa00,color:#0a0a0f
    style C fill:#15803d,color:#fff
    style D fill:#0e7490,color:#fff

Now what happens when we add a new server E at hash 1800? Only the keys whose next-clockwise-server was C (and who hash between B and 1800) move to E. Roughly 1/(N+1) of keys move, not N/(N+1).

When we remove a server, only its keys redistribute to the next neighbor.

flowchart LR
    subgraph "Add server E at 1800"
    direction TB
    NA[A: 100]
    NB[B: 1000]
    NE[E: 1800<br/>NEW]
    NC[C: 2500]
    ND[D: 3500]
    end
    NK1[apple, banana — unchanged]
    NK2[cherry: 1500] -->|"used to go to C,<br/>now goes to E"| NE
    style NE fill:#15803d,color:#fff

Why this matters

Adding a server moves only ~1/(N+1) keys. This is the property that makes consistent hashing usable in practice.

Scheme	Keys remapped on N → N+1
`hash(k) % N`	≈ N/(N+1) (almost all)
Consistent hashing	≈ 1/(N+1) (a small slice)

For N = 100, going from 100 → 101 servers, the naive scheme reshuffles ~99% of keys; consistent hashing reshuffles ~1%.

The unevenness problem

There's a catch. Random hashing puts servers at random positions on the ring, and random points are not uniformly spaced. With 4 servers, you might end up with one server "owning" 60% of the ring and another only 5%.

flowchart TD
    subgraph "Bad luck: uneven server placement"
    BA[A: 100]
    BB[B: 200]
    BC[C: 300]
    BD[D: 3500]
    end
    Note["A, B, C cluster together<br/>D 'owns' ~60% of the ring"]
    style BD fill:#ff2e88,color:#fff

D will get most of the load. Not what we wanted.

The fix: virtual nodes (vnodes)

Instead of placing each server once on the ring, we place it many times — commonly 100–256 virtual copies, depending on fleet size and whether the allocator picks positions randomly or deterministically. Each virtual node is at a different position derived from hash(server + "#" + i).

flowchart TD
    subgraph "With virtual nodes (vnodes)"
    VA1[A#1]
    VA2[A#2]
    VA3[A#...]
    VB1[B#1]
    VB2[B#2]
    VB3[B#...]
    VC1[C#1]
    VC2[C#2]
    VC3[C#...]
    end
    Note["100 vnodes per server<br/>distribution evens out"]
    style VA1 fill:#ff6b1a,color:#0a0a0f
    style VB1 fill:#ffaa00,color:#0a0a0f
    style VC1 fill:#15803d,color:#fff

With 100 vnodes per server, the law of large numbers kicks in: each server's share of the ring is within ~1% of 1/N. There's a second benefit worth calling out: when a server fails, its keys get redistributed across all remaining servers (not just one neighbor), spreading the recovery load. Without vnodes, you'd flood a single node; with them, every survivor absorbs a proportional share.

Pseudocode

import bisect
import hashlib

class ConsistentHash:
    def __init__(self, replicas: int = 100):
        self.replicas = replicas
        self.ring: dict[int, str] = {}     # hash -> server
        self.sorted_hashes: list[int] = [] # sorted keys of ring

    def _h(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_server(self, server: str):
        for i in range(self.replicas):
            h = self._h(f"{server}#{i}")
            self.ring[h] = server
            bisect.insort(self.sorted_hashes, h)

    def remove_server(self, server: str):
        for i in range(self.replicas):
            h = self._h(f"{server}#{i}")
            del self.ring[h]
            self.sorted_hashes.remove(h)

    def get_server(self, key: str) -> str:
        if not self.ring:
            return None
        h = self._h(key)
        idx = bisect.bisect(self.sorted_hashes, h) % len(self.sorted_hashes)
        return self.ring[self.sorted_hashes[idx]]

That's the entire algorithm. ~30 lines.

Where consistent hashing is used in real systems

System	Use
Amazon Dynamo / DynamoDB	Partition data across nodes
Apache Cassandra	Same. Each node owns a range of the ring
Memcached clients (e.g. Ketama)	Pick which server holds a key
Riak	Partition data
Akamai, CDN routing	Map URLs to edge servers
Discord's Elixir services	Map guild entities to nodes (via ex_hash_ring)
Many service meshes	Sticky load balancing for stateful workloads

Rendezvous hashing (HRW)

The idea here is different: for each key, compute hash(key + server_id) for every server and pick the one that scores highest. There's no ring and no vnodes — just a list of servers and a comparison. The distribution is perfectly even, and you don't need to reason about ring positions at all.

The cost is that lookup is O(N) — you must score every server. That's fine up to a few hundred nodes (the comparisons are embarrassingly parallelizable), but it becomes impractical at thousands. Used in Microsoft's CARP and some BitTorrent trackers.

Jump consistent hash (Google, 2014)

A clever closed-form algorithm that takes a key and a bucket count and returns the chosen bucket — no data structures at all. The paper describes it as ~5 lines of code:

// Jump consistent hash, from the paper
int32_t jump_hash(uint64_t key, int32_t num_buckets) {
    int64_t b = -1, j = 0;
    while (j < num_buckets) {
        b = j;
        key = key * 2862933555777941757ULL + 1;
        j = (b + 1) * ((double)(1LL << 31) / (double)((key >> 33) + 1));
    }
    return b;
}

It runs in O(log N) and uses zero memory. The constraint is that your buckets must be numbered 0..N-1, and the only way to "remove a server" is to remove the highest-numbered bucket. You can't pull an arbitrary node out of the middle. That makes it well-suited to bulk storage systems with sequentially numbered shards (used inside Google for exactly that), but awkward for any cluster where arbitrary nodes come and go.

Anchor hashing, Multi-probe consistent hashing

Newer variants that try to combine the strengths of all the above. Mostly a research curiosity unless you're maintaining a hashing library.

Picking the right approach

flowchart TD
    Q1{"Buckets numbered<br/>0..N-1 and grow<br/>only at the end?"}
    Q1 -->|yes| JUMP["Jump hash<br/>O(log N), zero memory"]
    Q1 -->|no| Q2{"Fleet size<br/>≤ a few hundred?"}
    Q2 -->|yes| HRW["Rendezvous HRW<br/>perfect balance, O(N) lookup"]
    Q2 -->|no| Q3{"Heterogeneous<br/>node capacity?"}
    Q3 -->|yes| WRING[Weighted ring<br/>more vnodes to bigger nodes]
    Q3 -->|no| RING[Ring + vnodes<br/>the standard answer]
    style JUMP fill:#0e7490,color:#fff
    style HRW fill:#ffaa00,color:#0a0a0f
    style WRING fill:#a855f7,color:#fff
    style RING fill:#ff6b1a,color:#0a0a0f

Trade-offs to know

Memory is essentially free. With 1000 servers × 100 vnodes you have 100k ring entries — trivially small.

Lookup speed is O(log V) binary search over the sorted vnode list, where V is the total vnode count. Fast enough at any practical scale.

Rebalancing when a node joins or leaves: ~1/(N+1) of keys move, and with vnodes those keys scatter across all surviving nodes rather than piling onto a single neighbor. That's the load-spreading property in action.

Heterogeneous servers are handled naturally by giving bigger nodes more vnodes — weighted consistent hashing. No protocol changes needed.

Vnode count tuning is worth understanding. More vnodes gives better balance but creates more data-streaming pairs at bootstrap time. Cassandra 3.x used 256 random positions per node; Cassandra 4.x dropped to 16 by adding a deterministic token allocator (allocate_tokens_for_local_replication_factor) that achieves similar balance with far less churn. The right count depends on whether your allocator is random or deterministic.

Hot-key problem is a separate concern. Consistent hashing distributes keys across nodes evenly in aggregate, but if a single key is read or written at extreme rate — a viral tweet's media object, say — it still lands on exactly one node. That requires a different remedy: key-level replication, local caching, or sharding the key's value itself.

Gossip consistency of the ring matters in practice. Every node needs an up-to-date view of ring membership for routing to work without a central coordinator. Cassandra gossips token ownership continuously. A stale ring view causes wrong-node routing, which the target node corrects with a redirect — a latency spike, not a correctness failure, but worth planning for.

Things you should now be able to answer

If you got here from that hypothetical 3am page, here is what consistent hashing gives you: the ability to add a server without melting your database. The five questions below are the ones interviewers reach for — work through each one in your own words before moving on.

Why does hash(key) % N break when N changes? How much data moves?
What is a virtual node and why do we need them?
Why does Cassandra use consistent hashing?
When would jump hashing beat ring-based consistent hashing?
A new node is added to the ring. Which keys move and which don't?

Frequently asked questions

▸What is consistent hashing and why is it better than hash-mod-N sharding?

Consistent hashing places both servers and keys on a shared hash ring spanning 0 to 2^32-1, assigning each key to the next clockwise server. Unlike hash(key) % N, which remaps roughly N/(N+1) of all keys when the server count changes — about 80% when going from 4 to 5 nodes — consistent hashing moves only ~1/(N+1) keys, so a cache cluster survives scaling events without a mass-miss storm.

▸What are virtual nodes and why does consistent hashing need them?

Virtual nodes (vnodes) give each physical server 100–256 positions on the ring instead of one, derived from hash(server + "#" + i). Without them, random placement clusters servers unevenly and one node can own 60% of the ring while another owns 5%. With 100 vnodes per server, the law of large numbers brings each server's share to within ~1% of 1/N, and when a server fails its keys scatter across all survivors rather than flooding a single neighbor.

▸How many virtual nodes per server do Cassandra and other real systems use?

The typical range is 100–256 vnodes per server. Cassandra 3.x used 256 random positions per node; Cassandra 4.x dropped to 16 by introducing a deterministic token allocator that achieves similar balance with far less churn at bootstrap time. The right count depends on whether the allocator picks positions randomly or deterministically.

▸When should I use jump hashing instead of ring-based consistent hashing?

Use jump hashing when your buckets are numbered 0 through N-1 and you only ever add capacity at the end — you cannot remove an arbitrary node from the middle. It runs in O(log N) with zero memory and fits in about 5 lines of C. Ring-based consistent hashing is the standard answer whenever nodes can join or leave in arbitrary order, which covers most cache and database clusters.

▸What is the lookup cost and memory cost of consistent hashing with virtual nodes?

Lookup is O(log V) binary search over the sorted vnode list, where V equals servers multiplied by vnodes per server. Memory cost is trivial: 1000 servers with 100 vnodes each produces 100,000 ring entries, well within any practical budget.

← previous

Database Sharding

Scale From Zero to Millions of Users

// RELATED