~/articles/consistent-hashing
◆◆Intermediateasked at Amazonasked at Metaasked at Netflixasked at Google

Consistent Hashing

Why hash-mod-N breaks when you resize, and how Amazon Dynamo, Cassandra, and Memcached avoid it with consistent hashing and virtual nodes.

9 min read2026-02-02Ironclad Academy
// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

If you take only one algorithm from this whole site, take this one. Consistent hashing is the single most reused trick in distributed systems — caches, sharded databases, service meshes, CDNs, every load balancer at scale. Once you see it, you'll see it everywhere.

The problem: naive sharding breaks on resize

Suppose you have 4 cache servers and a million keys. The simplest sharding scheme:

server = hash(key) % 4

Works great. Until you need to add a 5th server. Now:

server = hash(key) % 5

Almost every key now hashes to a different server. Cache hit rate goes from 95% to ~0% overnight. Your databases melt, your latency explodes, you page someone at 3am.

flowchart LR
    subgraph "Before: 4 servers"
    K1[key 'apple'<br/>hash=42] -->|"42 % 4 = 2"| S2A[Server 2]
    K2[key 'orange'<br/>hash=37] -->|"37 % 4 = 1"| S1A[Server 1]
    end
    subgraph "After: 5 servers"
    K1B[key 'apple'<br/>hash=42] -->|"42 % 5 = 2"| S2B[Server 2]
    K2B[key 'orange'<br/>hash=37] -->|"37 % 5 = 2"| S2B
    end
    style K1B fill:#ff2e88,color:#fff
    style K2B fill:#ff2e88,color:#fff

Roughly N/(N+1) of keys move when you go from N to N+1 servers. With 4 → 5, that's 80% of keys remapped. Useless.

The idea: a hash ring

Consistent hashing puts both the keys and the servers on the same conceptual circle (the "ring"). You hash both into the same large integer space (say, 0 to 2^32 - 1), arrange them around the ring, and assign each key to the next server clockwise.

flowchart TD
    subgraph "Hash ring (0 to 2^32 - 1)"
    A[Server A<br/>at hash 100]
    B[Server B<br/>at hash 1000]
    C[Server C<br/>at hash 2500]
    D[Server D<br/>at hash 3500]
    end
    K1[key 'apple'<br/>hash 50] -->|next clockwise| A
    K2[key 'banana'<br/>hash 800] -->|next clockwise| B
    K3[key 'cherry'<br/>hash 1500] -->|next clockwise| C
    K4[key 'durian'<br/>hash 3000] -->|next clockwise| D
    K5[key 'elderberry'<br/>hash 4000] -->|wraps around to| A
    style A fill:#ff6b1a,color:#0a0a0f
    style B fill:#ffaa00,color:#0a0a0f
    style C fill:#15803d,color:#fff
    style D fill:#0e7490,color:#fff

Now what happens when we add a new server E at hash 1800? Only the keys whose next-clockwise-server was C (and who hash between B and 1800) move to E. Roughly 1/(N+1) of keys move, not N/(N+1).

When we remove a server, only its keys redistribute to the next neighbor.

flowchart LR
    subgraph "Add server E at 1800"
    direction TB
    NA[A: 100]
    NB[B: 1000]
    NE[E: 1800<br/>NEW]
    NC[C: 2500]
    ND[D: 3500]
    end
    NK1[apple, banana — unchanged]
    NK2[cherry: 1500] -->|"used to go to C,<br/>now goes to E"| NE
    style NE fill:#15803d,color:#fff

Why this matters

Adding a server moves only ~1/(N+1) keys. This is the property that makes consistent hashing usable in practice.

SchemeKeys remapped on N → N+1
hash(k) % N≈ N/(N+1) (almost all)
Consistent hashing≈ 1/(N+1) (a small slice)

For N = 100, going from 100 → 101 servers, the naive scheme reshuffles ~99% of keys; consistent hashing reshuffles ~1%.

The unevenness problem

There's a catch. Random hashing puts servers at random positions on the ring, and random points are not uniformly spaced. With 4 servers, you might end up with one server "owning" 60% of the ring and another only 5%.

flowchart TD
    subgraph "Bad luck: uneven server placement"
    BA[A: 100]
    BB[B: 200]
    BC[C: 300]
    BD[D: 3500]
    end
    Note["A, B, C cluster together<br/>D 'owns' ~60% of the ring"]
    style BD fill:#ff2e88,color:#fff

D will get most of the load. Not what we wanted.

The fix: virtual nodes (vnodes)

Instead of placing each server once on the ring, we place it many times — commonly 100–256 virtual copies, depending on fleet size and whether the allocator picks positions randomly or deterministically. Each virtual node is at a different position derived from hash(server + "#" + i).

flowchart TD
    subgraph "With virtual nodes (vnodes)"
    VA1[A#1]
    VA2[A#2]
    VA3[A#...]
    VB1[B#1]
    VB2[B#2]
    VB3[B#...]
    VC1[C#1]
    VC2[C#2]
    VC3[C#...]
    end
    Note["100 vnodes per server<br/>distribution evens out"]
    style VA1 fill:#ff6b1a,color:#0a0a0f
    style VB1 fill:#ffaa00,color:#0a0a0f
    style VC1 fill:#15803d,color:#fff

With 100 vnodes per server, the law of large numbers kicks in: each server's share of the ring is within ~1% of 1/N. There's a second benefit worth calling out: when a server fails, its keys get redistributed across all remaining servers (not just one neighbor), spreading the recovery load. Without vnodes, you'd flood a single node; with them, every survivor absorbs a proportional share.

Pseudocode

import bisect
import hashlib

class ConsistentHash:
    def __init__(self, replicas: int = 100):
        self.replicas = replicas
        self.ring: dict[int, str] = {}     # hash -> server
        self.sorted_hashes: list[int] = [] # sorted keys of ring

    def _h(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_server(self, server: str):
        for i in range(self.replicas):
            h = self._h(f"{server}#{i}")
            self.ring[h] = server
            bisect.insort(self.sorted_hashes, h)

    def remove_server(self, server: str):
        for i in range(self.replicas):
            h = self._h(f"{server}#{i}")
            del self.ring[h]
            self.sorted_hashes.remove(h)

    def get_server(self, key: str) -> str:
        if not self.ring:
            return None
        h = self._h(key)
        idx = bisect.bisect(self.sorted_hashes, h) % len(self.sorted_hashes)
        return self.ring[self.sorted_hashes[idx]]

That's the entire algorithm. ~30 lines.

Where consistent hashing is used in real systems

SystemUse
Amazon Dynamo / DynamoDBPartition data across nodes
Apache CassandraSame. Each node owns a range of the ring
Memcached clients (e.g. Ketama)Pick which server holds a key
RiakPartition data
Akamai, CDN routingMap URLs to edge servers
Discord's Elixir servicesMap guild entities to nodes (via ex_hash_ring)
Many service meshesSticky load balancing for stateful workloads

Variants and refinements

Rendezvous hashing (HRW)

The idea here is different: for each key, compute hash(key + server_id) for every server and pick the one that scores highest. There's no ring and no vnodes — just a list of servers and a comparison. The distribution is perfectly even, and you don't need to reason about ring positions at all.

The cost is that lookup is O(N) — you must score every server. That's fine up to a few hundred nodes (the comparisons are embarrassingly parallelizable), but it becomes impractical at thousands. Used in Microsoft's CARP and some BitTorrent trackers.

Jump consistent hash (Google, 2014)

A clever closed-form algorithm that takes a key and a bucket count and returns the chosen bucket — no data structures at all. The paper describes it as ~5 lines of code:

// Jump consistent hash, from the paper
int32_t jump_hash(uint64_t key, int32_t num_buckets) {
    int64_t b = -1, j = 0;
    while (j < num_buckets) {
        b = j;
        key = key * 2862933555777941757ULL + 1;
        j = (b + 1) * ((double)(1LL << 31) / (double)((key >> 33) + 1));
    }
    return b;
}

It runs in O(log N) and uses zero memory. The constraint is that your buckets must be numbered 0..N-1, and the only way to "remove a server" is to remove the highest-numbered bucket. You can't pull an arbitrary node out of the middle. That makes it well-suited to bulk storage systems with sequentially numbered shards (used inside Google for exactly that), but awkward for any cluster where arbitrary nodes come and go.

Anchor hashing, Multi-probe consistent hashing

Newer variants that try to combine the strengths of all the above. Mostly a research curiosity unless you're maintaining a hashing library.

Picking the right approach

flowchart TD
    Q1{"Buckets numbered<br/>0..N-1 and grow<br/>only at the end?"}
    Q1 -->|yes| JUMP["Jump hash<br/>O(log N), zero memory"]
    Q1 -->|no| Q2{"Fleet size<br/>≤ a few hundred?"}
    Q2 -->|yes| HRW["Rendezvous HRW<br/>perfect balance, O(N) lookup"]
    Q2 -->|no| Q3{"Heterogeneous<br/>node capacity?"}
    Q3 -->|yes| WRING[Weighted ring<br/>more vnodes to bigger nodes]
    Q3 -->|no| RING[Ring + vnodes<br/>the standard answer]
    style JUMP fill:#0e7490,color:#fff
    style HRW fill:#ffaa00,color:#0a0a0f
    style WRING fill:#a855f7,color:#fff
    style RING fill:#ff6b1a,color:#0a0a0f

Trade-offs to know

Memory is essentially free. With 1000 servers × 100 vnodes you have 100k ring entries — trivially small.

Lookup speed is O(log V) binary search over the sorted vnode list, where V is the total vnode count. Fast enough at any practical scale.

Rebalancing when a node joins or leaves: ~1/(N+1) of keys move, and with vnodes those keys scatter across all surviving nodes rather than piling onto a single neighbor. That's the load-spreading property in action.

Heterogeneous servers are handled naturally by giving bigger nodes more vnodes — weighted consistent hashing. No protocol changes needed.

Vnode count tuning is worth understanding. More vnodes gives better balance but creates more data-streaming pairs at bootstrap time. Cassandra 3.x used 256 random positions per node; Cassandra 4.x dropped to 16 by adding a deterministic token allocator (allocate_tokens_for_local_replication_factor) that achieves similar balance with far less churn. The right count depends on whether your allocator is random or deterministic.

Hot-key problem is a separate concern. Consistent hashing distributes keys across nodes evenly in aggregate, but if a single key is read or written at extreme rate — a viral tweet's media object, say — it still lands on exactly one node. That requires a different remedy: key-level replication, local caching, or sharding the key's value itself.

Gossip consistency of the ring matters in practice. Every node needs an up-to-date view of ring membership for routing to work without a central coordinator. Cassandra gossips token ownership continuously. A stale ring view causes wrong-node routing, which the target node corrects with a redirect — a latency spike, not a correctness failure, but worth planning for.

Things you should now be able to answer

If you got here from that hypothetical 3am page, here is what consistent hashing gives you: the ability to add a server without melting your database. The five questions below are the ones interviewers reach for — work through each one in your own words before moving on.

  • Why does hash(key) % N break when N changes? How much data moves?
  • What is a virtual node and why do we need them?
  • Why does Cassandra use consistent hashing?
  • When would jump hashing beat ring-based consistent hashing?
  • A new node is added to the ring. Which keys move and which don't?

Further reading

  • Karger et al., "Consistent Hashing and Random Trees" (1997) — the original paper
  • DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store" (2007)
  • Lamping & Veach, "A Fast, Minimal Memory, Consistent Hash Algorithm" (2014) — jump hashing (arXiv:1406.2294)
  • Cassandra docs: "How data is distributed across a cluster (using virtual nodes)" — Datastax/Apache Cassandra documentation on vnode tuning
// FAQ

Frequently asked questions

What is consistent hashing and why is it better than hash-mod-N sharding?

Consistent hashing places both servers and keys on a shared hash ring spanning 0 to 2^32-1, assigning each key to the next clockwise server. Unlike hash(key) % N, which remaps roughly N/(N+1) of all keys when the server count changes — about 80% when going from 4 to 5 nodes — consistent hashing moves only ~1/(N+1) keys, so a cache cluster survives scaling events without a mass-miss storm.

What are virtual nodes and why does consistent hashing need them?

Virtual nodes (vnodes) give each physical server 100–256 positions on the ring instead of one, derived from hash(server + "#" + i). Without them, random placement clusters servers unevenly and one node can own 60% of the ring while another owns 5%. With 100 vnodes per server, the law of large numbers brings each server's share to within ~1% of 1/N, and when a server fails its keys scatter across all survivors rather than flooding a single neighbor.

How many virtual nodes per server do Cassandra and other real systems use?

The typical range is 100–256 vnodes per server. Cassandra 3.x used 256 random positions per node; Cassandra 4.x dropped to 16 by introducing a deterministic token allocator that achieves similar balance with far less churn at bootstrap time. The right count depends on whether the allocator picks positions randomly or deterministically.

When should I use jump hashing instead of ring-based consistent hashing?

Use jump hashing when your buckets are numbered 0 through N-1 and you only ever add capacity at the end — you cannot remove an arbitrary node from the middle. It runs in O(log N) with zero memory and fits in about 5 lines of C. Ring-based consistent hashing is the standard answer whenever nodes can join or leave in arbitrary order, which covers most cache and database clusters.

What is the lookup cost and memory cost of consistent hashing with virtual nodes?

Lookup is O(log V) binary search over the sorted vnode list, where V equals servers multiplied by vnodes per server. Memory cost is trivial: 1000 servers with 100 vnodes each produces 100,000 ring entries, well within any practical budget.

// RELATED

You may also like