~/articles/quorum-consensus-and-read-repair

◆◆◆Advancedasked at Amazonasked at Netflixasked at Apple

Quorums, Read-Repair & Anti-Entropy (Dynamo-style)

How leaderless databases like Dynamo and Cassandra stay available and converge. Quorum R+W>N, read-repair, hinted handoff, Merkle anti-entropy, and conflict resolution.

18 min read2026-03-22Ironclad Academy

#distributed-systems #consistency #replication #availability

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

Leader-based replication (Raft, Paxos, primary-secondary) gives you a clear chain of command: one node decides write order, everyone else follows. That clarity comes at a cost — the leader is a bottleneck, leader election takes time, and if you can't reach the leader during a partition, you refuse writes. Dynamo-style systems (Apache Cassandra, Riak, and the original Amazon Dynamo) make the opposite bet: every replica is equal, every replica can accept a write, and the system converges to agreement in the background. (Note: Amazon DynamoDB, the managed cloud service, has since moved away from leaderless replication to a leader-per-partition model using Multi-Paxos for stronger consistency guarantees.) The tradeoff is genuine — but for many workloads (shopping carts, user profiles, sensor readings) brief inconsistency is far better than unavailability.

This article walks through every mechanism in that bet: quorums, sloppy quorums, hinted handoff, read-repair, Merkle anti-entropy, and conflict resolution. Contrast this with leader-based consensus at leader election and consensus, revisit partition fundamentals at CAP theorem deep dive, and see how keys are distributed across nodes at consistent hashing. The end-to-end storage design is at design key-value store.

Leaderless replication

In a leader-based model, writes flow through one designated primary. In a leaderless model, the client (or a coordinator node acting on its behalf) sends the write to multiple replicas directly. Any replica can accept any write. There is no concept of "the primary refuses you." If two nodes get different values for the same key, the system resolves it later.

flowchart TD
    subgraph Leader-based
        C1[Client] --> L[Leader]
        L -->|replicate| F1[Follower 1]
        L -->|replicate| F2[Follower 2]
    end
    subgraph Leaderless
        C2[Client] --> A[Node A]
        C2 --> B[Node B]
        C2 --> C[Node C]
    end
    style L fill:#ff6b1a,color:#fff
    style A fill:#15803d,color:#fff
    style B fill:#15803d,color:#fff
    style C fill:#15803d,color:#fff

The key consequence: write availability is decoupled from any single node's health. You can lose Node A entirely and writes flow to B and C. You pay for this with the possibility that A, when it comes back, holds stale data — hence all the convergence machinery described below.

Quorums: the R+W>N rule

Quorums are the mathematical guarantee that a read will always see the most recent write. The three numbers to know:

N — how many replicas hold a copy of the data (the replication factor).
W — how many must acknowledge a write before the client gets a success response.
R — how many must respond to a read before the client gets an answer.

If R+W>N, then the set of replicas that answered the write and the set that answered the read must overlap by at least one node. That overlapping node holds the newest value.

N = 3, W = 2, R = 2

Write quorum: {A, B}
Read quorum:  {B, C}

Overlap: {B}  ← B guaranteed to hold the new value

If the read coordinator sees multiple values, it returns the one with the highest version (or timestamp) and triggers read-repair on the stale nodes.

Tuning the knobs

The same N gives you very different behavior depending on how you set W and R:

W	R	Character	Trade-off
N	1	Write-heavy consistency	Writes must touch all N replicas (slow, unavailable if any replica down); reads are fast (any replica)
1	N	Read-heavy consistency	Writes are fast and highly available; reads must touch all N replicas
⌈N/2⌉+1	⌈N/2⌉+1	Balanced quorum	Both reads and writes survive minority failures; both pay the quorum latency tax
1	1	No guarantee	Maximum performance; staleness is certain; use only where best-effort is fine

With N=3: W=2, R=2 is the balanced sweet spot. It tolerates one replica being down on both reads and writes. If all three replicas are healthy, reads and writes both succeed with the fastest two responders — the slow third doesn't block the client.

flowchart LR
    subgraph "N=3, W=2, R=2 — one failure tolerated"
        CW[Client write] -->|"ack needed from 2"| WA[Node A]
        CW -->|"ack needed from 2"| WB[Node B]
        CW -. skipped .-> WC[Node C<br/>DOWN]
        CR[Client read] -->|"response from 2"| RB[Node B]
        CR -. fails .-> RC2[Node C<br/>DOWN]
        CR -->|"response from 2"| RA[Node A]
    end
    WA & WB -->|"W=2 met"| OK1[Write success]
    RB & RA -->|"R=2 met — B overlaps write"| OK2[Read sees latest]
    style OK1 fill:#15803d,color:#fff
    style OK2 fill:#15803d,color:#fff
    style WC fill:#ff2e88,color:#fff
    style RC2 fill:#ff2e88,color:#fff
    style CW fill:#ff6b1a,color:#fff
    style CR fill:#ff6b1a,color:#fff

What quorums do NOT guarantee

A quorum is only as strong as its clock or version tracking. Two things can quietly go wrong:

Last-write-wins with wall clocks. If two clients write concurrently to different nodes, both get W acks. Whoever has the larger timestamp "wins" on the next read. If the clocks are skewed, the wrong write wins and the other is silently discarded. No error, no conflict signal — the data is just gone. This is the most dangerous failure mode in Dynamo-style stores.

Quorums are not linearizable by default. Even with R+W>N, you can still see non-linearizable reads — one client sees a value, then another client sees an older value — unless the implementation adds extra mechanisms (like Cassandra's SERIAL consistency level, which uses Paxos for linearizable conditional writes).

Sloppy quorums break the overlap in a third way, which is the subject of the next section.

Sloppy quorums and hinted handoff

The standard quorum guarantee requires the W nodes to be from the key's home replica set — the N nodes responsible for this key range. What happens when one of those home nodes is down?

You have two choices. A strict quorum refuses the write if you can't reach W home replicas: consistent, but unavailable during any partition. A sloppy quorum writes to W live nodes from anywhere in the cluster, not necessarily the key's home replicas, and hands the data off to the home replicas when they come back online.

The temporary holder stores the value along with a hint: "this data belongs to home replica X; deliver it when X is reachable."

sequenceDiagram
    participant Client
    participant Coord as Coordinator
    participant H1 as Home 1 (up)
    participant H2 as Home 2 (DOWN)
    participant Tmp as Temp node (hint)

    Client->>Coord: write key=foo, value=bar
    Coord->>H1: write (normal replica)
    Coord->>H2: write (FAILS — node down)
    Coord->>Tmp: write + hint "deliver to H2"
    H1-->>Coord: ack
    Tmp-->>Coord: ack
    Coord-->>Client: success (W=2 met)
    Note over H2,Tmp: H2 comes back online
    Tmp->>H2: handoff: key=foo, value=bar
    H2-->>Tmp: ack
    Tmp->>Tmp: discard hint

The write succeeds even when home replicas are down. Write availability is preserved. But the overlap guarantee is broken: a read quorum of R home replicas may not include the temporary node that holds the latest value. Until the hint is handed off, a read can return the stale value while the latest value sits on a temp node.

This is a deliberate design choice, well-documented in the original Amazon Dynamo paper. The system is highly available for writes; it is NOT strongly consistent. Applications that need the "always see your own writes" guarantee must use strict quorums or a single coordinator for both read and write.

How replicas converge

When replicas hold different values for the same key — due to a network partition, a failed handoff, or concurrent writes — they need to converge. There are two complementary mechanisms: one reactive, one proactive.

Read-repair

Every quorum read touches R replicas. If those replicas return different values, the coordinator picks the winner (highest version, or per the conflict resolution policy), returns it to the client, and sends the winner value to any replica that returned a stale version — either synchronously before returning (blocking mode) or after returning (background mode), depending on configuration.

sequenceDiagram
    participant Client
    participant Coord as Coordinator
    participant R1 as Replica 1
    participant R2 as Replica 2

    Client->>Coord: read key=user:42
    Coord->>R1: read
    Coord->>R2: read
    R1-->>Coord: value=v3 (timestamp=1000)
    R2-->>Coord: value=v2 (timestamp=900)
    Coord-->>Client: return v3
    Coord->>R2: repair: write key=user:42, value=v3
    R2-->>Coord: ack

Read-repair is reactive and incremental: it only fires when a read reveals divergence. Hot keys heal very quickly — every read is a potential repair — but cold keys that are rarely read can stay stale for extended periods.

Read-repair carries a small tail-latency cost. In Cassandra 4.0+, the read_repair table option controls this: BLOCKING (the default) causes the coordinator to wait for repair writes to reach quorum before returning, providing monotonic quorum reads at the cost of slightly higher read latency; ASYNC performs repair writes in the background after returning to the client; NONE skips repair writes entirely, eliminating the extra write traffic generated during reads. Older probabilistic approaches (configuring a random fraction of reads to trigger repair) have been removed from modern Cassandra.

Anti-entropy via Merkle trees

Read-repair alone does not guarantee all data converges — cold keys may never get a read. Anti-entropy is a background process that compares the full key space between replicas and syncs anything that differs.

The naive approach — send every key-value pair from one replica to another and compare — is prohibitively expensive for a large dataset. Merkle trees (hash trees) solve this efficiently.

A Merkle tree is a binary tree where each leaf is a hash of the value for a single key (or a small key range), each internal node is a hash of its children's hashes, and the root hash summarizes the entire dataset. Two replicas compare their root hashes. If they match, the data is identical — done, zero data transferred. If they differ, descend into each subtree comparing hashes at each level. The path to disagreement narrows logarithmically. Only the key ranges with differing hashes need to sync.

flowchart TD
    ROOT["Root hash: 8f3a..."]
    L["Left: 2d11..."]
    R["Right: 2d11... ✓ MATCH"]
    LL["LL: a7f2... DIFFER"]
    LR["LR: 9c44... ✓ MATCH"]
    LLL["keys A-D"]
    LLR["keys E-H"]

    ROOT --> L
    ROOT --> R
    L --> LL
    L --> LR
    LL --> LLL
    LL --> LLR

    style ROOT fill:#ff6b1a,color:#fff
    style L fill:#0e7490,color:#fff
    style R fill:#15803d,color:#fff
    style LL fill:#a855f7,color:#fff
    style LR fill:#15803d,color:#fff
    style LLL fill:#ffaa00,color:#0a0a0f
    style LLR fill:#ffaa00,color:#0a0a0f

In this example: the right subtree matches entirely — skip it. The left subtree differs; descend. The left-right sub-subtree matches — skip it. Only the left-left sub-subtree (keys A–H) needs to be synced. O(log N) rounds of comparison to find disagreement in a dataset of N keys.

Practical numbers. A Cassandra node typically rebuilds and exchanges its Merkle trees with each peer in a process called nodetool repair. For a node holding 100 GB of data, the tree comparison itself is cheap — it hashes data into a tree structure and sends only the tree, not the data, in the comparison phase. Only the differing key ranges transfer their actual values. Repair is recommended at least once per gc_grace_seconds — typically every 10 days in a default Cassandra configuration — to ensure deleted keys (tombstones) are not resurrected.

Conflict resolution

What happens when two clients write to the same key at the same time, and those writes land on different replicas before either can propagate? You have a genuine conflict — two values that were both "the latest" from someone's perspective.

Last-write-wins (LWW)

The simplest policy: attach a timestamp to each write; the replica with the highest timestamp wins.

The danger is that wall clocks are unreliable in distributed systems. Clock skew across nodes (even with NTP) can be milliseconds to hundreds of milliseconds. If two writes arrive within a few milliseconds of each other, the "loser" by timestamp may have been logically later from the client's perspective, and it is permanently discarded with no error returned to either client.

LWW is safe when the write rate is low enough that true concurrent writes are extremely rare, overwriting with a slightly stale value is acceptable (a "last seen timestamp" for a user's activity, say), or you use a monotonic cluster-wide logical clock rather than wall time (Cassandra allows this via USING TIMESTAMP).

Version vectors

To detect concurrency without trusting clocks, attach a version vector to every value: a map from replica identifier to a sequence number.

Node A writes key=cart, value=["milk"] → version: {A:1}
Node B writes key=cart, value=["bread"] → version: {B:1}

Replica 1 holds: ["milk"]  {A:1}
Replica 2 holds: ["bread"] {B:1}

When a coordinator sees both, it compares version vectors: {A:1} does not dominate {B:1} — A doesn't know about B's write, and vice versa — so this is a genuine conflict. If one had been {A:1, B:1}, it would dominate the other — simple stale data, no conflict.

A version vector tells you whether a conflict exists. It doesn't resolve it. Your options for resolution:

Strategy	Mechanics	When to use
Last-write-wins	Attach wall-clock or logical timestamp; keep highest	Low write rate, tolerable data loss
Client-side merge	Return all conflicting versions ("siblings") to the next reader; app merges	Shopping carts, collaborative docs; app knows semantics
CRDTs	Data structure guarantees any merge order yields the same result (counters, sets)	Counters, distributed sets, growing-only structures
Server-side quorum	Use Paxos/Raft for the write; no conflicts possible	When strong consistency is non-negotiable

Riak popularized the "siblings" (multiple versions) model: on a conflict, the database returns all conflicting values to the next reader, and the application is responsible for merging them. Amazon's Dynamo paper describes doing this for the shopping cart — if you added an item in Rome while your session also wrote in London, both items end up in the cart, because union-merge (a CRDT set union) is the right semantic for "things in a cart."

CRDTs (Conflict-free Replicated Data Types) take the merge logic out of the application by choosing data structures whose merge function is commutative, associative, and idempotent. A G-Counter (grow-only counter) is the simplest CRDT: each replica keeps a per-node count; the merge of two states is the component-wise maximum; the total is the sum of all per-node maxima. No matter what order merges happen in, you get the same result.

CAP positioning

Dynamo-style stores sit firmly in the AP (Available + Partition-tolerant) camp of the CAP theorem. Writes succeed as long as enough replicas are reachable — the system doesn't refuse writes when nodes are down. Across network partitions, replicas diverge and converge later. The thing you give up is strong consistency: reads may return stale values, concurrent writes may conflict.

This is the right choice when availability matters more than perfect consistency — user-facing applications where "service degraded" is worse than "data is a few seconds stale." Raft-based systems (leader election and consensus) sit in the CP corner: they refuse writes during a partition rather than risk divergence.

flowchart LR
    subgraph AP ["AP (Dynamo / Cassandra)"]
        A1[Always available<br/>for writes] --> A2[Converge later<br/>via repair]
    end
    subgraph CP ["CP (Raft / Paxos)"]
        B1[Strong consistency<br/>linearizable] --> B2[Unavailable during<br/>partition / election]
    end
    style A1 fill:#15803d,color:#fff
    style A2 fill:#15803d,color:#fff
    style B1 fill:#0e7490,color:#fff
    style B2 fill:#0e7490,color:#fff

See CAP theorem deep dive for a rigorous treatment of this choice.

Concrete: Dynamo and Cassandra

Amazon Dynamo (described in the 2007 SOSP paper by DeCandia et al.) is the seminal leaderless system. It uses consistent hashing to partition keys across nodes, N/W/R tunable quorums, sloppy quorums with hinted handoff, read-repair on every read, and Merkle tree anti-entropy between peers. Conflict resolution was application-driven: the shopping cart service used a union-merge strategy. DynamoDB (the managed cloud service) evolved substantially from the original paper. By the time of AWS's 2022 USENIX ATC paper, DynamoDB had moved to a leader-per-partition model using Multi-Paxos, gaining strong consistency and predictable performance — it is no longer leaderless in the Dynamo sense.

Apache Cassandra (originally developed at Facebook, open-sourced in 2008) combines Dynamo's leaderless replication with a Bigtable-inspired wide-column data model. Key Cassandra specifics:

Mechanism	Cassandra implementation
Partitioning	Consistent hashing with virtual nodes (vnodes) — many small token ranges per physical node, not one large range
Replication factor	`REPLICATION = {'class': 'NetworkTopologyStrategy', 'dc1': 3}` — replicate to N nodes, topology-aware
Consistency levels	`ONE`, `QUORUM`, `LOCAL_QUORUM`, `ALL`, `SERIAL` (linearizable via Paxos) — per-operation, not per-table
Read repair	`read_repair` table option: `BLOCKING` (default, provides monotonic quorum reads), `ASYNC` (background repair, non-blocking), or `NONE` (no repair; avoids extra write traffic during reads). The older probabilistic `read_repair_chance` parameter was removed in Cassandra 4.0.
Anti-entropy	`nodetool repair` runs Merkle tree sync; should run at least every `gc_grace_seconds` (default 10 days)
Conflict resolution	LWW by default (timestamp per cell); no built-in sibling model

Cassandra defaults to LWW, which means clock skew is a real operational concern. The recommendation is to use USING TIMESTAMP with a logical, client-generated timestamp if precise ordering matters, or to accept that writes within the clock-skew window (~few ms) may be silently overwritten.

Failure modes and what goes wrong

Failure	Mechanism triggered	Risk
One home replica down	Sloppy quorum + hinted handoff	Write succeeds; hint may be lost if hint-holder also crashes before delivery
Clock skew > write concurrency window	LWW picks wrong winner	Silent data loss — no error, no conflict signal
Quorum unavailable (floor(N/2)+1 replicas down)	Write/read fails (strict) or degrades (sloppy)	Unavailability with strict; stale reads with sloppy
Hint accumulation (many nodes down)	Disk space on hint-holder fills up	Writes rejected when hint store is full; repair needed
Repair not run for > gc_grace_seconds	Deleted tombstones may be resurrected	Deleted keys reappear; data corruption
Conflict explosion	Clients write to different replicas under partition	Version vector "sibling" count grows; app must merge many versions
Network partition healed too late	Hinted handoff TTL expires before delivery	Hint discarded; replica remains permanently stale until next repair

The tombstone resurrection failure mode is worth pausing on — it tests operational depth in interviews. When you delete a key in Cassandra, it writes a deletion marker (tombstone) rather than removing the data immediately, so other replicas know the key was deleted when they sync. After gc_grace_seconds, the tombstone is garbage-collected. If a replica was down during the delete and comes back after the tombstone was GC'd, it will see "no data" in the sync and think it simply doesn't have the key yet — and will not receive a delete. Anti-entropy repair must run within gc_grace_seconds to prevent this.

Choosing your quorum configuration

When to use which configuration, in practice:

Use case	Recommended config	Rationale
Highest write availability (metrics ingest, event logs)	N=3, W=1, R=1 or R=2	Writes never fail; staleness acceptable
Balanced: tolerate one failure each way	N=3, W=2, R=2	Classic quorum; standard for most apps
Read-your-writes consistency required	N=3, W=3 or route reads to same node	All replicas always have latest; or use sticky routing
Financial / inventory: no data loss	`SERIAL` (Cassandra) / Raft	Paxos-backed linearizability; CP, not AP
Cross-datacenter with local reads	N=3 per DC, `LOCAL_QUORUM`	Quorum within one DC; cross-DC replication async

Things to discuss in an interview

Why R+W>N works: walk through a concrete example (N=3, W=2, R=2, draw the overlap). Make sure you can also show what breaks when R+W≤N.
Sloppy vs strict quorums: when would you use each? What availability vs consistency trade-off are you making?
Read-repair vs anti-entropy: why both? (Read-repair is fast for hot keys; anti-entropy handles cold keys and full convergence.)
Why Merkle trees: what's the alternative (full scan), why it's infeasible at scale, and how tree-comparison reduces it to O(log N) rounds.
LWW vs version vectors: LWW is simpler but lossy; version vectors detect conflicts but push resolution to the application or a CRDT.
CAP trade-off: Dynamo is AP; Raft is CP. When would you choose each? (Availability-critical vs correctness-critical data.)
Tombstone resurrection: a subtle failure mode that tests depth of operational knowledge.

Things you should now be able to answer

If N=5, W=3, R=3, how many node failures can reads tolerate? Writes?
Why does sloppy quorum weaken the R+W>N overlap guarantee?
A read returns two values with version vectors {A:2, B:1} and {A:1, B:2}. Is this a conflict? What do you do?
Why is running Cassandra repair within gc_grace_seconds critical?
A client writes key=X to replica A. A partition isolates A before the write reaches B or C. The client then reads from B and C. What does it see? What happens after the partition heals?
What's a CRDT G-Counter, and why does it never conflict?
Why is LWW dangerous when writes are concurrent across replicas with ms-level clock skew?

Frequently asked questions

▸What is the R+W>N quorum rule and why does it guarantee you see the latest write?

R is the number of replicas that must respond to a read, W is the number that must acknowledge a write, and N is the total replication factor. When R+W>N, the read set and write set must share at least one replica by the pigeonhole principle, and that overlapping node is guaranteed to hold the newest value.

▸What is the difference between a sloppy quorum and a strict quorum?

A strict quorum requires that the W acknowledgments come from the key's home replica set; if too few home replicas are reachable, the write is refused. A sloppy quorum writes to any W live nodes in the cluster, and each temporary holder stores a hint directing it to forward the data to the home replica once it recovers. Sloppy quorums preserve write availability but break the overlap guarantee, so a subsequent read from home replicas may return a stale value until hinted handoff completes.

▸How do read-repair and Merkle-tree anti-entropy complement each other?

Read-repair fires inline on every quorum read: if replicas return different values, the coordinator writes the newest value back to any stale replica immediately, so hot keys heal within seconds of a partition healing. Merkle anti-entropy is a background process that compares per-key hash trees between replicas and syncs only differing key ranges in O(log N) rounds, covering cold keys that may never receive a read. Both are needed because read-repair alone cannot guarantee full convergence across the entire key space.

▸When should I use last-write-wins versus version vectors for conflict resolution?

Last-write-wins is appropriate when the write rate is low enough that true concurrent writes are rare and silently overwriting a slightly stale value is acceptable, such as a user activity timestamp. Version vectors are necessary when you need to detect genuine conflicts: two version vectors that neither dominates the other signal a true concurrent write that must be resolved by application-level merge or a CRDT. The danger of LWW is that clock skew of even a few milliseconds can cause the logically later write to be permanently discarded with no error.

▸Why must Cassandra's nodetool repair run within gc_grace_seconds, and what happens if it does not?

When a key is deleted in Cassandra, a tombstone marker is written instead of immediately removing the data, so other replicas learn about the deletion during sync. After gc_grace_seconds (default 10 days) the tombstone is garbage-collected. If a replica was down during the delete and comes back after the tombstone is gone, anti-entropy sees no data on either side and never delivers the delete, resurrecting the key silently.

← previous

Design a Video Conferencing System (Zoom)

Design Slack (team chat at scale)

// RELATED