Quorums, Read-Repair & Anti-Entropy (Dynamo-style)
How leaderless databases like Dynamo and Cassandra stay available and converge. Quorum R+W>N, read-repair, hinted handoff, Merkle anti-entropy, and conflict resolution.
Leader-based replication (Raft, Paxos, primary-secondary) gives you a clear chain of command: one node decides write order, everyone else follows. That clarity comes at a cost — the leader is a bottleneck, leader election takes time, and if you can't reach the leader during a partition, you refuse writes. Dynamo-style systems (Apache Cassandra, Riak, and the original Amazon Dynamo) make the opposite bet: every replica is equal, every replica can accept a write, and the system converges to agreement in the background. (Note: Amazon DynamoDB, the managed cloud service, has since moved away from leaderless replication to a leader-per-partition model using Multi-Paxos for stronger consistency guarantees.) The tradeoff is genuine — but for many workloads (shopping carts, user profiles, sensor readings) brief inconsistency is far better than unavailability.
This article walks through every mechanism in that bet: quorums, sloppy quorums, hinted handoff, read-repair, Merkle anti-entropy, and conflict resolution. Contrast this with leader-based consensus at leader election and consensus, revisit partition fundamentals at CAP theorem deep dive, and see how keys are distributed across nodes at consistent hashing. The end-to-end storage design is at design key-value store.
Leaderless replication
In a leader-based model, writes flow through one designated primary. In a leaderless model, the client (or a coordinator node acting on its behalf) sends the write to multiple replicas directly. Any replica can accept any write. There is no concept of "the primary refuses you." If two nodes get different values for the same key, the system resolves it later.
flowchart TD
subgraph Leader-based
C1[Client] --> L[Leader]
L -->|replicate| F1[Follower 1]
L -->|replicate| F2[Follower 2]
end
subgraph Leaderless
C2[Client] --> A[Node A]
C2 --> B[Node B]
C2 --> C[Node C]
end
style L fill:#ff6b1a,color:#fff
style A fill:#15803d,color:#fff
style B fill:#15803d,color:#fff
style C fill:#15803d,color:#fff
The key consequence: write availability is decoupled from any single node's health. You can lose Node A entirely and writes flow to B and C. You pay for this with the possibility that A, when it comes back, holds stale data — hence all the convergence machinery described below.
Quorums: the R+W>N rule
Quorums are the mathematical guarantee that a read will always see the most recent write. The three numbers to know:
- N — how many replicas hold a copy of the data (the replication factor).
- W — how many must acknowledge a write before the client gets a success response.
- R — how many must respond to a read before the client gets an answer.
If R+W>N, then the set of replicas that answered the write and the set that answered the read must overlap by at least one node. That overlapping node holds the newest value.
N = 3, W = 2, R = 2
Write quorum: {A, B}
Read quorum: {B, C}
Overlap: {B} ← B guaranteed to hold the new value
If the read coordinator sees multiple values, it returns the one with the highest version (or timestamp) and triggers read-repair on the stale nodes.
Tuning the knobs
The same N gives you very different behavior depending on how you set W and R:
| W | R | Character | Trade-off |
|---|---|---|---|
| N | 1 | Write-heavy consistency | Writes must touch all N replicas (slow, unavailable if any replica down); reads are fast (any replica) |
| 1 | N | Read-heavy consistency | Writes are fast and highly available; reads must touch all N replicas |
| ⌈N/2⌉+1 | ⌈N/2⌉+1 | Balanced quorum | Both reads and writes survive minority failures; both pay the quorum latency tax |
| 1 | 1 | No guarantee | Maximum performance; staleness is certain; use only where best-effort is fine |
With N=3: W=2, R=2 is the balanced sweet spot. It tolerates one replica being down on both reads and writes. If all three replicas are healthy, reads and writes both succeed with the fastest two responders — the slow third doesn't block the client.
flowchart LR
subgraph "N=3, W=2, R=2 — one failure tolerated"
CW[Client write] -->|"ack needed from 2"| WA[Node A]
CW -->|"ack needed from 2"| WB[Node B]
CW -. skipped .-> WC[Node C<br/>DOWN]
CR[Client read] -->|"response from 2"| RB[Node B]
CR -. fails .-> RC2[Node C<br/>DOWN]
CR -->|"response from 2"| RA[Node A]
end
WA & WB -->|"W=2 met"| OK1[Write success]
RB & RA -->|"R=2 met — B overlaps write"| OK2[Read sees latest]
style OK1 fill:#15803d,color:#fff
style OK2 fill:#15803d,color:#fff
style WC fill:#ff2e88,color:#fff
style RC2 fill:#ff2e88,color:#fff
style CW fill:#ff6b1a,color:#fff
style CR fill:#ff6b1a,color:#fff
What quorums do NOT guarantee
A quorum is only as strong as its clock or version tracking. Two things can quietly go wrong:
Last-write-wins with wall clocks. If two clients write concurrently to different nodes, both get W acks. Whoever has the larger timestamp "wins" on the next read. If the clocks are skewed, the wrong write wins and the other is silently discarded. No error, no conflict signal — the data is just gone. This is the most dangerous failure mode in Dynamo-style stores.
Quorums are not linearizable by default. Even with R+W>N, you can still see non-linearizable reads — one client sees a value, then another client sees an older value — unless the implementation adds extra mechanisms (like Cassandra's SERIAL consistency level, which uses Paxos for linearizable conditional writes).
Sloppy quorums break the overlap in a third way, which is the subject of the next section.
Sloppy quorums and hinted handoff
The standard quorum guarantee requires the W nodes to be from the key's home replica set — the N nodes responsible for this key range. What happens when one of those home nodes is down?
You have two choices. A strict quorum refuses the write if you can't reach W home replicas: consistent, but unavailable during any partition. A sloppy quorum writes to W live nodes from anywhere in the cluster, not necessarily the key's home replicas, and hands the data off to the home replicas when they come back online.
The temporary holder stores the value along with a hint: "this data belongs to home replica X; deliver it when X is reachable."
sequenceDiagram
participant Client
participant Coord as Coordinator
participant H1 as Home 1 (up)
participant H2 as Home 2 (DOWN)
participant Tmp as Temp node (hint)
Client->>Coord: write key=foo, value=bar
Coord->>H1: write (normal replica)
Coord->>H2: write (FAILS — node down)
Coord->>Tmp: write + hint "deliver to H2"
H1-->>Coord: ack
Tmp-->>Coord: ack
Coord-->>Client: success (W=2 met)
Note over H2,Tmp: H2 comes back online
Tmp->>H2: handoff: key=foo, value=bar
H2-->>Tmp: ack
Tmp->>Tmp: discard hint
The write succeeds even when home replicas are down. Write availability is preserved. But the overlap guarantee is broken: a read quorum of R home replicas may not include the temporary node that holds the latest value. Until the hint is handed off, a read can return the stale value while the latest value sits on a temp node.
This is a deliberate design choice, well-documented in the original Amazon Dynamo paper. The system is highly available for writes; it is NOT strongly consistent. Applications that need the "always see your own writes" guarantee must use strict quorums or a single coordinator for both read and write.
How replicas converge
When replicas hold different values for the same key — due to a network partition, a failed handoff, or concurrent writes — they need to converge. There are two complementary mechanisms: one reactive, one proactive.
Read-repair
Every quorum read touches R replicas. If those replicas return different values, the coordinator picks the winner (highest version, or per the conflict resolution policy), returns it to the client, and sends the winner value to any replica that returned a stale version — either synchronously before returning (blocking mode) or after returning (background mode), depending on configuration.
sequenceDiagram
participant Client
participant Coord as Coordinator
participant R1 as Replica 1
participant R2 as Replica 2
Client->>Coord: read key=user:42
Coord->>R1: read
Coord->>R2: read
R1-->>Coord: value=v3 (timestamp=1000)
R2-->>Coord: value=v2 (timestamp=900)
Coord-->>Client: return v3
Coord->>R2: repair: write key=user:42, value=v3
R2-->>Coord: ack
Read-repair is reactive and incremental: it only fires when a read reveals divergence. Hot keys heal very quickly — every read is a potential repair — but cold keys that are rarely read can stay stale for extended periods.
Read-repair carries a small tail-latency cost. In Cassandra 4.0+, the read_repair table option controls this: BLOCKING (the default) causes the coordinator to wait for repair writes to reach quorum before returning, providing monotonic quorum reads at the cost of slightly higher read latency; ASYNC performs repair writes in the background after returning to the client; NONE skips repair writes entirely, eliminating the extra write traffic generated during reads. Older probabilistic approaches (configuring a random fraction of reads to trigger repair) have been removed from modern Cassandra.
Anti-entropy via Merkle trees
Read-repair alone does not guarantee all data converges — cold keys may never get a read. Anti-entropy is a background process that compares the full key space between replicas and syncs anything that differs.
The naive approach — send every key-value pair from one replica to another and compare — is prohibitively expensive for a large dataset. Merkle trees (hash trees) solve this efficiently.
A Merkle tree is a binary tree where each leaf is a hash of the value for a single key (or a small key range), each internal node is a hash of its children's hashes, and the root hash summarizes the entire dataset. Two replicas compare their root hashes. If they match, the data is identical — done, zero data transferred. If they differ, descend into each subtree comparing hashes at each level. The path to disagreement narrows logarithmically. Only the key ranges with differing hashes need to sync.
flowchart TD
ROOT["Root hash: 8f3a..."]
L["Left: 2d11..."]
R["Right: 2d11... ✓ MATCH"]
LL["LL: a7f2... DIFFER"]
LR["LR: 9c44... ✓ MATCH"]
LLL["keys A-D"]
LLR["keys E-H"]
ROOT --> L
ROOT --> R
L --> LL
L --> LR
LL --> LLL
LL --> LLR
style ROOT fill:#ff6b1a,color:#fff
style L fill:#0e7490,color:#fff
style R fill:#15803d,color:#fff
style LL fill:#a855f7,color:#fff
style LR fill:#15803d,color:#fff
style LLL fill:#ffaa00,color:#0a0a0f
style LLR fill:#ffaa00,color:#0a0a0f
In this example: the right subtree matches entirely — skip it. The left subtree differs; descend. The left-right sub-subtree matches — skip it. Only the left-left sub-subtree (keys A–H) needs to be synced. O(log N) rounds of comparison to find disagreement in a dataset of N keys.
Practical numbers. A Cassandra node typically rebuilds and exchanges its Merkle trees with each peer in a process called nodetool repair. For a node holding 100 GB of data, the tree comparison itself is cheap — it hashes data into a tree structure and sends only the tree, not the data, in the comparison phase. Only the differing key ranges transfer their actual values. Repair is recommended at least once per gc_grace_seconds — typically every 10 days in a default Cassandra configuration — to ensure deleted keys (tombstones) are not resurrected.
Conflict resolution
What happens when two clients write to the same key at the same time, and those writes land on different replicas before either can propagate? You have a genuine conflict — two values that were both "the latest" from someone's perspective.
Last-write-wins (LWW)
The simplest policy: attach a timestamp to each write; the replica with the highest timestamp wins.
The danger is that wall clocks are unreliable in distributed systems. Clock skew across nodes (even with NTP) can be milliseconds to hundreds of milliseconds. If two writes arrive within a few milliseconds of each other, the "loser" by timestamp may have been logically later from the client's perspective, and it is permanently discarded with no error returned to either client.
LWW is safe when the write rate is low enough that true concurrent writes are extremely rare, overwriting with a slightly stale value is acceptable (a "last seen timestamp" for a user's activity, say), or you use a monotonic cluster-wide logical clock rather than wall time (Cassandra allows this via USING TIMESTAMP).
Version vectors
To detect concurrency without trusting clocks, attach a version vector to every value: a map from replica identifier to a sequence number.
Node A writes key=cart, value=["milk"] → version: {A:1}
Node B writes key=cart, value=["bread"] → version: {B:1}
Replica 1 holds: ["milk"] {A:1}
Replica 2 holds: ["bread"] {B:1}
When a coordinator sees both, it compares version vectors: {A:1} does not dominate {B:1} — A doesn't know about B's write, and vice versa — so this is a genuine conflict. If one had been {A:1, B:1}, it would dominate the other — simple stale data, no conflict.
A version vector tells you whether a conflict exists. It doesn't resolve it. Your options for resolution:
| Strategy | Mechanics | When to use |
|---|---|---|
| Last-write-wins | Attach wall-clock or logical timestamp; keep highest | Low write rate, tolerable data loss |
| Client-side merge | Return all conflicting versions ("siblings") to the next reader; app merges | Shopping carts, collaborative docs; app knows semantics |
| CRDTs | Data structure guarantees any merge order yields the same result (counters, sets) | Counters, distributed sets, growing-only structures |
| Server-side quorum | Use Paxos/Raft for the write; no conflicts possible | When strong consistency is non-negotiable |
Riak popularized the "siblings" (multiple versions) model: on a conflict, the database returns all conflicting values to the next reader, and the application is responsible for merging them. Amazon's Dynamo paper describes doing this for the shopping cart — if you added an item in Rome while your session also wrote in London, both items end up in the cart, because union-merge (a CRDT set union) is the right semantic for "things in a cart."
CRDTs (Conflict-free Replicated Data Types) take the merge logic out of the application by choosing data structures whose merge function is commutative, associative, and idempotent. A G-Counter (grow-only counter) is the simplest CRDT: each replica keeps a per-node count; the merge of two states is the component-wise maximum; the total is the sum of all per-node maxima. No matter what order merges happen in, you get the same result.
CAP positioning
Dynamo-style stores sit firmly in the AP (Available + Partition-tolerant) camp of the CAP theorem. Writes succeed as long as enough replicas are reachable — the system doesn't refuse writes when nodes are down. Across network partitions, replicas diverge and converge later. The thing you give up is strong consistency: reads may return stale values, concurrent writes may conflict.
This is the right choice when availability matters more than perfect consistency — user-facing applications where "service degraded" is worse than "data is a few seconds stale." Raft-based systems (leader election and consensus) sit in the CP corner: they refuse writes during a partition rather than risk divergence.
flowchart LR
subgraph AP ["AP (Dynamo / Cassandra)"]
A1[Always available<br/>for writes] --> A2[Converge later<br/>via repair]
end
subgraph CP ["CP (Raft / Paxos)"]
B1[Strong consistency<br/>linearizable] --> B2[Unavailable during<br/>partition / election]
end
style A1 fill:#15803d,color:#fff
style A2 fill:#15803d,color:#fff
style B1 fill:#0e7490,color:#fff
style B2 fill:#0e7490,color:#fff
See CAP theorem deep dive for a rigorous treatment of this choice.
Concrete: Dynamo and Cassandra
Amazon Dynamo (described in the 2007 SOSP paper by DeCandia et al.) is the seminal leaderless system. It uses consistent hashing to partition keys across nodes, N/W/R tunable quorums, sloppy quorums with hinted handoff, read-repair on every read, and Merkle tree anti-entropy between peers. Conflict resolution was application-driven: the shopping cart service used a union-merge strategy. DynamoDB (the managed cloud service) evolved substantially from the original paper. By the time of AWS's 2022 USENIX ATC paper, DynamoDB had moved to a leader-per-partition model using Multi-Paxos, gaining strong consistency and predictable performance — it is no longer leaderless in the Dynamo sense.
Apache Cassandra (originally developed at Facebook, open-sourced in 2008) combines Dynamo's leaderless replication with a Bigtable-inspired wide-column data model. Key Cassandra specifics:
| Mechanism | Cassandra implementation |
|---|---|
| Partitioning | Consistent hashing with virtual nodes (vnodes) — many small token ranges per physical node, not one large range |
| Replication factor | REPLICATION = {'class': 'NetworkTopologyStrategy', 'dc1': 3} — replicate to N nodes, topology-aware |
| Consistency levels | ONE, QUORUM, LOCAL_QUORUM, ALL, SERIAL (linearizable via Paxos) — per-operation, not per-table |
| Read repair | read_repair table option: BLOCKING (default, provides monotonic quorum reads), ASYNC (background repair, non-blocking), or NONE (no repair; avoids extra write traffic during reads). The older probabilistic read_repair_chance parameter was removed in Cassandra 4.0. |
| Anti-entropy | nodetool repair runs Merkle tree sync; should run at least every gc_grace_seconds (default 10 days) |
| Conflict resolution | LWW by default (timestamp per cell); no built-in sibling model |
Cassandra defaults to LWW, which means clock skew is a real operational concern. The recommendation is to use USING TIMESTAMP with a logical, client-generated timestamp if precise ordering matters, or to accept that writes within the clock-skew window (~few ms) may be silently overwritten.
Failure modes and what goes wrong
| Failure | Mechanism triggered | Risk |
|---|---|---|
| One home replica down | Sloppy quorum + hinted handoff | Write succeeds; hint may be lost if hint-holder also crashes before delivery |
| Clock skew > write concurrency window | LWW picks wrong winner | Silent data loss — no error, no conflict signal |
| Quorum unavailable (floor(N/2)+1 replicas down) | Write/read fails (strict) or degrades (sloppy) | Unavailability with strict; stale reads with sloppy |
| Hint accumulation (many nodes down) | Disk space on hint-holder fills up | Writes rejected when hint store is full; repair needed |
| Repair not run for > gc_grace_seconds | Deleted tombstones may be resurrected | Deleted keys reappear; data corruption |
| Conflict explosion | Clients write to different replicas under partition | Version vector "sibling" count grows; app must merge many versions |
| Network partition healed too late | Hinted handoff TTL expires before delivery | Hint discarded; replica remains permanently stale until next repair |
The tombstone resurrection failure mode is worth pausing on — it tests operational depth in interviews. When you delete a key in Cassandra, it writes a deletion marker (tombstone) rather than removing the data immediately, so other replicas know the key was deleted when they sync. After gc_grace_seconds, the tombstone is garbage-collected. If a replica was down during the delete and comes back after the tombstone was GC'd, it will see "no data" in the sync and think it simply doesn't have the key yet — and will not receive a delete. Anti-entropy repair must run within gc_grace_seconds to prevent this.
Choosing your quorum configuration
When to use which configuration, in practice:
| Use case | Recommended config | Rationale |
|---|---|---|
| Highest write availability (metrics ingest, event logs) | N=3, W=1, R=1 or R=2 | Writes never fail; staleness acceptable |
| Balanced: tolerate one failure each way | N=3, W=2, R=2 | Classic quorum; standard for most apps |
| Read-your-writes consistency required | N=3, W=3 or route reads to same node | All replicas always have latest; or use sticky routing |
| Financial / inventory: no data loss | SERIAL (Cassandra) / Raft | Paxos-backed linearizability; CP, not AP |
| Cross-datacenter with local reads | N=3 per DC, LOCAL_QUORUM | Quorum within one DC; cross-DC replication async |
Things to discuss in an interview
- Why R+W>N works: walk through a concrete example (N=3, W=2, R=2, draw the overlap). Make sure you can also show what breaks when R+W≤N.
- Sloppy vs strict quorums: when would you use each? What availability vs consistency trade-off are you making?
- Read-repair vs anti-entropy: why both? (Read-repair is fast for hot keys; anti-entropy handles cold keys and full convergence.)
- Why Merkle trees: what's the alternative (full scan), why it's infeasible at scale, and how tree-comparison reduces it to O(log N) rounds.
- LWW vs version vectors: LWW is simpler but lossy; version vectors detect conflicts but push resolution to the application or a CRDT.
- CAP trade-off: Dynamo is AP; Raft is CP. When would you choose each? (Availability-critical vs correctness-critical data.)
- Tombstone resurrection: a subtle failure mode that tests depth of operational knowledge.
Things you should now be able to answer
- If N=5, W=3, R=3, how many node failures can reads tolerate? Writes?
- Why does sloppy quorum weaken the R+W>N overlap guarantee?
- A read returns two values with version vectors
{A:2, B:1}and{A:1, B:2}. Is this a conflict? What do you do? - Why is running Cassandra repair within
gc_grace_secondscritical? - A client writes key=X to replica A. A partition isolates A before the write reaches B or C. The client then reads from B and C. What does it see? What happens after the partition heals?
- What's a CRDT G-Counter, and why does it never conflict?
- Why is LWW dangerous when writes are concurrent across replicas with ms-level clock skew?
Further reading
- DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store" — SOSP 2007. The foundational paper; every mechanism in this article traces back here.
- Kleppmann, Designing Data-Intensive Applications, Chapter 5 (Replication) and Chapter 9 (Consistency and Consensus). The clearest textbook treatment.
- Cassandra documentation — "How is data replicated?" and
nodetool repair— cassandra.apache.org - "Eventually Consistent" — Werner Vogels, ACM Queue 2008. The philosophy behind this design from Amazon's CTO.
- Consistent hashing — how keys are distributed across the N replicas in the first place.
- Leader election and consensus — the CP alternative: Raft and Paxos.
- CAP theorem deep dive — the formal framework for understanding the AP vs CP trade-off.
- Design key-value store — putting these mechanisms together into a full system design.
Frequently asked questions
▸What is the R+W>N quorum rule and why does it guarantee you see the latest write?
R is the number of replicas that must respond to a read, W is the number that must acknowledge a write, and N is the total replication factor. When R+W>N, the read set and write set must share at least one replica by the pigeonhole principle, and that overlapping node is guaranteed to hold the newest value.
▸What is the difference between a sloppy quorum and a strict quorum?
A strict quorum requires that the W acknowledgments come from the key's home replica set; if too few home replicas are reachable, the write is refused. A sloppy quorum writes to any W live nodes in the cluster, and each temporary holder stores a hint directing it to forward the data to the home replica once it recovers. Sloppy quorums preserve write availability but break the overlap guarantee, so a subsequent read from home replicas may return a stale value until hinted handoff completes.
▸How do read-repair and Merkle-tree anti-entropy complement each other?
Read-repair fires inline on every quorum read: if replicas return different values, the coordinator writes the newest value back to any stale replica immediately, so hot keys heal within seconds of a partition healing. Merkle anti-entropy is a background process that compares per-key hash trees between replicas and syncs only differing key ranges in O(log N) rounds, covering cold keys that may never receive a read. Both are needed because read-repair alone cannot guarantee full convergence across the entire key space.
▸When should I use last-write-wins versus version vectors for conflict resolution?
Last-write-wins is appropriate when the write rate is low enough that true concurrent writes are rare and silently overwriting a slightly stale value is acceptable, such as a user activity timestamp. Version vectors are necessary when you need to detect genuine conflicts: two version vectors that neither dominates the other signal a true concurrent write that must be resolved by application-level merge or a CRDT. The danger of LWW is that clock skew of even a few milliseconds can cause the logically later write to be permanently discarded with no error.
▸Why must Cassandra's nodetool repair run within gc_grace_seconds, and what happens if it does not?
When a key is deleted in Cassandra, a tombstone marker is written instead of immediately removing the data, so other replicas learn about the deletion during sync. After gc_grace_seconds (default 10 days) the tombstone is garbage-collected. If a replica was down during the delete and comes back after the tombstone is gone, anti-entropy sees no data on either side and never delivers the delete, resurrecting the key silently.
You may also like
Design a Shopping Cart & Checkout System
Keep a cart consistent across devices, then check out without overselling or double-charging. The available-cart vs consistent-checkout split, inventory holds, and the order saga.
Design a Social Graph Service (Facebook's TAO)
Serve billions of "who follows whom" reads over a graph of trillions of edges. The objects-and-associations model, a cache in front of sharded SQL, and the hot-vertex problem.
Design an Authorization System (Google Zanzibar / RBAC / ReBAC)
Answer "can user U do action A on resource R?" globally, in milliseconds, consistently. RBAC vs ABAC vs ReBAC, Zanzibar relation tuples, and the new-enemy problem.