~/articles/design-key-value-store
◆◆◆Advancedasked at Amazonasked at Metaasked at Google

Design a Distributed Key-Value Store (Dynamo)

Build your own DynamoDB / Cassandra. Sharding, replication, quorum reads/writes, vector clocks, conflict resolution.

// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The problem

DynamoDB stores over a trillion items and handles millions of requests per second. Cassandra powers Discord's message history — 120 billion messages as of 2020, with writes sustaining several million per day. At the core of both systems is the same primitive: a distributed key-value store. You hand it a key, you get back a value. Simple API, brutally hard implementation at scale.

A single machine with a hash map in memory can do millions of operations per second — until it runs out of RAM, until the process crashes, or until traffic outgrows one box. The first move is sharding: split the key space across many nodes. The second move is replication: keep copies of every key on multiple nodes so one failure doesn't make data disappear. That's where the hard engineering begins.

Once you have N replicas of the same key, two problems collide. First: when is a write "done"? Wait for all replicas and a slow node stalls every client; ack on one and you risk losing the write if that node dies before the others catch up. Second: what do you do when two clients write the same key at the same time, or when a network partition lets both sides accept writes independently? Wall-clock timestamps drift and lie. You need real concurrency tracking.

This is why the distributed key-value store is one of the most revealing system design questions: it forces you to reason through consistency vs. availability, quorum math, vector clocks, and gossip — not as isolated topics but as a connected stack. The reference implementation is Amazon Dynamo (2007), which became the blueprint for Cassandra, Riak, Voldemort, and ScyllaDB. Getting this design right in an interview shows senior-level distributed-systems fluency.

Back-of-the-envelope

DimensionEstimateHow we got there
Write QPS1 M QPSTarget scale; drives sharding and quorum choices
Average value size1 KBTypical KV workload (small objects)
Write bandwidth~1 GB/s1 M QPS × 1 KB per write
Logical data1 PBAggregate stored dataset
Raw storage (N=3)~3 PB1 PB logical × 3 replicas
Keys reshuffled on node add~1%1/N for a 100-node cluster via consistent hashing (vs. ~100% with hash-mod-N)
Gossip convergence (100 nodes)~7 roundsO(log₂ 100) ≈ 6.6 rounds; ~7–14 s at 1–2 s/round

Takeaway: At 1 M QPS and 1 KB values, you need sharding from day one (~1 GB/s write bandwidth) and N=3 replication triples raw storage to ~3 PB. Consistent hashing limits reshuffling to ~1% of keys per topology change — the core scalability argument for the ring design.

Functional requirements

  • put(key, value)
  • get(key) → value or null
  • (Optional) delete(key)
  • (Optional) list_keys(prefix)

Non-functional

  • Massive scale: petabytes, billions of keys, millions of QPS.
  • Always available: writes succeed even during partition.
  • Tunable consistency: per-query, not just per-cluster.
  • Linear horizontal scaling.

Building up to the design

A distributed KV store is the single most "make sure you can walk from V1" question in the catalog. Every piece — quorum, vector clocks, gossip — earns its place by being the answer to a problem the simpler version couldn't solve. If you jump straight to Dynamo, you'll fumble the "why?" follow-ups.

V1: A hash map in one process

store = {}
def put(k, v): store[k] = v
def get(k):    return store.get(k)

Millions of ops/sec on one box, instant to write. The problem is catastrophic: process death loses everything. Memory caps your data to RAM.

V2: Persist to disk

Write each put to an append-only log, rebuild the in-memory map from that log on startup. (This is roughly Bitcask.) You now survive restarts and can make a demo with real data. But the ceiling is still one machine's disk and CPU — past a few TB or a few hundred thousand ops/sec, one box simply can't keep up.

V3: Shard across N machines

Hash the key, modulo N, send to that node. Capacity and throughput both scale linearly. The problem is what happens when you add that N+1th machine: hash % N reshuffles every key on every node. Adding one node moves roughly 99% of your data. You're also creating a single point of failure per shard — when that node dies, its entire slice disappears.

V4: Consistent hashing + replication

Hash both keys and nodes onto a ring. Each key lives on the next N nodes clockwise — typically N=3 replicas. Now adding or removing one node moves only ~1/(total nodes) of keys — about 1% for a 100-node cluster. See the consistent hashing article. One node death still leaves 2 copies of each key, so you survive it gracefully.

But now the question is: when do we ack the write? Wait for all 3 replicas and a slow one stalls every write. Ack on just 1 and you might lose data if that replica dies before the others catch up. Worse, a network partition means writes succeed on one side while reads on the other side see stale data.

V5: Quorum reads/writes (the R+W>N rule)

Let the caller decide: W is how many replicas must ack a write, R is how many must respond to a read. Set R + W > N and any read is guaranteed to see at least one node that participated in the latest write — that's read-your-writes under normal operation. Set R + W ≤ N and you get higher availability at the cost of occasional stale reads.

Under strict quorum semantics (non-sloppy), R+W>N prevents stale reads. Dynamo uses sloppy quorum, which softens this guarantee during failures — more on that below.

The remaining problem: when a partition resolves, two replicas may hold different values for the same key. Wall-clock timestamps drift, so you can't trust "most recent" to mean anything reliable. You need a partial order.

V6: Vector clocks for conflicts

Each write carries a per-node version vector. On read, if replicas have concurrent (incomparable) versions, return both and let the client — or a domain merge function — resolve them. The Dynamo shopping-cart example is the classic: concurrent adds to a cart are unioned together on read. You now handle conflicts without a global clock and survive partitions cleanly. What remains is knowing which nodes are alive in the first place.

V7: Gossip + hinted handoff + Merkle anti-entropy

  • Gossip: each node periodically exchanges liveness and ring state with random peers.
  • Hinted handoff: if replica B is down, replica A holds B's writes and replays them when B returns.
  • Merkle trees: periodic background reconciliation finds and fixes silent divergences between replicas.

This is the production Dynamo design — V4 + V5 + V6 + V7 layered together.

flowchart LR
    V1[V1: hashmap<br/>RAM, lossy] --> V2[V2: + log<br/>durable, one box]
    V2 --> V3[V3: + sharding<br/>resharding hell]
    V3 --> V4[V4: consistent hash + replicas<br/>survives 1 death]
    V4 --> V5[V5: + quorum R/W<br/>tunable consistency]
    V5 --> V6[V6: + vector clocks<br/>conflict-safe]
    V6 --> V7[V7: + gossip + Merkle<br/>production Dynamo]
    style V1 fill:#0e7490,color:#fff
    style V4 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V7 fill:#a855f7,color:#fff

The rest of the article expands each box from V4 onward. If an interviewer asks "what's the simplest thing that would work?", start at V1 and walk forward until they say "good — now show me the Dynamo version."

Architecture overview

flowchart TD
    CLIENT[Client] --> COORD["Coordinator Node<br/>(any node, picked by client)"]
    COORD --> N1["Node 1<br/>holds keys [a, h)"]
    COORD --> N2["Node 2<br/>holds keys [h, p)"]
    COORD --> N3["Node 3<br/>holds keys [p, z)"]
    N1 -.replicate.-> N2
    N2 -.replicate.-> N3
    N3 -.replicate.-> N1

    GOSSIP[Gossip Protocol] -.-> N1
    GOSSIP -.-> N2
    GOSSIP -.-> N3

    style COORD fill:#ff6b1a,color:#0a0a0f
    style N1 fill:#15803d,color:#fff
    style N2 fill:#15803d,color:#fff
    style N3 fill:#15803d,color:#fff

The cluster is fully decentralized: every node knows the topology, and any node can serve as coordinator for a given request. Consistent hashing determines which N nodes hold each key, and gossip keeps the ring view synchronized across all nodes without a central registry.

Data partitioning

Each node owns a slice of the hash ring via virtual nodes. A put(key, value) request:

  1. Coordinator computes hash(key).
  2. Walks the ring clockwise to find the N nodes that should hold this key.
  3. Forwards the write to all N.
  4. Waits for W of them to ack before responding to client.
flowchart LR
    K[key 'foo'] --> H[hash → ring position]
    H --> P[Primary<br/>next clockwise node]
    P --> R1[Replica 1<br/>next after primary]
    R1 --> R2[Replica 2<br/>next after R1]
    style P fill:#ff6b1a,color:#0a0a0f
    style R1 fill:#15803d,color:#fff
    style R2 fill:#15803d,color:#fff

Quorum: tunable consistency

With N replicas, configure:

  • W = how many must acknowledge a write before it succeeds.
  • R = how many must respond to a read before we return.

The key insight is overlap. If R + W > N, then the set of nodes you read from and the set you wrote to must share at least one node — so your read always catches the latest write.

flowchart LR
    subgraph "N=3 replicas"
    W1[Write: W=2<br/>acked by N1 N2]
    R1[Read: R=2<br/>asks N2 N3]
    OVER["Overlap: at least one node<br/>is in both sets (e.g. N2 here)<br/>→ read-your-writes guaranteed"]
    end
    W1 --> OVER
    R1 --> OVER
    style OVER fill:#ff6b1a,color:#0a0a0f
    style W1 fill:#15803d,color:#fff
    style R1 fill:#0e7490,color:#fff

Rules of thumb for N=3:

SettingBehavior
R + W > NRead-your-writes (overlap guarantee — at least one R replica saw the latest W write). Not linearizable: sloppy quorums can temporarily route to non-preference-list nodes during failures.
R + W ≤ NEventual consistency, higher availability, stale reads possible
W = N, R = 1Slow writes, fast reads; maximum write durability (every replica confirms the write)
W = 1, R = NFast writes, slow reads; always reads from every replica
W = 1, R = 1Eventually consistent, both fast
W = N/2 + 1, R = N/2 + 1"Quorum" — balanced; typical default

For N=3, R=2, W=2 is the standard balanced setup.

Sloppy quorum nuance (important for staff interviews): Dynamo does not enforce strict quorum membership. During failures, a write can land on any available node that is not in the normal preference list for that key, so the W acks and R acks are not guaranteed to overlap — even with R+W>N. This is why Dynamo is classified as an AP system rather than a CP one. Hinted handoff and anti-entropy are what eventually bring the cluster back to a consistent state.

Replication and writes

sequenceDiagram
    participant Client
    participant Coord
    participant N1 as Replica 1
    participant N2 as Replica 2
    participant N3 as Replica 3
    Client->>Coord: put(k, v)
    par
      Coord->>N1: write(k, v)
      Coord->>N2: write(k, v)
      Coord->>N3: write(k, v)
    end
    N1-->>Coord: ack
    N2-->>Coord: ack
    Note over Coord: W=2 reached → respond
    Coord-->>Client: ok
    N3-->>Coord: ack (later)

The coordinator returns as soon as W replicas confirm. The remaining replicas catch up asynchronously. This is what keeps tail latency low — you're never waiting for the slowest node in the happy path.

Conflict resolution

Two clients write to the same key at the same moment. The replicas may receive those writes in different orders. Or a replica is partitioned, then comes back holding stale data. You need a principled way to figure out which version wins.

Last-write-wins (LWW)

Each value carries a timestamp. On read, keep the highest timestamp. Simple to implement and reason about — the problem is clock skew. A node whose clock is running slightly behind can have its write silently overwritten by an older write from a node whose clock is ahead. For shopping carts or counters, that's unacceptable data loss.

Vector clocks (Dynamo's approach)

Each version is tagged with a vector of (node_id, counter) pairs. When two versions are compared:

V1 = [(N1, 3), (N2, 1)]
V2 = [(N1, 3), (N2, 2)]   ← strictly greater than V1
V3 = [(N1, 4), (N2, 1)]   ← concurrent with V2 (incomparable)

If one version strictly dominates the other — every component is greater than or equal — keep the dominant version. If they're concurrent (incomparable), keep both and let the application reconcile on read. The classic Dynamo example is a shopping cart: concurrent adds are unioned together.

flowchart LR
    W1["Write A from Node1<br/>V=(N1=1)"] --> R[(Replicas)]
    W2["Write B from Node2<br/>V=(N1=1, N2=1)"] --> R
    W3["Write C from Node1<br/>V=(N1=2)"] --> R
    R --> READ[Read returns:<br/>concurrent versions B & C<br/>app reconciles]
    style READ fill:#ff2e88,color:#fff

The tradeoff: the application has to understand reconciliation, and the vector clock metadata grows over time. In a large cluster, a new entry is added for every node that ever coordinates a write, so the vector can grow without bound. The original Dynamo paper handles this with a pruning heuristic: when the vector exceeds a size threshold, the oldest (node, counter) entries are dropped — at the cost of potentially treating causally related writes as concurrent.

CRDTs

Conflict-free Replicated Data Types are math-y data structures (counters, sets, sequences) where the merge operation is commutative and associative — any two concurrent updates can always be merged automatically, without application-level logic. Used in Riak, Redis CRDT modules, and collaborative editing systems. The limitation is that they only work for specific data types; they're not a general-purpose replacement for vector clocks.

Reads and read repair

sequenceDiagram
    participant Client
    participant Coord
    participant N1
    participant N2
    participant N3
    Client->>Coord: get(k)
    par
      Coord->>N1: read(k)
      Coord->>N2: read(k)
      Coord->>N3: read(k)
    end
    N1-->>Coord: v1, t=10
    N2-->>Coord: v1, t=10
    Note over Coord: R=2 reached → respond freshest
    Coord-->>Client: v1
    N3-->>Coord: v0, t=8 (stale!)
    Coord->>N3: write(k, v1, t=10)  ← repair

When R replicas respond, the coordinator picks the freshest value and returns it. If a stale replica is detected in the process, the coordinator quietly writes the fresh version back to it. Read repair is how the cluster keeps converging even without explicit reconciliation jobs — every read is an opportunity to heal divergence.

Anti-entropy: Merkle trees

Read repair only fixes divergence you happen to stumble across on a read. For deeper repair — especially after long partitions or when a new replica is bootstrapped — nodes periodically compare hash trees ("Merkle trees") of their data ranges. If the root hashes differ, they walk the tree to find exactly which key ranges diverge and sync only those. The cost is proportional to the diff, not the full dataset, which makes it tractable even on large clusters. This is how Cassandra's nodetool repair works.

Membership: gossip

Every few seconds, each node picks a random peer and exchanges:

  • "I think these nodes are alive: A, B, C, D."
  • "I think these nodes are dead: E."
  • "Heartbeat counters."

After a few rounds, all nodes converge on the same view. No central coordinator needed — and none can be a single point of failure.

flowchart LR
    A[A] -.gossip.-> B[B]
    B -.gossip.-> C[C]
    C -.gossip.-> D[D]
    D -.gossip.-> A
    A -.-> E["E?<br/>(suspected dead)"]
    style E fill:#ff2e88,color:#fff

With N nodes and each node gossiping to one random peer per round, information spreads in O(log N) rounds. A 100-node cluster converges in roughly 7 rounds; a 10,000-node cluster in roughly 14. Each round takes on the order of a gossip interval (typically 1–2 seconds in production systems like Cassandra), so full convergence across large clusters is measured in tens of seconds, not minutes.

Dynamo uses an accrual-based failure detector (the phi-accrual model) rather than a simple timeout. Instead of a binary "dead / alive" judgment, it outputs a continuously-rising suspicion level φ. Each caller sets its own conviction threshold, trading detection speed against false-positive risk — a low threshold triggers quickly but misfires more often; a high threshold waits longer before declaring a node dead. Systems like Cassandra expose this as a tunable parameter (phi_convict_threshold, default 8; raised to 10–12 in high-latency cloud environments). This prevents thundering-herd false-evictions when a node is slow but not dead.

Hinted handoff

A replica is temporarily down. The coordinator can't write to it. Rather than failing the write, it stores a "hint" on another available node: "when X comes back, deliver this write to it." When X returns, the hints are replayed and the cluster heals without having lost any writes.

sequenceDiagram
    participant Client
    participant Coord
    participant N2 as "Replica 2 (down)"
    participant N3 as "Replica 3 (hint holder)"
    Client->>Coord: put(k, v)
    Coord->>N3: write hint "deliver to N2 when back"
    Note over N2: offline
    Coord-->>Client: ok (W acks met without N2)
    Note over N2: comes back online
    N3->>N2: replay hinted write(k, v)
    N2-->>N3: ack

Hinted handoff is designed for short outages — minutes to a few hours. For longer divergence, Merkle tree anti-entropy picks up where it leaves off.

Storage on a single node

Each node manages its own local storage. The choice of engine matters a lot at the write rates this system is designed for.

LSM tree (Log-Structured Merge): writes land in a memtable in memory, get flushed to immutable SSTables on disk, and are compacted in the background. All writes are sequential, which means the disk is never doing random I/O on the write path. Used by Cassandra, RocksDB, LevelDB, and ScyllaDB.

B-tree: the classic update-in-place structure. Every write updates a page directly on disk — random I/O, but good for read-heavy workloads. Used by LMDB and WiredTiger (MongoDB's storage engine).

Bitcask: an append-only log with an in-memory hash index mapping every live key to its file offset. All writes are sequential; reads require at most one disk seek. Very fast for workloads where the full key set fits in RAM, which is what Riak's default backend uses.

flowchart LR
    W[Write: put k v] --> MEM[Memtable<br/>in RAM]
    MEM -->|"flush when full"| SS1[SSTable L0<br/>on disk]
    SS1 -->|"compact"| SS2[SSTable L1]
    SS2 -->|"compact"| SS3[SSTable L2]
    READ[Read: get k] --> MEM
    MEM -.miss.-> SS1
    SS1 -.miss.-> SS2
    style MEM fill:#ff6b1a,color:#0a0a0f
    style SS1 fill:#15803d,color:#fff
    style SS2 fill:#0e7490,color:#fff
    style SS3 fill:#a855f7,color:#fff

For a write-heavy KV store, LSM wins. The tradeoffs:

PropertyLSM treeB-tree
Write throughputHigh (sequential I/O, batched flushes)Moderate (random in-place page updates)
Read latency (cold)Higher (may check memtable + multiple SSTable levels)Lower (single tree traversal)
Space amplificationHigher during compactionLower (overwrites in place)
Write amplificationSignificant (data rewritten at each level merge)Moderate (page COW or in-place)
Compaction costBackground CPU/disk spikes during compactionMinimal background work

LSM write amplification is a real operational concern. RocksDB's leveled compaction typically amplifies writes by more than 10×, often reaching 20–30× at high level-size ratios (the RocksDB wiki describes it as "often larger than 10"). Tuning the level size ratio and compaction strategy (leveled vs. tiered/STCS in Cassandra) is a significant operational lever. Tiered compaction has lower write amplification but higher read amplification and space usage during a merge.

Architecture summary

flowchart TD
    subgraph Cluster
    NODE1[Node 1]
    NODE2[Node 2]
    NODE3[Node 3]
    NODEN[Node N]
    end
    GOSSIP{Gossip<br/>membership} --> NODE1
    GOSSIP --> NODE2
    GOSSIP --> NODE3
    NODE1 -.replicate.-> NODE2
    NODE2 -.replicate.-> NODE3
    HINTS[(Hinted handoff queue)] --> NODE1
    MERKLE[Merkle tree comparison<br/>for anti-entropy] --> NODE1
    NODE1 --> LSM[LSM Tree on local disk]
    NODE1 --> CACHE[Block cache in RAM]
    style NODE1 fill:#ff6b1a,color:#0a0a0f
    style HINTS fill:#a855f7,color:#fff
    style MERKLE fill:#15803d,color:#fff

Failure scenarios

Single node down

As long as W replicas are alive, writes succeed. Hinted handoff stores the writes that missed the downed node, and replays them when it returns.

Whole rack / AZ down

Spread N replicas across racks and availability zones. With N=3 across 3 AZs, losing any one AZ leaves at least 2 live replicas — writes still succeed.

Network partition

Both sides keep accepting writes (AP behavior). Vector clocks record what happened on each side; anti-entropy reconciles when the partition heals.

With sloppy quorum, writes succeed on both sides of a partition by pulling in available nodes outside the normal preference list — so R+W>N does not protect you from conflicting concurrent writes during a partition. The system will surface concurrent versions (via vector clocks) on the next read and require application-level reconciliation.

With strict quorum (not Dynamo's default), the minority side would block writes rather than accept them — preserving CP semantics at the cost of availability.

Coordinated failure of N nodes

You lose data. Mitigations: spread replicas across geo regions; backups to S3.

Things to discuss in an interview

  • Consistent hashing + virtual nodes for sharding. Virtual nodes (tokens) let each physical node own multiple ring segments, which balances load across heterogeneous machines and makes node addition/removal require smaller, incremental data transfers instead of one big shift.
  • N, R, W as the three knobs of consistency — and the sloppy quorum caveat that Dynamo-style systems use.
  • Vector clocks for concurrent writes, including their pruning problem in large clusters.
  • Gossip for membership, with phi-accrual failure detection for configurable suspicion thresholds.
  • Hinted handoff & Merkle anti-entropy as short-term vs. long-term availability repair tactics (read repair is opportunistic — it fires on reads, not on a schedule).
  • Merkle trees for bounded background repair cost (tree comparison is O(diff size), not O(full data size)).
  • LSM tree as the local storage choice, and when write amplification becomes an operational issue.

Things you should now be able to answer

  • Why does R + W > N give read-your-writes under strict quorum, and why does sloppy quorum weaken that guarantee?
  • What's the alternative to vector clocks, and what does it sacrifice?
  • How does phi-accrual failure detection differ from a simple heartbeat timeout, and why does it matter?
  • What happens to writes during a network partition in Dynamo's sloppy-quorum model? How does it differ from a CP system?
  • Why is gossip preferred over a central coordinator, and what is its convergence complexity?
  • What is write amplification in an LSM tree, and how does compaction strategy affect it?
  • When would you choose Bitcask over an LSM tree?

Further reading

  • "Dynamo: Amazon's Highly Available Key-value Store" (DeCandia et al., SOSP 2007) — the foundational paper
  • "Designing Data-Intensive Applications" (Kleppmann), chapters 5-6
  • Cassandra documentation on consistency, gossip, repair
// FAQ

Frequently asked questions

What is the R+W>N quorum rule, and what does it guarantee?

With N replicas, R is how many must respond to a read and W is how many must ack a write. When R+W>N, the read set and write set must overlap by at least one node, guaranteeing that every read sees at least one replica that participated in the latest write — this is read-your-writes. For N=3, the standard balanced setup is R=2, W=2.

Why does Dynamo's sloppy quorum weaken the R+W>N guarantee?

Dynamo does not enforce strict quorum membership. During failures, a write can land on any available node outside the normal preference list for that key, so the W acks and R acks are not guaranteed to overlap even when R+W>N. This is why Dynamo is classified as an AP system; hinted handoff and Merkle tree anti-entropy are what eventually restore consistency.

How does consistent hashing with virtual nodes compare to naive hash-mod-N sharding?

With naive hash%N sharding, adding one node reshuffles nearly all keys across the cluster. Consistent hashing places both keys and nodes on a ring, so adding one node moves only about 1/N of total keys — roughly 1% for a 100-node cluster — taken evenly from existing nodes. Virtual nodes let each physical machine own multiple ring segments, balancing load across heterogeneous hardware.

When should you use vector clocks instead of last-write-wins for conflict resolution?

Use vector clocks when data loss from clock skew is unacceptable. Last-write-wins silently discards a write if the losing node's wall clock is running slightly behind, which is catastrophic for shopping carts or counters. Vector clocks track causality per node; when two versions are concurrent and incomparable, both are surfaced to the application for merge — the classic Dynamo example is unioning concurrent shopping-cart adds.

What is LSM tree write amplification, and how large does it get in practice?

Write amplification in an LSM tree means data is rewritten at each level merge during compaction, so one logical write triggers many physical disk writes. RocksDB's leveled compaction typically amplifies writes by more than 10x, often reaching 20-30x at high level-size ratios. Tiered compaction reduces write amplification but increases read amplification and space usage during a merge.

// RELATED

You may also like