Design a Distributed Key-Value Store (Dynamo)
Build your own DynamoDB / Cassandra. Sharding, replication, quorum reads/writes, vector clocks, conflict resolution.
The problem
DynamoDB stores over a trillion items and handles millions of requests per second. Cassandra powers Discord's message history — 120 billion messages as of 2020, with writes sustaining several million per day. At the core of both systems is the same primitive: a distributed key-value store. You hand it a key, you get back a value. Simple API, brutally hard implementation at scale.
A single machine with a hash map in memory can do millions of operations per second — until it runs out of RAM, until the process crashes, or until traffic outgrows one box. The first move is sharding: split the key space across many nodes. The second move is replication: keep copies of every key on multiple nodes so one failure doesn't make data disappear. That's where the hard engineering begins.
Once you have N replicas of the same key, two problems collide. First: when is a write "done"? Wait for all replicas and a slow node stalls every client; ack on one and you risk losing the write if that node dies before the others catch up. Second: what do you do when two clients write the same key at the same time, or when a network partition lets both sides accept writes independently? Wall-clock timestamps drift and lie. You need real concurrency tracking.
This is why the distributed key-value store is one of the most revealing system design questions: it forces you to reason through consistency vs. availability, quorum math, vector clocks, and gossip — not as isolated topics but as a connected stack. The reference implementation is Amazon Dynamo (2007), which became the blueprint for Cassandra, Riak, Voldemort, and ScyllaDB. Getting this design right in an interview shows senior-level distributed-systems fluency.
Back-of-the-envelope
| Dimension | Estimate | How we got there |
|---|---|---|
| Write QPS | 1 M QPS | Target scale; drives sharding and quorum choices |
| Average value size | 1 KB | Typical KV workload (small objects) |
| Write bandwidth | ~1 GB/s | 1 M QPS × 1 KB per write |
| Logical data | 1 PB | Aggregate stored dataset |
| Raw storage (N=3) | ~3 PB | 1 PB logical × 3 replicas |
| Keys reshuffled on node add | ~1% | 1/N for a 100-node cluster via consistent hashing (vs. ~100% with hash-mod-N) |
| Gossip convergence (100 nodes) | ~7 rounds | O(log₂ 100) ≈ 6.6 rounds; ~7–14 s at 1–2 s/round |
Takeaway: At 1 M QPS and 1 KB values, you need sharding from day one (~1 GB/s write bandwidth) and N=3 replication triples raw storage to ~3 PB. Consistent hashing limits reshuffling to ~1% of keys per topology change — the core scalability argument for the ring design.
Functional requirements
put(key, value)get(key)→ value or null- (Optional)
delete(key) - (Optional)
list_keys(prefix)
Non-functional
- Massive scale: petabytes, billions of keys, millions of QPS.
- Always available: writes succeed even during partition.
- Tunable consistency: per-query, not just per-cluster.
- Linear horizontal scaling.
Building up to the design
A distributed KV store is the single most "make sure you can walk from V1" question in the catalog. Every piece — quorum, vector clocks, gossip — earns its place by being the answer to a problem the simpler version couldn't solve. If you jump straight to Dynamo, you'll fumble the "why?" follow-ups.
V1: A hash map in one process
store = {}
def put(k, v): store[k] = v
def get(k): return store.get(k)
Millions of ops/sec on one box, instant to write. The problem is catastrophic: process death loses everything. Memory caps your data to RAM.
V2: Persist to disk
Write each put to an append-only log, rebuild the in-memory map from that log on startup. (This is roughly Bitcask.) You now survive restarts and can make a demo with real data. But the ceiling is still one machine's disk and CPU — past a few TB or a few hundred thousand ops/sec, one box simply can't keep up.
V3: Shard across N machines
Hash the key, modulo N, send to that node. Capacity and throughput both scale linearly. The problem is what happens when you add that N+1th machine: hash % N reshuffles every key on every node. Adding one node moves roughly 99% of your data. You're also creating a single point of failure per shard — when that node dies, its entire slice disappears.
V4: Consistent hashing + replication
Hash both keys and nodes onto a ring. Each key lives on the next N nodes clockwise — typically N=3 replicas. Now adding or removing one node moves only ~1/(total nodes) of keys — about 1% for a 100-node cluster. See the consistent hashing article. One node death still leaves 2 copies of each key, so you survive it gracefully.
But now the question is: when do we ack the write? Wait for all 3 replicas and a slow one stalls every write. Ack on just 1 and you might lose data if that replica dies before the others catch up. Worse, a network partition means writes succeed on one side while reads on the other side see stale data.
V5: Quorum reads/writes (the R+W>N rule)
Let the caller decide: W is how many replicas must ack a write, R is how many must respond to a read. Set R + W > N and any read is guaranteed to see at least one node that participated in the latest write — that's read-your-writes under normal operation. Set R + W ≤ N and you get higher availability at the cost of occasional stale reads.
Under strict quorum semantics (non-sloppy), R+W>N prevents stale reads. Dynamo uses sloppy quorum, which softens this guarantee during failures — more on that below.
The remaining problem: when a partition resolves, two replicas may hold different values for the same key. Wall-clock timestamps drift, so you can't trust "most recent" to mean anything reliable. You need a partial order.
V6: Vector clocks for conflicts
Each write carries a per-node version vector. On read, if replicas have concurrent (incomparable) versions, return both and let the client — or a domain merge function — resolve them. The Dynamo shopping-cart example is the classic: concurrent adds to a cart are unioned together on read. You now handle conflicts without a global clock and survive partitions cleanly. What remains is knowing which nodes are alive in the first place.
V7: Gossip + hinted handoff + Merkle anti-entropy
- Gossip: each node periodically exchanges liveness and ring state with random peers.
- Hinted handoff: if replica B is down, replica A holds B's writes and replays them when B returns.
- Merkle trees: periodic background reconciliation finds and fixes silent divergences between replicas.
This is the production Dynamo design — V4 + V5 + V6 + V7 layered together.
flowchart LR
V1[V1: hashmap<br/>RAM, lossy] --> V2[V2: + log<br/>durable, one box]
V2 --> V3[V3: + sharding<br/>resharding hell]
V3 --> V4[V4: consistent hash + replicas<br/>survives 1 death]
V4 --> V5[V5: + quorum R/W<br/>tunable consistency]
V5 --> V6[V6: + vector clocks<br/>conflict-safe]
V6 --> V7[V7: + gossip + Merkle<br/>production Dynamo]
style V1 fill:#0e7490,color:#fff
style V4 fill:#15803d,color:#fff
style V5 fill:#ff6b1a,color:#0a0a0f
style V7 fill:#a855f7,color:#fff
The rest of the article expands each box from V4 onward. If an interviewer asks "what's the simplest thing that would work?", start at V1 and walk forward until they say "good — now show me the Dynamo version."
Architecture overview
flowchart TD
CLIENT[Client] --> COORD["Coordinator Node<br/>(any node, picked by client)"]
COORD --> N1["Node 1<br/>holds keys [a, h)"]
COORD --> N2["Node 2<br/>holds keys [h, p)"]
COORD --> N3["Node 3<br/>holds keys [p, z)"]
N1 -.replicate.-> N2
N2 -.replicate.-> N3
N3 -.replicate.-> N1
GOSSIP[Gossip Protocol] -.-> N1
GOSSIP -.-> N2
GOSSIP -.-> N3
style COORD fill:#ff6b1a,color:#0a0a0f
style N1 fill:#15803d,color:#fff
style N2 fill:#15803d,color:#fff
style N3 fill:#15803d,color:#fff
The cluster is fully decentralized: every node knows the topology, and any node can serve as coordinator for a given request. Consistent hashing determines which N nodes hold each key, and gossip keeps the ring view synchronized across all nodes without a central registry.
Data partitioning
Each node owns a slice of the hash ring via virtual nodes. A put(key, value) request:
- Coordinator computes
hash(key). - Walks the ring clockwise to find the N nodes that should hold this key.
- Forwards the write to all N.
- Waits for
Wof them to ack before responding to client.
flowchart LR
K[key 'foo'] --> H[hash → ring position]
H --> P[Primary<br/>next clockwise node]
P --> R1[Replica 1<br/>next after primary]
R1 --> R2[Replica 2<br/>next after R1]
style P fill:#ff6b1a,color:#0a0a0f
style R1 fill:#15803d,color:#fff
style R2 fill:#15803d,color:#fff
Quorum: tunable consistency
With N replicas, configure:
W= how many must acknowledge a write before it succeeds.R= how many must respond to a read before we return.
The key insight is overlap. If R + W > N, then the set of nodes you read from and the set you wrote to must share at least one node — so your read always catches the latest write.
flowchart LR
subgraph "N=3 replicas"
W1[Write: W=2<br/>acked by N1 N2]
R1[Read: R=2<br/>asks N2 N3]
OVER["Overlap: at least one node<br/>is in both sets (e.g. N2 here)<br/>→ read-your-writes guaranteed"]
end
W1 --> OVER
R1 --> OVER
style OVER fill:#ff6b1a,color:#0a0a0f
style W1 fill:#15803d,color:#fff
style R1 fill:#0e7490,color:#fff
Rules of thumb for N=3:
| Setting | Behavior |
|---|---|
R + W > N | Read-your-writes (overlap guarantee — at least one R replica saw the latest W write). Not linearizable: sloppy quorums can temporarily route to non-preference-list nodes during failures. |
R + W ≤ N | Eventual consistency, higher availability, stale reads possible |
W = N, R = 1 | Slow writes, fast reads; maximum write durability (every replica confirms the write) |
W = 1, R = N | Fast writes, slow reads; always reads from every replica |
W = 1, R = 1 | Eventually consistent, both fast |
W = N/2 + 1, R = N/2 + 1 | "Quorum" — balanced; typical default |
For N=3, R=2, W=2 is the standard balanced setup.
Sloppy quorum nuance (important for staff interviews): Dynamo does not enforce strict quorum membership. During failures, a write can land on any available node that is not in the normal preference list for that key, so the W acks and R acks are not guaranteed to overlap — even with R+W>N. This is why Dynamo is classified as an AP system rather than a CP one. Hinted handoff and anti-entropy are what eventually bring the cluster back to a consistent state.
Replication and writes
sequenceDiagram
participant Client
participant Coord
participant N1 as Replica 1
participant N2 as Replica 2
participant N3 as Replica 3
Client->>Coord: put(k, v)
par
Coord->>N1: write(k, v)
Coord->>N2: write(k, v)
Coord->>N3: write(k, v)
end
N1-->>Coord: ack
N2-->>Coord: ack
Note over Coord: W=2 reached → respond
Coord-->>Client: ok
N3-->>Coord: ack (later)
The coordinator returns as soon as W replicas confirm. The remaining replicas catch up asynchronously. This is what keeps tail latency low — you're never waiting for the slowest node in the happy path.
Conflict resolution
Two clients write to the same key at the same moment. The replicas may receive those writes in different orders. Or a replica is partitioned, then comes back holding stale data. You need a principled way to figure out which version wins.
Last-write-wins (LWW)
Each value carries a timestamp. On read, keep the highest timestamp. Simple to implement and reason about — the problem is clock skew. A node whose clock is running slightly behind can have its write silently overwritten by an older write from a node whose clock is ahead. For shopping carts or counters, that's unacceptable data loss.
Vector clocks (Dynamo's approach)
Each version is tagged with a vector of (node_id, counter) pairs. When two versions are compared:
V1 = [(N1, 3), (N2, 1)]
V2 = [(N1, 3), (N2, 2)] ← strictly greater than V1
V3 = [(N1, 4), (N2, 1)] ← concurrent with V2 (incomparable)
If one version strictly dominates the other — every component is greater than or equal — keep the dominant version. If they're concurrent (incomparable), keep both and let the application reconcile on read. The classic Dynamo example is a shopping cart: concurrent adds are unioned together.
flowchart LR
W1["Write A from Node1<br/>V=(N1=1)"] --> R[(Replicas)]
W2["Write B from Node2<br/>V=(N1=1, N2=1)"] --> R
W3["Write C from Node1<br/>V=(N1=2)"] --> R
R --> READ[Read returns:<br/>concurrent versions B & C<br/>app reconciles]
style READ fill:#ff2e88,color:#fff
The tradeoff: the application has to understand reconciliation, and the vector clock metadata grows over time. In a large cluster, a new entry is added for every node that ever coordinates a write, so the vector can grow without bound. The original Dynamo paper handles this with a pruning heuristic: when the vector exceeds a size threshold, the oldest (node, counter) entries are dropped — at the cost of potentially treating causally related writes as concurrent.
CRDTs
Conflict-free Replicated Data Types are math-y data structures (counters, sets, sequences) where the merge operation is commutative and associative — any two concurrent updates can always be merged automatically, without application-level logic. Used in Riak, Redis CRDT modules, and collaborative editing systems. The limitation is that they only work for specific data types; they're not a general-purpose replacement for vector clocks.
Reads and read repair
sequenceDiagram
participant Client
participant Coord
participant N1
participant N2
participant N3
Client->>Coord: get(k)
par
Coord->>N1: read(k)
Coord->>N2: read(k)
Coord->>N3: read(k)
end
N1-->>Coord: v1, t=10
N2-->>Coord: v1, t=10
Note over Coord: R=2 reached → respond freshest
Coord-->>Client: v1
N3-->>Coord: v0, t=8 (stale!)
Coord->>N3: write(k, v1, t=10) ← repair
When R replicas respond, the coordinator picks the freshest value and returns it. If a stale replica is detected in the process, the coordinator quietly writes the fresh version back to it. Read repair is how the cluster keeps converging even without explicit reconciliation jobs — every read is an opportunity to heal divergence.
Anti-entropy: Merkle trees
Read repair only fixes divergence you happen to stumble across on a read. For deeper repair — especially after long partitions or when a new replica is bootstrapped — nodes periodically compare hash trees ("Merkle trees") of their data ranges. If the root hashes differ, they walk the tree to find exactly which key ranges diverge and sync only those. The cost is proportional to the diff, not the full dataset, which makes it tractable even on large clusters. This is how Cassandra's nodetool repair works.
Membership: gossip
Every few seconds, each node picks a random peer and exchanges:
- "I think these nodes are alive: A, B, C, D."
- "I think these nodes are dead: E."
- "Heartbeat counters."
After a few rounds, all nodes converge on the same view. No central coordinator needed — and none can be a single point of failure.
flowchart LR
A[A] -.gossip.-> B[B]
B -.gossip.-> C[C]
C -.gossip.-> D[D]
D -.gossip.-> A
A -.-> E["E?<br/>(suspected dead)"]
style E fill:#ff2e88,color:#fff
With N nodes and each node gossiping to one random peer per round, information spreads in O(log N) rounds. A 100-node cluster converges in roughly 7 rounds; a 10,000-node cluster in roughly 14. Each round takes on the order of a gossip interval (typically 1–2 seconds in production systems like Cassandra), so full convergence across large clusters is measured in tens of seconds, not minutes.
Dynamo uses an accrual-based failure detector (the phi-accrual model) rather than a simple timeout. Instead of a binary "dead / alive" judgment, it outputs a continuously-rising suspicion level φ. Each caller sets its own conviction threshold, trading detection speed against false-positive risk — a low threshold triggers quickly but misfires more often; a high threshold waits longer before declaring a node dead. Systems like Cassandra expose this as a tunable parameter (phi_convict_threshold, default 8; raised to 10–12 in high-latency cloud environments). This prevents thundering-herd false-evictions when a node is slow but not dead.
Hinted handoff
A replica is temporarily down. The coordinator can't write to it. Rather than failing the write, it stores a "hint" on another available node: "when X comes back, deliver this write to it." When X returns, the hints are replayed and the cluster heals without having lost any writes.
sequenceDiagram
participant Client
participant Coord
participant N2 as "Replica 2 (down)"
participant N3 as "Replica 3 (hint holder)"
Client->>Coord: put(k, v)
Coord->>N3: write hint "deliver to N2 when back"
Note over N2: offline
Coord-->>Client: ok (W acks met without N2)
Note over N2: comes back online
N3->>N2: replay hinted write(k, v)
N2-->>N3: ack
Hinted handoff is designed for short outages — minutes to a few hours. For longer divergence, Merkle tree anti-entropy picks up where it leaves off.
Storage on a single node
Each node manages its own local storage. The choice of engine matters a lot at the write rates this system is designed for.
LSM tree (Log-Structured Merge): writes land in a memtable in memory, get flushed to immutable SSTables on disk, and are compacted in the background. All writes are sequential, which means the disk is never doing random I/O on the write path. Used by Cassandra, RocksDB, LevelDB, and ScyllaDB.
B-tree: the classic update-in-place structure. Every write updates a page directly on disk — random I/O, but good for read-heavy workloads. Used by LMDB and WiredTiger (MongoDB's storage engine).
Bitcask: an append-only log with an in-memory hash index mapping every live key to its file offset. All writes are sequential; reads require at most one disk seek. Very fast for workloads where the full key set fits in RAM, which is what Riak's default backend uses.
flowchart LR
W[Write: put k v] --> MEM[Memtable<br/>in RAM]
MEM -->|"flush when full"| SS1[SSTable L0<br/>on disk]
SS1 -->|"compact"| SS2[SSTable L1]
SS2 -->|"compact"| SS3[SSTable L2]
READ[Read: get k] --> MEM
MEM -.miss.-> SS1
SS1 -.miss.-> SS2
style MEM fill:#ff6b1a,color:#0a0a0f
style SS1 fill:#15803d,color:#fff
style SS2 fill:#0e7490,color:#fff
style SS3 fill:#a855f7,color:#fff
For a write-heavy KV store, LSM wins. The tradeoffs:
| Property | LSM tree | B-tree |
|---|---|---|
| Write throughput | High (sequential I/O, batched flushes) | Moderate (random in-place page updates) |
| Read latency (cold) | Higher (may check memtable + multiple SSTable levels) | Lower (single tree traversal) |
| Space amplification | Higher during compaction | Lower (overwrites in place) |
| Write amplification | Significant (data rewritten at each level merge) | Moderate (page COW or in-place) |
| Compaction cost | Background CPU/disk spikes during compaction | Minimal background work |
LSM write amplification is a real operational concern. RocksDB's leveled compaction typically amplifies writes by more than 10×, often reaching 20–30× at high level-size ratios (the RocksDB wiki describes it as "often larger than 10"). Tuning the level size ratio and compaction strategy (leveled vs. tiered/STCS in Cassandra) is a significant operational lever. Tiered compaction has lower write amplification but higher read amplification and space usage during a merge.
Architecture summary
flowchart TD
subgraph Cluster
NODE1[Node 1]
NODE2[Node 2]
NODE3[Node 3]
NODEN[Node N]
end
GOSSIP{Gossip<br/>membership} --> NODE1
GOSSIP --> NODE2
GOSSIP --> NODE3
NODE1 -.replicate.-> NODE2
NODE2 -.replicate.-> NODE3
HINTS[(Hinted handoff queue)] --> NODE1
MERKLE[Merkle tree comparison<br/>for anti-entropy] --> NODE1
NODE1 --> LSM[LSM Tree on local disk]
NODE1 --> CACHE[Block cache in RAM]
style NODE1 fill:#ff6b1a,color:#0a0a0f
style HINTS fill:#a855f7,color:#fff
style MERKLE fill:#15803d,color:#fff
Failure scenarios
Single node down
As long as W replicas are alive, writes succeed. Hinted handoff stores the writes that missed the downed node, and replays them when it returns.
Whole rack / AZ down
Spread N replicas across racks and availability zones. With N=3 across 3 AZs, losing any one AZ leaves at least 2 live replicas — writes still succeed.
Network partition
Both sides keep accepting writes (AP behavior). Vector clocks record what happened on each side; anti-entropy reconciles when the partition heals.
With sloppy quorum, writes succeed on both sides of a partition by pulling in available nodes outside the normal preference list — so R+W>N does not protect you from conflicting concurrent writes during a partition. The system will surface concurrent versions (via vector clocks) on the next read and require application-level reconciliation.
With strict quorum (not Dynamo's default), the minority side would block writes rather than accept them — preserving CP semantics at the cost of availability.
Coordinated failure of N nodes
You lose data. Mitigations: spread replicas across geo regions; backups to S3.
Things to discuss in an interview
- Consistent hashing + virtual nodes for sharding. Virtual nodes (tokens) let each physical node own multiple ring segments, which balances load across heterogeneous machines and makes node addition/removal require smaller, incremental data transfers instead of one big shift.
- N, R, W as the three knobs of consistency — and the sloppy quorum caveat that Dynamo-style systems use.
- Vector clocks for concurrent writes, including their pruning problem in large clusters.
- Gossip for membership, with phi-accrual failure detection for configurable suspicion thresholds.
- Hinted handoff & Merkle anti-entropy as short-term vs. long-term availability repair tactics (read repair is opportunistic — it fires on reads, not on a schedule).
- Merkle trees for bounded background repair cost (tree comparison is O(diff size), not O(full data size)).
- LSM tree as the local storage choice, and when write amplification becomes an operational issue.
Things you should now be able to answer
- Why does
R + W > Ngive read-your-writes under strict quorum, and why does sloppy quorum weaken that guarantee? - What's the alternative to vector clocks, and what does it sacrifice?
- How does phi-accrual failure detection differ from a simple heartbeat timeout, and why does it matter?
- What happens to writes during a network partition in Dynamo's sloppy-quorum model? How does it differ from a CP system?
- Why is gossip preferred over a central coordinator, and what is its convergence complexity?
- What is write amplification in an LSM tree, and how does compaction strategy affect it?
- When would you choose Bitcask over an LSM tree?
Further reading
- "Dynamo: Amazon's Highly Available Key-value Store" (DeCandia et al., SOSP 2007) — the foundational paper
- "Designing Data-Intensive Applications" (Kleppmann), chapters 5-6
- Cassandra documentation on consistency, gossip, repair
Frequently asked questions
▸What is the R+W>N quorum rule, and what does it guarantee?
With N replicas, R is how many must respond to a read and W is how many must ack a write. When R+W>N, the read set and write set must overlap by at least one node, guaranteeing that every read sees at least one replica that participated in the latest write — this is read-your-writes. For N=3, the standard balanced setup is R=2, W=2.
▸Why does Dynamo's sloppy quorum weaken the R+W>N guarantee?
Dynamo does not enforce strict quorum membership. During failures, a write can land on any available node outside the normal preference list for that key, so the W acks and R acks are not guaranteed to overlap even when R+W>N. This is why Dynamo is classified as an AP system; hinted handoff and Merkle tree anti-entropy are what eventually restore consistency.
▸How does consistent hashing with virtual nodes compare to naive hash-mod-N sharding?
With naive hash%N sharding, adding one node reshuffles nearly all keys across the cluster. Consistent hashing places both keys and nodes on a ring, so adding one node moves only about 1/N of total keys — roughly 1% for a 100-node cluster — taken evenly from existing nodes. Virtual nodes let each physical machine own multiple ring segments, balancing load across heterogeneous hardware.
▸When should you use vector clocks instead of last-write-wins for conflict resolution?
Use vector clocks when data loss from clock skew is unacceptable. Last-write-wins silently discards a write if the losing node's wall clock is running slightly behind, which is catastrophic for shopping carts or counters. Vector clocks track causality per node; when two versions are concurrent and incomparable, both are surfaced to the application for merge — the classic Dynamo example is unioning concurrent shopping-cart adds.
▸What is LSM tree write amplification, and how large does it get in practice?
Write amplification in an LSM tree means data is rewritten at each level merge during compaction, so one logical write triggers many physical disk writes. RocksDB's leveled compaction typically amplifies writes by more than 10x, often reaching 20-30x at high level-size ratios. Tiered compaction reduces write amplification but increases read amplification and space usage during a merge.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.