Database Replication
Single-leader, multi-leader, and leaderless replication. Sync vs async, replication lag, conflict resolution, and how each model trades availability for consistency.
Replication is keeping copies of the same data on multiple machines. The reason is usually one of three things: survive failure, serve more reads, or put data near users. The shape of the replication — how many writers, how syncing happens, who resolves conflicts — is one of the most consequential decisions in any distributed system.
This article walks through the three families (single-leader, multi-leader, leaderless), the sync/async trade-off that runs through all of them, and the operational pain you sign up for with each.
Why replicate
Before picking a model, it's worth being precise about why you're replicating — because the reason shapes every trade-off that follows.
If you're replicating for high availability, you need your failover path to be fast and clean. If you're replicating to scale reads, you can often tolerate stale data. If you're replicating for geo-locality — putting data physically close to the users who read it — you almost certainly need multiple write regions too, which opens a whole different can of worms.
flowchart TD
R[Why replicate?] --> HA[High availability<br/>survive node failure]
R --> RP[Read scaling<br/>more replicas = more reads]
R --> GEO[Geo locality<br/>data near users]
style HA fill:#ff6b1a,color:#0a0a0f
style RP fill:#15803d,color:#fff
style GEO fill:#0e7490,color:#fff
Replicating reads is the easy part — you just route to any copy. Coordinating writes across copies is where every interesting design choice lives.
Three replication models
There are three families worth knowing. Single-leader handles the overwhelming majority of real-world OLTP databases. Multi-leader is a specialized tool for multi-region active-active. Leaderless is the model Amazon's 2007 Dynamo paper popularized, and what Cassandra, Riak, and ScyllaDB are built on.
flowchart TD
M[Replication models] --> S[Single-leader<br/>one writer]
M --> MU[Multi-leader<br/>many writers]
M --> L[Leaderless<br/>quorum]
S --> PG[Postgres, MySQL,<br/>MongoDB, most SQL]
MU --> MM[Galera, BDR, multi-region<br/>multi-master MySQL]
L --> DY[Dynamo paper, Cassandra,<br/>Riak, ScyllaDB]
style S fill:#ff6b1a,color:#0a0a0f
style MU fill:#ffaa00,color:#0a0a0f
style L fill:#0e7490,color:#fff
Single-leader (a.k.a. primary–replica, a.k.a. master–slave)
One node handles all writes. Reads can go anywhere. The leader streams its write-ahead log to followers.
flowchart LR
C[Client writes] --> L[(Leader)]
L -->|WAL stream| F1[(Follower 1)]
L -->|WAL stream| F2[(Follower 2)]
C2[Client reads] --> F1
C3[Client reads] --> F2
C4[Client reads] --> L
style L fill:#ff6b1a,color:#0a0a0f
This is what 80% of production databases use. Postgres, MySQL, MongoDB, SQL Server, Oracle, Kafka brokers within a partition.
Synchronous vs asynchronous
The leader can ship the write to followers in two ways:
| Mode | Behavior | Trade-off |
|---|---|---|
| Sync | Wait for followers to ack before returning to client | Strong durability; slow; fails if a follower is down |
| Async | Return to client immediately; followers catch up | Fast; risk losing the last few writes on leader crash |
| Semi-sync | Wait for at least one follower, then return | Practical middle ground |
Postgres lets you mix: certain followers are sync, others async. The synchronous ones become candidates for failover.
Replication lag
Async followers are behind by some amount of time, called replication lag. Usually milliseconds. Under load: seconds. Under crisis: minutes or worse.
This is where users hit "post a comment, immediately reload, my comment is gone" bugs.
sequenceDiagram
actor U as User
participant L as Leader
participant F as Follower
U->>L: POST comment
L-->>U: 200 OK
Note over L,F: ...replication lag...
U->>F: GET comments
F-->>U: comments without yours!
Note over U: 😡
The fix isn't to always read from the leader — that defeats the whole point of having followers. Instead, you route selectively:
- Read-your-own-writes: send a user's reads to the leader for a short window after they write something.
- Monotonic reads: pin a user's reads to one follower for the duration of a session, so at least they see a consistent history.
- Versioned reads: the client passes the timestamp of its last write; the gateway routes to a follower that's at least that fresh.
flowchart TD
U[User makes a write] --> L[(Leader<br/>write succeeds)]
L --> Q{Did user just write?}
Q -->|Yes, within window| RL[Route reads to Leader]
Q -->|No| RF[Route reads to Follower]
RL --> RES[Consistent result]
RF --> RES2[Possibly stale but acceptable]
style L fill:#ff6b1a,color:#0a0a0f
style RL fill:#15803d,color:#fff
style RF fill:#0e7490,color:#fff
Failover — the hard part
When the leader dies, some follower must become the new leader. This is failover, and it's where single-leader systems earn their complexity budget.
flowchart LR
L1[(Leader<br/>💀)] -.dies.-> L2{Promote<br/>which follower?}
L2 -->|"the one closest to caught up"| NEW[(New Leader)]
F1[(Follower 1)] --> L2
F2[(Follower 2)] --> L2
style L1 fill:#ff2e88,color:#fff
style NEW fill:#15803d,color:#fff
Three things routinely go wrong:
Lost writes. Async followers are behind. Anything not yet replicated is lost when you cut over. In steady state, async lag is typically tens to hundreds of milliseconds; under write load it can spike to seconds — so failover can silently discard that window of writes.
Split brain. The old leader thinks it's still leader; the new leader is also live; both accept writes. Different rows, same key, two answers. The mitigation is fencing (STONITH — "Shoot The Other Node In The Head") and epoch/generation numbers — each leader is assigned a monotonically increasing epoch (Raft calls it a "term"; Paxos calls it a "ballot number"); followers reject requests from any leader whose epoch is lower than the current election result.
Stale clients. Apps cache the old leader's address. Short-lived DNS TTLs (30–60 s) or a proxy (PgBouncer, ProxySQL, RDS Proxy) that tracks the current leader address keep clients from hammering a dead node.
Modern systems handle failover with consensus (Raft, Paxos) to elect leaders safely. See the leader election article.
Single-leader is the right choice when you need strong consistency on writes in one region, your write QPS fits one machine, and you can tolerate brief downtime during failover — which is to say, it's the right choice for almost every OLTP database.
Multi-leader replication
Multiple nodes accept writes; each replicates to the others. The main use case is multi-region active-active — you want users in Tokyo to write to a Tokyo node at low latency, not wait for a round-trip to US-East.
flowchart LR
U1[Users US] --> L1[(Leader US)]
U2[Users EU] --> L2[(Leader EU)]
U3[Users AP] --> L3[(Leader AP)]
L1 <-->|async| L2
L2 <-->|async| L3
L1 <-->|async| L3
style L1 fill:#ff6b1a,color:#0a0a0f
style L2 fill:#ffaa00,color:#0a0a0f
style L3 fill:#0e7490,color:#fff
Tokyo writes go to the Tokyo leader (low latency), and the write propagates to US and EU asynchronously. The latency win is real and significant. The price you pay is conflicts.
The brutal problem: conflicts
Two leaders write to the same row at the same time. Now you have two histories to merge.
sequenceDiagram
actor A as Alice (US)
actor B as Bob (EU)
participant L1 as US Leader
participant L2 as EU Leader
A->>L1: SET name = 'Carol'
B->>L2: SET name = 'Dave'
L1->>L2: replicate name='Carol'
L2->>L1: replicate name='Dave'
Note over L1,L2: Both reach inconsistent state
There are four strategies, and none of them are free.
The simplest is last-write-wins (LWW): pick the version with the later timestamp. Simple to implement, but lossy — one update is silently dropped. It works fine for truly independent facts; it's terrible for anything where two concurrent writes are both supposed to survive. From LWW you can step up to application-defined merge, which gives the app both versions and lets it decide — the app sees name='Carol' and name='Dave' and picks one. This works well when the app knows what "merge" means for its domain (e.g., taking a set-union of tags).
For data types where the merge is always well-defined, CRDTs (conflict-free replicated data types) sidestep the resolution problem entirely. CRDTs are data structures where the merge function is commutative and associative — so order doesn't matter and there's nothing to resolve. Counters, sets, and maps all have CRDT implementations. If your data fits the model, this is the cleanest option.
When none of the above applies, manual resolution surfaces both conflicting versions to a user or admin. Dropbox's "conflicted copy" files are a familiar example.
Most engineers underestimate how often conflicts happen and how painful resolution is. Multi-leader is the right tool for offline-first mobile apps (where every device is effectively a leader) and for multi-region SaaS where users mostly write to their nearest region and those writes rarely contend. For traditional OLTP, avoid it unless you have a strong reason.
Leaderless replication
There is no leader. Clients send writes to multiple replicas in parallel. Reads also fan out to multiple replicas. Quorums decide what "success" means.
flowchart TD
W[Client writes] -->|"to W replicas"| R1[(Replica 1)]
W --> R2[(Replica 2)]
W --> R3[(Replica 3)]
R[Client reads] -->|"from R replicas"| R1
R --> R2
R --> R3
style R1 fill:#0e7490,color:#fff
style R2 fill:#0e7490,color:#fff
style R3 fill:#0e7490,color:#fff
This is the design from Amazon's 2007 Dynamo paper (DeCandia et al.), later adopted by Cassandra, Riak, and ScyllaDB. Note that the commercial Amazon DynamoDB product, launched in 2012, uses single-leader replication internally per partition and is a distinct system from the academic Dynamo paper.
The quorum formula
With N replicas, W required for a successful write, R required for a successful read:
If W + R > N, every read overlaps with at least one writer → strong-ish consistency.
Common configurations:
| N | W | R | Behavior |
|---|---|---|---|
| 3 | 2 | 2 | "Strong" quorum — the typical Cassandra configuration (and the Dynamo paper default) |
| 3 | 1 | 1 | Fast, eventually consistent |
| 3 | 3 | 1 | All-write, any-read. Slow writes, fast reads |
| 3 | 1 | 3 | Fast writes, slow reads |
One important nuance: W + R > N is necessary but not sufficient for linearizability. Even with this formula satisfied, two concurrent reads can still return different versions if writes are in-flight, because the quorum sets can overlap on a stale replica. Systems like Cassandra that implement quorum over async replication offer read-your-writes and monotonic reads in the common case, but not full linearizability. For linearizable leaderless reads, you need a separate protocol (e.g., Paxos read or conditional-write CAS).
Read repair and anti-entropy
Two background mechanisms keep replicas in sync over time:
- Read repair: when a read sees inconsistent replicas, write the newest value back to stale ones.
- Anti-entropy: periodic Merkle-tree comparison between replicas to find and fix divergences.
sequenceDiagram
actor C as Client
participant R1 as Replica 1
participant R2 as Replica 2
participant R3 as Replica 3
C->>R1: read K
C->>R2: read K
C->>R3: read K
R1-->>C: v=5 (old)
R2-->>C: v=7 (new)
R3-->>C: v=7
C->>R1: write back v=7 (repair)
Conflict resolution: vector clocks vs LWW
Two writes might disagree about which is "newer."
LWW with timestamps is simple and loses concurrent updates silently. It's Cassandra's default, which means you need to think carefully before enabling it for anything where concurrent writes matter.
Vector clocks tag each write with a (replica_id, counter) pair. Reads can see "these two versions are concurrent — the application needs to resolve them." Riak used vector clocks; since Riak 2.0 it recommends Dotted Version Vectors (DVVs) instead, which produce far fewer spurious siblings.
CRDTs — the same conflict-free data types mentioned above — sidestep the resolution problem entirely. Riak, Redis CRDTs, and distributed counters all use this approach.
Sloppy quorums and hinted handoff
What if a write can't reach W of the canonical replicas? You have two options. You can fail (strict quorum), which gives you correctness. Or you can use a sloppy quorum: write to any W available nodes, and those temporary recipients hand off to the rightful replica later (hinted handoff). This gives you availability at the cost of a window where readers looking at the canonical replicas might miss recent writes.
Leaderless wins when you need massive write throughput, high availability without failover complexity, geo-distribution, or workloads where eventual consistency is genuinely acceptable. It struggles with strong-consistency operations — money, counters with exact totals — and with workloads where the same key is frequently written from multiple nodes simultaneously.
Sync vs async: the unavoidable trade-off
Across all three models, every replication has a knob between "wait for acknowledgment" and "fire and forget."
flowchart LR
SYNC[Synchronous<br/>wait for replicas] --> SLOW[Slower writes]
SYNC --> DUR[Stronger durability]
SYNC --> BLOCK[Blocks if replicas slow]
ASYNC[Asynchronous<br/>fire-and-forget] --> FAST[Faster writes]
ASYNC --> LOSS[Possible data loss on failover]
ASYNC --> LAG[Replication lag]
style SYNC fill:#ff6b1a,color:#0a0a0f
style ASYNC fill:#0e7490,color:#fff
Production HA setups almost always settle on semi-sync: one replica acknowledges before commit, the rest catch up async. It's the best of both within reason — you get a known-safe failover candidate without blocking on the full replication fan-out. (Out of the box, most databases including PostgreSQL default to fully async; semi-sync is opt-in via configuration.)
Replication topologies
In single-leader and multi-leader setups, the physical shape of the replication graph also matters.
flowchart LR
subgraph Star
LS[(Leader)] --> SA[(R1)]
LS --> SB[(R2)]
LS --> SC[(R3)]
end
subgraph Chain
LC[(Leader)] --> CA[(R1)] --> CB[(R2)] --> CC[(R3)]
end
subgraph AllToAll
AA[(Node A)] <--> AB[(Node B)]
AB <--> AC[(Node C)]
AA <--> AC
end
style LS fill:#ff6b1a,color:#0a0a0f
style LC fill:#ffaa00,color:#0a0a0f
style AA fill:#0e7490,color:#fff
style AB fill:#0e7490,color:#fff
style AC fill:#0e7490,color:#fff
| Topology | Description | Use |
|---|---|---|
| Star | One leader, many followers | Default single-leader |
| Chain | Leader → R1 → R2 → R3 | Lower leader bandwidth, higher latency to last replica |
| All-to-all | Every node connects to every other | Multi-leader Galera; write latency grows with each added node (typically deployed as 3–5 nodes) |
| Tree | Cascaded fan-out | Wide read replicas (e.g., 50 read replicas) |
Operational pitfalls
Replication lag spirals
A burst of writes makes followers fall behind. Reads start hitting stale data. Apps routing reads to followers start showing inconsistencies. The fix is usually capacity — more or faster followers — not configuration tweaks.
Long-running transactions on the leader
A 10-minute transaction holds locks; replication backs up; followers fall further behind. Avoid large batched writes in a single transaction.
Adding a replica is expensive
Bootstrapping a new replica means copying the entire database. For TB-scale, this takes hours and stresses the network. Plan ahead; don't wait for a crisis.
Failover testing
Most teams never test failover. The first real failover is the production incident. Practice it. Quarterly.
Schema migrations under replication
Some DDL operations replicate poorly. An ALTER TABLE that takes 30 minutes on the leader takes 30 minutes per follower, serially. Use pt-online-schema-change, gh-ost, or Postgres logical replication tricks to avoid blocking the replica stream.
Worked example: Postgres + read replicas
A common starting setup:
flowchart LR
APP[App] -->|writes| L[(Postgres Primary<br/>db.r6id.4xlarge)]
L -->|stream WAL async| R1[(Replica EU)]
L -->|stream WAL sync| R2[(Replica US-W)]
APP -->|reads| L
APP -->|reads| R1
APP -->|reads| R2
style L fill:#ff6b1a,color:#0a0a0f
One sync replica gives you a durable failover candidate — if the primary dies, this replica is guaranteed to be current. The async replicas serve reads, especially region-local ones. If read traffic is 5× the write traffic, this setup absorbs that without adding any write load.
When this stops working, it's almost always because writes have maxed out the primary box. At that point, you shard (sharding article) or move to distributed SQL (CockroachDB, Spanner).
How to talk about replication in an interview
A strong answer has four ingredients:
- Pick a model: single-leader / multi-leader / leaderless, and say why.
- Pick sync mode: sync, async, semi-sync; justify the choice for your durability and latency requirements.
- Talk about failover: how a new leader is elected, what writes might be lost in the gap.
- Talk about reads: which clients might see stale data and how you route around it.
If you only memorize one default: single-leader, semi-sync, with sync to one replica for failover safety. Defend it with: "Strong consistency on writes; one machine of write throughput is enough for most OLTP; we trade slightly higher write latency for a safe, predictable failover."
Things you should now be able to answer
- What does "replication lag" mean and what bugs does it cause?
- Why is multi-leader replication operationally so painful?
- What does
W + R > Ngive you in a leaderless system? - Why is async replication a durability risk during failover?
- When would you pick leaderless over single-leader?
Further reading
- Designing Data-Intensive Applications by Martin Kleppmann (chapter 5)
- DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store" (2007)
- Verbitski et al., "Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases" — SIGMOD 2017
- CAP theorem deep dive on this site
Frequently asked questions
▸What is the difference between single-leader, multi-leader, and leaderless replication?
In single-leader replication one node handles all writes and streams its write-ahead log to followers — used by Postgres, MySQL, MongoDB, and Kafka partitions. Multi-leader allows multiple nodes to accept writes, primarily for multi-region active-active setups, but requires conflict resolution. Leaderless sends writes to multiple replicas in parallel and uses a quorum formula (W + R > N) to determine success, as in Cassandra and Riak (the original Dynamo paper design). Note that Amazon DynamoDB, despite the name, uses single-leader replication per partition internally.
▸What does W + R > N guarantee in a leaderless replication system?
When the number of write acknowledgments (W) plus read acknowledgments (R) exceeds the total replica count (N), every read is guaranteed to overlap with at least one replica that has the latest write. The typical Cassandra and Riak (Dynamo-paper-style) configuration is N=3, W=2, R=2. This gives strong-ish consistency but not full linearizability — two concurrent reads can still return different versions if writes are in-flight, because quorum sets can overlap on a stale replica.
▸What is replication lag and what bugs does it cause?
Replication lag is the delay between a write reaching the leader and that write becoming visible on async followers — usually milliseconds, but seconds under load and minutes under crisis. The classic symptom is a user posting a comment, immediately reloading the page, and seeing their comment missing because the read hit a follower that had not yet caught up.
▸When should I use semi-synchronous replication instead of fully synchronous or fully asynchronous?
Semi-sync — waiting for exactly one follower to acknowledge before returning to the client and letting the rest catch up asynchronously — is the recommended HA pattern because it gives a known-safe failover candidate without blocking on the full replication fan-out. Fully synchronous blocks if any follower is slow or dead; fully async risks losing the last few writes if the leader crashes before replication completes. PostgreSQL defaults to fully async and requires setting synchronous_standby_names to enable semi-sync.
▸When should I choose leaderless replication over single-leader?
Leaderless wins when you need massive write throughput, high availability without failover complexity, or geo-distribution where eventual consistency is acceptable. It struggles with strong-consistency operations such as financial counters requiring exact totals, and with workloads where the same key is frequently written concurrently from multiple nodes.
You may also like
Design a Shopping Cart & Checkout System
Keep a cart consistent across devices, then check out without overselling or double-charging. The available-cart vs consistent-checkout split, inventory holds, and the order saga.
Design a Social Graph Service (Facebook's TAO)
Serve billions of "who follows whom" reads over a graph of trillions of edges. The objects-and-associations model, a cache in front of sharded SQL, and the hot-vertex problem.
Design an Authorization System (Google Zanzibar / RBAC / ReBAC)
Answer "can user U do action A on resource R?" globally, in milliseconds, consistently. RBAC vs ABAC vs ReBAC, Zanzibar relation tuples, and the new-enemy problem.