~/articles/database-replication

◆◆Intermediateasked at Postgresasked at Cassandraasked at DynamoDB

Database Replication

Single-leader, multi-leader, and leaderless replication. Sync vs async, replication lag, conflict resolution, and how each model trades availability for consistency.

14 min read2026-02-26Ironclad Academy

#databases #distributed-systems #replication #consistency

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

Replication is keeping copies of the same data on multiple machines. The reason is usually one of three things: survive failure, serve more reads, or put data near users. The shape of the replication — how many writers, how syncing happens, who resolves conflicts — is one of the most consequential decisions in any distributed system.

This article walks through the three families (single-leader, multi-leader, leaderless), the sync/async trade-off that runs through all of them, and the operational pain you sign up for with each.

Why replicate

Before picking a model, it's worth being precise about why you're replicating — because the reason shapes every trade-off that follows.

If you're replicating for high availability, you need your failover path to be fast and clean. If you're replicating to scale reads, you can often tolerate stale data. If you're replicating for geo-locality — putting data physically close to the users who read it — you almost certainly need multiple write regions too, which opens a whole different can of worms.

flowchart TD
    R[Why replicate?] --> HA[High availability<br/>survive node failure]
    R --> RP[Read scaling<br/>more replicas = more reads]
    R --> GEO[Geo locality<br/>data near users]
    style HA fill:#ff6b1a,color:#0a0a0f
    style RP fill:#15803d,color:#fff
    style GEO fill:#0e7490,color:#fff

Replicating reads is the easy part — you just route to any copy. Coordinating writes across copies is where every interesting design choice lives.

Three replication models

There are three families worth knowing. Single-leader handles the overwhelming majority of real-world OLTP databases. Multi-leader is a specialized tool for multi-region active-active. Leaderless is the model Amazon's 2007 Dynamo paper popularized, and what Cassandra, Riak, and ScyllaDB are built on.

flowchart TD
    M[Replication models] --> S[Single-leader<br/>one writer]
    M --> MU[Multi-leader<br/>many writers]
    M --> L[Leaderless<br/>quorum]
    S --> PG[Postgres, MySQL,<br/>MongoDB, most SQL]
    MU --> MM[Galera, BDR, multi-region<br/>multi-master MySQL]
    L --> DY[Dynamo paper, Cassandra,<br/>Riak, ScyllaDB]
    style S fill:#ff6b1a,color:#0a0a0f
    style MU fill:#ffaa00,color:#0a0a0f
    style L fill:#0e7490,color:#fff

Single-leader (a.k.a. primary–replica, a.k.a. master–slave)

One node handles all writes. Reads can go anywhere. The leader streams its write-ahead log to followers.

flowchart LR
    C[Client writes] --> L[(Leader)]
    L -->|WAL stream| F1[(Follower 1)]
    L -->|WAL stream| F2[(Follower 2)]
    C2[Client reads] --> F1
    C3[Client reads] --> F2
    C4[Client reads] --> L
    style L fill:#ff6b1a,color:#0a0a0f

This is what 80% of production databases use. Postgres, MySQL, MongoDB, SQL Server, Oracle, Kafka brokers within a partition.

Synchronous vs asynchronous

The leader can ship the write to followers in two ways:

Mode	Behavior	Trade-off
Sync	Wait for followers to ack before returning to client	Strong durability; slow; fails if a follower is down
Async	Return to client immediately; followers catch up	Fast; risk losing the last few writes on leader crash
Semi-sync	Wait for at least one follower, then return	Practical middle ground

Postgres lets you mix: certain followers are sync, others async. The synchronous ones become candidates for failover.

Replication lag

Async followers are behind by some amount of time, called replication lag. Usually milliseconds. Under load: seconds. Under crisis: minutes or worse.

This is where users hit "post a comment, immediately reload, my comment is gone" bugs.

sequenceDiagram
    actor U as User
    participant L as Leader
    participant F as Follower
    U->>L: POST comment
    L-->>U: 200 OK
    Note over L,F: ...replication lag...
    U->>F: GET comments
    F-->>U: comments without yours!
    Note over U: 😡

The fix isn't to always read from the leader — that defeats the whole point of having followers. Instead, you route selectively:

Read-your-own-writes: send a user's reads to the leader for a short window after they write something.
Monotonic reads: pin a user's reads to one follower for the duration of a session, so at least they see a consistent history.
Versioned reads: the client passes the timestamp of its last write; the gateway routes to a follower that's at least that fresh.

flowchart TD
    U[User makes a write] --> L[(Leader<br/>write succeeds)]
    L --> Q{Did user just write?}
    Q -->|Yes, within window| RL[Route reads to Leader]
    Q -->|No| RF[Route reads to Follower]
    RL --> RES[Consistent result]
    RF --> RES2[Possibly stale but acceptable]
    style L fill:#ff6b1a,color:#0a0a0f
    style RL fill:#15803d,color:#fff
    style RF fill:#0e7490,color:#fff

Failover — the hard part

When the leader dies, some follower must become the new leader. This is failover, and it's where single-leader systems earn their complexity budget.

flowchart LR
    L1[(Leader<br/>💀)] -.dies.-> L2{Promote<br/>which follower?}
    L2 -->|"the one closest to caught up"| NEW[(New Leader)]
    F1[(Follower 1)] --> L2
    F2[(Follower 2)] --> L2
    style L1 fill:#ff2e88,color:#fff
    style NEW fill:#15803d,color:#fff

Three things routinely go wrong:

Lost writes. Async followers are behind. Anything not yet replicated is lost when you cut over. In steady state, async lag is typically tens to hundreds of milliseconds; under write load it can spike to seconds — so failover can silently discard that window of writes.

Split brain. The old leader thinks it's still leader; the new leader is also live; both accept writes. Different rows, same key, two answers. The mitigation is fencing (STONITH — "Shoot The Other Node In The Head") and epoch/generation numbers — each leader is assigned a monotonically increasing epoch (Raft calls it a "term"; Paxos calls it a "ballot number"); followers reject requests from any leader whose epoch is lower than the current election result.

Stale clients. Apps cache the old leader's address. Short-lived DNS TTLs (30–60 s) or a proxy (PgBouncer, ProxySQL, RDS Proxy) that tracks the current leader address keep clients from hammering a dead node.

Modern systems handle failover with consensus (Raft, Paxos) to elect leaders safely. See the leader election article.

Single-leader is the right choice when you need strong consistency on writes in one region, your write QPS fits one machine, and you can tolerate brief downtime during failover — which is to say, it's the right choice for almost every OLTP database.

Multi-leader replication

Multiple nodes accept writes; each replicates to the others. The main use case is multi-region active-active — you want users in Tokyo to write to a Tokyo node at low latency, not wait for a round-trip to US-East.

flowchart LR
    U1[Users US] --> L1[(Leader US)]
    U2[Users EU] --> L2[(Leader EU)]
    U3[Users AP] --> L3[(Leader AP)]
    L1 <-->|async| L2
    L2 <-->|async| L3
    L1 <-->|async| L3
    style L1 fill:#ff6b1a,color:#0a0a0f
    style L2 fill:#ffaa00,color:#0a0a0f
    style L3 fill:#0e7490,color:#fff

Tokyo writes go to the Tokyo leader (low latency), and the write propagates to US and EU asynchronously. The latency win is real and significant. The price you pay is conflicts.

The brutal problem: conflicts

Two leaders write to the same row at the same time. Now you have two histories to merge.

sequenceDiagram
    actor A as Alice (US)
    actor B as Bob (EU)
    participant L1 as US Leader
    participant L2 as EU Leader
    A->>L1: SET name = 'Carol'
    B->>L2: SET name = 'Dave'
    L1->>L2: replicate name='Carol'
    L2->>L1: replicate name='Dave'
    Note over L1,L2: Both reach inconsistent state

There are four strategies, and none of them are free.

The simplest is last-write-wins (LWW): pick the version with the later timestamp. Simple to implement, but lossy — one update is silently dropped. It works fine for truly independent facts; it's terrible for anything where two concurrent writes are both supposed to survive. From LWW you can step up to application-defined merge, which gives the app both versions and lets it decide — the app sees name='Carol' and name='Dave' and picks one. This works well when the app knows what "merge" means for its domain (e.g., taking a set-union of tags).

For data types where the merge is always well-defined, CRDTs (conflict-free replicated data types) sidestep the resolution problem entirely. CRDTs are data structures where the merge function is commutative and associative — so order doesn't matter and there's nothing to resolve. Counters, sets, and maps all have CRDT implementations. If your data fits the model, this is the cleanest option.

When none of the above applies, manual resolution surfaces both conflicting versions to a user or admin. Dropbox's "conflicted copy" files are a familiar example.

Most engineers underestimate how often conflicts happen and how painful resolution is. Multi-leader is the right tool for offline-first mobile apps (where every device is effectively a leader) and for multi-region SaaS where users mostly write to their nearest region and those writes rarely contend. For traditional OLTP, avoid it unless you have a strong reason.

Leaderless replication

There is no leader. Clients send writes to multiple replicas in parallel. Reads also fan out to multiple replicas. Quorums decide what "success" means.

flowchart TD
    W[Client writes] -->|"to W replicas"| R1[(Replica 1)]
    W --> R2[(Replica 2)]
    W --> R3[(Replica 3)]
    R[Client reads] -->|"from R replicas"| R1
    R --> R2
    R --> R3
    style R1 fill:#0e7490,color:#fff
    style R2 fill:#0e7490,color:#fff
    style R3 fill:#0e7490,color:#fff

This is the design from Amazon's 2007 Dynamo paper (DeCandia et al.), later adopted by Cassandra, Riak, and ScyllaDB. Note that the commercial Amazon DynamoDB product, launched in 2012, uses single-leader replication internally per partition and is a distinct system from the academic Dynamo paper.

The quorum formula

With N replicas, W required for a successful write, R required for a successful read:

If W + R > N, every read overlaps with at least one writer → strong-ish consistency.

Common configurations:

N	W	R	Behavior
3	2	2	"Strong" quorum — the typical Cassandra configuration (and the Dynamo paper default)
3	1	1	Fast, eventually consistent
3	3	1	All-write, any-read. Slow writes, fast reads
3	1	3	Fast writes, slow reads

One important nuance: W + R > N is necessary but not sufficient for linearizability. Even with this formula satisfied, two concurrent reads can still return different versions if writes are in-flight, because the quorum sets can overlap on a stale replica. Systems like Cassandra that implement quorum over async replication offer read-your-writes and monotonic reads in the common case, but not full linearizability. For linearizable leaderless reads, you need a separate protocol (e.g., Paxos read or conditional-write CAS).

Read repair and anti-entropy

Two background mechanisms keep replicas in sync over time:

Read repair: when a read sees inconsistent replicas, write the newest value back to stale ones.
Anti-entropy: periodic Merkle-tree comparison between replicas to find and fix divergences.

sequenceDiagram
    actor C as Client
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    C->>R1: read K
    C->>R2: read K
    C->>R3: read K
    R1-->>C: v=5 (old)
    R2-->>C: v=7 (new)
    R3-->>C: v=7
    C->>R1: write back v=7 (repair)

Conflict resolution: vector clocks vs LWW

Two writes might disagree about which is "newer."

LWW with timestamps is simple and loses concurrent updates silently. It's Cassandra's default, which means you need to think carefully before enabling it for anything where concurrent writes matter.

Vector clocks tag each write with a (replica_id, counter) pair. Reads can see "these two versions are concurrent — the application needs to resolve them." Riak used vector clocks; since Riak 2.0 it recommends Dotted Version Vectors (DVVs) instead, which produce far fewer spurious siblings.

CRDTs — the same conflict-free data types mentioned above — sidestep the resolution problem entirely. Riak, Redis CRDTs, and distributed counters all use this approach.

Sloppy quorums and hinted handoff

What if a write can't reach W of the canonical replicas? You have two options. You can fail (strict quorum), which gives you correctness. Or you can use a sloppy quorum: write to any W available nodes, and those temporary recipients hand off to the rightful replica later (hinted handoff). This gives you availability at the cost of a window where readers looking at the canonical replicas might miss recent writes.

Leaderless wins when you need massive write throughput, high availability without failover complexity, geo-distribution, or workloads where eventual consistency is genuinely acceptable. It struggles with strong-consistency operations — money, counters with exact totals — and with workloads where the same key is frequently written from multiple nodes simultaneously.

Sync vs async: the unavoidable trade-off

Across all three models, every replication has a knob between "wait for acknowledgment" and "fire and forget."

flowchart LR
    SYNC[Synchronous<br/>wait for replicas] --> SLOW[Slower writes]
    SYNC --> DUR[Stronger durability]
    SYNC --> BLOCK[Blocks if replicas slow]
    ASYNC[Asynchronous<br/>fire-and-forget] --> FAST[Faster writes]
    ASYNC --> LOSS[Possible data loss on failover]
    ASYNC --> LAG[Replication lag]
    style SYNC fill:#ff6b1a,color:#0a0a0f
    style ASYNC fill:#0e7490,color:#fff

Production HA setups almost always settle on semi-sync: one replica acknowledges before commit, the rest catch up async. It's the best of both within reason — you get a known-safe failover candidate without blocking on the full replication fan-out. (Out of the box, most databases including PostgreSQL default to fully async; semi-sync is opt-in via configuration.)

Replication topologies

In single-leader and multi-leader setups, the physical shape of the replication graph also matters.

flowchart LR
    subgraph Star
        LS[(Leader)] --> SA[(R1)]
        LS --> SB[(R2)]
        LS --> SC[(R3)]
    end
    subgraph Chain
        LC[(Leader)] --> CA[(R1)] --> CB[(R2)] --> CC[(R3)]
    end
    subgraph AllToAll
        AA[(Node A)] <--> AB[(Node B)]
        AB <--> AC[(Node C)]
        AA <--> AC
    end
    style LS fill:#ff6b1a,color:#0a0a0f
    style LC fill:#ffaa00,color:#0a0a0f
    style AA fill:#0e7490,color:#fff
    style AB fill:#0e7490,color:#fff
    style AC fill:#0e7490,color:#fff

Topology	Description	Use
Star	One leader, many followers	Default single-leader
Chain	Leader → R1 → R2 → R3	Lower leader bandwidth, higher latency to last replica
All-to-all	Every node connects to every other	Multi-leader Galera; write latency grows with each added node (typically deployed as 3–5 nodes)
Tree	Cascaded fan-out	Wide read replicas (e.g., 50 read replicas)

Operational pitfalls

Replication lag spirals

A burst of writes makes followers fall behind. Reads start hitting stale data. Apps routing reads to followers start showing inconsistencies. The fix is usually capacity — more or faster followers — not configuration tweaks.

Long-running transactions on the leader

A 10-minute transaction holds locks; replication backs up; followers fall further behind. Avoid large batched writes in a single transaction.

Adding a replica is expensive

Bootstrapping a new replica means copying the entire database. For TB-scale, this takes hours and stresses the network. Plan ahead; don't wait for a crisis.

Failover testing

Most teams never test failover. The first real failover is the production incident. Practice it. Quarterly.

Schema migrations under replication

Some DDL operations replicate poorly. An ALTER TABLE that takes 30 minutes on the leader takes 30 minutes per follower, serially. Use pt-online-schema-change, gh-ost, or Postgres logical replication tricks to avoid blocking the replica stream.

Worked example: Postgres + read replicas

A common starting setup:

flowchart LR
    APP[App] -->|writes| L[(Postgres Primary<br/>db.r6id.4xlarge)]
    L -->|stream WAL async| R1[(Replica EU)]
    L -->|stream WAL sync| R2[(Replica US-W)]
    APP -->|reads| L
    APP -->|reads| R1
    APP -->|reads| R2
    style L fill:#ff6b1a,color:#0a0a0f

One sync replica gives you a durable failover candidate — if the primary dies, this replica is guaranteed to be current. The async replicas serve reads, especially region-local ones. If read traffic is 5× the write traffic, this setup absorbs that without adding any write load.

When this stops working, it's almost always because writes have maxed out the primary box. At that point, you shard (sharding article) or move to distributed SQL (CockroachDB, Spanner).

How to talk about replication in an interview

A strong answer has four ingredients:

Pick a model: single-leader / multi-leader / leaderless, and say why.
Pick sync mode: sync, async, semi-sync; justify the choice for your durability and latency requirements.
Talk about failover: how a new leader is elected, what writes might be lost in the gap.
Talk about reads: which clients might see stale data and how you route around it.

If you only memorize one default: single-leader, semi-sync, with sync to one replica for failover safety. Defend it with: "Strong consistency on writes; one machine of write throughput is enough for most OLTP; we trade slightly higher write latency for a safe, predictable failover."

Things you should now be able to answer

What does "replication lag" mean and what bugs does it cause?
Why is multi-leader replication operationally so painful?
What does W + R > N give you in a leaderless system?
Why is async replication a durability risk during failover?
When would you pick leaderless over single-leader?

Frequently asked questions

▸What is the difference between single-leader, multi-leader, and leaderless replication?

In single-leader replication one node handles all writes and streams its write-ahead log to followers — used by Postgres, MySQL, MongoDB, and Kafka partitions. Multi-leader allows multiple nodes to accept writes, primarily for multi-region active-active setups, but requires conflict resolution. Leaderless sends writes to multiple replicas in parallel and uses a quorum formula (W + R > N) to determine success, as in Cassandra and Riak (the original Dynamo paper design). Note that Amazon DynamoDB, despite the name, uses single-leader replication per partition internally.

▸What does W + R > N guarantee in a leaderless replication system?

When the number of write acknowledgments (W) plus read acknowledgments (R) exceeds the total replica count (N), every read is guaranteed to overlap with at least one replica that has the latest write. The typical Cassandra and Riak (Dynamo-paper-style) configuration is N=3, W=2, R=2. This gives strong-ish consistency but not full linearizability — two concurrent reads can still return different versions if writes are in-flight, because quorum sets can overlap on a stale replica.

▸What is replication lag and what bugs does it cause?

Replication lag is the delay between a write reaching the leader and that write becoming visible on async followers — usually milliseconds, but seconds under load and minutes under crisis. The classic symptom is a user posting a comment, immediately reloading the page, and seeing their comment missing because the read hit a follower that had not yet caught up.

▸When should I use semi-synchronous replication instead of fully synchronous or fully asynchronous?

Semi-sync — waiting for exactly one follower to acknowledge before returning to the client and letting the rest catch up asynchronously — is the recommended HA pattern because it gives a known-safe failover candidate without blocking on the full replication fan-out. Fully synchronous blocks if any follower is slow or dead; fully async risks losing the last few writes if the leader crashes before replication completes. PostgreSQL defaults to fully async and requires setting synchronous_standby_names to enable semi-sync.

▸When should I choose leaderless replication over single-leader?

Leaderless wins when you need massive write throughput, high availability without failover complexity, or geo-distribution where eventual consistency is acceptable. It struggles with strong-consistency operations such as financial counters requiring exact totals, and with workloads where the same key is frequently written concurrently from multiple nodes.

← previous

Leader Election and Consensus (Raft, Paxos)

Bloom Filters

// RELATED