CAP, Consistency, and Replication
CAP and PACELC, consistency models from linearizable to eventual, replication strategies, quorums, partitioning, consensus (Raft, Paxos), CRDTs, and 2PC.
This is where distributed systems get genuinely hard. Get the vocabulary right and you'll understand why every distributed database makes the trade-offs it does. Get it wrong and you'll spend weeks chasing "ghost" bugs that turn out to be replication lag.
The CAP theorem
Stated by Eric Brewer in 2000, proven by Gilbert and Lynch in 2002:
Any distributed data store can provide at most two of the following three guarantees:
- Consistency: every read sees the most recent write.
- Availability: every request receives a (non-error) response.
- Partition tolerance: the system continues to operate despite network failures between nodes.
Network partitions happen — they are a fact of physics on the public internet, even within a single data center. So you don't really get to choose whether to be P or not. The real choice is what you do when a partition arrives:
flowchart TD
P[Partition occurs]
P --> CHOICE{What do you do?}
CHOICE -->|"choose Consistency<br/>(refuse some requests)"| CP[CP system]
CHOICE -->|"choose Availability<br/>(answer with stale data)"| AP[AP system]
CP --> CPEX[HBase, MongoDB, Spanner]
AP --> APEX[Cassandra, DynamoDB, Riak]
style CP fill:#ff6b1a,color:#0a0a0f
style AP fill:#15803d,color:#fff
A CP system refuses to serve requests on the minority side of a partition rather than risk serving stale data. An AP system keeps both sides running and accepts that they may diverge — you'll reconcile later.
CAP isn't binary
In practice, "consistency" lives on a spectrum, and even within one product the level is tunable. Cassandra lets you set consistency_level=ONE (very available, weakly consistent) or QUORUM (more consistent, less available) per query.
What CAP doesn't tell you
CAP describes behavior only during a partition. Most of the time there is no partition. So CAP is silent about the trade-offs you face under normal operation — which, in practice, dominate every real design conversation.
PACELC: the better model
CAP only describes behavior during a partition. But even when the network is healthy, you still face a trade-off:
If a Partition occurs, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.
flowchart TD
NORMAL[Normal operation:<br/>EL or EC]
PARTITION[Partition:<br/>PA or PC]
NORMAL --> EL[EL: low Latency<br/>DynamoDB, Cassandra]
NORMAL --> EC[EC: strong Consistency<br/>Spanner, etcd]
PARTITION --> PA[PA: stay Available<br/>DynamoDB, Cassandra]
PARTITION --> PC[PC: stay Consistent<br/>Spanner, etcd, HBase]
style EL fill:#15803d,color:#fff
style EC fill:#ff6b1a,color:#0a0a0f
Doing strong consistency under normal conditions costs you latency — consensus requires at least one round-trip to a majority of nodes. DynamoDB defaults to "PA + EL": available and low-latency, accepting eventual consistency. Spanner is "PC + EC": consistent both during partitions and normally, paying the latency cost.
This is the four-letter framework you should reach for, not just CAP.
Consistency models, ranked
From strongest (slowest, hardest to implement) to weakest (fastest, easiest):
| Model | Guarantee | Example |
|---|---|---|
| Strict / Linearizable | Looks like a single machine; reads see the latest write globally | Spanner with TrueTime, etcd, ZooKeeper |
| Sequential | All nodes see operations in the same order, but order may not match real time | Some custom systems |
| Causal | If A happens-before B, everyone sees A before B | Riak, MongoDB causal sessions |
| Read-your-writes | A user always sees their own writes | Many DBs with sticky reads |
| Monotonic reads | Once you see a value, you won't go back to an older one | Most replicated DBs with session pinning |
| Eventual | If writes stop, reads will eventually see the latest | DynamoDB default, Cassandra default, S3 cross-region |
Most apps don't need linearizable. Most apps need read-your-writes (so the user sees their own post immediately) plus eventual elsewhere.
Linearizability vs serializability
Easy to confuse. Different things:
- Linearizability is about single-object ordering. There's a real-time order on operations on each object.
- Serializability is about multi-object transactions. Transactions execute as if they ran one at a time.
Spanner is both. Postgres' default isolation level is read committed, not serializable — serializable requires an explicit SET or per-transaction override. And even with serializable transactions, Postgres is not linearizable across replicas.
Why eventual consistency is fine, mostly
Imagine a tweet. You post it; your follower reads it 2 seconds later because the write hadn't propagated yet. Is that a bug? No — it's a 2-second delay you'd never notice. Eventual consistency works for tweets, posts, news feeds, search indexes, view counts.
It does not work for: bank balances at the moment of a transaction, inventory at checkout, seat reservations, or anything where two simultaneous decisions could conflict in costly ways.
The "session guarantees" sweet spot
For most user-facing apps, the right level is session consistency: within one user's session, they see read-your-writes and monotonic reads. Across users, eventual is fine.
The implementation is lightweight: pin the user to one replica per session (sticky reads), or carry a "version" token they expect to see and read from any replica that's caught up to that version.
flowchart LR
W[User writes post] --> P[Primary]
P -->|"replicate"| R1[Replica 1\nv=45]
P -->|"replicate"| R2[Replica 2\nv=38 — skipped]
W -->|"session token: version=42"| C[Next read]
C -->|"read from replica that has v>=42"| R1
R1 -->|"v=45 — ok"| C
style P fill:#ff6b1a,color:#0a0a0f
style R1 fill:#15803d,color:#fff
style R2 fill:#4b5563,color:#fff
style C fill:#0e7490,color:#fff
Cheap to implement: just pass a version number in the session cookie. Readers route to any replica that's sufficiently caught up.
Replication
Copying data across multiple nodes serves three goals: durability (data survives a single failure), availability (a replica can serve requests when the primary is sick), and read scaling (more nodes share the read load).
There are three architectures, each with different complexity and trade-off profiles.
Single-leader (primary-replica)
flowchart LR
W[Writes] --> P[Primary]
P -->|replicate| R1[Replica 1]
P -->|replicate| R2[Replica 2]
P -->|replicate| R3[Replica 3]
R1 --> RD[Reads]
R2 --> RD
R3 --> RD
P --> RD
style P fill:#ff6b1a,color:#0a0a0f
style R1 fill:#0e7490,color:#fff
All writes go to one primary. The primary streams its log to replicas. Reads can be served from anywhere.
How fast is the replication? That depends on what you can live with. Synchronous replication waits for at least one replica to confirm before the write returns — safer, but the write latency is now bounded by the replica's round-trip. Asynchronous replication returns immediately and lets replicas catch up in the background — faster, but if the primary dies before a replica gets the last few writes, those writes are gone. Most production systems use semi-sync: wait for one replica to confirm, let the rest catch up asynchronously. You get durability without paying for N round-trips on every write.
Failure handling: if the primary dies, an election promotes a replica. This is what tools like Patroni (Postgres) or Sentinel (Redis) automate. Replication lag — the gap between a write landing on the primary and being visible on a replica — matters when users need to read their own writes. If you write then immediately read from a lagging replica, your own change may have vanished.
The split-brain problem
A network partition isolates the primary from its replicas. Replicas can't tell whether the primary is truly dead or just temporarily unreachable. If they elect a new primary while the old one is still serving writes, you now have two primaries accepting conflicting writes — and reconciling them is painful.
sequenceDiagram
participant Old as Old Primary
participant N as Network
participant New as New Primary
participant W as Write Client
participant S as Storage
Old->>N: heartbeat (drops)
N-->>Old: (no response)
Note over N: Partition
New->>New: election — majority vote
New-->>W: I am the leader (token=2)
Old-->>W: I am the leader (token=1, stale)
Note over Old,New: Two primaries accepting writes — split-brain
W->>S: write (token=2)
S-->>W: accepted
Old->>S: write (token=1)
S-->>Old: rejected — stale token
Three mitigations:
- Quorum-based promotion (majority vote): a replica can't become primary unless more than N/2 nodes agree.
- STONITH ("Shoot The Other Node In The Head"): the new primary forcibly fences the old one — revokes its IP, powers it off via BMC — so it literally cannot accept writes.
- Fencing tokens: writes carry a monotonically-increasing generation number; the storage layer rejects any write carrying a stale token.
Split-brain is the classic distributed-database nightmare. Modern databases solve it by building consensus (Raft) directly into their replication protocol.
Multi-leader
Two or more nodes accept writes and sync to each other.
flowchart LR
W1[Writes US] <--> P1[Leader US]
W2[Writes EU] <--> P2[Leader EU]
P1 <-->|"async sync"| P2
style P1 fill:#ff6b1a,color:#0a0a0f
style P2 fill:#ff6b1a,color:#0a0a0f
This pattern shows up in multi-region active-active deployments: you want users in the US and EU to both write to a nearby node with low latency, then have the regions sync in the background. The cost is write conflicts — if two regions update the same row at the same time, something has to decide who wins.
Conflict resolution strategies:
| Strategy | How | Trade-off |
|---|---|---|
| Last-write-wins (LWW) | Compare timestamps; later wins | Lossy; clock skew matters |
| Per-field merge | Each field updated independently | Doesn't compose well |
| Application-level | Hand-written reconciliation | Most correct, most code |
| CRDTs | Math guarantees convergence | Limited data types |
Leaderless
Everyone's equal. Reads and writes go to multiple nodes; quorums (R + W > N) ensure consistency.
flowchart LR
C[Client\nW=2 acks needed] -->|write| N1[Node 1\nack]
C -->|write| N2[Node 2\nack]
C -->|write| N3[Node 3]
N1 -->|ack 1| C
N2 -->|ack 2 — write succeeds| C
style N1 fill:#15803d,color:#fff
style N2 fill:#15803d,color:#fff
style N3 fill:#4b5563,color:#fff
style C fill:#0e7490,color:#fff
Cassandra does this. The original Amazon Dynamo paper also described this model, but modern DynamoDB has evolved away from it — DynamoDB internally uses a leader-based Multi-Paxos protocol per partition, though it still exposes eventually consistent reads by default and strongly consistent reads on request. Riak is another pure leaderless example.
The parameters are tunable: R=1, W=N gives fast reads and slow writes; R=N, W=1 is the inverse; R+W > N gives quorum consistency. The math makes it work: if you write to W out of N nodes and read from R out of N nodes, and R + W > N, then at least one node in the read set must have seen the latest write — assuming no concurrent conflicting writes and no sloppy quorums. That overlap is how leaderless systems get consistency without a designated leader, though concurrent writes still require conflict resolution strategies like version vectors or LWW.
CRDTs (Conflict-free Replicated Data Types)
CRDTs are data structures designed so that any two replicas can be merged automatically, with the merge always producing the same result regardless of the order merges happen in.
flowchart LR
A["Replica A:<br/>set = {1, 2}"] -->|merge| M["Merged: {1, 2, 3, 4}"]
B["Replica B:<br/>set = {3, 4}"] -->|merge| M
style M fill:#15803d,color:#fff
A few concrete types:
- G-Counter (grow-only counter): each replica increments its own slot in a vector; merge takes the max per slot. The sum of all slots gives the global total. This is how like counts work at scale.
- PN-Counter: a G-Counter for increments plus a G-Counter for decrements. Subtract one from the other.
- G-Set (grow-only set): merge is union. Items can be added, never removed.
- OR-Set (observed-remove set): supports adds and removes with correct semantics when both happen concurrently.
- LWW-Register: last-write-wins with a monotonic tiebreaker.
CRDTs power Riak, Redis CRDTs, and the collaborative editing in Yjs and Automerge. (Figma uses a custom OT-influenced approach with a central server, not CRDTs.) The appeal is real: they remove the need for centralized coordination entirely, which means no coordinator to fail or contend on. Use them when the data type fits — you can't turn arbitrary structures into CRDTs, and the available types are a small family.
Partitioning (sharding)
Replication makes copies. Partitioning splits the data — no node has it all.
flowchart TD
DB[(All Data)] --> P1[(Shard 1<br/>users a-h)]
DB --> P2[(Shard 2<br/>users i-q)]
DB --> P3[(Shard 3<br/>users r-z)]
style DB fill:#ff6b1a,color:#0a0a0f
Strategies:
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Range | users a-h on shard 1, i-q on shard 2, r-z on shard 3 | Range queries efficient | Hot shards (e.g. all "a" usernames) |
| Hash | shard = hash(key) % N | Even distribution | No range queries, rebalancing is hard |
| Consistent hash | Hash onto a ring | Even + minimal reshuffling | Slightly more complex |
| Directory | Lookup table maps key → shard | Flexible | Lookup table is itself a bottleneck |
Most production systems use consistent hashing. See the Consistent Hashing article.
Cross-partition queries are expensive
Once data is sharded, "give me the top 10 users by post count" requires a query to every shard, then a merge. This is scatter-gather, and it's how distributed search engines (Elasticsearch) and OLAP databases (BigQuery) work — but it adds complexity and tail latency. Design your partition key so your most common queries touch one shard, not all of them.
Hot partitions
The biggest sharding pitfall. Even with hash sharding, if one key is responsible for most traffic (Justin Bieber's user_id), one shard is overloaded. Three mitigations:
- Random suffix on the key:
user:42-{0..9}— spread writes across 10 slots, read all 10 and aggregate. More writes, but the load is spread. - Caching the hot keys in front of the shard, so most reads never reach storage.
- Custom routing: detect hot keys operationally, fan out only those to dedicated capacity.
Consensus algorithms (Paxos, Raft)
How do a group of nodes agree on a single value — who's the leader, what's the next log entry — when some nodes may be failing or partitioned?
Raft is the modern, comprehensible answer. Paxos is the older, harder-to-understand one. Both are widely deployed: Raft in etcd, CockroachDB, and Consul; Paxos in Spanner and Chubby.
flowchart TD
L[Leader] -->|"AppendEntries"| F1[Follower 1]
L -->|"AppendEntries"| F2[Follower 2]
L -->|"AppendEntries"| F3[Follower 3]
L -->|"AppendEntries"| F4[Follower 4]
F1 -->|ack| L
F2 -->|ack| L
F3 -->|"ack — majority reached,<br/>entry committed"| L
style L fill:#ff6b1a,color:#0a0a0f
style F1 fill:#15803d,color:#fff
style F2 fill:#15803d,color:#fff
style F3 fill:#15803d,color:#fff
Raft's three phases
- Leader election: nodes elect a leader for a "term." An election requires majority votes; if two candidates split the vote, they retry with a randomized backoff so one eventually wins.
- Log replication: the leader takes client requests, appends to its log, replicates to followers, and commits once a majority acknowledges. Followers apply the entry after they see it committed.
- Safety: a leader can only commit entries from its current term; followers reject messages from stale leaders carrying an older term number.
You won't implement Raft yourself — you'll use it through systems that ship with it. But four things are worth knowing cold:
- Consensus requires a majority quorum. With 5 nodes you can lose 2 and still make progress; lose 3 and you're stuck.
- Each consensus decision costs at least one round-trip to a majority of nodes, often more.
- That round-trip is why globally-distributed strong consistency (Spanner) is slow — every commit must cross continents.
- Use odd numbers of nodes (3, 5, 7). With 4 nodes you can only tolerate 1 failure — the same as with 3, but you're paying for an extra node.
When you need consensus
You need a consensus system when you have leader election or strongly-consistent state to maintain: distributed locks, service discovery, configuration management, or database primary/failover decisions.
Reach for a battle-tested tool: etcd, Consul, ZooKeeper (older), Apache Kafka's KRaft mode. Do not roll your own consensus — it is genuinely hard to get right and easy to get subtly wrong.
Two-phase commit (2PC)
The classic protocol for committing a transaction that spans multiple participants.
sequenceDiagram
participant C as Coordinator
participant A as Participant A
participant B as Participant B
C->>A: prepare
C->>B: prepare
A-->>C: yes
B-->>C: yes
C->>A: commit
C->>B: commit
A-->>C: done
B-->>C: done
Two phases:
- Prepare: ask all participants if they can commit. They lock resources and reply yes or no.
- Commit (or Abort): if all said yes, tell all to commit; otherwise abort.
The fatal flaw is that the coordinator is a single point of failure. If it crashes between the prepare and commit phases, participants are stuck holding locks indefinitely — they said "yes" but never heard back. This is why 2PC is rarely the primary cross-service transaction mechanism in modern systems, replaced by sagas (compensating transactions; see Message Queues) or eventual consistency with idempotency.
2PC still lives in a few places: inside a single database engine between its own shards (CockroachDB uses this), XA transactions across databases (painful to operate), and Kafka's exactly-once producer (a variant of the idea).
Three-phase commit (3PC)
Adds an "are you committed yet?" phase to resolve some 2PC stalls. Almost never used in production — the extra round-trip cost isn't worth it, and Raft-based systems handle the hard cases more cleanly.
Putting it all together
For most production systems, the stack looks like this: one strongly-consistent source of truth for writes, a weakly-consistent layer for read throughput, a cache up front, and a stream for fanout.
Real examples:
- Stripe: Postgres for payments (CP, source of truth), DynamoDB for sessions, Redis for fraud signals, Kafka for events.
- Twitter: MySQL for users, Redis for timelines (cache + queue), Manhattan/Cassandra for tweets, Kafka for fanout.
- Google: Spanner for everything that needs strong consistency, Bigtable for big-data workloads, MapReduce / Dataflow for analytics.
The pattern is consistent — you don't pick one consistency level for everything. You pick the strongest level you can afford for each piece of state, and weaken it everywhere the business can tolerate it.
A consistency-design checklist
Before you decide what database to use:
- What does read-your-writes mean here? When does the user need to see their own change?
- What's the cost of a stale read to a different user? Bug, mild annoyance, or financial harm?
- Are there multi-row transactions that must be atomic?
- What's the latency budget for writes? Strong consistency → consensus → multi-RT.
- What's the replication topology — single-leader, multi-leader, leaderless?
- During a partition, do you prefer to stay available or stay consistent?
- What's your conflict-resolution strategy for the multi-leader / leaderless case?
Each answer constrains your data store choice and your query layer.
Things you should now be able to answer
- What does CAP actually mean? Why is the choice really "C or A during a partition"?
- Why is "exactly once" hard but "eventually consistent" easy?
- A user posts, then refreshes — can they see their post? What guarantee makes that work?
- Why does Spanner have higher write latency than DynamoDB?
- Replication vs partitioning — what does each solve?
- Why is consensus expensive, and why do we tolerate that for leader election?
- A multi-leader DB has two regions update the same row simultaneously — what happens?
- Two-phase commit gives you cross-service transactions. Why don't we use it everywhere?
🎉 You finished the theory chunk. Two more pragmatic modules and you're done.
→ Next: Reliability and Failure Patterns
Frequently asked questions
▸What is the CAP theorem and who proved it?
Stated by Eric Brewer in 2000 and proven by Gilbert and Lynch in 2002, CAP says any distributed data store can provide at most two of three guarantees: consistency (every read sees the most recent write), availability (every request gets a non-error response), and partition tolerance. Because network partitions are unavoidable, the real choice is what to do when a partition occurs: refuse requests to stay consistent (CP) or answer with stale data to stay available (AP).
▸What is PACELC and why is it a better model than CAP?
PACELC extends CAP by covering normal operation, not just partitions. During a partition you choose between availability and consistency; else (normal operation) you choose between latency and consistency. DynamoDB defaults to PA and EL, staying available and low-latency at the cost of eventual consistency. Spanner is PC and EC, staying consistent in both scenarios and paying the latency cost of cross-node consensus on every write.
▸What is the difference between linearizability and serializability?
Linearizability is about single-object ordering: operations on each object appear to execute in real-time order across all nodes. Serializability is about multi-object transactions: transactions execute as if they ran one at a time. Spanner is both linearizable and serializable. Postgres' default isolation level is read committed, not serializable — serializable is opt-in. Even with serializable transactions enabled, Postgres is not linearizable across replicas.
▸When is eventual consistency acceptable and when does it break down?
Eventual consistency works fine for tweets, posts, news feeds, search indexes, and view counts, where a 2-second propagation delay is unnoticeable. It breaks down for bank balances at the moment of a transaction, inventory at checkout, and seat reservations, where two simultaneous decisions can conflict in costly ways.
▸Why do Raft-based consensus systems require an odd number of nodes, and how many failures can they tolerate?
With 5 nodes a Raft cluster can lose 2 and still commit entries, because a majority of 3 remains. With 4 nodes it can tolerate only 1 failure, the same as 3 nodes, so the fourth node adds cost without adding fault tolerance. Using odd numbers (3, 5, 7) ensures every possible majority is exactly one more than the maximum tolerated failures.