MODULE 10 / 12crash course

~/roadmap/10-cap-and-consistency

◆◆Intermediate

CAP, Consistency, and Replication

CAP and PACELC, consistency models from linearizable to eventual, replication strategies, quorums, partitioning, consensus (Raft, Paxos), CRDTs, and 2PC.

16 min read2026-01-24Ironclad Academy

#distributed-systems #consistency #replication #consensus #fundamentals

This is where distributed systems get genuinely hard. Get the vocabulary right and you'll understand why every distributed database makes the trade-offs it does. Get it wrong and you'll spend weeks chasing "ghost" bugs that turn out to be replication lag.

The CAP theorem

Stated by Eric Brewer in 2000, proven by Gilbert and Lynch in 2002:

Any distributed data store can provide at most two of the following three guarantees:

Consistency: every read sees the most recent write.

Availability: every request receives a (non-error) response.

Partition tolerance: the system continues to operate despite network failures between nodes.

Network partitions happen — they are a fact of physics on the public internet, even within a single data center. So you don't really get to choose whether to be P or not. The real choice is what you do when a partition arrives:

flowchart TD
    P[Partition occurs]
    P --> CHOICE{What do you do?}
    CHOICE -->|"choose Consistency<br/>(refuse some requests)"| CP[CP system]
    CHOICE -->|"choose Availability<br/>(answer with stale data)"| AP[AP system]
    CP --> CPEX[HBase, MongoDB, Spanner]
    AP --> APEX[Cassandra, DynamoDB, Riak]
    style CP fill:#ff6b1a,color:#0a0a0f
    style AP fill:#15803d,color:#fff

A CP system refuses to serve requests on the minority side of a partition rather than risk serving stale data. An AP system keeps both sides running and accepts that they may diverge — you'll reconcile later.

CAP isn't binary

In practice, "consistency" lives on a spectrum, and even within one product the level is tunable. Cassandra lets you set consistency_level=ONE (very available, weakly consistent) or QUORUM (more consistent, less available) per query.

What CAP doesn't tell you

CAP describes behavior only during a partition. Most of the time there is no partition. So CAP is silent about the trade-offs you face under normal operation — which, in practice, dominate every real design conversation.

PACELC: the better model

CAP only describes behavior during a partition. But even when the network is healthy, you still face a trade-off:

If a Partition occurs, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.

flowchart TD
    NORMAL[Normal operation:<br/>EL or EC]
    PARTITION[Partition:<br/>PA or PC]
    NORMAL --> EL[EL: low Latency<br/>DynamoDB, Cassandra]
    NORMAL --> EC[EC: strong Consistency<br/>Spanner, etcd]
    PARTITION --> PA[PA: stay Available<br/>DynamoDB, Cassandra]
    PARTITION --> PC[PC: stay Consistent<br/>Spanner, etcd, HBase]
    style EL fill:#15803d,color:#fff
    style EC fill:#ff6b1a,color:#0a0a0f

Doing strong consistency under normal conditions costs you latency — consensus requires at least one round-trip to a majority of nodes. DynamoDB defaults to "PA + EL": available and low-latency, accepting eventual consistency. Spanner is "PC + EC": consistent both during partitions and normally, paying the latency cost.

This is the four-letter framework you should reach for, not just CAP.

Consistency models, ranked

From strongest (slowest, hardest to implement) to weakest (fastest, easiest):

Model	Guarantee	Example
Strict / Linearizable	Looks like a single machine; reads see the latest write globally	Spanner with TrueTime, etcd, ZooKeeper
Sequential	All nodes see operations in the same order, but order may not match real time	Some custom systems
Causal	If A happens-before B, everyone sees A before B	Riak, MongoDB causal sessions
Read-your-writes	A user always sees their own writes	Many DBs with sticky reads
Monotonic reads	Once you see a value, you won't go back to an older one	Most replicated DBs with session pinning
Eventual	If writes stop, reads will eventually see the latest	DynamoDB default, Cassandra default, S3 cross-region

Most apps don't need linearizable. Most apps need read-your-writes (so the user sees their own post immediately) plus eventual elsewhere.

Linearizability vs serializability

Easy to confuse. Different things:

Linearizability is about single-object ordering. There's a real-time order on operations on each object.
Serializability is about multi-object transactions. Transactions execute as if they ran one at a time.

Spanner is both. Postgres' default isolation level is read committed, not serializable — serializable requires an explicit SET or per-transaction override. And even with serializable transactions, Postgres is not linearizable across replicas.

Why eventual consistency is fine, mostly

Imagine a tweet. You post it; your follower reads it 2 seconds later because the write hadn't propagated yet. Is that a bug? No — it's a 2-second delay you'd never notice. Eventual consistency works for tweets, posts, news feeds, search indexes, view counts.

It does not work for: bank balances at the moment of a transaction, inventory at checkout, seat reservations, or anything where two simultaneous decisions could conflict in costly ways.

The "session guarantees" sweet spot

For most user-facing apps, the right level is session consistency: within one user's session, they see read-your-writes and monotonic reads. Across users, eventual is fine.

The implementation is lightweight: pin the user to one replica per session (sticky reads), or carry a "version" token they expect to see and read from any replica that's caught up to that version.

flowchart LR
    W[User writes post] --> P[Primary]
    P -->|"replicate"| R1[Replica 1\nv=45]
    P -->|"replicate"| R2[Replica 2\nv=38 — skipped]
    W -->|"session token: version=42"| C[Next read]
    C -->|"read from replica that has v>=42"| R1
    R1 -->|"v=45 — ok"| C
    style P fill:#ff6b1a,color:#0a0a0f
    style R1 fill:#15803d,color:#fff
    style R2 fill:#4b5563,color:#fff
    style C fill:#0e7490,color:#fff

Cheap to implement: just pass a version number in the session cookie. Readers route to any replica that's sufficiently caught up.

Replication

Copying data across multiple nodes serves three goals: durability (data survives a single failure), availability (a replica can serve requests when the primary is sick), and read scaling (more nodes share the read load).

There are three architectures, each with different complexity and trade-off profiles.

Single-leader (primary-replica)

flowchart LR
    W[Writes] --> P[Primary]
    P -->|replicate| R1[Replica 1]
    P -->|replicate| R2[Replica 2]
    P -->|replicate| R3[Replica 3]
    R1 --> RD[Reads]
    R2 --> RD
    R3 --> RD
    P --> RD
    style P fill:#ff6b1a,color:#0a0a0f
    style R1 fill:#0e7490,color:#fff

All writes go to one primary. The primary streams its log to replicas. Reads can be served from anywhere.

How fast is the replication? That depends on what you can live with. Synchronous replication waits for at least one replica to confirm before the write returns — safer, but the write latency is now bounded by the replica's round-trip. Asynchronous replication returns immediately and lets replicas catch up in the background — faster, but if the primary dies before a replica gets the last few writes, those writes are gone. Most production systems use semi-sync: wait for one replica to confirm, let the rest catch up asynchronously. You get durability without paying for N round-trips on every write.

Failure handling: if the primary dies, an election promotes a replica. This is what tools like Patroni (Postgres) or Sentinel (Redis) automate. Replication lag — the gap between a write landing on the primary and being visible on a replica — matters when users need to read their own writes. If you write then immediately read from a lagging replica, your own change may have vanished.

The split-brain problem

A network partition isolates the primary from its replicas. Replicas can't tell whether the primary is truly dead or just temporarily unreachable. If they elect a new primary while the old one is still serving writes, you now have two primaries accepting conflicting writes — and reconciling them is painful.

sequenceDiagram
    participant Old as Old Primary
    participant N as Network
    participant New as New Primary
    participant W as Write Client
    participant S as Storage
    Old->>N: heartbeat (drops)
    N-->>Old: (no response)
    Note over N: Partition
    New->>New: election — majority vote
    New-->>W: I am the leader (token=2)
    Old-->>W: I am the leader (token=1, stale)
    Note over Old,New: Two primaries accepting writes — split-brain
    W->>S: write (token=2)
    S-->>W: accepted
    Old->>S: write (token=1)
    S-->>Old: rejected — stale token

Three mitigations:

Quorum-based promotion (majority vote): a replica can't become primary unless more than N/2 nodes agree.
STONITH ("Shoot The Other Node In The Head"): the new primary forcibly fences the old one — revokes its IP, powers it off via BMC — so it literally cannot accept writes.
Fencing tokens: writes carry a monotonically-increasing generation number; the storage layer rejects any write carrying a stale token.

Split-brain is the classic distributed-database nightmare. Modern databases solve it by building consensus (Raft) directly into their replication protocol.

Multi-leader

Two or more nodes accept writes and sync to each other.

flowchart LR
    W1[Writes US] <--> P1[Leader US]
    W2[Writes EU] <--> P2[Leader EU]
    P1 <-->|"async sync"| P2
    style P1 fill:#ff6b1a,color:#0a0a0f
    style P2 fill:#ff6b1a,color:#0a0a0f

This pattern shows up in multi-region active-active deployments: you want users in the US and EU to both write to a nearby node with low latency, then have the regions sync in the background. The cost is write conflicts — if two regions update the same row at the same time, something has to decide who wins.

Conflict resolution strategies:

Strategy	How	Trade-off
Last-write-wins (LWW)	Compare timestamps; later wins	Lossy; clock skew matters
Per-field merge	Each field updated independently	Doesn't compose well
Application-level	Hand-written reconciliation	Most correct, most code
CRDTs	Math guarantees convergence	Limited data types

Leaderless

Everyone's equal. Reads and writes go to multiple nodes; quorums (R + W > N) ensure consistency.

flowchart LR
    C[Client\nW=2 acks needed] -->|write| N1[Node 1\nack]
    C -->|write| N2[Node 2\nack]
    C -->|write| N3[Node 3]
    N1 -->|ack 1| C
    N2 -->|ack 2 — write succeeds| C
    style N1 fill:#15803d,color:#fff
    style N2 fill:#15803d,color:#fff
    style N3 fill:#4b5563,color:#fff
    style C fill:#0e7490,color:#fff

Cassandra does this. The original Amazon Dynamo paper also described this model, but modern DynamoDB has evolved away from it — DynamoDB internally uses a leader-based Multi-Paxos protocol per partition, though it still exposes eventually consistent reads by default and strongly consistent reads on request. Riak is another pure leaderless example.

The parameters are tunable: R=1, W=N gives fast reads and slow writes; R=N, W=1 is the inverse; R+W > N gives quorum consistency. The math makes it work: if you write to W out of N nodes and read from R out of N nodes, and R + W > N, then at least one node in the read set must have seen the latest write — assuming no concurrent conflicting writes and no sloppy quorums. That overlap is how leaderless systems get consistency without a designated leader, though concurrent writes still require conflict resolution strategies like version vectors or LWW.

CRDTs (Conflict-free Replicated Data Types)

CRDTs are data structures designed so that any two replicas can be merged automatically, with the merge always producing the same result regardless of the order merges happen in.

flowchart LR
    A["Replica A:<br/>set = {1, 2}"] -->|merge| M["Merged: {1, 2, 3, 4}"]
    B["Replica B:<br/>set = {3, 4}"] -->|merge| M
    style M fill:#15803d,color:#fff

A few concrete types:

G-Counter (grow-only counter): each replica increments its own slot in a vector; merge takes the max per slot. The sum of all slots gives the global total. This is how like counts work at scale.
PN-Counter: a G-Counter for increments plus a G-Counter for decrements. Subtract one from the other.
G-Set (grow-only set): merge is union. Items can be added, never removed.
OR-Set (observed-remove set): supports adds and removes with correct semantics when both happen concurrently.
LWW-Register: last-write-wins with a monotonic tiebreaker.

CRDTs power Riak, Redis CRDTs, and the collaborative editing in Yjs and Automerge. (Figma uses a custom OT-influenced approach with a central server, not CRDTs.) The appeal is real: they remove the need for centralized coordination entirely, which means no coordinator to fail or contend on. Use them when the data type fits — you can't turn arbitrary structures into CRDTs, and the available types are a small family.

Partitioning (sharding)

Replication makes copies. Partitioning splits the data — no node has it all.

flowchart TD
    DB[(All Data)] --> P1[(Shard 1<br/>users a-h)]
    DB --> P2[(Shard 2<br/>users i-q)]
    DB --> P3[(Shard 3<br/>users r-z)]
    style DB fill:#ff6b1a,color:#0a0a0f

Strategies:

Strategy	How	Pros	Cons
Range	`users a-h` on shard 1, `i-q` on shard 2, `r-z` on shard 3	Range queries efficient	Hot shards (e.g. all "a" usernames)
Hash	`shard = hash(key) % N`	Even distribution	No range queries, rebalancing is hard
Consistent hash	Hash onto a ring	Even + minimal reshuffling	Slightly more complex
Directory	Lookup table maps key → shard	Flexible	Lookup table is itself a bottleneck

Most production systems use consistent hashing. See the Consistent Hashing article.

Cross-partition queries are expensive

Once data is sharded, "give me the top 10 users by post count" requires a query to every shard, then a merge. This is scatter-gather, and it's how distributed search engines (Elasticsearch) and OLAP databases (BigQuery) work — but it adds complexity and tail latency. Design your partition key so your most common queries touch one shard, not all of them.

Hot partitions

The biggest sharding pitfall. Even with hash sharding, if one key is responsible for most traffic (Justin Bieber's user_id), one shard is overloaded. Three mitigations:

Random suffix on the key: user:42-{0..9} — spread writes across 10 slots, read all 10 and aggregate. More writes, but the load is spread.
Caching the hot keys in front of the shard, so most reads never reach storage.
Custom routing: detect hot keys operationally, fan out only those to dedicated capacity.

Consensus algorithms (Paxos, Raft)

How do a group of nodes agree on a single value — who's the leader, what's the next log entry — when some nodes may be failing or partitioned?

Raft is the modern, comprehensible answer. Paxos is the older, harder-to-understand one. Both are widely deployed: Raft in etcd, CockroachDB, and Consul; Paxos in Spanner and Chubby.

flowchart TD
    L[Leader] -->|"AppendEntries"| F1[Follower 1]
    L -->|"AppendEntries"| F2[Follower 2]
    L -->|"AppendEntries"| F3[Follower 3]
    L -->|"AppendEntries"| F4[Follower 4]
    F1 -->|ack| L
    F2 -->|ack| L
    F3 -->|"ack — majority reached,<br/>entry committed"| L
    style L fill:#ff6b1a,color:#0a0a0f
    style F1 fill:#15803d,color:#fff
    style F2 fill:#15803d,color:#fff
    style F3 fill:#15803d,color:#fff

Raft's three phases

Leader election: nodes elect a leader for a "term." An election requires majority votes; if two candidates split the vote, they retry with a randomized backoff so one eventually wins.
Log replication: the leader takes client requests, appends to its log, replicates to followers, and commits once a majority acknowledges. Followers apply the entry after they see it committed.
Safety: a leader can only commit entries from its current term; followers reject messages from stale leaders carrying an older term number.

You won't implement Raft yourself — you'll use it through systems that ship with it. But four things are worth knowing cold:

Consensus requires a majority quorum. With 5 nodes you can lose 2 and still make progress; lose 3 and you're stuck.
Each consensus decision costs at least one round-trip to a majority of nodes, often more.
That round-trip is why globally-distributed strong consistency (Spanner) is slow — every commit must cross continents.
Use odd numbers of nodes (3, 5, 7). With 4 nodes you can only tolerate 1 failure — the same as with 3, but you're paying for an extra node.

When you need consensus

You need a consensus system when you have leader election or strongly-consistent state to maintain: distributed locks, service discovery, configuration management, or database primary/failover decisions.

Reach for a battle-tested tool: etcd, Consul, ZooKeeper (older), Apache Kafka's KRaft mode. Do not roll your own consensus — it is genuinely hard to get right and easy to get subtly wrong.

Two-phase commit (2PC)

The classic protocol for committing a transaction that spans multiple participants.

sequenceDiagram
    participant C as Coordinator
    participant A as Participant A
    participant B as Participant B
    C->>A: prepare
    C->>B: prepare
    A-->>C: yes
    B-->>C: yes
    C->>A: commit
    C->>B: commit
    A-->>C: done
    B-->>C: done

Two phases:

Prepare: ask all participants if they can commit. They lock resources and reply yes or no.
Commit (or Abort): if all said yes, tell all to commit; otherwise abort.

The fatal flaw is that the coordinator is a single point of failure. If it crashes between the prepare and commit phases, participants are stuck holding locks indefinitely — they said "yes" but never heard back. This is why 2PC is rarely the primary cross-service transaction mechanism in modern systems, replaced by sagas (compensating transactions; see Message Queues) or eventual consistency with idempotency.

2PC still lives in a few places: inside a single database engine between its own shards (CockroachDB uses this), XA transactions across databases (painful to operate), and Kafka's exactly-once producer (a variant of the idea).

Three-phase commit (3PC)

Adds an "are you committed yet?" phase to resolve some 2PC stalls. Almost never used in production — the extra round-trip cost isn't worth it, and Raft-based systems handle the hard cases more cleanly.

Putting it all together

For most production systems, the stack looks like this: one strongly-consistent source of truth for writes, a weakly-consistent layer for read throughput, a cache up front, and a stream for fanout.

Real examples:

Stripe: Postgres for payments (CP, source of truth), DynamoDB for sessions, Redis for fraud signals, Kafka for events.
Twitter: MySQL for users, Redis for timelines (cache + queue), Manhattan/Cassandra for tweets, Kafka for fanout.
Google: Spanner for everything that needs strong consistency, Bigtable for big-data workloads, MapReduce / Dataflow for analytics.

The pattern is consistent — you don't pick one consistency level for everything. You pick the strongest level you can afford for each piece of state, and weaken it everywhere the business can tolerate it.

A consistency-design checklist

Before you decide what database to use:

What does read-your-writes mean here? When does the user need to see their own change?
What's the cost of a stale read to a different user? Bug, mild annoyance, or financial harm?
Are there multi-row transactions that must be atomic?
What's the latency budget for writes? Strong consistency → consensus → multi-RT.
What's the replication topology — single-leader, multi-leader, leaderless?
During a partition, do you prefer to stay available or stay consistent?
What's your conflict-resolution strategy for the multi-leader / leaderless case?

Each answer constrains your data store choice and your query layer.

Things you should now be able to answer

What does CAP actually mean? Why is the choice really "C or A during a partition"?
Why is "exactly once" hard but "eventually consistent" easy?
A user posts, then refreshes — can they see their post? What guarantee makes that work?
Why does Spanner have higher write latency than DynamoDB?
Replication vs partitioning — what does each solve?
Why is consensus expensive, and why do we tolerate that for leader election?
A multi-leader DB has two regions update the same row simultaneously — what happens?
Two-phase commit gives you cross-service transactions. Why don't we use it everywhere?

🎉 You finished the theory chunk. Two more pragmatic modules and you're done.

→ Next: Reliability and Failure Patterns

// KEY TAKEAWAYS

▸CAP forces a choice only during a partition; PACELC reveals the latency-vs-consistency trade-off that dominates normal operation.
▸Most user-facing apps need session consistency (read-your-writes plus monotonic reads within a session), not linearizability across all users.
▸Semi-synchronous replication — wait for one replica to confirm, let the rest catch up asynchronously — is the standard production trade-off between durability and write latency.
▸The quorum condition R + W greater than N guarantees that at least one node in the read set has seen the latest write, but concurrent writes still require a conflict-resolution strategy like LWW or version vectors.
▸Two-phase commit is rarely the primary cross-service transaction mechanism in modern systems because a coordinator crash between prepare and commit leaves participants holding locks indefinitely.

// FAQ

Frequently asked questions

▸What is the CAP theorem and who proved it?

Stated by Eric Brewer in 2000 and proven by Gilbert and Lynch in 2002, CAP says any distributed data store can provide at most two of three guarantees: consistency (every read sees the most recent write), availability (every request gets a non-error response), and partition tolerance. Because network partitions are unavoidable, the real choice is what to do when a partition occurs: refuse requests to stay consistent (CP) or answer with stale data to stay available (AP).

▸What is PACELC and why is it a better model than CAP?

PACELC extends CAP by covering normal operation, not just partitions. During a partition you choose between availability and consistency; else (normal operation) you choose between latency and consistency. DynamoDB defaults to PA and EL, staying available and low-latency at the cost of eventual consistency. Spanner is PC and EC, staying consistent in both scenarios and paying the latency cost of cross-node consensus on every write.

▸What is the difference between linearizability and serializability?

Linearizability is about single-object ordering: operations on each object appear to execute in real-time order across all nodes. Serializability is about multi-object transactions: transactions execute as if they ran one at a time. Spanner is both linearizable and serializable. Postgres' default isolation level is read committed, not serializable — serializable is opt-in. Even with serializable transactions enabled, Postgres is not linearizable across replicas.

▸When is eventual consistency acceptable and when does it break down?

Eventual consistency works fine for tweets, posts, news feeds, search indexes, and view counts, where a 2-second propagation delay is unnoticeable. It breaks down for bank balances at the moment of a transaction, inventory at checkout, and seat reservations, where two simultaneous decisions can conflict in costly ways.

▸Why do Raft-based consensus systems require an odd number of nodes, and how many failures can they tolerate?

With 5 nodes a Raft cluster can lose 2 and still commit entries, because a majority of 3 remains. With 4 nodes it can tolerate only 1 failure, the same as 3 nodes, so the fourth node adds cost without adding fault tolerance. Using odd numbers (3, 5, 7) ensures every possible majority is exactly one more than the maximum tolerated failures.

← previous module

Message Queues and Event Streams

next module →

Reliability and Failure Patterns