~/articles/cap-theorem-deep-dive

◆◆Intermediateasked at Googleasked at Amazonasked at Meta

CAP Theorem Deep Dive

The CAP theorem, debunked myths, PACELC, and the actual trade-offs every distributed database makes.

10 min read2026-02-04Ironclad Academy

#distributed-systems #consistency #theory #databases

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The CAP theorem is the single most-quoted, most-misunderstood result in distributed systems. This article walks through what it actually says, why "CAP" is too coarse a label, and how PACELC is a more honest model. By the end, you'll be able to read a database's marketing page and translate "highly available, eventually consistent" into the real-world behavior you're going to inherit.

The theorem, formally

Stated by Eric Brewer in 2000, formalized by Seth Gilbert and Nancy Lynch in 2002:

In an asynchronous network, no shared-data system can simultaneously provide Consistency, Availability, and Partition tolerance.

You can have any two. You cannot have all three.

flowchart TD
    P[Partition occurs:<br/>nodes can't talk to each other] --> CHOICE{Choice}
    CHOICE -->|"keep serving<br/>with possibly stale data"| AP[AP system:<br/>Available + Partition-tolerant]
    CHOICE -->|"refuse on minority side<br/>to keep data correct"| CP[CP system:<br/>Consistent + Partition-tolerant]
    style AP fill:#15803d,color:#fff
    style CP fill:#ff6b1a,color:#0a0a0f

What each letter actually means

C is linearizability — every read returns the most recent committed write, ordered as if by a single global clock. This is far stronger than what people usually mean when they say "consistent" in casual conversation. ACID's "C" is about constraint preservation; "eventual consistency" is something else again. When CAP says C, it means the strongest possible flavor.

A means every non-failing node responds — not just "the system stays up." A node cut off from the rest of the cluster must still answer requests in finite time. A 99.9%-uptime system that occasionally returns 503 is not technically "available" in CAP terms.

P means the system keeps running despite arbitrary message loss. Here's the key insight: P is not optional in any real system. Networks will partition — a cable cut, a misconfigured firewall, a slow GC pause that makes a node look dead. The only question is what you do when that happens.

So CAP collapses to a binary: when partitioned, choose A or C. The diagram below shows what each looks like on the wire.

sequenceDiagram
    participant C as Client
    participant N1 as Node 1 (majority)
    participant N2 as Node 2 (minority, partitioned)
    Note over N1,N2: Network partition — N1 and N2 cannot reach each other
    C->>N1: Write x=42
    N1-->>C: OK (quorum satisfied)
    C->>N2: Read x
    Note over N2: AP system: responds with stale x=0
    Note over N2: CP system: returns error — cannot confirm latest value

This is the crux. In a CP system, the minority partition goes silent rather than risk returning old data. In an AP system, it keeps answering — knowing the value it returns might be behind.

CP vs AP — examples

System	CAP class	What happens during partition
Postgres (single primary)	CP-ish	Minority side sees errors / read-only
ZooKeeper	CP	Minority side refuses all writes; with `readonlymode.enabled=true` (opt-in since 3.4.0) minority nodes can serve stale reads
etcd	CP	Minority side refuses all reads and writes; all reads are linearizable by default (routed through leader/quorum)
MongoDB (replica set)	AP	Majority side elects primary; minority becomes read-only but can serve reads if readPreference allows it (non-default)
HBase	CP	Region server outage = unavailable for that region
Cassandra	AP (tunable)	All nodes keep accepting; conflicts reconciled via last-write-wins or LWT
DynamoDB	AP (eventually consistent default)	Reads may be stale during partition
Redis (with sentinel)	CP-ish	Minority replicas refuse writes
Spanner	CP	Uses TrueTime (GPS + atomic clocks) to order commits globally; refuses rather than serve stale

The myth: "Cassandra has no consistency"

Hear "AP" and it's tempting to picture a database that will happily return whatever garbage it has on hand. That's not what's happening. Cassandra (and DynamoDB) lets you set the consistency level per query, which means you can dial anywhere on the spectrum depending on what you need:

Cassandra read level	Meaning
`ONE`	First node that responds wins (fast, weak)
`QUORUM`	Majority — `(N/2)+1` nodes agree
`ALL`	Every replica must respond

If you use QUORUM reads and QUORUM writes such that R + W > N, you guarantee read-your-writes and dramatically reduce the chance of returning stale data — at the cost of latency. That said, R + W > N alone is not full linearizability: concurrent writes, hinted handoff, and clock skew can still cause anomalies. When you genuinely need linearizable (serial) operations, Cassandra provides Lightweight Transactions (LWT), which run Paxos under the hood (IF NOT EXISTS, IF col = val).

The "AP" label means AP by default, not AP always. LWT is the escape hatch.

flowchart LR
    W[Client write] -->|QUORUM| R1[Replica A]
    W -->|QUORUM| R2[Replica B]
    W -->|QUORUM| R3[Replica C]
    RD[Client read] -->|QUORUM: waits for 2| R1
    RD -->|QUORUM: waits for 2| R2
    R1 -. overlap .-> note1["R + W > N: overlap node<br/>has latest write"]
    style R2 fill:#15803d,color:#fff
    style note1 fill:#ffaa00,color:#0a0a0f

When the read and write quorums overlap by at least one node, that node is guaranteed to have seen the latest committed write — which is why quorum reads give you read-your-writes without needing a central primary.

PACELC — the more honest model

Daniel Abadi made an important observation in 2010 (blog post, then formalized in an IEEE Computer paper in 2012):

If a Partition occurs, choose between Availability and Consistency. Else (no partition), choose between Latency and Consistency.

That second line is the one engineers miss. Even on a healthy network, strong consistency burns latency — every consensus round-trip, every quorum, every write that must go through the primary before it can be acknowledged. PACELC makes this cost explicit.

System	PACELC class	Translation
Spanner, CockroachDB	PC/EC	Always pays latency for consistency
Postgres	PC/EC	Same
MongoDB	PA/EC	Available during partitions; consistent under normal load
DynamoDB	PA/EL	Available + low latency by default
Cassandra	PA/EL	Same

The interesting cases are worth a moment. PA/EC (MongoDB's default position) means the system stays available during a partition, but under normal operation it still pays the consistency cost — all reads go through the primary rather than being served locally from any replica. Note that MongoDB's default read concern is "local" (not linearizable), so you get the latest value on the primary but without the full CAP-linearizability guarantee; you need readConcern: "linearizable" explicitly for that. The point still stands: routing to the primary costs latency 99.9% of the time (when there's no partition) in exchange for a consistency property that partially evaporates the 0.1% of the time when the network actually fails. PA/EL is the full "fast by default, repair later" path — the dominant choice for user-facing read-heavy systems. PC/EL is rare because it's logically odd: refusing requests during partitions while accepting stale reads when the network is fine gives you the downsides of both choices.

Worked example: a payment system

Take payments. You want three things simultaneously: don't lose payments (strong durability), don't double-charge (linearizable writes), and stay up worldwide (geo-replicated). The problem is that linearizability plus geo-distribution equals high latency — consensus across continents takes time.

Real systems solve this by decomposing the problem rather than accepting a single CAP label:

Stripe keeps the ledger strongly consistent in Postgres/Aurora, localized to one region. Many other services — idempotency keys, fraud signals — read from replicas globally and accept a few seconds of staleness.

Visa/Mastercard use regional clearing systems with batch reconciliation: strong consistency locally, eventual consistency at the global level.

The lesson: you don't make one CAP choice. You break the problem into pieces, each with its own profile, and match the guarantee to what the data actually requires.

For a social feed the requirements look different. You need latency under 200ms, so you need geo-distributed reads. You want users to see their own posts immediately — read-your-writes. But other users' posts can lag; nobody expects a tweet in Tokyo to appear in Mumbai in under a second.

This is the textbook AP system: Cassandra or DynamoDB as the store, Redis caches everywhere, strong consistency only when a user reads their own data (sticky routing to a primary or regional quorum). The eventual consistency is invisible to users 99.9% of the time, and the rare "I posted but it disappeared" race is fixed with sticky reads rather than making the whole system synchronous.

Why eventual consistency usually works

Most user-facing data has weak temporal requirements. A like on Instagram doesn't need to appear globally in 100ms. A tweet propagating to followers across regions can take a few seconds. A view count can lag minutes. Users don't notice — and even when they do, the failure mode is "I saw an old count," not "the system lost my data."

The bugs that are noticeable usually have targeted fixes: sticky reads for the author's own content, local writes with async fan-out, version vectors to detect conflicts. These are precise interventions; they don't require making the entire system synchronous.

When you cannot cut corners

Some operations genuinely need strong consistency, and pretending otherwise creates bugs that are hard to reproduce and catastrophic when they occur:

Money and inventory — you cannot sell the same seat twice, cannot debit more than the balance.
Authentication and authorization — revoking a token must be visible everywhere, immediately; a stale auth cache is a security hole.
Locks and leader election — by definition, only one process should hold the lock. Without consensus, you have no lock.
Global uniqueness — usernames, email addresses, IDs. Two users claiming the same username because two nodes didn't compare notes is a bad day.

For these operations you pay the latency. For everything else, you probably don't have to.

Thinking about it in an interview

When someone asks "is this CP or AP?", walk through it in three moves:

First, identify the data. "The user's profile? Strongly consistent. Their feed? Eventually consistent." Different tables in the same system can have different profiles — that's not a cop-out, it's good engineering.

Second, justify with PACELC. "We choose PA/EL because users expect the system to stay up during regional partitions, and we want low latency on every read." This signals you understand the normal-operation cost, not just the partition scenario.

Third, acknowledge the observable trade-off. "If a follower posts in Tokyo and a viewer reads in Mumbai 100ms later, they may not see it yet. That's acceptable for a feed." Naming the failure mode and calling it acceptable is the difference between a junior answer and a senior one.

Things to discuss in an interview

Why is "CAP all three" impossible — give the intuition without the proof?
What does PACELC add that CAP misses?
A bank's ledger and a feed have very different CAP profiles. Why?
"Cassandra is AP" — what level of consistency can you actually achieve?

Things you should now be able to answer

Why is partition-tolerance non-optional?
Translate "PA + EL" into product behavior.
Why does strong consistency cost latency even when there's no partition?
Give two operations that genuinely require strong consistency.

Frequently asked questions

▸What does CAP's C actually mean, and how is it different from ACID's C or eventual consistency?

CAP's C means linearizability: every read returns the most recent committed write, ordered as if from a single global clock. ACID's C refers to constraint preservation, and eventual consistency is weaker still. Linearizability is the strongest possible flavor of consistency.

▸What does PACELC add that CAP misses?

PACELC, formalized by Daniel Abadi in a 2012 IEEE Computer paper, adds the else branch: when there is no partition, strong consistency still costs latency because every consensus round-trip, every quorum, every primary-only write takes extra time. CAP only captures the partition scenario, which is the rare case; PACELC captures the latency trade-off you face 99.9% of the time.

▸How consistent can Cassandra actually be, given that it is labeled AP?

Cassandra lets you set consistency per query. Using QUORUM reads and QUORUM writes such that R + W > N guarantees read-your-writes and reduces stale reads at the cost of latency, though this is not full linearizability due to concurrent writes, hinted handoff, and clock skew. For true linearizable operations, Cassandra provides Lightweight Transactions (LWT), which run Paxos under the hood via IF NOT EXISTS or IF col = val.

▸Which operations genuinely require strong consistency and cannot tolerate eventual consistency?

The article identifies four categories: money and inventory (cannot sell the same seat twice or overdraft a balance), authentication and authorization (a revoked token must be visible everywhere immediately), locks and leader election (only one process should hold the lock), and global uniqueness constraints such as usernames and email addresses.

▸What is Spanner's approach to achieving CP behavior globally, and what is the latency cost?

Spanner uses TrueTime, backed by GPS receivers and atomic clocks, to order commits globally, and refuses to serve stale data rather than guess. The observable cost is 5 to 30 milliseconds of extra latency compared with systems that serve locally.

← previous

Design a URL Shortener (TinyURL / bit.ly)

Database Sharding

// RELATED