~/articles/design-spanner

◆◆◆Advancedasked at Googleasked at Uberasked at CockroachDBasked at YugabyteDBasked at Amazon

Design a Globally-Distributed SQL Database (Spanner / CockroachDB)

SQL transactions that are ACID across continents. How Spanner shards into Paxos groups, runs 2PC on top, and uses TrueTime to give you external consistency — the CP counterpart to Dynamo.

13 min read2026-05-29Ironclad Academy

#interview #distributed-systems #databases #transactions #consensus #consistency #staff

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Google Spanner manages the relational data for products that touch hundreds of millions of users — ads, payments, commerce — across every continent. CockroachDB and YugabyteDB exist precisely to bring the same model to the rest of the industry. The pitch is simple: a familiar SQL database with real ACID transactions, that also scales horizontally across machines and survives the loss of an entire datacenter. The delivery is anything but simple.

The core tension is that distributed databases have always forced a choice between correctness and availability. The Dynamo-style key-value store sits at one end: always writable, eventually consistent, conflicts resolved after the fact. That tradeoff is fine for shopping cart contents or social-graph edges. It is not fine for a payment ledger, an inventory system, or a fulfillment platform. Uber's engineering team learned this the hard way when they migrated off Cassandra and Redis to Spanner because they needed strong consistency for multi-row, multi-table transactions that span regions — something you cannot retrofit onto an availability-first store after the architecture is set.

What makes this problem genuinely hard is the combination of two requirements that pull in opposite directions. External consistency — the guarantee that the entire planet's readers agree on one commit order that matches real-time order — demands synchronized clocks and commit coordination. Horizontal scale demands that data be split across thousands of shards on hundreds of machines. Every write touching more than one shard needs an atomic commit protocol. Every commit needs a globally valid timestamp. And any solution that is too slow, too fragile, or too expensive to operate doesn't ship.

This is the staff/architect-level question because it sits squarely on the CAP fault line: Spanner deliberately chooses consistency over availability and then engineers as hard as possible against the latency that choice imposes. The rest of this article is about how.

Functional requirements

SQL with schemas, secondary indexes, and joins.
ACID transactions spanning multiple rows, tables, and shards — including across regions.
Horizontal scale: capacity grows by adding nodes; data auto-splits and rebalances.
Strongly consistent reads, plus cheaper bounded-staleness/follower reads as an option.
Survive zone and region failures with automatic failover, no data loss of committed writes.

Non-functional requirements

External consistency (Google's term for strict serializability): the strongest practical guarantee — the system behaves as if every transaction executed instantaneously at a single point in time, and this serial order is consistent with real-time commit order.
Durability: a committed transaction survives f replica failures (majority quorum).
Availability within a region even if a minority of replicas fail; CP under partition (sacrifice availability of the minority side).
Latency: single-shard, single-region commits in a few ms; cross-region commits bounded by WAN RTT — and reads that don't pay write-path latency.

The crux: CAP, and why this is a CP system

CAP says under a network Partition you choose C or A. Dynamo chose A (stay writable, reconcile later). A globally-consistent SQL database chooses C: if a shard's leader can't reach a majority, it stops accepting writes there rather than risk two sides diverging. PACELC extends this: else (no partition), you still trade Latency for Consistency — quorum writes and commit-wait cost milliseconds you wouldn't pay in an eventually-consistent store.

Saying this explicitly — "this design deliberately sacrifices availability and latency for consistency, and here's exactly where" — is the move that signals staff-level judgment. The rest of the article is how you pay that bill as cheaply as possible.

Building up to the design

V1: One SQL node

Start with a single Postgres primary. You get full ACID and zero distributed-systems pain — but one machine caps your write throughput and storage, and it's a single point of failure. Lose the box, lose the database.

V2: Primary + read replicas

Add async followers and route reads to them. Read scale goes up and you have a warm standby. But writes still funnel to one primary, and async replicas are stale — read-your-writes breaks, and a primary failover can lose the un-replicated tail. Not acceptable for anything touching money.

V3: Shard the data

Split tables into ranges across many nodes; each node owns a slice. Writes now scale horizontally. Two new problems surface immediately: each shard is still a single point of failure, and a transaction touching two shards has no atomicity — you can commit on shard A and crash before shard B. You've lost ACID across shards.

V4: Replicate each shard with Paxos/Raft

Make every shard a replication group of (say) 3 replicas (single-region, across 3 zones) or 5 replicas (multi-region), kept in sync by a consensus protocol. The group elects a leader; writes are committed once a majority acknowledges. A 3-replica group tolerates 1 failure; a 5-replica group tolerates 2 (Raft's quorum property).

This is the key structural idea: consensus is per-shard, not per-cluster — thousands of independent Paxos groups, each small enough to be fast. Each shard now fails over automatically. But multi-shard atomicity is still unsolved, and you still need a global order across shards so reads are consistent.

V5: Two-phase commit across groups + TrueTime ordering

For a transaction spanning groups, run 2PC with the participant leaders as cohorts. Crucially, each participant is itself a Paxos group, so a "prepared" vote is durably replicated — the classic 2PC "coordinator dies and everyone blocks forever" failure is defanged because participants don't lose their prepared state. Then use TrueTime to assign a commit timestamp that respects real-time order, giving external consistency.

This is the production design (Google Spanner; CockroachDB and YugabyteDB are Spanner-inspired open-source systems using Raft rather than Paxos, and HLCs instead of GPS/atomic-clock TrueTime). The diagram below picks up here.

flowchart LR
    V1[V1: single SQL<br/>SPOF, no scale] --> V2[V2: read replicas<br/>stale, no write scale]
    V2 --> V3[V3: shard<br/>write scale, no cross-shard ACID]
    V3 --> V4[V4: Paxos per shard<br/>durable + HA shards]
    V4 --> V5[V5: 2PC + TrueTime<br/>global ACID + external consistency]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V5 fill:#a855f7,color:#fff

Architecture

flowchart TD
    C[Client] --> QP[Query layer<br/>parse, plan, route to splits]

    subgraph "Split A (range [a,m))"
      LA[Leader A] --> A1[(Replica zone1)]
      LA --> A2[(Replica zone2)]
      LA --> A3[(Replica zone3)]
    end
    subgraph "Split B (range [m,z))"
      LB[Leader B] --> B1[(Replica zone1)]
      LB --> B2[(Replica zone2)]
      LB --> B3[(Replica zone3)]
    end

    QP --> LA
    QP --> LB
    LA <-->|"2PC: prepare/commit"| LB
    TT["TrueTime API<br/>now() → interval"] -.->|"assign commit ts"| LA
    TT -.-> LB
    PLACE[Placement driver<br/>split / rebalance / move leaders] -.-> LA
    PLACE -.-> LB

    style LA fill:#ff6b1a,color:#0a0a0f
    style LB fill:#ff6b1a,color:#0a0a0f
    style TT fill:#a855f7,color:#fff
    style PLACE fill:#0e7490,color:#fff

Four layers hold this together. The query/SQL layer is stateless — it plans queries and routes each piece to the right split. The splits are the range-partitioned shards, each a Paxos group with a leader that does the actual work. The placement driver watches for hot or oversized ranges and continuously splits them, rebalances replicas, and moves leaders closer to load. And the TrueTime service feeds bounded-uncertainty timestamps to every leader so commits land in a provably correct global order.

TrueTime: the idea that makes it work

Here's the deep problem. To impose a global order on commits, every commit needs a timestamp. But clocks on different machines disagree — not by much, but enough. If you just read local wall-clock time, two transactions can get timestamps that contradict their real-time order, breaking consistency in a way that's nearly impossible to debug.

Spanner's insight is to expose clock uncertainty as a first-class value. TrueTime.now() returns not a point but an interval [earliest, latest] guaranteed to contain the true time, with uncertainty ε kept tight by GPS receivers and atomic clocks in every datacenter — a few milliseconds in practice.

The trick is commit-wait: the leader assigns a commit timestamp t = TT.now().latest (the upper bound of the current uncertainty interval), replicates the commit via Paxos, then waits until TrueTime.now().earliest > t before releasing locks and acking the client. The wait duration is roughly 2ε. By the time any observer sees the commit, t is unambiguously in the past everywhere. The guarantee follows: if T1 commits before T2 begins in real time, t1 < t2 always — external consistency.

sequenceDiagram
    participant L as Leader
    participant TT as TrueTime
    L->>TT: now() → [earliest, latest]
    Note over L: commit ts = latest (upper bound)
    L->>L: replicate commit via Paxos
    loop commit-wait (~2ε)
        L->>TT: now() → [e', l']
        Note over L: wait until e' > commit_ts
    end
    L-->>L: release locks, ack client

The cost is real: every commit pays roughly 2ε of commit-wait latency. Google spends serious money on GPS/atomic-clock infrastructure to keep ε small (~1–7 ms, so the wait is ~2–14 ms). CockroachDB and YugabyteDB can't assume that hardware, so they use Hybrid Logical Clocks (HLC) — wall-clock combined with a logical counter. The consistency consequence: CockroachDB provides serializable isolation by default, not strict serializability / external consistency — meaning a "causal reverse" anomaly across causally-dependent transactions is theoretically possible (though rare in practice). It compensates via uncertainty restarts: if a read encounters a value whose timestamp falls within the uncertainty window, the transaction retries at a higher timestamp rather than risking a stale read. That trade — dedicated time infrastructure for external consistency vs. commodity clocks for serializability — is worth raising in any interview on this topic.

Reads: MVCC snapshots, no locks

Multi-version concurrency control stores each row as timestamped versions. A read executes "as of" a timestamp and sees the latest version ≤ t. This gives you three things.

Read-only transactions take no locks and never block writers (and vice versa) — a huge win for read-heavy workloads. A snapshot read at a past timestamp is consistent and can be served by any replica that has caught up to t, including a nearby follower, so you can dodge a cross-region hop entirely. Strong reads go to the leader; stale/bounded-staleness reads trade a little freshness for local latency.

This separation — strongly consistent writes via Paxos+2PC, lock-free consistent reads via MVCC — is what keeps a CP database usable under heavy read load rather than grinding to a halt on lock contention.

flowchart LR
    R[Read request] --> Q{"Strong or<br/>stale?"}
    Q -->|"strong read"| LEADER[Shard leader<br/>latest committed version]
    Q -->|"bounded-staleness"| FOLLOWER[Nearby follower<br/>version at ts T]
    LEADER --> MVCC[(MVCC versions<br/>row@t1, row@t2, ...)]
    FOLLOWER --> MVCC
    MVCC --> OUT[Consistent snapshot<br/>no locks held]
    style LEADER fill:#ff6b1a,color:#0a0a0f
    style FOLLOWER fill:#0e7490,color:#fff
    style MVCC fill:#15803d,color:#fff
    style OUT fill:#a855f7,color:#fff

Two-phase commit, made non-blocking

Vanilla 2PC has a notorious flaw: if the coordinator crashes after participants vote "prepared," they hold locks forever waiting for a decision. Layering 2PC on Paxos groups fixes both failure modes.

A "prepared" vote is written through the participant's Paxos group, so it survives the participant leader crashing — a new leader takes over already knowing it's prepared. The coordinator's decision is also Paxos-replicated, so a crashed coordinator is replaced by a peer that can finish the commit. Neither side can lose its state.

sequenceDiagram
    participant Co as Coordinator (Paxos group)
    participant A as Participant A (Paxos group)
    participant B as Participant B (Paxos group)
    Co->>A: prepare
    Co->>B: prepare
    A->>A: replicate "prepared" (Paxos)
    B->>B: replicate "prepared" (Paxos)
    A-->>Co: vote yes
    B-->>Co: vote yes
    Co->>Co: replicate "commit" decision (Paxos)
    Co->>A: commit @ ts
    Co->>B: commit @ ts

It's still 2PC — it still blocks the specific rows under contention, and adds a round-trip — which is exactly why data modeling matters: co-locate rows that are transacted together (Spanner's interleaved tables / table families) into the same split so the common transaction is single-group and skips 2PC entirely.

Placement, splitting, and rebalancing

Ranges aren't static. A placement driver continuously splits a range that grows too large or too hot, rebalances replicas across nodes to even out load and storage, moves leaders toward the region generating the writes (leader locality cuts latency), and repairs under-replicated groups when a node dies by re-replicating to restore the quorum margin.

This is how the system scales smoothly instead of requiring manual resharding — the operational pain that hand-rolled sharding inflicts.

Edge cases and follow-ups

Hot rows / write skew

A single hot row (a global counter) serializes on one Paxos leader. Mitigate by sharding the counter, or accept the bottleneck. Write skew (two transactions each reading-then-writing disjoint rows that violate an invariant together) requires SERIALIZABLE isolation — which these systems provide, unlike default Postgres READ COMMITTED.

Clock skew beyond ε

If a machine's true clock drifts outside the guaranteed uncertainty, the external-consistency guarantee is violated. The TrueTime architecture defends against this: each GPS/atomic clock master continuously cross-checks its own drift and evicts itself from serving time (stops advertising) if it diverges substantially — causing ε to temporarily grow on that datacenter until other masters compensate, which in turn makes Spanner's commit-wait longer, not undefined. Correctness depends on the time infrastructure actually meeting its bound — a rare but real operational assumption worth naming.

Cross-region write latency

A transaction whose leader is in us-east but must reach a majority across regions pays inter-region RTT per commit. Designs minimize this by placing all replicas of latency-sensitive data in one region (giving up some disaster tolerance), or using leader leases so reads stay local.

Schema changes without downtime

Online schema changes use a multi-version, staged protocol so old and new schema coexist during rollout — a hard sub-problem Spanner solved with timestamped schema versions.

When NOT to use this

If you don't need cross-row/cross-region atomicity, this is overkill — you're paying consensus + commit-wait latency for guarantees you don't use. The staff-level judgment is recognizing that most workloads are fine with a single-region SQL DB + read replicas, and reserving globally-consistent SQL for the genuinely transactional, genuinely global use cases (payments, inventory, fulfillment).

What interviewers look for

Interviewers probe three areas. First, framing and tradeoffs: did you open with an explicit CP / consistency-over-availability stance (CAP/PACELC), and did you name the latency cost and when the design is the wrong tool? Second, the core mechanisms: consensus per shard (not per cluster) and why that limits blast radius; 2PC layered on Paxos and why the Paxos layer makes 2PC non-blocking; TrueTime + commit-wait and what external consistency actually guarantees; MVCC snapshot reads and why they stay lock-free. Third, open-source alternatives: how CockroachDB and YugabyteDB substitute HLC for GPS/atomic-clock TrueTime and what consistency level they achieve as a result.

Things you should now be able to answer

Why can't you retrofit cross-region ACID onto a Dynamo-style AP store?
Why is consensus done per-shard instead of across the whole cluster?
What does commit-wait actually wait for (TT.now().earliest > commit_ts), and what does it buy you?
How does layering 2PC on Paxos avoid the blocking-coordinator problem?
Why can read-only transactions avoid locks entirely?
When is a globally-distributed SQL database the wrong choice?

Frequently asked questions

▸What is external consistency in Spanner, and how does TrueTime enforce it?

External consistency (Google's term for strict serializability) means the system behaves as if every transaction executed instantaneously at a single point in time, and that serial order matches real-time commit order. Spanner enforces it by assigning each commit a timestamp equal to TrueTime.now().latest, then performing commit-wait: the leader holds the commit until TrueTime.now().earliest exceeds that timestamp, guaranteeing the commit timestamp is unambiguously in the past everywhere before any observer sees it.

▸Why does Spanner run consensus per shard rather than across the whole cluster?

Each shard (split) runs its own independent Paxos group so that shards fail over independently and a failure in one range has no blast radius on others. It also keeps each consensus group small and fast: a 3-replica single-region group needs only one intra-region (inter-zone) round-trip per quorum write, whereas a cluster-wide consensus protocol would require every node to coordinate on every write.

▸How does layering 2PC on Paxos prevent the classic blocking-coordinator problem?

In vanilla 2PC, a crashed coordinator leaves participants holding locks forever because they cannot know the final decision. When each participant is itself a Paxos-replicated group, a prepared vote is durably written to the participant's log before the vote is sent, so a new participant leader can recover already knowing it voted yes. The coordinator's commit or abort decision is also Paxos-replicated, so a failed coordinator is replaced by a peer that can finish the protocol without losing state.

▸How do CockroachDB and YugabyteDB differ from Spanner on clock-based consistency?

Spanner relies on GPS receivers and atomic clocks in every datacenter to keep TrueTime uncertainty small (roughly 1-7 ms), which allows true external consistency at the cost of ~2-14 ms commit-wait per write. CockroachDB and YugabyteDB use Hybrid Logical Clocks (HLC) on commodity hardware instead; as a result they provide serializable isolation rather than strict serializability, and compensate with uncertainty restarts when a read encounters a value whose timestamp falls within the clock uncertainty window.

▸When is a globally-distributed SQL database the wrong choice?

If the workload does not require cross-row or cross-region atomicity, the consensus and commit-wait overhead is unjustified cost. The article explicitly calls out that most workloads are fine with a single-region SQL database plus read replicas, and reserves globally-consistent SQL for genuinely transactional, genuinely global use cases such as payments, inventory, and fulfillment.

← previous

Design a Distributed Job Scheduler (cron at scale)

Design a Digital Wallet (PayPal / Venmo / Paytm)

// RELATED