~/articles/leader-election-and-consensus
◆◆◆Advancedasked at Googleasked at Etcdasked at Consulasked at CockroachDBasked at Kafka

Leader Election and Consensus (Raft, Paxos)

How distributed systems agree on a single leader without splitting brains. Raft step-by-step, Paxos explained intuitively, and where consensus shows up in production.

// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

If you've ever wondered how Zookeeper, etcd, Kafka's controller, Spanner, MongoDB, and CockroachDB manage to all elect a single leader without any of them ever lying about it — the answer is a consensus protocol. Two protocols matter: Paxos (the theoretical foundation) and Raft (the one humans can actually read).

This article explains why naive leader election doesn't work, walks through Raft step-by-step, and tells you when and where you'll bump into consensus in real systems.

Why this is hard

If you have N nodes and a leader dies, "elect a new leader" sounds trivial: vote. Just count hands.

The trouble is that distributed systems don't get a clean show of hands. The network can lose, delay, or reorder messages. Nodes crash and recover, sometimes with stale data. A slow node might come back online days later and start writing as if nothing happened. And crucially, two nodes can simultaneously think they are the leader — split brain — each writing to what they believe is the authoritative state.

flowchart LR
    A[Node A<br/>thinks: I'm leader] --> X[(Shared state)]
    B[Node B<br/>thinks: I'm leader] --> X
    X --> CHAOS[Two different<br/>committed values<br/>for the same key]
    style CHAOS fill:#ff2e88,color:#fff

Once you have split brain, the damage is already done. You need a protocol that is safe (never two leaders simultaneously) and live (eventually elects one, even when the network misbehaves).

The FLP impossibility (in one paragraph)

Fischer, Lynch, Paterson (1985) proved that in a fully asynchronous network, no deterministic protocol can guarantee both safety (agreement) and liveness (termination) in the presence of even one crash failure. You always give up something. In practice, consensus protocols pick safety and assume bounded asynchrony — "we'll be safe always, and live whenever the network is reasonably stable."

This is why every production consensus system has a notion of timeouts and terms — the timing assumptions are explicit rather than hidden.

Raft (the one you should learn first)

Raft was designed in 2014 specifically to be understandable. It's used in etcd, Consul, TiKV, RethinkDB, CockroachDB, the Kafka raft controller, and dozens of others.

Every node holds one of three roles at any moment:

flowchart LR
    F[Follower] -->|"election timeout"| C[Candidate]
    C -->|"wins majority"| L[Leader]
    C -->|"sees higher term"| F
    L -->|"sees higher term"| F
    style L fill:#ff6b1a,color:#0a0a0f
    style C fill:#ffaa00,color:#0a0a0f
    style F fill:#0e7490,color:#fff

At any time there's at most one leader. All the other nodes are followers — or briefly candidates during an election.

The term

Time is divided into terms — monotonically increasing integers. Each term has at most one leader. If a node ever sees a higher term anywhere in the system, it immediately steps down to follower. No argument, no delay.

term 1: leader A
term 2: leader B  (A died, election held)
term 3: leader B  (still leader)
term 4: leader C  (B partitioned, election held)

This is the trick that kills split brain. Even if two nodes both believe they're the leader, they hold different terms. The higher term wins, and the lower-term node steps down the moment it hears about it.

Election

Followers sit quietly, receiving heartbeats from the leader. The moment a follower's election timeout fires without a heartbeat — typically randomized between 150–300ms — it assumes the leader is dead and takes action:

  1. Increments its term.
  2. Becomes a candidate.
  3. Votes for itself.
  4. Asks every other node for a vote.
sequenceDiagram
    participant A
    participant B
    participant C
    Note over A,C: Term 3. Leader was A. Network isolates A.
    Note over B: Election timeout fires on B.
    B->>B: term=4, vote for self
    B->>A: RequestVote term=4
    B->>C: RequestVote term=4
    Note over A: A still in term=3, sees term=4 — steps down to follower
    A-->>B: granted
    C-->>B: granted
    Note over B: Majority! B becomes leader in term=4.
    B->>A: Heartbeat term=4
    B->>C: Heartbeat term=4

A node grants a vote only when three conditions hold: the candidate's term is at least as high as its own; it hasn't already voted in this term; and the candidate's log is at least as up-to-date as its own. That last condition is what makes Raft safe — a node with stale data cannot win because well-informed peers won't vote for it.

What if no one wins?

Two candidates can split the vote — each gets less than a majority. They both time out, increment their terms, and try again. Because timeouts are randomized, one of them fires first next time and wins before the other even starts campaigning. In practice, elections in Raft complete in well under a second in steady-state conditions and only occur when something actually breaks.

Log replication (the other half)

Once elected, the leader is the only node that accepts writes. Every write is an entry appended to the leader's log, then shipped to followers:

flowchart LR
    C[Client write] --> L[Leader log<br/>append entry]
    L -->|AppendEntries| F1[Follower 1 log]
    L -->|AppendEntries| F2[Follower 2 log]
    F1 -.ack.-> L
    F2 -.ack.-> L
    L --> C2[Commit when<br/>majority acked]
    C2 --> APPLY[Apply to state machine]
    style L fill:#ff6b1a,color:#0a0a0f
    style APPLY fill:#15803d,color:#fff

An entry is committed when a majority of nodes have written it to their logs. The leader then notifies everyone "apply up to index N" and followers apply entries in order. If a follower falls behind, the leader walks back through its own log to find the divergence point, then ships the missing entries forward.

Safety guarantees

Raft promises four things:

  1. At most one leader per term.
  2. A leader's log only grows by appending — no rewrites.
  3. If two logs have the same entry at the same index, they have the same prefix.
  4. Once an entry is committed, it stays committed and appears at the same index in every future leader's log.

Together these mean: the state machine is deterministic across all nodes. Whatever order one node applies entries in, every other node applies in the same order. The cluster converges on a single truth.

Paxos (the theoretical foundation)

Paxos came first — Lamport's "The Part-Time Parliament" circulated as a DEC SRC technical report in 1989 (SRC-49) and was formally published in ACM TOCS in 1998. It is famously confusing. Notable variants include Basic Paxos (decide one value), Multi-Paxos (decide a sequence of values, like Raft), Fast Paxos, and EPaxos.

Basic Paxos in one diagram

sequenceDiagram
    actor C as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3
    Note over C: Phase 1: Prepare
    C->>A1: prepare(n)
    C->>A2: prepare(n)
    C->>A3: prepare(n)
    A1-->>C: promise(n, prev_value=null)
    A2-->>C: promise(n, prev_value=null)
    A3-->>C: promise(n, prev_value=null)
    Note over C: Got majority promises
    Note over C: Phase 2: Accept
    C->>A1: accept(n, value=X)
    C->>A2: accept(n, value=X)
    C->>A3: accept(n, value=X)
    A1-->>C: accepted
    A2-->>C: accepted
    Note over C: Majority accepted — value X is chosen

Two phases:

  1. Prepare: the proposer sends a proposal number n and asks for a promise. If a majority promise — and they report any previously-accepted value — the proposer must use that value rather than its own.
  2. Accept: the proposer asks acceptors to accept the value. If a majority accepts, the value is chosen.

Multi-Paxos amortizes Phase 1 across many decisions, leaving you with Phase 2 per decision — the same cost as Raft's normal-operation append.

Why people prefer Raft

Raft separates leader election from log replication into two explicit concerns. Paxos blurs them. Raft has fewer states and a "strong leader" model where only the leader appends; Paxos lets any node propose. The end result of either protocol is the same — but reading Raft pseudocode takes an afternoon; reading Paxos takes a year.

Other consensus protocols worth knowing

ProtocolNiche
ZabUsed by Zookeeper. Atomic broadcast, similar to Raft.
Multi-PaxosThe original "log replication" protocol. Used by Google's Chubby, Spanner.
EPaxosLeaderless Paxos. Each command can be proposed by any node. Fast when commands don't conflict.
Viewstamped ReplicationCousin of Paxos (predates it independently — Oki & Liskov published at PODC 1988, Lamport's tech report followed in 1989). Conceptually similar to Raft; MongoDB's current replication protocol (pv1, introduced in MongoDB 3.2) draws from Raft ideas but is not a pure Raft implementation.
PBFT, HotStuffByzantine fault tolerance — survives malicious nodes. Blockchains use these.

Where you'll meet consensus in production

flowchart TD
    CON[Consensus protocols] --> KV[Coordination services<br/>etcd, Zookeeper, Consul]
    CON --> DB[Distributed databases<br/>CockroachDB, Spanner, TiKV, MongoDB]
    CON --> Q[Message brokers<br/>Kafka, Pulsar]
    CON --> SM[Service mesh control planes<br/>Istio, Linkerd]
    CON --> SCH[Schedulers<br/>Kubernetes etcd-backed]
    style CON fill:#ff6b1a,color:#0a0a0f

Any time a system needs to agree on "who's the leader," "what's the next sequence number," "what's the membership of this cluster," or "what's the most recently committed transaction" — it's running consensus somewhere. Here's what that looks like in three specific systems.

etcd / Zookeeper

The classic pattern: a tiny cluster runs consensus, and everyone else asks etcd who the leader is. Kubernetes works exactly this way — your kubelets don't run Raft; etcd does, and they read from etcd.

CockroachDB

Each range of keys (512 MiB of contiguous keyspace by default, raised from 64 MiB in v22.1) has its own Raft group. A large cluster may have tens of thousands of ranges, each independently electing a leader and agreeing on its slice of the keyspace. A single node can be a member of hundreds of thousands of Raft groups simultaneously; CockroachDB coalesces heartbeats so network overhead scales with node count, not range count. This per-range consensus is how CockroachDB achieves serializable transactions across shards.

Kafka's Controller (KRaft)

Older Kafka outsourced its controller metadata to Zookeeper. Modern Kafka runs Raft internally for the controller — the component that decides partition leadership. Removing the Zookeeper dependency simplified operations and gave Kafka control over its own failover timing.

Performance: what consensus costs you

Every committed write requires a round trip to a majority of nodes. In a 3-node cluster:

Write latency ≥ network RTT to slowest of 2 followers.

Within a region that's 1–5 ms. Across regions it's 30–80 ms or more, which is why globally-consistent systems like Spanner are notably slower on writes and why most architectures keep a consensus group entirely within one region.

Throughput is bounded by leader bandwidth and the speed of fsync. Raft must fsync (or fdatasync) each batch of entries to disk before acknowledging — that's the durability promise. What the hardware actually gives you:

StorageApproximate fdatasync latencyEffective throughput per Raft group
Spinning disk (HDD)~10 ms~100 entries/sec (unbatched)
SATA SSD~0.5–2 ms~500–2,000 entries/sec (unbatched)
NVMe SSD~0.1–0.5 ms~2k–10k entries/sec (unbatched)

Production systems like etcd and TiKV batch many client requests into a single Raft round-trip to amortize the fsync cost. With batching, etcd benchmarks report 30k+ ops/sec on a 3-node cluster. The key insight for interviews: the fsync, not the network, is often the bottleneck in single-region deployments.

Common misconceptions

These come up in interviews constantly. Worth having crisp answers ready.

"Three nodes means we can lose two"

No. Quorum is (N/2)+1. For N=3, quorum is 2, so you can lose one node and still write. Lose two and the cluster stops accepting writes and linearizable reads — you lose quorum entirely. Any entries that were committed before the failure are safe and will not be lost; only entries that were in-flight and never committed may be lost.

flowchart LR
    subgraph "N=3 cluster"
        N1[Node 1<br/>Leader] --- N2[Node 2]
        N1 --- N3[Node 3]
    end
    subgraph "N=5 cluster"
        M1[Node 1<br/>Leader] --- M2[Node 2]
        M1 --- M3[Node 3]
        M1 --- M4[Node 4]
        M1 --- M5[Node 5]
    end
    N1 -->|quorum = 2| Q3[Tolerates 1 failure]
    M1 -->|quorum = 3| Q5[Tolerates 2 failures]
    style N1 fill:#ff6b1a,color:#0a0a0f
    style M1 fill:#ff6b1a,color:#0a0a0f
    style Q3 fill:#15803d,color:#fff
    style Q5 fill:#15803d,color:#fff

"More nodes means better availability"

Not for write availability. More nodes means larger quorums, which means more nodes can be slow before writes stall. Five nodes lets you survive two failures, but writes wait for three acknowledgments. Most production clusters use 3 or 5 nodes — rarely more.

"Consensus means high availability"

Consensus protocols stop accepting writes during partitions where they lack quorum. They are CP in CAP terms — consistency and partition-tolerance, at the cost of availability when a partition takes out a majority.

"Multi-region Raft is fine"

It's possible, but writes wait for cross-region RTT. Spanner does it with atomic clocks and TrueTime; most others can't. Without those tricks, multi-region Raft means writes measured in tens of milliseconds rather than single digits.

Operational pitfalls

These are the things that bite teams running Raft-based systems in production.

Cluster reconfiguration

Adding or removing a node during a partition is dangerous — you can temporarily create two disjoint majorities. Raft has a specific "joint consensus" mode for safe reconfiguration. Don't roll your own version of this.

Disk failures

Raft assumes the disk is reliable enough that committed entries don't disappear. Bit-rot, partial fsync, cosmic rays — real-world disks lose data. Production deployments need ECC RAM, careful fsync configuration, and possibly RAID or multiple disks under the WAL.

Snapshotting

The log grows forever. Every Raft system has a snapshot mechanism: periodically dump the full state machine state to disk, then truncate the log. Snapshots must be replicated to followers that are too far behind to catch up from the log alone.

Leader lease and stale reads

By default, a Raft leader can serve linearizable reads only by going through the log — a no-op commit that confirms it's still the leader. Some implementations use a leader lease instead: the leader assumes it remains leader for a bounded time interval, enabling faster local reads without a log round-trip.

The catch is clock drift. If clocks drift beyond the lease bound, a partitioned old leader may serve stale reads while a new one has already been elected. Systems that use leases — etcd, CockroachDB — must enforce bounded clock skew. CockroachDB defaults to a 500 ms maximum clock offset and will shut a node down if it exceeds 80% of that bound; Cockroach Labs recommends 250 ms for multi-region clusters. GPS/PTP gives sub-millisecond accuracy (Spanner's TrueTime) when tighter bounds are needed.

Clock drift

Raft doesn't depend on clocks for safety, but election timeouts are clock-based. A node with a wildly skewed clock can trigger spurious elections. Leader-lease correctness additionally requires bounded clock drift across all nodes — it's a softer dependency, but a real one.

A reusable diagram for interviews

When you say "we'll use a leader to coordinate," draw this:

flowchart TD
    C[Clients] --> L[(Leader)]
    L -->|Raft replicate| F1[(Follower)]
    L -->|Raft replicate| F2[(Follower)]
    L -. heartbeat .-> F1
    L -. heartbeat .-> F2
    Note["Quorum = 2 of 3<br/>Leader elected via Raft<br/>Writes durable when majority ack"]
    style L fill:#ff6b1a,color:#0a0a0f

Then explain what the leader coordinates (writes, sequence numbers, schema changes) and how a new leader is elected (term-based vote on log freshness).

Things you should now be able to answer

  • Why doesn't a simple "first one to claim leader wins" work?
  • What does a Raft term prevent?
  • Why is a Raft candidate required to have an up-to-date log to win?
  • When is a Raft entry considered committed?
  • What does consensus cost you in terms of latency and availability?
  • Where does Kubernetes run a Raft consensus?
  • What is a leader lease, and when can it serve stale reads?
  • Why does CockroachDB coalesce Raft heartbeats, and what problem does that solve?

Further reading

  • Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm (Raft)" (2014) — the paper, very readable
  • Lamport, "Paxos Made Simple" (2001) — accessible Paxos overview
  • The Raft visualizer: raft.github.io/raftscope/
  • Designing Data-Intensive Applications by Martin Kleppmann (chapter 9)
  • CAP theorem deep dive
// FAQ

Frequently asked questions

What is a Raft term and why does it prevent split brain?

A term is a monotonically increasing integer that identifies a logical epoch in the cluster. Each term has at most one leader, and any node that sees a higher term anywhere in the system immediately steps down to follower — no argument, no delay. Even if two nodes simultaneously believe they are leader, they hold different terms, so the lower-term node steps down the moment it hears the higher term.

How many node failures can a 3-node and a 5-node Raft cluster each tolerate?

A 3-node cluster requires a quorum of 2 and tolerates exactly 1 failure. A 5-node cluster requires a quorum of 3 and tolerates exactly 2 failures. Losing more nodes than that causes the cluster to stop accepting writes and linearizable reads entirely, though any entries committed before the failure are safe and will not be lost.

What is the write latency cost of running a Raft consensus group across multiple regions?

Every committed write requires a round trip to a majority of nodes, so cross-region RTT dominates write latency. The article puts that cost at 30–80 ms or more for cross-region hops versus 1–5 ms within a single region. This is why most architectures keep the entire Raft group within one region, and why globally-consistent systems like Spanner require atomic clocks to make multi-region consensus practical.

What is a leader lease in Raft and when can it serve stale reads?

A leader lease lets the current leader assume it remains leader for a bounded time interval and serve local reads without a log round-trip, avoiding the cost of a no-op commit. The risk is clock drift: if clocks skew beyond the lease bound, a partitioned old leader may serve stale reads while a new leader has already been elected. Systems using leases must enforce bounded clock skew — CockroachDB defaults to a 500 ms maximum clock offset and shuts a node down if it exceeds 80% of that bound.

When should I use EPaxos instead of Raft?

EPaxos is a leaderless variant of Paxos where any node can propose a command, making it fast when commands do not conflict. Raft uses a strong-leader model where only the elected leader accepts writes, which simplifies reasoning and implementation significantly. EPaxos trades correctness simplicity for lower write latency in workloads with low command conflict, but it is far harder to implement and reason about than Raft.

// RELATED

You may also like