~/articles/design-distributed-lock
◆◆◆Advancedasked at Googleasked at Amazon

Design a Distributed Lock / Coordination Service (ZooKeeper / etcd)

Provide mutual exclusion and coordination across machines safely. Consensus-backed locks, leases, fencing tokens, and why a lock without fencing is unsafe.

21 min read2026-04-01Ironclad Academy
// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The problem

A distributed lock does exactly what a threading mutex does — ensure only one process can execute a critical section at a time — except the processes live on different machines and cannot share memory. etcd and ZooKeeper are the production implementations you will encounter most often: both power enormous amounts of cloud-native infrastructure, including Kubernetes' control plane (which depends on etcd for all cluster state) and Hadoop/HBase coordination (which historically relied on ZooKeeper). The pattern appears anywhere two services must not step on the same shared resource simultaneously: inventory reservation, payment processing, database schema migrations, leader election.

The conceptually simple version — "write a flag to a database, check it before acting" — breaks immediately in a distributed setting. A lock holder can pause for 30 seconds due to a JVM garbage-collection stop-the-world, its lease can expire while it is frozen, a second client can acquire the same lock, and when the first client wakes up it has no idea that anything changed. Both processes now believe they hold the lock. This paused-process problem is the core engineering tension: no lock service can reliably distinguish a slow client from a dead one.

The second tension is the service itself. If the lock service uses async replication, a leader failure can produce a split-brain where two nodes simultaneously grant the same lock. The service must be linearizable — meaning every granted lock is globally agreed upon by a majority quorum before any client is told "you have it." That requires a consensus algorithm (Raft in etcd, Zab in ZooKeeper), which makes the coordination service a CP system: under a network partition, the minority side refuses to serve any locks rather than risk granting a conflicting one.

Getting distributed locks right under partial failure requires understanding consensus, lease semantics, and the precise moment at which a lock holder is no longer authoritative. This article builds the design from first principles.

Functional requirements

  • acquire(lock_name, ttl) → granted or blocked. Returns a fencing token with every grant.
  • release(lock_name, token) → releases the lock if the caller holds it.
  • renew(lock_name, token, new_ttl) → extends the lease before it expires.
  • Automatic expiry — if the holder dies or fails to renew, the lock is released after ttl seconds.
  • Watch / notify — waiters are notified the moment the lock is released (no polling).
  • Key-value store primitives for config and service discovery (the coordination service doubles as general-purpose config store).

Non-functional requirements

  • Linearizability — there is a single, globally-agreed order of lock grants. No two clients ever simultaneously believe they hold the same lock.
  • High availability — the service must survive the failure of a minority of its nodes without operator intervention.
  • Low latency — lock acquire should complete in single-digit milliseconds in a single data center.
  • Herd management — a single lock release should not simultaneously wake up thousands of waiting clients and DDoS the service.

Capacity estimation

DimensionEstimateHow we got there
Cluster size3 or 5 nodesOdd number required for quorum majority
Lock ops/sec40,000 ops/sec4,000 services × 10 locks/sec each
Quorum acks needed (3-node)2 acks per write(n+1)/2 = (3+1)/2 = 2
Raft write latency (single DC)1–5 msSub-ms under light load on SSDs; hardware-dependent
Leader throughput40,000 writes/sec (all from leader)Followers do not serve writes; etcd benchmarks show 30,000+ req/sec under heavy load, ~44,000 QPS on leader-only writes (3-node GCE cluster)
Watch fan-out per release500 KB burst500 waiters × 1 KB/event (key + value + revision)
Peak egress from leader50 MB/s100 lock-releases/sec × 500 KB burst
Typical lease TTL10–30 secondsMust exceed expected critical-section duration; partition detectable within TTL
Max clock skew assumed< 1 s (NTP)Well within TTL margins
Live lease memory (1M leases)200 MB in-memory1M leases × 200 bytes each; replicated across quorum

Takeaway: A 3-node cluster can handle ~40,000 lock ops/sec with single-digit-ms latency and ~200 MB of in-memory state — well within reach of commodity hardware, with watch egress being the one burst to manage carefully.

Why distributed locks are subtle — the paused-process problem

Start with the obvious broken approach.

The naive approach, and why it fails

A client acquires a lock by writing SET lock:billing clientA NX EX 30 to a key-value store. The lock expires in 30 seconds. Simple. The trouble is that "30 seconds have not passed" and "this client is still doing useful work" are two entirely different things:

t=0   Client A acquires lock (TTL=30s)
t=25  Client A enters a GC pause (JVM stop-the-world)
t=31  TTL expires; client A's lease is gone
t=32  Client B acquires the same lock
t=32  Client B writes to the billing database
t=40  Client A's GC pause ends; client A resumes
t=40  Client A also writes to the billing database — DOUBLE WRITE

This scenario is not exotic. Production GC pauses of 30+ seconds are observed on JVM workloads. The process does not know time passed. It woke up believing it still holds the lock.

sequenceDiagram
    participant A as Client A
    participant Lock as Lock Service
    participant DB as Protected Resource

    A->>Lock: acquire("billing", ttl=30s)
    Lock-->>A: granted, token=41
    Note over A: *** GC pause 35 seconds ***
    Note over Lock: TTL expires at t=30
    participant B as Client B
    B->>Lock: acquire("billing", ttl=30s)
    Lock-->>B: granted, token=42
    B->>DB: write (token=42) ← succeeds
    Note over A: GC pause ends at t=40
    A->>DB: write (token=41) ← UNSAFE without fencing

The fix: fencing tokens

Martin Kleppmann's "Designing Data-Intensive Applications" (and his widely-read 2016 blog post) articulates this precisely: the protected resource must be the last line of defense. The lock service issues a monotonically increasing fencing token with every grant. The protected resource rejects any write whose token is strictly less than the maximum token it has seen — the current lock holder, with the highest token, may write freely; older tokens are rejected.

Lock service issues tokens: 39, 40, 41, 42, ...
Client A holds token 41 → pauses
Client B acquires → gets token 42 → writes to DB with token=42
DB records: max_token_seen = 42
Client A wakes → tries to write with token=41
DB: 41 < 42 → REJECT

The fencing token provides safety even when the lock service cannot reliably detect that client A is dead. The resource enforces it. This is why etcd's leases include a monotonically increasing revision number that clients can use as a fencing token — specifically, etcd's create_revision on the lock key is assigned once at lock acquisition and remains stable for the lifetime of that grant.

The token must be monotonically increasing and assigned atomically with the lock grant. When the service uses a replicated log (Raft/Zab), this comes for free — the log index is already monotone.

The service must itself be linearizable — why consensus is required

If the lock service itself can disagree about who holds the lock, all the above reasoning collapses. You cannot build a correct distributed lock on a primary with async replicas. If the primary fails mid-write and the replica hasn't received the grant, the new primary may re-grant the same lock to a different client. The two clients both have "valid" grants from different primaries — split brain.

The solution is consensus: a majority quorum must agree on every write before it is acknowledged. This is the Raft or Paxos insight: a write is only committed when floor(n/2) + 1 nodes have persisted it. A new leader is only elected when it can prove it has all committed writes. The two situations cannot co-exist.

flowchart TD
    CLIENT[Client] -->|"write: grant lock to A"| LDR[Leader]
    LDR -->|"AppendEntries RPC"| F1[Follower 1]
    LDR -->|"AppendEntries RPC"| F2[Follower 2]
    F1 -->|ack| LDR
    F2 -->|ack| LDR
    LDR -->|"quorum = 2/3 nodes (leader + 1 ack)\n→ commit"| LDR
    LDR -->|"success (token=42)"| CLIENT
    style LDR fill:#ff6b1a,color:#0a0a0f
    style F1 fill:#0e7490,color:#fff
    style F2 fill:#0e7490,color:#fff
    style CLIENT fill:#a855f7,color:#fff

This makes the coordination service CP in CAP terms — it chooses Consistency over Availability during a partition. If the leader is isolated from a majority of followers (say, a network partition separates 2 of 5 nodes), those 2 nodes cannot form a quorum, cannot elect a new leader, and cannot serve writes. Lock acquires from clients on the minority side will block or fail. This is the correct behavior: refusing a lock is far better than granting one that might be simultaneously granted on the other side.

Cluster sizing:

Cluster sizeMajority quorumCan tolerate failures
321
532
743

Larger clusters add fault tolerance but increase write latency (more followers to replicate to). 3- and 5-node clusters cover almost all production use cases.

Building up to the design

V1: Single-node key-value store with TTLs

acquire("billing") → SET lock:billing clientA NX EX 30
release("billing") → if GET lock:billing == clientA: DEL lock:billing

This works locally. The problems are immediate: no replication means a single node failure loses all lock state, there are no fencing tokens, and clients have to poll to discover when a lock is released.

V2: Add replication — and why async isn't enough

Adding a read replica for HA sounds like progress, but async replication introduces the split-brain risk described above. A failover can roll back a committed grant, leaving two clients with valid-looking leases. Async replication is unusable for correctness here.

V3: Synchronous replication — majority write quorum

Replace async replication with a Raft-based replicated state machine. A write is only committed when acknowledged by a majority of nodes. Failover is safe because the new leader is guaranteed to have all committed entries — linearizable key-value operations, no split brain.

The next gap: clients still can't find out when a lock is released without polling. Every 100ms check from 500 waiters is 5000 QPS of idle traffic, and the release-to-acquire lag is bounded by the poll interval rather than by network latency.

V4: Add server-push watches

The coordination service tracks interest: client B watches key lock:billing. When the key is deleted (lock released), the leader pushes a notification to all registered watchers. No polling, sub-millisecond notification latency.

But if 500 clients are waiting on one lock, all 500 get notified simultaneously and all try to acquire at once — 500 Raft writes in a sudden burst. The thundering herd.

V5: Herd mitigation — sequential waiters or randomized backoff

Two approaches work well:

  1. Sequential waiter queue (ZooKeeper's model): clients create sequential ephemeral nodes under the lock path. Each client watches only the node immediately before it. Only the client with the lowest sequence number holds the lock. On release, only one waiter (the next-in-queue) is notified.

  2. Randomized retry with backoff: clients receive the watch notification but retry after rand(0, base_delay * 2^attempt) ms. Simpler to implement; distributes the burst.

ZooKeeper's sequential-node approach eliminates the herd entirely. etcd clients typically implement the randomized backoff at the client library level.

flowchart LR
    V1["V1: single node + TTL\nno durability"] --> V2["V2: async replica\nsplit brain risk"]
    V2 --> V3["V3: Raft quorum\nlinearizable, no watches"]
    V3 --> V4["V4: + server watches\nno polling, herd risk"]
    V4 --> V5["V5: + sequential nodes\nor backoff — production"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f

ZooKeeper's primitives

ZooKeeper exposes a file-system-like namespace of znodes. Three properties of znodes that make coordination elegant:

Ephemeral nodes — deleted automatically when the creating session disconnects. Session liveliness is maintained via periodic heartbeats. If the client crashes, ZooKeeper's session times out and the ephemeral node is deleted. Session timeout is negotiated between client and server, bounded by minSessionTimeout (default: 2× tickTime) and maxSessionTimeout (default: 20× tickTime); with the default tickTime of 2000 ms this gives a range of 4–40 s, though both bounds are operator-configurable. This is the TTL mechanism for ZooKeeper-based locks.

Sequential nodes — appending SEQUENTIAL to a create call appends a monotonically increasing integer to the node name. create /locks/billing-, with SEQUENTIAL, might return /locks/billing-0000000042. The sequence number is the fencing token.

Watches — a one-shot notification callback. A client calls exists("/locks/billing-0000000041", watch=true). When that node is deleted, ZooKeeper delivers the notification exactly once to the watcher.

Building a distributed lock with ZooKeeper

acquire(lockPath):
  1. create(lockPath + "/lock-", EPHEMERAL | SEQUENTIAL)
     → returns e.g. /locks/billing/lock-0000000042
  2. children = getChildren(lockPath)
  3. if my node has the lowest sequence number:
       → I hold the lock. Done.
  4. else:
       find the next-lowest node (predecessor)
       set watch on predecessor: exists(predecessor, watch=true)
       wait for watch event
       goto 2

release(lockPath):
  delete(my_node)   ← triggers watch on successor

The elegance is that each waiter watches exactly one node — its immediate predecessor. No matter how many clients are queued, a lock release wakes exactly one of them. And because the ephemeral node disappears automatically when the session drops, a client crash releases the lock without any explicit cleanup.

flowchart TD
    ROOT["/locks/billing/"]
    ROOT --> N41["lock-0000000041\n(Client A — holds lock)"]
    ROOT --> N42["lock-0000000042\n(Client B — watches 41)"]
    ROOT --> N43["lock-0000000043\n(Client C — watches 42)"]
    N42 -.->|"watch on"| N41
    N43 -.->|"watch on"| N42
    style N41 fill:#ff6b1a,color:#0a0a0f
    style N42 fill:#0e7490,color:#fff
    style N43 fill:#0e7490,color:#fff
    style ROOT fill:#a855f7,color:#fff

When Client A releases and deletes lock-0000000041, only Client B is notified. Client C keeps watching its predecessor and doesn't hear a thing. One wakeup, one new Zab write.

This guarantees:

  • Only one holder (the lowest-seq node).
  • No herd — each waiter watches only its immediate predecessor.
  • Automatic release on crash (ephemeral node disappears with session).
  • Fencing token is the sequence number — the protected resource can compare them.

Other coordination patterns ZooKeeper enables

PatternZooKeeper primitive used
Distributed lockSequential + ephemeral nodes
Leader electionSame as lock — first to hold is leader
Service discoveryEphemeral nodes under a service path; watchers track live members
Config distributionPersistent nodes; watchers notify on change
Barrier / two-phase commitCombination of persistent and ephemeral nodes

etcd as the modern alternative

etcd uses Raft (not Zab), exposes a simpler key-value API, and is the backing store for Kubernetes. Its lock semantics are provided through leases (a TTL-bearing object whose expiry deletes all keys associated with it) and the Txn (transaction) API.

# Acquire lock via Compare-And-Swap
Txn:
  IF   key "lock:billing" NOT EXISTS
  THEN PUT "lock:billing" "clientA" WITH LEASE lease_id
  ELSE (fail: someone else holds it)

# Lease keeps the key alive; client renews periodically
LeaseKeepAlive(lease_id)   ← background goroutine

The create_revision of the key (the etcd-global revision at the moment the key was first created, assigned once and stable for the lifetime of a single lock grant) serves as the fencing token. The client must extract this revision and pass it explicitly to the protected resource with each write; etcd itself cannot enforce the fencing token at external resources. The protected resource is responsible for tracking the maximum revision it has processed and rejecting any operation presenting a lower one.

etcd's Watch API is range-based: watch all keys with prefix lock: and receive events when any matching key is created, modified, or deleted.

Failure mode deep dive

Client pauses (GC / OS scheduling)

As analyzed above, the correct response is fencing tokens on the protected resource. No lock service can reliably distinguish "client is slow" from "client is dead" — the fencing token offloads that burden to the resource.

Network partition — minority can't make progress

If a 3-node cluster partitions into [leader + 1 follower] and [1 follower], the minority (single follower) cannot form a quorum. It will refuse to serve writes and will not elect a new leader. Clients directed at the minority will see timeouts. This is correct behavior — safety is preserved at the cost of availability on the minority side. See the leader election and consensus article for partition behavior details.

Leader failure during a write

The leader crashes after appending to its own log but before receiving quorum acks. The entry is not committed. The new leader's first action is to bring followers up to date — uncommitted entries may be overwritten. The client's acquire request gets an error or timeout and must retry. No lock is silently lost.

Split brain (double leader)

Raft prevents this structurally: a candidate must receive votes from a majority to become leader. Two nodes cannot simultaneously each hold a majority of votes in the same term. The term number (Raft) or epoch (Zab) is included in every operation, so stale leaders that reconnect after isolation are immediately rejected.

TTL too short — premature expiry under load

If the lock service is overloaded and lease renewals are delayed, a client's lease may expire before it finishes its critical section. Set TTL generously relative to the expected critical-section duration, monitor renewal latency, and alert before TTL is breached. Even then, the fencing token is the backstop: if the lease expires, the resource rejects any write with a stale token regardless.

Watch notification storm (thundering herd)

One lock release → N watchers notified → N simultaneous acquire attempts → N Raft writes in a burst. ZooKeeper's sequential-node approach eliminates this (only 1 watcher per node). For etcd clients using key watches, the client library should implement randomized exponential backoff.

Full architecture

flowchart TD
    subgraph clients["Client Fleet"]
        C1[Service A]
        C2[Service B]
        C3[Service C]
    end

    subgraph coord["Coordination Cluster (3 nodes)"]
        LDR[Leader<br/>etcd / ZK]
        FLW1[Follower 1]
        FLW2[Follower 2]
        LDR <-->|Raft / Zab replication| FLW1
        LDR <-->|Raft / Zab replication| FLW2
    end

    subgraph protected["Protected Resources"]
        DB[(Database)]
        STORE[(Object Store)]
    end

    C1 -->|"acquire(billing, ttl=20s)"| LDR
    C2 -->|"acquire(billing) — WAIT"| LDR
    C3 -->|"watch(billing)"| LDR
    LDR -->|"granted token=42"| C1
    LDR -->|"watch event on release"| C3
    C1 -->|"write, fencing_token=42"| DB
    DB -->|"reject if token < 42"| STALE_CLIENT[Stale Client]

    style LDR fill:#ff6b1a,color:#0a0a0f
    style FLW1 fill:#0e7490,color:#fff
    style FLW2 fill:#0e7490,color:#fff
    style DB fill:#15803d,color:#fff
    style C1 fill:#a855f7,color:#fff
    style C2 fill:#a855f7,color:#fff
    style C3 fill:#a855f7,color:#fff

The Redis Redlock debate

Redis is widely used for distributed locks via the SET NX EX pattern on a single instance. When you have multiple Redis nodes, the Redlock algorithm (proposed by Salvatore Sanfilippo / antirez) extends this: acquire the lock on a majority of N independent Redis nodes using clock-based TTLs. If the majority is achieved within a time window, the lock is considered held.

Martin Kleppmann published a detailed critique arguing that Redlock is fundamentally unsafe because it provides no facility for generating fencing tokens — without fencing tokens, the protected resource cannot distinguish a stale operation from a current one, regardless of how the lock was acquired. On top of that, Redlock's safety depends on wall-clock time assumptions that do not hold in practice: a process pause (GC, OS scheduling) or a clock jump (NTP step adjustment, VM migration) can cause the client to believe its lock is still valid after it has already expired on a majority of Redis nodes.

antirez responded that Kleppmann's critique applied to an adversarial model beyond typical production use.

The practical takeaway for an interview: if you need correctness guarantees for money, inventory, or any state where double-write is catastrophic, use a proper consensus-backed service (etcd, ZooKeeper) with fencing tokens. If you need best-effort mutual exclusion for cache warming or similar idempotent work where occasional races are acceptable, Redis SET NX EX on a single primary is fast and operationally simple.

Read scaling — stale reads vs linearizable reads

All writes go through the leader. Reads are more nuanced:

Read modeWho serves itConsistencyLatency
Linearizable readLeader onlyFully linearizableLeader RTT
Serializable read (etcd: --consistency=s)Any follower (local state, no leader contact)Stale — may return data behind the current leaderSlightly lower than linearizable
ZooKeeper getDataAny serverMonotonic (client-ordered)Follower RTT

ZooKeeper followers serve reads from their local state — this means a follower may serve a stale value if replication has not caught up. ZooKeeper addresses this with the sync call (forces the server to contact the leader before serving the read) for applications that need linearizable reads. etcd's default behavior is linearizable — the leader forwards the read to itself using a lease-based read index mechanism (the leader confirms its lease is still valid before serving).

For lock-related operations (is the lock held? who holds it?), always use linearizable reads. Stale reads from a follower can give the wrong answer.

Lock lease renewal — the keep-alive loop

A well-implemented lock client runs a background goroutine to renew the lease. The renewal interval is typically TTL / 3, giving two full renewal windows of slack before the lease would expire:

renew interval = TTL / 3    # renew at 1/3 of TTL remaining

goroutine keepalive:
  while holding_lock:
    sleep(TTL / 3)
    if now() - lock_acquired > TTL * 0.8:
      panic("approaching TTL expiry, cannot guarantee exclusivity")
    renew(lock_name, token, TTL)
flowchart LR
    ACQ[Acquire lock\nt=0] --> KA1["KeepAlive\nt = TTL/3"]
    KA1 --> KA2["KeepAlive\nt = 2×TTL/3"]
    KA2 --> REL[Release lock]
    ACQ -.->|"lease expires at t=TTL\nif no renewals"| EXP[Lease expires]
    style ACQ fill:#15803d,color:#fff
    style REL fill:#15803d,color:#fff
    style EXP fill:#ff2e88,color:#fff
    style KA1 fill:#ffaa00,color:#0a0a0f
    style KA2 fill:#ffaa00,color:#0a0a0f

If the renewal fails (service unreachable), the client should stop performing writes to the protected resource immediately, attempt a few retries with exponential backoff, and if still failing before TTL expiry, abort and fail the critical section. Failing the critical section is better than silently continuing without a lock.

Storage choices

WhatStoreWhy
Lock state (live leases)In-memory on quorum nodes, replicated via Raft/Zab logFast access; durability comes from the replicated log
Watch registrationsIn-memory on leaderWatches are not persisted; reconnecting clients re-register
Raft log / WALLocal disk (SSD) on each nodeDurability, crash recovery
SnapshotsLocal disk + object storeFaster node bootstrap; WAL compaction
Historical revisions (etcd)bbolt (B+ tree embedded in etcd — a fork of Ben Johnson's BoltDB, maintained by the etcd-io organization)Supports range reads, watch history, compaction

Things to discuss in an interview

  • Why leases alone aren't enough — the paused-process problem and why fencing tokens are required at the resource level.
  • Why consensus is required — async replication allows split brain on failover; the lock service must be linearizable.
  • Quorum sizing trade-offs — 3 vs 5 nodes, failure tolerance vs write latency.
  • Herd effect on watch notifications — ZooKeeper's sequential-node pattern vs randomized backoff.
  • The Redis Redlock debate — when best-effort Redis locks are acceptable and when they are not.
  • Read linearizability — why stale reads from followers can give wrong answers for lock state.
  • Session / lease duration sizing — too short causes spurious expiry; too long delays reclamation of dead holders.

Things you should now be able to answer

  • Why can a process holding a valid lease still race with another lock holder?
  • What property must the protected resource implement to be safe against paused processes?
  • Why does a consensus-backed lock service refuse writes in a minority partition, and why is that correct?
  • What is the difference between ZooKeeper's ephemeral nodes and etcd's leases?
  • How does ZooKeeper's sequential-node lock algorithm eliminate the thundering herd?
  • Why are stale reads from a follower dangerous for lock operations?
  • When is Redis SET NX EX an acceptable lock implementation, and when is it not?

Further reading

  • Martin Kleppmann — "How to do distributed locking" (2016, martin.kleppmann.com) — the canonical treatment of fencing tokens and the Redlock critique.
  • "Designing Data-Intensive Applications," Chapter 8 and 9 — Kleppmann's book, sections on consensus and linearizability.
  • etcd documentation — "Distributed locks with etcd" and the concurrency package in the etcd Go client.
  • ZooKeeper Recipes — "Locks" section in the Apache ZooKeeper documentation — the sequential-node recipe described in this article.
  • "ZooKeeper: Wait-free coordination for internet-scale systems" — the original ZooKeeper paper (USENIX ATC 2010).
  • Leader election and consensus — the Raft and Paxos mechanics underlying the lock service.
  • "In Search of an Understandable Consensus Algorithm" (Ongaro & Ousterhout, 2014) — the Raft paper.
// FAQ

Frequently asked questions

What is a fencing token and why does a distributed lock require one?

A fencing token is a monotonically increasing number issued with every lock grant by the coordination service. It is required because a lock holder can pause — due to a JVM GC stop-the-world or OS scheduling — for longer than its TTL, wake up believing it still holds the lock, and corrupt shared state. The protected resource rejects any write whose token is strictly less than the maximum token it has seen, so even a stale client that wakes up after its lease expired is refused.

When should I use etcd or ZooKeeper instead of Redis Redlock for distributed locking?

Use a consensus-backed service like etcd or ZooKeeper when correctness is non-negotiable — money, inventory, or any state where a double write is catastrophic. Redlock provides no facility for fencing tokens and its safety depends on wall-clock assumptions that fail under process pauses or NTP clock jumps. Redis SET NX EX on a single primary is acceptable only for best-effort mutual exclusion over idempotent work like cache warming, where occasional races are tolerable.

How does ZooKeeper's sequential-node lock algorithm eliminate the thundering herd?

Each client creates a sequential ephemeral node under the lock path and then watches only the node immediately before it — its predecessor — rather than the lock key itself. When the current holder deletes its node, only the next-in-queue client is notified. No matter how many clients are queued, a single lock release produces exactly one wakeup and one new Zab write.

How many nodes does a Raft-based coordination cluster need, and how many failures can it tolerate?

A 3-node cluster requires a majority quorum of 2 and tolerates 1 failure; a 5-node cluster requires 3 and tolerates 2 failures. Larger clusters add fault tolerance but increase write latency because more followers must acknowledge each entry before it is committed. Three- and five-node clusters cover almost all production use cases.

What is the recommended lease renewal interval for a distributed lock client?

Renew at TTL divided by 3, which gives two full renewal windows of slack before the lease would expire. If renewal fails and the client cannot confirm the lease before TTL expiry, it should stop writing to the protected resource immediately, retry with exponential backoff, and abort the critical section if still failing — failing the critical section is safer than continuing without a lock.

// RELATED

You may also like