Design a Distributed Lock / Coordination Service (ZooKeeper / etcd)
Provide mutual exclusion and coordination across machines safely. Consensus-backed locks, leases, fencing tokens, and why a lock without fencing is unsafe.
The problem
A distributed lock does exactly what a threading mutex does — ensure only one process can execute a critical section at a time — except the processes live on different machines and cannot share memory. etcd and ZooKeeper are the production implementations you will encounter most often: both power enormous amounts of cloud-native infrastructure, including Kubernetes' control plane (which depends on etcd for all cluster state) and Hadoop/HBase coordination (which historically relied on ZooKeeper). The pattern appears anywhere two services must not step on the same shared resource simultaneously: inventory reservation, payment processing, database schema migrations, leader election.
The conceptually simple version — "write a flag to a database, check it before acting" — breaks immediately in a distributed setting. A lock holder can pause for 30 seconds due to a JVM garbage-collection stop-the-world, its lease can expire while it is frozen, a second client can acquire the same lock, and when the first client wakes up it has no idea that anything changed. Both processes now believe they hold the lock. This paused-process problem is the core engineering tension: no lock service can reliably distinguish a slow client from a dead one.
The second tension is the service itself. If the lock service uses async replication, a leader failure can produce a split-brain where two nodes simultaneously grant the same lock. The service must be linearizable — meaning every granted lock is globally agreed upon by a majority quorum before any client is told "you have it." That requires a consensus algorithm (Raft in etcd, Zab in ZooKeeper), which makes the coordination service a CP system: under a network partition, the minority side refuses to serve any locks rather than risk granting a conflicting one.
Getting distributed locks right under partial failure requires understanding consensus, lease semantics, and the precise moment at which a lock holder is no longer authoritative. This article builds the design from first principles.
Functional requirements
acquire(lock_name, ttl)→ granted or blocked. Returns a fencing token with every grant.release(lock_name, token)→ releases the lock if the caller holds it.renew(lock_name, token, new_ttl)→ extends the lease before it expires.- Automatic expiry — if the holder dies or fails to renew, the lock is released after
ttlseconds. - Watch / notify — waiters are notified the moment the lock is released (no polling).
- Key-value store primitives for config and service discovery (the coordination service doubles as general-purpose config store).
Non-functional requirements
- Linearizability — there is a single, globally-agreed order of lock grants. No two clients ever simultaneously believe they hold the same lock.
- High availability — the service must survive the failure of a minority of its nodes without operator intervention.
- Low latency — lock acquire should complete in single-digit milliseconds in a single data center.
- Herd management — a single lock release should not simultaneously wake up thousands of waiting clients and DDoS the service.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Cluster size | 3 or 5 nodes | Odd number required for quorum majority |
| Lock ops/sec | 40,000 ops/sec | 4,000 services × 10 locks/sec each |
| Quorum acks needed (3-node) | 2 acks per write | (n+1)/2 = (3+1)/2 = 2 |
| Raft write latency (single DC) | 1–5 ms | Sub-ms under light load on SSDs; hardware-dependent |
| Leader throughput | 40,000 writes/sec (all from leader) | Followers do not serve writes; etcd benchmarks show 30,000+ req/sec under heavy load, ~44,000 QPS on leader-only writes (3-node GCE cluster) |
| Watch fan-out per release | 500 KB burst | 500 waiters × 1 KB/event (key + value + revision) |
| Peak egress from leader | 50 MB/s | 100 lock-releases/sec × 500 KB burst |
| Typical lease TTL | 10–30 seconds | Must exceed expected critical-section duration; partition detectable within TTL |
| Max clock skew assumed | < 1 s (NTP) | Well within TTL margins |
| Live lease memory (1M leases) | 200 MB in-memory | 1M leases × 200 bytes each; replicated across quorum |
Takeaway: A 3-node cluster can handle ~40,000 lock ops/sec with single-digit-ms latency and ~200 MB of in-memory state — well within reach of commodity hardware, with watch egress being the one burst to manage carefully.
Why distributed locks are subtle — the paused-process problem
Start with the obvious broken approach.
The naive approach, and why it fails
A client acquires a lock by writing SET lock:billing clientA NX EX 30 to a key-value store. The lock expires in 30 seconds. Simple. The trouble is that "30 seconds have not passed" and "this client is still doing useful work" are two entirely different things:
t=0 Client A acquires lock (TTL=30s)
t=25 Client A enters a GC pause (JVM stop-the-world)
t=31 TTL expires; client A's lease is gone
t=32 Client B acquires the same lock
t=32 Client B writes to the billing database
t=40 Client A's GC pause ends; client A resumes
t=40 Client A also writes to the billing database — DOUBLE WRITE
This scenario is not exotic. Production GC pauses of 30+ seconds are observed on JVM workloads. The process does not know time passed. It woke up believing it still holds the lock.
sequenceDiagram
participant A as Client A
participant Lock as Lock Service
participant DB as Protected Resource
A->>Lock: acquire("billing", ttl=30s)
Lock-->>A: granted, token=41
Note over A: *** GC pause 35 seconds ***
Note over Lock: TTL expires at t=30
participant B as Client B
B->>Lock: acquire("billing", ttl=30s)
Lock-->>B: granted, token=42
B->>DB: write (token=42) ← succeeds
Note over A: GC pause ends at t=40
A->>DB: write (token=41) ← UNSAFE without fencing
The fix: fencing tokens
Martin Kleppmann's "Designing Data-Intensive Applications" (and his widely-read 2016 blog post) articulates this precisely: the protected resource must be the last line of defense. The lock service issues a monotonically increasing fencing token with every grant. The protected resource rejects any write whose token is strictly less than the maximum token it has seen — the current lock holder, with the highest token, may write freely; older tokens are rejected.
Lock service issues tokens: 39, 40, 41, 42, ...
Client A holds token 41 → pauses
Client B acquires → gets token 42 → writes to DB with token=42
DB records: max_token_seen = 42
Client A wakes → tries to write with token=41
DB: 41 < 42 → REJECT
The fencing token provides safety even when the lock service cannot reliably detect that client A is dead. The resource enforces it. This is why etcd's leases include a monotonically increasing revision number that clients can use as a fencing token — specifically, etcd's create_revision on the lock key is assigned once at lock acquisition and remains stable for the lifetime of that grant.
The token must be monotonically increasing and assigned atomically with the lock grant. When the service uses a replicated log (Raft/Zab), this comes for free — the log index is already monotone.
The service must itself be linearizable — why consensus is required
If the lock service itself can disagree about who holds the lock, all the above reasoning collapses. You cannot build a correct distributed lock on a primary with async replicas. If the primary fails mid-write and the replica hasn't received the grant, the new primary may re-grant the same lock to a different client. The two clients both have "valid" grants from different primaries — split brain.
The solution is consensus: a majority quorum must agree on every write before it is acknowledged. This is the Raft or Paxos insight: a write is only committed when floor(n/2) + 1 nodes have persisted it. A new leader is only elected when it can prove it has all committed writes. The two situations cannot co-exist.
flowchart TD
CLIENT[Client] -->|"write: grant lock to A"| LDR[Leader]
LDR -->|"AppendEntries RPC"| F1[Follower 1]
LDR -->|"AppendEntries RPC"| F2[Follower 2]
F1 -->|ack| LDR
F2 -->|ack| LDR
LDR -->|"quorum = 2/3 nodes (leader + 1 ack)\n→ commit"| LDR
LDR -->|"success (token=42)"| CLIENT
style LDR fill:#ff6b1a,color:#0a0a0f
style F1 fill:#0e7490,color:#fff
style F2 fill:#0e7490,color:#fff
style CLIENT fill:#a855f7,color:#fff
This makes the coordination service CP in CAP terms — it chooses Consistency over Availability during a partition. If the leader is isolated from a majority of followers (say, a network partition separates 2 of 5 nodes), those 2 nodes cannot form a quorum, cannot elect a new leader, and cannot serve writes. Lock acquires from clients on the minority side will block or fail. This is the correct behavior: refusing a lock is far better than granting one that might be simultaneously granted on the other side.
Cluster sizing:
| Cluster size | Majority quorum | Can tolerate failures |
|---|---|---|
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
Larger clusters add fault tolerance but increase write latency (more followers to replicate to). 3- and 5-node clusters cover almost all production use cases.
Building up to the design
V1: Single-node key-value store with TTLs
acquire("billing") → SET lock:billing clientA NX EX 30
release("billing") → if GET lock:billing == clientA: DEL lock:billing
This works locally. The problems are immediate: no replication means a single node failure loses all lock state, there are no fencing tokens, and clients have to poll to discover when a lock is released.
V2: Add replication — and why async isn't enough
Adding a read replica for HA sounds like progress, but async replication introduces the split-brain risk described above. A failover can roll back a committed grant, leaving two clients with valid-looking leases. Async replication is unusable for correctness here.
V3: Synchronous replication — majority write quorum
Replace async replication with a Raft-based replicated state machine. A write is only committed when acknowledged by a majority of nodes. Failover is safe because the new leader is guaranteed to have all committed entries — linearizable key-value operations, no split brain.
The next gap: clients still can't find out when a lock is released without polling. Every 100ms check from 500 waiters is 5000 QPS of idle traffic, and the release-to-acquire lag is bounded by the poll interval rather than by network latency.
V4: Add server-push watches
The coordination service tracks interest: client B watches key lock:billing. When the key is deleted (lock released), the leader pushes a notification to all registered watchers. No polling, sub-millisecond notification latency.
But if 500 clients are waiting on one lock, all 500 get notified simultaneously and all try to acquire at once — 500 Raft writes in a sudden burst. The thundering herd.
V5: Herd mitigation — sequential waiters or randomized backoff
Two approaches work well:
-
Sequential waiter queue (ZooKeeper's model): clients create sequential ephemeral nodes under the lock path. Each client watches only the node immediately before it. Only the client with the lowest sequence number holds the lock. On release, only one waiter (the next-in-queue) is notified.
-
Randomized retry with backoff: clients receive the watch notification but retry after
rand(0, base_delay * 2^attempt)ms. Simpler to implement; distributes the burst.
ZooKeeper's sequential-node approach eliminates the herd entirely. etcd clients typically implement the randomized backoff at the client library level.
flowchart LR
V1["V1: single node + TTL\nno durability"] --> V2["V2: async replica\nsplit brain risk"]
V2 --> V3["V3: Raft quorum\nlinearizable, no watches"]
V3 --> V4["V4: + server watches\nno polling, herd risk"]
V4 --> V5["V5: + sequential nodes\nor backoff — production"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V5 fill:#ff6b1a,color:#0a0a0f
ZooKeeper's primitives
ZooKeeper exposes a file-system-like namespace of znodes. Three properties of znodes that make coordination elegant:
Ephemeral nodes — deleted automatically when the creating session disconnects. Session liveliness is maintained via periodic heartbeats. If the client crashes, ZooKeeper's session times out and the ephemeral node is deleted. Session timeout is negotiated between client and server, bounded by minSessionTimeout (default: 2× tickTime) and maxSessionTimeout (default: 20× tickTime); with the default tickTime of 2000 ms this gives a range of 4–40 s, though both bounds are operator-configurable. This is the TTL mechanism for ZooKeeper-based locks.
Sequential nodes — appending SEQUENTIAL to a create call appends a monotonically increasing integer to the node name. create /locks/billing-, with SEQUENTIAL, might return /locks/billing-0000000042. The sequence number is the fencing token.
Watches — a one-shot notification callback. A client calls exists("/locks/billing-0000000041", watch=true). When that node is deleted, ZooKeeper delivers the notification exactly once to the watcher.
Building a distributed lock with ZooKeeper
acquire(lockPath):
1. create(lockPath + "/lock-", EPHEMERAL | SEQUENTIAL)
→ returns e.g. /locks/billing/lock-0000000042
2. children = getChildren(lockPath)
3. if my node has the lowest sequence number:
→ I hold the lock. Done.
4. else:
find the next-lowest node (predecessor)
set watch on predecessor: exists(predecessor, watch=true)
wait for watch event
goto 2
release(lockPath):
delete(my_node) ← triggers watch on successor
The elegance is that each waiter watches exactly one node — its immediate predecessor. No matter how many clients are queued, a lock release wakes exactly one of them. And because the ephemeral node disappears automatically when the session drops, a client crash releases the lock without any explicit cleanup.
flowchart TD
ROOT["/locks/billing/"]
ROOT --> N41["lock-0000000041\n(Client A — holds lock)"]
ROOT --> N42["lock-0000000042\n(Client B — watches 41)"]
ROOT --> N43["lock-0000000043\n(Client C — watches 42)"]
N42 -.->|"watch on"| N41
N43 -.->|"watch on"| N42
style N41 fill:#ff6b1a,color:#0a0a0f
style N42 fill:#0e7490,color:#fff
style N43 fill:#0e7490,color:#fff
style ROOT fill:#a855f7,color:#fff
When Client A releases and deletes lock-0000000041, only Client B is notified. Client C keeps watching its predecessor and doesn't hear a thing. One wakeup, one new Zab write.
This guarantees:
- Only one holder (the lowest-seq node).
- No herd — each waiter watches only its immediate predecessor.
- Automatic release on crash (ephemeral node disappears with session).
- Fencing token is the sequence number — the protected resource can compare them.
Other coordination patterns ZooKeeper enables
| Pattern | ZooKeeper primitive used |
|---|---|
| Distributed lock | Sequential + ephemeral nodes |
| Leader election | Same as lock — first to hold is leader |
| Service discovery | Ephemeral nodes under a service path; watchers track live members |
| Config distribution | Persistent nodes; watchers notify on change |
| Barrier / two-phase commit | Combination of persistent and ephemeral nodes |
etcd as the modern alternative
etcd uses Raft (not Zab), exposes a simpler key-value API, and is the backing store for Kubernetes. Its lock semantics are provided through leases (a TTL-bearing object whose expiry deletes all keys associated with it) and the Txn (transaction) API.
# Acquire lock via Compare-And-Swap
Txn:
IF key "lock:billing" NOT EXISTS
THEN PUT "lock:billing" "clientA" WITH LEASE lease_id
ELSE (fail: someone else holds it)
# Lease keeps the key alive; client renews periodically
LeaseKeepAlive(lease_id) ← background goroutine
The create_revision of the key (the etcd-global revision at the moment the key was first created, assigned once and stable for the lifetime of a single lock grant) serves as the fencing token. The client must extract this revision and pass it explicitly to the protected resource with each write; etcd itself cannot enforce the fencing token at external resources. The protected resource is responsible for tracking the maximum revision it has processed and rejecting any operation presenting a lower one.
etcd's Watch API is range-based: watch all keys with prefix lock: and receive events when any matching key is created, modified, or deleted.
Failure mode deep dive
Client pauses (GC / OS scheduling)
As analyzed above, the correct response is fencing tokens on the protected resource. No lock service can reliably distinguish "client is slow" from "client is dead" — the fencing token offloads that burden to the resource.
Network partition — minority can't make progress
If a 3-node cluster partitions into [leader + 1 follower] and [1 follower], the minority (single follower) cannot form a quorum. It will refuse to serve writes and will not elect a new leader. Clients directed at the minority will see timeouts. This is correct behavior — safety is preserved at the cost of availability on the minority side. See the leader election and consensus article for partition behavior details.
Leader failure during a write
The leader crashes after appending to its own log but before receiving quorum acks. The entry is not committed. The new leader's first action is to bring followers up to date — uncommitted entries may be overwritten. The client's acquire request gets an error or timeout and must retry. No lock is silently lost.
Split brain (double leader)
Raft prevents this structurally: a candidate must receive votes from a majority to become leader. Two nodes cannot simultaneously each hold a majority of votes in the same term. The term number (Raft) or epoch (Zab) is included in every operation, so stale leaders that reconnect after isolation are immediately rejected.
TTL too short — premature expiry under load
If the lock service is overloaded and lease renewals are delayed, a client's lease may expire before it finishes its critical section. Set TTL generously relative to the expected critical-section duration, monitor renewal latency, and alert before TTL is breached. Even then, the fencing token is the backstop: if the lease expires, the resource rejects any write with a stale token regardless.
Watch notification storm (thundering herd)
One lock release → N watchers notified → N simultaneous acquire attempts → N Raft writes in a burst. ZooKeeper's sequential-node approach eliminates this (only 1 watcher per node). For etcd clients using key watches, the client library should implement randomized exponential backoff.
Full architecture
flowchart TD
subgraph clients["Client Fleet"]
C1[Service A]
C2[Service B]
C3[Service C]
end
subgraph coord["Coordination Cluster (3 nodes)"]
LDR[Leader<br/>etcd / ZK]
FLW1[Follower 1]
FLW2[Follower 2]
LDR <-->|Raft / Zab replication| FLW1
LDR <-->|Raft / Zab replication| FLW2
end
subgraph protected["Protected Resources"]
DB[(Database)]
STORE[(Object Store)]
end
C1 -->|"acquire(billing, ttl=20s)"| LDR
C2 -->|"acquire(billing) — WAIT"| LDR
C3 -->|"watch(billing)"| LDR
LDR -->|"granted token=42"| C1
LDR -->|"watch event on release"| C3
C1 -->|"write, fencing_token=42"| DB
DB -->|"reject if token < 42"| STALE_CLIENT[Stale Client]
style LDR fill:#ff6b1a,color:#0a0a0f
style FLW1 fill:#0e7490,color:#fff
style FLW2 fill:#0e7490,color:#fff
style DB fill:#15803d,color:#fff
style C1 fill:#a855f7,color:#fff
style C2 fill:#a855f7,color:#fff
style C3 fill:#a855f7,color:#fff
The Redis Redlock debate
Redis is widely used for distributed locks via the SET NX EX pattern on a single instance. When you have multiple Redis nodes, the Redlock algorithm (proposed by Salvatore Sanfilippo / antirez) extends this: acquire the lock on a majority of N independent Redis nodes using clock-based TTLs. If the majority is achieved within a time window, the lock is considered held.
Martin Kleppmann published a detailed critique arguing that Redlock is fundamentally unsafe because it provides no facility for generating fencing tokens — without fencing tokens, the protected resource cannot distinguish a stale operation from a current one, regardless of how the lock was acquired. On top of that, Redlock's safety depends on wall-clock time assumptions that do not hold in practice: a process pause (GC, OS scheduling) or a clock jump (NTP step adjustment, VM migration) can cause the client to believe its lock is still valid after it has already expired on a majority of Redis nodes.
antirez responded that Kleppmann's critique applied to an adversarial model beyond typical production use.
The practical takeaway for an interview: if you need correctness guarantees for money, inventory, or any state where double-write is catastrophic, use a proper consensus-backed service (etcd, ZooKeeper) with fencing tokens. If you need best-effort mutual exclusion for cache warming or similar idempotent work where occasional races are acceptable, Redis SET NX EX on a single primary is fast and operationally simple.
Read scaling — stale reads vs linearizable reads
All writes go through the leader. Reads are more nuanced:
| Read mode | Who serves it | Consistency | Latency |
|---|---|---|---|
| Linearizable read | Leader only | Fully linearizable | Leader RTT |
Serializable read (etcd: --consistency=s) | Any follower (local state, no leader contact) | Stale — may return data behind the current leader | Slightly lower than linearizable |
ZooKeeper getData | Any server | Monotonic (client-ordered) | Follower RTT |
ZooKeeper followers serve reads from their local state — this means a follower may serve a stale value if replication has not caught up. ZooKeeper addresses this with the sync call (forces the server to contact the leader before serving the read) for applications that need linearizable reads. etcd's default behavior is linearizable — the leader forwards the read to itself using a lease-based read index mechanism (the leader confirms its lease is still valid before serving).
For lock-related operations (is the lock held? who holds it?), always use linearizable reads. Stale reads from a follower can give the wrong answer.
Lock lease renewal — the keep-alive loop
A well-implemented lock client runs a background goroutine to renew the lease. The renewal interval is typically TTL / 3, giving two full renewal windows of slack before the lease would expire:
renew interval = TTL / 3 # renew at 1/3 of TTL remaining
goroutine keepalive:
while holding_lock:
sleep(TTL / 3)
if now() - lock_acquired > TTL * 0.8:
panic("approaching TTL expiry, cannot guarantee exclusivity")
renew(lock_name, token, TTL)
flowchart LR
ACQ[Acquire lock\nt=0] --> KA1["KeepAlive\nt = TTL/3"]
KA1 --> KA2["KeepAlive\nt = 2×TTL/3"]
KA2 --> REL[Release lock]
ACQ -.->|"lease expires at t=TTL\nif no renewals"| EXP[Lease expires]
style ACQ fill:#15803d,color:#fff
style REL fill:#15803d,color:#fff
style EXP fill:#ff2e88,color:#fff
style KA1 fill:#ffaa00,color:#0a0a0f
style KA2 fill:#ffaa00,color:#0a0a0f
If the renewal fails (service unreachable), the client should stop performing writes to the protected resource immediately, attempt a few retries with exponential backoff, and if still failing before TTL expiry, abort and fail the critical section. Failing the critical section is better than silently continuing without a lock.
Storage choices
| What | Store | Why |
|---|---|---|
| Lock state (live leases) | In-memory on quorum nodes, replicated via Raft/Zab log | Fast access; durability comes from the replicated log |
| Watch registrations | In-memory on leader | Watches are not persisted; reconnecting clients re-register |
| Raft log / WAL | Local disk (SSD) on each node | Durability, crash recovery |
| Snapshots | Local disk + object store | Faster node bootstrap; WAL compaction |
| Historical revisions (etcd) | bbolt (B+ tree embedded in etcd — a fork of Ben Johnson's BoltDB, maintained by the etcd-io organization) | Supports range reads, watch history, compaction |
Things to discuss in an interview
- Why leases alone aren't enough — the paused-process problem and why fencing tokens are required at the resource level.
- Why consensus is required — async replication allows split brain on failover; the lock service must be linearizable.
- Quorum sizing trade-offs — 3 vs 5 nodes, failure tolerance vs write latency.
- Herd effect on watch notifications — ZooKeeper's sequential-node pattern vs randomized backoff.
- The Redis Redlock debate — when best-effort Redis locks are acceptable and when they are not.
- Read linearizability — why stale reads from followers can give wrong answers for lock state.
- Session / lease duration sizing — too short causes spurious expiry; too long delays reclamation of dead holders.
Things you should now be able to answer
- Why can a process holding a valid lease still race with another lock holder?
- What property must the protected resource implement to be safe against paused processes?
- Why does a consensus-backed lock service refuse writes in a minority partition, and why is that correct?
- What is the difference between ZooKeeper's ephemeral nodes and etcd's leases?
- How does ZooKeeper's sequential-node lock algorithm eliminate the thundering herd?
- Why are stale reads from a follower dangerous for lock operations?
- When is Redis
SET NX EXan acceptable lock implementation, and when is it not?
Further reading
- Martin Kleppmann — "How to do distributed locking" (2016, martin.kleppmann.com) — the canonical treatment of fencing tokens and the Redlock critique.
- "Designing Data-Intensive Applications," Chapter 8 and 9 — Kleppmann's book, sections on consensus and linearizability.
- etcd documentation — "Distributed locks with etcd" and the
concurrencypackage in the etcd Go client. - ZooKeeper Recipes — "Locks" section in the Apache ZooKeeper documentation — the sequential-node recipe described in this article.
- "ZooKeeper: Wait-free coordination for internet-scale systems" — the original ZooKeeper paper (USENIX ATC 2010).
- Leader election and consensus — the Raft and Paxos mechanics underlying the lock service.
- "In Search of an Understandable Consensus Algorithm" (Ongaro & Ousterhout, 2014) — the Raft paper.
Frequently asked questions
▸What is a fencing token and why does a distributed lock require one?
A fencing token is a monotonically increasing number issued with every lock grant by the coordination service. It is required because a lock holder can pause — due to a JVM GC stop-the-world or OS scheduling — for longer than its TTL, wake up believing it still holds the lock, and corrupt shared state. The protected resource rejects any write whose token is strictly less than the maximum token it has seen, so even a stale client that wakes up after its lease expired is refused.
▸When should I use etcd or ZooKeeper instead of Redis Redlock for distributed locking?
Use a consensus-backed service like etcd or ZooKeeper when correctness is non-negotiable — money, inventory, or any state where a double write is catastrophic. Redlock provides no facility for fencing tokens and its safety depends on wall-clock assumptions that fail under process pauses or NTP clock jumps. Redis SET NX EX on a single primary is acceptable only for best-effort mutual exclusion over idempotent work like cache warming, where occasional races are tolerable.
▸How does ZooKeeper's sequential-node lock algorithm eliminate the thundering herd?
Each client creates a sequential ephemeral node under the lock path and then watches only the node immediately before it — its predecessor — rather than the lock key itself. When the current holder deletes its node, only the next-in-queue client is notified. No matter how many clients are queued, a single lock release produces exactly one wakeup and one new Zab write.
▸How many nodes does a Raft-based coordination cluster need, and how many failures can it tolerate?
A 3-node cluster requires a majority quorum of 2 and tolerates 1 failure; a 5-node cluster requires 3 and tolerates 2 failures. Larger clusters add fault tolerance but increase write latency because more followers must acknowledge each entry before it is committed. Three- and five-node clusters cover almost all production use cases.
▸What is the recommended lease renewal interval for a distributed lock client?
Renew at TTL divided by 3, which gives two full renewal windows of slack before the lease would expire. If renewal fails and the client cannot confirm the lease before TTL expiry, it should stop writing to the protected resource immediately, retry with exponential backoff, and abort the critical section if still failing — failing the critical section is safer than continuing without a lock.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.