Design a Distributed Job Scheduler (cron at scale)
Run millions of scheduled and recurring jobs reliably — at-least-once execution, leader election, sharded time-wheels, and exactly-once side effects via idempotency.
The problem
AWS EventBridge Scheduler, GitHub Actions scheduled workflows, Airbnb's Chronos, and every "send the weekly digest at 9 AM" feature in your product all share the same foundation: a distributed job scheduler. At its core, the system stores a set of jobs — each with a schedule and a payload — and makes sure they execute at the right time, even across a fleet of servers, even when machines crash.
The naive version is a single cron daemon on one machine. Every Unix engineer has written one. It works until that machine reboots, and then it silently misses every job due during the downtime. The natural fix — run the same cron on three machines for redundancy — makes things worse: all three daemons see the same due jobs and execute each one three times. That is how users get three payment emails.
The real engineering challenge is two-sided. First, coordination: you need multiple machines to share the work without stomping on each other, which demands leader election, lease-based locking, and sharding. Second, correctness under failure: a worker can crash after starting a job but before recording it as done, leaving the system uncertain whether to retry. Since distributed systems cannot give you exactly-once delivery for free, the scheduler guarantees at-least-once — and pushes the responsibility for safe retries onto the job itself via idempotency. That tension between availability and correctness is what makes this question hard, and it is the thread that runs through every design decision below.
This question separates candidates who have actually operated production systems from those who have not. Everyone knows how to write a setTimeout. Almost nobody has wrestled with what happens when two schedulers both think they own the same cron job, or when a worker dies with a lock held, or when DST moves a clock backward and a "once per hour" job fires twice.
Functional requirements
POST /jobs— register a job with a schedule (ISO-8601 one-off timestamp, or cron expression).DELETE /jobs/:id— deregister.GET /jobs/:id/runs— execution history.- The scheduler fires the job at (or within a configurable slack of) the scheduled time.
- Recurring jobs re-schedule themselves automatically.
- One-off jobs are GC'd after completion.
Non-functional requirements
- At-least-once execution — every due job must fire. Missing a job is worse than running it twice.
- Low duplicate rate — the scheduler must minimize double-executions; correctness of side effects is the job's responsibility.
- Throughput — handle thousands of jobs firing per second under normal load; absorb hot-second bursts.
- High availability — scheduler downtime means missed jobs; 99.9%+ uptime required.
- Latency — a job should start within a few seconds of its due time (not necessarily within milliseconds).
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Registered jobs | 10M | Given baseline |
| Avg firing interval | 1 hour | Given baseline |
| Steady-state throughput | ~2,800 jobs/sec | 10M ÷ 3,600 s |
| Hot-second spike (e.g. 00:00:00) | ~500k jobs in 1 second | 5% of 10M recurring jobs scheduled daily-at-midnight |
| Hot-second drain time | ~50 seconds | 500k ÷ 10k workers/sec — queue absorbs the spike |
| Schedule store size | ~5 GB | 10M jobs × 512 B/row; fits on a single DB node (sharding optional until write throughput limits hit) |
| Execution history (append-only) | ~2 TB/month | 2,800 jobs/sec × 86,400 s × 30 days × 256 B/row; shard by job_id or time-partition; keep 90 days hot, archive to object storage |
| Scheduler scan latency (per shard) | <5 ms | B-tree index scan on next_run_at ≤ now() over 1M rows per shard (N=10 shards, each owning 1M jobs); scan runs every 1 s |
Takeaway: 10M jobs fits in ~5 GB at rest; the hot-second burst (500k jobs) is the throughput design point, not the steady-state 2,800 jobs/sec — the queue absorbs it and workers drain at their own rate.
Building up to the design
Start with the simplest scheduler that could possibly work and watch it break.
V1: Single cron daemon + application database
# system crontab (single host)
* * * * * /usr/bin/python3 /opt/job_runner.py
The script queries a jobs table, runs what's due, updates last_run_at. Fits on a laptop, and it genuinely works — for a team of five with a handful of jobs.
The problem is the host. When it reboots — planned or not — every job due during that outage is silently missed. There is no alert, no retry, no record that anything was skipped. Beyond availability, the script also runs serially: a thousand jobs due at the same minute boundary run one after the other, not concurrently.
V2: Multiple cron daemons — thundering herd + double execution
The natural instinct for availability is to run the same cron daemon on three machines. This makes the problem worse. All three daemons scan the DB simultaneously and pick the same due jobs. Every job executes three times. That's not just a correctness annoyance — it means three payment emails, three database migrations, three billing charges.
You've traded a missed-job risk for a duplicate-execution risk, and duplicates are harder to explain to users.
V3: Single-leader scheduler (no sharding)
The fix for duplicate execution is leader election: only one scheduler is "active" at a time; the others wait on standby. ZooKeeper and etcd both offer ephemeral nodes for exactly this — the leader holds a lock, and if it crashes, a standby acquires the lock within a few seconds.
No more duplicates from concurrent schedulers. But now you have a throughput ceiling: one leader must scan the DB, claim up to 2 800 jobs, and enqueue them — all within one second. That's workable at this scale, but a 10× spike or a 10× growth in registered jobs breaks it.
V4: Sharded schedulers with per-shard leader election
Partition the job space: shard by job_id % N. Each shard runs its own leader election in etcd independently. Shard 0's leader owns jobs where job_id % 10 == 0, shard 1 owns the next tenth, and so on.
Now throughput scales linearly with the number of shards — you can add shards as the job count grows. The remaining problem is that the shard leader still executes jobs inline. One slow job blocks the scheduler loop, delaying every other job in that shard.
V5: Decouple scheduling from execution (the production design)
The fundamental insight is that the scheduler should do exactly one thing: scan for due jobs and put them on a queue. Workers should do exactly one thing: consume from the queue and execute.
Scheduler shard → queue (Kafka / SQS) → worker pool
Workers scale independently of schedulers. The queue absorbs hot-second bursts naturally — if 500k jobs become due at midnight, the scheduler enqueues them all and workers drain the queue at whatever rate they can sustain. The scheduler's loop stays fast because it never waits on job execution.
Workers report results back to the DB (updating next_run_at for recurring jobs, inserting a run record). The scheduler is never in the execution critical path.
V6: Add a timing wheel for the near-term horizon
Scanning the DB every second works at 10M jobs, but at 100M+ it gets expensive even with a good index. A hierarchical timing wheel keeps the near-term horizon in memory: on startup, load all jobs due within the next 10 minutes into an in-memory wheel. Each tick processes only the jobs in the current bucket — O(jobs in that bucket) per tick, with insertion and cancellation in effectively O(1) (strictly O(m) where m is the small, fixed number of wheel levels). The DB scan runs every minute just to refresh the far horizon and pick up newly registered jobs.
flowchart LR
V1["V1: single cron<br/>one host, serial"] --> V2["V2: multi-host cron<br/>double execution"]
V2 --> V3["V3: leader election<br/>one active scheduler"]
V3 --> V4["V4: sharded leaders<br/>linear throughput"]
V4 --> V5["V5: queue decoupling<br/>workers scale independently"]
V5 --> V6["V6: timing wheel<br/>O(1) near-term firing"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V5 fill:#ff6b1a,color:#0a0a0f
style V6 fill:#a855f7,color:#fff
The rest of this article zooms in on V5 + V6 — what it takes to run this reliably in production.
API design
POST /api/v1/jobs HTTP/1.1
Content-Type: application/json
{
"name": "send-weekly-report",
"schedule": "0 9 * * 1", // cron: every Monday at 09:00 UTC
"timezone": "America/New_York",
"payload": { "report_type": "weekly" },
"idempotency_key_template": "report-{{week}}",
"max_retries": 3,
"retry_backoff": "exponential",
"missed_window": "RUN_ONCE", // SKIP | RUN_ONCE | RUN_ALL
"timeout_seconds": 120
}
→ 201 Created
{ "job_id": "j_8f3kd9", "next_run_at": "2026-06-09T13:00:00Z" }
GET /api/v1/jobs/j_8f3kd9/runs?limit=20
→ 200 OK
[
{ "run_id": "r_abc", "scheduled_at": "...", "started_at": "...",
"finished_at": "...", "status": "success", "attempt": 1 },
...
]
The schedule store
The schedule store is a relational database — Postgres is the canonical choice. The critical column is next_run_at, indexed with a B-tree. Everything else flows from that one index.
CREATE TABLE jobs (
job_id VARCHAR(32) PRIMARY KEY,
name TEXT NOT NULL,
schedule TEXT NOT NULL, -- cron expression or ISO-8601
timezone TEXT NOT NULL DEFAULT 'UTC',
payload JSONB,
status TEXT NOT NULL DEFAULT 'active', -- active | paused | deleted
missed_window TEXT NOT NULL DEFAULT 'RUN_ONCE',
max_retries INT NOT NULL DEFAULT 3,
timeout_seconds INT NOT NULL DEFAULT 300,
next_run_at TIMESTAMPTZ NOT NULL,
locked_by TEXT, -- scheduler shard ID or NULL
locked_until TIMESTAMPTZ, -- lease expiry
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- The one index that makes everything work:
CREATE INDEX jobs_due ON jobs (next_run_at ASC)
WHERE status = 'active' AND (locked_by IS NULL OR locked_until < now());
CREATE TABLE job_runs (
run_id VARCHAR(32) PRIMARY KEY,
job_id VARCHAR(32) NOT NULL REFERENCES jobs(job_id),
scheduled_at TIMESTAMPTZ NOT NULL,
started_at TIMESTAMPTZ,
finished_at TIMESTAMPTZ,
attempt INT NOT NULL DEFAULT 1,
status TEXT NOT NULL, -- pending | running | success | failed | dead
worker_id TEXT,
error TEXT
);
CREATE INDEX job_runs_job ON job_runs (job_id, scheduled_at DESC);
Notice the partial index on jobs_due: it only covers rows that are active and either unlocked or past their lease expiry. When the scheduler queries for due jobs, Postgres touches only the rows it could actually claim — typically a few hundred rows per second rather than the full 10M.
The scheduler loop
Each shard runs this loop once per second:
def scheduler_loop(shard_id, shard_count, lease_ttl_seconds=30):
while True:
now = utcnow()
# Atomically claim up to 500 due jobs for this shard
due_jobs = db.execute("""
UPDATE jobs
SET locked_by = %s, locked_until = %s
WHERE job_id IN (
SELECT job_id FROM jobs
WHERE status = 'active'
AND next_run_at <= %s
AND hashtext(job_id) %% %s = %s -- shard predicate (Postgres hashtext; or precompute a shard_id column)
AND (locked_by IS NULL OR locked_until < %s)
ORDER BY next_run_at ASC
LIMIT 500
FOR UPDATE SKIP LOCKED -- no blocking
)
RETURNING *
""", [shard_id, now + lease_ttl_seconds, now,
shard_count, shard_id, now])
for job in due_jobs:
queue.publish(job) -- enqueue to Kafka / SQS
sleep(1)
FOR UPDATE SKIP LOCKED (available in Postgres 9.5+) is what makes this work at scale. It acquires a row lock and skips already-locked rows without blocking — no waiting, no deadlocks. Multiple shards can run this query at the same moment against the same table and they'll naturally divide the work, each grabbing a non-overlapping set of jobs.
The execution pipeline
Once a job is on the queue, a worker picks it up and runs it. The key thing to hold in mind is that this is a three-party interaction: the scheduler set a lease, the queue is delivering the message, and the worker needs to close the loop back to the DB.
sequenceDiagram
participant S as Scheduler Shard
participant DB as Schedule Store
participant Q as Queue (Kafka/SQS)
participant W as Worker
participant DLQ as Dead-Letter Queue
S->>DB: UPDATE ... SET locked_by=shard1 WHERE next_run_at <= now() FOR UPDATE SKIP LOCKED
DB-->>S: 200 due jobs returned
S->>Q: publish 200 job messages
W->>Q: consume job message
W->>W: execute job (with timeout)
alt success
W->>DB: UPDATE jobs SET next_run_at=next_occurrence(), locked_by=NULL
W->>DB: INSERT job_runs (status=success)
else transient failure (attempt < max_retries)
W->>Q: re-enqueue with exponential backoff delay
W->>DB: UPDATE job_runs (status=failed, attempt=N)
else permanent failure (attempt >= max_retries)
W->>DLQ: move to dead-letter queue
W->>DB: INSERT job_runs (status=dead)
end
Lease / visibility timeout mechanics
The lease is what prevents the same job from running twice when a worker crashes. The flow is simple once you see it laid out:
- The scheduler sets
locked_until = now() + 30swhen it enqueues the job. - The worker has 30 seconds to finish and write back. If it dies mid-execution, the lease just expires on its own.
- On the next scheduler loop iteration,
locked_until < now()is true — the job is eligible again and gets picked up by the next available worker. - Because of step 3, the worker must be idempotent. The same job may execute more than once.
flowchart LR
SCHED[Scheduler] -->|"SET locked_until = now+30s"| DB[(Schedule Store)]
DB -->|enqueue| Q[Queue]
Q --> W[Worker]
W -->|"finish → SET locked_by=NULL, next_run_at=next"| DB
W -.crash.-> EXPIRE["lease expires<br/>(locked_until < now)"]
EXPIRE -->|"next scheduler scan picks it up"| SCHED
style SCHED fill:#ff6b1a,color:#0a0a0f
style DB fill:#0e7490,color:#fff
style EXPIRE fill:#ff2e88,color:#fff
style Q fill:#a855f7,color:#fff
The lease TTL must be longer than the job's expected execution time. For long-running jobs (minutes), either set the TTL accordingly, or have the worker heartbeat — periodically extending locked_until while executing so the lease doesn't expire on a healthy but slow job.
Exactly-once side effects via idempotency
The scheduler guarantees at-least-once delivery. Exactly-once side effects require idempotent job logic. This is not a scheduler limitation — it is a fundamental property of distributed systems under partial failure.
See idempotency and exactly-once semantics for the full treatment. The short version:
def send_invoice_job(job_payload, idempotency_key):
# Check if this key was already processed
if invoice_db.exists(idempotency_key):
return "already sent — skipping"
invoice = generate_invoice(job_payload)
email_api.send(invoice, idempotency_key=idempotency_key)
invoice_db.mark_sent(idempotency_key)
return "sent"
The idempotency_key is typically f"{job_id}:{scheduled_at_unix}" — stable across retries of the same scheduled firing, but different for each separate firing window. This way, a retry of Tuesday's invoice job doesn't skip Tuesday's invoice just because Monday's already ran.
Handling cron expressions, timezones, and DST
Parsing cron
A cron expression like 0 9 * * 1 ("9 AM every Monday") must be evaluated in the job's configured timezone, not UTC. Computing next_run_at correctly requires a timezone-aware cron library that understands DST transitions.
The tricky cases:
| Case | What happens | Correct behavior |
|---|---|---|
| Clock springs forward (DST start) | 2:00 AM → 3:00 AM; 2:30 AM doesn't exist | Skip if the window falls in the gap; fire at 3:00 AM instead |
| Clock falls back (DST end) | 1:00–2:00 AM occurs twice | Fire once; use the first occurrence by default |
0 2 * * * in a timezone that skips 2:00 AM | Entire scheduled time disappears | Skip that day's firing; treat as missed window |
Store next_run_at in UTC in the DB, but compute it from the cron expression + timezone every time you reschedule. Do not compute all future occurrences ahead of time — timezone rules change.
Missed windows
When a scheduler comes back after downtime, it finds jobs with next_run_at in the past. What happens next depends on what the job does:
- SKIP — the moment has passed, move on. Right for bill payment alerts.
- RUN_ONCE — run now, even if late. Right for database vacuums where you want the work done, just not necessarily on-schedule.
- RUN_ALL — run once for each missed window. Right for audit log archiving where every window matters.
RUN_ALL is the most dangerous option. A 24-hour outage on an hourly job triggers 24 executions. Cap this with a max_missed setting and alert when a large catch-up batch fires.
The timing wheel
For millions of jobs at steady state, scanning the DB every second works but is wasteful. A hierarchical timing wheel keeps the near-term horizon in memory:
Wheel structure (each level is an array of buckets):
Level 0: 60 buckets, each = 1 second (covers 1 minute)
Level 1: 60 buckets, each = 1 minute (covers 1 hour)
Level 2: 24 buckets, each = 1 hour (covers 1 day)
On each tick, the current bucket fires all jobs in it. Jobs farther out live in higher-level buckets and cascade down as time advances. Job insertions are O(m) where m is the number of wheel levels (a small constant, typically 3–5), which is effectively O(1) in practice. Firing is O(jobs in bucket). Memory: each shard holds ~1M jobs, but the wheel only loads the 10-minute near-term horizon (~1/6 of those), so roughly 167k × 64 B ≈ ~10 MB per shard — comfortably fits in any JVM heap.
The DB scan runs every minute to load new and modified jobs into the wheel. The wheel handles sub-minute precision; the DB is the authoritative source and gets rebuilt from it on any restart.
flowchart TD
DB[(Schedule Store)] -->|load jobs due in next 10 min<br/>every 60 s| TW[Timing Wheel<br/>in-memory]
TW -->|tick every 1 s<br/>fire current bucket| SCHED[Scheduler Loop]
SCHED -->|enqueue| Q[Queue]
NEW[New job registered] -->|next_run_at ≤ 10 min| TW
NEW -->|next_run_at > 10 min| DB
style TW fill:#ff6b1a,color:#0a0a0f
style DB fill:#0e7490,color:#fff
style Q fill:#a855f7,color:#fff
Full architecture
flowchart TD
CLIENT[Client / API] -->|register / update / delete jobs| JAPI[Job Registry API]
JAPI --> DB[(Schedule Store<br/>Postgres — sharded by job_id)]
ETCD[(etcd / ZooKeeper<br/>leader election)] --> S0[Scheduler Shard 0<br/>+ timing wheel]
ETCD --> S1[Scheduler Shard 1<br/>+ timing wheel]
ETCD --> SN[Scheduler Shard N<br/>+ timing wheel]
S0 -->|scan due jobs| DB
S1 -->|scan due jobs| DB
SN -->|scan due jobs| DB
S0 --> MQ
S1 --> MQ
SN --> MQ
MQ[Message Queue<br/>Kafka / SQS] --> W1[Worker]
MQ --> W2[Worker]
MQ --> WN[Worker]
W1 -->|update next_run_at<br/>insert job_run| DB
W2 -->|update next_run_at<br/>insert job_run| DB
WN -->|update next_run_at<br/>insert job_run| DB
W1 -.failed after max retries.-> DLQ[Dead-Letter Queue]
W2 -.failed after max retries.-> DLQ
WN -.failed after max retries.-> DLQ
DLQ --> ALERT[Alert / oncall]
style DB fill:#0e7490,color:#fff
style MQ fill:#a855f7,color:#fff
style S0 fill:#ff6b1a,color:#0a0a0f
style S1 fill:#ff6b1a,color:#0a0a0f
style SN fill:#ff6b1a,color:#0a0a0f
style ETCD fill:#ffaa00,color:#0a0a0f
style W1 fill:#15803d,color:#fff
style W2 fill:#15803d,color:#fff
style WN fill:#15803d,color:#fff
style DLQ fill:#ff2e88,color:#fff
Job state machine
stateDiagram-v2
[*] --> Scheduled: registered
Scheduled --> Leased: scheduler picks up
Leased --> Scheduled: lease expired (worker crashed)
Leased --> Running: worker starts
Running --> Succeeded: job completes
Running --> Failed: job throws / times out
Failed --> Leased: retry enqueued (attempt < max_retries)
Failed --> Dead: max_retries exceeded, moved to DLQ
Succeeded --> Scheduled: recurring — next_run_at updated
Succeeded --> [*]: one-off job completed
Dead --> [*]
Storage choices
| Data | Store | Rationale |
|---|---|---|
| Job definitions | Postgres (sharded by job_id) | Strong consistency; next_run_at B-tree index; FK to runs |
Lease state (locked_by, locked_until) | Same Postgres row | Atomic with FOR UPDATE SKIP LOCKED; no separate lock store |
| Job execution history | Postgres (time-partitioned) or Cassandra | Append-only, high write rate; partition by month |
| Leader election | etcd or ZooKeeper | Battle-tested; ephemeral leases with TTL |
| Message queue | Kafka or Amazon SQS | Kafka: replay, fan-out; SQS: managed, per-message visibility timeout |
| Timing wheel state | JVM heap (in-memory, per shard) | Rebuilt from DB on restart; ephemeral is fine |
| Dead-letter queue | Kafka DLQ topic or SQS DLQ | Inspectable; re-drive via UI |
Failure modes and mitigations
| Failure | Symptom | Mitigation |
|---|---|---|
| Worker dies mid-execution | Job lease expires; job requeued | Lease expiry + worker idempotency |
| Duplicate execution (two workers pick same job) | FOR UPDATE SKIP LOCKED prevents this at DB level; unlikely with queue | Idempotency key in job execution |
| Scheduler shard crashes | Standby acquires etcd lock in roughly one election timeout (default 1 s; higher in tuned/cross-DC clusters) | Fast re-election; jobs due during gap fire slightly late |
| DB partition / unavailability | Scheduler cannot scan; jobs stall | Multi-AZ Postgres; read replica for status checks |
| Clock skew between scheduler nodes | Two schedulers disagree on now() | Sync via NTP; use DB server timestamp (now() in SQL) not scheduler host clock |
| Hot-second spike (00:00:00) | 500k jobs enqueued in 1 second | Queue absorbs; workers drain asynchronously; expect up to a minute of latency on spike |
| DST clock-back: job fires twice | "2:30 AM" occurs twice in one night | Cron library must detect and deduplicate using UTC next_run_at |
RUN_ALL after long outage | Thousands of catch-up jobs flood queue | Cap max_missed; alert on large catch-up batches |
| Worker consumes message but queue ACK lost | Message redelivered after visibility timeout | Worker idempotency handles re-execution |
Things to discuss in an interview
- Why not just use
SELECT FOR UPDATEwithoutSKIP LOCKED? Because all schedulers would block on the same rows, serializing themselves into a single queue.SKIP LOCKEDturns this into an efficient, parallel job dispatch. - Why decouple the scheduler from the worker? The scheduler's loop must complete within 1 second. If it runs the job inline, one slow job blocks all other firings. The queue is the buffer.
- Why is at-least-once the right guarantee, not exactly-once? Exactly-once at the scheduler level would require distributed transactions across the queue, the DB, and the worker's external side effects — practically impossible without massive performance cost. Idempotent job design is the correct abstraction boundary.
- What does the job specify as its idempotency key? Typically
{job_id}:{scheduled_at_unix}— stable across retries of the same window, unique across different windows. - How do you handle a job that takes longer than the lease TTL? Two options: (1) extend the TTL with a heartbeat UPDATE, or (2) set a generous per-job lease TTL that exceeds the job's timeout. If using a queue like SQS, the worker extends the visibility timeout via API call.
- How do you prevent the "hot shard" problem for jobs that all hash to the same shard? Use consistent hashing rather than
job_id % N; distribute by hash ofjob_idwith virtual nodes. See consistent hashing.
Things you should now be able to answer
- Why can't you just run two cron daemons for HA, and what goes wrong?
- What does
FOR UPDATE SKIP LOCKEDdo and why is it the right primitive here? - How does a job lease prevent double execution, and what is the failure mode when a worker dies?
- Why is idempotency the job's responsibility rather than the scheduler's?
- How does a timing wheel work, and when does the DB scan still matter?
- What are the three strategies for handling missed job windows, and when do you use each?
- Why must you compute
next_run_atfrom the cron expression + timezone on every reschedule, rather than pre-computing future occurrences?
Further reading
- Quartz Scheduler — widely used Java job scheduler; documentation covers clustering and misfires
- AWS EventBridge Scheduler — managed service; study its at-least-once semantics and flexible time windows
- "Distributed Periodic Scheduling with Cron" — Google SRE book, chapter 24 (the ACM Queue article covering the same material is titled "Reliable Cron across the Planet")
- Idempotency and exactly-once semantics
- Consistent hashing — for distributing job_id across shards
- "Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility" — Varghese and Lauck, 1987 (SOSP '87) — the canonical paper on timing wheels
Frequently asked questions
▸What is the difference between at-least-once and exactly-once in a distributed job scheduler?
The scheduler guarantees at-least-once delivery: every due job will fire, but it may fire more than once if a worker crashes mid-execution and the lease expires. Exactly-once side effects are the job's responsibility, not the scheduler's, achieved through idempotent job logic keyed on a stable idempotency key such as job_id combined with the scheduled firing timestamp.
▸How does FOR UPDATE SKIP LOCKED prevent double execution in the scheduler loop?
When a scheduler shard issues the UPDATE to claim due jobs, FOR UPDATE SKIP LOCKED acquires a row lock on each job row but skips any row already locked by another scheduler — without blocking or deadlocking. Multiple shards can run the same query simultaneously and naturally divide the work into non-overlapping sets, so no two schedulers claim the same job.
▸What are the three strategies for handling missed job windows after downtime, and when should each be used?
SKIP discards the missed firing and is right for time-sensitive alerts like bill payment reminders. RUN_ONCE executes the job immediately even if late, which is right for maintenance work like a database vacuum where the work must get done but the exact time does not matter. RUN_ALL fires once for every missed window, which is right for audit log archiving where no interval can be dropped — but it is the most dangerous option and should be capped with a max_missed setting.
▸How does a hierarchical timing wheel work and what problem does it solve?
A hierarchical timing wheel holds the near-term job horizon in memory across three levels: 60 one-second buckets, 60 one-minute buckets, and 24 one-hour buckets. Each tick processes only the jobs in the current bucket, so firing cost is proportional to jobs due in that second, not to the total job count. Insertions and cancellations cost O(m) where m is the number of wheel levels, a small constant, compared to the O(log N) cost of a database index scan. With N=10 shards each owning 1M jobs, the wheel only loads the 10-minute near-term horizon per shard (roughly 167k jobs at a 1-hour median interval), so memory is about 167k x 64 B = ~10 MB per shard; the DB scan runs every minute only to refresh the far horizon.
▸What is the hot-second problem and how does the architecture address it?
The hot-second problem occurs when a large fraction of recurring jobs share the same scheduled time, such as 500k daily-at-midnight jobs all firing at 00:00:00. The scheduler enqueues all 500k jobs immediately, and the queue absorbs the burst; workers drain at their own rate, which at 10k workers per second takes roughly 50 seconds. Job start latency during a spike can reach about a minute, which the article considers acceptable given that the latency requirement is a few seconds under normal load, not milliseconds.
You may also like
Model Context Protocol (MCP) and Tool-Use Infrastructure
How LLMs safely reach the outside world — from raw function calling to MCP, the open standard that collapses N×M bespoke integrations to N+M, with production-grade security, reliability, and a ~88% token reduction via deferred tool loading.
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.