~/articles/design-distributed-job-scheduler

◆◆◆Advancedasked at Amazonasked at Googleasked at Uberasked at Airbnb

Design a Distributed Job Scheduler (cron at scale)

Q: What is the difference between at-least-once and exactly-once in a distributed job scheduler?

The scheduler guarantees at-least-once delivery: every due job will fire, but it may fire more than once if a worker crashes mid-execution and the lease expires. Exactly-once side effects are the job's responsibility, not the scheduler's, achieved through idempotent job logic keyed on a stable idempotency key such as job_id combined with the scheduled firing timestamp.

Q: How does FOR UPDATE SKIP LOCKED prevent double execution in the scheduler loop?

When a scheduler shard issues the UPDATE to claim due jobs, FOR UPDATE SKIP LOCKED acquires a row lock on each job row but skips any row already locked by another scheduler — without blocking or deadlocking. Multiple shards can run the same query simultaneously and naturally divide the work into non-overlapping sets, so no two schedulers claim the same job.

Q: What are the three strategies for handling missed job windows after downtime, and when should each be used?

SKIP discards the missed firing and is right for time-sensitive alerts like bill payment reminders. RUN_ONCE executes the job immediately even if late, which is right for maintenance work like a database vacuum where the work must get done but the exact time does not matter. RUN_ALL fires once for every missed window, which is right for audit log archiving where no interval can be dropped — but it is the most dangerous option and should be capped with a max_missed setting.

Q: How does a hierarchical timing wheel work and what problem does it solve?

A hierarchical timing wheel holds the near-term job horizon in memory across three levels: 60 one-second buckets, 60 one-minute buckets, and 24 one-hour buckets. Each tick processes only the jobs in the current bucket, so firing cost is proportional to jobs due in that second, not to the total job count. Insertions and cancellations cost O(m) where m is the number of wheel levels, a small constant, compared to the O(log N) cost of a database index scan. With N=10 shards each owning 1M jobs, the wheel only loads the 10-minute near-term horizon per shard (roughly 167k jobs at a 1-hour median interval), so memory is about 167k x 64 B = ~10 MB per shard; the DB scan runs every minute only to refresh the far horizon.

Q: What is the hot-second problem and how does the architecture address it?

The hot-second problem occurs when a large fraction of recurring jobs share the same scheduled time, such as 500k daily-at-midnight jobs all firing at 00:00:00. The scheduler enqueues all 500k jobs immediately, and the queue absorbs the burst; workers drain at their own rate, which at 10k workers per second takes roughly 50 seconds. Job start latency during a spike can reach about a minute, which the article considers acceptable given that the latency requirement is a few seconds under normal load, not milliseconds.

Run millions of scheduled and recurring jobs reliably — at-least-once execution, leader election, sharded time-wheels, and exactly-once side effects via idempotency.

20 min read2026-05-30Ironclad Academy

#interview #distributed-systems #scheduling #reliability

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

AWS EventBridge Scheduler, GitHub Actions scheduled workflows, Airbnb's Chronos, and every "send the weekly digest at 9 AM" feature in your product all share the same foundation: a distributed job scheduler. At its core, the system stores a set of jobs — each with a schedule and a payload — and makes sure they execute at the right time, even across a fleet of servers, even when machines crash.

The naive version is a single cron daemon on one machine. Every Unix engineer has written one. It works until that machine reboots, and then it silently misses every job due during the downtime. The natural fix — run the same cron on three machines for redundancy — makes things worse: all three daemons see the same due jobs and execute each one three times. That is how users get three payment emails.

The real engineering challenge is two-sided. First, coordination: you need multiple machines to share the work without stomping on each other, which demands leader election, lease-based locking, and sharding. Second, correctness under failure: a worker can crash after starting a job but before recording it as done, leaving the system uncertain whether to retry. Since distributed systems cannot give you exactly-once delivery for free, the scheduler guarantees at-least-once — and pushes the responsibility for safe retries onto the job itself via idempotency. That tension between availability and correctness is what makes this question hard, and it is the thread that runs through every design decision below.

This question separates candidates who have actually operated production systems from those who have not. Everyone knows how to write a setTimeout. Almost nobody has wrestled with what happens when two schedulers both think they own the same cron job, or when a worker dies with a lock held, or when DST moves a clock backward and a "once per hour" job fires twice.

Functional requirements

POST /jobs — register a job with a schedule (ISO-8601 one-off timestamp, or cron expression).
DELETE /jobs/:id — deregister.
GET /jobs/:id/runs — execution history.
The scheduler fires the job at (or within a configurable slack of) the scheduled time.
Recurring jobs re-schedule themselves automatically.
One-off jobs are GC'd after completion.

Non-functional requirements

At-least-once execution — every due job must fire. Missing a job is worse than running it twice.
Low duplicate rate — the scheduler must minimize double-executions; correctness of side effects is the job's responsibility.
Throughput — handle thousands of jobs firing per second under normal load; absorb hot-second bursts.
High availability — scheduler downtime means missed jobs; 99.9%+ uptime required.
Latency — a job should start within a few seconds of its due time (not necessarily within milliseconds).

Capacity estimation

Dimension	Estimate	How we got there
Registered jobs	10M	Given baseline
Avg firing interval	1 hour	Given baseline
Steady-state throughput	~2,800 jobs/sec	`10M ÷ 3,600 s`
Hot-second spike (e.g. 00:00:00)	~500k jobs in 1 second	5% of 10M recurring jobs scheduled daily-at-midnight
Hot-second drain time	~50 seconds	`500k ÷ 10k workers/sec` — queue absorbs the spike
Schedule store size	~5 GB	`10M jobs × 512 B/row`; fits on a single DB node (sharding optional until write throughput limits hit)
Execution history (append-only)	~2 TB/month	`2,800 jobs/sec × 86,400 s × 30 days × 256 B/row`; shard by job_id or time-partition; keep 90 days hot, archive to object storage
Scheduler scan latency (per shard)	<5 ms	B-tree index scan on `next_run_at ≤ now()` over 1M rows per shard (N=10 shards, each owning 1M jobs); scan runs every 1 s

Takeaway: 10M jobs fits in ~5 GB at rest; the hot-second burst (500k jobs) is the throughput design point, not the steady-state 2,800 jobs/sec — the queue absorbs it and workers drain at their own rate.

Building up to the design

Start with the simplest scheduler that could possibly work and watch it break.

V1: Single cron daemon + application database

# system crontab (single host)
* * * * * /usr/bin/python3 /opt/job_runner.py

The script queries a jobs table, runs what's due, updates last_run_at. Fits on a laptop, and it genuinely works — for a team of five with a handful of jobs.

The problem is the host. When it reboots — planned or not — every job due during that outage is silently missed. There is no alert, no retry, no record that anything was skipped. Beyond availability, the script also runs serially: a thousand jobs due at the same minute boundary run one after the other, not concurrently.

V2: Multiple cron daemons — thundering herd + double execution

The natural instinct for availability is to run the same cron daemon on three machines. This makes the problem worse. All three daemons scan the DB simultaneously and pick the same due jobs. Every job executes three times. That's not just a correctness annoyance — it means three payment emails, three database migrations, three billing charges.

You've traded a missed-job risk for a duplicate-execution risk, and duplicates are harder to explain to users.

V3: Single-leader scheduler (no sharding)

The fix for duplicate execution is leader election: only one scheduler is "active" at a time; the others wait on standby. ZooKeeper and etcd both offer ephemeral nodes for exactly this — the leader holds a lock, and if it crashes, a standby acquires the lock within a few seconds.

No more duplicates from concurrent schedulers. But now you have a throughput ceiling: one leader must scan the DB, claim up to 2 800 jobs, and enqueue them — all within one second. That's workable at this scale, but a 10× spike or a 10× growth in registered jobs breaks it.

V4: Sharded schedulers with per-shard leader election

Partition the job space: shard by job_id % N. Each shard runs its own leader election in etcd independently. Shard 0's leader owns jobs where job_id % 10 == 0, shard 1 owns the next tenth, and so on.

Now throughput scales linearly with the number of shards — you can add shards as the job count grows. The remaining problem is that the shard leader still executes jobs inline. One slow job blocks the scheduler loop, delaying every other job in that shard.

V5: Decouple scheduling from execution (the production design)

The fundamental insight is that the scheduler should do exactly one thing: scan for due jobs and put them on a queue. Workers should do exactly one thing: consume from the queue and execute.

Scheduler shard → queue (Kafka / SQS) → worker pool

Workers scale independently of schedulers. The queue absorbs hot-second bursts naturally — if 500k jobs become due at midnight, the scheduler enqueues them all and workers drain the queue at whatever rate they can sustain. The scheduler's loop stays fast because it never waits on job execution.

Workers report results back to the DB (updating next_run_at for recurring jobs, inserting a run record). The scheduler is never in the execution critical path.

V6: Add a timing wheel for the near-term horizon

Scanning the DB every second works at 10M jobs, but at 100M+ it gets expensive even with a good index. A hierarchical timing wheel keeps the near-term horizon in memory: on startup, load all jobs due within the next 10 minutes into an in-memory wheel. Each tick processes only the jobs in the current bucket — O(jobs in that bucket) per tick, with insertion and cancellation in effectively O(1) (strictly O(m) where m is the small, fixed number of wheel levels). The DB scan runs every minute just to refresh the far horizon and pick up newly registered jobs.

flowchart LR
    V1["V1: single cron<br/>one host, serial"] --> V2["V2: multi-host cron<br/>double execution"]
    V2 --> V3["V3: leader election<br/>one active scheduler"]
    V3 --> V4["V4: sharded leaders<br/>linear throughput"]
    V4 --> V5["V5: queue decoupling<br/>workers scale independently"]
    V5 --> V6["V6: timing wheel<br/>O(1) near-term firing"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

The rest of this article zooms in on V5 + V6 — what it takes to run this reliably in production.

API design

POST /api/v1/jobs HTTP/1.1
Content-Type: application/json

{
  "name": "send-weekly-report",
  "schedule": "0 9 * * 1",           // cron: every Monday at 09:00 UTC
  "timezone": "America/New_York",
  "payload": { "report_type": "weekly" },
  "idempotency_key_template": "report-{{week}}",
  "max_retries": 3,
  "retry_backoff": "exponential",
  "missed_window": "RUN_ONCE",       // SKIP | RUN_ONCE | RUN_ALL
  "timeout_seconds": 120
}

→ 201 Created
{ "job_id": "j_8f3kd9", "next_run_at": "2026-06-09T13:00:00Z" }

GET /api/v1/jobs/j_8f3kd9/runs?limit=20

→ 200 OK
[
  { "run_id": "r_abc", "scheduled_at": "...", "started_at": "...",
    "finished_at": "...", "status": "success", "attempt": 1 },
  ...
]

The schedule store

The schedule store is a relational database — Postgres is the canonical choice. The critical column is next_run_at, indexed with a B-tree. Everything else flows from that one index.

CREATE TABLE jobs (
  job_id          VARCHAR(32)   PRIMARY KEY,
  name            TEXT          NOT NULL,
  schedule        TEXT          NOT NULL,  -- cron expression or ISO-8601
  timezone        TEXT          NOT NULL DEFAULT 'UTC',
  payload         JSONB,
  status          TEXT          NOT NULL DEFAULT 'active',  -- active | paused | deleted
  missed_window   TEXT          NOT NULL DEFAULT 'RUN_ONCE',
  max_retries     INT           NOT NULL DEFAULT 3,
  timeout_seconds INT           NOT NULL DEFAULT 300,
  next_run_at     TIMESTAMPTZ   NOT NULL,
  locked_by       TEXT,                   -- scheduler shard ID or NULL
  locked_until    TIMESTAMPTZ,            -- lease expiry
  created_at      TIMESTAMPTZ   NOT NULL DEFAULT now()
);

-- The one index that makes everything work:
CREATE INDEX jobs_due ON jobs (next_run_at ASC)
  WHERE status = 'active' AND (locked_by IS NULL OR locked_until < now());

CREATE TABLE job_runs (
  run_id        VARCHAR(32)   PRIMARY KEY,
  job_id        VARCHAR(32)   NOT NULL REFERENCES jobs(job_id),
  scheduled_at  TIMESTAMPTZ   NOT NULL,
  started_at    TIMESTAMPTZ,
  finished_at   TIMESTAMPTZ,
  attempt       INT           NOT NULL DEFAULT 1,
  status        TEXT          NOT NULL,   -- pending | running | success | failed | dead
  worker_id     TEXT,
  error         TEXT
);

CREATE INDEX job_runs_job ON job_runs (job_id, scheduled_at DESC);

Notice the partial index on jobs_due: it only covers rows that are active and either unlocked or past their lease expiry. When the scheduler queries for due jobs, Postgres touches only the rows it could actually claim — typically a few hundred rows per second rather than the full 10M.

The scheduler loop

Each shard runs this loop once per second:

def scheduler_loop(shard_id, shard_count, lease_ttl_seconds=30):
    while True:
        now = utcnow()
        # Atomically claim up to 500 due jobs for this shard
        due_jobs = db.execute("""
            UPDATE jobs
            SET locked_by = %s, locked_until = %s
            WHERE job_id IN (
              SELECT job_id FROM jobs
              WHERE status = 'active'
                AND next_run_at <= %s
                AND hashtext(job_id) %% %s = %s   -- shard predicate (Postgres hashtext; or precompute a shard_id column)
                AND (locked_by IS NULL OR locked_until < %s)
              ORDER BY next_run_at ASC
              LIMIT 500
              FOR UPDATE SKIP LOCKED            -- no blocking
            )
            RETURNING *
        """, [shard_id, now + lease_ttl_seconds, now,
              shard_count, shard_id, now])

        for job in due_jobs:
            queue.publish(job)                  -- enqueue to Kafka / SQS

        sleep(1)

FOR UPDATE SKIP LOCKED (available in Postgres 9.5+) is what makes this work at scale. It acquires a row lock and skips already-locked rows without blocking — no waiting, no deadlocks. Multiple shards can run this query at the same moment against the same table and they'll naturally divide the work, each grabbing a non-overlapping set of jobs.

The execution pipeline

Once a job is on the queue, a worker picks it up and runs it. The key thing to hold in mind is that this is a three-party interaction: the scheduler set a lease, the queue is delivering the message, and the worker needs to close the loop back to the DB.

sequenceDiagram
    participant S as Scheduler Shard
    participant DB as Schedule Store
    participant Q as Queue (Kafka/SQS)
    participant W as Worker
    participant DLQ as Dead-Letter Queue

    S->>DB: UPDATE ... SET locked_by=shard1 WHERE next_run_at <= now() FOR UPDATE SKIP LOCKED
    DB-->>S: 200 due jobs returned
    S->>Q: publish 200 job messages
    W->>Q: consume job message
    W->>W: execute job (with timeout)
    alt success
        W->>DB: UPDATE jobs SET next_run_at=next_occurrence(), locked_by=NULL
        W->>DB: INSERT job_runs (status=success)
    else transient failure (attempt < max_retries)
        W->>Q: re-enqueue with exponential backoff delay
        W->>DB: UPDATE job_runs (status=failed, attempt=N)
    else permanent failure (attempt >= max_retries)
        W->>DLQ: move to dead-letter queue
        W->>DB: INSERT job_runs (status=dead)
    end

Lease / visibility timeout mechanics

The lease is what prevents the same job from running twice when a worker crashes. The flow is simple once you see it laid out:

The scheduler sets locked_until = now() + 30s when it enqueues the job.
The worker has 30 seconds to finish and write back. If it dies mid-execution, the lease just expires on its own.
On the next scheduler loop iteration, locked_until < now() is true — the job is eligible again and gets picked up by the next available worker.
Because of step 3, the worker must be idempotent. The same job may execute more than once.

flowchart LR
    SCHED[Scheduler] -->|"SET locked_until = now+30s"| DB[(Schedule Store)]
    DB -->|enqueue| Q[Queue]
    Q --> W[Worker]
    W -->|"finish → SET locked_by=NULL, next_run_at=next"| DB
    W -.crash.-> EXPIRE["lease expires<br/>(locked_until < now)"]
    EXPIRE -->|"next scheduler scan picks it up"| SCHED
    style SCHED fill:#ff6b1a,color:#0a0a0f
    style DB fill:#0e7490,color:#fff
    style EXPIRE fill:#ff2e88,color:#fff
    style Q fill:#a855f7,color:#fff

The lease TTL must be longer than the job's expected execution time. For long-running jobs (minutes), either set the TTL accordingly, or have the worker heartbeat — periodically extending locked_until while executing so the lease doesn't expire on a healthy but slow job.

Exactly-once side effects via idempotency

The scheduler guarantees at-least-once delivery. Exactly-once side effects require idempotent job logic. This is not a scheduler limitation — it is a fundamental property of distributed systems under partial failure.

See idempotency and exactly-once semantics for the full treatment. The short version:

def send_invoice_job(job_payload, idempotency_key):
    # Check if this key was already processed
    if invoice_db.exists(idempotency_key):
        return "already sent — skipping"

    invoice = generate_invoice(job_payload)
    email_api.send(invoice, idempotency_key=idempotency_key)
    invoice_db.mark_sent(idempotency_key)
    return "sent"

The idempotency_key is typically f"{job_id}:{scheduled_at_unix}" — stable across retries of the same scheduled firing, but different for each separate firing window. This way, a retry of Tuesday's invoice job doesn't skip Tuesday's invoice just because Monday's already ran.

Handling cron expressions, timezones, and DST

Parsing cron

A cron expression like 0 9 * * 1 ("9 AM every Monday") must be evaluated in the job's configured timezone, not UTC. Computing next_run_at correctly requires a timezone-aware cron library that understands DST transitions.

The tricky cases:

Case	What happens	Correct behavior
Clock springs forward (DST start)	2:00 AM → 3:00 AM; 2:30 AM doesn't exist	Skip if the window falls in the gap; fire at 3:00 AM instead
Clock falls back (DST end)	1:00–2:00 AM occurs twice	Fire once; use the first occurrence by default
`0 2 * * *` in a timezone that skips 2:00 AM	Entire scheduled time disappears	Skip that day's firing; treat as missed window

Store next_run_at in UTC in the DB, but compute it from the cron expression + timezone every time you reschedule. Do not compute all future occurrences ahead of time — timezone rules change.

Missed windows

When a scheduler comes back after downtime, it finds jobs with next_run_at in the past. What happens next depends on what the job does:

SKIP — the moment has passed, move on. Right for bill payment alerts.
RUN_ONCE — run now, even if late. Right for database vacuums where you want the work done, just not necessarily on-schedule.
RUN_ALL — run once for each missed window. Right for audit log archiving where every window matters.

RUN_ALL is the most dangerous option. A 24-hour outage on an hourly job triggers 24 executions. Cap this with a max_missed setting and alert when a large catch-up batch fires.

The timing wheel

For millions of jobs at steady state, scanning the DB every second works but is wasteful. A hierarchical timing wheel keeps the near-term horizon in memory:

Wheel structure (each level is an array of buckets):
  Level 0: 60 buckets, each = 1 second  (covers 1 minute)
  Level 1: 60 buckets, each = 1 minute  (covers 1 hour)
  Level 2: 24 buckets, each = 1 hour    (covers 1 day)

On each tick, the current bucket fires all jobs in it. Jobs farther out live in higher-level buckets and cascade down as time advances. Job insertions are O(m) where m is the number of wheel levels (a small constant, typically 3–5), which is effectively O(1) in practice. Firing is O(jobs in bucket). Memory: each shard holds ~1M jobs, but the wheel only loads the 10-minute near-term horizon (~1/6 of those), so roughly 167k × 64 B ≈ ~10 MB per shard — comfortably fits in any JVM heap.

The DB scan runs every minute to load new and modified jobs into the wheel. The wheel handles sub-minute precision; the DB is the authoritative source and gets rebuilt from it on any restart.

flowchart TD
    DB[(Schedule Store)] -->|load jobs due in next 10 min<br/>every 60 s| TW[Timing Wheel<br/>in-memory]
    TW -->|tick every 1 s<br/>fire current bucket| SCHED[Scheduler Loop]
    SCHED -->|enqueue| Q[Queue]
    NEW[New job registered] -->|next_run_at ≤ 10 min| TW
    NEW -->|next_run_at > 10 min| DB
    style TW fill:#ff6b1a,color:#0a0a0f
    style DB fill:#0e7490,color:#fff
    style Q fill:#a855f7,color:#fff

Full architecture

flowchart TD
    CLIENT[Client / API] -->|register / update / delete jobs| JAPI[Job Registry API]
    JAPI --> DB[(Schedule Store<br/>Postgres — sharded by job_id)]

    ETCD[(etcd / ZooKeeper<br/>leader election)] --> S0[Scheduler Shard 0<br/>+ timing wheel]
    ETCD --> S1[Scheduler Shard 1<br/>+ timing wheel]
    ETCD --> SN[Scheduler Shard N<br/>+ timing wheel]

    S0 -->|scan due jobs| DB
    S1 -->|scan due jobs| DB
    SN -->|scan due jobs| DB

    S0 --> MQ
    S1 --> MQ
    SN --> MQ

    MQ[Message Queue<br/>Kafka / SQS] --> W1[Worker]
    MQ --> W2[Worker]
    MQ --> WN[Worker]

    W1 -->|update next_run_at<br/>insert job_run| DB
    W2 -->|update next_run_at<br/>insert job_run| DB
    WN -->|update next_run_at<br/>insert job_run| DB

    W1 -.failed after max retries.-> DLQ[Dead-Letter Queue]
    W2 -.failed after max retries.-> DLQ
    WN -.failed after max retries.-> DLQ

    DLQ --> ALERT[Alert / oncall]

    style DB fill:#0e7490,color:#fff
    style MQ fill:#a855f7,color:#fff
    style S0 fill:#ff6b1a,color:#0a0a0f
    style S1 fill:#ff6b1a,color:#0a0a0f
    style SN fill:#ff6b1a,color:#0a0a0f
    style ETCD fill:#ffaa00,color:#0a0a0f
    style W1 fill:#15803d,color:#fff
    style W2 fill:#15803d,color:#fff
    style WN fill:#15803d,color:#fff
    style DLQ fill:#ff2e88,color:#fff

Job state machine

stateDiagram-v2
    [*] --> Scheduled: registered
    Scheduled --> Leased: scheduler picks up
    Leased --> Scheduled: lease expired (worker crashed)
    Leased --> Running: worker starts
    Running --> Succeeded: job completes
    Running --> Failed: job throws / times out
    Failed --> Leased: retry enqueued (attempt < max_retries)
    Failed --> Dead: max_retries exceeded, moved to DLQ
    Succeeded --> Scheduled: recurring — next_run_at updated
    Succeeded --> [*]: one-off job completed
    Dead --> [*]

Storage choices

Data	Store	Rationale
Job definitions	Postgres (sharded by `job_id`)	Strong consistency; `next_run_at` B-tree index; FK to runs
Lease state (`locked_by`, `locked_until`)	Same Postgres row	Atomic with `FOR UPDATE SKIP LOCKED`; no separate lock store
Job execution history	Postgres (time-partitioned) or Cassandra	Append-only, high write rate; partition by month
Leader election	etcd or ZooKeeper	Battle-tested; ephemeral leases with TTL
Message queue	Kafka or Amazon SQS	Kafka: replay, fan-out; SQS: managed, per-message visibility timeout
Timing wheel state	JVM heap (in-memory, per shard)	Rebuilt from DB on restart; ephemeral is fine
Dead-letter queue	Kafka DLQ topic or SQS DLQ	Inspectable; re-drive via UI

Failure modes and mitigations

Failure	Symptom	Mitigation
Worker dies mid-execution	Job lease expires; job requeued	Lease expiry + worker idempotency
Duplicate execution (two workers pick same job)	`FOR UPDATE SKIP LOCKED` prevents this at DB level; unlikely with queue	Idempotency key in job execution
Scheduler shard crashes	Standby acquires etcd lock in roughly one election timeout (default 1 s; higher in tuned/cross-DC clusters)	Fast re-election; jobs due during gap fire slightly late
DB partition / unavailability	Scheduler cannot scan; jobs stall	Multi-AZ Postgres; read replica for status checks
Clock skew between scheduler nodes	Two schedulers disagree on `now()`	Sync via NTP; use DB server timestamp (`now()` in SQL) not scheduler host clock
Hot-second spike (00:00:00)	500k jobs enqueued in 1 second	Queue absorbs; workers drain asynchronously; expect up to a minute of latency on spike
DST clock-back: job fires twice	"2:30 AM" occurs twice in one night	Cron library must detect and deduplicate using UTC `next_run_at`
`RUN_ALL` after long outage	Thousands of catch-up jobs flood queue	Cap `max_missed`; alert on large catch-up batches
Worker consumes message but queue ACK lost	Message redelivered after visibility timeout	Worker idempotency handles re-execution

Things to discuss in an interview

Why not just use SELECT FOR UPDATE without SKIP LOCKED? Because all schedulers would block on the same rows, serializing themselves into a single queue. SKIP LOCKED turns this into an efficient, parallel job dispatch.
Why decouple the scheduler from the worker? The scheduler's loop must complete within 1 second. If it runs the job inline, one slow job blocks all other firings. The queue is the buffer.
Why is at-least-once the right guarantee, not exactly-once? Exactly-once at the scheduler level would require distributed transactions across the queue, the DB, and the worker's external side effects — practically impossible without massive performance cost. Idempotent job design is the correct abstraction boundary.
What does the job specify as its idempotency key? Typically {job_id}:{scheduled_at_unix} — stable across retries of the same window, unique across different windows.
How do you handle a job that takes longer than the lease TTL? Two options: (1) extend the TTL with a heartbeat UPDATE, or (2) set a generous per-job lease TTL that exceeds the job's timeout. If using a queue like SQS, the worker extends the visibility timeout via API call.
How do you prevent the "hot shard" problem for jobs that all hash to the same shard? Use consistent hashing rather than job_id % N; distribute by hash of job_id with virtual nodes. See consistent hashing.

Things you should now be able to answer

Why can't you just run two cron daemons for HA, and what goes wrong?
What does FOR UPDATE SKIP LOCKED do and why is it the right primitive here?
How does a job lease prevent double execution, and what is the failure mode when a worker dies?
Why is idempotency the job's responsibility rather than the scheduler's?
How does a timing wheel work, and when does the DB scan still matter?
What are the three strategies for handling missed job windows, and when do you use each?
Why must you compute next_run_at from the cron expression + timezone on every reschedule, rather than pre-computing future occurrences?

Frequently asked questions

▸What is the difference between at-least-once and exactly-once in a distributed job scheduler?

The scheduler guarantees at-least-once delivery: every due job will fire, but it may fire more than once if a worker crashes mid-execution and the lease expires. Exactly-once side effects are the job's responsibility, not the scheduler's, achieved through idempotent job logic keyed on a stable idempotency key such as job_id combined with the scheduled firing timestamp.

▸How does FOR UPDATE SKIP LOCKED prevent double execution in the scheduler loop?

When a scheduler shard issues the UPDATE to claim due jobs, FOR UPDATE SKIP LOCKED acquires a row lock on each job row but skips any row already locked by another scheduler — without blocking or deadlocking. Multiple shards can run the same query simultaneously and naturally divide the work into non-overlapping sets, so no two schedulers claim the same job.

▸What are the three strategies for handling missed job windows after downtime, and when should each be used?

SKIP discards the missed firing and is right for time-sensitive alerts like bill payment reminders. RUN_ONCE executes the job immediately even if late, which is right for maintenance work like a database vacuum where the work must get done but the exact time does not matter. RUN_ALL fires once for every missed window, which is right for audit log archiving where no interval can be dropped — but it is the most dangerous option and should be capped with a max_missed setting.

▸How does a hierarchical timing wheel work and what problem does it solve?

A hierarchical timing wheel holds the near-term job horizon in memory across three levels: 60 one-second buckets, 60 one-minute buckets, and 24 one-hour buckets. Each tick processes only the jobs in the current bucket, so firing cost is proportional to jobs due in that second, not to the total job count. Insertions and cancellations cost O(m) where m is the number of wheel levels, a small constant, compared to the O(log N) cost of a database index scan. With N=10 shards each owning 1M jobs, the wheel only loads the 10-minute near-term horizon per shard (roughly 167k jobs at a 1-hour median interval), so memory is about 167k x 64 B = ~10 MB per shard; the DB scan runs every minute only to refresh the far horizon.

▸What is the hot-second problem and how does the architecture address it?

The hot-second problem occurs when a large fraction of recurring jobs share the same scheduled time, such as 500k daily-at-midnight jobs all firing at 00:00:00. The scheduler enqueues all 500k jobs immediately, and the queue absorbs the burst; workers drain at their own rate, which at 10k workers per second takes roughly 50 seconds. Job start latency during a spike can reach about a minute, which the article considers acceptable given that the latency requirement is a few seconds under normal load, not milliseconds.

← previous

Design Ticketmaster (seat booking / reservations)

Design a Globally-Distributed SQL Database (Spanner / CockroachDB)

// RELATED

Design a Distributed Job Scheduler (cron at scale)

The problem

Functional requirements

Non-functional requirements

Capacity estimation

Building up to the design

V1: Single cron daemon + application database

V2: Multiple cron daemons — thundering herd + double execution

V3: Single-leader scheduler (no sharding)

V4: Sharded schedulers with per-shard leader election

V5: Decouple scheduling from execution (the production design)

V6: Add a timing wheel for the near-term horizon

API design

The schedule store

The scheduler loop

The execution pipeline

Lease / visibility timeout mechanics

Exactly-once side effects via idempotency

Handling cron expressions, timezones, and DST

Parsing cron

Missed windows

The timing wheel

Full architecture

Job state machine

Storage choices

Failure modes and mitigations

Things to discuss in an interview

Things you should now be able to answer

Further reading

Frequently asked questions

You may also like

Model Context Protocol (MCP) and Tool-Use Infrastructure

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)