~/articles/design-online-judge

◆◆Intermediateasked at LeetCodeasked at HackerRankasked at Codeforces

Design an Online Code Judge (LeetCode)

Run untrusted user code safely against test cases at scale. Sandboxed execution, a judge worker queue, resource limits, and contest-time spikes.

22 min read2026-04-25Ironclad Academy

#interview #sandboxing #queues #isolation

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

LeetCode has over 100,000 daily active users submitting code. HackerRank and Codeforces run competitive contests where thousands of participants slam the submit button within seconds of each other. An online judge is the infrastructure behind all of it: accept source code, compile it, run it against a battery of hidden test cases, and return a deterministic verdict — Accepted, Wrong Answer, Time Limit Exceeded — without ever letting that untrusted code touch anything it shouldn't.

The mechanics sound straightforward until you think about what "run untrusted code" actually means. Your service is a free, anonymous remote code execution platform. Every submission is a potential attack. A submission can try to fork-bomb the host, open outbound sockets to exfiltrate data, read another user's test inputs, or mine cryptocurrency on your machines. The sandbox that wraps every execution is not an implementation detail — it is the product. Get it wrong and you hand attackers root on your infrastructure.

Alongside the security problem sits a scaling problem with a sharp, predictable shape. Steady-state load is manageable: a few hundred submissions per second, each taking 2–4 seconds to judge. But when a contest ends and ten thousand competitors submit simultaneously, the queue must absorb a 10–20× burst without dropping a job or making users wait indefinitely. Workers are stateless and can be autoscaled, but only if the architecture separates "receive the submission" from "run the code" from the start.

Those two tensions — kernel-level isolation vs. throughput, and contest-spike absorption vs. cost efficiency — are what make this a rich interview problem. Every design decision flows from them.

Functional requirements

POST /submit { problem_id, language, source_code } → returns { submission_id } immediately.
GET /submissions/:id → returns current status and, when done, the full verdict.
Verdicts: AC (all test cases pass), WA (wrong output), TLE (exceeded time limit), MLE (exceeded memory limit), RE (runtime crash), CE (compilation failed).
Problems have N hidden test cases; execution must stop (and return WA/TLE/MLE/RE) on the first failing test case.
Languages: at minimum C++, Java, Python. Realistically 15–20 languages.
Contest mode: time-boxed contest where a leaderboard updates in near-real-time.

Non-functional requirements

Security above all: untrusted code must not escape its container, access the host network, read other users' submissions, or consume unbounded resources.
Correctness: verdict must be deterministic and reproducible given the same code, test input, and resource limits.
Latency: user should see a verdict in ≤ 10 s for a typical submission under steady load.
Availability: queue absorbs bursts; a judge worker crashing mid-submission must not lose the job.
Fair scheduling: no submission should starve behind a slow one.

Capacity estimation

Dimension	Estimate	How we got there
Submissions — steady state	500/sec	Baseline load assumption
Submissions — contest peak	5,000/sec	10× burst over steady state
Compile step	~0.5–2 s	C++ slow; Python skips compilation
Run one test case	~5–100 ms	Problem-dependent
Total judge time per submission	~2–4 s wall clock	`100 test cases × 20 ms avg = 2 s execution + compile overhead`
Worker slots — steady state	~1,500	`500 sub/sec × 3 s avg = 1,500 concurrent executions` (1 sandbox per slot)
Worker slots — contest peak	~15,000	`5,000 × 3 s = 15,000` slots, or ~2,000 workers each running ~7–8 sandboxes
Source code per submission	~10 KB	Average across languages
Verdict + per-test-case detail	~40 KB	stderr snippets + per-case result
Total storage per submission	~50 KB	`10 KB + 40 KB`
Object storage — daily	~2.16 TB/day	`500/sec × 86,400 s × 50 KB` → S3/GCS
Verdict metadata in Postgres	~43M rows/day	`500/sec × 86,400 s` at ~500 bytes/row; partition by date, cold-tier after 30 days
Queue throughput — steady	~0.5 MB/s	`500 msg/sec × 1 KB` per message (submission_id + metadata) — trivial for Kafka or SQS
Queue throughput — peak	~5 MB/s	`5,000 msg/sec × 1 KB`
Queue backlog at peak	Up to a few minutes	If workers lag; autoscaling should drain within ~5 min of spike onset

Takeaway: At steady state the system is well within reach of a single Kafka cluster and a few hundred workers; the design challenge is absorbing the 10× contest burst without pre-provisioning peak capacity at all times.

Building up to the design

V1: The naive approach — run it inline

@app.post("/submit")
def submit(code, lang, problem_id):
    verdict = run_in_subprocess(code, lang, problem_id)
    return {"verdict": verdict}

Spin up a subprocess, wait for it to finish, return. Works on your laptop in 30 lines.

The moment you put this in front of real users, it falls apart in every dimension. The API blocks for 2–4 seconds per submission, so one slow submission can starve every request on that thread. The subprocess runs as the API process user — no isolation at all. A fork bomb kills the server. And there's no autoscaling lever to pull when a contest ends and thousands of submissions arrive at once.

V2: Add resource limits — necessary, but not sufficient

Wrap the subprocess with ulimit and timeout. Cap virtual address space with RLIMIT_AS. Drop privileges with setuid. Kill the child if it exceeds wall-clock time.

This gives you basic resource enforcement — the subprocess can be killed. But the code still runs on the host kernel. A kernel exploit in user code, or even a well-crafted ptrace call or /proc read, can still observe the host. You've put a fence around the yard, but left the gate unlocked.

V3: Move execution to workers and a queue

Here's the first structurally correct decision: the API should never run code. Instead, it writes a job row to Postgres and publishes a message to a queue. A separate fleet of worker processes consumes that queue, one submission at a time.

This immediately buys you three things: the API returns in milliseconds, workers are independently scalable, and a crashed worker doesn't lose the job — the message reappears after the visibility timeout expires and another worker picks it up. The queue is the shock absorber you'll thank yourself for when a contest ends.

The remaining gap: workers still run code on the host OS with only cgroup protection. Any kernel vulnerability is a full host compromise.

V4: Add a real sandbox — gVisor or Firecracker

Replace the raw subprocess with a sandboxed runtime. Two main options:

gVisor (runsc) inserts a user-space kernel (the "Sentry") between the untrusted process and the host kernel. Every syscall the user code makes is intercepted by the Sentry — via systrap (the default since mid-2023) or KVM on bare metal — and re-implemented in safe Go code before anything reaches the real kernel. Startup is ~50–100 ms. The overhead is low for CPU-bound code but meaningful for syscall-heavy code, so benchmark for your workload.

Firecracker takes a different approach: it boots a minimal Linux kernel inside a KVM-based microVM — the NSDI'20 paper and AWS Lambda production data put this at roughly 125 ms for a highly tuned minimal kernel; a more typical setup with a richer userspace takes closer to 1 s cold. Snapshot-restore (freeze the VM after init, restore on demand) cuts resume time to tens of milliseconds (~28 ms in community benchmarks). Full VM-level isolation, VMM memory overhead under 5 MiB per VM. You get a real kernel boundary, not just a syscall filter.

Layer seccomp-bpf on top of either one as a secondary defense: even if something escapes gVisor's syscall table, a tight allowlist (read, write, fstat, exit, and little else) denies network syscalls, ptrace, mount, and clone with CLONE_NEWUSER. Defeating the sandbox now requires punching through gVisor or the hypervisor boundary and the seccomp filter simultaneously.

V5: Autoscale the worker pool

Workers are stateless — they pull from the queue, judge, write results, ack. Adding more workers is safe at any time. The autoscaler watches queue depth and the age of the oldest message; when submissions are waiting more than ~5 seconds, it provisions more workers. This is how you absorb a contest spike without pre-provisioning a fleet large enough for peak at all times.

flowchart LR
    V1["V1: sync subprocess<br/>0 isolation"] --> V2["V2: + cgroups/ulimit<br/>resource limits, host kernel"]
    V2 --> V3["V3: + queue + workers<br/>async, restartable"]
    V3 --> V4["V4: + gVisor / Firecracker<br/>kernel-level isolation"]
    V4 --> V5["V5: + autoscaling<br/>contest-safe"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V5 fill:#a855f7,color:#fff

The sandbox — the hardest part

The sandbox is not a detail; it is the product. If you skip it, you don't have an online judge — you have a free remote code execution service for attackers.

What you must prevent

Threat	Mechanism	Mitigation
Fork bomb (`fork()` loop)	Spawns processes until OOM	`pids.max` (cgroups v1 and v2), `RLIMIT_NPROC`
Crypto mining	CPU monopolization	CPU time quota via cgroup `cpu.cfs_quota_us`; wall-clock timeout kills the container
Network exfiltration	Open socket, send data out	Sandbox network namespace with no external routes; seccomp blocks `socket()` syscall
Filesystem writes to host	`open("/etc/passwd", O_WRONLY)`	Read-only rootfs; no `/host` mounts; overlay filesystem per-submission
Reading another user's code	`open("/submissions/other.cpp")`	Per-submission ephemeral overlay; no shared mutable state
Memory exhaustion	Allocate 100 GB	cgroup v1 `memory.limit_in_bytes` / cgroup v2 `memory.max` → OOM-killed → MLE verdict
Sandbox escape via kernel bug	Exploit kernel syscall path	gVisor intercepts all syscalls at the sentry; Firecracker provides hypervisor boundary
Infinite recursion / stack overflow	Stack grows until SIGSEGV	`RLIMIT_STACK`, default stack size limits in runtime

Layered defense-in-depth

Think of security as rings, not switches. Each layer stops a different class of attack independently — an attacker has to defeat all of them simultaneously to do real damage.

flowchart TD
    L0["Layer 0: Drop root — run as nobody"] --> L1
    L1["Layer 1: seccomp-bpf allowlist<br/>only read, write, fstat, exit…"] --> L2
    L2["Layer 2: Linux namespaces<br/>network, PID, mount, IPC, UTS isolated"] --> L3
    L3["Layer 3: cgroup v2 resource limits<br/>CPU · memory · PIDs"] --> L4
    L4["Layer 4: gVisor sentry OR<br/>Firecracker KVM boundary"] --> L5
    L5["Layer 5: Wall-clock kill timer<br/>belt-and-suspenders in the worker"]
    style L0 fill:#0e7490,color:#fff
    style L1 fill:#15803d,color:#fff
    style L2 fill:#ff6b1a,color:#0a0a0f
    style L3 fill:#ffaa00,color:#0a0a0f
    style L4 fill:#a855f7,color:#fff
    style L5 fill:#ff2e88,color:#fff

Resource limits — the numbers

CPU time limit:     Set per-problem (e.g. 1 s for C++, 5 s for Python, 3 s for Java)
                    Enforced via cgroup cpu.cfs_quota_us (CFS quota, throttles silently)
                    or RLIMIT_CPU / SIGXCPU (rlimit-based, sends signal)
                    Worker also has a wall-clock timeout = 2× CPU limit (catches spinloops)

Memory limit:       Set per-problem (e.g. 256 MB)
                    cgroup v1: memory.limit_in_bytes = 256 * 1024 * 1024
                    cgroup v2: memory.max = 268435456
                    OOM kill → worker reads exit status → MLE verdict

PID limit:          pids.max = 64 (generous; a single-threaded C++ binary uses ~1)

Filesystem:         Read-only overlayfs base image
                    Ephemeral writable layer of 64 MB max (for temp files)
                    No /proc/sysrq-trigger, /dev/mem, /dev/kmem

Timing accuracy — why TLE is hard

Wall-clock time is noisy. A loaded host slows down any process, so two identical programs can produce different wall-clock measurements depending on what else is running. There are two ways to deal with this.

The cleaner approach is measuring CPU time, not wall-clock time: getrusage(RUSAGE_CHILDREN) or cgroup v1 cpuacct.usage (cgroup v2: cpu.stat) gives you CPU time consumed, independent of scheduling jitter. This is the right number to compare against the problem's time limit.

For competitive judges where TLE precision really matters, CPU pinning via the cpuset cgroup goes further: pin each sandbox to its own physical core so no other process competes there. Wall-clock time then approximates CPU time closely enough that the simpler measurement works. The trade-off is that you need as many cores as concurrent sandboxes, which constrains packing density.

Submission flow — end to end

sequenceDiagram
    participant U as User
    participant API as API Service
    participant MQ as Message Queue
    participant W as Judge Worker
    participant SB as Sandbox
    participant DB as Verdict Store
    participant WS as WebSocket Hub

    U->>API: POST /submit {code, lang, problem_id}
    API->>DB: INSERT submission (status=PENDING)
    API->>MQ: publish SubmissionJob {submission_id}
    API->>U: 202 Accepted {submission_id}

    MQ->>W: deliver SubmissionJob (visibility_timeout=60s)
    W->>DB: UPDATE status=RUNNING
    W->>SB: create sandbox, copy source
    SB->>SB: compile (if needed)
    alt Compilation error
        SB->>W: exit code != 0, stderr
        W->>DB: UPSERT verdict=CE, detail=stderr
        W->>MQ: ack message
    else Compiled OK
        loop for each test case (stop on first failure)
            W->>SB: run with stdin=test_input, limits
            SB->>W: stdout, exit_code, cpu_ms, mem_kb
            W->>W: compare stdout to expected output
        end
        W->>DB: UPSERT verdict={AC|WA|TLE|MLE|RE}, per-case detail
        W->>MQ: ack message
    end
    DB->>WS: CDC event or polling
    WS->>U: push verdict over WebSocket/SSE

A few properties of this flow are worth calling out explicitly. The API never runs code — it's a thin write-to-DB plus enqueue. The worker acks only after writing the verdict, so a crash before that point leaves the message to be redelivered. The verdict write is an idempotent upsert on submission_id, meaning a retried worker cannot create a duplicate row or overwrite a verdict that already settled. And the user sees a 202 immediately, then waits on a WebSocket for the result.

Architecture deep dive

flowchart TD
    U[User browser] --> GW[API Gateway<br/>auth + rate limit]
    GW --> API[Submission API]
    API --> DB[(Postgres<br/>submissions + verdicts)]
    API --> MQ[Kafka / SQS<br/>submission queue]
    MQ --> WP[Worker Pool<br/>autoscaled]
    WP --> SB["Sandbox Fleet<br/>gVisor / Firecracker"]
    SB --> TS[(Test Case Store<br/>S3 / object store)]
    WP --> DB
    WP --> OBJ[(Source + Artifact Store<br/>S3)]
    DB --> CDC[CDC / Pub-Sub]
    CDC --> WS[WebSocket Hub]
    WS --> U
    WP --> CACHE[(Compile Cache<br/>Redis)]
    style GW fill:#ff6b1a,color:#0a0a0f
    style SB fill:#15803d,color:#fff
    style MQ fill:#ffaa00,color:#0a0a0f
    style WS fill:#a855f7,color:#fff
    style CACHE fill:#0e7490,color:#fff

API Service

The API is stateless and deliberately thin. It authenticates the user, applies rate limiting (say, 10 submissions per minute under normal load, tighter during contests to prevent queue flooding), validates the payload (known language, problem exists, source under 64 KB), writes a submissions row with status = PENDING, publishes a SubmissionJob message containing only the submission_id, and returns 202 { submission_id }. The message carries the ID, not the full source — workers fetch the source themselves from the DB or object store.

Message Queue

The queue is the shock absorber for contest spikes. It needs to be durable (messages survive broker restarts), at-least-once (a crashed worker's message reappears after the visibility timeout), and equipped with a dead-letter queue for messages that fail repeatedly — typically submissions that OOM-kill the worker itself, which is rare but happens.

FIFO ordering within a partition is nice but not required — verdicts are independent, so reordering doesn't matter. Kafka or AWS SQS both fit. SQS is operationally simpler; Kafka is better if you also want the submission stream for downstream analytics and plagiarism detection.

Judge Worker

A worker is a long-running process that pulls one message, fetches the source, checks the compile cache (keyed on hash(source_code + language + compiler_flags + compiler_version)), and if there's no hit, compiles in an ephemeral sandbox and stores the artifact. Then for each test case in order, it starts a fresh namespace (PID, mount, IPC) reusing the compiled binary, feeds the test input via stdin, waits for exit with select/epoll plus a timeout, reads stdout, and compares to expected output — exact match, or a custom checker for problems with multiple valid answers. On the first non-AC result it stops, writes the final verdict via idempotent upsert, and acks the message.

Workers are stateless and horizontally scalable. The autoscaler watches queue depth and the age of the oldest message; if submissions are waiting more than 5 seconds, it provisions more.

Sandbox lifecycle

stateDiagram-v2
    [*] --> Created: worker spawns sandbox
    Created --> Compiling: source injected
    Compiling --> CompileError: non-zero exit
    Compiling --> Ready: binary available
    CompileError --> [*]: worker records CE
    Ready --> Running: test input fed via stdin
    Running --> Finished: clean exit within limits
    Running --> TLE: wall-clock kill fires
    Running --> MLE: OOM kill from cgroup
    Running --> RE: non-zero exit / signal
    Finished --> Verdict: compare stdout
    TLE --> [*]: worker records TLE
    MLE --> [*]: worker records MLE
    RE --> [*]: worker records RE
    Verdict --> Ready: next test case
    Verdict --> [*]: AC or WA, done

Each test case gets a fresh namespace (PID, mount, IPC) but reuses the compiled binary. This is the key optimization: compile once per submission, execute N times. The binary lands in a read-only bind mount inside the sandbox — the fresh namespace isolates side effects between test cases without paying the compilation cost again.

Test case storage

Test cases (input + expected output) live in object storage organized as problem_id/tc_{n}.in and problem_id/tc_{n}.out. Workers download and cache them on the worker node's local disk — not inside the sandbox, where the user code would be able to see the expected output.

Warm workers that have all test cases for a popular problem already on local disk can start execution within milliseconds of picking up a job, which matters for tail latency under load.

Compile cache

For C++ a typical compile takes 1–3 seconds. For Rust it can be 5 seconds. Skipping recompilation on retries or re-judges is a real win. The cache key is SHA256(source_code + language + compiler_flags + compiler_version); a hit means the worker skips straight to execution. Redis is a reasonable store: compiled artifacts are 100 KB–2 MB, a 24-hour TTL covers the cases where it matters (same user resubmitting, re-judge after a problem update), and LRU eviction handles the working set naturally.

Storage choices

Data	Store	Why
Submission metadata + verdicts	Postgres	Strong consistency; per-submission row; FK to users and problems
Source code (raw)	S3 / object store	Blobs up to 64 KB; cheap; content-addressed by hash
Compiled artifacts (cache)	Redis + local worker disk	Ephemeral; LRU; 24h TTL
Test case inputs/outputs	S3	Read-heavy; immutable per version; pre-fetched to worker local disk
Contest leaderboard	Redis sorted set	`ZADD leaderboard {score} {user_id}`; `ZRANK` in O(log N)
Submission event stream	Kafka	Fan-out to analytics, plagiarism detection, notifications

Handling language diversity

Different languages have fundamentally different execution models. The worker must handle:

Language	Compile step	Runtime	Notes
C++	g++ / clang++ (~1–3 s)	Native binary	Fastest; strictest TL
Java	javac (~1–2 s)	JVM + JIT	JVM startup varies: ~60–500 ms depending on classpath and container constraints; account for warmup in TL
Python 3	None (or py_compile for syntax check)	CPython interpreter	Slowest per operation; TL is usually 3–5× C++
Go	go build (~1 s)	Native binary	Fast startup; GC may cause TLE edge cases
JavaScript	None	Node.js	V8 JIT; startup ~100 ms
Rust	rustc (~1–5 s for typical CP submissions)	Native binary	Longer compile time than C++; cache hit is critical

The time limit in the problem metadata must be language-aware. A typical rule: C++ gets 1 s, Java 3 s, Python 5 s for the same problem.

Contest mode and leaderboard

During a contest, the leaderboard is a critical path. Every accepted submission must update it atomically.

A Redis sorted set is the right data structure:

// score = problems_solved * 1e9 - total_penalty_seconds
// Higher score = better rank. ZRANGE REV returns highest first.
ZADD contest:{contest_id}:leaderboard <score> <user_id>
ZRANGE contest:{contest_id}:leaderboard 0 99 WITHSCORES REV  // top 100

Encoding: score = problems_solved * 1e9 - total_penalty_seconds so that more problems solved → higher score → better rank under ZRANGE … REV (highest score = rank 1). Within the same problem count, lower penalty → higher score → better rank. The 1e9 factor ensures a difference in problems solved always dominates any difference in penalty (max contest penalty is in the tens-of-thousands of seconds, far below 1e9). All values stay well within float64's exact-integer range of ±2^53, so there is no precision loss. Exact encoding depends on the contest scoring rules.

The leaderboard is eventually consistent: a verdict write triggers a Kafka event, a leaderboard updater consumes it and runs the Redis command. Lag is typically under 1 second.

Failure modes

Worker crashes mid-submission

The visibility timeout on the queue message expires (typically 60–120 s). The message reappears and another worker picks it up. Because the verdict write is an idempotent upsert keyed on submission_id, there are no duplicate rows. If the first worker had already written a partial result before crashing, the second write with the same ID overwrites it cleanly.

Sandbox escape (security-critical)

If an attacker escapes the sandbox onto the host, they have full access to the worker VM. The blast radius is bounded by two choices: workers run on dedicated instances not shared with other services, and workers hold minimal DB credentials — write access scoped only to the verdict column of their own submission row. A hardened deployment routes verdict writes through a dedicated internal API so worker processes never hold general DB write credentials.

Host-level monitoring matters too: auditd or eBPF-based tools watching for anomalous syscalls (unexpected connect() to external IPs, execve of unknown binaries) can surface an escape before it progresses. Rotating worker instances regularly — daily or after a configurable number of submissions — limits the window of a persistent compromise.

Resource exhaustion on the worker host

If a submission writes more than the ephemeral overlay limit allows (say, 10 GB of output), the writable layer must be capped. Use dm-thin or overlay2 with a size limit on the upper layer. The worker also monitors its own disk usage and terminates sandboxes that exceed a threshold.

Noisy-neighbor timing skew

On a multi-tenant worker running several sandboxes concurrently, CPU pinning (cpuset) per sandbox eliminates CPU-level interference. Memory bandwidth contention is harder to eliminate at the hardware level. For high-stakes verdicts — final contest submissions — re-judge on a pinned, isolated worker.

Test case data corruption

If the expected output for a test case is corrupted in S3, every submission for that problem returns a wrong WA. The fix is to store a SHA256 checksum alongside each test case file and have workers verify it on download. Keep objects immutable and versioned in S3 — never overwrite, only create a new version and update the problem's pointer.

Plagiarism detection

Online judges operate at a scale where plagiarism is common in contests. The Kafka submission stream makes it straightforward to hook in a background plagiarism detector: compare source code using token-based similarity (MOSS-style: Winnowing algorithm with hashed n-grams), run it asynchronously after the contest ends, and route results to a moderation queue rather than directly to users. Running this synchronously on the submission path would add latency for no benefit to the user.

Things to discuss in an interview

Why async? The API must return immediately; judge time is 2–10 s. Synchronous execution would require the client to hold a connection open (HTTP long-poll) or the server to block a thread per submission — both are expensive at scale.
Why the queue absorbs contest spikes: a tidal wave of submissions at a contest deadline all arrive in seconds. The queue buffers them; workers drain at their own pace; autoscaling adds capacity.
What makes TLE accurate: CPU time (not wall-clock) measured via cgroups or getrusage; CPU pinning eliminates scheduling noise; the time limit is language-adjusted.
Why gVisor over just cgroups: cgroups limit resources but don't prevent syscall-level exploits. gVisor's Sentry intercepts every syscall before it reaches the host kernel — via systrap (default) or KVM — so the untrusted process never touches the host kernel's syscall handlers directly.
Idempotency of verdict writes: a worker can crash and retry without corrupting data. The upsert is keyed on submission_id, which is stable across retries.
Compile cache: amortizes the expensive compile step across retries and re-judges. Also crucial for Rust (long compile times).

Things you should now be able to answer

Why is a synchronous submission API a bad idea at scale?
What is the difference between CPU time and wall-clock time, and which should you enforce for TLE?
Name three attacks a sandbox must prevent and the mechanism that stops each.
What happens if a judge worker crashes after writing a partial verdict?
How does the queue help during contest spikes, and what drives worker autoscaling?
Why does Java need a higher time limit than C++ for the same problem?
What is the difference between gVisor and Firecracker, and why might you choose one over the other?

Frequently asked questions

▸What is the difference between gVisor and Firecracker for sandbox isolation in an online judge?

gVisor inserts a user-space kernel called the Sentry that intercepts every syscall before it reaches the host kernel, adding roughly 50-100 ms startup overhead. Firecracker boots a minimal Linux kernel inside a KVM microVM, providing a full hypervisor boundary with under 5 MiB VMM memory overhead per VM and around 125 ms cold start on a highly tuned setup, or tens of milliseconds when using snapshot-restore.

▸Why should you measure CPU time rather than wall-clock time for TLE verdicts?

Wall-clock time is noisy because a loaded host slows every process, so two identical programs can produce different measurements depending on scheduling. CPU time, measured via getrusage(RUSAGE_CHILDREN) or cgroup v1 cpuacct.usage (cgroup v2: cpu.stat), reflects actual compute consumed and is independent of scheduling jitter, giving deterministic and reproducible TLE verdicts.

▸How many judge worker slots does a system handling 500 submissions per second at steady state require?

Roughly 1,500 concurrent worker slots. At 500 submissions per second with an average judge time of 3 seconds per submission, Little's Law gives 500 times 3 equals 1,500 concurrent sandboxes, with each worker slot running one sandbox at a time.

▸Why must verdict writes use idempotent upserts keyed on submission_id?

A judge worker can crash mid-submission, causing the queue message to reappear after the visibility timeout and be picked up by a second worker. Without idempotent upserts, the second worker would create a duplicate row or overwrite a verdict that already settled cleanly; an upsert keyed on submission_id makes retries safe.

▸When should you choose Kafka over SQS for the online judge submission queue?

Kafka is the better fit when you also need the submission stream for downstream consumers such as analytics and plagiarism detection, since it supports fan-out to multiple consumer groups. SQS is operationally simpler if the queue's only job is delivering jobs to judge workers.

← previous

Design a Hotel / Airbnb Booking System

Design Top-K / Trending (heavy hitters)

// RELATED

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform