Design an Online Code Judge (LeetCode)
Run untrusted user code safely against test cases at scale. Sandboxed execution, a judge worker queue, resource limits, and contest-time spikes.
The problem
LeetCode has over 100,000 daily active users submitting code. HackerRank and Codeforces run competitive contests where thousands of participants slam the submit button within seconds of each other. An online judge is the infrastructure behind all of it: accept source code, compile it, run it against a battery of hidden test cases, and return a deterministic verdict — Accepted, Wrong Answer, Time Limit Exceeded — without ever letting that untrusted code touch anything it shouldn't.
The mechanics sound straightforward until you think about what "run untrusted code" actually means. Your service is a free, anonymous remote code execution platform. Every submission is a potential attack. A submission can try to fork-bomb the host, open outbound sockets to exfiltrate data, read another user's test inputs, or mine cryptocurrency on your machines. The sandbox that wraps every execution is not an implementation detail — it is the product. Get it wrong and you hand attackers root on your infrastructure.
Alongside the security problem sits a scaling problem with a sharp, predictable shape. Steady-state load is manageable: a few hundred submissions per second, each taking 2–4 seconds to judge. But when a contest ends and ten thousand competitors submit simultaneously, the queue must absorb a 10–20× burst without dropping a job or making users wait indefinitely. Workers are stateless and can be autoscaled, but only if the architecture separates "receive the submission" from "run the code" from the start.
Those two tensions — kernel-level isolation vs. throughput, and contest-spike absorption vs. cost efficiency — are what make this a rich interview problem. Every design decision flows from them.
Functional requirements
POST /submit { problem_id, language, source_code }→ returns{ submission_id }immediately.GET /submissions/:id→ returns current status and, when done, the full verdict.- Verdicts: AC (all test cases pass), WA (wrong output), TLE (exceeded time limit), MLE (exceeded memory limit), RE (runtime crash), CE (compilation failed).
- Problems have N hidden test cases; execution must stop (and return WA/TLE/MLE/RE) on the first failing test case.
- Languages: at minimum C++, Java, Python. Realistically 15–20 languages.
- Contest mode: time-boxed contest where a leaderboard updates in near-real-time.
Non-functional requirements
- Security above all: untrusted code must not escape its container, access the host network, read other users' submissions, or consume unbounded resources.
- Correctness: verdict must be deterministic and reproducible given the same code, test input, and resource limits.
- Latency: user should see a verdict in ≤ 10 s for a typical submission under steady load.
- Availability: queue absorbs bursts; a judge worker crashing mid-submission must not lose the job.
- Fair scheduling: no submission should starve behind a slow one.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Submissions — steady state | 500/sec | Baseline load assumption |
| Submissions — contest peak | 5,000/sec | 10× burst over steady state |
| Compile step | ~0.5–2 s | C++ slow; Python skips compilation |
| Run one test case | ~5–100 ms | Problem-dependent |
| Total judge time per submission | ~2–4 s wall clock | 100 test cases × 20 ms avg = 2 s execution + compile overhead |
| Worker slots — steady state | ~1,500 | 500 sub/sec × 3 s avg = 1,500 concurrent executions (1 sandbox per slot) |
| Worker slots — contest peak | ~15,000 | 5,000 × 3 s = 15,000 slots, or ~2,000 workers each running ~7–8 sandboxes |
| Source code per submission | ~10 KB | Average across languages |
| Verdict + per-test-case detail | ~40 KB | stderr snippets + per-case result |
| Total storage per submission | ~50 KB | 10 KB + 40 KB |
| Object storage — daily | ~2.16 TB/day | 500/sec × 86,400 s × 50 KB → S3/GCS |
| Verdict metadata in Postgres | ~43M rows/day | 500/sec × 86,400 s at ~500 bytes/row; partition by date, cold-tier after 30 days |
| Queue throughput — steady | ~0.5 MB/s | 500 msg/sec × 1 KB per message (submission_id + metadata) — trivial for Kafka or SQS |
| Queue throughput — peak | ~5 MB/s | 5,000 msg/sec × 1 KB |
| Queue backlog at peak | Up to a few minutes | If workers lag; autoscaling should drain within ~5 min of spike onset |
Takeaway: At steady state the system is well within reach of a single Kafka cluster and a few hundred workers; the design challenge is absorbing the 10× contest burst without pre-provisioning peak capacity at all times.
Building up to the design
V1: The naive approach — run it inline
@app.post("/submit")
def submit(code, lang, problem_id):
verdict = run_in_subprocess(code, lang, problem_id)
return {"verdict": verdict}
Spin up a subprocess, wait for it to finish, return. Works on your laptop in 30 lines.
The moment you put this in front of real users, it falls apart in every dimension. The API blocks for 2–4 seconds per submission, so one slow submission can starve every request on that thread. The subprocess runs as the API process user — no isolation at all. A fork bomb kills the server. And there's no autoscaling lever to pull when a contest ends and thousands of submissions arrive at once.
V2: Add resource limits — necessary, but not sufficient
Wrap the subprocess with ulimit and timeout. Cap virtual address space with RLIMIT_AS. Drop privileges with setuid. Kill the child if it exceeds wall-clock time.
This gives you basic resource enforcement — the subprocess can be killed. But the code still runs on the host kernel. A kernel exploit in user code, or even a well-crafted ptrace call or /proc read, can still observe the host. You've put a fence around the yard, but left the gate unlocked.
V3: Move execution to workers and a queue
Here's the first structurally correct decision: the API should never run code. Instead, it writes a job row to Postgres and publishes a message to a queue. A separate fleet of worker processes consumes that queue, one submission at a time.
This immediately buys you three things: the API returns in milliseconds, workers are independently scalable, and a crashed worker doesn't lose the job — the message reappears after the visibility timeout expires and another worker picks it up. The queue is the shock absorber you'll thank yourself for when a contest ends.
The remaining gap: workers still run code on the host OS with only cgroup protection. Any kernel vulnerability is a full host compromise.
V4: Add a real sandbox — gVisor or Firecracker
Replace the raw subprocess with a sandboxed runtime. Two main options:
gVisor (runsc) inserts a user-space kernel (the "Sentry") between the untrusted process and the host kernel. Every syscall the user code makes is intercepted by the Sentry — via systrap (the default since mid-2023) or KVM on bare metal — and re-implemented in safe Go code before anything reaches the real kernel. Startup is ~50–100 ms. The overhead is low for CPU-bound code but meaningful for syscall-heavy code, so benchmark for your workload.
Firecracker takes a different approach: it boots a minimal Linux kernel inside a KVM-based microVM — the NSDI'20 paper and AWS Lambda production data put this at roughly 125 ms for a highly tuned minimal kernel; a more typical setup with a richer userspace takes closer to 1 s cold. Snapshot-restore (freeze the VM after init, restore on demand) cuts resume time to tens of milliseconds (~28 ms in community benchmarks). Full VM-level isolation, VMM memory overhead under 5 MiB per VM. You get a real kernel boundary, not just a syscall filter.
Layer seccomp-bpf on top of either one as a secondary defense: even if something escapes gVisor's syscall table, a tight allowlist (read, write, fstat, exit, and little else) denies network syscalls, ptrace, mount, and clone with CLONE_NEWUSER. Defeating the sandbox now requires punching through gVisor or the hypervisor boundary and the seccomp filter simultaneously.
V5: Autoscale the worker pool
Workers are stateless — they pull from the queue, judge, write results, ack. Adding more workers is safe at any time. The autoscaler watches queue depth and the age of the oldest message; when submissions are waiting more than ~5 seconds, it provisions more workers. This is how you absorb a contest spike without pre-provisioning a fleet large enough for peak at all times.
flowchart LR
V1["V1: sync subprocess<br/>0 isolation"] --> V2["V2: + cgroups/ulimit<br/>resource limits, host kernel"]
V2 --> V3["V3: + queue + workers<br/>async, restartable"]
V3 --> V4["V4: + gVisor / Firecracker<br/>kernel-level isolation"]
V4 --> V5["V5: + autoscaling<br/>contest-safe"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V5 fill:#a855f7,color:#fff
The sandbox — the hardest part
The sandbox is not a detail; it is the product. If you skip it, you don't have an online judge — you have a free remote code execution service for attackers.
What you must prevent
| Threat | Mechanism | Mitigation |
|---|---|---|
Fork bomb (fork() loop) | Spawns processes until OOM | pids.max (cgroups v1 and v2), RLIMIT_NPROC |
| Crypto mining | CPU monopolization | CPU time quota via cgroup cpu.cfs_quota_us; wall-clock timeout kills the container |
| Network exfiltration | Open socket, send data out | Sandbox network namespace with no external routes; seccomp blocks socket() syscall |
| Filesystem writes to host | open("/etc/passwd", O_WRONLY) | Read-only rootfs; no /host mounts; overlay filesystem per-submission |
| Reading another user's code | open("/submissions/other.cpp") | Per-submission ephemeral overlay; no shared mutable state |
| Memory exhaustion | Allocate 100 GB | cgroup v1 memory.limit_in_bytes / cgroup v2 memory.max → OOM-killed → MLE verdict |
| Sandbox escape via kernel bug | Exploit kernel syscall path | gVisor intercepts all syscalls at the sentry; Firecracker provides hypervisor boundary |
| Infinite recursion / stack overflow | Stack grows until SIGSEGV | RLIMIT_STACK, default stack size limits in runtime |
Layered defense-in-depth
Think of security as rings, not switches. Each layer stops a different class of attack independently — an attacker has to defeat all of them simultaneously to do real damage.
flowchart TD
L0["Layer 0: Drop root — run as nobody"] --> L1
L1["Layer 1: seccomp-bpf allowlist<br/>only read, write, fstat, exit…"] --> L2
L2["Layer 2: Linux namespaces<br/>network, PID, mount, IPC, UTS isolated"] --> L3
L3["Layer 3: cgroup v2 resource limits<br/>CPU · memory · PIDs"] --> L4
L4["Layer 4: gVisor sentry OR<br/>Firecracker KVM boundary"] --> L5
L5["Layer 5: Wall-clock kill timer<br/>belt-and-suspenders in the worker"]
style L0 fill:#0e7490,color:#fff
style L1 fill:#15803d,color:#fff
style L2 fill:#ff6b1a,color:#0a0a0f
style L3 fill:#ffaa00,color:#0a0a0f
style L4 fill:#a855f7,color:#fff
style L5 fill:#ff2e88,color:#fff
Resource limits — the numbers
CPU time limit: Set per-problem (e.g. 1 s for C++, 5 s for Python, 3 s for Java)
Enforced via cgroup cpu.cfs_quota_us (CFS quota, throttles silently)
or RLIMIT_CPU / SIGXCPU (rlimit-based, sends signal)
Worker also has a wall-clock timeout = 2× CPU limit (catches spinloops)
Memory limit: Set per-problem (e.g. 256 MB)
cgroup v1: memory.limit_in_bytes = 256 * 1024 * 1024
cgroup v2: memory.max = 268435456
OOM kill → worker reads exit status → MLE verdict
PID limit: pids.max = 64 (generous; a single-threaded C++ binary uses ~1)
Filesystem: Read-only overlayfs base image
Ephemeral writable layer of 64 MB max (for temp files)
No /proc/sysrq-trigger, /dev/mem, /dev/kmem
Timing accuracy — why TLE is hard
Wall-clock time is noisy. A loaded host slows down any process, so two identical programs can produce different wall-clock measurements depending on what else is running. There are two ways to deal with this.
The cleaner approach is measuring CPU time, not wall-clock time: getrusage(RUSAGE_CHILDREN) or cgroup v1 cpuacct.usage (cgroup v2: cpu.stat) gives you CPU time consumed, independent of scheduling jitter. This is the right number to compare against the problem's time limit.
For competitive judges where TLE precision really matters, CPU pinning via the cpuset cgroup goes further: pin each sandbox to its own physical core so no other process competes there. Wall-clock time then approximates CPU time closely enough that the simpler measurement works. The trade-off is that you need as many cores as concurrent sandboxes, which constrains packing density.
Submission flow — end to end
sequenceDiagram
participant U as User
participant API as API Service
participant MQ as Message Queue
participant W as Judge Worker
participant SB as Sandbox
participant DB as Verdict Store
participant WS as WebSocket Hub
U->>API: POST /submit {code, lang, problem_id}
API->>DB: INSERT submission (status=PENDING)
API->>MQ: publish SubmissionJob {submission_id}
API->>U: 202 Accepted {submission_id}
MQ->>W: deliver SubmissionJob (visibility_timeout=60s)
W->>DB: UPDATE status=RUNNING
W->>SB: create sandbox, copy source
SB->>SB: compile (if needed)
alt Compilation error
SB->>W: exit code != 0, stderr
W->>DB: UPSERT verdict=CE, detail=stderr
W->>MQ: ack message
else Compiled OK
loop for each test case (stop on first failure)
W->>SB: run with stdin=test_input, limits
SB->>W: stdout, exit_code, cpu_ms, mem_kb
W->>W: compare stdout to expected output
end
W->>DB: UPSERT verdict={AC|WA|TLE|MLE|RE}, per-case detail
W->>MQ: ack message
end
DB->>WS: CDC event or polling
WS->>U: push verdict over WebSocket/SSE
A few properties of this flow are worth calling out explicitly. The API never runs code — it's a thin write-to-DB plus enqueue. The worker acks only after writing the verdict, so a crash before that point leaves the message to be redelivered. The verdict write is an idempotent upsert on submission_id, meaning a retried worker cannot create a duplicate row or overwrite a verdict that already settled. And the user sees a 202 immediately, then waits on a WebSocket for the result.
Architecture deep dive
flowchart TD
U[User browser] --> GW[API Gateway<br/>auth + rate limit]
GW --> API[Submission API]
API --> DB[(Postgres<br/>submissions + verdicts)]
API --> MQ[Kafka / SQS<br/>submission queue]
MQ --> WP[Worker Pool<br/>autoscaled]
WP --> SB["Sandbox Fleet<br/>gVisor / Firecracker"]
SB --> TS[(Test Case Store<br/>S3 / object store)]
WP --> DB
WP --> OBJ[(Source + Artifact Store<br/>S3)]
DB --> CDC[CDC / Pub-Sub]
CDC --> WS[WebSocket Hub]
WS --> U
WP --> CACHE[(Compile Cache<br/>Redis)]
style GW fill:#ff6b1a,color:#0a0a0f
style SB fill:#15803d,color:#fff
style MQ fill:#ffaa00,color:#0a0a0f
style WS fill:#a855f7,color:#fff
style CACHE fill:#0e7490,color:#fff
API Service
The API is stateless and deliberately thin. It authenticates the user, applies rate limiting (say, 10 submissions per minute under normal load, tighter during contests to prevent queue flooding), validates the payload (known language, problem exists, source under 64 KB), writes a submissions row with status = PENDING, publishes a SubmissionJob message containing only the submission_id, and returns 202 { submission_id }. The message carries the ID, not the full source — workers fetch the source themselves from the DB or object store.
Message Queue
The queue is the shock absorber for contest spikes. It needs to be durable (messages survive broker restarts), at-least-once (a crashed worker's message reappears after the visibility timeout), and equipped with a dead-letter queue for messages that fail repeatedly — typically submissions that OOM-kill the worker itself, which is rare but happens.
FIFO ordering within a partition is nice but not required — verdicts are independent, so reordering doesn't matter. Kafka or AWS SQS both fit. SQS is operationally simpler; Kafka is better if you also want the submission stream for downstream analytics and plagiarism detection.
Judge Worker
A worker is a long-running process that pulls one message, fetches the source, checks the compile cache (keyed on hash(source_code + language + compiler_flags + compiler_version)), and if there's no hit, compiles in an ephemeral sandbox and stores the artifact. Then for each test case in order, it starts a fresh namespace (PID, mount, IPC) reusing the compiled binary, feeds the test input via stdin, waits for exit with select/epoll plus a timeout, reads stdout, and compares to expected output — exact match, or a custom checker for problems with multiple valid answers. On the first non-AC result it stops, writes the final verdict via idempotent upsert, and acks the message.
Workers are stateless and horizontally scalable. The autoscaler watches queue depth and the age of the oldest message; if submissions are waiting more than 5 seconds, it provisions more.
Sandbox lifecycle
stateDiagram-v2
[*] --> Created: worker spawns sandbox
Created --> Compiling: source injected
Compiling --> CompileError: non-zero exit
Compiling --> Ready: binary available
CompileError --> [*]: worker records CE
Ready --> Running: test input fed via stdin
Running --> Finished: clean exit within limits
Running --> TLE: wall-clock kill fires
Running --> MLE: OOM kill from cgroup
Running --> RE: non-zero exit / signal
Finished --> Verdict: compare stdout
TLE --> [*]: worker records TLE
MLE --> [*]: worker records MLE
RE --> [*]: worker records RE
Verdict --> Ready: next test case
Verdict --> [*]: AC or WA, done
Each test case gets a fresh namespace (PID, mount, IPC) but reuses the compiled binary. This is the key optimization: compile once per submission, execute N times. The binary lands in a read-only bind mount inside the sandbox — the fresh namespace isolates side effects between test cases without paying the compilation cost again.
Test case storage
Test cases (input + expected output) live in object storage organized as problem_id/tc_{n}.in and problem_id/tc_{n}.out. Workers download and cache them on the worker node's local disk — not inside the sandbox, where the user code would be able to see the expected output.
Warm workers that have all test cases for a popular problem already on local disk can start execution within milliseconds of picking up a job, which matters for tail latency under load.
Compile cache
For C++ a typical compile takes 1–3 seconds. For Rust it can be 5 seconds. Skipping recompilation on retries or re-judges is a real win. The cache key is SHA256(source_code + language + compiler_flags + compiler_version); a hit means the worker skips straight to execution. Redis is a reasonable store: compiled artifacts are 100 KB–2 MB, a 24-hour TTL covers the cases where it matters (same user resubmitting, re-judge after a problem update), and LRU eviction handles the working set naturally.
Storage choices
| Data | Store | Why |
|---|---|---|
| Submission metadata + verdicts | Postgres | Strong consistency; per-submission row; FK to users and problems |
| Source code (raw) | S3 / object store | Blobs up to 64 KB; cheap; content-addressed by hash |
| Compiled artifacts (cache) | Redis + local worker disk | Ephemeral; LRU; 24h TTL |
| Test case inputs/outputs | S3 | Read-heavy; immutable per version; pre-fetched to worker local disk |
| Contest leaderboard | Redis sorted set | ZADD leaderboard {score} {user_id}; ZRANK in O(log N) |
| Submission event stream | Kafka | Fan-out to analytics, plagiarism detection, notifications |
Handling language diversity
Different languages have fundamentally different execution models. The worker must handle:
| Language | Compile step | Runtime | Notes |
|---|---|---|---|
| C++ | g++ / clang++ (~1–3 s) | Native binary | Fastest; strictest TL |
| Java | javac (~1–2 s) | JVM + JIT | JVM startup varies: ~60–500 ms depending on classpath and container constraints; account for warmup in TL |
| Python 3 | None (or py_compile for syntax check) | CPython interpreter | Slowest per operation; TL is usually 3–5× C++ |
| Go | go build (~1 s) | Native binary | Fast startup; GC may cause TLE edge cases |
| JavaScript | None | Node.js | V8 JIT; startup ~100 ms |
| Rust | rustc (~1–5 s for typical CP submissions) | Native binary | Longer compile time than C++; cache hit is critical |
The time limit in the problem metadata must be language-aware. A typical rule: C++ gets 1 s, Java 3 s, Python 5 s for the same problem.
Contest mode and leaderboard
During a contest, the leaderboard is a critical path. Every accepted submission must update it atomically.
A Redis sorted set is the right data structure:
// score = problems_solved * 1e9 - total_penalty_seconds
// Higher score = better rank. ZRANGE REV returns highest first.
ZADD contest:{contest_id}:leaderboard <score> <user_id>
ZRANGE contest:{contest_id}:leaderboard 0 99 WITHSCORES REV // top 100
Encoding: score = problems_solved * 1e9 - total_penalty_seconds so that more problems solved → higher score → better rank under ZRANGE … REV (highest score = rank 1). Within the same problem count, lower penalty → higher score → better rank. The 1e9 factor ensures a difference in problems solved always dominates any difference in penalty (max contest penalty is in the tens-of-thousands of seconds, far below 1e9). All values stay well within float64's exact-integer range of ±2^53, so there is no precision loss. Exact encoding depends on the contest scoring rules.
The leaderboard is eventually consistent: a verdict write triggers a Kafka event, a leaderboard updater consumes it and runs the Redis command. Lag is typically under 1 second.
Failure modes
Worker crashes mid-submission
The visibility timeout on the queue message expires (typically 60–120 s). The message reappears and another worker picks it up. Because the verdict write is an idempotent upsert keyed on submission_id, there are no duplicate rows. If the first worker had already written a partial result before crashing, the second write with the same ID overwrites it cleanly.
Sandbox escape (security-critical)
If an attacker escapes the sandbox onto the host, they have full access to the worker VM. The blast radius is bounded by two choices: workers run on dedicated instances not shared with other services, and workers hold minimal DB credentials — write access scoped only to the verdict column of their own submission row. A hardened deployment routes verdict writes through a dedicated internal API so worker processes never hold general DB write credentials.
Host-level monitoring matters too: auditd or eBPF-based tools watching for anomalous syscalls (unexpected connect() to external IPs, execve of unknown binaries) can surface an escape before it progresses. Rotating worker instances regularly — daily or after a configurable number of submissions — limits the window of a persistent compromise.
Resource exhaustion on the worker host
If a submission writes more than the ephemeral overlay limit allows (say, 10 GB of output), the writable layer must be capped. Use dm-thin or overlay2 with a size limit on the upper layer. The worker also monitors its own disk usage and terminates sandboxes that exceed a threshold.
Noisy-neighbor timing skew
On a multi-tenant worker running several sandboxes concurrently, CPU pinning (cpuset) per sandbox eliminates CPU-level interference. Memory bandwidth contention is harder to eliminate at the hardware level. For high-stakes verdicts — final contest submissions — re-judge on a pinned, isolated worker.
Test case data corruption
If the expected output for a test case is corrupted in S3, every submission for that problem returns a wrong WA. The fix is to store a SHA256 checksum alongside each test case file and have workers verify it on download. Keep objects immutable and versioned in S3 — never overwrite, only create a new version and update the problem's pointer.
Plagiarism detection
Online judges operate at a scale where plagiarism is common in contests. The Kafka submission stream makes it straightforward to hook in a background plagiarism detector: compare source code using token-based similarity (MOSS-style: Winnowing algorithm with hashed n-grams), run it asynchronously after the contest ends, and route results to a moderation queue rather than directly to users. Running this synchronously on the submission path would add latency for no benefit to the user.
Things to discuss in an interview
- Why async? The API must return immediately; judge time is 2–10 s. Synchronous execution would require the client to hold a connection open (HTTP long-poll) or the server to block a thread per submission — both are expensive at scale.
- Why the queue absorbs contest spikes: a tidal wave of submissions at a contest deadline all arrive in seconds. The queue buffers them; workers drain at their own pace; autoscaling adds capacity.
- What makes TLE accurate: CPU time (not wall-clock) measured via cgroups or
getrusage; CPU pinning eliminates scheduling noise; the time limit is language-adjusted. - Why gVisor over just cgroups: cgroups limit resources but don't prevent syscall-level exploits. gVisor's Sentry intercepts every syscall before it reaches the host kernel — via systrap (default) or KVM — so the untrusted process never touches the host kernel's syscall handlers directly.
- Idempotency of verdict writes: a worker can crash and retry without corrupting data. The upsert is keyed on
submission_id, which is stable across retries. - Compile cache: amortizes the expensive compile step across retries and re-judges. Also crucial for Rust (long compile times).
Things you should now be able to answer
- Why is a synchronous submission API a bad idea at scale?
- What is the difference between CPU time and wall-clock time, and which should you enforce for TLE?
- Name three attacks a sandbox must prevent and the mechanism that stops each.
- What happens if a judge worker crashes after writing a partial verdict?
- How does the queue help during contest spikes, and what drives worker autoscaling?
- Why does Java need a higher time limit than C++ for the same problem?
- What is the difference between gVisor and Firecracker, and why might you choose one over the other?
Further reading
- Firecracker: Lightweight Virtualization for Serverless Applications — NSDI 2020 (usenix.org)
- gVisor: Sandboxing Container Workloads — gvisor.dev
- MOSS — A System for Detecting Software Plagiarism — theory.stanford.edu/~aiken/moss
- Codeforces judge architecture — codeforces.com/blog (community posts on judge design)
- Linux cgroups v2 documentation — kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
- seccomp-bpf — kernel.org/doc/html/latest/userspace-api/seccomp_filter.html
Frequently asked questions
▸What is the difference between gVisor and Firecracker for sandbox isolation in an online judge?
gVisor inserts a user-space kernel called the Sentry that intercepts every syscall before it reaches the host kernel, adding roughly 50-100 ms startup overhead. Firecracker boots a minimal Linux kernel inside a KVM microVM, providing a full hypervisor boundary with under 5 MiB VMM memory overhead per VM and around 125 ms cold start on a highly tuned setup, or tens of milliseconds when using snapshot-restore.
▸Why should you measure CPU time rather than wall-clock time for TLE verdicts?
Wall-clock time is noisy because a loaded host slows every process, so two identical programs can produce different measurements depending on scheduling. CPU time, measured via getrusage(RUSAGE_CHILDREN) or cgroup v1 cpuacct.usage (cgroup v2: cpu.stat), reflects actual compute consumed and is independent of scheduling jitter, giving deterministic and reproducible TLE verdicts.
▸How many judge worker slots does a system handling 500 submissions per second at steady state require?
Roughly 1,500 concurrent worker slots. At 500 submissions per second with an average judge time of 3 seconds per submission, Little's Law gives 500 times 3 equals 1,500 concurrent sandboxes, with each worker slot running one sandbox at a time.
▸Why must verdict writes use idempotent upserts keyed on submission_id?
A judge worker can crash mid-submission, causing the queue message to reappear after the visibility timeout and be picked up by a second worker. Without idempotent upserts, the second worker would create a duplicate row or overwrite a verdict that already settled cleanly; an upsert keyed on submission_id makes retries safe.
▸When should you choose Kafka over SQS for the online judge submission queue?
Kafka is the better fit when you also need the submission stream for downstream consumers such as analytics and plagiarism detection, since it supports fan-out to multiple consumer groups. SQS is operationally simpler if the queue's only job is delivering jobs to judge workers.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.