~/articles/design-llm-inference-serving

◆◆◆Advancedasked at OpenAIasked at Anthropicasked at Googleasked at Metaasked at NVIDIA

Design an LLM Inference & Serving System

Q: Why is KV cache, not model weights, the binding constraint on batch size?

A 70B-parameter model with 80 layers and 64 heads of dimension 128 requires roughly 2.6 MB of KV cache per token per request. A single 4096-token request consumes about 10.7 GB just for its key-value tensors, while the weights themselves are loaded once and shared across all concurrent requests. On an 80 GB GPU, after 140 GB of weights are split across two GPUs, the remaining HBM fills up with KV cache — and each new request that enters the batch must claim KV storage proportional to its context length. The batch size ceiling is therefore how many simultaneous requests fit in the remaining HBM, not how many FLOPs the device can execute.

Q: What is continuous batching and why does it outperform static batching for LLM serving?

Static batching holds a fixed group of requests together until all have finished generating, which means fast requests wait for slow ones to reach their maximum output length before the GPU can accept new work. Continuous batching inserts new requests and evicts finished ones at each decode step — typically every few milliseconds — so the GPU is always working on the densest feasible batch. In practice this raises effective throughput 2–4x without changing the underlying model or hardware.

Q: What is PagedAttention and what problem does it solve?

Standard attention requires KV cache to occupy a single contiguous memory block per sequence, which causes severe fragmentation — a block reserved for a request that might generate 4096 tokens wastes most of that allocation when the request generates only 80 tokens. PagedAttention (introduced in vLLM, Kwon et al. 2023) divides KV cache into fixed-size pages analogous to virtual memory pages, maintaining a per-sequence page table. Physical KV memory is allocated one page at a time as tokens arrive, reducing fragmentation to near zero and enabling physical KV pages to be shared across requests with identical prefixes.

Q: How does speculative decoding reduce latency without changing output distribution?

A small, fast draft model generates k tokens speculatively in parallel. The large target model then verifies all k tokens in a single forward pass — because the target model can process multiple positions simultaneously in prefill mode. If the target accepts all k, you get k tokens for roughly the cost of one. Rejected tokens are resampled from the target's distribution, so the output is provably identical to greedy or temperature-sampled decoding from the target alone. At typical k=4–8 and acceptance rates around 70–80%, wall-clock latency drops by 1.5–3x on tasks with predictable continuations.

Q: Why does tensor parallelism require high-bandwidth interconnect and when do you use pipeline parallelism instead?

Tensor parallelism splits each weight matrix column-wise across GPUs so every device participates in every matrix multiply, then an all-reduce synchronizes partial activations — this all-reduce must complete within a single forward pass step, so it requires sub-microsecond NVLink or InfiniBand bandwidth. Pipeline parallelism assigns whole transformer layers to different GPUs and passes activations forward one micro-batch at a time, which only requires bandwidth between adjacent stages and tolerates inter-node Ethernet, but introduces pipeline bubbles that reduce GPU utilization by roughly 1/num_stages at steady state.

Serve token generation for a 70B-parameter model at scale — where KV cache, not FLOPs, caps concurrency and continuous batching is what separates good GPU utilization from terrible utilization.

25 min read2026-06-18Ironclad Academy

#interview #ai #llm #inference #gpu #caching #scaling

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

The autoregressive decode loop is not incidental to transformers — it is fundamental to how they work. Each token depends on all prior tokens, so position 42 cannot be computed until position 41 is ready, no matter how many GPUs you have. This makes LLM serving unlike almost any other large-scale inference problem.

The business pressure on this problem is severe. A single H100 GPU costs around $2–3/hour of cloud compute. The model weights for a 70B-parameter model occupy 140 GB at fp16 — more than one 80 GB GPU can hold — so you need at least two GPUs just to load the model, before you have served a single request. If those GPUs sit idle most of the time because you are running one request at a time, the math on cost per million tokens never closes.

The core engineering tension is this: to maximize throughput you want as many requests in-flight simultaneously as possible, sharing the same GPU time-slices. But each in-flight request requires its key-value cache to live in high-bandwidth memory for the duration of the conversation, and HBM is the scarcest resource on the machine. Batching more requests means running out of KV memory faster. The whole discipline of LLM serving is finding clever ways to fit more KV into less memory, waste fewer cycles waiting, and keep GPUs saturated without blowing the per-user latency budget.

This article covers the infrastructure behind the inference API — the layer between a user's HTTP request and the model weights. For how to build the ML model itself, see the ML system design framework. For recommendation and ranking systems that sit on top of LLMs, see design a recommendation system.

Functional requirements

POST /v1/chat/completions with stream: true — return tokens as server-sent events as they are generated.
Serve multiple named models from the same fleet (e.g., a 7B model and a 70B model).
Support LoRA adapter hot-swapping — different tenants use different fine-tuned adapters on top of a shared base model.
Enforce per-tenant rate limits and token budgets. (For rate-limit design, see design a rate limiter.)
Expose model health and throughput metrics.

Non-functional requirements

p99 TTFT < 500 ms.
p99 ITL < 50 ms.
Fleet output throughput >= 1 M tokens/second.
Cost < $1 per million output tokens.
Multi-tenant fairness — no single tenant starves others.
GPU utilization >= 70% during business hours.

Capacity estimation

Dimension	Estimate	How we got there
Model weights (70B fp16)	140 GB	`70 × 10⁹ params × 2 bytes/param`
Min GPUs for weights alone	2× A100/H100 80 GB	`140 GB ÷ 80 GB/GPU ≈ 1.75 → 2 GPUs`
KV cache per token (70B model)	~2.6 MB	`2 [k,v] × 80 layers × 64 heads × 128 head_dim × 2 bytes = 2,621,440 bytes`
KV cache — avg 512-token request	~1.3 GB	`2.6 MB × 512 tokens`
KV cache — max 4096-token request	~10.7 GB	`2.6 MB × 4,096 tokens`
Concurrent requests per 2-GPU node (naively, avg context)	~8–15 at 512-token avg context	`~20 GB free HBM after weights ÷ ~1.3 GB per request`
Concurrent requests with PagedAttention	20–40	Near-zero fragmentation; dynamic page allocation; prefix sharing
Decode throughput per 2-GPU node	~800–1,200 tokens/sec	Empirical; memory-bandwidth-bound at ~3–4 TB/s HBM bandwidth on H100 NVLink
Fleet for 1 M tokens/sec	~1,000 nodes, ~2,000 GPUs	`1 M tokens/sec ÷ 1,000 tokens/sec/node`
Cost per million output tokens	~$1.16–$1.74	`$2.50/hr per H100 × 2 H100/node ÷ 3.6M tokens/hr/node` at 800–1,200 tokens/sec

Takeaway: KV cache memory, not FLOPs, is the binding constraint on batch size. A 70B model at a 4096-token context window consumes ~10.7 GB of HBM per request — over half the free memory on a 2-GPU node after weights are loaded. At a more realistic average context of 512 tokens each request uses ~1.3 GB, allowing 8–15 concurrent requests naively. PagedAttention and prefix sharing are not optional optimizations — they are what make multi-tenant serving financially viable.

Building up to the design

The evolution of LLM serving is a clean story of identifying bottlenecks one by one and eliminating them.

V1: One request per GPU — the obvious baseline

Run one request at a time. Load the model weights, run the prefill pass on the prompt, then decode tokens one by one until the stop condition. GPU utilization during the decode loop: the model spends most of its time reading weights from memory to execute a single-token forward pass. On an H100, loading 140 GB of weights to compute one token takes ~50 ms — and for a 200-token response you repeat this 200 times. The GPU is compute-active for maybe 10–15% of its theoretical throughput.

V2: Static batching — pack multiple requests together

Group a fixed set of requests, run them together through each decode step. Matrix multiplications now operate on a batch of size B rather than 1, dramatically improving arithmetic intensity and GPU utilization.

The problem: requests in a static batch finish at different times. If one request asks for a 20-token answer and another asks for a 500-token answer, the 20-token request finishes and sits idle waiting for the slow one. The batch cannot grow until all members finish. Head-of-line blocking means average GPU utilization might improve to 40–50%, but long-tail requests create bubbles.

flowchart LR
    R1[Request 1<br/>20 tokens] --> BATCH[Static Batch<br/>waits for slowest]
    R2[Request 2<br/>500 tokens] --> BATCH
    R3[Request 3<br/>120 tokens] --> BATCH
    BATCH --> R1D[done at step 20<br/>idle until step 500]
    BATCH --> R3D[done at step 120<br/>idle until step 500]
    BATCH --> R2D[done at step 500]
    style BATCH fill:#ff6b1a,color:#0a0a0f
    style R1D fill:#0e7490,color:#fff

V3: Continuous batching — the key insight

Instead of holding a fixed batch until all requests finish, insert new requests and remove finished ones at every decode step. The scheduler evaluates the queue each step, approximately every 5–20 ms. A request that finished at step 20 frees its slot immediately; a new waiting request fills it before step 21 runs.

This is the Orca continuous batching paper (OSDI 2022) and it is the single most important technique in production LLM serving. Throughput improvements of 2–4x over static batching are typical.

V4: PagedAttention — eliminate KV fragmentation

Static batching revealed a second problem: even with continuous batching, the scheduler cannot always fill freed slots. Standard attention requires each request's KV cache to be allocated as a single contiguous block of memory, sized at the maximum possible output length. A request that might generate up to 4096 tokens needs 4096 × 2.6 MB = ~10.7 GB reserved upfront, even if it only generates 80 tokens. Most of that reservation sits unused, and because it is contiguous, no other request can use the gaps.

PagedAttention (vLLM, Kwon et al. 2023) borrows virtual memory paging. Physical KV memory is divided into fixed pages — say 16 tokens per page, each page ~41 MB for the 70B model. Each sequence maintains a page table mapping logical token positions to physical page addresses. Pages are allocated one at a time as tokens arrive, returned to a free pool when the request finishes. Fragmentation drops to at most one page per sequence at the tail. Two requests with identical prefixes can share their physical KV pages read-only, similar to copy-on-write forking.

flowchart TD
    SEQ1[Sequence A<br/>prompt tokens 0–63] --> PT1[Page Table A]
    SEQ2[Sequence B<br/>same system prompt] --> PT2[Page Table B]
    PT1 -->|pages 0-3| PPOOL[(Physical KV<br/>Page Pool)]
    PT2 -->|pages 0-3 shared| PPOOL
    PT1 -->|pages 4-6 unique| PPOOL
    PT2 -->|pages 4-5 unique| PPOOL
    style PPOOL fill:#15803d,color:#fff
    style PT1 fill:#0e7490,color:#fff
    style PT2 fill:#a855f7,color:#fff

V5: Tensor and pipeline parallelism — models bigger than one node

A 70B model needs 2 GPUs. A 405B model at fp16 would need ~810 GB — 11 GPUs minimum. Two parallelism strategies apply:

Tensor parallelism splits each weight matrix across GPUs column-wise. Every device executes part of every matrix multiply, then an all-reduce synchronization produces the correct output. The all-reduce happens inside each decoder layer, so latency is critical — this only works well with NVLink (within a node) or very fast InfiniBand. NVIDIA's Megatron-LM demonstrated this at scale.

Pipeline parallelism assigns contiguous groups of transformer layers to different devices. Device 0 processes layers 0–19, device 1 processes layers 20–39, and so on. Activations pass between stages. The bandwidth requirement is much lower — just a single activation tensor at each stage boundary — so this works over standard inter-node Ethernet. The cost is pipeline bubbles during the warmup and cooldown of each micro-batch.

In practice: tensor-parallel within a node (using NVLink), pipeline-parallel across nodes (using InfiniBand or Ethernet).

V6: Prefix caching and speculative decoding

Two orthogonal latency/throughput wins:

Prefix caching. Most OpenAI API users send the same system prompt with every request. Without caching, every request runs a full prefill over that system prompt, consuming GPU time proportional to prompt length. With prefix caching, the KV pages for any prompt prefix that has been seen before are kept in a hot cache. A cache hit skips the prefill for the shared prefix entirely. vLLM's prefix caching can reduce TTFT by 2–5x for workloads with long system prompts.

Speculative decoding. A small draft model (3B–7B parameters) generates k candidate tokens in k sequential steps. The large target model then verifies all k tokens in a single forward pass — because it processes them all at once, in parallel across positions, the way prefill works. Tokens that the target accepts are emitted; the first rejected token is resampled from the target's distribution. Output quality is provably identical to running the target model alone. At k=4 and 75% acceptance rate, you emit roughly 3 tokens per target-model forward pass instead of 1, cutting wall-clock ITL by ~2–3x on typical chat workloads. (Leviathan et al., 2023.)

V7: Prefill/decode disaggregation, quantization, and autoscaling

The most advanced production systems separate the prefill and decode phases onto different hardware.

Prefill is compute-bound: processing a long prompt is a dense matrix multiply that saturates GPU FLOPs. Decode is memory-bandwidth-bound: generating one token requires reading all the weights, which is limited by HBM bandwidth. Running both phases on the same GPU means the two workloads interfere — a batch of long prefills delays the decode of short responses that are already mid-generation.

Disaggregation: prefill requests go to a pool of prefill-optimized workers (possibly older GPUs with high compute density), decode requests go to a separate pool of decode workers. The KV cache tensors are transferred between pools after prefill completes. This is the frontier architecture in 2025, pioneered by systems like Splitwise and adopted internally at major providers.

Quantization. INT8 weight quantization (GPTQ, AWQ) cuts weight memory in half — the 70B model fits on one 80 GB GPU rather than two. FP8 quantization (supported on H100) also uses 1 byte per parameter (the same footprint as INT8) but retains a floating-point exponent, recovering most of fp16 quality. FP8's advantage over INT8 is numerical accuracy, not smaller size. The trade-off is 5–15% degradation on reasoning benchmarks, which is acceptable for many applications.

API

POST /v1/chat/completions
{
  "model": "llm-70b",
  "messages": [{ "role": "user", "content": "..." }],
  "stream": true,
  "max_tokens": 512,
  "temperature": 0.7
}

→ SSE stream:
data: {"choices": [{"delta": {"content": "The"}}]}
data: {"choices": [{"delta": {"content": " answer"}}]}
...
data: [DONE]

The gateway also exposes a LoRA adapter endpoint:

POST /v1/chat/completions
{
  "model": "llm-70b",
  "lora_adapter": "customer-support-v3",
  ...
}

LoRA adapters are small rank-decomposition matrices (a few hundred MB) loaded on top of the base weights. Multiple adapters can live in GPU memory simultaneously, switching between them per-request with negligible overhead.

Architecture

flowchart TD
    CLI[Client] --> GW[API Gateway<br/>Auth / Rate-limit / Routing]
    GW --> QUEUE[Priority Queue<br/>Admission Control]
    QUEUE --> SCHED[Continuous Batch<br/>Scheduler]
    GW --> PC[(Prefix KV<br/>Cache)]
    PC -->|cache hit| SCHED

    SCHED --> PFW[Prefill Workers<br/>compute-bound]
    SCHED --> DCW[Decode Workers<br/>memory-BW-bound]

    PFW --> KVC[(Paged KV<br/>Pool)]
    DCW --> KVC
    KVC -->|KV pages| DCW

    PFW --> WS[(Weight Shards<br/>NVMe / object store)]
    DCW --> WS

    DRAFT[Draft Model<br/>Speculative Decode] --> DCW
    LORA[(LoRA Adapters)] --> DCW
    LORA --> PFW

    DCW -->|token stream| SCHED
    SCHED -->|SSE tokens| GW
    GW -->|stream| CLI

    METRICS[Metrics / Autoscaler] --> GW
    METRICS --> SCHED

    style GW fill:#ff6b1a,color:#0a0a0f
    style SCHED fill:#a855f7,color:#fff
    style PFW fill:#0e7490,color:#fff
    style DCW fill:#15803d,color:#fff
    style KVC fill:#ffaa00,color:#0a0a0f
    style PC fill:#ff2e88,color:#fff
    style DRAFT fill:#15803d,color:#fff

The autoregressive loop: prefill vs. decode

Understanding why prefill and decode are different is prerequisite to understanding every optimization in this system.

Prefill processes the entire input prompt in a single forward pass. The prompt tokens are all known upfront, so they can be processed in parallel across sequence positions — this is the same dense matrix multiply that runs during training. A 4096-token prompt on a 70B model takes roughly 200–400 ms on two H100s. The GPU is compute-bound.

Decode generates one token at a time. At each step, the model takes the last generated token, runs a forward pass, and samples the next token. The critical bottleneck: to run a single-token forward pass through a 70B model, the GPU must read ~140 GB of weight data through HBM. The arithmetic intensity (FLOPs per byte loaded) of a single-token forward pass is far too low to saturate the GPU's compute units — most of the time, the compute cores wait for HBM to deliver weight data. This is the memory-bandwidth bottleneck. Batching helps because a batch of 32 requests reads the same weight bytes once but computes 32 output tokens from them, improving arithmetic intensity by 32x.

sequenceDiagram
    participant C as Client
    participant GW as Gateway
    participant SC as Scheduler
    participant PF as Prefill Worker
    participant DC as Decode Worker

    C->>GW: POST /v1/chat/completions (stream: true)
    GW->>SC: Enqueue request (priority, token budget)
    SC->>PF: Prefill: process prompt (4096 tokens, single pass)
    Note over PF: ~200-400 ms, compute-bound
    PF-->>SC: KV pages written to pool
    SC->>DC: Add to decode batch (KV pages ref)
    loop Every ~5-20 ms (one decode step)
        DC->>DC: forward pass over batch
        DC-->>GW: next token
        GW-->>C: SSE: data: {"delta": {"content": "..."}}
    end
    DC-->>GW: stop token / max_tokens reached
    GW-->>C: data: [DONE]
    SC->>SC: Release KV pages back to pool

KV cache and PagedAttention

The attention mechanism in a transformer computes, for each layer, a query (Q), key (K), and value (V) matrix from the current token. The keys and values from all prior tokens must be available for the current attention computation — these are the KV tensors that must live in HBM for the duration of each active request.

For a 70B model with 80 layers, 64 attention heads, and a head dimension of 128:

KV cache per token = 2 (k, v) × 80 layers × 64 heads × 128 dim × 2 bytes (fp16)
                   = 2 × 80 × 64 × 128 × 2
                   = 2,621,440 bytes ≈ 2.6 MB per token

A single 1024-token conversation uses ~2.7 GB. A single 4096-token context uses ~10.7 GB. On a 2-GPU node with 160 GB total HBM and 140 GB consumed by weights, only ~20 GB remains for KV — room for roughly 2 requests at maximum context length if allocated naively, or about 8–15 requests at a more realistic 512-token average context.

PagedAttention solves the fragmentation problem by dividing physical KV memory into 16-token pages (~41 MB per page for the 70B model). Each request maintains a logical-to-physical page table, identical in concept to virtual memory paging. Physical pages are allocated only when tokens arrive and freed immediately when the request ends. The practical result: batch sizes jump from single digits to 20–40 on the same hardware, because:

No pre-reservation at maximum output length.
Completed requests' pages are immediately reusable.
Requests with shared system prompts can map the same physical pages read-only.

FlashAttention (Dao et al.) complements PagedAttention at the kernel level — it computes attention in tiles that fit in SRAM rather than staging the full KV matrix through HBM, reducing HBM reads during the attention computation itself. For the vector index structures (HNSW, IVF-PQ) that underpin semantic search on top of LLM outputs, see Design a Vector Database.

The continuous batching scheduler

The scheduler is the central coordinator. It runs approximately every 5–20 ms (one decode step) and decides:

Which waiting requests to admit to the batch (bounded by available KV pages).
Which active requests have finished and should be evicted.
Which requests to preempt if KV memory is exhausted (swap to CPU or recompute).

The scheduling policy must balance several competing goals:

Fairness: a tenant that sends a very long request should not block shorter requests from other tenants for seconds. Token-bucket quotas per tenant limit how many KV pages any single tenant can hold simultaneously.
Priority: paid tiers get lower TTFT SLOs and therefore higher admission priority.
Anti-starvation: a low-priority request that has been waiting long enough gets a priority boost, preventing indefinite starvation.

When KV memory is exhausted and a new high-priority request must be admitted, the scheduler can preempt a lower-priority in-flight request. It has two options: swap the KV pages to CPU DRAM (fast, keeps state, resumes without recompute; vLLM calls this "swapping") or discard the KV cache and recompute from scratch when the request is rescheduled (slower but requires no CPU memory buffer). The choice depends on the latency budget for the preempted request and available CPU memory.

Parallelism strategies

Strategy	Splits	Interconnect requirement	Best for
Tensor parallelism (TP)	Each weight matrix column-wise across N GPUs	NVLink or fast InfiniBand; all-reduce per layer	Models larger than one GPU; within a node
Pipeline parallelism (PP)	Contiguous layer groups to different stages	Standard Ethernet; activations passed between stages	Very large models; across nodes
Expert parallelism (EP)	MoE experts distributed across GPUs	Low bandwidth; each token routed to 2–8 experts	Mixture-of-experts models (Mixtral, Grok)
Data parallelism (DP)	Full model replicated; requests sharded	None within replica; coordination for batching	Smaller models; multiple identical replicas

Production deployments typically use TP=8 within a DGX node (8 GPUs via NVLink) and DP (multiple independent nodes) for throughput scaling. PP is added when the model is too large for a single node at the chosen TP degree.

Speculative decoding

The intuition: the expensive model is idle during most of the speculative token generation. Why not use a smaller model to propose candidates and the large model to vet them?

The draft model generates k tokens autoregressively, producing draft tokens t₁ ... tₖ and their probability distributions. The target model then runs a single forward pass over the entire prefix plus all k draft tokens simultaneously, producing reference probability distributions for each position. Draft tokens are accepted or rejected via a rejection sampling rule: token t is accepted with probability min(1, p_target(t) / p_draft(t)); otherwise a corrected residual distribution is sampled, guaranteeing the overall output distribution matches sampling from the target model alone.

The acceptance rate depends on how predictable the continuation is. On code completion and factual lookups, acceptance rates can exceed 85%, yielding close to k tokens per target-model pass. On creative writing with high temperature, acceptance rates drop to 50–60%, and the benefit shrinks. Most production systems use k=4 to 8.

Multi-LoRA serving

Fine-tuned LoRA adapters let customers customize model behavior without serving separate full model copies. A LoRA adapter for a 70B model might be 200–500 MB — a rounding error compared to the base model.

The challenge is hot-swapping: different concurrent requests in the same batch may need different adapters. The merged weight matrix for LoRA is W + ΔW = W + AB, where A and B are the low-rank factors. Applying this per-request inside a batch requires either batching requests by adapter (degrades continuous batching) or computing the ΔW additions separately and accumulating them into the output activations — the latter is what systems like S-LoRA (Sheng et al.) implement, using unified paging for adapter weights.

Prefix and prompt caching

Every request to the OpenAI or Anthropic API includes a system prompt. A developer might send the same 2000-token system prompt with every API call. Without caching, every request burns prefill compute on those 2000 tokens. With prefix caching, the gateway hashes the KV pages for any prefix of length ≥ some threshold; on a cache hit, those pages are attached to the new request's page table at zero prefill cost.

The cache must be invalidated when a model is updated or when the prefix itself changes. Implementation: a content-addressed store of KV pages keyed by (model_id, prefix_hash). Page eviction follows LRU, weighted by prefix frequency and length (longer prefixes are more valuable to cache).

A warming strategy matters: on cold start, the prefix cache is empty and TTFT spikes. Production systems pre-warm the cache with the top-N most common system prompts during model load.

SLO management: TTFT vs. throughput pools

TTFT and throughput are in tension. To minimize TTFT, you want small batch sizes — the prefill starts immediately without waiting for other requests to arrive. To maximize throughput, you want large batch sizes — more tokens computed per weight-loading cycle. Serving both optimally from the same pool is impossible.

The solution: separate request pools with separate SLO targets.

Interactive pool: small max-batch-size (~16–32 requests), aggressive prefill scheduling, targets p99 TTFT < 300 ms. Slightly lower throughput is acceptable.
Batch pool: large max-batch-size (~256+ requests), relaxed TTFT (seconds acceptable), optimizes for tokens/sec/$. Used for async generation, background jobs, offline evaluation.

The gateway routes requests based on stream: true (interactive pool) vs. stream: false with a large max_tokens (batch pool).

Autoscaling

GPU autoscaling is fundamentally different from scaling stateless HTTP servers, for two reasons:

Cold start takes 5–15 minutes. Downloading model weights (140 GB for a 70B model) from object storage, loading them into GPU memory, and warming the prefix cache cannot be done in seconds. A traffic spike that arrives in 60 seconds cannot be absorbed by scaling before it peaks.
The marginal GPU in a tensor-parallel group is worth nothing. A 2-GPU TP group must add 2 GPUs at once, not 1. Autoscaling must reason in terms of full model instances, not individual GPUs.

Production solutions:

Keep-warm pool: maintain N instances idle during off-peak hours. The cost is ~20–30% of peak compute — cheaper than the latency penalty of cold starts during traffic spikes.
Request queuing with admission control: under load, queue requests up to a time-based deadline (e.g., 30 seconds for interactive requests) rather than immediately returning 503. Most traffic spikes are shorter than 30 seconds.
Predictive scaling: traffic patterns are highly predictable (time-of-day, day-of-week, product event schedules). Scale out 15 minutes before expected peaks.

Edge cases & gotchas

Long context OOM and preemption

A request with a 32k-token context can exhaust an entire node's available KV pages mid-generation. The scheduler must detect this and choose: preempt another request (with KV swap or recompute), reject the long-context request at admission, or migrate some pages to CPU. Without admission control that accounts for maximum potential KV usage, a single large request can OOM the node and crash all concurrent requests.

Stragglers and batch interference

In a continuous batching setup, one very slow request (long output, high max_tokens) stays in the batch for many steps, contributing latency to the batch's critical path. Its KV pages occupy slots other requests could use. The fix: impose per-request token budgets at admission, and preempt requests that have exceeded an age threshold in favor of fairness.

Bursty arrivals and the admission queue

At the second a product launches or a model announcement goes viral, request rates can spike 50–100x in seconds. Without an admission queue, all requests hit GPU workers simultaneously and memory-OOM'd workers drop them. The queue absorbs the burst and serves requests in order, with tail requests receiving a 429 with a Retry-After header if they exceed the deadline.

Tail latency from batch diversity

A batch containing a mixture of very short (4-token) and very long (512-token) responses has poor average efficiency. The scheduler can apply batch bin-packing: prefer grouping requests of similar estimated output length, using the first few decode steps' token generation rate to estimate final length dynamically. Output length prediction (a tiny classifier on the prompt) helps, but it is imprecise.

Numerical precision and quantization errors

INT8 quantization can degrade long-chain reasoning tasks noticeably. A simple detection heuristic: run shadow evals on a held-out benchmark before promoting a quantized model to production traffic. Monitor perplexity and MMLU-style benchmarks continuously, not just at model deployment time.

Trade-offs to discuss in an interview

Batch size vs. latency. Every request added to the current batch increases throughput (amortizes weight reads across more tokens) but also increases the time each request in the batch waits for its slot in the decode step. The optimal batch size is not constant — it depends on request mix, model size, and SLO targets. Continuous batching adapts dynamically, but the scheduler still needs a target concurrency ceiling.

Quantization vs. output quality. INT8 weight quantization (AWQ, GPTQ) halves memory requirements and raises effective throughput by ~30–50% on the same hardware. The quality degradation is 5–15% on reasoning benchmarks and nearly invisible on summarization and chat. Whether the trade-off is acceptable depends entirely on the use case. Always measure quality on your actual task distribution, not generic benchmarks.

Speculative decoding vs. wasted compute. A draft model adds GPU memory overhead and compute. At low acceptance rates, you run the draft model forward, generate k tokens, run the target model to verify them, reject most, and are left with 1 correct token — paying the cost of both models for less benefit than running the target alone. Speculative decoding is net-positive only when acceptance rates exceed ~50–60% and the draft model is significantly cheaper than the target. Profile on production traffic before committing.

Prefix caching vs. cache management complexity. A hot prefix cache dramatically reduces TTFT for API customers with long system prompts. The complexity: the cache must be consistent across the TP group (all GPUs in the group must have the same pages), invalidated on model update, and sized correctly (too small and hit rates collapse; too large and you are holding KV pages that could serve live requests). The operational burden is non-trivial.

Prefill/decode disaggregation vs. operational complexity. Separate hardware pools for prefill and decode let you optimize each independently, but now you have two fleets to manage, KV transfer between them introduces network overhead, and the boundary makes debugging harder. The benefit is real — 20–40% throughput improvement at matched latency — but it is an advanced optimization worth mentioning in an interview before committing to it.

Things you should now be able to answer

Why does the autoregressive decode loop become memory-bandwidth-bound rather than compute-bound, and how does batching fix this?
What is the difference between TTFT and ITL, and which phase of the model is each most influenced by?
Why does static batching cause head-of-line blocking, and how does continuous batching eliminate it?
Explain PagedAttention's analogy to virtual memory and why it enables prefix sharing.
Why does tensor parallelism require NVLink, and when would you use pipeline parallelism instead?
How does speculative decoding improve ITL while preserving output distribution? When does it help least?
How would you design a fair multi-tenant scheduler with token-budget admission control?
What makes GPU autoscaling fundamentally harder than scaling stateless HTTP services?
What is prefill/decode disaggregation and what problem does it solve?
How does quantization (INT8/FP8) change the cost-quality trade-off, and how would you decide whether to use it?

Frequently asked questions

▸Why is KV cache, not model weights, the binding constraint on batch size?

A 70B-parameter model with 80 layers and 64 heads of dimension 128 requires roughly 2.6 MB of KV cache per token per request. A single 4096-token request consumes about 10.7 GB just for its key-value tensors, while the weights themselves are loaded once and shared across all concurrent requests. On an 80 GB GPU, after 140 GB of weights are split across two GPUs, the remaining HBM fills up with KV cache — and each new request that enters the batch must claim KV storage proportional to its context length. The batch size ceiling is therefore how many simultaneous requests fit in the remaining HBM, not how many FLOPs the device can execute.

▸What is continuous batching and why does it outperform static batching for LLM serving?

Static batching holds a fixed group of requests together until all have finished generating, which means fast requests wait for slow ones to reach their maximum output length before the GPU can accept new work. Continuous batching inserts new requests and evicts finished ones at each decode step — typically every few milliseconds — so the GPU is always working on the densest feasible batch. In practice this raises effective throughput 2–4x without changing the underlying model or hardware.

▸What is PagedAttention and what problem does it solve?

Standard attention requires KV cache to occupy a single contiguous memory block per sequence, which causes severe fragmentation — a block reserved for a request that might generate 4096 tokens wastes most of that allocation when the request generates only 80 tokens. PagedAttention (introduced in vLLM, Kwon et al. 2023) divides KV cache into fixed-size pages analogous to virtual memory pages, maintaining a per-sequence page table. Physical KV memory is allocated one page at a time as tokens arrive, reducing fragmentation to near zero and enabling physical KV pages to be shared across requests with identical prefixes.

▸How does speculative decoding reduce latency without changing output distribution?

A small, fast draft model generates k tokens speculatively in parallel. The large target model then verifies all k tokens in a single forward pass — because the target model can process multiple positions simultaneously in prefill mode. If the target accepts all k, you get k tokens for roughly the cost of one. Rejected tokens are resampled from the target's distribution, so the output is provably identical to greedy or temperature-sampled decoding from the target alone. At typical k=4–8 and acceptance rates around 70–80%, wall-clock latency drops by 1.5–3x on tasks with predictable continuations.

▸Why does tensor parallelism require high-bandwidth interconnect and when do you use pipeline parallelism instead?

Tensor parallelism splits each weight matrix column-wise across GPUs so every device participates in every matrix multiply, then an all-reduce synchronizes partial activations — this all-reduce must complete within a single forward pass step, so it requires sub-microsecond NVLink or InfiniBand bandwidth. Pipeline parallelism assigns whole transformer layers to different GPUs and passes activations forward one micro-batch at a time, which only requires bandwidth between adjacent stages and tolerates inter-node Ethernet, but introduces pipeline bubbles that reduce GPU utilization by roughly 1/num_stages at steady state.

← previous

Design a RAG (Retrieval-Augmented Generation) Pipeline

Design an AI Agent Platform

// RELATED

Design an LLM Inference & Serving System

The problem

Functional requirements

Non-functional requirements

Capacity estimation

Building up to the design

V1: One request per GPU — the obvious baseline

V2: Static batching — pack multiple requests together

V3: Continuous batching — the key insight

V4: PagedAttention — eliminate KV fragmentation

V5: Tensor and pipeline parallelism — models bigger than one node

V6: Prefix caching and speculative decoding

V7: Prefill/decode disaggregation, quantization, and autoscaling

API

Architecture

The autoregressive loop: prefill vs. decode

KV cache and PagedAttention

The continuous batching scheduler

Parallelism strategies

Speculative decoding

Multi-LoRA serving

Prefix and prompt caching

SLO management: TTFT vs. throughput pools

Autoscaling

Edge cases & gotchas

Long context OOM and preemption

Stragglers and batch interference

Bursty arrivals and the admission queue

Tail latency from batch diversity

Numerical precision and quantization errors

Trade-offs to discuss in an interview

Things you should now be able to answer

Further reading

Frequently asked questions

You may also like

Model Context Protocol (MCP) and Tool-Use Infrastructure

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)