The ML / GenAI System Design Interview Framework
A 7-step, 45-minute framework for ML and GenAI system design rounds — covering the data/feedback loop, the candidate-generation + ranking funnel, and the GenAI decision tree from prompt-engineering to fine-tune, with a hire vs no-hire signals table.
The problem
Most system design interviews are graded on topology: did you put a cache in front of the database, did you use Kafka to decouple write bursts, did you shard by the right key. ML and GenAI rounds add a second dimension. The interviewer is also watching whether you can identify what you're learning from, where the labels come from, how you know the model is working, and — most importantly — whether the system you're designing will improve over time or silently degrade.
This matters because ML systems fail in ways that don't generate errors. A feature computed slightly differently in training versus serving produces no exception; it just makes the model worse. A feedback loop that inadvertently labels everything as positive doesn't crash anything; it just trains a model that predicts "positive" on everything. These failures are invisible to infrastructure monitoring and require specific design choices to prevent.
The ML system design interview has become a standard screen at Google, Meta, Amazon, and every AI lab. The question framing varies — "design a content ranking system," "design a search relevance model," "design a code autocomplete service" — but the underlying rubric is consistent: the interviewer is evaluating your ability to take a fuzzy product goal, frame it as a tractable ML problem, reason about data and evaluation, and close the feedback loop from production back to training.
This article is the framework for that round. It is deliberately structured as a meta-guide rather than a single-system deep dive. For the specific architecture of a recommendation system — the two-stage candidate generation and ranking funnel, the ANN index, the feature store — read Design a Recommendation System. This article teaches you how to run the interview regardless of which system you're asked about.
Why ML rounds are a distinct interview
A classic system design interview is primarily about operational reasoning: traffic patterns, failure modes, consistency models, storage choices. The relevant expertise is "I have built systems that serve a lot of requests and I know what breaks."
An ML design round adds a domain that most infrastructure engineers do not develop organically: statistical validity. The candidate must reason about whether the data being collected is representative, whether the evaluation metric actually measures what the product cares about, and whether the training distribution will match the serving distribution six months after launch.
The canonical mistakes that fail ML rounds are structural, not architectural. Proposing a sophisticated model architecture before defining what a positive label is. Claiming "we'll use AUC" to evaluate a ranking system when the product cares about rank-order quality. Designing a training pipeline that trains once offline without discussing when and how it retrains. Ignoring the feedback loop: the model you deploy influences what users do, which changes the data you collect, which changes what the next model learns.
The rest of this article walks through the seven steps that address each of these failure modes systematically.
Capacity estimation
This article is a framework guide, not a single-system deep dive. The capacity numbers below represent a typical at-scale ML system (think: a content feed ranker or a GenAI API) and exist to anchor the serving and training architecture decisions.
| Dimension | Estimate | How we got there |
|---|---|---|
| Recommendation requests | ~29 000 QPS avg (~58 000 peak) | 500 M users × 5 feed loads/day ÷ 86 400 s; 2× for peak |
| Candidate generation budget | < 20 ms | ANN retrieval over 10 M item embeddings in RAM |
| Ranking inference budget | < 80 ms | Dense model over ~500 candidates on GPU; feature store lookup ≤ 15 ms |
| Item embeddings (float32, 256-dim) | ~10 GB | 10 M × 256 × 4 B — fits one large memory node |
| Training data (daily logs) | ~5 TB/day uncompressed | ~125 B impression events × ~40 B/event, Parquet on object store |
| LLM serving — unoptimised 70B | 20–40 tok/s per request | 2× A100 80 GB (140 GB FP16 weight memory), greedy decode, no batching |
| LLM serving — vLLM + continuous batching | 300–1 000 tok/s aggregate | PagedAttention removes KV-cache fragmentation; batches requests with different lengths |
| Quantisation saving (FP16 → INT8) | ~2× memory reduction | 70B: 140 GB → ~70 GB; fits on 1 A100 80 GB instead of 2 |
| A/B test ramp | 1% → 10% → 50% → 100% | Each stage needs ~1–2 weeks for statistical significance on low-variance metrics |
Takeaway: the two-stage candidate-generation + ranking split exists because of the first three rows — you cannot run an expensive model over 10 M items at 58 000 QPS. The LLM rows anchor the GenAI serving discussion in Step 7.
Step 1: Clarify the task and map business metric → ML objective (0–5 min)
The opening move in every ML round should be the same: identify the business metric the product team cares about, and map it to the ML objective you will actually optimise.
These are rarely the same thing. "Increase watch time" (business) maps to "rank by predicted watch duration" (ML objective), with a guardrail that click-through rate does not fall below a floor. "Reduce churn" (business) maps to "predict the probability of cancellation in the next 30 days" (ML objective), with a guardrail that push notification volume stays below a cap. "Improve search quality" (business) maps to "maximise NDCG@10 on a labeled relevance judgment set" (ML objective), with a guardrail that latency p99 stays under 100 ms.
The mapping is non-trivial because you can optimise an ML objective well and move the business metric in the wrong direction. TikTok famously found that optimising raw click-through rate trained models toward clickbait. YouTube found that optimising watch time trained models toward content users kept watching but later regretted. The guardrail metric — the constraint you must not violate while optimising the objective — is where the product wisdom lives.
In the interview, this conversation takes two minutes and produces something concrete to write on the board:
| Business metric | daily active users (DAU) |
| ML objective | predicted watch time (regression or ranking) |
| Guardrail | click-through rate floor; safety classifier pass rate |
Now every subsequent design decision has an anchor. When you propose a feature, you can ask whether it helps predict watch time. When you propose an A/B test, you know which metrics to watch.
Step 2: Data and labels (5–10 min)
This is the step most candidates rush past. It is also where interviewers catch the most signal.
Where do labels come from?
In supervised learning, labels are the thing you're teaching the model to predict. The naive assumption is that labels are collected by human annotators and stored in a database. In production, labels almost always come from user behavior. A click is an implicit positive label for the item that was shown. A skip is an implicit negative. A purchase is a strong positive. A return is a strong negative.
The problems with behavioral labels are specific and well-known.
Position bias is the first. Items shown at the top of a list are clicked more because of where they are, not because they are better. A model trained on raw clicks learns to predict "was shown first" rather than "is relevant." The fix is inverse propensity scoring — weight each training example by the inverse of the probability it was shown — or counterfactual estimation techniques.
Selection bias is the second. You only observe outcomes for items that were shown. The space of items never shown has no labels at all. A model trained on the observed distribution is biased toward items the current policy already likes, and the feedback loop entrenches existing preferences.
Class imbalance is the third. Negative outcomes typically outnumber positives by 100:1 or more. A model that predicts "no click" on everything is 99% accurate and completely useless. Standard fixes include downsampling negatives, upsampling positives, focal loss, or post-training probability calibration.
Label leakage is the fourth, and the most dangerous. A feature that encodes information about the future label produces a model that looks great offline and fails immediately in production. The canonical example: using the total number of purchases for an item as a feature when some of those purchases happened after the impression you're training on. Time-based train/val/test splits are the most important data hygiene step in any ML system.
Label collection strategy
At the interview, be concrete. For a feed ranking system: the label is a 30-minute watch-time signal (did the user watch more than 30 minutes after clicking?), collected via a streaming event pipeline (Kafka → stream processor → label store), joined to the impression log with a 2-hour delay to allow the outcome to accumulate.
The 2-hour delay is not a detail to skip. It means your training data is always at least 2 hours stale. Your retraining cadence must account for this. Your model cannot incorporate real-time outcomes.
Step 3: Features and the feature store (10–15 min)
Online vs. offline features
ML features split cleanly into two categories based on when they can be computed.
Offline (batch) features are computed over historical data, updated on a schedule (hourly or daily), and stored for retrieval. Examples: a user's genre affinity over the last 30 days; an item's average watch time; a creator's historical engagement rate. These can be computed with a batch Spark job overnight and stored in a low-latency key-value store (Redis, DynamoDB) for serving.
Online (real-time) features change fast enough that stale values hurt model quality. Examples: a user's last three clicks in this session; an item's view count in the last 10 minutes. These must be computed from a stream (Kafka → Flink → feature store) and served with sub-5 ms latency.
Request-time features are computed inline at serving time from the raw request: device type, time of day, country from IP. No storage needed.
Training/serving skew
This is the single most important concept to volunteer in an ML design round. Training/serving skew occurs when the feature value used to train the model differs from the feature value used at serving time. It does not produce any error. It silently degrades model quality. Common causes: different timezone handling in the batch pipeline versus the serving code; aggregation windows defined differently (last-24-hours in training, last-24-hours-floor-to-midnight in serving); feature normalization applied in training but not in serving.
The fix is a feature store that serves as the single source of truth for both paths. The same transformation logic — ideally the same code — runs in the offline pipeline to populate training data and in the online path to populate serving inputs. Widely deployed examples include Uber's Michelangelo, DoorDash's Riviera, and the open-source Feast project. The recommendation system design article covers the feature store architecture in detail.
Point-in-time correctness
When you build a training dataset by joining impressions to features, you must use the feature value as it existed at the time of the impression — not the current value. A user's follower count at the time of a tweet impression matters; their follower count today doesn't. Failing to enforce point-in-time correctness is a form of data leakage.
Step 4: Model choice and architecture (15–22 min)
The most common mistake in this phase is starting with the most sophisticated model you know. Interviewers read this as a sign you haven't thought about the problem — you've picked up a hammer and are looking for a nail.
Start with a heuristic baseline
The right opening is: what is the cheapest thing that could possibly work? For a feed ranker: sort by recency. For a search system: BM25 (TF-IDF variant). For a content moderator: a keyword blocklist. This is not a cop-out — it is the baseline every subsequent model must beat, and naming it explicitly shows you understand what you're adding when you add ML.
Then evolve:
flowchart LR
H["Heuristic<br/>(recency, popularity)"] --> LR["Logistic Regression<br/>on hand-crafted features"]
LR --> GBDT["GBDT<br/>(XGBoost / LightGBM)"]
GBDT --> DNN["Deep NN<br/>(embeddings + dense layers)"]
DNN --> TF["Transformer / LLM<br/>(self-attention, scale)"]
style H fill:#0e7490,color:#fff
style LR fill:#15803d,color:#fff
style GBDT fill:#ff6b1a,color:#0a0a0f
style DNN fill:#a855f7,color:#fff
style TF fill:#ff2e88,color:#fff
The decision of when to move right on this spectrum is concrete: does adding model complexity move the metric you defined in Step 1, on an offline evaluation, by enough to justify the serving cost? If GBDT gives you 90% of the performance gain at 10% of the inference cost of a transformer, GBDT is the right answer.
The candidate-generation + ranking funnel
For any system with millions of items and a latency budget under 200 ms, brute-force scoring every item is infeasible. The universal solution is a two-stage funnel:
flowchart LR
ALL["Full catalog<br/>10 M+ items"] --> CG["Candidate Generation<br/>ANN / BM25 / heuristics<br/>< 20 ms"]
CG --> RANK["Ranking<br/>rich feature model<br/>< 80 ms"]
RANK --> RERANK["Re-ranking<br/>diversity + rules<br/>< 20 ms"]
RERANK --> USER["Response<br/>top 20–100 items"]
style ALL fill:#0e7490,color:#fff
style CG fill:#ff6b1a,color:#0a0a0f
style RANK fill:#15803d,color:#fff
style RERANK fill:#a855f7,color:#fff
Candidate generation must be cheap — approximate nearest-neighbor search on embeddings (HNSW for high recall, IVF+PQ for memory efficiency at the cost of 5–15% lower recall), BM25 for text retrieval, or simple heuristics. Ranking applies an expensive model to the shortlist of a few hundred candidates. Re-ranking handles diversity, business rules, and safety filters without paying full model inference cost.
The recommendation system design article covers the ANN index choices (HNSW vs IVF+PQ), the two-tower embedding model, and the full ranking DNN in detail. In the interview, name the funnel and explain the latency math — the reason the funnel exists is computational, not architectural preference. For the inverted index and BM25 candidate retrieval that backs keyword-based systems, see Design a Distributed Search Engine.
Step 5: Training pipeline (22–28 min)
A model trained once and never updated is a liability. The world changes; the model's distribution of inputs changes; model performance degrades. The training pipeline is the mechanism that keeps the model current.
The offline training loop
flowchart TD
LOGS[(Interaction logs<br/>S3/GCS, Parquet)] --> JOIN[Feature join<br/>point-in-time correct]
JOIN --> FILTER[Filter and label<br/>dedup, position debiasing]
FILTER --> SPLIT[Train / val / test split<br/>by time — NEVER random]
SPLIT --> TRAIN[Distributed training<br/>PyTorch + GPU cluster]
TRAIN --> EVAL[Offline evaluation<br/>AUC / NDCG / recall@K]
EVAL -->|"passes threshold"| REG[Model registry<br/>versioned artifact]
REG --> SHADOW[Shadow traffic<br/>log scores, don't serve]
SHADOW -->|"quality gate"| AB[A/B test<br/>1% → 10% → 100%]
style LOGS fill:#0e7490,color:#fff
style TRAIN fill:#ff6b1a,color:#0a0a0f
style REG fill:#15803d,color:#fff
style AB fill:#ff2e88,color:#fff
The train/val/test split must be by time. Train on week N, validate on week N+1, test on week N+2. Random splits leak future information and produce offline metrics that don't correlate with production A/B results. This is not a subtle point — it is a hard correctness requirement.
Retraining cadence
How often to retrain depends on how fast the world changes. A news feed ranker that sees trending topics shift by the hour may retrain every few hours with a continuous training pipeline. A product recommendation model for a stable catalog may retrain weekly. The answer in the interview should always reference the feedback loop delay from Step 2 — if you have a 2-hour label delay, there is no point in a pipeline that runs more frequently than every few hours.
Distributed training matters at scale. PyTorch Distributed Data Parallel (DDP) splits batches across GPUs within or across nodes using NCCL. For very large models, model parallelism (Megatron-LM style tensor parallelism) is additionally needed. For interview purposes: name that you'd use distributed training, and know that the bottleneck is usually the data loading and feature join step, not the GPU compute step.
Step 6: Evaluation (28–35 min)
This is where ML rounds are most commonly won or lost. The failure mode is proposing a single offline metric and moving on. The hire signal is proposing a three-layer evaluation stack.
Layer 1: Offline metrics
Offline metrics measure the model's predictive quality on held-out historical data. Choose the metric that matches your objective:
| ML objective | Offline metric | Why |
|---|---|---|
| Click probability (binary) | AUC-ROC | Discrimination ability; threshold-agnostic |
| Ranking (feed, search) | NDCG@K, MRR | Measures rank-order quality, not just binary correctness |
| Retrieval recall | Recall@K | Fraction of true positives in top-K candidates |
| Regression (watch time) | MSE, MAE | Direct fit to the target value |
| Classification (N classes) | Macro F1, top-K accuracy | Handles imbalance, cares about per-class performance |
Name the offline metric and immediately name its limitation. AUC doesn't capture position. NDCG requires relevance judgments that may not exist. A model with high recall@1000 may still surface bad items as its top 20.
Layer 2: Online A/B test
Offline metrics are a proxy. The only way to know whether a model change moves the business metric is an A/B test.
An A/B test must specify: what percent of traffic goes to the new model (start at 1%, escalate to 10% → 50% → 100%); what is the primary metric (DAU, watch time, conversion rate); what are the guardrails (latency p99, click-through floor, toxicity rate cap); and how long to run (enough to see statistical significance on the primary metric, typically 1–2 weeks for low-variance metrics).
The common trap is to claim the A/B test proves causation. It does, within the test period and population, for the metrics you measured. It does not prove the model is correct, safe, or good long-term — which is why guardrails matter.
Layer 3: Guardrails
Every optimisation objective has a dual constraint. Name it explicitly:
- Optimising watch time? Guardrail: content safety pass rate, click-through floor.
- Optimising purchase conversion? Guardrail: return rate cap, customer satisfaction floor.
- Optimising GenAI response quality? Guardrail: latency p99, hallucination rate cap, refusal rate floor.
A model that wins the primary metric while breaking a guardrail does not ship.
Step 7: Serving and monitoring — closing the feedback loop (35–43 min)
Serving architecture
Online inference has two latency-relevant concerns. The first is model complexity vs latency budget. A 7B-parameter model on a GPU takes 10–50 ms per forward pass for short sequences, depending on batch size and sequence length. A 70B model with vLLM's PagedAttention and continuous batching reaches 300–1000 tokens/second of aggregate throughput across concurrent requests, but per-request latency for a 200-token generation is still 5–10 seconds without speculative decoding. Speculative decoding — a small draft model proposes tokens that a large model verifies in a single forward pass — can cut latency 2–3× at the cost of additional GPU memory for the draft model.
The second is feature freshness. Online features — the user's last three clicks, the item's trending velocity — must arrive at the serving layer before the model score is computed. Any latency in the feature pipeline adds directly to p99 serving latency.
The feedback loop
This is the hardest concept to close in the interview and the one most candidates leave open. Describe it explicitly:
sequenceDiagram
participant M as Model
participant S as Serving
participant U as Users
participant L as Log store
participant T as Training pipeline
M->>S: Deploy v1
S->>U: Show ranked items
U->>S: Interact (click, watch, skip)
S->>L: Log impressions + outcomes
L->>T: Join labels to features
T->>M: Train v2
M->>S: Deploy v2
Note over S,U: The model changes what users see,<br/>which changes what gets logged,<br/>which shapes v3.
The feedback loop is a closed system. The model influences what users see. What users see influences what they interact with. What they interact with determines the training data. The training data determines the next model. Any bias in the model amplifies over iterations — this is the mechanism behind filter bubbles, popularity bias, and the systematic under-exposure of new content.
Name the specific mitigations: exploration policies (epsilon-greedy or bandit-based random selection of a small fraction of impressions), counterfactual logging (record what the model would have shown under a different policy, to debias the training data), and diversity constraints in re-ranking.
Monitoring and drift detection
After deployment, the metrics to monitor fall into three buckets.
Model quality metrics (computed daily or hourly against fresh held-out data): AUC, NDCG, prediction calibration — is the model's predicted click rate close to observed click rate for each score bucket?
Feature health metrics: feature freshness (time from event to feature availability), feature null rate (what fraction of serving requests got a null value for a required feature), feature distribution shift (did the distribution of this feature change significantly from the training distribution?).
Business metrics: the online metric (DAU, watch time, conversion) and each guardrail, tracked with statistical process control to detect a degradation before it is obvious from user complaints.
When the model quality degrades silently — no errors, just worse metrics — the root cause is almost always one of: data distribution shift (the world changed, the model is stale), feature distribution shift (a feature pipeline changed its definition), or feedback loop reinforcement (the model is training on data it influenced in a way that entrenches a bias). Name all three when discussing drift.
The GenAI variant
GenAI system design rounds follow the same seven-step framework but add a layer of decisions that classic ML rounds do not require.
The build-vs-prompt-vs-fine-tune decision tree
The first question in any GenAI design is: what is the right engineering investment for this task?
flowchart TD
Q1{"Does a base LLM produce<br/>acceptable output with<br/>careful prompting?"}
Q1 -->|"Yes"| PE["Prompt engineering<br/>(cheapest, fastest to ship)"]
Q1 -->|"No — wrong facts<br/>or stale knowledge"| Q2{"Does the knowledge<br/>change frequently?"}
Q2 -->|"Yes (daily/weekly)"| RAG["RAG — retrieval-augmented<br/>generation"]
Q2 -->|"No (stable domain)"| Q3{"Wrong output style<br/>or format?"}
Q3 -->|"Yes"| FT["Fine-tune on<br/>domain examples"]
Q3 -->|"No — capability gap<br/>at foundation level"| PT["Train from scratch<br/>(almost never)"]
style PE fill:#15803d,color:#fff
style RAG fill:#ff6b1a,color:#0a0a0f
style FT fill:#a855f7,color:#fff
style PT fill:#0e7490,color:#fff
Prompt engineering is always the first thing to try. A well-constructed system prompt with few-shot examples often closes 60–80% of a quality gap. It is free at training time and adds only token overhead at inference time.
RAG (retrieval-augmented generation) is the right answer when the required knowledge changes faster than fine-tuning cycles allow, or when you need citations and verifiable grounding. The architecture retrieves relevant chunks from a vector store (dense retrieval with HNSW, or hybrid dense + BM25 sparse retrieval), injects them into the context window, and generates a response conditioned on those chunks. The key design decisions are: chunk size and overlap, embedding model choice, retrieval recall vs. context length budget, and how to handle contradictions between retrieved chunks. The dedicated RAG pipeline design article covers the full architecture; for the vector index choices see Design a Vector Database.
Fine-tuning makes sense when the base model's output style or format is wrong — the task requires a specific JSON schema, a particular code style, or a tone that the base model cannot produce with prompting. PEFT methods (LoRA, QLoRA) fine-tune a fraction of model parameters at a fraction of full fine-tune cost. Full fine-tuning is rarely justified unless the task requires fundamentally different capabilities.
Training from scratch almost never makes sense in an interview unless the question is explicitly about building a foundation model. The compute cost (tens of thousands of GPU-hours at minimum), data requirements (trillions of tokens), and infrastructure complexity are out of scope for a 45-minute interview. Say so.
Evaluating non-deterministic GenAI systems
This is where GenAI evaluation diverges from classical ML evaluation. There is no ground-truth label on "write a good product description." The standard approaches:
LLM-as-judge. Use a capable LLM (e.g., a GPT-4-class or Claude 3-class model) to evaluate outputs on defined rubrics: faithfulness to source (does the answer contradict the retrieved document?), relevance (does the answer address the question?), coherence (is the answer well-formed?). Score each on a 1–5 scale. This scales cheaply — you can evaluate thousands of examples — but it introduces the bias and consistency issues of the judge model itself. Use multiple judges with majority voting, or a calibrated rubric with gold examples.
Golden sets. Curate 200–1000 human-labeled (question, expected answer) pairs that cover the important task categories. Run the system on every deployment and track the pass rate. This is the most robust evaluation, but the gold set requires ongoing curation as the task evolves.
Faithfulness metrics. For RAG systems specifically: what fraction of claims in the generated answer are supported by the retrieved context? FactScore and RAGAs are open-source frameworks for this evaluation. Track faithfulness as a first-class metric alongside relevance.
Latency, cost, and token budgets
In a GenAI system, tokens are the fundamental unit of cost and latency. An LLM inference call costs roughly proportionally to (input tokens + output tokens) × model size. At serving scale, the levers are:
| Lever | Effect | Trade-off |
|---|---|---|
| Smaller model (7B vs 70B) | 5–10× cheaper, lower latency | Lower quality ceiling |
| Quantisation (FP16 → INT8) | ~2× memory reduction | Minor quality degradation |
| vLLM + PagedAttention | 10–30× throughput gain from continuous batching | Requires GPU with sufficient HBM |
| Speculative decoding | 2–3× latency reduction on generation | Adds memory for draft model |
| KV cache prefix sharing | Near-zero cost for shared system prompt tokens | Only helps when prompts share a prefix |
| Context window reduction | Linear cost reduction | Shorter context may miss relevant information |
For an interview involving LLM serving, name at least two of these. vLLM's PagedAttention is the most commonly cited — it avoids GPU memory fragmentation in the KV cache by managing it in non-contiguous physical blocks (analogous to virtual memory paging), which enables continuous batching across requests with different sequence lengths. For a full treatment of LLM inference serving, see Design an LLM Inference Serving System.
Safety and guardrails for GenAI
A GenAI system that can produce arbitrary text needs input and output guardrails. Name them as a distinct design component, not an afterthought.
Input filters classify the user's prompt for policy violations (jailbreak attempts, PII, prohibited topics) before sending to the model. Output filters run a safety classifier on the model's response before returning it to the user; track the refusal rate as a product metric — too high and the system is unusable, too low and policy violations go undetected. Grounding checks for RAG verify that the response's factual claims are supported by retrieved context, flagging or suppressing responses with low faithfulness scores. Rate limits and abuse detection are also required: GenAI endpoints are expensive and targeted for abuse; a prompt-injection attack that generates 10 000-token responses is both a safety and a cost incident. See Design a Rate Limiter for the infrastructure layer.
Architecture
flowchart TD
subgraph Offline["Offline — Training and Evaluation"]
LOGS[(Interaction logs<br/>S3 / GCS, Parquet)] --> FJ[Feature join<br/>point-in-time correct]
FJ --> TRAIN[Distributed training<br/>PyTorch + GPU]
TRAIN --> REG[Model registry<br/>versioned artifact]
REG --> SHADOW[Shadow test<br/>then A/B ramp]
end
subgraph Online["Online — Serving Path"]
REQ[User request] --> CG[Candidate Gen<br/>ANN / BM25 / heuristics]
CG --> RANK[Ranking service<br/>GPU inference]
FS[(Feature store<br/>online — Redis)] --> RANK
RANK --> RERANK[Re-rank<br/>diversity + safety]
RERANK --> RESP[Response]
end
subgraph RT["Real-time — Signal Pipeline"]
EVENTS[User events<br/>click, skip, dwell] --> KAFKA[Kafka]
KAFKA --> FLINK[Flink stream<br/>processor]
FLINK --> FS
FLINK --> LOGS
end
REG --> RANK
FS -.->|"training read"| FJ
style TRAIN fill:#ff6b1a,color:#0a0a0f
style RANK fill:#15803d,color:#fff
style FS fill:#a855f7,color:#fff
style KAFKA fill:#ff2e88,color:#fff
style CG fill:#0e7490,color:#fff
style REG fill:#ffaa00,color:#0a0a0f
This diagram is the canonical ML system. The three subgraphs correspond to the three time horizons that matter: the offline training loop (hours), the real-time signal pipeline (minutes), and the online serving path (milliseconds). They share the feature store as the single source of truth — the component that prevents training/serving skew. The feedback arrow from the real-time pipeline back to the log store is the feedback loop made explicit.
Edge cases & gotchas
Optimising a proxy metric
The most dangerous trap in ML is optimising a metric that correlates with the business goal in the short term but diverges in the long term. Maximising clicks trains models toward clickbait. Maximising raw watch time trains models toward content that is hard to stop watching, not content that users value. Name this risk explicitly when you discuss the ML objective in Step 1 — and connect it to your guardrail metric choice.
Ignoring the feedback loop and feedback delay
Most candidates describe the training pipeline as a one-time process: collect data, train model, deploy. Production ML systems run continuously. The model changes what users see. What users see changes the training data. Without explicit mitigation (exploration policies, counterfactual logging), the system converges to a local optimum that may be far from the global optimum.
Feedback delay is a related, concrete problem. If your label requires a 24-hour outcome window (did the user return the next day?), your training data is always 24 hours stale. A retraining pipeline that runs every hour is meaningless — you're training on data from yesterday regardless.
No baseline
Proposing a deep neural network on a problem a lookup table would solve is a red flag. Always name the baseline, measure the gap it would need to close, and justify the complexity of the model you're proposing.
Data leakage
Using a feature that encodes information about the future label — the final purchase count including purchases after the impression, the user's 30-day history computed from the current date rather than the impression date — produces a model that looks excellent offline and fails immediately in production. Point-in-time correct feature joins are the operational fix. Naming this unprompted is a strong hire signal.
Evaluating a GenAI system only on vibes
"We showed it to the team and it seemed good" is not an evaluation. Establish a golden set, run an LLM-as-judge pipeline, track faithfulness, and propose an A/B test on the business metric. All of these are tractable to propose in a 45-minute interview even if you acknowledge they take weeks to implement.
Trade-offs to discuss in an interview
Offline AUC vs online A/B. High offline AUC does not guarantee a positive A/B result. The gap is usually caused by position bias or popularity bias in the training data. Name the gap and explain why an A/B test is required to close it.
Exploration vs exploitation. A pure exploit strategy creates a feedback loop that starves the long tail and degrades long-term user satisfaction. Name an exploration policy — epsilon-greedy, Thompson sampling, or an explicit freshness budget — and state the expected cost (a 2–5% reduction in short-term engagement metrics is typically acceptable).
RAG vs fine-tune. RAG wins when knowledge changes frequently or citations are required. Fine-tune wins when output format/style is wrong or when a smaller model can match quality at lower inference cost. The decision axis is: knowledge-change frequency, latency budget, and whether grounding/faithfulness is a product requirement.
Continuous batching trade-off. vLLM's continuous batching increases aggregate GPU throughput by 10–30× but means individual requests may wait briefly to be grouped. For latency-sensitive applications (autocomplete, voice), this trade-off needs discussion — prefer speculative decoding or dedicated per-request inference instead.
Feature freshness vs cost. Near-real-time features (Flink, Redis) are expensive to operate. Batch features (Spark, offline column store) are cheap but stale. The right split depends on which features are most sensitive to freshness — for most rankers, 1–5 session-level features need real-time freshness; the rest can be hourly batch.
Things you should now be able to answer
- What is the first thing you do in an ML system design round, and why does it happen before you touch architecture?
- A candidate proposes maximising click-through rate as the ML objective for a news feed. What is the likely failure mode, and how do you fix it with a guardrail?
- What causes training/serving skew, and what does a feature store do to prevent it?
- You have a training dataset of user impressions and click labels. Why must you split by time rather than randomly?
- A ranking model scores well on AUC but the A/B test shows lower watch time. What are three possible explanations?
- When would you choose RAG over fine-tuning for a GenAI feature? When would you choose fine-tuning?
- What does PagedAttention do, and why does it increase LLM serving throughput?
- How do you evaluate a GenAI system when there is no ground-truth label?
- What is the feedback loop in a recommendation system, and what two concrete mechanisms mitigate its negative effects?
- A model worked well at launch and degraded silently over six months. Name three root causes and the monitoring signal for each.
Further reading
- Chip Huyen, Designing Machine Learning Systems (O'Reilly, 2022) — the most comprehensive treatment of ML system design from a practitioner perspective; covers the full lifecycle from data collection to monitoring.
- Martin Zinkevich, "Rules of Machine Learning: Best Practices for ML Engineering" — Google internal doc, publicly released; 43 numbered rules covering the progression from heuristics to ML with specific failure modes at each stage.
- "Deep Neural Networks for YouTube Recommendations" — Covington, Adams, Sargin (RecSys 2016); the canonical two-stage paper that describes the candidate generation + ranking funnel in production.
- "Machine Learning: The High Interest Credit Card of Technical Debt" — Sculley et al. (NeurIPS 2014); the paper that named ML-specific technical debt including hidden feedback loops and unstable data dependencies.
- "Reliable Machine Learning" — Cathy Chen et al. (O'Reilly, 2022); covers production ML reliability, SLOs for model quality, and drift detection in depth.
- vLLM project documentation (vllm.ai) and the PagedAttention paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" — Kwon et al. (SOSP 2023).
- "Lessons Learned from Deploying Deep Learning for Recommendation at Scale" — engineering blogs from Netflix, DoorDash, LinkedIn, and Spotify; search each company's engineering blog for "recommendation system" or "ML platform."
- Design a Recommendation System — the canonical two-stage funnel, ANN index choices, feature store architecture.
- Design a Distributed Search Engine — inverted index, scatter-gather, BM25, the infrastructure behind keyword-based candidate retrieval.
- Design an LLM Inference Serving System — vLLM, PagedAttention, continuous batching, speculative decoding, KV cache management at serving scale.
- Design a RAG Pipeline — chunking, dense vs. hybrid retrieval, faithfulness evaluation, the full retrieval-augmented architecture.
- Design a Vector Database — HNSW, IVF+PQ, quantisation, the storage layer behind ANN retrieval.
- Design an AI Agent Platform — tool use, multi-step reasoning, agent orchestration, failure modes in agentic systems.
- How to Approach a System Design Interview — the classic system design framework this article extends.
Frequently asked questions
▸How is an ML system design round different from a classic system design interview?
In a classic round you are graded on boxes and arrows — the topology of services, databases, and caches. In an ML round you are additionally graded on whether you can frame the business problem as an ML objective, define a label strategy, close the feedback loop, and distinguish offline from online evaluation. Getting the architecture right but ignoring the data pipeline and retraining cadence is a no-hire at most FAANG+ teams.
▸What is training/serving skew and why does it matter more than model choice?
Training/serving skew occurs when a feature computed offline for training is computed differently at serving time — different aggregation windows, timezone handling, or code paths. It degrades model quality silently: logs show no errors, scores just drift. A feature store that shares transformation logic across both paths is the standard fix, and many interviewers consider naming this problem a stronger signal than any particular model architecture choice.
▸When should a GenAI candidate choose RAG over fine-tuning?
RAG is preferred when the knowledge base changes frequently (daily or weekly updates would require constant retraining), when faithfulness and citation are required, or when the base model already has strong reasoning but lacks domain facts. Fine-tuning is preferred when the task requires a new output style or behaviour the base model cannot produce with prompting, or when inference latency and cost are tight enough that a smaller fine-tuned model beats a large prompted one.
▸What is the standard latency budget for an LLM inference call and what levers control it?
A generation call to a 70B-parameter model typically produces 20-40 tokens per second on two A100 80 GB GPUs (the model's 140 GB FP16 weight memory requires at least two 80 GB GPUs) without optimisation. With PagedAttention (vLLM) and continuous batching, aggregate throughput across concurrent requests rises to 300-1000 tokens per second, cutting per-request cost by 10-30x. The main levers are: model size, quantisation (FP16 to INT8 cuts memory roughly 2x), speculative decoding (a small draft model proposes tokens that a large model verifies in a single forward pass), and KV-cache prefix sharing for repeated system prompts.
▸What separates a hire from a no-hire in an ML design round?
The clearest hire signals are: defines the online business metric and a guardrail before touching architecture; names train/serve skew unprompted; proposes an offline eval then an A/B test with concrete metrics. The clearest no-hire signals are: jumps to model architecture before defining labels; claims offline AUC proves the product works; evaluates a GenAI system only on subjective vibes with no reproducible harness.
You may also like
Model Context Protocol (MCP) and Tool-Use Infrastructure
How LLMs safely reach the outside world — from raw function calling to MCP, the open standard that collapses N×M bespoke integrations to N+M, with production-grade security, reliability, and a ~88% token reduction via deferred tool loading.
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.