Design a Recommendation System (Netflix / TikTok)
Pick the best items for each user from millions of candidates in milliseconds. The two-stage candidate-generation + ranking architecture, embeddings, and feature stores.
The problem
Netflix has roughly 10 million videos in its catalog. When you open the app, the home screen loads in well under two seconds — and most of those items were handpicked for you, not for the person sitting next to you. TikTok does the same with a billion users and an endless scroll. Spotify does it with 100 million tracks. Recommendation systems are the invisible infrastructure that makes "personalized for you" feel effortless at global scale.
At the core, the job is straightforward: given a user and a surface (a home feed, a "watch next" shelf, an email digest), return an ordered list of N items the user is likely to enjoy — faster than a human can perceive the delay. The catch is the numbers. With 500 million users and 10 million items, brute-force scoring every item for every user at serving time is roughly 25 quadrillion lookups per day. That's not an engineering stretch goal; it's physically impossible with today's hardware at consumer cost.
That impossibility is what makes recommendation the canonical ML systems interview topic. The central tension is two-sided: candidate generation must be cheap enough to run in milliseconds across millions of items, while ranking must be rich enough to capture nuanced user taste. Neither stage alone is sufficient — retrieval without ranking returns mediocre results, ranking without fast retrieval is too slow to serve. The architecture that resolves this tension — a fast, approximate retrieval stage feeding a slower, high-capacity ranking model — spans distributed systems, ML engineering, and product intuition about filter bubbles and cold start. This article walks the full path.
Functional requirements
- Given a user and a surface (home feed, "related videos", email digest), return a ranked list of N items.
- Low-latency: < 150 ms p99 end-to-end (including network from the user's region).
- Personalised: rank must be driven by that user's preferences, not just global popularity.
- Fresh: new content should surface; stale signals (watched 2 years ago) should decay.
- Cold start: a new user with zero history, or a newly uploaded item, must still receive or appear in recs.
Non-functional requirements
- Scale: 500 M users, 10 M items, ~29 k recommendation requests/sec average (~58 k QPS peak).
- Availability: > 99.99% — the home feed going dark is a Severity-1 incident.
- Consistency: feature values used at serving time must match features used at training time (training/serving skew causes silent model degradation).
- Explainability (soft): the system should support "why am I seeing this?" for product compliance.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Active users | 500 M | Given |
| Catalog size | 10 M items (videos, products, songs) | Given |
| Daily recommendation requests | 2.5 B req/day | 500 M users × 5 feed loads/day |
| Average QPS | ~29 000 QPS | 2.5 B ÷ 86 400 s |
| Peak QPS | ~58 000 QPS | 2× average |
| Latency — candidate generation | ≤ 20 ms | Budget within 150 ms p99 |
| Latency — ranking inference | ≤ 80 ms | Budget (includes ≤ 15 ms feature store lookup) |
| Latency — serialisation + network | ≤ 30 ms | Budget |
| Latency — total | ~130 ms | 20 + 80 + 30 ms (the 15 ms feature-store lookup is already inside the 80 ms ranking budget) |
| Item embeddings (float32) | ≈ 10 GB | 10 M items × 256-dim × 4 B = 10 M × 1 024 B |
| Item embeddings (int8 quantised) | ≈ 2.5 GB | 10 GB ÷ 4 — fits in one large memory node |
| User embeddings (all, float32) | ≈ 512 GB | 500 M users × 256-dim × 4 B — too large for RAM |
| Active user embeddings in RAM | ~10 GB | Keep top 10 M active users; rest on SSD or recomputed on demand |
| Interaction events/sec | ~1.45 M events/sec | 29 k req/s × 50 items shown/request × 1 event |
| Daily training events | ~125 B events/day | 1.45 M × 86 400 s |
| Training data storage | ~5 TB/day compressed | Parquet on object store (S3/GCS) |
| Index refresh cadence | Incremental every 5 min; full rebuild daily | Catalog grows by ~100 k new items/day |
Takeaway: At 58 k QPS peak and a 10 M item catalog, serving requires a ~10 GB in-RAM item embedding index and a ~130 ms end-to-end latency budget split across candidate generation (≤ 20 ms), ranking inference (≤ 80 ms, including feature-store lookup), and serialisation + network (≤ 30 ms); full float32 user embeddings (≈ 512 GB) cannot fit in RAM, so only the top 10 M active users (~10 GB) are kept hot.
Building up to the design
V1: Popularity-based recs
def recommend(user_id, n=20):
return top_n_items_by_play_count(n) # same list for everyone
Every user gets the same top-20 globally popular items. Zero ML required.
This is a surprisingly strong baseline — popular content is popular for a reason, and many teams underestimate it. But it fails the moment personalisation matters: a user who watches only German documentaries gets served English action blockbusters, and engagement metrics plateau quickly.
V2: Collaborative filtering
Matrix factorisation decomposes the user-item interaction matrix R (rows = users, columns = items, values = implicit feedback like play count) into user embeddings U and item embeddings V such that R ≈ U × Vᵀ.
Training is offline (hours). At serving time: look up the user's embedding, compute dot products with all item embeddings, return the top-K. Users who liked similar things land near each other in the latent space, so a documentaries-in-German user actually gets documentaries in German.
The problem is scale. Scoring all 10 M items per request is O(N × D). At 256 dims and 58 k QPS, that's 58 k × 10 M × 256 ≈ 150 trillion FLOPs/sec. Even with BLAS-optimised matrix multiply, that's tens of racks of compute just for candidate scoring. Cold start is a separate wound: a new user has no embedding at all, and neither does a new item.
V3: Approximate nearest-neighbour (ANN) retrieval
The key insight is that you don't need the exact top-K — you need a fast, good-enough shortlist that a downstream ranker can tighten up. Build an ANN index over item embeddings (FAISS with HNSW or IVF quantisation, or Google's ScaNN). At query time, look up the user embedding; the ANN index returns the approximate top-K in O(log N) time — typically under 10 ms for 10 M items in RAM, versus the seconds a brute-force dot product would need.
Recall is typically 90–99% depending on index type and configuration: you miss a fraction of the true top-K, but those missed items are very close in score to the items you do return, and the ranking stage will reorder the shortlist anyway. The latency win is decisive.
What breaks at this point is multi-interest users. A single ANN lookup from a single user embedding misses someone who watches tech talks AND cooking AND horror. You need to retrieve from multiple interest clusters or query towers.
V4: Two-tower model + multiple retrieval sources
The two-tower model learns user and item encoders jointly, optimised for inner product similarity between positive user-item pairs versus in-batch negatives. The towers can ingest arbitrary features — not just IDs — so they handle cold start more gracefully than pure matrix factorisation.
Multiple candidate sources are merged:
| Source | Mechanism | Candidates |
|---|---|---|
| ANN on user embedding | Two-tower embeddings, FAISS | 200–300 |
| User's watch/like history → item-item similarity | Item co-watch matrix | 100 |
| Trending / new content | Time-decayed popularity score | 50 |
| Followed creators / subscriptions | Social graph | 50 |
| Geographically local trending | Per-region popularity | 30 |
Total retrieved: ~500 items, deduplicated.
These ~500 items still need to be ranked. Candidate generation scores (dot products) don't account for rich context: time of day, device type, the user's watch history in the last five minutes, item freshness. Those features are too expensive to compute at retrieval scale, which is exactly why the two-stage split exists.
V5: Ranking model over the shortlist
A ranking model (typically a deep neural network with embedding layers for sparse IDs, dense layers for numeric features, and optional cross-network layers for feature interactions) scores each of the ~500 candidates with a much richer feature set. It can be trained to predict multiple objectives simultaneously — click-through rate, watch-time, like probability — with a weighted combination.
At 500 candidates × 58 k QPS = 29 M inference calls/sec across the fleet, this is expensive but tractable with GPU serving (TensorFlow Serving, TorchServe) and model distillation.
V6: Re-ranking, feature store, online learning
The final pieces are the ones most candidates skip. A re-ranker applies diversity (max-marginal relevance), business rules (boost paying advertisers, suppress policy-flagged content), and exploration (surface a few non-obvious items to learn about user taste). A feature store ensures the model sees identical features at training time and serving time — without it, training/serving skew is nearly inevitable and degrades ranking quality silently. And a streaming pipeline feeds fresh interaction events back into the feature store so the model's inputs update within minutes of a user action, without requiring a full retrain.
flowchart LR
V1["V1: Global popularity<br/>no ML"] --> V2["V2: Matrix factorisation<br/>personalised but slow"]
V2 --> V3["V3: ANN retrieval<br/>sub-20 ms"]
V3 --> V4["V4: Two-tower + multi-source<br/>500 candidates"]
V4 --> V5["V5: + ranking model<br/>rich features"]
V5 --> V6["V6: + feature store + re-ranking<br/>production-grade"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V6 fill:#a855f7,color:#fff
Full architecture
flowchart TD
subgraph Offline["Offline Training Pipeline (hours)"]
LOGS[(Interaction Logs<br/>S3 / GCS)] --> FE[Feature Engineering<br/>Spark / Flink]
FE --> TRAIN[Model Training<br/>PyTorch / TensorFlow]
TRAIN --> REG[Model Registry]
FE --> FS_BATCH[(Feature Store<br/>offline / batch)]
end
subgraph Online["Online Serving Path (< 150 ms)"]
U[User Request] --> API[Recs API]
API --> CG[Candidate Generator]
CG --> FAISS[(ANN Index<br/>FAISS / ScaNN)]
CG --> HEU[Heuristics<br/>Trending + Followed]
CG --> MERGE[Merge + Dedup<br/>~500 candidates]
MERGE --> RANKER[Ranking Service]
FS_ONLINE[(Feature Store<br/>online — Redis)] --> RANKER
RANKER --> MODEL[Ranking Model<br/>GPU inference]
MODEL --> RERANK[Re-Ranker<br/>Diversity + Rules]
RERANK --> RESP[Response]
end
subgraph RT["Real-Time Signal Pipeline (minutes)"]
EVENTS[User Events<br/>click, skip, dwell] --> KAFKA[Kafka]
KAFKA --> STREAM[Stream Processor<br/>Flink / Samza]
STREAM --> FS_ONLINE
STREAM --> LOGS
end
REG --> MODEL
FS_BATCH --> FS_ONLINE
style FAISS fill:#0e7490,color:#fff
style RANKER fill:#15803d,color:#fff
style FS_ONLINE fill:#a855f7,color:#fff
style MODEL fill:#ffaa00,color:#0a0a0f
style KAFKA fill:#ff2e88,color:#fff
style CG fill:#ff6b1a,color:#0a0a0f
Component deep-dives
Candidate generation — two-tower model and ANN
The two-tower model trains a user encoder and an item encoder end-to-end:
User tower: [user_id, age_bucket, country, recent_watch_ids, ...] → 256-dim embedding
Item tower: [item_id, category, creator_id, duration_bucket, ...] → 256-dim embedding
Training objective: maximise dot(user_emb, positive_item_emb) vs. in-batch negatives
The item embeddings are indexed into FAISS once trained. At serving time, the user tower runs on the current user context (~5 ms CPU or < 1 ms on GPU), and the ANN index is queried for the top-K nearest item embeddings by inner product (typically under 10 ms for HNSW on 10 M items in RAM, often 1–5 ms at moderate efSearch settings).
The diagram below shows how this works: both towers project into the same embedding space, and the retrieval step is a nearest-neighbor search there — not a query against a database of raw attributes.
flowchart LR
UF["User features<br/>user_id, history, country"] --> UT[User Tower<br/>MLP layers]
IF["Item features<br/>item_id, category, creator"] --> IT[Item Tower<br/>MLP layers]
UT --> UE((User<br/>embedding<br/>256-dim))
IT --> IE((Item<br/>embedding<br/>256-dim))
UE -->|"inner product similarity"| ANN[(ANN Index<br/>all item embeddings)]
ANN -->|"top-300 nearest"| CANDS[Candidate list]
style UT fill:#ff6b1a,color:#0a0a0f
style IT fill:#0e7490,color:#fff
style UE fill:#ffaa00,color:#0a0a0f
style IE fill:#ffaa00,color:#0a0a0f
style ANN fill:#15803d,color:#fff
HNSW vs IVF+PQ: HNSW (hierarchical navigable small world graphs) delivers high recall (typically 95–99%, depending on efSearch) with low latency but high memory — it stores the full-precision graph in RAM. IVF with product quantisation (PQ) compresses embeddings 16–32×, saving significant memory, but recall is typically 5–15% lower than HNSW at matched latency, and depends heavily on nprobe and PQ code size. Adding a reranking step — fetch more candidates, recompute exact distances — recovers much of the recall loss at extra latency cost, and is preferred when item catalogs exceed available RAM. In practice, many systems use IVF+PQ for the full 10 M catalog and HNSW for a "freshness index" holding only the last 48 hours of new items.
The feature store
The feature store is the piece most teams underestimate. Its job: serve pre-computed feature values — fast, consistently — to both the training pipeline and the online serving path.
flowchart LR
BATCH[Batch compute<br/>Spark] --> FSO[(Feature Store<br/>Offline tier<br/>e.g. Hive / BigQuery)]
STREAM2[Stream compute<br/>Flink] --> FSN[(Feature Store<br/>Online tier<br/>e.g. Redis / DynamoDB)]
FSO --> FSN
FSN --> RANKER2[Ranking Service]
FSO --> TRAIN2[Training Pipeline]
style FSO fill:#a855f7,color:#fff
style FSN fill:#a855f7,color:#fff
style RANKER2 fill:#15803d,color:#fff
Training/serving skew is the most dangerous failure mode in production recommenders. It happens when the feature value computed offline for training differs from the value computed at serving time — because of logic drift, timezone handling, or aggregation window differences. The feature store mitigates this by being the single source of truth: training jobs read the same store that serving reads, using the same transformation code in both paths.
Features split into two tiers:
| Tier | Examples | Staleness | Storage |
|---|---|---|---|
| Pre-computed batch | User's 30-day genre affinity, item's all-time play count | Hours | Column store (Hive, BigQuery) |
| Near-real-time | User's plays in last 30 min, item's plays in last 5 min | Minutes | Redis / DynamoDB |
| Request-time | Device type, time of day, country from IP | Milliseconds | Computed inline |
Ranking model — architecture and training
The ranking model takes ~500 candidates and scores each independently. A standard architecture (similar in spirit to widely described systems at Netflix and YouTube):
Input features per (user, item) pair:
- user_id (sparse, embedded)
- item_id (sparse, embedded)
- user's recent watch list (variable-length, pooled embeddings)
- item's pre-computed features (category, duration, age, language)
- context features (device, hour, day-of-week)
- interaction features (user has seen this creator before? user's avg watch % of similar items)
Architecture:
Sparse IDs → embedding table → 64-dim vectors
Dense features → MLP
Concatenate all → 3-4 dense layers (512 → 256 → 128 → 1 per objective)
Objectives (multi-task):
- P(click) weight 0.3
- P(watch > 50%) weight 0.5
- P(like / save) weight 0.2
Final score = weighted sum
Training runs on the logged data (user, item, label) with the features as they existed at the time of the impression. This point is subtle: you must log the feature values at impression time and train on those logged values — not recompute them later — otherwise you get future data leakage.
Cold start
Cold start comes in two flavors, and they need different fixes.
New user. Zero interaction history. The most useful first step is asking for explicit preferences during onboarding — genre interests, followed creators. Without that signal, fall back to geographically and demographically popular content. Then run a multi-armed bandit (epsilon-greedy or Thompson sampling) to explore the user's taste with a small fraction of their early impressions. This is the classic exploration-exploitation problem: you need to show some uncertain items to learn what the user likes, even at the cost of short-term click-through rate.
New item. Zero interactions, so embedding-based retrieval won't surface it. Use content-based features instead: item category, creator embeddings if the creator has prior items, title/description embeddings from a text encoder. Assign the new item an initialisation embedding based on its metadata rather than a random or zero vector. Also reserve a freshness "exploration budget" — a fraction of every user's feed allocated to recently published content regardless of predicted score.
flowchart TD
REQ[Recommendation request] --> CHECK{Has interaction<br/>history?}
CHECK -->|"Yes"| EMB[Look up user embedding<br/>ANN retrieval]
CHECK -->|"No — new user"| COLD[Cold start path]
COLD --> ONBOARD{Onboarding prefs<br/>collected?}
ONBOARD -->|"Yes"| PREF[Content-based on<br/>stated preferences]
ONBOARD -->|"No"| POP[Geo + demographic<br/>popular content]
PREF --> BANDIT[Epsilon-greedy bandit<br/>explore taste]
POP --> BANDIT
EMB --> MERGE[Merge candidates<br/>~500]
BANDIT --> MERGE
FRESHNESS[Freshness budget<br/>new items regardless of score] --> MERGE
MERGE --> RANK[Ranking Service]
style CHECK fill:#ff6b1a,color:#0a0a0f
style EMB fill:#15803d,color:#fff
style BANDIT fill:#0e7490,color:#fff
style FRESHNESS fill:#a855f7,color:#fff
style RANK fill:#ffaa00,color:#0a0a0f
Online learning and real-time signals
A full model retrain from scratch takes hours. But the user's taste right now matters. The solution is to decompose what changes.
Feature store updates happen fast, without any retraining. Kafka events update near-real-time features — what the user just watched, the item's viral momentum — within minutes. The model weights are unchanged, but its inputs are fresh. This is the most universally adopted mechanism.
Online fine-tuning goes further: the ranking model is updated with a streaming gradient update at a small learning rate. This risks training instability and is less universally adopted.
Frequent full retrains remain the main lever. Retrain the model on the last N days of data every few hours. With modern ML infrastructure (Spark for feature prep, distributed GPU training), this is achievable in well under an hour for many ranking models.
Exploration vs. exploitation — preventing filter bubbles
A pure exploit strategy — always show the highest-scoring item — creates a feedback loop. The model learns from what it shows, so it keeps showing the same narrow content. Users interact with that content, confirming the bias. Long-term user satisfaction degrades quietly.
There are several concrete mitigations. Epsilon-greedy exploration shows a randomly sampled item with probability ε (typically 2–5%), logs whether it was explored, and trains on that signal. Thompson sampling / UCB is a principled bandit approach that explores proportionally to uncertainty about an item's quality. Max-marginal relevance (MMR) re-ranking penalises items that are too similar to items already placed earlier in the list — this prevents showing 10 nearly identical videos from the same creator. And explicit business rules in re-ranking handle freshness boosts (surface content newer than 48 hours), topic diversification quotas, and policy compliance filters like age-gating and regional restrictions.
Serving path — sequence diagram
sequenceDiagram
participant User
participant API as Recs API
participant CG as Candidate Generator
participant ANN as ANN Index
participant HEU as Heuristics
participant FS as Feature Store
participant RANK as Ranking Service
participant RERANK as Re-Ranker
User->>API: GET /feed (user_id, context)
API->>CG: fetch candidates
par ANN retrieval
CG->>ANN: user_embedding → top-300
ANN-->>CG: 300 item IDs
and Heuristics
CG->>HEU: trending + subscriptions
HEU-->>CG: 200 item IDs
end
CG->>CG: merge + dedup → ~500 candidates
CG-->>API: 500 item IDs
API->>RANK: score 500 candidates
RANK->>FS: batch fetch features for 500 items + user
FS-->>RANK: feature vectors
RANK->>RANK: model inference (GPU batch)
RANK-->>API: scored list
API->>RERANK: apply diversity + rules
RERANK-->>API: final top-50
API-->>User: ranked list
The ANN retrieval and heuristic retrieval run in parallel — notice the par block above. That parallelism is what lets candidate generation stay under 20 ms despite pulling from multiple sources.
Offline training pipeline
flowchart TD
LOGS[(Interaction logs<br/>S3/GCS, Parquet)] --> JOIN[Feature join<br/>attach features logged at impression time]
JOIN --> FILTER[Filter: remove bots,<br/>deduplicate, label smoothing]
FILTER --> SPLIT[Train / val / test split<br/>by time, NOT random]
SPLIT --> TRAIN[Distributed training<br/>PyTorch + GPU cluster]
TRAIN --> EVAL[Offline evaluation<br/>AUC-ROC, NDCG, MRR]
EVAL --> SHADOW[Shadow traffic test<br/>log but don't serve]
SHADOW --> AB[A/B test<br/>engagement metrics]
AB --> DEPLOY[Deploy to production]
style TRAIN fill:#ff6b1a,color:#0a0a0f
style AB fill:#15803d,color:#fff
style EVAL fill:#0e7490,color:#fff
A critical subtlety: always split train/val/test by time — train on week N, validate on week N+1. Random splits leak future information and produce optimistic offline metrics that don't correlate with online A/B test results.
Offline vs. online metrics
| Metric | Where | What it measures | Pitfall |
|---|---|---|---|
| AUC-ROC | Offline | Model's discrimination ability | Doesn't capture position or diversity |
| NDCG@K | Offline | Ranking quality in top-K | Requires a relevance judgment, not just binary label |
| Precision@K | Offline | Fraction of top-K that are relevant | Ignores ordering within top-K |
| Click-through rate (CTR) | Online A/B | Did users click? | Clickbait-optimised models win; watch time suffers |
| Watch time / completion | Online A/B | Did users watch? | Better proxy for satisfaction |
| Long-term retention | Online (weeks) | Did users come back? | The true north-star metric; slow to measure |
The common trap: optimise offline AUC to a high value, deploy, and see CTR go down. This usually means the model learned to predict clicks from features that aren't actually causal — e.g., the item's popularity at training time, which doesn't generalise to new items.
Storage choices
| Data | Store | Why |
|---|---|---|
| Item embeddings (ANN index) | In-memory FAISS / ScaNN on inference nodes | Sub-10 ms retrieval requires RAM; index sharded across nodes |
| User embeddings (active) | Redis cluster or in-memory service | Top-10 M active users ~10 GB; LRU eviction for the rest |
| Feature store (online) | Redis / DynamoDB | < 5 ms p99 lookup; key = user_id or item_id |
| Feature store (offline) | Hive / BigQuery / S3 Parquet | Batch reads for training; columnar for scan efficiency |
| Interaction logs | Object store (S3/GCS) in Parquet | Cheap, durable; Spark-readable; append-only |
| Model artifacts | Model registry (MLflow / internal) | Versioned; enables rollback |
| Experiment results | Columnar analytics DB (ClickHouse / BigQuery) | A/B test result queries over billions of rows |
Failure modes
Training/serving skew
The most common cause of silent ranking degradation. Features computed differently in training versus serving produce garbage scores that look normal in logs — no errors, just worse recs. To detect it: log both the features used at serving time and the model's raw score, then compare offline model predictions on the same features against serving scores. The fix is the feature store as single source of truth, with shared transformation code run in both paths.
Cold start
New items get zero interaction signal, so embedding-based retrieval ignores them. The freshness exploration budget and content-based initialisation (described above) are the practical mitigations. Creator-based proxy embeddings also help for new uploads from known creators.
Feedback loop / popularity bias
The system shows popular items, users click them, they become more popular, the long tail starves. The fix is an exploration policy, diversity re-ranking, explicit long-tail exposure quotas, and debiasing during training via importance weighting by inverse propensity score.
Stale features
A real-time feature (item views in the last 5 minutes) is stale because the Kafka pipeline lagged. The model was trained with fresh data, but serving gets 30-minute-old features — behaviour shifts. Monitor feature freshness as a service-level metric; alert on lag exceeding a threshold.
Index staleness (new items not retrievable)
The ANN index is rebuilt every hour, so a newly uploaded item isn't retrievable for up to an hour. The standard fix: maintain a "hot index" of items uploaded in the last 2 hours that is rebuilt every 5 minutes and merged with main retrieval results.
Model regression after retrain
A new model version scores better offline but degrades engagement in A/B. The standard deployment sequence: shadow mode (run new model in parallel, log scores, don't serve), then gradual traffic ramp (1% → 10% → 50% → 100%) with automated rollback triggers on engagement metrics.
Things to discuss in an interview
- Why two stages? Because you can't run an expensive ranking model over 10 M items at 58 k QPS. Quantify it — the math shuts down the "why not just rank everything" question cleanly.
- What's in the feature store? Distinguish batch features (genre affinity, all-time popularity), near-real-time features (last-N-minutes activity), and request-time features (device, time). Each has a different freshness SLA and storage strategy.
- Cold start specifically. Interviewers always ask. Have a concrete plan for new-user and new-item cold start before they ask.
- Offline metrics vs. online metrics. Know that high AUC doesn't guarantee good A/B results, and explain why (position bias, popularity bias, metric-objective misalignment).
- How do you prevent filter bubbles? Exploration policy + diversity re-ranking. Name a concrete mechanism.
- Training data leakage. Time-based train/val split; logging features at impression time.
Things you should now be able to answer
- Why does approximate nearest-neighbour retrieval work well enough despite missing some fraction of the exact top-K items?
- What is training/serving skew, and how does a feature store prevent it?
- Why do you split train/test by time rather than randomly?
- How does the two-tower model handle new items that have no interaction history?
- What is the exploration-exploitation trade-off in recommendations, and what are two concrete mechanisms to address it?
- Why is watch-time a better training signal than CTR for long-term user satisfaction?
Further reading
- "Deep Neural Networks for YouTube Recommendations" — Covington, Adams, Sargin (RecSys 2016). The canonical two-stage paper.
- "Recommending What Video to Watch Next: A Multitask Ranking System" — Zhao et al. (RecSys 2019). Multi-objective ranking with position bias correction.
- "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations" — Yi et al. (RecSys 2019). In-batch negative bias in two-tower training.
- FAISS documentation — faiss.ai / github.com/facebookresearch/faiss
- ScaNN (Scalable Nearest Neighbors) — Google Research, ICML 2020 paper ("Accelerating Large-Scale Inference with Anisotropic Vector Quantization"), open-sourced July 2020.
- "Building a Real-time Machine Learning Feature Store" — various engineering blogs (Uber, DoorDash, Feast open-source project).
- Design a Rate Limiter — for protecting the Recs API from traffic spikes.
- Consistent Hashing — for sharding the feature store and ANN index nodes.
Frequently asked questions
▸Why do production recommendation systems use two stages instead of scoring all items directly?
Brute-force scoring all 10 million items at 58,000 QPS peak would require roughly 150 trillion FLOPs per second just for the dot-product step — infeasible at consumer cost. The two-stage split uses a cheap ANN retrieval to narrow the catalog to roughly 500 candidates in under 20 ms, then applies an expensive ranking model only to that shortlist in under 80 ms, keeping the total end-to-end latency within the 150 ms p99 budget.
▸What is training/serving skew and why is it the most dangerous failure mode in recommenders?
Training/serving skew occurs when a feature value computed offline for training differs from the value computed at serving time — due to logic drift, timezone handling, or aggregation window differences. It degrades ranking quality silently: no errors appear in logs, scores just get worse. A centralized feature store that serves as the single source of truth, with shared transformation code run in both training and serving paths, is the standard mitigation.
▸How much memory do item and user embeddings require at scale?
At 10 million items with 256-dimensional float32 embeddings, item embeddings require approximately 10 GB — or about 2.5 GB with int8 quantization, small enough to fit on one large memory node. Full float32 user embeddings for 500 million users would require roughly 512 GB, so systems keep only the top 10 million active users in RAM (about 10 GB) and recompute or load the rest from SSD on demand.
▸When should you use HNSW versus IVF with product quantization for ANN retrieval?
HNSW delivers high recall — typically 95 to 99 percent depending on efSearch — with low latency, but stores the full-precision graph in RAM. IVF with product quantization compresses embeddings 16 to 32 times and saves significant memory, but recall is typically 5 to 15 percent lower at matched latency. In practice, many systems use IVF+PQ for the full 10 million item catalog and HNSW for a freshness index covering only the last 48 hours of new items.
▸How should you handle cold start for a new item that has no interaction history?
A new item has no embedding signal, so ANN-based retrieval will not surface it. The fix is to initialize the item's embedding using content-based features — category, creator embeddings from prior uploads by the same creator, title and description embeddings from a text encoder — rather than a random or zero vector. Systems also reserve a freshness exploration budget that allocates a fraction of every user's feed to recently published content regardless of predicted score, and maintain a hot index rebuilt every 5 minutes for items uploaded in the last 2 hours.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.