~/articles/design-recommendation-system

◆◆◆Advancedasked at Netflixasked at Amazonasked at TikTokasked at Spotify

Design a Recommendation System (Netflix / TikTok)

Q: Why do production recommendation systems use two stages instead of scoring all items directly?

Brute-force scoring all 10 million items at 58,000 QPS peak would require roughly 150 trillion FLOPs per second just for the dot-product step — infeasible at consumer cost. The two-stage split uses a cheap ANN retrieval to narrow the catalog to roughly 500 candidates in under 20 ms, then applies an expensive ranking model only to that shortlist in under 80 ms, keeping the total end-to-end latency within the 150 ms p99 budget.

Q: What is training/serving skew and why is it the most dangerous failure mode in recommenders?

Training/serving skew occurs when a feature value computed offline for training differs from the value computed at serving time — due to logic drift, timezone handling, or aggregation window differences. It degrades ranking quality silently: no errors appear in logs, scores just get worse. A centralized feature store that serves as the single source of truth, with shared transformation code run in both training and serving paths, is the standard mitigation.

Q: How much memory do item and user embeddings require at scale?

At 10 million items with 256-dimensional float32 embeddings, item embeddings require approximately 10 GB — or about 2.5 GB with int8 quantization, small enough to fit on one large memory node. Full float32 user embeddings for 500 million users would require roughly 512 GB, so systems keep only the top 10 million active users in RAM (about 10 GB) and recompute or load the rest from SSD on demand.

Q: When should you use HNSW versus IVF with product quantization for ANN retrieval?

HNSW delivers high recall — typically 95 to 99 percent depending on efSearch — with low latency, but stores the full-precision graph in RAM. IVF with product quantization compresses embeddings 16 to 32 times and saves significant memory, but recall is typically 5 to 15 percent lower at matched latency. In practice, many systems use IVF+PQ for the full 10 million item catalog and HNSW for a freshness index covering only the last 48 hours of new items.

Q: How should you handle cold start for a new item that has no interaction history?

A new item has no embedding signal, so ANN-based retrieval will not surface it. The fix is to initialize the item's embedding using content-based features — category, creator embeddings from prior uploads by the same creator, title and description embeddings from a text encoder — rather than a random or zero vector. Systems also reserve a freshness exploration budget that allocates a fraction of every user's feed to recently published content regardless of predicted score, and maintain a hot index rebuilt every 5 minutes for items uploaded in the last 2 hours.

Pick the best items for each user from millions of candidates in milliseconds. The two-stage candidate-generation + ranking architecture, embeddings, and feature stores.

22 min read2026-03-28Ironclad Academy

#interview #ml #ranking #big-data

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Netflix has roughly 10 million videos in its catalog. When you open the app, the home screen loads in well under two seconds — and most of those items were handpicked for you, not for the person sitting next to you. TikTok does the same with a billion users and an endless scroll. Spotify does it with 100 million tracks. Recommendation systems are the invisible infrastructure that makes "personalized for you" feel effortless at global scale.

At the core, the job is straightforward: given a user and a surface (a home feed, a "watch next" shelf, an email digest), return an ordered list of N items the user is likely to enjoy — faster than a human can perceive the delay. The catch is the numbers. With 500 million users and 10 million items, brute-force scoring every item for every user at serving time is roughly 25 quadrillion lookups per day. That's not an engineering stretch goal; it's physically impossible with today's hardware at consumer cost.

That impossibility is what makes recommendation the canonical ML systems interview topic. The central tension is two-sided: candidate generation must be cheap enough to run in milliseconds across millions of items, while ranking must be rich enough to capture nuanced user taste. Neither stage alone is sufficient — retrieval without ranking returns mediocre results, ranking without fast retrieval is too slow to serve. The architecture that resolves this tension — a fast, approximate retrieval stage feeding a slower, high-capacity ranking model — spans distributed systems, ML engineering, and product intuition about filter bubbles and cold start. This article walks the full path.

Functional requirements

Given a user and a surface (home feed, "related videos", email digest), return a ranked list of N items.
Low-latency: < 150 ms p99 end-to-end (including network from the user's region).
Personalised: rank must be driven by that user's preferences, not just global popularity.
Fresh: new content should surface; stale signals (watched 2 years ago) should decay.
Cold start: a new user with zero history, or a newly uploaded item, must still receive or appear in recs.

Non-functional requirements

Scale: 500 M users, 10 M items, ~29 k recommendation requests/sec average (~58 k QPS peak).
Availability: > 99.99% — the home feed going dark is a Severity-1 incident.
Consistency: feature values used at serving time must match features used at training time (training/serving skew causes silent model degradation).
Explainability (soft): the system should support "why am I seeing this?" for product compliance.

Capacity estimation

Dimension	Estimate	How we got there
Active users	500 M	Given
Catalog size	10 M items (videos, products, songs)	Given
Daily recommendation requests	2.5 B req/day	`500 M users × 5 feed loads/day`
Average QPS	~29 000 QPS	`2.5 B ÷ 86 400 s`
Peak QPS	~58 000 QPS	`2× average`
Latency — candidate generation	≤ 20 ms	Budget within 150 ms p99
Latency — ranking inference	≤ 80 ms	Budget (includes ≤ 15 ms feature store lookup)
Latency — serialisation + network	≤ 30 ms	Budget
Latency — total	~130 ms	`20 + 80 + 30 ms` (the 15 ms feature-store lookup is already inside the 80 ms ranking budget)
Item embeddings (float32)	≈ 10 GB	`10 M items × 256-dim × 4 B = 10 M × 1 024 B`
Item embeddings (int8 quantised)	≈ 2.5 GB	`10 GB ÷ 4` — fits in one large memory node
User embeddings (all, float32)	≈ 512 GB	`500 M users × 256-dim × 4 B` — too large for RAM
Active user embeddings in RAM	~10 GB	Keep top 10 M active users; rest on SSD or recomputed on demand
Interaction events/sec	~1.45 M events/sec	`29 k req/s × 50 items shown/request × 1 event`
Daily training events	~125 B events/day	`1.45 M × 86 400 s`
Training data storage	~5 TB/day compressed	Parquet on object store (S3/GCS)
Index refresh cadence	Incremental every 5 min; full rebuild daily	Catalog grows by ~100 k new items/day

Takeaway: At 58 k QPS peak and a 10 M item catalog, serving requires a ~10 GB in-RAM item embedding index and a ~130 ms end-to-end latency budget split across candidate generation (≤ 20 ms), ranking inference (≤ 80 ms, including feature-store lookup), and serialisation + network (≤ 30 ms); full float32 user embeddings (≈ 512 GB) cannot fit in RAM, so only the top 10 M active users (~10 GB) are kept hot.

Building up to the design

V1: Popularity-based recs

def recommend(user_id, n=20):
    return top_n_items_by_play_count(n)   # same list for everyone

Every user gets the same top-20 globally popular items. Zero ML required.

This is a surprisingly strong baseline — popular content is popular for a reason, and many teams underestimate it. But it fails the moment personalisation matters: a user who watches only German documentaries gets served English action blockbusters, and engagement metrics plateau quickly.

V2: Collaborative filtering

Matrix factorisation decomposes the user-item interaction matrix R (rows = users, columns = items, values = implicit feedback like play count) into user embeddings U and item embeddings V such that R ≈ U × Vᵀ.

Training is offline (hours). At serving time: look up the user's embedding, compute dot products with all item embeddings, return the top-K. Users who liked similar things land near each other in the latent space, so a documentaries-in-German user actually gets documentaries in German.

The problem is scale. Scoring all 10 M items per request is O(N × D). At 256 dims and 58 k QPS, that's 58 k × 10 M × 256 ≈ 150 trillion FLOPs/sec. Even with BLAS-optimised matrix multiply, that's tens of racks of compute just for candidate scoring. Cold start is a separate wound: a new user has no embedding at all, and neither does a new item.

V3: Approximate nearest-neighbour (ANN) retrieval

The key insight is that you don't need the exact top-K — you need a fast, good-enough shortlist that a downstream ranker can tighten up. Build an ANN index over item embeddings (FAISS with HNSW or IVF quantisation, or Google's ScaNN). At query time, look up the user embedding; the ANN index returns the approximate top-K in O(log N) time — typically under 10 ms for 10 M items in RAM, versus the seconds a brute-force dot product would need.

Recall is typically 90–99% depending on index type and configuration: you miss a fraction of the true top-K, but those missed items are very close in score to the items you do return, and the ranking stage will reorder the shortlist anyway. The latency win is decisive.

What breaks at this point is multi-interest users. A single ANN lookup from a single user embedding misses someone who watches tech talks AND cooking AND horror. You need to retrieve from multiple interest clusters or query towers.

V4: Two-tower model + multiple retrieval sources

The two-tower model learns user and item encoders jointly, optimised for inner product similarity between positive user-item pairs versus in-batch negatives. The towers can ingest arbitrary features — not just IDs — so they handle cold start more gracefully than pure matrix factorisation.

Multiple candidate sources are merged:

Source	Mechanism	Candidates
ANN on user embedding	Two-tower embeddings, FAISS	200–300
User's watch/like history → item-item similarity	Item co-watch matrix	100
Trending / new content	Time-decayed popularity score	50
Followed creators / subscriptions	Social graph	50
Geographically local trending	Per-region popularity	30

Total retrieved: ~500 items, deduplicated.

These ~500 items still need to be ranked. Candidate generation scores (dot products) don't account for rich context: time of day, device type, the user's watch history in the last five minutes, item freshness. Those features are too expensive to compute at retrieval scale, which is exactly why the two-stage split exists.

V5: Ranking model over the shortlist

A ranking model (typically a deep neural network with embedding layers for sparse IDs, dense layers for numeric features, and optional cross-network layers for feature interactions) scores each of the ~500 candidates with a much richer feature set. It can be trained to predict multiple objectives simultaneously — click-through rate, watch-time, like probability — with a weighted combination.

At 500 candidates × 58 k QPS = 29 M inference calls/sec across the fleet, this is expensive but tractable with GPU serving (TensorFlow Serving, TorchServe) and model distillation.

V6: Re-ranking, feature store, online learning

The final pieces are the ones most candidates skip. A re-ranker applies diversity (max-marginal relevance), business rules (boost paying advertisers, suppress policy-flagged content), and exploration (surface a few non-obvious items to learn about user taste). A feature store ensures the model sees identical features at training time and serving time — without it, training/serving skew is nearly inevitable and degrades ranking quality silently. And a streaming pipeline feeds fresh interaction events back into the feature store so the model's inputs update within minutes of a user action, without requiring a full retrain.

flowchart LR
    V1["V1: Global popularity<br/>no ML"] --> V2["V2: Matrix factorisation<br/>personalised but slow"]
    V2 --> V3["V3: ANN retrieval<br/>sub-20 ms"]
    V3 --> V4["V4: Two-tower + multi-source<br/>500 candidates"]
    V4 --> V5["V5: + ranking model<br/>rich features"]
    V5 --> V6["V6: + feature store + re-ranking<br/>production-grade"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

Full architecture

flowchart TD
    subgraph Offline["Offline Training Pipeline (hours)"]
        LOGS[(Interaction Logs<br/>S3 / GCS)] --> FE[Feature Engineering<br/>Spark / Flink]
        FE --> TRAIN[Model Training<br/>PyTorch / TensorFlow]
        TRAIN --> REG[Model Registry]
        FE --> FS_BATCH[(Feature Store<br/>offline / batch)]
    end

    subgraph Online["Online Serving Path (< 150 ms)"]
        U[User Request] --> API[Recs API]
        API --> CG[Candidate Generator]
        CG --> FAISS[(ANN Index<br/>FAISS / ScaNN)]
        CG --> HEU[Heuristics<br/>Trending + Followed]
        CG --> MERGE[Merge + Dedup<br/>~500 candidates]
        MERGE --> RANKER[Ranking Service]
        FS_ONLINE[(Feature Store<br/>online — Redis)] --> RANKER
        RANKER --> MODEL[Ranking Model<br/>GPU inference]
        MODEL --> RERANK[Re-Ranker<br/>Diversity + Rules]
        RERANK --> RESP[Response]
    end

    subgraph RT["Real-Time Signal Pipeline (minutes)"]
        EVENTS[User Events<br/>click, skip, dwell] --> KAFKA[Kafka]
        KAFKA --> STREAM[Stream Processor<br/>Flink / Samza]
        STREAM --> FS_ONLINE
        STREAM --> LOGS
    end

    REG --> MODEL
    FS_BATCH --> FS_ONLINE

    style FAISS fill:#0e7490,color:#fff
    style RANKER fill:#15803d,color:#fff
    style FS_ONLINE fill:#a855f7,color:#fff
    style MODEL fill:#ffaa00,color:#0a0a0f
    style KAFKA fill:#ff2e88,color:#fff
    style CG fill:#ff6b1a,color:#0a0a0f

Component deep-dives

Candidate generation — two-tower model and ANN

The two-tower model trains a user encoder and an item encoder end-to-end:

User tower:   [user_id, age_bucket, country, recent_watch_ids, ...] → 256-dim embedding
Item tower:   [item_id, category, creator_id, duration_bucket, ...]  → 256-dim embedding

Training objective: maximise dot(user_emb, positive_item_emb) vs. in-batch negatives

The item embeddings are indexed into FAISS once trained. At serving time, the user tower runs on the current user context (~5 ms CPU or < 1 ms on GPU), and the ANN index is queried for the top-K nearest item embeddings by inner product (typically under 10 ms for HNSW on 10 M items in RAM, often 1–5 ms at moderate efSearch settings).

The diagram below shows how this works: both towers project into the same embedding space, and the retrieval step is a nearest-neighbor search there — not a query against a database of raw attributes.

flowchart LR
    UF["User features<br/>user_id, history, country"] --> UT[User Tower<br/>MLP layers]
    IF["Item features<br/>item_id, category, creator"] --> IT[Item Tower<br/>MLP layers]
    UT --> UE((User<br/>embedding<br/>256-dim))
    IT --> IE((Item<br/>embedding<br/>256-dim))
    UE -->|"inner product similarity"| ANN[(ANN Index<br/>all item embeddings)]
    ANN -->|"top-300 nearest"| CANDS[Candidate list]
    style UT fill:#ff6b1a,color:#0a0a0f
    style IT fill:#0e7490,color:#fff
    style UE fill:#ffaa00,color:#0a0a0f
    style IE fill:#ffaa00,color:#0a0a0f
    style ANN fill:#15803d,color:#fff

HNSW vs IVF+PQ: HNSW (hierarchical navigable small world graphs) delivers high recall (typically 95–99%, depending on efSearch) with low latency but high memory — it stores the full-precision graph in RAM. IVF with product quantisation (PQ) compresses embeddings 16–32×, saving significant memory, but recall is typically 5–15% lower than HNSW at matched latency, and depends heavily on nprobe and PQ code size. Adding a reranking step — fetch more candidates, recompute exact distances — recovers much of the recall loss at extra latency cost, and is preferred when item catalogs exceed available RAM. In practice, many systems use IVF+PQ for the full 10 M catalog and HNSW for a "freshness index" holding only the last 48 hours of new items.

The feature store

The feature store is the piece most teams underestimate. Its job: serve pre-computed feature values — fast, consistently — to both the training pipeline and the online serving path.

flowchart LR
    BATCH[Batch compute<br/>Spark] --> FSO[(Feature Store<br/>Offline tier<br/>e.g. Hive / BigQuery)]
    STREAM2[Stream compute<br/>Flink] --> FSN[(Feature Store<br/>Online tier<br/>e.g. Redis / DynamoDB)]
    FSO --> FSN
    FSN --> RANKER2[Ranking Service]
    FSO --> TRAIN2[Training Pipeline]
    style FSO fill:#a855f7,color:#fff
    style FSN fill:#a855f7,color:#fff
    style RANKER2 fill:#15803d,color:#fff

Training/serving skew is the most dangerous failure mode in production recommenders. It happens when the feature value computed offline for training differs from the value computed at serving time — because of logic drift, timezone handling, or aggregation window differences. The feature store mitigates this by being the single source of truth: training jobs read the same store that serving reads, using the same transformation code in both paths.

Features split into two tiers:

Tier	Examples	Staleness	Storage
Pre-computed batch	User's 30-day genre affinity, item's all-time play count	Hours	Column store (Hive, BigQuery)
Near-real-time	User's plays in last 30 min, item's plays in last 5 min	Minutes	Redis / DynamoDB
Request-time	Device type, time of day, country from IP	Milliseconds	Computed inline

Ranking model — architecture and training

The ranking model takes ~500 candidates and scores each independently. A standard architecture (similar in spirit to widely described systems at Netflix and YouTube):

Input features per (user, item) pair:
  - user_id (sparse, embedded)
  - item_id (sparse, embedded)
  - user's recent watch list (variable-length, pooled embeddings)
  - item's pre-computed features (category, duration, age, language)
  - context features (device, hour, day-of-week)
  - interaction features (user has seen this creator before? user's avg watch % of similar items)

Architecture:
  Sparse IDs → embedding table → 64-dim vectors
  Dense features → MLP
  Concatenate all → 3-4 dense layers (512 → 256 → 128 → 1 per objective)

Objectives (multi-task):
  - P(click)        weight 0.3
  - P(watch > 50%)  weight 0.5
  - P(like / save)  weight 0.2
  Final score = weighted sum

Training runs on the logged data (user, item, label) with the features as they existed at the time of the impression. This point is subtle: you must log the feature values at impression time and train on those logged values — not recompute them later — otherwise you get future data leakage.

Cold start

Cold start comes in two flavors, and they need different fixes.

New user. Zero interaction history. The most useful first step is asking for explicit preferences during onboarding — genre interests, followed creators. Without that signal, fall back to geographically and demographically popular content. Then run a multi-armed bandit (epsilon-greedy or Thompson sampling) to explore the user's taste with a small fraction of their early impressions. This is the classic exploration-exploitation problem: you need to show some uncertain items to learn what the user likes, even at the cost of short-term click-through rate.

New item. Zero interactions, so embedding-based retrieval won't surface it. Use content-based features instead: item category, creator embeddings if the creator has prior items, title/description embeddings from a text encoder. Assign the new item an initialisation embedding based on its metadata rather than a random or zero vector. Also reserve a freshness "exploration budget" — a fraction of every user's feed allocated to recently published content regardless of predicted score.

flowchart TD
    REQ[Recommendation request] --> CHECK{Has interaction<br/>history?}
    CHECK -->|"Yes"| EMB[Look up user embedding<br/>ANN retrieval]
    CHECK -->|"No — new user"| COLD[Cold start path]
    COLD --> ONBOARD{Onboarding prefs<br/>collected?}
    ONBOARD -->|"Yes"| PREF[Content-based on<br/>stated preferences]
    ONBOARD -->|"No"| POP[Geo + demographic<br/>popular content]
    PREF --> BANDIT[Epsilon-greedy bandit<br/>explore taste]
    POP --> BANDIT
    EMB --> MERGE[Merge candidates<br/>~500]
    BANDIT --> MERGE
    FRESHNESS[Freshness budget<br/>new items regardless of score] --> MERGE
    MERGE --> RANK[Ranking Service]
    style CHECK fill:#ff6b1a,color:#0a0a0f
    style EMB fill:#15803d,color:#fff
    style BANDIT fill:#0e7490,color:#fff
    style FRESHNESS fill:#a855f7,color:#fff
    style RANK fill:#ffaa00,color:#0a0a0f

Online learning and real-time signals

A full model retrain from scratch takes hours. But the user's taste right now matters. The solution is to decompose what changes.

Feature store updates happen fast, without any retraining. Kafka events update near-real-time features — what the user just watched, the item's viral momentum — within minutes. The model weights are unchanged, but its inputs are fresh. This is the most universally adopted mechanism.

Online fine-tuning goes further: the ranking model is updated with a streaming gradient update at a small learning rate. This risks training instability and is less universally adopted.

Frequent full retrains remain the main lever. Retrain the model on the last N days of data every few hours. With modern ML infrastructure (Spark for feature prep, distributed GPU training), this is achievable in well under an hour for many ranking models.

Exploration vs. exploitation — preventing filter bubbles

A pure exploit strategy — always show the highest-scoring item — creates a feedback loop. The model learns from what it shows, so it keeps showing the same narrow content. Users interact with that content, confirming the bias. Long-term user satisfaction degrades quietly.

There are several concrete mitigations. Epsilon-greedy exploration shows a randomly sampled item with probability ε (typically 2–5%), logs whether it was explored, and trains on that signal. Thompson sampling / UCB is a principled bandit approach that explores proportionally to uncertainty about an item's quality. Max-marginal relevance (MMR) re-ranking penalises items that are too similar to items already placed earlier in the list — this prevents showing 10 nearly identical videos from the same creator. And explicit business rules in re-ranking handle freshness boosts (surface content newer than 48 hours), topic diversification quotas, and policy compliance filters like age-gating and regional restrictions.

Serving path — sequence diagram

sequenceDiagram
    participant User
    participant API as Recs API
    participant CG as Candidate Generator
    participant ANN as ANN Index
    participant HEU as Heuristics
    participant FS as Feature Store
    participant RANK as Ranking Service
    participant RERANK as Re-Ranker

    User->>API: GET /feed (user_id, context)
    API->>CG: fetch candidates
    par ANN retrieval
        CG->>ANN: user_embedding → top-300
        ANN-->>CG: 300 item IDs
    and Heuristics
        CG->>HEU: trending + subscriptions
        HEU-->>CG: 200 item IDs
    end
    CG->>CG: merge + dedup → ~500 candidates
    CG-->>API: 500 item IDs

    API->>RANK: score 500 candidates
    RANK->>FS: batch fetch features for 500 items + user
    FS-->>RANK: feature vectors
    RANK->>RANK: model inference (GPU batch)
    RANK-->>API: scored list

    API->>RERANK: apply diversity + rules
    RERANK-->>API: final top-50

    API-->>User: ranked list

The ANN retrieval and heuristic retrieval run in parallel — notice the par block above. That parallelism is what lets candidate generation stay under 20 ms despite pulling from multiple sources.

Offline training pipeline

flowchart TD
    LOGS[(Interaction logs<br/>S3/GCS, Parquet)] --> JOIN[Feature join<br/>attach features logged at impression time]
    JOIN --> FILTER[Filter: remove bots,<br/>deduplicate, label smoothing]
    FILTER --> SPLIT[Train / val / test split<br/>by time, NOT random]
    SPLIT --> TRAIN[Distributed training<br/>PyTorch + GPU cluster]
    TRAIN --> EVAL[Offline evaluation<br/>AUC-ROC, NDCG, MRR]
    EVAL --> SHADOW[Shadow traffic test<br/>log but don't serve]
    SHADOW --> AB[A/B test<br/>engagement metrics]
    AB --> DEPLOY[Deploy to production]
    style TRAIN fill:#ff6b1a,color:#0a0a0f
    style AB fill:#15803d,color:#fff
    style EVAL fill:#0e7490,color:#fff

A critical subtlety: always split train/val/test by time — train on week N, validate on week N+1. Random splits leak future information and produce optimistic offline metrics that don't correlate with online A/B test results.

Offline vs. online metrics

Metric	Where	What it measures	Pitfall
AUC-ROC	Offline	Model's discrimination ability	Doesn't capture position or diversity
NDCG@K	Offline	Ranking quality in top-K	Requires a relevance judgment, not just binary label
Precision@K	Offline	Fraction of top-K that are relevant	Ignores ordering within top-K
Click-through rate (CTR)	Online A/B	Did users click?	Clickbait-optimised models win; watch time suffers
Watch time / completion	Online A/B	Did users watch?	Better proxy for satisfaction
Long-term retention	Online (weeks)	Did users come back?	The true north-star metric; slow to measure

The common trap: optimise offline AUC to a high value, deploy, and see CTR go down. This usually means the model learned to predict clicks from features that aren't actually causal — e.g., the item's popularity at training time, which doesn't generalise to new items.

Storage choices

Data	Store	Why
Item embeddings (ANN index)	In-memory FAISS / ScaNN on inference nodes	Sub-10 ms retrieval requires RAM; index sharded across nodes
User embeddings (active)	Redis cluster or in-memory service	Top-10 M active users ~10 GB; LRU eviction for the rest
Feature store (online)	Redis / DynamoDB	< 5 ms p99 lookup; key = user_id or item_id
Feature store (offline)	Hive / BigQuery / S3 Parquet	Batch reads for training; columnar for scan efficiency
Interaction logs	Object store (S3/GCS) in Parquet	Cheap, durable; Spark-readable; append-only
Model artifacts	Model registry (MLflow / internal)	Versioned; enables rollback
Experiment results	Columnar analytics DB (ClickHouse / BigQuery)	A/B test result queries over billions of rows

Failure modes

Training/serving skew

The most common cause of silent ranking degradation. Features computed differently in training versus serving produce garbage scores that look normal in logs — no errors, just worse recs. To detect it: log both the features used at serving time and the model's raw score, then compare offline model predictions on the same features against serving scores. The fix is the feature store as single source of truth, with shared transformation code run in both paths.

Cold start

New items get zero interaction signal, so embedding-based retrieval ignores them. The freshness exploration budget and content-based initialisation (described above) are the practical mitigations. Creator-based proxy embeddings also help for new uploads from known creators.

Feedback loop / popularity bias

The system shows popular items, users click them, they become more popular, the long tail starves. The fix is an exploration policy, diversity re-ranking, explicit long-tail exposure quotas, and debiasing during training via importance weighting by inverse propensity score.

Stale features

A real-time feature (item views in the last 5 minutes) is stale because the Kafka pipeline lagged. The model was trained with fresh data, but serving gets 30-minute-old features — behaviour shifts. Monitor feature freshness as a service-level metric; alert on lag exceeding a threshold.

Index staleness (new items not retrievable)

The ANN index is rebuilt every hour, so a newly uploaded item isn't retrievable for up to an hour. The standard fix: maintain a "hot index" of items uploaded in the last 2 hours that is rebuilt every 5 minutes and merged with main retrieval results.

Model regression after retrain

A new model version scores better offline but degrades engagement in A/B. The standard deployment sequence: shadow mode (run new model in parallel, log scores, don't serve), then gradual traffic ramp (1% → 10% → 50% → 100%) with automated rollback triggers on engagement metrics.

Things to discuss in an interview

Why two stages? Because you can't run an expensive ranking model over 10 M items at 58 k QPS. Quantify it — the math shuts down the "why not just rank everything" question cleanly.
What's in the feature store? Distinguish batch features (genre affinity, all-time popularity), near-real-time features (last-N-minutes activity), and request-time features (device, time). Each has a different freshness SLA and storage strategy.
Cold start specifically. Interviewers always ask. Have a concrete plan for new-user and new-item cold start before they ask.
Offline metrics vs. online metrics. Know that high AUC doesn't guarantee good A/B results, and explain why (position bias, popularity bias, metric-objective misalignment).
How do you prevent filter bubbles? Exploration policy + diversity re-ranking. Name a concrete mechanism.
Training data leakage. Time-based train/val split; logging features at impression time.

Things you should now be able to answer

Why does approximate nearest-neighbour retrieval work well enough despite missing some fraction of the exact top-K items?
What is training/serving skew, and how does a feature store prevent it?
Why do you split train/test by time rather than randomly?
How does the two-tower model handle new items that have no interaction history?
What is the exploration-exploitation trade-off in recommendations, and what are two concrete mechanisms to address it?
Why is watch-time a better training signal than CTR for long-term user satisfaction?

Frequently asked questions

▸Why do production recommendation systems use two stages instead of scoring all items directly?

Brute-force scoring all 10 million items at 58,000 QPS peak would require roughly 150 trillion FLOPs per second just for the dot-product step — infeasible at consumer cost. The two-stage split uses a cheap ANN retrieval to narrow the catalog to roughly 500 candidates in under 20 ms, then applies an expensive ranking model only to that shortlist in under 80 ms, keeping the total end-to-end latency within the 150 ms p99 budget.

▸What is training/serving skew and why is it the most dangerous failure mode in recommenders?

Training/serving skew occurs when a feature value computed offline for training differs from the value computed at serving time — due to logic drift, timezone handling, or aggregation window differences. It degrades ranking quality silently: no errors appear in logs, scores just get worse. A centralized feature store that serves as the single source of truth, with shared transformation code run in both training and serving paths, is the standard mitigation.

▸How much memory do item and user embeddings require at scale?

At 10 million items with 256-dimensional float32 embeddings, item embeddings require approximately 10 GB — or about 2.5 GB with int8 quantization, small enough to fit on one large memory node. Full float32 user embeddings for 500 million users would require roughly 512 GB, so systems keep only the top 10 million active users in RAM (about 10 GB) and recompute or load the rest from SSD on demand.

▸When should you use HNSW versus IVF with product quantization for ANN retrieval?

HNSW delivers high recall — typically 95 to 99 percent depending on efSearch — with low latency, but stores the full-precision graph in RAM. IVF with product quantization compresses embeddings 16 to 32 times and saves significant memory, but recall is typically 5 to 15 percent lower at matched latency. In practice, many systems use IVF+PQ for the full 10 million item catalog and HNSW for a freshness index covering only the last 48 hours of new items.

▸How should you handle cold start for a new item that has no interaction history?

A new item has no embedding signal, so ANN-based retrieval will not surface it. The fix is to initialize the item's embedding using content-based features — category, creator embeddings from prior uploads by the same creator, title and description embeddings from a text encoder — rather than a random or zero vector. Systems also reserve a freshness exploration budget that allocates a fraction of every user's feed to recently published content regardless of predicted score, and maintain a hot index rebuilt every 5 minutes for items uploaded in the last 2 hours.

← previous

Backpressure & Flow Control

API Gateways & the Backend-for-Frontend Pattern

// RELATED

Design a Recommendation System (Netflix / TikTok)

The problem

Functional requirements

Non-functional requirements

Capacity estimation

Building up to the design

V1: Popularity-based recs

V2: Collaborative filtering

V3: Approximate nearest-neighbour (ANN) retrieval

V4: Two-tower model + multiple retrieval sources

V5: Ranking model over the shortlist

V6: Re-ranking, feature store, online learning

Full architecture

Component deep-dives

Candidate generation — two-tower model and ANN

The feature store

Ranking model — architecture and training

Cold start

Online learning and real-time signals

Exploration vs. exploitation — preventing filter bubbles

Serving path — sequence diagram

Offline training pipeline

Offline vs. online metrics

Storage choices

Failure modes

Training/serving skew

Cold start

Feedback loop / popularity bias

Stale features

Index staleness (new items not retrievable)

Model regression after retrain

Things to discuss in an interview

Things you should now be able to answer

Further reading

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform