Design TikTok / Reels (short-video platform)
A vertical-swipe video feed that feels infinite and clairvoyant. Two-tower retrieval, real-time ranking with Monolith, sub-200ms playback via aggressive preloading, and the For You Page that knows you in 90 seconds.
The problem
TikTok is a recommendation engine with a video player bolted on. A billion users open the app, swipe up, and expect the next video — perfectly matched to their taste, already playing — within 200 milliseconds. ByteDance built TikTok around this loop. Meta copied it with Reels. YouTube copied it with Shorts. The vertical-swipe feed is now the dominant video UX on mobile because of how well that loop works when the system behind it is tuned right.
The deceptive part is how many hard problems are stacked beneath that one-swipe UX. At base you have a video CDN problem: 34 million new uploads a day, encoded into multiple bitrates, distributed to edge nodes worldwide so that anyone, anywhere gets sub-second starts. That's essentially YouTube at shorter average duration but much higher upload rate. On top of that you have a real-time recommendation problem: the system must score ~100 million candidate videos against each user's preferences and return the right 10 in under 100ms. And on top of that you have a cold-start problem: a brand-new user with zero history should get a personalized feed within 90 seconds — roughly 10 to 20 swipes.
The core engineering tension is between latency and freshness. To be fast, you'd cache everything: pre-rank video lists, pre-compute user preferences, serve stale embeddings. To be fresh, you'd recompute from scratch on every swipe and retrain models constantly. TikTok's answer is a two-stage retrieval and ranking pipeline backed by Monolith — a parameter server that continuously retrains on a live Kafka stream of user actions and syncs updated embeddings to the serving layer every minute or so. Dense model weights refresh more slowly, but embeddings — which are what actually react to new users and trending videos — stay nearly real-time.
The UX trick that hides all the remaining latency is aggressive preloading: while you watch video N, the client silently downloads the first few seconds of videos N+1 and N+2. Swipe latency drops from "video has to load" (~600ms) to "video already playing" (~50ms). If you can walk through the YouTube-scale video pipeline and then describe how a continuously retrained two-tower model + preloading is bolted onto it, you've covered this question.
Functional requirements
- Upload a short video (up to 10 minutes recorded in-app, up to 60 minutes when uploaded from camera roll), with effects, music, captions.
- Watch a vertically-swiped feed of recommended videos ("For You Page").
- Browse follower-graph feed.
- Like, comment, share, save, follow.
- Live streaming (stretch).
- Direct messages (stretch — see chat system article).
We'll focus on the For You Page and the video pipeline, which is what makes TikTok TikTok.
Non-functional requirements
- 1B+ DAU.
- p99 video start time < 200ms on any swipe.
- Recommendations adapt to a new user within ~10–20 videos (~90 seconds).
- Globally consistent moderation (no NSFW slips into FYP).
- 99.99% playback availability.
- Sub-second upload acknowledgment ("posted" feels instant); processing within 1 minute.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| DAU | 1B | Baseline assumption |
| Watch time | 95B minutes/day | 1B × 95 min/day |
| FYP requests | 6.3B calls/day | 95B min ÷ 15 min per session → ~1 page-load per 15 min of session |
| FYP QPS (avg) | ~73k req/sec | 6.3B ÷ 86,400 |
| FYP QPS (peak) | ~220–250k req/sec | ~3× avg |
| Upload rate | ~34M videos/day ≈ 400/sec | Estimated from creator/DAU ratio |
| Avg raw video size | 30 MB | Typical short-video recording |
| Avg encoded size | ~150 MB | ~5× across multi-bitrate renditions |
| Storage — new per day | ~5 PB/day | 34M × 150 MB |
| Storage — hot (CDN, 30 days) | ~150 PB | 5 PB/day × 30 days |
| Storage — cold (1 year) | ~1.8 EB | 5 PB/day × 365 days |
| Concurrent viewers (avg) | ~66M | 1B DAU × 95 min/day ÷ 1440 min/day |
| Egress at peak (no CDN) | ~66 Tbps | 66M streams × 1 Mbps |
| Egress at peak (with CDN) | ~3.3 Tbps origin | ~66 Tbps × 5% miss rate; ~95% absorbed at edge |
Takeaway: The defining number isn't storage or bandwidth — it's ~73k FYP requests/sec average (peak ~220–250k), each requiring an ML ranker call against a candidate set of ~1000 videos. That's where the system spends most of its budget.
Building up to the design
The full TikTok architecture is intimidating, but each piece appears as the answer to a specific failure of the simpler version. Walk forward.
V1: Reverse-chronological friend feed
SELECT v.* FROM videos v
JOIN follows f ON f.followee = v.author
WHERE f.follower = $me
ORDER BY v.created_at DESC LIMIT 50;
This is Instagram circa 2010. It works for users who have friends on the platform. The problem is that new users have no friends and see an empty feed; existing users who follow only a handful of people run out of new content quickly. Engagement caps at whatever your social graph produces.
V2: Global trending feed
Show the top N videos by global engagement — same for everyone, regardless of who they follow. New users see something interesting right away. But it's a popularity contest: niche communities (woodworking tutorials, K-pop dance covers) get crowded out by the global median. Everyone sees identical content, so there's no reason to stay.
V3: Personalized via collaborative filtering
"Users who watched X also watched Y." Build a user-video co-engagement matrix; recommend videos that similar users liked. You get a real personalized feed, which is a big step forward. The catch is that brand-new videos with zero engagement get no exposure — they never appear in anyone's "users like you also watched" list, so they die before anyone discovers them. Hourly batch jobs also can't keep up with TikTok-speed trends that form and die in minutes.
V4: Two-tower retrieval — embeddings
Train a model that produces a vector embedding for each user (based on watch history, demographics, interactions) and each video (based on content, audio, hashtags, early engagement). Recommend videos whose embeddings are closest to the user's.
flowchart LR
USER[User features] --> UT[User Tower]
VID[Video features] --> VT[Video Tower]
UT --> UE[User embedding<br/>dense vector]
VT --> VE[Video embedding<br/>dense vector]
UE --> ANN[ANN search<br/>nearest videos]
VE --> ANN
style UT fill:#15803d,color:#fff
style VT fill:#15803d,color:#fff
style ANN fill:#ff6b1a,color:#0a0a0f
You embed once, then query millions of video vectors in milliseconds using approximate nearest-neighbor search (FAISS, ScaNN). This is fast candidate generation that scales — the computation at serving time is just the user vector lookup plus an ANN query, not a full matrix factorization. But retrieval finds similar videos; it doesn't optimize for engagement. A user who watched 50 cooking videos gets recommended cooking video #51, but the one she'd rewatch and share might be a dance video the model hasn't figured out she'd love. You need a second stage.
V5: Two-stage rank — candidates + heavy ranker
- Stage 1 — Retrieval: two-tower ANN narrows ~100M videos → ~1000 candidates per user (5–10ms).
- Stage 2 — Ranking: a heavy deep model scores each candidate on multiple objectives (watch_time, completion_rate, like_prob, share_prob, follow_prob) and combines into a final score. ~100ms for 1000 items.
This is essentially the news feed architecture but with sharper objectives — TikTok cares about watch_time and replay_rate, not "did you click." Feed quality improves dramatically and the "FYP knows me" effect appears. But models trained on yesterday's data can't react to today's trends. A dance going viral right now — the model trained four hours ago doesn't know it exists.
V6: Real-time training (Monolith — ByteDance's actual system)
Stream user actions (views, watches, completions, likes, shares, skips) into Kafka. Online-training workers consume the stream and update the embedding tables (user-side and item-side) on the order of seconds to minutes. Dense ranker layers refresh on a slower cadence (day-level, per the Monolith paper), but the embeddings — which are what react to new content and new users — are effectively real-time.
The ranker that recommends to you at 3:01pm has user/video embeddings informed by what happened at 2:58pm. Even if the model shape is 12 hours old, the embedding slots it reads from are fresh.
V7: Aggressive preloading + edge caching
When the user is watching video N, the next 2 candidates (N+1 and N+2) are already downloading their first few seconds in the background. The CDN pre-positions hot videos at every regional POP. Multi-layer cache: device → ISP edge → POP → origin.
flowchart LR
USER[User watching video 5] --> PLAY[Player]
PLAY --> N6[Preload video 6<br/>first 3s]
PLAY --> N7[Preload video 7<br/>manifest]
style PLAY fill:#ff6b1a,color:#0a0a0f
With aggressive preloading, the perceived swipe latency drops from "video has to load" (~600ms+) to "video is already playing" (~50ms — just a UI transition). This is the single most important UX decision in the platform.
V8: Production TikTok
V5 + V6 + V7 + the upload pipeline (chunked upload → distributed transcoding → multiple bitrates → CDN distribution), moderation (multimodal classifier on every upload, plus human review), engagement counters (Kafka → batched Cassandra), follower-graph services, search, and live streaming.
flowchart LR
V1[V1: chrono friend feed<br/>new users empty] --> V2[V2: global trending<br/>not personal]
V2 --> V3[V3: collab filtering<br/>cold start, batch]
V3 --> V4[V4: two-tower retrieval<br/>fast candidates]
V4 --> V5[V5: + heavy ranker<br/>FYP quality]
V5 --> V6[V6: + online training<br/>minute-level trends]
V6 --> V7[V7: + preloading + CDN<br/>zero-latency swipe]
V7 --> V8[V8: + upload + mod + counters<br/>production]
style V1 fill:#0e7490,color:#fff
style V4 fill:#15803d,color:#fff
style V6 fill:#ff6b1a,color:#0a0a0f
style V8 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
subgraph Upload
CR[Creator] -->|chunked upload| UAPI[Upload API]
UAPI --> RAW[(Raw S3)]
RAW --> TRANS[Transcoding Cluster<br/>GPU workers]
TRANS --> SEG[(HLS segments<br/>multiple bitrates)]
TRANS --> MOD[Moderation ML]
MOD --> METADB[(Video Metadata)]
TRANS --> EMBED[Content Embedder]
EMBED --> VVEC[(Video Vector Store<br/>FAISS / ScaNN)]
end
subgraph Feed
U[User] --> FAPI[Feed API]
FAPI --> RET[Retrieval<br/>two-tower ANN]
RET --> VVEC
RET --> CAND[~1000 candidates]
CAND --> RANK[Ranker<br/>heavy ML]
RANK --> FS[(Feature Store)]
RANK --> TOP[Top 10 videos]
TOP --> MANIFEST[Manifest Service]
MANIFEST --> CDN
end
subgraph Realtime
U -.events.-> KAFKA[Kafka: views, likes, watches]
KAFKA --> TRAIN[Online Training<br/>Monolith]
TRAIN --> RANK
TRAIN --> RET
KAFKA --> COUNT[Counters<br/>Cassandra]
end
SEG --> CDN
CDN --> PLAYER[Player on device]
style TRANS fill:#ff6b1a,color:#0a0a0f
style RANK fill:#15803d,color:#fff
style TRAIN fill:#a855f7,color:#fff
style CDN fill:#0e7490,color:#fff
Three loosely-coupled subsystems:
- Upload pipeline — turns a creator's video into globally-distributed playable chunks plus an embedding.
- Feed serving — retrieval + ranking + manifest assembly, ~200ms per FYP request.
- Realtime training — every user action flows through Kafka into model updates.
The upload pipeline
sequenceDiagram
participant App
participant Upload API
participant Raw S3
participant Queue
participant Transcoder
participant Moderator
participant CDN
participant Embedder
App->>Upload API: initiate (size, fmt, fingerprint)
Upload API-->>App: signed chunk URLs
App->>Raw S3: PUT chunks (parallel)
App->>Upload API: complete
Upload API->>Queue: enqueue jobs
Queue->>Transcoder: assigns to GPU worker
par
Transcoder->>Transcoder: encode 360p, 540p, 720p, 1080p (HLS chunks)
Transcoder->>Moderator: video frames + audio + caption
Transcoder->>Embedder: extract content embedding
end
Transcoder->>CDN: push segments to regional POPs
Moderator-->>Upload API: ok / quarantine
App->>App: shows "posted" — usable within ~60s
The upload is chunked so it survives flaky cellular connections — if the upload dies halfway, the app resumes from the last committed chunk. Once all chunks land in Raw S3, a job is enqueued and a GPU transcoder picks it up. Transcoding, moderation, and content embedding all run in parallel: there's no reason to wait for moderation before starting the encode, or to wait for the encode before extracting the content embedding. The three fan-out immediately.
Multiple bitrates (360p–1080p) feed adaptive bitrate streaming — HLS or DASH lets the player pick the right quality for the current network. Segments land on CDN POPs in the creator's region, then propagate to others. The creator sees "posted" as soon as upload completes; global playability comes within ~60s.
Until moderation clears a video, it's visible to its uploader but not to FYP — what ByteDance calls "limited reach" state. This means a violating video can never go viral because it can't enter the recommendation pipeline.
Two-tower retrieval
The model maps users and videos into the same vector space. Two separate neural nets:
User tower input features:
- Recent watch history (last 100 video IDs → averaged embeddings)
- Completion patterns by category
- Time of day, day of week, locale
- Device, network type
- Long-term affinity vectors
Video tower input features:
- Content embedding (from video frames + audio)
- Author identity
- Hashtags, music track, sounds used
- Early engagement (first 1000 views' watch / share rates)
- Recency
Both towers output a fixed-dimension vector (128-d is a common industry choice; ByteDance has not publicly confirmed the exact dimension used in TikTok production). The "score" is the dot product or cosine similarity.
At serving time, all video vectors live in a FAISS/ScaNN index. Each user request:
- Compute the user vector (server-side, cached).
- Approximate nearest-neighbor search returns top 1000 video vectors.
- Pass to the ranker.
This is typically 5–10ms for the ANN search over tens of millions of indexed videos. The training signal is "did the user watch this video to completion" — making the towers learn what engages a given user, not just what's textually similar.
The heavy ranker
The 1000 candidates pass through a multi-objective deep model. Inputs per (user, video) pair:
- Embedding similarity (from retrieval stage)
- Per-objective predictions:
P(complete), P(like), P(share), P(comment), P(follow_author), E(watch_time)
- Contextual: position in feed (1st item vs 5th), recent skips
The final score is a weighted combination, often something like:
score = w1·E(watch_time) + w2·P(complete) + w3·P(share) + w4·P(like) - w5·P(skip)
Weights are tuned via online experiments (A/B testing thousands of variants simultaneously). The objective ByteDance is famous for optimizing is watch time, which is harder to game than likes and correlates better with deep engagement.
The ranker has to score all 1000 candidates in ~100ms total. It gets there through a combination of pre-computed user features cached in Redis (refreshed every few seconds), pre-computed video features in a feature store (Bigtable-like), the heavy model running on GPUs/TPUs in batched mode, and deduplication — items the user has already seen in recent sessions are filtered out before they ever reach the ranker.
Monolith — real-time training
The big architectural insight: don't wait for nightly retraining. Stream every user action into the model continuously.
flowchart LR
A[User watches/likes/skips] --> K[Kafka]
K --> TW[Training Workers<br/>parameter server]
TW --> EMB[(Embedding Tables)]
TW --> WEIGHTS[(Model Weights)]
EMB --> SERVE[Serving Model]
WEIGHTS --> SERVE
SERVE --> RANK[Ranker / Retrieval]
style TW fill:#ff6b1a,color:#0a0a0f
The key design points, from ByteDance's 2022 RecSys paper:
- Parameter server architecture: embeddings stored across distributed workers; gradients accumulated per minibatch.
- Collision-free hash table for IDs: each user_id and video_id maps to a unique embedding slot — no hash collisions that would conflate two users.
- Expirable embeddings: rarely-seen IDs evict to save memory; frequency filtering keeps the working set small.
- Online + batch hybrid: streaming updates flow into embedding tables continuously; dense ranker layers refresh on a longer cycle (day-level, per the Monolith paper) for stability.
- Embedding sync to serving: serving instances pull updated embedding shards on a minute-level cadence — this is where the "real-time" feel comes from. Dense weights refresh more slowly, but the embeddings are what react to new IDs and new behaviors.
It's worth understanding the failure modes here. If the Kafka consumer group falls behind during an event spike, the training parameter server processes a stale mini-batch — but Monolith tolerates this deliberately: the sync to serving still happens on schedule even if training is a few minutes behind. Serving is never blocked waiting for training. If a serving shard restarts, it bootstraps from the last checkpoint (periodically snapshotted to durable storage) and re-applies missed embedding updates from Kafka, which is why Kafka retention of at least several hours matters. And when frequency filtering evicts a rare video ID to save memory, that video starts with a zero embedding — which is why the first 100 views of a new video are so critical for it to escape zero-vector territory.
Counters and engagement
Likes, shares, views — each is a write hot-path. The pattern (same as Instagram):
Event → Kafka → Aggregator batches per (video, counter_type) every few seconds → writes to sharded Cassandra counter. The read path puts Redis in front; for the absolute hottest viral videos (10M likes/hour against one counter row), the counter itself is sharded across 100 rows and summed on read.
flowchart LR
LIKE[Like tap] --> KAFKA[Kafka]
KAFKA --> AGG[Aggregator]
AGG -->|batched every 5s| CASS[(Cassandra)]
READ[View video] --> RED[(Redis)]
RED -.miss.-> CASS
style AGG fill:#ff6b1a,color:#0a0a0f
Preloading: the perceived-zero-latency trick
When the user watches video N, the player downloads the first few seconds of videos N+1 and N+2 in the background.
Active playback: video[5] (1080p, streaming chunks 3-6)
Background: video[6] (360p first 3 seconds, manifest)
Background: video[7] (manifest only, lightweight)
A swipe almost always lands on N+1. Even when it doesn't, the cost was a few hundred KB of cellular data. With aggressive preloading, the perceived swipe latency drops from "video has to load" (~600ms+) to "video is already playing" (~50ms — just a UI transition).
This is what makes TikTok feel different from YouTube. YouTube serves video on-demand; TikTok pre-positions the next video before you know you want it.
The interviewer will ask about the downside: preloading wastes bandwidth when the user exits the app or swipes to an unexpected video. TikTok mitigates this by preloading at low bitrate (360p first-3s costs roughly 150–200 KB, depending on codec — H.265/AV1 compression brings it closer to the lower bound) rather than the full video, and canceling outstanding fetches on exit. The upside easily outweighs the occasional wasted segment.
CDN strategy
Three cache tiers plus origin:
| Tier | TTL | Hit rate target |
|---|---|---|
| On-device cache | 1 hour | ~20% (resaves) |
| ISP / edge POP | 24 hours | ~80% |
| Regional CDN | 7+ days | ~95% of remaining |
| Origin | — | < 1% of total reads |
For viral videos, the regional CDN pre-positions chunks based on prediction signals (rising engagement curve in adjacent regions). A video taking off in Seoul will be pre-cached in Tokyo edges before Tokyo users see it surface in their FYP.
Video delivery stack
The path from a video segment stored at origin to the player on a 4G phone is worth diagramming — it's where the 95%+ CDN hit rate comes from and why origin bandwidth stays manageable despite the scale.
flowchart LR
DEV[Device player] -->|HLS playlist request| EPOP[ISP/Edge POP]
EPOP -->|cache hit ~80%| DEV
EPOP -->|miss| RCDN[Regional CDN POP]
RCDN -->|cache hit ~95% of misses| EPOP
RCDN -->|miss| ORIG[Origin S3 + servers]
ORIG --> RCDN
style EPOP fill:#0e7490,color:#fff
style RCDN fill:#0e7490,color:#fff
style ORIG fill:#15803d,color:#fff
style DEV fill:#ffaa00,color:#0a0a0f
The player requests an HLS playlist, which lists the individual segment URLs. Each segment is a 2–10 second chunk of video at a specific bitrate. The player adapts bitrate mid-stream based on download throughput — if a user goes from WiFi to 3G, the next segment request picks the lower-bitrate rendition automatically.
Storage choices
| Data | Tech | Why |
|---|---|---|
| Video segments | S3-class blob + CDN | Cheap, durable, edge-served |
| Metadata | Cassandra (sharded by video_id) | High write rate, simple access |
| User profiles | Sharded MySQL / Manhattan | Small per-user state |
| Follower graph | Cassandra / TAO-like | Read-heavy, denormalized |
| Counters | Sharded Cassandra counter | Write-heavy hot rows |
| Video vectors | FAISS / ScaNN | Vector ANN |
| Feature store | Bigtable-like | (entity, feature) → value, low-latency reads |
| Event stream | Kafka | Buffer + replay for training |
| Embeddings (model) | Parameter server | Trillion-parameter scale |
Cold-start and the "discover your taste in 90s" trick
A new user has no watch history, so the retrieval stage has nothing personal to query with. Two things happen. First, the system bootstraps the user vector from demographics, locale, and device — a generic prior that at least knows your country and what language you speak. Second, the first ~10 videos deliberately include a mix across different categories (cooking, humor, sports, beauty, education) rather than committing to one niche. Skips are extremely informative — a hard skip within 2 seconds says "not this" more clearly than a like does. After roughly 10–20 swipes (a qualitative estimate — ByteDance has not published a precise threshold), the model accumulates enough signal to lock onto a niche and the FYP starts feeling personal.
The "TikTok read me in 90 seconds" experience is this signal acquisition working as designed.
Moderation
Every upload is scored by a multimodal classifier across visual content (NSFW, violence, gore, brand logos), audio (copyrighted music via fingerprinting, hate speech), and text (caption + on-screen OCR), plus the uploader's behavioral history. High-confidence violations are blocked immediately. Medium-confidence items go into limited-reach state while human reviewers look at them. Low-confidence items are cleared.
flowchart TD
VID[Uploaded video] --> ML[Multimodal ML classifier<br/>visual + audio + text]
ML -->|"high confidence violation"| BLOCK[Blocked — not distributed]
ML -->|"medium confidence"| LIM[Limited reach<br/>human review queue]
ML -->|"low confidence / clear"| CLEAR[Full distribution / FYP eligible]
LIM -->|"reviewer: violation"| BLOCK
LIM -->|"reviewer: clear"| CLEAR
CLEAR --> FYP[Enters recommendation pipeline]
style BLOCK fill:#ff2e88,color:#fff
style LIM fill:#ffaa00,color:#0a0a0f
style CLEAR fill:#15803d,color:#fff
style ML fill:#a855f7,color:#fff
A separate trending-content review tier inspects the top N videos before they reach massive distribution — a viral hateful video is orders of magnitude more damaging than a small one, so the review queue is priority-weighted by the video's engagement trajectory.
Performance budgets
| Step | Budget |
|---|---|
| FYP API request | < 200ms total |
| user feature fetch | 10ms |
| retrieval (ANN) | 10ms |
| ranker forward pass | 100ms |
| manifest assembly | 10ms |
| network (client ↔ origin) | 70ms |
| Swipe → playback start | < 200ms (mostly via preload) |
| Upload → playable | < 60s (target), < 120s p99 |
The 200ms budget is a serial worst-case. In practice, user feature fetch can be pipelined with the ANN query — the user vector is computed first, then feature lookup and ANN search run in parallel, shaving 5–10ms. The network budget is the dominant term for users far from an edge node, which is why regional CDN POPs and anycast routing matter so much.
Edge cases
Viral video at 100M views/hour
Counter contention: the same Cassandra row gets hammered. Shard the counter across 100 rows and sum on read. On the CDN side, pre-warm all POPs as soon as the engagement curve indicates virality — rising sharply in one region is a reliable leading indicator for adjacent regions.
A user "trains" the FYP into a niche they don't actually want
Say someone watched five videos about a topic out of morbid curiosity, and now every swipe shows more of it. The explicit "Not interested" signal resets the local signal; the system also bumps the exploration rate temporarily when it detects rapid skips or a session that ends abruptly — both are signals that the feed has gone wrong.
A creator games engagement (buys views)
Fraud detection in the event stream: device fingerprints, watch patterns, and engagement curve anomalies (a real video's view count ramps gradually; a bought-views campaign often spikes unnaturally at uniform intervals). Suspicious views are downweighted before they reach the ranker, so artificially boosted videos don't propagate to FYP.
Live streaming
A separate subsystem: the streamer's video goes to a low-latency CDN (LL-HLS or WebRTC), viewers watch with ~3s delay. Engagement events flow through the same pipeline; the recommender treats live streams similarly to short videos for FYP placement.
Right to be forgotten / video takedown
Soft-delete in metadata. CDN invalidates within minutes. Vector index removes the embedding on next refresh. The original blob is purged async; backups follow regulatory retention.
Things to discuss in an interview
- Why a recommender, not a chronological feed — engagement economics.
- Two-stage ranking — retrieval (fast, broad) + ranking (slow, precise).
- Real-time training — minute-level model updates, Monolith-style parameter server.
- Preloading — the trick that makes UX feel zero-latency.
- Upload pipeline — chunked, GPU-transcoded, embedded in parallel.
- Moderation — in the critical path of distribution, not as an afterthought.
Things you should now be able to answer
- What's the difference between candidate generation and ranking, and why are both needed?
- How does Monolith (or any online-trained recommender) react to a video going viral in minutes?
- Why is preloading the next 2 videos critical for the swipe UX?
- What's the trade-off in optimizing for watch_time vs. likes as the ranker objective?
- How does the system handle a brand-new user with no history?
- What happens to a video that's pulled for moderation 1 hour after it went viral?
Further reading
- "Monolith: Real Time Recommendation System With Collisionless Embedding Table" — ByteDance, RecSys 2022 (the paper on TikTok's actual recommender)
- "Deep Neural Networks for YouTube Recommendations" (Covington et al., 2016) — the canonical two-stage paper
- Meta's "Two-Tower Retrieval at Scale" engineering posts
- "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations" (Google, 2019)
- The YouTube design article for the video-pipeline foundations
- The news feed article for the candidate + rank pattern
Frequently asked questions
▸What is Monolith and why does TikTok use it instead of nightly batch retraining?
Monolith is ByteDance's parameter server for online recommendation training, described in their 2022 RecSys paper. It streams every user action through Kafka and continuously updates embedding tables, syncing them to the serving layer on a minute-level cadence. This means the ranker serving you at 3:01pm has embeddings informed by events at 2:58pm, so a trending video can surface in feeds within minutes rather than the next day.
▸What is the two-stage retrieval and ranking pipeline in TikTok's For You Page?
Stage one is two-tower ANN retrieval: separate neural nets embed users and videos into the same vector space, and an approximate nearest-neighbor search over a FAISS or ScaNN index narrows roughly 100 million videos to about 1000 candidates in 5 to 10ms. Stage two is a heavy multi-objective deep ranker that scores each candidate on watch time, completion rate, like probability, share probability, and follow probability, returning the top results in about 100ms total.
▸Why does TikTok optimize for watch time rather than likes as the primary ranker objective?
Watch time correlates better with deep engagement and is harder to game than likes. The final ranker score weights expected watch time most heavily in its combination of per-objective predictions.
▸How does TikTok achieve sub-200ms swipe-to-play latency given the ML pipeline and network overhead?
Aggressive preloading is the core mechanism. While the user watches video N, the player silently downloads the first few seconds of videos N+1 and N+2 in the background at low bitrate (roughly 150 to 200 KB per segment). A swipe then lands on a video already playing, dropping perceived latency from roughly 600ms to about 50ms. The CDN absorbs over 95% of bytes at the edge, and the three-tier cache hierarchy keeps origin traffic below 1% of total reads.
▸How does TikTok handle cold-start for a brand-new user with no watch history?
The system bootstraps an initial user vector from demographics, locale, and device as a generic prior. The first 10 to 20 videos deliberately span multiple categories rather than committing to one niche, and hard skips within 2 seconds are treated as strong negative signals. After roughly 90 seconds of swiping, enough signal has accumulated for the model to lock onto a niche and deliver a genuinely personalized feed.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.