~/articles/design-tiktok-short-video
◆◆◆Advancedasked at ByteDanceasked at Metaasked at YouTubeasked at Snap

Design TikTok / Reels (short-video platform)

A vertical-swipe video feed that feels infinite and clairvoyant. Two-tower retrieval, real-time ranking with Monolith, sub-200ms playback via aggressive preloading, and the For You Page that knows you in 90 seconds.

22 min read2026-03-08Ironclad Academy
// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The problem

TikTok is a recommendation engine with a video player bolted on. A billion users open the app, swipe up, and expect the next video — perfectly matched to their taste, already playing — within 200 milliseconds. ByteDance built TikTok around this loop. Meta copied it with Reels. YouTube copied it with Shorts. The vertical-swipe feed is now the dominant video UX on mobile because of how well that loop works when the system behind it is tuned right.

The deceptive part is how many hard problems are stacked beneath that one-swipe UX. At base you have a video CDN problem: 34 million new uploads a day, encoded into multiple bitrates, distributed to edge nodes worldwide so that anyone, anywhere gets sub-second starts. That's essentially YouTube at shorter average duration but much higher upload rate. On top of that you have a real-time recommendation problem: the system must score ~100 million candidate videos against each user's preferences and return the right 10 in under 100ms. And on top of that you have a cold-start problem: a brand-new user with zero history should get a personalized feed within 90 seconds — roughly 10 to 20 swipes.

The core engineering tension is between latency and freshness. To be fast, you'd cache everything: pre-rank video lists, pre-compute user preferences, serve stale embeddings. To be fresh, you'd recompute from scratch on every swipe and retrain models constantly. TikTok's answer is a two-stage retrieval and ranking pipeline backed by Monolith — a parameter server that continuously retrains on a live Kafka stream of user actions and syncs updated embeddings to the serving layer every minute or so. Dense model weights refresh more slowly, but embeddings — which are what actually react to new users and trending videos — stay nearly real-time.

The UX trick that hides all the remaining latency is aggressive preloading: while you watch video N, the client silently downloads the first few seconds of videos N+1 and N+2. Swipe latency drops from "video has to load" (~600ms) to "video already playing" (~50ms). If you can walk through the YouTube-scale video pipeline and then describe how a continuously retrained two-tower model + preloading is bolted onto it, you've covered this question.

Functional requirements

  • Upload a short video (up to 10 minutes recorded in-app, up to 60 minutes when uploaded from camera roll), with effects, music, captions.
  • Watch a vertically-swiped feed of recommended videos ("For You Page").
  • Browse follower-graph feed.
  • Like, comment, share, save, follow.
  • Live streaming (stretch).
  • Direct messages (stretch — see chat system article).

We'll focus on the For You Page and the video pipeline, which is what makes TikTok TikTok.

Non-functional requirements

  • 1B+ DAU.
  • p99 video start time < 200ms on any swipe.
  • Recommendations adapt to a new user within ~10–20 videos (~90 seconds).
  • Globally consistent moderation (no NSFW slips into FYP).
  • 99.99% playback availability.
  • Sub-second upload acknowledgment ("posted" feels instant); processing within 1 minute.

Capacity estimation

DimensionEstimateHow we got there
DAU1BBaseline assumption
Watch time95B minutes/day1B × 95 min/day
FYP requests6.3B calls/day95B min ÷ 15 min per session → ~1 page-load per 15 min of session
FYP QPS (avg)~73k req/sec6.3B ÷ 86,400
FYP QPS (peak)~220–250k req/sec~3× avg
Upload rate~34M videos/day ≈ 400/secEstimated from creator/DAU ratio
Avg raw video size30 MBTypical short-video recording
Avg encoded size~150 MB~5× across multi-bitrate renditions
Storage — new per day~5 PB/day34M × 150 MB
Storage — hot (CDN, 30 days)~150 PB5 PB/day × 30 days
Storage — cold (1 year)~1.8 EB5 PB/day × 365 days
Concurrent viewers (avg)~66M1B DAU × 95 min/day ÷ 1440 min/day
Egress at peak (no CDN)~66 Tbps66M streams × 1 Mbps
Egress at peak (with CDN)~3.3 Tbps origin~66 Tbps × 5% miss rate; ~95% absorbed at edge

Takeaway: The defining number isn't storage or bandwidth — it's ~73k FYP requests/sec average (peak ~220–250k), each requiring an ML ranker call against a candidate set of ~1000 videos. That's where the system spends most of its budget.

Building up to the design

The full TikTok architecture is intimidating, but each piece appears as the answer to a specific failure of the simpler version. Walk forward.

V1: Reverse-chronological friend feed

SELECT v.* FROM videos v
JOIN follows f ON f.followee = v.author
WHERE f.follower = $me
ORDER BY v.created_at DESC LIMIT 50;

This is Instagram circa 2010. It works for users who have friends on the platform. The problem is that new users have no friends and see an empty feed; existing users who follow only a handful of people run out of new content quickly. Engagement caps at whatever your social graph produces.

Show the top N videos by global engagement — same for everyone, regardless of who they follow. New users see something interesting right away. But it's a popularity contest: niche communities (woodworking tutorials, K-pop dance covers) get crowded out by the global median. Everyone sees identical content, so there's no reason to stay.

V3: Personalized via collaborative filtering

"Users who watched X also watched Y." Build a user-video co-engagement matrix; recommend videos that similar users liked. You get a real personalized feed, which is a big step forward. The catch is that brand-new videos with zero engagement get no exposure — they never appear in anyone's "users like you also watched" list, so they die before anyone discovers them. Hourly batch jobs also can't keep up with TikTok-speed trends that form and die in minutes.

V4: Two-tower retrieval — embeddings

Train a model that produces a vector embedding for each user (based on watch history, demographics, interactions) and each video (based on content, audio, hashtags, early engagement). Recommend videos whose embeddings are closest to the user's.

flowchart LR
    USER[User features] --> UT[User Tower]
    VID[Video features] --> VT[Video Tower]
    UT --> UE[User embedding<br/>dense vector]
    VT --> VE[Video embedding<br/>dense vector]
    UE --> ANN[ANN search<br/>nearest videos]
    VE --> ANN
    style UT fill:#15803d,color:#fff
    style VT fill:#15803d,color:#fff
    style ANN fill:#ff6b1a,color:#0a0a0f

You embed once, then query millions of video vectors in milliseconds using approximate nearest-neighbor search (FAISS, ScaNN). This is fast candidate generation that scales — the computation at serving time is just the user vector lookup plus an ANN query, not a full matrix factorization. But retrieval finds similar videos; it doesn't optimize for engagement. A user who watched 50 cooking videos gets recommended cooking video #51, but the one she'd rewatch and share might be a dance video the model hasn't figured out she'd love. You need a second stage.

V5: Two-stage rank — candidates + heavy ranker

  • Stage 1 — Retrieval: two-tower ANN narrows ~100M videos → ~1000 candidates per user (5–10ms).
  • Stage 2 — Ranking: a heavy deep model scores each candidate on multiple objectives (watch_time, completion_rate, like_prob, share_prob, follow_prob) and combines into a final score. ~100ms for 1000 items.

This is essentially the news feed architecture but with sharper objectives — TikTok cares about watch_time and replay_rate, not "did you click." Feed quality improves dramatically and the "FYP knows me" effect appears. But models trained on yesterday's data can't react to today's trends. A dance going viral right now — the model trained four hours ago doesn't know it exists.

V6: Real-time training (Monolith — ByteDance's actual system)

Stream user actions (views, watches, completions, likes, shares, skips) into Kafka. Online-training workers consume the stream and update the embedding tables (user-side and item-side) on the order of seconds to minutes. Dense ranker layers refresh on a slower cadence (day-level, per the Monolith paper), but the embeddings — which are what react to new content and new users — are effectively real-time.

The ranker that recommends to you at 3:01pm has user/video embeddings informed by what happened at 2:58pm. Even if the model shape is 12 hours old, the embedding slots it reads from are fresh.

V7: Aggressive preloading + edge caching

When the user is watching video N, the next 2 candidates (N+1 and N+2) are already downloading their first few seconds in the background. The CDN pre-positions hot videos at every regional POP. Multi-layer cache: device → ISP edge → POP → origin.

flowchart LR
    USER[User watching video 5] --> PLAY[Player]
    PLAY --> N6[Preload video 6<br/>first 3s]
    PLAY --> N7[Preload video 7<br/>manifest]
    style PLAY fill:#ff6b1a,color:#0a0a0f

With aggressive preloading, the perceived swipe latency drops from "video has to load" (~600ms+) to "video is already playing" (~50ms — just a UI transition). This is the single most important UX decision in the platform.

V8: Production TikTok

V5 + V6 + V7 + the upload pipeline (chunked upload → distributed transcoding → multiple bitrates → CDN distribution), moderation (multimodal classifier on every upload, plus human review), engagement counters (Kafka → batched Cassandra), follower-graph services, search, and live streaming.

flowchart LR
    V1[V1: chrono friend feed<br/>new users empty] --> V2[V2: global trending<br/>not personal]
    V2 --> V3[V3: collab filtering<br/>cold start, batch]
    V3 --> V4[V4: two-tower retrieval<br/>fast candidates]
    V4 --> V5[V5: + heavy ranker<br/>FYP quality]
    V5 --> V6[V6: + online training<br/>minute-level trends]
    V6 --> V7[V7: + preloading + CDN<br/>zero-latency swipe]
    V7 --> V8[V8: + upload + mod + counters<br/>production]
    style V1 fill:#0e7490,color:#fff
    style V4 fill:#15803d,color:#fff
    style V6 fill:#ff6b1a,color:#0a0a0f
    style V8 fill:#a855f7,color:#fff

High-level architecture

flowchart TD
    subgraph Upload
    CR[Creator] -->|chunked upload| UAPI[Upload API]
    UAPI --> RAW[(Raw S3)]
    RAW --> TRANS[Transcoding Cluster<br/>GPU workers]
    TRANS --> SEG[(HLS segments<br/>multiple bitrates)]
    TRANS --> MOD[Moderation ML]
    MOD --> METADB[(Video Metadata)]
    TRANS --> EMBED[Content Embedder]
    EMBED --> VVEC[(Video Vector Store<br/>FAISS / ScaNN)]
    end

    subgraph Feed
    U[User] --> FAPI[Feed API]
    FAPI --> RET[Retrieval<br/>two-tower ANN]
    RET --> VVEC
    RET --> CAND[~1000 candidates]
    CAND --> RANK[Ranker<br/>heavy ML]
    RANK --> FS[(Feature Store)]
    RANK --> TOP[Top 10 videos]
    TOP --> MANIFEST[Manifest Service]
    MANIFEST --> CDN
    end

    subgraph Realtime
    U -.events.-> KAFKA[Kafka: views, likes, watches]
    KAFKA --> TRAIN[Online Training<br/>Monolith]
    TRAIN --> RANK
    TRAIN --> RET
    KAFKA --> COUNT[Counters<br/>Cassandra]
    end

    SEG --> CDN
    CDN --> PLAYER[Player on device]

    style TRANS fill:#ff6b1a,color:#0a0a0f
    style RANK fill:#15803d,color:#fff
    style TRAIN fill:#a855f7,color:#fff
    style CDN fill:#0e7490,color:#fff

Three loosely-coupled subsystems:

  1. Upload pipeline — turns a creator's video into globally-distributed playable chunks plus an embedding.
  2. Feed serving — retrieval + ranking + manifest assembly, ~200ms per FYP request.
  3. Realtime training — every user action flows through Kafka into model updates.

The upload pipeline

sequenceDiagram
    participant App
    participant Upload API
    participant Raw S3
    participant Queue
    participant Transcoder
    participant Moderator
    participant CDN
    participant Embedder
    App->>Upload API: initiate (size, fmt, fingerprint)
    Upload API-->>App: signed chunk URLs
    App->>Raw S3: PUT chunks (parallel)
    App->>Upload API: complete
    Upload API->>Queue: enqueue jobs
    Queue->>Transcoder: assigns to GPU worker
    par
        Transcoder->>Transcoder: encode 360p, 540p, 720p, 1080p (HLS chunks)
        Transcoder->>Moderator: video frames + audio + caption
        Transcoder->>Embedder: extract content embedding
    end
    Transcoder->>CDN: push segments to regional POPs
    Moderator-->>Upload API: ok / quarantine
    App->>App: shows "posted" — usable within ~60s

The upload is chunked so it survives flaky cellular connections — if the upload dies halfway, the app resumes from the last committed chunk. Once all chunks land in Raw S3, a job is enqueued and a GPU transcoder picks it up. Transcoding, moderation, and content embedding all run in parallel: there's no reason to wait for moderation before starting the encode, or to wait for the encode before extracting the content embedding. The three fan-out immediately.

Multiple bitrates (360p–1080p) feed adaptive bitrate streaming — HLS or DASH lets the player pick the right quality for the current network. Segments land on CDN POPs in the creator's region, then propagate to others. The creator sees "posted" as soon as upload completes; global playability comes within ~60s.

Until moderation clears a video, it's visible to its uploader but not to FYP — what ByteDance calls "limited reach" state. This means a violating video can never go viral because it can't enter the recommendation pipeline.

Two-tower retrieval

The model maps users and videos into the same vector space. Two separate neural nets:

User tower input features:
  - Recent watch history (last 100 video IDs → averaged embeddings)
  - Completion patterns by category
  - Time of day, day of week, locale
  - Device, network type
  - Long-term affinity vectors

Video tower input features:
  - Content embedding (from video frames + audio)
  - Author identity
  - Hashtags, music track, sounds used
  - Early engagement (first 1000 views' watch / share rates)
  - Recency

Both towers output a fixed-dimension vector (128-d is a common industry choice; ByteDance has not publicly confirmed the exact dimension used in TikTok production). The "score" is the dot product or cosine similarity.

At serving time, all video vectors live in a FAISS/ScaNN index. Each user request:

  1. Compute the user vector (server-side, cached).
  2. Approximate nearest-neighbor search returns top 1000 video vectors.
  3. Pass to the ranker.

This is typically 5–10ms for the ANN search over tens of millions of indexed videos. The training signal is "did the user watch this video to completion" — making the towers learn what engages a given user, not just what's textually similar.

The heavy ranker

The 1000 candidates pass through a multi-objective deep model. Inputs per (user, video) pair:

- Embedding similarity (from retrieval stage)
- Per-objective predictions:
    P(complete), P(like), P(share), P(comment), P(follow_author), E(watch_time)
- Contextual: position in feed (1st item vs 5th), recent skips

The final score is a weighted combination, often something like:

score = w1·E(watch_time) + w2·P(complete) + w3·P(share) + w4·P(like) - w5·P(skip)

Weights are tuned via online experiments (A/B testing thousands of variants simultaneously). The objective ByteDance is famous for optimizing is watch time, which is harder to game than likes and correlates better with deep engagement.

The ranker has to score all 1000 candidates in ~100ms total. It gets there through a combination of pre-computed user features cached in Redis (refreshed every few seconds), pre-computed video features in a feature store (Bigtable-like), the heavy model running on GPUs/TPUs in batched mode, and deduplication — items the user has already seen in recent sessions are filtered out before they ever reach the ranker.

Monolith — real-time training

The big architectural insight: don't wait for nightly retraining. Stream every user action into the model continuously.

flowchart LR
    A[User watches/likes/skips] --> K[Kafka]
    K --> TW[Training Workers<br/>parameter server]
    TW --> EMB[(Embedding Tables)]
    TW --> WEIGHTS[(Model Weights)]
    EMB --> SERVE[Serving Model]
    WEIGHTS --> SERVE
    SERVE --> RANK[Ranker / Retrieval]
    style TW fill:#ff6b1a,color:#0a0a0f

The key design points, from ByteDance's 2022 RecSys paper:

  • Parameter server architecture: embeddings stored across distributed workers; gradients accumulated per minibatch.
  • Collision-free hash table for IDs: each user_id and video_id maps to a unique embedding slot — no hash collisions that would conflate two users.
  • Expirable embeddings: rarely-seen IDs evict to save memory; frequency filtering keeps the working set small.
  • Online + batch hybrid: streaming updates flow into embedding tables continuously; dense ranker layers refresh on a longer cycle (day-level, per the Monolith paper) for stability.
  • Embedding sync to serving: serving instances pull updated embedding shards on a minute-level cadence — this is where the "real-time" feel comes from. Dense weights refresh more slowly, but the embeddings are what react to new IDs and new behaviors.

It's worth understanding the failure modes here. If the Kafka consumer group falls behind during an event spike, the training parameter server processes a stale mini-batch — but Monolith tolerates this deliberately: the sync to serving still happens on schedule even if training is a few minutes behind. Serving is never blocked waiting for training. If a serving shard restarts, it bootstraps from the last checkpoint (periodically snapshotted to durable storage) and re-applies missed embedding updates from Kafka, which is why Kafka retention of at least several hours matters. And when frequency filtering evicts a rare video ID to save memory, that video starts with a zero embedding — which is why the first 100 views of a new video are so critical for it to escape zero-vector territory.

Counters and engagement

Likes, shares, views — each is a write hot-path. The pattern (same as Instagram):

Event → Kafka → Aggregator batches per (video, counter_type) every few seconds → writes to sharded Cassandra counter. The read path puts Redis in front; for the absolute hottest viral videos (10M likes/hour against one counter row), the counter itself is sharded across 100 rows and summed on read.

flowchart LR
    LIKE[Like tap] --> KAFKA[Kafka]
    KAFKA --> AGG[Aggregator]
    AGG -->|batched every 5s| CASS[(Cassandra)]
    READ[View video] --> RED[(Redis)]
    RED -.miss.-> CASS
    style AGG fill:#ff6b1a,color:#0a0a0f

Preloading: the perceived-zero-latency trick

When the user watches video N, the player downloads the first few seconds of videos N+1 and N+2 in the background.

Active playback: video[5]   (1080p, streaming chunks 3-6)
Background:      video[6]   (360p first 3 seconds, manifest)
Background:      video[7]   (manifest only, lightweight)

A swipe almost always lands on N+1. Even when it doesn't, the cost was a few hundred KB of cellular data. With aggressive preloading, the perceived swipe latency drops from "video has to load" (~600ms+) to "video is already playing" (~50ms — just a UI transition).

This is what makes TikTok feel different from YouTube. YouTube serves video on-demand; TikTok pre-positions the next video before you know you want it.

The interviewer will ask about the downside: preloading wastes bandwidth when the user exits the app or swipes to an unexpected video. TikTok mitigates this by preloading at low bitrate (360p first-3s costs roughly 150–200 KB, depending on codec — H.265/AV1 compression brings it closer to the lower bound) rather than the full video, and canceling outstanding fetches on exit. The upside easily outweighs the occasional wasted segment.

CDN strategy

Three cache tiers plus origin:

TierTTLHit rate target
On-device cache1 hour~20% (resaves)
ISP / edge POP24 hours~80%
Regional CDN7+ days~95% of remaining
Origin< 1% of total reads

For viral videos, the regional CDN pre-positions chunks based on prediction signals (rising engagement curve in adjacent regions). A video taking off in Seoul will be pre-cached in Tokyo edges before Tokyo users see it surface in their FYP.

Video delivery stack

The path from a video segment stored at origin to the player on a 4G phone is worth diagramming — it's where the 95%+ CDN hit rate comes from and why origin bandwidth stays manageable despite the scale.

flowchart LR
    DEV[Device player] -->|HLS playlist request| EPOP[ISP/Edge POP]
    EPOP -->|cache hit ~80%| DEV
    EPOP -->|miss| RCDN[Regional CDN POP]
    RCDN -->|cache hit ~95% of misses| EPOP
    RCDN -->|miss| ORIG[Origin S3 + servers]
    ORIG --> RCDN
    style EPOP fill:#0e7490,color:#fff
    style RCDN fill:#0e7490,color:#fff
    style ORIG fill:#15803d,color:#fff
    style DEV fill:#ffaa00,color:#0a0a0f

The player requests an HLS playlist, which lists the individual segment URLs. Each segment is a 2–10 second chunk of video at a specific bitrate. The player adapts bitrate mid-stream based on download throughput — if a user goes from WiFi to 3G, the next segment request picks the lower-bitrate rendition automatically.

Storage choices

DataTechWhy
Video segmentsS3-class blob + CDNCheap, durable, edge-served
MetadataCassandra (sharded by video_id)High write rate, simple access
User profilesSharded MySQL / ManhattanSmall per-user state
Follower graphCassandra / TAO-likeRead-heavy, denormalized
CountersSharded Cassandra counterWrite-heavy hot rows
Video vectorsFAISS / ScaNNVector ANN
Feature storeBigtable-like(entity, feature) → value, low-latency reads
Event streamKafkaBuffer + replay for training
Embeddings (model)Parameter serverTrillion-parameter scale

Cold-start and the "discover your taste in 90s" trick

A new user has no watch history, so the retrieval stage has nothing personal to query with. Two things happen. First, the system bootstraps the user vector from demographics, locale, and device — a generic prior that at least knows your country and what language you speak. Second, the first ~10 videos deliberately include a mix across different categories (cooking, humor, sports, beauty, education) rather than committing to one niche. Skips are extremely informative — a hard skip within 2 seconds says "not this" more clearly than a like does. After roughly 10–20 swipes (a qualitative estimate — ByteDance has not published a precise threshold), the model accumulates enough signal to lock onto a niche and the FYP starts feeling personal.

The "TikTok read me in 90 seconds" experience is this signal acquisition working as designed.

Moderation

Every upload is scored by a multimodal classifier across visual content (NSFW, violence, gore, brand logos), audio (copyrighted music via fingerprinting, hate speech), and text (caption + on-screen OCR), plus the uploader's behavioral history. High-confidence violations are blocked immediately. Medium-confidence items go into limited-reach state while human reviewers look at them. Low-confidence items are cleared.

flowchart TD
    VID[Uploaded video] --> ML[Multimodal ML classifier<br/>visual + audio + text]
    ML -->|"high confidence violation"| BLOCK[Blocked — not distributed]
    ML -->|"medium confidence"| LIM[Limited reach<br/>human review queue]
    ML -->|"low confidence / clear"| CLEAR[Full distribution / FYP eligible]
    LIM -->|"reviewer: violation"| BLOCK
    LIM -->|"reviewer: clear"| CLEAR
    CLEAR --> FYP[Enters recommendation pipeline]
    style BLOCK fill:#ff2e88,color:#fff
    style LIM fill:#ffaa00,color:#0a0a0f
    style CLEAR fill:#15803d,color:#fff
    style ML fill:#a855f7,color:#fff

A separate trending-content review tier inspects the top N videos before they reach massive distribution — a viral hateful video is orders of magnitude more damaging than a small one, so the review queue is priority-weighted by the video's engagement trajectory.

Performance budgets

StepBudget
FYP API request< 200ms total
user feature fetch10ms
retrieval (ANN)10ms
ranker forward pass100ms
manifest assembly10ms
network (client ↔ origin)70ms
Swipe → playback start< 200ms (mostly via preload)
Upload → playable< 60s (target), < 120s p99

The 200ms budget is a serial worst-case. In practice, user feature fetch can be pipelined with the ANN query — the user vector is computed first, then feature lookup and ANN search run in parallel, shaving 5–10ms. The network budget is the dominant term for users far from an edge node, which is why regional CDN POPs and anycast routing matter so much.

Edge cases

Viral video at 100M views/hour

Counter contention: the same Cassandra row gets hammered. Shard the counter across 100 rows and sum on read. On the CDN side, pre-warm all POPs as soon as the engagement curve indicates virality — rising sharply in one region is a reliable leading indicator for adjacent regions.

A user "trains" the FYP into a niche they don't actually want

Say someone watched five videos about a topic out of morbid curiosity, and now every swipe shows more of it. The explicit "Not interested" signal resets the local signal; the system also bumps the exploration rate temporarily when it detects rapid skips or a session that ends abruptly — both are signals that the feed has gone wrong.

A creator games engagement (buys views)

Fraud detection in the event stream: device fingerprints, watch patterns, and engagement curve anomalies (a real video's view count ramps gradually; a bought-views campaign often spikes unnaturally at uniform intervals). Suspicious views are downweighted before they reach the ranker, so artificially boosted videos don't propagate to FYP.

Live streaming

A separate subsystem: the streamer's video goes to a low-latency CDN (LL-HLS or WebRTC), viewers watch with ~3s delay. Engagement events flow through the same pipeline; the recommender treats live streams similarly to short videos for FYP placement.

Right to be forgotten / video takedown

Soft-delete in metadata. CDN invalidates within minutes. Vector index removes the embedding on next refresh. The original blob is purged async; backups follow regulatory retention.

Things to discuss in an interview

  • Why a recommender, not a chronological feed — engagement economics.
  • Two-stage ranking — retrieval (fast, broad) + ranking (slow, precise).
  • Real-time training — minute-level model updates, Monolith-style parameter server.
  • Preloading — the trick that makes UX feel zero-latency.
  • Upload pipeline — chunked, GPU-transcoded, embedded in parallel.
  • Moderation — in the critical path of distribution, not as an afterthought.

Things you should now be able to answer

  • What's the difference between candidate generation and ranking, and why are both needed?
  • How does Monolith (or any online-trained recommender) react to a video going viral in minutes?
  • Why is preloading the next 2 videos critical for the swipe UX?
  • What's the trade-off in optimizing for watch_time vs. likes as the ranker objective?
  • How does the system handle a brand-new user with no history?
  • What happens to a video that's pulled for moderation 1 hour after it went viral?

Further reading

  • "Monolith: Real Time Recommendation System With Collisionless Embedding Table" — ByteDance, RecSys 2022 (the paper on TikTok's actual recommender)
  • "Deep Neural Networks for YouTube Recommendations" (Covington et al., 2016) — the canonical two-stage paper
  • Meta's "Two-Tower Retrieval at Scale" engineering posts
  • "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations" (Google, 2019)
  • The YouTube design article for the video-pipeline foundations
  • The news feed article for the candidate + rank pattern
// FAQ

Frequently asked questions

What is Monolith and why does TikTok use it instead of nightly batch retraining?

Monolith is ByteDance's parameter server for online recommendation training, described in their 2022 RecSys paper. It streams every user action through Kafka and continuously updates embedding tables, syncing them to the serving layer on a minute-level cadence. This means the ranker serving you at 3:01pm has embeddings informed by events at 2:58pm, so a trending video can surface in feeds within minutes rather than the next day.

What is the two-stage retrieval and ranking pipeline in TikTok's For You Page?

Stage one is two-tower ANN retrieval: separate neural nets embed users and videos into the same vector space, and an approximate nearest-neighbor search over a FAISS or ScaNN index narrows roughly 100 million videos to about 1000 candidates in 5 to 10ms. Stage two is a heavy multi-objective deep ranker that scores each candidate on watch time, completion rate, like probability, share probability, and follow probability, returning the top results in about 100ms total.

Why does TikTok optimize for watch time rather than likes as the primary ranker objective?

Watch time correlates better with deep engagement and is harder to game than likes. The final ranker score weights expected watch time most heavily in its combination of per-objective predictions.

How does TikTok achieve sub-200ms swipe-to-play latency given the ML pipeline and network overhead?

Aggressive preloading is the core mechanism. While the user watches video N, the player silently downloads the first few seconds of videos N+1 and N+2 in the background at low bitrate (roughly 150 to 200 KB per segment). A swipe then lands on a video already playing, dropping perceived latency from roughly 600ms to about 50ms. The CDN absorbs over 95% of bytes at the edge, and the three-tier cache hierarchy keeps origin traffic below 1% of total reads.

How does TikTok handle cold-start for a brand-new user with no watch history?

The system bootstraps an initial user vector from demographics, locale, and device as a generic prior. The first 10 to 20 videos deliberately span multiple categories rather than committing to one niche, and hard skips within 2 seconds are treated as strong negative signals. After roughly 90 seconds of swiping, enough signal has accumulated for the model to lock onto a niche and deliver a genuinely personalized feed.

// RELATED

You may also like