Design YouTube / Netflix (video streaming)
How a billion users watch ~1 billion hours of video a day. Upload pipeline, transcoding, adaptive bitrate, CDN, recommendation.
The problem
YouTube serves about a billion hours of video per day. Netflix, at its 2015 peak, accounted for roughly 37% of North American fixed-network downstream traffic during evening hours — and even as the internet has grown and more services have entered the market, video streaming consistently dominates global bandwidth. Two platforms, one engineering challenge: take a single upload and make it stream smoothly to anyone on the planet, from a fiber-connected 4K TV to a phone on a shaky 3G connection.
The core of the product is deceptively simple. A creator records a video and uploads it. A viewer presses play and it starts in under two seconds. Between those two moments, the system has to transcode the source file into half a dozen resolution tiers, split every tier into thousands of short segments, push those segments across a tiered CDN to edge nodes near each viewer, and let the player switch quality in real time as network conditions change.
What makes this hard is the scale at which all of that happens simultaneously. Creators upload 500 hours of new video every minute. Viewers generate about 12 terabytes per second of egress on average, with peak traffic several times higher. Transcoding a single hour of 4K footage can take many CPU-hours on one core — which means the pipeline must fan out across hundreds of thousands of cores just to keep up with incoming uploads without making creators wait.
The three engineering tensions that run through every design decision here are: (1) write-side throughput — transcoding is slow but uploads are relentless; (2) read-side cost — CDN egress is the single largest line item; and (3) adaptive delivery — a static bitrate stream that works on fiber will buffer endlessly on mobile, so the format and player protocol have to handle mid-stream quality switching transparently.
Functional requirements
- Upload videos.
- Watch videos at multiple resolutions, with adaptive bitrate.
- Search.
- Comment, like.
- Recommendations.
Non-functional
- 2.5B+ monthly active users.
- 500 hours of video uploaded per minute.
- p99 startup latency < 2 seconds.
- 99.99% playback availability.
Capacity
| Dimension | Estimate | How we got there |
|---|---|---|
| Upload volume (raw) | ~250 PB/year | 500 h/min × 60 min × 24 h × 365 days × ~1 GB/hr avg ≈ 263M hr × 1 GB/hr ≈ 263 PB ≈ 250 PB |
| Upload volume (after encoding) | ~1 EB/year | Raw storage × 3–5× for multiple resolutions |
| Watch egress | ~1 EB/day | 1B hours/day × ~1 GB/hour (mixed bitrates) |
| Average egress bandwidth | ~12 TB/s | 1 EB ÷ 86,400 s |
| Peak egress bandwidth | Several × 12 TB/s | Typically 3–5× above average during prime time |
Takeaway: The dominant costs are storage (exabytes) and bandwidth (the CDN bill is the single largest line item).
Building up to the design
YouTube has more moving parts than almost any other system, but every piece is a response to a specific failure mode of the simpler version. The progression below makes the full architecture feel inevitable.
V1: One file per video, served from origin
POST /videos → save raw.mp4 to S3
GET /videos/X → 302 redirect to s3.example.com/X/raw.mp4
You have a working upload-and-play site in an afternoon. Streaming raw 4K video to 1,000 viewers simultaneously means 1,000 × 50 Mbps = 50 Gbps from your origin — you'll exhaust your egress budget over a weekend. And mobile clients on 4G can't consume 50 Mbps anyway; they buffer endlessly.
V2: CDN in front
Put a CDN in front of S3. Viewers pull from the closest edge, and roughly 95% of bandwidth is absorbed there, leaving origin nearly quiet. But a viewer on a 3G phone still tries to download the same 50 Mbps stream as a fiber subscriber. They buffer forever. You need different qualities for different devices and networks.
V3: Multiple resolutions (transcoding)
On upload, transcode the original into 240p, 360p, 480p, 720p, 1080p, and 4K. The player picks one based on device and measured bandwidth. A 3G phone watches 240p at a few Mbps; a fiber TV watches 4K. The remaining problem is that the network changes mid-watch — the user enters a tunnel, steps away from WiFi. Locking quality at playback start means buffering when conditions worsen and wasted bandwidth when they improve. And the player still has to buffer the whole file to seek.
V4: Adaptive bitrate streaming (HLS/DASH)
Split each resolution into 4–6 second segments and generate a "manifest" that lists the segment URLs per quality level. The player downloads the manifest first, then fetches segments one at a time, switching quality between segments as bandwidth changes.
flowchart LR
REQ[Player] --> M[Manifest]
M --> S1["segment 1 @ 720p"]
M --> S2["segment 2 @ 720p<br/>(bandwidth dropped → 480p)"]
M --> S3["segment 3 @ 480p"]
style M fill:#15803d,color:#fff
Now the player can seek instantly (request the chunk containing the seek point), start fast (grab a small low-res chunk first, upgrade when buffered), and recover gracefully from a sudden bandwidth drop. The new bottleneck is production: transcoding 500 hours per minute of uploads into 6 resolutions of HLS chunks is a massive compute job. Done synchronously, it would make creators wait hours before their video is viewable.
V5: Async distributed transcoding
The upload goes directly to "raw" object storage via signed chunked URLs. The upload API enqueues a transcoding job, which then splits into many parallel sub-jobs — one per segment range per resolution — and workers run the encoder independently. Finished segments are written to chunk storage as they complete; the manifest is updated incrementally. Low-quality versions become viewable within minutes; 4K follows later as the encode farm catches up.
With transcoding solved, the next constraint is discovery. A catalog of 800M+ videos is useless without a way for viewers to find what they actually want to watch.
V6: Recommendations + search
Search via Elasticsearch (title, description, transcript). Recs via a two-stage ML pipeline (candidate gen + ranker) — same shape as the news feed but with watch-time as the target metric. View counts via Kafka aggregation (same as Instagram likes).
V7: Production YouTube
Everything stacked: upload → transcode → HLS chunks → CDN with edge caching (hot videos cached at every POP) → adaptive playback → recommendations → search → moderation (Content ID for copyright, NSFW classifier).
flowchart LR
V1[V1: raw mp4<br/>melts your origin] --> V2[V2: + CDN<br/>cheap delivery]
V2 --> V3[V3: + resolutions<br/>mobile-friendly]
V3 --> V4[V4: + HLS/DASH<br/>adaptive bitrate]
V4 --> V5[V5: + async transcode<br/>scalable upload]
V5 --> V6[V6: + search + recs<br/>discovery]
V6 --> V7[V7: + counts + mod<br/>production YouTube]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V5 fill:#ff6b1a,color:#0a0a0f
style V7 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
subgraph Upload
U1[Creator] -->|chunked upload| UAPI[Upload API]
UAPI --> UQ[Upload Buffer / S3]
UQ --> ENC[Transcoding Pipeline]
ENC --> CHUNKS[(Chunked storage<br/>HLS/DASH segments)]
ENC --> META[(Video Metadata<br/>Postgres / Spanner)]
end
subgraph Watch
V[Viewer] --> VAPI[Video API]
VAPI --> META
VAPI --> MURL[Manifest Builder]
MURL --> CDN[CDN]
CDN --> CHUNKS
end
subgraph Discovery
V --> SAPI[Search]
V --> RAPI[Recs]
end
style ENC fill:#ff6b1a,color:#0a0a0f
style CDN fill:#15803d,color:#fff
style CHUNKS fill:#0e7490,color:#fff
The upload pipeline
sequenceDiagram
participant Creator
participant Upload API
participant Object Store
participant Encoder Q
participant Encoder Workers
participant Chunk Store
Creator->>Upload API: initiate upload (size, format)
Upload API->>Creator: signed URLs for chunks
Creator->>Object Store: PUT chunk 1
Creator->>Object Store: PUT chunk 2
Creator->>Upload API: complete
Upload API->>Encoder Q: enqueue job
Encoder Workers->>Object Store: read source
Encoder Workers->>Encoder Workers: transcode to 240p, 360p, 480p, 720p, 1080p, 4K
Encoder Workers->>Encoder Workers: segment into 4-6s chunks (VOD)
Encoder Workers->>Encoder Workers: package as HLS/DASH
Encoder Workers->>Chunk Store: write segments + manifest
Encoder Workers->>Upload API: ready
Upload API->>Creator: published
Walk through the four stages and why each one exists.
Chunked upload. The client splits the file into 1–10 MB chunks and uploads each one directly to object storage using a short-lived signed URL. The Upload API never touches the video bytes — it just issues the URLs. If one chunk fails mid-transfer, only that chunk needs to be retried. For a 10 GB file over a shaky home connection this matters enormously.
Transcoding. Every video must be encoded at multiple resolutions and bitrates. This is genuinely expensive — software H.265 encoding of a 1-hour 4K source can take many times real-time on a single CPU core (hours, not minutes), depending on preset and hardware; hardware encoders such as NVENC achieve near-real-time but with quality trade-offs. The only way through is parallelism: fan the job out across many workers, each responsible for one segment range at one resolution.
flowchart TD
SRC[(Source video)] --> JOB[Transcoding job]
JOB --> W1["Worker: segs 1-20 @ 240p"]
JOB --> W2["Worker: segs 1-20 @ 720p"]
JOB --> W3["Worker: segs 1-20 @ 1080p"]
JOB --> W4["Worker: segs 21-40 @ 240p"]
JOB --> W5["Worker: segs 21-40 @ 720p"]
JOB --> W6["Worker: segs 21-40 @ 1080p"]
W1 & W2 & W3 & W4 & W5 & W6 --> STORE[(Chunk Store)]
style JOB fill:#ff6b1a,color:#0a0a0f
style W1 fill:#0e7490,color:#fff
style W2 fill:#0e7490,color:#fff
style W3 fill:#0e7490,color:#fff
style W4 fill:#0e7490,color:#fff
style W5 fill:#0e7490,color:#fff
style W6 fill:#0e7490,color:#fff
style STORE fill:#15803d,color:#fff
Segmenting. Each resolution is split into ~2–6 second chunks for VOD (2–4 s for live). Shorter segments allow faster quality switching but increase HTTP request overhead and reduce encoding efficiency (more I-frames); a 4–6 s duration is a common VOD operating point, though 2–4 s is also widely used. With short segments, the player can seek to any point without buffering the whole file, and can switch quality between segments without a glitch.
Packaging. HLS (Apple, .m3u8) or MPEG-DASH (everyone else, .mpd) wraps the chunks with a manifest that the player reads to discover which segment URLs correspond to which quality level.
Adaptive bitrate streaming
Instead of one big video file, the player downloads small chunks at multiple bitrates and switches between them based on the current connection.
flowchart LR
P[Player] -->|fetch manifest| MAN[(.m3u8)]
MAN --> P
P -->|"chunk 1 @ 1080p<br/>fetched in 2s"| Q1080[1080p chunks]
P -->|"chunk 2: bandwidth dropped<br/>switch to 720p"| Q720[720p chunks]
P -->|"chunk 3: recovered<br/>back to 1080p"| Q1080
style Q1080 fill:#15803d,color:#fff
style Q720 fill:#ffaa00,color:#0a0a0f
This is why videos sometimes look blurry then sharp again — your bandwidth changed mid-watch and the player adapted.
Storage model
Each video has:
- Metadata (title, description, owner, duration, tags) → Postgres or Spanner.
- Segments for each (resolution × bitrate) → object storage (GCS / S3).
- Manifest files (.m3u8 / .mpd) → object storage, served via CDN.
- Thumbnails → object storage, served via CDN.
Numbers for one popular 10-minute video:
Resolutions: 240p, 360p, 480p, 720p, 1080p, 4K (6 versions)
Each version: ~10 minute @ 4-6s chunks = ~100 segments
Per segment: 0.5–10 MB depending on resolution
Total storage: maybe 1–5 GB per popular video
For YouTube's library: industry estimates put total storage in the range of tens of exabytes (exact figures are not publicly disclosed).
Where the CDN does the heavy lifting
99% of bytes are served from the CDN, not origin. Cold cache happens only when:
- A new video is published (warm the popular regions).
- A long-tail video gets surprise traffic.
YouTube and Netflix run their own CDNs (Google Edge Cache, Netflix Open Connect) — ISPs install boxes in their data centers, peering directly. This eliminates inter-ISP transit costs and slashes latency.
flowchart TD
V[Viewer in Japan] --> ISP[ISP datacenter]
ISP --> OC[Netflix Open Connect box]
OC --> V
OC -.cache miss.-> ORIGIN[Netflix origin US]
style OC fill:#ff6b1a,color:#0a0a0f
style ORIGIN fill:#0e7490,color:#fff
For 99.9% of viewer requests, the bytes never leave the ISP's network. That's why Netflix is fast even on a cheap ISP plan.
Pre-positioning and caching strategy
Netflix pre-positions content based on predicted demand. Open Connect appliances receive nightly content fills; before a major new release, the relevant titles are pre-positioned across all boxes days in advance (the specific timing is not publicly documented but the mechanism is described in Netflix engineering blog posts).
YouTube's cache is more reactive (the long tail is much wider; you can't pre-position 800M+ videos), but uses tier hierarchies:
flowchart LR
V[Viewer] --> EDGE["Edge cache<br/>(metro)"]
EDGE -->|miss| REG[Regional cache]
REG -->|miss| ORIGIN[(Origin)]
style EDGE fill:#ff6b1a,color:#0a0a0f
style REG fill:#ffaa00,color:#0a0a0f
style ORIGIN fill:#15803d,color:#fff
Hot path: starting playback
sequenceDiagram
participant V as Viewer
participant API as Video API
participant CDN
participant Player
V->>API: GET /watch?v=xyz
API-->>V: HTML + player JS + manifest URL
V->>CDN: GET manifest.m3u8
CDN-->>V: manifest (list of chunks)
V->>CDN: GET first chunk (low resolution to start fast)
CDN-->>V: chunk
V->>Player: play
Player->>CDN: prefetch next chunks
CDN-->>Player: chunks
To minimize startup latency, the player starts at a low resolution so the first chunk is small enough to download quickly. Once it has built a buffer, it upgrades to the resolution the connection can sustain. Prefetching the next 1–2 chunks during playback keeps the buffer ahead of the read head, and HTTP/2 or HTTP/3 lets the player fetch multiple chunks in parallel over the same connection.
Recommendations & search
Recommendation is its own discipline. The high-level shape:
flowchart LR
USER[User signals:<br/>watches, likes, time] --> EMB[User Embedding]
VIDEO[Video signals:<br/>title, frames, audio, engagement] --> VEMB[Video Embedding]
EMB --> ANN[Approx. Nearest Neighbor]
VEMB --> ANN
ANN --> CAND[Candidates]
CAND --> RANKER[Ranker model]
RANKER --> FEED[Recommended videos]
style ANN fill:#a855f7,color:#fff
style RANKER fill:#ff6b1a,color:#0a0a0f
Two-tower neural model + vector store (e.g. ScaNN, FAISS) for retrieval; heavy ranker on top.
Search is similar but driven by query → candidates rather than user → candidates.
DRM and content protection
For Netflix and other commercial streaming services, every chunk is encrypted; players are issued time-bound license keys via Widevine (Google), FairPlay (Apple), PlayReady (Microsoft).
For YouTube user-uploads, DRM is mostly absent (creators want their videos shared) but Content ID scans uploads against a fingerprint database to detect copyrighted material.
Scale summary
| Component | Scale | Tech | Key failure mode / trade-off |
|---|---|---|---|
| Object storage | Multiple exabytes (library) | GCS, S3, custom | Erasure-coded for durability; cost-tier cold storage for long-tail |
| Transcoding | Hundreds of thousands of cores | Borg-style scheduler | Job stragglers delay availability; partial publish (low-res first) mitigates |
| Metadata DB | Billions of rows, sharded | Spanner / Postgres; Cassandra or Redis for counters | Spanner for ACID video metadata; Cassandra or Redis for high-write counters (views, likes) |
| CDN | Hundreds of POPs, exabytes/day | Custom (Open Connect, Google Edge) | Cold-start on new releases; pre-position top predicted titles before drop |
| Recommendation | Trillions of features | TFX, Beam, vector store (ScaNN) | Staleness vs. freshness trade-off; embedding model retraining lag |
| Live streaming | (Bonus topic) | WebRTC + LL-HLS | Latency vs. stability: LL-HLS at ~3 s, WebRTC at <1 s but harder to scale |
Edge cases
Live streaming (live broadcast)
Same architecture, but the encoder runs continuously and the manifest is dynamic. Latency budget is much tighter — live audiences want < 5 second delay; ultra-low-latency (sports betting) wants < 1 second. WebRTC, low-latency HLS, or custom protocols.
Mid-video seek
The player requests the chunk containing the seek point and starts from there — no need to download earlier chunks.
Bandwidth estimation (ABR algorithms)
The player measures download speed of recent chunks; if too slow, downgrade resolution. Two broad approaches exist in practice:
- Throughput-based: estimate bandwidth from recent chunk download times, select the highest bitrate within a safety margin. Simple, but reacts after congestion has already caused a slow download. Many basic HLS players use this approach.
- Buffer-based (BBA — Netflix's published approach; BOLA — an academic algorithm available in dash.js): select bitrate based on the current buffer level rather than measured throughput. A full buffer permits higher bitrate; a draining buffer forces downgrade. Smoother and more resilient to bandwidth estimation noise.
- Hybrid (e.g., DYNAMIC — the default in dash.js): use throughput-based at startup/seek when buffer is low; switch to buffer-based (BOLA) once buffer is sufficient. This is the current production default in the DASH reference player.
Both BBA and BOLA are published in the academic literature and worth naming if the interviewer asks how ABR is implemented.
Comments / likes (write-heavy)
Comments are sharded by video_id. Writes are async — a comment appearing a second late is acceptable — and they fan out to moderation queues for spam classification. Pagination uses a cursor on created_at.
View counters are a different animal. Directly incrementing a SQL row on every play of a viral video collapses under lock contention the moment a video goes genuinely viral. The solution is to never touch the database synchronously: each API server aggregates increments in memory and flushes every few seconds into a counter store like Cassandra or Redis. Exact counts aren't needed — YouTube itself shows approximate values ("1.2M views") rather than real-time exact counts, so the seconds-to-minutes display lag is intentional.
flowchart LR
PLAY[Play events] --> AGG["In-memory aggregator<br/>(per API server)"]
AGG -->|"flush every few seconds"| CNT[(Counter store<br/>Cassandra / Redis)]
CNT --> DISP[View count display<br/>"1.2M views"]
style AGG fill:#ff6b1a,color:#0a0a0f
style CNT fill:#15803d,color:#fff
style DISP fill:#0e7490,color:#fff
This is the same pattern as Instagram likes or Twitter retweet counts — any write-thundering counter problem on a popular object.
Upload resumability and idempotency
Chunked upload via signed URLs makes each chunk independently retryable. The upload API tracks which chunks have arrived (e.g., via a manifest in Redis or the object store itself). On client reconnect, it re-fetches the chunk list and uploads only missing chunks. This is why large YouTube uploads resume after a network dropout without starting over.
The transcoding job is enqueued only after the creator signals "upload complete," and the job carries a video_id idempotency key — re-enqueuing the same video twice just deduplicates at the queue consumer.
Things to discuss in an interview
- Upload pipeline: chunked upload → transcoding → segmenting → packaging; why each step is async.
- Adaptive bitrate: the killer feature; explain HLS/DASH manifests and how the player switches quality between segments.
- CDN as the dominant cost: 99%+ of bytes; why companies build their own (Open Connect, Google Edge Cache).
- Pre-positioning vs. reactive caching: Netflix pre-positions; YouTube uses reactive tier hierarchy because 800M+ videos can't be pre-warmed.
- Storage decisions: object storage for blobs, sharded SQL/NewSQL for metadata, wide-column for counters.
- Counter thundering-herd: why direct SQL increments fail on viral videos, and how in-memory aggregation + Cassandra solve it.
- ABR algorithm choice: throughput-based vs. buffer-based (BBA/BOLA) vs. hybrid; trade-offs at startup, mid-stream, and on mobile.
- DRM: per-chunk encryption with time-bound license keys; Widevine/FairPlay/PlayReady ecosystem.
Things you should now be able to answer
- Why is the same video stored at 5+ resolutions?
- What does the player do when bandwidth changes mid-watch?
- Why do Netflix and YouTube run their own CDNs?
- Why is transcoding expensive, and how is it parallelized?
- What's the difference between live and on-demand streaming architecture?
- How do you prevent a viral video from killing your view-counter database?
- What is the difference between buffer-based (BBA/BOLA) and throughput-based ABR? When does each approach win?
- Why are chunks served over HTTP with Range requests rather than a custom streaming protocol?
Further reading
- Netflix Open Connect Overview — official technical overview of appliance deployment
- "A Buffer-Based Approach to Rate Adaptation: Evidence from a Large Video Streaming Service" (Huang et al., SIGCOMM 2014) — the BBA paper authored by Netflix engineers; available via ACM DL
- "BOLA: Near-Optimal Bitrate Adaptation for Online Videos" (Spiteri et al., INFOCOM 2016) — the academic buffer-based ABR algorithm implemented in dash.js
- How Content ID works — YouTube Help (authoritative source)
- "Scaling YouTube's thumbnail pipeline" — Google Engineering talks (Google Cloud Next)
- Sandvine Global Internet Phenomena Report — annual internet traffic breakdown (confirms Netflix and YouTube traffic share figures)
Frequently asked questions
▸Why is transcoding parallelized per segment and resolution instead of per video?
Software H.265 encoding of a 1-hour 4K source can take many times real-time on a single CPU core, meaning hours of wall-clock time. Fanning the job out so each worker handles one segment range at one resolution turns an hours-long serial job into a minutes-long parallel one, letting low-quality versions go live quickly while higher resolutions catch up.
▸What is HLS/DASH adaptive bitrate streaming and why does YouTube use it instead of a single video file?
HLS and DASH split each resolution into 4-6 second segments and generate a manifest listing the segment URLs per quality level. The player downloads the manifest, then fetches one segment at a time and switches quality between segments as bandwidth changes. This lets viewers seek instantly to any chunk, start playback at low resolution for a fast startup, and recover from sudden bandwidth drops without buffering the whole file.
▸How does Netflix's Open Connect CDN differ from a conventional CDN, and why does it matter for cost?
Netflix ships physical Open Connect appliances into ISP data centers and peers with them directly. For 99.9% of viewer requests, video bytes never cross inter-ISP transit links at all. This eliminates the transit cost that would otherwise be the largest line item, and reduces latency because the content is already inside the viewer's ISP network.
▸Why does YouTube show approximate view counts like '1.2M views' instead of exact real-time numbers?
Directly incrementing a SQL counter row on every play of a viral video causes lock contention that collapses under load. Instead, each API server aggregates play events in memory and flushes every few seconds into a counter store such as Cassandra or Redis. The seconds-to-minutes display lag is intentional, not a limitation.
▸What is the difference between buffer-based and throughput-based ABR algorithms, and when does each win?
Throughput-based ABR estimates available bandwidth from recent chunk download times and picks the highest safe bitrate, but it reacts only after congestion has already slowed a chunk download. Buffer-based ABR (BBA, Netflix's published approach, and BOLA used in dash.js) selects bitrate based on current buffer level rather than measured throughput, making it smoother and more resilient to noisy bandwidth estimates. Hybrid players like the DASH reference implementation use throughput-based at startup when the buffer is empty and switch to buffer-based once the buffer is healthy.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.