~/articles/design-youtube
◆◆◆Advancedasked at Googleasked at Netflixasked at Amazon

Design YouTube / Netflix (video streaming)

How a billion users watch ~1 billion hours of video a day. Upload pipeline, transcoding, adaptive bitrate, CDN, recommendation.

17 min read2026-02-10Ironclad Academy
// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The problem

YouTube serves about a billion hours of video per day. Netflix, at its 2015 peak, accounted for roughly 37% of North American fixed-network downstream traffic during evening hours — and even as the internet has grown and more services have entered the market, video streaming consistently dominates global bandwidth. Two platforms, one engineering challenge: take a single upload and make it stream smoothly to anyone on the planet, from a fiber-connected 4K TV to a phone on a shaky 3G connection.

The core of the product is deceptively simple. A creator records a video and uploads it. A viewer presses play and it starts in under two seconds. Between those two moments, the system has to transcode the source file into half a dozen resolution tiers, split every tier into thousands of short segments, push those segments across a tiered CDN to edge nodes near each viewer, and let the player switch quality in real time as network conditions change.

What makes this hard is the scale at which all of that happens simultaneously. Creators upload 500 hours of new video every minute. Viewers generate about 12 terabytes per second of egress on average, with peak traffic several times higher. Transcoding a single hour of 4K footage can take many CPU-hours on one core — which means the pipeline must fan out across hundreds of thousands of cores just to keep up with incoming uploads without making creators wait.

The three engineering tensions that run through every design decision here are: (1) write-side throughput — transcoding is slow but uploads are relentless; (2) read-side cost — CDN egress is the single largest line item; and (3) adaptive delivery — a static bitrate stream that works on fiber will buffer endlessly on mobile, so the format and player protocol have to handle mid-stream quality switching transparently.

Functional requirements

  • Upload videos.
  • Watch videos at multiple resolutions, with adaptive bitrate.
  • Search.
  • Comment, like.
  • Recommendations.

Non-functional

  • 2.5B+ monthly active users.
  • 500 hours of video uploaded per minute.
  • p99 startup latency < 2 seconds.
  • 99.99% playback availability.

Capacity

DimensionEstimateHow we got there
Upload volume (raw)~250 PB/year500 h/min × 60 min × 24 h × 365 days × ~1 GB/hr avg ≈ 263M hr × 1 GB/hr ≈ 263 PB ≈ 250 PB
Upload volume (after encoding)~1 EB/yearRaw storage × 3–5× for multiple resolutions
Watch egress~1 EB/day1B hours/day × ~1 GB/hour (mixed bitrates)
Average egress bandwidth~12 TB/s1 EB ÷ 86,400 s
Peak egress bandwidthSeveral × 12 TB/sTypically 3–5× above average during prime time

Takeaway: The dominant costs are storage (exabytes) and bandwidth (the CDN bill is the single largest line item).

Building up to the design

YouTube has more moving parts than almost any other system, but every piece is a response to a specific failure mode of the simpler version. The progression below makes the full architecture feel inevitable.

V1: One file per video, served from origin

POST /videos    → save raw.mp4 to S3
GET  /videos/X302 redirect to s3.example.com/X/raw.mp4

You have a working upload-and-play site in an afternoon. Streaming raw 4K video to 1,000 viewers simultaneously means 1,000 × 50 Mbps = 50 Gbps from your origin — you'll exhaust your egress budget over a weekend. And mobile clients on 4G can't consume 50 Mbps anyway; they buffer endlessly.

V2: CDN in front

Put a CDN in front of S3. Viewers pull from the closest edge, and roughly 95% of bandwidth is absorbed there, leaving origin nearly quiet. But a viewer on a 3G phone still tries to download the same 50 Mbps stream as a fiber subscriber. They buffer forever. You need different qualities for different devices and networks.

V3: Multiple resolutions (transcoding)

On upload, transcode the original into 240p, 360p, 480p, 720p, 1080p, and 4K. The player picks one based on device and measured bandwidth. A 3G phone watches 240p at a few Mbps; a fiber TV watches 4K. The remaining problem is that the network changes mid-watch — the user enters a tunnel, steps away from WiFi. Locking quality at playback start means buffering when conditions worsen and wasted bandwidth when they improve. And the player still has to buffer the whole file to seek.

V4: Adaptive bitrate streaming (HLS/DASH)

Split each resolution into 4–6 second segments and generate a "manifest" that lists the segment URLs per quality level. The player downloads the manifest first, then fetches segments one at a time, switching quality between segments as bandwidth changes.

flowchart LR
    REQ[Player] --> M[Manifest]
    M --> S1["segment 1 @ 720p"]
    M --> S2["segment 2 @ 720p<br/>(bandwidth dropped → 480p)"]
    M --> S3["segment 3 @ 480p"]
    style M fill:#15803d,color:#fff

Now the player can seek instantly (request the chunk containing the seek point), start fast (grab a small low-res chunk first, upgrade when buffered), and recover gracefully from a sudden bandwidth drop. The new bottleneck is production: transcoding 500 hours per minute of uploads into 6 resolutions of HLS chunks is a massive compute job. Done synchronously, it would make creators wait hours before their video is viewable.

V5: Async distributed transcoding

The upload goes directly to "raw" object storage via signed chunked URLs. The upload API enqueues a transcoding job, which then splits into many parallel sub-jobs — one per segment range per resolution — and workers run the encoder independently. Finished segments are written to chunk storage as they complete; the manifest is updated incrementally. Low-quality versions become viewable within minutes; 4K follows later as the encode farm catches up.

With transcoding solved, the next constraint is discovery. A catalog of 800M+ videos is useless without a way for viewers to find what they actually want to watch.

Search via Elasticsearch (title, description, transcript). Recs via a two-stage ML pipeline (candidate gen + ranker) — same shape as the news feed but with watch-time as the target metric. View counts via Kafka aggregation (same as Instagram likes).

V7: Production YouTube

Everything stacked: upload → transcode → HLS chunks → CDN with edge caching (hot videos cached at every POP) → adaptive playback → recommendations → search → moderation (Content ID for copyright, NSFW classifier).

flowchart LR
    V1[V1: raw mp4<br/>melts your origin] --> V2[V2: + CDN<br/>cheap delivery]
    V2 --> V3[V3: + resolutions<br/>mobile-friendly]
    V3 --> V4[V4: + HLS/DASH<br/>adaptive bitrate]
    V4 --> V5[V5: + async transcode<br/>scalable upload]
    V5 --> V6[V6: + search + recs<br/>discovery]
    V6 --> V7[V7: + counts + mod<br/>production YouTube]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V7 fill:#a855f7,color:#fff

High-level architecture

flowchart TD
    subgraph Upload
    U1[Creator] -->|chunked upload| UAPI[Upload API]
    UAPI --> UQ[Upload Buffer / S3]
    UQ --> ENC[Transcoding Pipeline]
    ENC --> CHUNKS[(Chunked storage<br/>HLS/DASH segments)]
    ENC --> META[(Video Metadata<br/>Postgres / Spanner)]
    end

    subgraph Watch
    V[Viewer] --> VAPI[Video API]
    VAPI --> META
    VAPI --> MURL[Manifest Builder]
    MURL --> CDN[CDN]
    CDN --> CHUNKS
    end

    subgraph Discovery
    V --> SAPI[Search]
    V --> RAPI[Recs]
    end

    style ENC fill:#ff6b1a,color:#0a0a0f
    style CDN fill:#15803d,color:#fff
    style CHUNKS fill:#0e7490,color:#fff

The upload pipeline

sequenceDiagram
    participant Creator
    participant Upload API
    participant Object Store
    participant Encoder Q
    participant Encoder Workers
    participant Chunk Store
    Creator->>Upload API: initiate upload (size, format)
    Upload API->>Creator: signed URLs for chunks
    Creator->>Object Store: PUT chunk 1
    Creator->>Object Store: PUT chunk 2
    Creator->>Upload API: complete
    Upload API->>Encoder Q: enqueue job
    Encoder Workers->>Object Store: read source
    Encoder Workers->>Encoder Workers: transcode to 240p, 360p, 480p, 720p, 1080p, 4K
    Encoder Workers->>Encoder Workers: segment into 4-6s chunks (VOD)
    Encoder Workers->>Encoder Workers: package as HLS/DASH
    Encoder Workers->>Chunk Store: write segments + manifest
    Encoder Workers->>Upload API: ready
    Upload API->>Creator: published

Walk through the four stages and why each one exists.

Chunked upload. The client splits the file into 1–10 MB chunks and uploads each one directly to object storage using a short-lived signed URL. The Upload API never touches the video bytes — it just issues the URLs. If one chunk fails mid-transfer, only that chunk needs to be retried. For a 10 GB file over a shaky home connection this matters enormously.

Transcoding. Every video must be encoded at multiple resolutions and bitrates. This is genuinely expensive — software H.265 encoding of a 1-hour 4K source can take many times real-time on a single CPU core (hours, not minutes), depending on preset and hardware; hardware encoders such as NVENC achieve near-real-time but with quality trade-offs. The only way through is parallelism: fan the job out across many workers, each responsible for one segment range at one resolution.

flowchart TD
    SRC[(Source video)] --> JOB[Transcoding job]
    JOB --> W1["Worker: segs 1-20 @ 240p"]
    JOB --> W2["Worker: segs 1-20 @ 720p"]
    JOB --> W3["Worker: segs 1-20 @ 1080p"]
    JOB --> W4["Worker: segs 21-40 @ 240p"]
    JOB --> W5["Worker: segs 21-40 @ 720p"]
    JOB --> W6["Worker: segs 21-40 @ 1080p"]
    W1 & W2 & W3 & W4 & W5 & W6 --> STORE[(Chunk Store)]
    style JOB fill:#ff6b1a,color:#0a0a0f
    style W1 fill:#0e7490,color:#fff
    style W2 fill:#0e7490,color:#fff
    style W3 fill:#0e7490,color:#fff
    style W4 fill:#0e7490,color:#fff
    style W5 fill:#0e7490,color:#fff
    style W6 fill:#0e7490,color:#fff
    style STORE fill:#15803d,color:#fff

Segmenting. Each resolution is split into ~2–6 second chunks for VOD (2–4 s for live). Shorter segments allow faster quality switching but increase HTTP request overhead and reduce encoding efficiency (more I-frames); a 4–6 s duration is a common VOD operating point, though 2–4 s is also widely used. With short segments, the player can seek to any point without buffering the whole file, and can switch quality between segments without a glitch.

Packaging. HLS (Apple, .m3u8) or MPEG-DASH (everyone else, .mpd) wraps the chunks with a manifest that the player reads to discover which segment URLs correspond to which quality level.

Adaptive bitrate streaming

Instead of one big video file, the player downloads small chunks at multiple bitrates and switches between them based on the current connection.

flowchart LR
    P[Player] -->|fetch manifest| MAN[(.m3u8)]
    MAN --> P
    P -->|"chunk 1 @ 1080p<br/>fetched in 2s"| Q1080[1080p chunks]
    P -->|"chunk 2: bandwidth dropped<br/>switch to 720p"| Q720[720p chunks]
    P -->|"chunk 3: recovered<br/>back to 1080p"| Q1080
    style Q1080 fill:#15803d,color:#fff
    style Q720 fill:#ffaa00,color:#0a0a0f

This is why videos sometimes look blurry then sharp again — your bandwidth changed mid-watch and the player adapted.

Storage model

Each video has:

  • Metadata (title, description, owner, duration, tags) → Postgres or Spanner.
  • Segments for each (resolution × bitrate) → object storage (GCS / S3).
  • Manifest files (.m3u8 / .mpd) → object storage, served via CDN.
  • Thumbnails → object storage, served via CDN.

Numbers for one popular 10-minute video:

Resolutions: 240p, 360p, 480p, 720p, 1080p, 4K (6 versions)
Each version: ~10 minute @ 4-6s chunks = ~100 segments
Per segment: 0.510 MB depending on resolution
Total storage: maybe 15 GB per popular video

For YouTube's library: industry estimates put total storage in the range of tens of exabytes (exact figures are not publicly disclosed).

Where the CDN does the heavy lifting

99% of bytes are served from the CDN, not origin. Cold cache happens only when:

  • A new video is published (warm the popular regions).
  • A long-tail video gets surprise traffic.

YouTube and Netflix run their own CDNs (Google Edge Cache, Netflix Open Connect) — ISPs install boxes in their data centers, peering directly. This eliminates inter-ISP transit costs and slashes latency.

flowchart TD
    V[Viewer in Japan] --> ISP[ISP datacenter]
    ISP --> OC[Netflix Open Connect box]
    OC --> V
    OC -.cache miss.-> ORIGIN[Netflix origin US]
    style OC fill:#ff6b1a,color:#0a0a0f
    style ORIGIN fill:#0e7490,color:#fff

For 99.9% of viewer requests, the bytes never leave the ISP's network. That's why Netflix is fast even on a cheap ISP plan.

Pre-positioning and caching strategy

Netflix pre-positions content based on predicted demand. Open Connect appliances receive nightly content fills; before a major new release, the relevant titles are pre-positioned across all boxes days in advance (the specific timing is not publicly documented but the mechanism is described in Netflix engineering blog posts).

YouTube's cache is more reactive (the long tail is much wider; you can't pre-position 800M+ videos), but uses tier hierarchies:

flowchart LR
    V[Viewer] --> EDGE["Edge cache<br/>(metro)"]
    EDGE -->|miss| REG[Regional cache]
    REG -->|miss| ORIGIN[(Origin)]
    style EDGE fill:#ff6b1a,color:#0a0a0f
    style REG fill:#ffaa00,color:#0a0a0f
    style ORIGIN fill:#15803d,color:#fff

Hot path: starting playback

sequenceDiagram
    participant V as Viewer
    participant API as Video API
    participant CDN
    participant Player
    V->>API: GET /watch?v=xyz
    API-->>V: HTML + player JS + manifest URL
    V->>CDN: GET manifest.m3u8
    CDN-->>V: manifest (list of chunks)
    V->>CDN: GET first chunk (low resolution to start fast)
    CDN-->>V: chunk
    V->>Player: play
    Player->>CDN: prefetch next chunks
    CDN-->>Player: chunks

To minimize startup latency, the player starts at a low resolution so the first chunk is small enough to download quickly. Once it has built a buffer, it upgrades to the resolution the connection can sustain. Prefetching the next 1–2 chunks during playback keeps the buffer ahead of the read head, and HTTP/2 or HTTP/3 lets the player fetch multiple chunks in parallel over the same connection.

Recommendation is its own discipline. The high-level shape:

flowchart LR
    USER[User signals:<br/>watches, likes, time] --> EMB[User Embedding]
    VIDEO[Video signals:<br/>title, frames, audio, engagement] --> VEMB[Video Embedding]
    EMB --> ANN[Approx. Nearest Neighbor]
    VEMB --> ANN
    ANN --> CAND[Candidates]
    CAND --> RANKER[Ranker model]
    RANKER --> FEED[Recommended videos]
    style ANN fill:#a855f7,color:#fff
    style RANKER fill:#ff6b1a,color:#0a0a0f

Two-tower neural model + vector store (e.g. ScaNN, FAISS) for retrieval; heavy ranker on top.

Search is similar but driven by query → candidates rather than user → candidates.

DRM and content protection

For Netflix and other commercial streaming services, every chunk is encrypted; players are issued time-bound license keys via Widevine (Google), FairPlay (Apple), PlayReady (Microsoft).

For YouTube user-uploads, DRM is mostly absent (creators want their videos shared) but Content ID scans uploads against a fingerprint database to detect copyrighted material.

Scale summary

ComponentScaleTechKey failure mode / trade-off
Object storageMultiple exabytes (library)GCS, S3, customErasure-coded for durability; cost-tier cold storage for long-tail
TranscodingHundreds of thousands of coresBorg-style schedulerJob stragglers delay availability; partial publish (low-res first) mitigates
Metadata DBBillions of rows, shardedSpanner / Postgres; Cassandra or Redis for countersSpanner for ACID video metadata; Cassandra or Redis for high-write counters (views, likes)
CDNHundreds of POPs, exabytes/dayCustom (Open Connect, Google Edge)Cold-start on new releases; pre-position top predicted titles before drop
RecommendationTrillions of featuresTFX, Beam, vector store (ScaNN)Staleness vs. freshness trade-off; embedding model retraining lag
Live streaming(Bonus topic)WebRTC + LL-HLSLatency vs. stability: LL-HLS at ~3 s, WebRTC at <1 s but harder to scale

Edge cases

Live streaming (live broadcast)

Same architecture, but the encoder runs continuously and the manifest is dynamic. Latency budget is much tighter — live audiences want < 5 second delay; ultra-low-latency (sports betting) wants < 1 second. WebRTC, low-latency HLS, or custom protocols.

Mid-video seek

The player requests the chunk containing the seek point and starts from there — no need to download earlier chunks.

Bandwidth estimation (ABR algorithms)

The player measures download speed of recent chunks; if too slow, downgrade resolution. Two broad approaches exist in practice:

  • Throughput-based: estimate bandwidth from recent chunk download times, select the highest bitrate within a safety margin. Simple, but reacts after congestion has already caused a slow download. Many basic HLS players use this approach.
  • Buffer-based (BBA — Netflix's published approach; BOLA — an academic algorithm available in dash.js): select bitrate based on the current buffer level rather than measured throughput. A full buffer permits higher bitrate; a draining buffer forces downgrade. Smoother and more resilient to bandwidth estimation noise.
  • Hybrid (e.g., DYNAMIC — the default in dash.js): use throughput-based at startup/seek when buffer is low; switch to buffer-based (BOLA) once buffer is sufficient. This is the current production default in the DASH reference player.

Both BBA and BOLA are published in the academic literature and worth naming if the interviewer asks how ABR is implemented.

Comments / likes (write-heavy)

Comments are sharded by video_id. Writes are async — a comment appearing a second late is acceptable — and they fan out to moderation queues for spam classification. Pagination uses a cursor on created_at.

View counters are a different animal. Directly incrementing a SQL row on every play of a viral video collapses under lock contention the moment a video goes genuinely viral. The solution is to never touch the database synchronously: each API server aggregates increments in memory and flushes every few seconds into a counter store like Cassandra or Redis. Exact counts aren't needed — YouTube itself shows approximate values ("1.2M views") rather than real-time exact counts, so the seconds-to-minutes display lag is intentional.

flowchart LR
    PLAY[Play events] --> AGG["In-memory aggregator<br/>(per API server)"]
    AGG -->|"flush every few seconds"| CNT[(Counter store<br/>Cassandra / Redis)]
    CNT --> DISP[View count display<br/>"1.2M views"]
    style AGG fill:#ff6b1a,color:#0a0a0f
    style CNT fill:#15803d,color:#fff
    style DISP fill:#0e7490,color:#fff

This is the same pattern as Instagram likes or Twitter retweet counts — any write-thundering counter problem on a popular object.

Upload resumability and idempotency

Chunked upload via signed URLs makes each chunk independently retryable. The upload API tracks which chunks have arrived (e.g., via a manifest in Redis or the object store itself). On client reconnect, it re-fetches the chunk list and uploads only missing chunks. This is why large YouTube uploads resume after a network dropout without starting over.

The transcoding job is enqueued only after the creator signals "upload complete," and the job carries a video_id idempotency key — re-enqueuing the same video twice just deduplicates at the queue consumer.

Things to discuss in an interview

  • Upload pipeline: chunked upload → transcoding → segmenting → packaging; why each step is async.
  • Adaptive bitrate: the killer feature; explain HLS/DASH manifests and how the player switches quality between segments.
  • CDN as the dominant cost: 99%+ of bytes; why companies build their own (Open Connect, Google Edge Cache).
  • Pre-positioning vs. reactive caching: Netflix pre-positions; YouTube uses reactive tier hierarchy because 800M+ videos can't be pre-warmed.
  • Storage decisions: object storage for blobs, sharded SQL/NewSQL for metadata, wide-column for counters.
  • Counter thundering-herd: why direct SQL increments fail on viral videos, and how in-memory aggregation + Cassandra solve it.
  • ABR algorithm choice: throughput-based vs. buffer-based (BBA/BOLA) vs. hybrid; trade-offs at startup, mid-stream, and on mobile.
  • DRM: per-chunk encryption with time-bound license keys; Widevine/FairPlay/PlayReady ecosystem.

Things you should now be able to answer

  • Why is the same video stored at 5+ resolutions?
  • What does the player do when bandwidth changes mid-watch?
  • Why do Netflix and YouTube run their own CDNs?
  • Why is transcoding expensive, and how is it parallelized?
  • What's the difference between live and on-demand streaming architecture?
  • How do you prevent a viral video from killing your view-counter database?
  • What is the difference between buffer-based (BBA/BOLA) and throughput-based ABR? When does each approach win?
  • Why are chunks served over HTTP with Range requests rather than a custom streaming protocol?

Further reading

  • Netflix Open Connect Overview — official technical overview of appliance deployment
  • "A Buffer-Based Approach to Rate Adaptation: Evidence from a Large Video Streaming Service" (Huang et al., SIGCOMM 2014) — the BBA paper authored by Netflix engineers; available via ACM DL
  • "BOLA: Near-Optimal Bitrate Adaptation for Online Videos" (Spiteri et al., INFOCOM 2016) — the academic buffer-based ABR algorithm implemented in dash.js
  • How Content ID works — YouTube Help (authoritative source)
  • "Scaling YouTube's thumbnail pipeline" — Google Engineering talks (Google Cloud Next)
  • Sandvine Global Internet Phenomena Report — annual internet traffic breakdown (confirms Netflix and YouTube traffic share figures)
// FAQ

Frequently asked questions

Why is transcoding parallelized per segment and resolution instead of per video?

Software H.265 encoding of a 1-hour 4K source can take many times real-time on a single CPU core, meaning hours of wall-clock time. Fanning the job out so each worker handles one segment range at one resolution turns an hours-long serial job into a minutes-long parallel one, letting low-quality versions go live quickly while higher resolutions catch up.

What is HLS/DASH adaptive bitrate streaming and why does YouTube use it instead of a single video file?

HLS and DASH split each resolution into 4-6 second segments and generate a manifest listing the segment URLs per quality level. The player downloads the manifest, then fetches one segment at a time and switches quality between segments as bandwidth changes. This lets viewers seek instantly to any chunk, start playback at low resolution for a fast startup, and recover from sudden bandwidth drops without buffering the whole file.

How does Netflix's Open Connect CDN differ from a conventional CDN, and why does it matter for cost?

Netflix ships physical Open Connect appliances into ISP data centers and peers with them directly. For 99.9% of viewer requests, video bytes never cross inter-ISP transit links at all. This eliminates the transit cost that would otherwise be the largest line item, and reduces latency because the content is already inside the viewer's ISP network.

Why does YouTube show approximate view counts like '1.2M views' instead of exact real-time numbers?

Directly incrementing a SQL counter row on every play of a viral video causes lock contention that collapses under load. Instead, each API server aggregates play events in memory and flushes every few seconds into a counter store such as Cassandra or Redis. The seconds-to-minutes display lag is intentional, not a limitation.

What is the difference between buffer-based and throughput-based ABR algorithms, and when does each win?

Throughput-based ABR estimates available bandwidth from recent chunk download times and picks the highest safe bitrate, but it reacts only after congestion has already slowed a chunk download. Buffer-based ABR (BBA, Netflix's published approach, and BOLA used in dash.js) selects bitrate based on current buffer level rather than measured throughput, making it smoother and more resilient to noisy bandwidth estimates. Hybrid players like the DASH reference implementation use throughput-based at startup when the buffer is empty and switch to buffer-based once the buffer is healthy.

// RELATED

You may also like