~/articles/design-live-streaming

◆◆◆Advancedasked at Twitchasked at YouTubeasked at Meta

Design a Live Streaming System (Twitch)

Q: What is an ABR ladder in live streaming?

An ABR (Adaptive Bitrate) ladder is a set of parallel renditions of the same stream at different resolutions and bitrates. A typical Twitch-style ladder has five levels: 1080p60 at 6 Mbps, 720p30 at 3 Mbps, 480p at 1.5 Mbps, 360p at 800 Kbps, and 160p at 250 Kbps. The player measures download speed against playback speed each segment and switches renditions automatically.

Q: Where does the 10-30 seconds of glass-to-glass latency in standard HLS come from?

It is the sum of several sequential delays: local broadcast encode adds 0.5-1s, network ingest adds 0.1-0.5s, transcoding adds 0.5-1s, waiting for a full segment boundary adds 2-6s, CDN propagation adds 0.5-1s, and player buffering of at least two segments adds 4-12s. The segment boundary wait and player buffering account for the majority of the delay and are protocol constraints, not server capacity limits.

Q: When should you use LL-HLS instead of standard HLS?

Use LL-HLS when your product requires 2-5 seconds of glass-to-glass latency rather than the 10-30s of standard HLS. LL-HLS publishes partial segments before the full segment is complete and uses blocking playlist requests to avoid polling, but it generates roughly 3-8 times more CDN HTTP requests per viewer, which raises CDN cost and edge load proportionally. If sub-second interactivity is required instead, LL-HLS is insufficient and WebRTC is the only option.

Q: Why can WebRTC not replace CDN-based HLS delivery for millions of concurrent viewers?

WebRTC for broadcast fan-out requires a media relay server (SFU or MCU) that maintains a persistent UDP connection to every viewer simultaneously, because WebRTC traffic is not HTTP and cannot be cached. There is no CDN absorption of load, so relay capacity must scale linearly with viewer count. At 5 million concurrent viewers that capacity planning is fundamentally different and far more expensive than the CDN model where a 99.9% cache hit rate collapses origin load to only thousands of requests per second.

Q: How does DVR rewind work without additional infrastructure beyond object storage?

Because HLS segments are written to S3 as immutable files during the broadcast, DVR is purely a manifest-pointer problem. A viewer seeking 30 minutes back receives an m3u8 manifest pointing to the segments from 30 minutes ago; those segments already exist in S3 and the CDN caches them on first access. For a 4-hour DVR window with 2-second segments, that is 7,200 segments per rendition per channel stored in S3, with the real cost being storage — which reaches petabyte scale (~2.5 PB) at 100k concurrent channels.

Ingest one broadcaster and fan out to millions of viewers with seconds of latency. Transcoding ladders, HLS/DASH segmenting, CDN fan-out, and live chat.

22 min read2026-04-17Ironclad Academy

#interview #media #streaming #scale

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Twitch, YouTube Live, and Kick all solve the same core problem: one person pushes a video feed from a bedroom PC, and anywhere from a handful to half a million people watch it simultaneously — with a delay of only a few seconds. That gap between the broadcaster pressing "Go Live" and the first frame appearing on a viewer's screen is the central engineering challenge. Everything else in the design flows from trying to close it without bankrupting the company.

Live streaming is neither on-demand video nor video conferencing, and that distinction matters enormously for the architecture. On-demand video (YouTube, Netflix) can transcode files hours before anyone watches — you pay once, serve forever. Video conferencing (Zoom, Google Meet) has two or three participants and a peer-to-peer or small-relay model. Live streaming has one producer and a potentially unbounded audience that arrives all at once, demanding a fresh-encoded stream in real time. There is no pre-computation window; every segment must be encoded and published in the time it takes to fill that segment — typically two to six seconds.

The two tensions that make this interesting at interview scale are transcoding cost vs. latency and fan-out at CDN scale. Shorter segments reduce viewer latency, but smaller segments mean more HTTP requests per viewer per minute, which drives CDN cost and edge load proportionally higher. Scaling the viewer side is comparatively straightforward once you accept CDN as a primitive — a 99.9% cache hit rate means origin sees only thousands of requests even when millions of viewers are watching the same stream simultaneously. Scaling the encoder side is harder: 100k concurrent live channels each producing five renditions in real time requires roughly 10,000–20,000 GPU instances running continuously, and a single GPU crash mid-stream is visible as a stall.

The design is a four-stage pipeline: ingest over RTMP, GPU transcode into an ABR ladder, package into HLS segments written to object storage, and CDN delivery to viewers. This article walks each stage, then covers latency modes, chat, DVR, and failure handling.

Functional requirements

POST /broadcast/start — broadcaster begins a live stream.
Broadcaster pushes raw video over RTMP/SRT to an ingest endpoint.
Viewers open a stream URL; a video player fetches HLS/DASH manifests and segments.
Multiple ABR renditions (e.g. 1080p, 720p, 480p, 360p, 160p) — player picks automatically.
Chat: bidirectional real-time text messages per stream.
DVR: rewind up to 4 hours on a live stream.
Recording: when a broadcast ends, archive as on-demand VOD.

Non-functional requirements

Global availability: viewers in any region see ≤5s additional latency beyond the glass-to-glass floor.
Ingest durability: if an ingest server goes down, the broadcaster can reconnect in <5s without losing the segment in flight.
Scale: 100k concurrent live channels; 5M concurrent viewers.
Standard latency target: 10–30s glass-to-glass for the default mode (more on this below).
Chat consistency: messages from any sender are delivered in-order within a room; occasional duplicate delivery is acceptable (at-least-once).

Capacity estimation

Dimension	Estimate	How we got there
Concurrent channels	100,000	given
Ingest bandwidth per channel	~8–12 Mbps (avg ~10 Mbps)	1080p60 source stream from broadcaster
Total ingest bandwidth	~1 Tbps	`100,000 × 10 Mbps`
ABR renditions per channel	5 (1080p, 720p, 480p, 360p, 160p)	standard ladder
Concurrent GPU transcode jobs	500,000	`100,000 channels × 5 renditions`; each rendition encoded at ~1× speed (GPU)
GPU instance count	~10,000–20,000 at peak	~5–10 channels per GPU (full 5-rendition ladder); NVIDIA T4: ~17–18 concurrent streams at 1080p, ~37 at 720p; lower renditions are proportionally cheaper
Segment size (1080p rendition)	~1.5 MB	`6 Mbps × 2s`
Segment size (720p rendition)	~0.75 MB	`3 Mbps × 2s`
Avg segment size across ladder	~0.7 MB/rendition	across the 5-rendition ladder
Total segment data per channel / 2s	~3.5 MB	`5 renditions × ~0.7 MB avg`
New segment data written / 2s	~350 GB	`100,000 channels × 5 renditions × ~0.7 MB`
DVR segments per rendition per channel	7,200	`4 hours × 3,600s/hr ÷ 2s per segment`
Live DVR storage footprint	~2.5 PB	`100,000 channels × 5 renditions × 7,200 segments × ~0.7 MB` = 2,520,000,000 MB ≈ 2.5 PB (in practice dominated by top-1000 channels; long-tail on hot-tier storage only)
Concurrent viewers	5M	given
CDN egress bandwidth	20 Tbps	`5M viewers × 4 Mbps avg`
CDN cache hit rate	>99%	same segment served millions of times for popular streams
Origin pull requests	only thousands/sec	`5M viewers ÷ CDN cache factor`; near-total cache absorption
Avg viewers per chat room	50	`5M viewers ÷ 100,000 rooms`
Top 100 rooms — viewer count	10,000–50,000 each	heavy-tail distribution
Top 100 rooms — message rate	1,000–10,000 msg/sec each	hot streamer chat bursts
Total chat throughput	~1M messages/sec system-wide	sum across all rooms

Takeaway: The 20 Tbps CDN egress figure is the dominant design driver — it is why CDN is non-negotiable and why a >99% cache hit rate is required to collapse origin load from millions of viewer pulls to only thousands of requests per second.

That CDN egress number is the reason the whole design looks the way it does. Without a CDN, delivering 20 Tbps from origin servers would require tens of thousands of machines just to serve bandwidth. The CDN collapses that into a tractable origin pull problem — typically only a few thousand pulls per second, because each segment gets cached after the first request per PoP.

The four stages of a live stream

Every component in this architecture exists to transform the stream into something that can be delivered cheaply at scale. Understanding those four stages is the core of the question.

Stage 1: Ingest

The broadcaster runs software (OBS, Streamlabs, a hardware encoder) that encodes video locally and pushes it to a public ingest endpoint over RTMP (Real-Time Messaging Protocol).

RTMP is TCP-based, so it handles NAT and firewalls well. Its main weakness: TCP retransmissions introduce jitter on lossy networks. SRT (Secure Reliable Transport) is a newer alternative that uses ARQ (Automatic Repeat reQuest) as its primary error recovery mechanism, with optional FEC (Forward Error Correction) and better congestion control — useful for broadcasters on unstable connections.

At the ingest server, the raw stream is decoded enough to extract timing information and detect keyframes, then split at keyframe boundaries into short chunks (every 1–2 seconds, aligned to segment boundaries for the next stage), and forwarded to the transcoder farm.

An ingest server should be stateless from the CDN perspective — it holds the live stream in flight, not durable state. If a broadcaster loses connection, they reconnect and the stream resumes from the next keyframe.

sequenceDiagram
    participant BC as Broadcaster
    participant ING as Ingest Server
    participant TC as Transcoder
    participant ORI as Origin / S3

    BC->>ING: RTMP push (continuous)
    ING->>TC: raw video chunks at keyframes
    TC->>TC: encode 5 renditions in parallel
    TC->>ORI: write HLS segment files
    TC->>ORI: update manifest (m3u8) pointer
    ORI-->>ING: ack segment stored

Stage 2: Transcoding

The source stream from the broadcaster is typically one high-bitrate rendition (e.g. 1080p60 at 12 Mbps). Viewers have wildly different network conditions. Adaptive Bitrate (ABR) solves this: transcode the source into a ladder of renditions, and let the player pick the best one it can sustain.

A typical ABR ladder:

Rendition	Resolution	Bitrate	Use case
1080p60	1920×1080	6 Mbps	Desktop, fast broadband
720p30	1280×720	3 Mbps	Desktop, average broadband
480p	854×480	1.5 Mbps	Mobile, good LTE
360p	640×360	800 Kbps	Mobile, weak LTE
160p	284×160	250 Kbps	Lowest-quality fallback

Each rendition is encoded in parallel on the transcoder farm. GPU-accelerated H.264 encoding (NVENC on NVIDIA hardware) can handle dozens of real-time streams per GPU. H.265/HEVC offers better compression but higher decode cost on the viewer side; AV1 offers even better compression but much higher encode cost.

Transcoding is the most expensive part of the system. At 100k channels × 5 renditions, you are running half a million concurrent real-time encode jobs. Failures here are the most disruptive failure mode — so this tier needs job-level health monitoring, not just server-level.

Stage 3: Packaging into HLS segments

After transcoding, the stream is packaged into the HLS (HTTP Live Streaming) format:

Each rendition is split into short .ts (MPEG-TS) or .m4s (fMP4) segment files, typically 2–6 seconds each.
A manifest file (.m3u8) lists the segments for each rendition, updated every segment duration.
A master manifest lists all renditions with their bitrates and resolutions.

The master manifest for a stream looks like:

#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=6000000,RESOLUTION=1920x1080
1080p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=3000000,RESOLUTION=1280x720
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=854x480
480p/playlist.m3u8

The per-rendition playlist:

#EXTM3U
#EXT-X-TARGETDURATION:2
#EXT-X-MEDIA-SEQUENCE:4821
seg-4821.ts
seg-4822.ts
seg-4823.ts

The player refreshes the playlist at the segment duration interval and appends newly listed segments to its buffer. The number of segments in the playlist controls the DVR window visible at the "live edge."

DASH (Dynamic Adaptive Streaming over HTTP) is the ISO-standardized alternative. It uses .mpd manifests and .m4s segments. Both HLS and DASH work over plain HTTP, making CDN delivery straightforward.

Stage 4: Distribution via CDN

Segments are small, immutable files written once and read millions of times — the ideal workload for a CDN.

The player fetches the master manifest, then the per-rendition manifest, then individual segments. All requests go to the CDN edge PoP nearest the viewer. The first viewer to request a segment at a given PoP causes a cache miss — the PoP pulls from origin. Every subsequent viewer gets a cache hit, served from the PoP's SSD in milliseconds, never touching origin.

For popular streams, CDN cache hit rates exceed 99.9%. A stream with 100k concurrent viewers all pulling segment N at nearly the same time: one PoP might serve that single segment file to 1,000 viewers, having fetched it from origin exactly once.

See design-cdn for a deeper treatment of CDN internals.

flowchart LR
    ORI[Origin / S3\nwrites segments] -->|"pull on miss"| SHIELD[Shield PoP\none per region]
    SHIELD -->|"pull on miss"| EDGE1[Edge PoP\nRegion A]
    SHIELD -->|"pull on miss"| EDGE2[Edge PoP\nRegion B]
    EDGE1 -->|"cache hit"| VW1[Viewers A]
    EDGE2 -->|"cache hit"| VW2[Viewers B]
    style ORI fill:#0e7490,color:#fff
    style SHIELD fill:#ff6b1a,color:#0a0a0f
    style EDGE1 fill:#15803d,color:#fff
    style EDGE2 fill:#15803d,color:#fff

The shield PoP (origin shielding) is what prevents a thundering-herd cache-miss storm at stream start: when hundreds of thousands of viewers tune in simultaneously, only one pull goes to origin per region, not one per viewer.

Building up to the design

V1: One box, RTMP in, stream out

Run an RTMP server (Nginx-RTMP or SRS) that receives the broadcast and re-streams it in HLS. One server, one channel. Works for a demo with dozens of viewers.

The problem is simple: the server's uplink becomes the bottleneck at ~a few hundred concurrent viewers. No ABR means mobile viewers get 1080p or nothing. You can't run 100k channels on one box.

V2: Add a transcoder and a CDN origin

Move transcoding off the ingest box. Run FFmpeg (or hardware encoder) in a separate process to produce 5 renditions. Write segments to an S3 bucket. Put a CDN in front of S3 for delivery.

You now have ABR and unlimited viewer scalability via CDN — this actually looks like a real streaming service. But transcoding is still coupled to the ingest server. One crash loses the stream. At 100 channels, CPU-based FFmpeg saturates your CPU budget.

V3: Decouple ingest from transcoding; add GPU workers

Ingest servers publish raw chunks to a distributed queue (Kafka or a proprietary internal bus). A pool of GPU transcoder workers consume chunks, produce renditions, write to S3.

Now ingest and transcoding scale independently, and a transcoder crash is recoverable — another worker picks up the next chunk. The new risk is transcoder lag: if a GPU worker can't keep up, segments arrive late at S3 and the CDN serves a stale manifest, causing viewers to see their player pause at the live edge.

V4: Add stream controller, health monitoring, adaptive scheduling

A Stream Controller watches each channel's transcoding pipeline. If a GPU worker falls behind, it reschedules the rendition to a less-loaded machine. If the source stream drops, it waits for reconnect and emits a discontinuity marker (#EXT-X-DISCONTINUITY) in the HLS manifest so players don't stall.

flowchart LR
    V1["V1: single server\nRTMP + HLS out"] --> V2["V2: + transcoder\n+ CDN\nABR works"]
    V2 --> V3["V3: + GPU farm\n+ S3 origin\nhorizontal scale"]
    V3 --> V4["V4: + stream controller\nhealth, failover, DVR"]
    V4 --> V5["V5: + LL-HLS option\n+ chat\n+ VOD pipeline"]
    style V1 fill:#0e7490,color:#fff
    style V2 fill:#15803d,color:#fff
    style V3 fill:#ff6b1a,color:#0a0a0f
    style V5 fill:#a855f7,color:#fff

V5: Low-latency option, chat, VOD pipeline

Production adds LL-HLS or LL-DASH for reduced-latency channels (see the latency section below), a dedicated Chat Service (WebSocket fan-out), and a VOD Pipeline that, on stream end, concatenates all HLS segments, generates a proper on-demand index, and moves the result to VOD storage.

Full architecture deep-dive

flowchart TD
    BC[Broadcaster\nOBS / hardware encoder] -->|"RTMP / SRT push"| INGEST[Ingest Tier\nstateless servers]

    INGEST -->|"raw keyframe chunks"| TC[Transcoder Farm\nGPU instances]
    INGEST -.stream start/stop.-> CTRL[Stream Controller]
    CTRL --> META[(Metadata Store\nstream state, settings)]
    CTRL --> NOTIF[Notification Service\nfollower alerts]

    TC -->|"HLS segments + manifests"| ORIG[Origin Store\nS3 / object storage]
    TC -.transcode failure.-> CTRL

    ORIG --> CDN[CDN Edge PoPs\nglobal]
    CDN -->|"HLS pull delivery"| VWR[Viewers\nvideo players]

    VWR -.chat messages.-> CHAT[Chat Service\nWebSocket fan-out]
    CHAT -.-> MSG[Message Bus\nKafka / Redis Pub/Sub]
    MSG -.-> CHAT
    CHAT -->|"messages"| VWR

    ORIG -.on stream end.-> VOD[VOD Pipeline\nconcatenate + re-index]
    VOD --> VODS3[(VOD Storage\nS3)]

    style INGEST fill:#ff6b1a,color:#0a0a0f
    style TC fill:#a855f7,color:#fff
    style ORIG fill:#0e7490,color:#fff
    style CDN fill:#ffaa00,color:#0a0a0f
    style CHAT fill:#15803d,color:#fff
    style CTRL fill:#ff2e88,color:#fff

Latency deep-dive: where the delay comes from

People talk about live streaming latency as though it's a single knob you can turn. It isn't. It's the sum of several physically unavoidable steps — and you can only reduce it by paying a higher cost elsewhere.

Broadcast encoder latency:  ~0.5–1s   (local encode, buffer, send)
Network ingest:             ~0.1–0.5s (RTMP to ingest server)
Transcoding:                ~0.5–1s   (real-time encode, ~1× speed)
Segmenting:                 = 1 segment = 2–6s (wait for segment boundary)
CDN propagation:            ~0.5–1s   (origin → edge pull)
Player buffering:           ≥ 2 segments = 4–12s (required for stable ABR)
──────────────────────────────────────────────
Standard HLS total:         ~8–20s glass-to-glass (often reported as "10–30s")

The segment boundary is the biggest single contributor and is a protocol constraint, not a server capacity constraint. You cannot shorten it without changing the protocol.

flowchart LR
    CAM["Camera\n0s"] --> ENC["Local encode\n+0.5–1s"]
    ENC --> NET["Network ingest\n+0.1–0.5s"]
    NET --> TXCODE["Transcode\n+0.5–1s"]
    TXCODE --> SEG["Wait for segment\n+2–6s"]
    SEG --> CDN["CDN propagation\n+0.5–1s"]
    CDN --> BUF["Player buffer\n+4–12s"]
    BUF --> EYE["Viewer sees it\n~8–20s later"]
    style SEG fill:#ff6b1a,color:#0a0a0f
    style BUF fill:#a855f7,color:#fff
    style EYE fill:#15803d,color:#fff

The two orange boxes — segment boundary and player buffering — account for most of the delay and are the levers LL-HLS pulls on.

Latency modes compared

Mode	Glass-to-glass	Segment size	CDN load	Cost
Standard HLS	10–30s	2–6s	Low (few req/viewer/min)	Low
Low-Latency HLS (LL-HLS)	2–5s	0.2–0.5s partial segments	~3–8× more requests (depends on part:segment ratio)	Medium
LL-DASH	2–5s	similar	similar	Medium
WebRTC relay	<1s	N/A (RTP, not HTTP)	Does not use CDN	High (dedicated relay servers)

LL-HLS (announced by Apple at WWDC 2019; spec revised in 2020 to drop the HTTP/2 push requirement) achieves lower latency by publishing partial segments (small chunks within a segment) before the full segment is complete. Production implementations use blocking playlist requests (EXT-X-PRELOAD-HINT) — the CDN holds the HTTP response open until the next partial segment is ready, avoiding polling. The trade-off: roughly 3–8× more HTTP requests per viewer depending on the part-to-segment duration ratio, which increases CDN cost and edge server load proportionally.

WebRTC for broadcast-scale fan-out requires a different architecture entirely: a media relay (SFU or MCU) that terminates WebRTC connections from one broadcaster and re-fans to viewers, all via UDP + DTLS + SRTP. There is no CDN caching because WebRTC is not HTTP. Every viewer is a persistent connection to a relay server. Sub-second latency is achievable but the relay capacity planning is completely different. Use this only when interactivity (<1s) is the product requirement, not just a nice-to-have.

Player-side: adaptive bitrate switching

The player's job is simpler than it looks. It fetches the master manifest to learn what renditions are available, starts at a conservative rendition (or estimates based on connection speed), and then every segment it measures: download speed vs. segment duration. If download is slower than playback speed, it drops a rendition. If consistently faster, it steps up. The player also maintains a playback buffer of typically 3–5 segments (6–15s) so brief bandwidth hiccups don't stall playback.

Common ABR algorithms (BOLA, BBA, or variants) factor in buffer level — how much runway the player has before stalling — bandwidth estimate from a rolling average of segment download speeds, and rendition jump size to avoid thrashing between distant quality levels. ABR is invisible to the viewer when working well. When bandwidth degrades faster than the algorithm can react, the viewer sees a quality drop or a buffering spinner.

Live chat — a separate fan-out problem

Chat is architecturally distinct from video. Video is a pull model: players request segments on their own schedule. Chat is a push model: messages must be delivered to all room members in near-real-time, without them polling for it.

The key properties: rooms range from 10 viewers to 500k viewers; messages must appear in-order within a room; the delivery latency target is under 1 second; and at-least-once semantics are acceptable (a duplicate message is annoying but tolerable; a missing one is not).

flowchart LR
    U1[Viewer A] -->|"send message"| GW[Chat Gateway\nWebSocket]
    GW --> PUB[Message Bus\nKafka / Redis Pub/Sub]
    PUB --> FANOUT[Fan-out Workers\nper room partition]
    FANOUT --> GW
    GW -->|"push message"| U2[Viewer B]
    GW -->|"push message"| U3[Viewer C]
    FANOUT --> STORE[(Message Store\nrecent history)]
    style GW fill:#15803d,color:#fff
    style PUB fill:#a855f7,color:#fff
    style FANOUT fill:#ff6b1a,color:#0a0a0f

For hot rooms — top streamers with 500k concurrent viewers — the fan-out tier becomes the bottleneck. You can't serve 500k WebSocket connections from a single process. Shard chat rooms horizontally across WebSocket gateway nodes, with Kafka partitioned by room ID mediating between them. See design-slack for a full treatment of chat architecture.

Chat moderation adds a layer: messages pass through a moderation service (rule-based + ML) before being fanned out. This introduces a few hundred milliseconds of additional latency but is non-negotiable at scale.

DVR and VOD

DVR (live rewind)

DVR sounds like a hard problem. It isn't, once you realize that the segments are already in S3.

The simplest DVR implementation: keep all segments in S3 forever during a broadcast. The HLS manifest served to the viewer can be configured as a sliding window playlist pointing to the last N segments (where N × segment_duration = DVR window — e.g. N = 7,200 for 4 hours of 2-second segments). A viewer seeking 30 minutes back receives a manifest pointing to segments from 30 minutes ago. Those segments are already in S3 and the CDN will cache them on first access.

flowchart LR
    S3[S3: all segments\nseg-0001 ... seg-9600] --> LIVE[Live manifest\npoints to last 3 segments]
    S3 --> DVR30[DVR -30min manifest\npoints to seg-7800...7802]
    S3 --> DVR4H[DVR -4hr manifest\npoints to seg-2401...2403]
    LIVE -->|"live edge viewer"| VW1[Live viewer]
    DVR30 -->|"rewind viewer"| VW2[Rewinding viewer]
    style S3 fill:#0e7490,color:#fff
    style LIVE fill:#ff6b1a,color:#0a0a0f
    style DVR30 fill:#15803d,color:#fff
    style DVR4H fill:#a855f7,color:#fff

DVR requires no additional compute — it is entirely a manifest-pointer problem. The real cost is storage: at 100k concurrent channels the footprint runs into the petabyte range (~2.5 PB) for a 4-hour rolling window. In practice, segment retention is tiered by channel popularity: top channels get full DVR windows; long-tail channels get shorter ones.

VOD pipeline

When a broadcast ends:

Concatenate all HLS segments into a complete archive.
Re-index the fMP4 fragments to enable proper VOD seeking (generate a .sidx or re-mux the fragmented MP4).
Write to VOD storage with a permanent manifest.
Optionally re-transcode at higher quality (offline, not real-time) for better compression.

The VOD and live pipelines share segment storage (S3) but have different manifests. The transition from "live" to "VOD" is a metadata update: the stream record changes state from live to vod, and the manifest URL resolves to the VOD index. See design-youtube for the full on-demand video serving architecture.

Storage choices

Data	Store	Why
HLS segments (live, DVR)	S3 / object storage	Immutable, write-once, high durability, CDN-pullable
HLS manifests (live)	S3 (with short CDN TTL) or edge-generated	Updated every segment; must propagate quickly
Stream metadata (state, settings)	Postgres / DynamoDB	Low volume, strong consistency needed for stream start/stop
Chat messages (recent, in-flight)	Redis Pub/Sub or Kafka	Low-latency fan-out; not the primary store
Chat history (last 100 messages)	Cassandra or DynamoDB	High write throughput; time-series queries by room + timestamp
VOD assets	S3 (separate bucket/prefix)	Permanent; separate lifecycle policy from live segments
Transcoding job queue	Kafka or SQS	Durable; handles GPU worker restarts gracefully
Analytics / playback events	Kafka → S3 → OLAP	High-volume append-only; batch processed

Failure modes

Broadcaster ingest drop / reconnect

The broadcaster's app loses connectivity. The ingest server closes the RTMP connection. The transcoder sees no new chunks.

The HLS manifest stops being updated, so the CDN continues serving the last valid segments. The player, seeing no new segments, begins buffering and eventually shows a "stream is offline" state. The Stream Controller waits for a reconnect within a configurable window (e.g. 60s). On reconnect, it emits a discontinuity tag (#EXT-X-DISCONTINUITY) in the HLS manifest so players know there is a gap. After the reconnect timeout, the stream is marked offline and the DVR/VOD pipeline archives what was recorded.

Transcoder failure mid-stream

A GPU instance crashes while encoding. Segments in-flight for that rendition are lost. The transcoding job for that channel is rescheduled to a healthy GPU worker, and the chunk pipeline (Kafka) allows the new worker to resume from the last fully-acknowledged chunk. A brief rendition gap may appear in the manifest; better implementations emit a discontinuity tag.

CDN PoP failure

A CDN edge PoP becomes unavailable. Players re-resolve DNS to a healthy PoP — CDN providers use GeoDNS or BGP anycast routing with health checks, so failover typically completes within 5–30s. Viewers connected to the failed PoP will see buffering during that window. Origin is unaffected.

Thundering herd on segment 0 / stream start

When a popular streamer goes live, hundreds of thousands of viewers simultaneously poll for the stream's manifest. A naively-configured CDN may not have the first segment cached yet, causing an origin-pull spike.

Three mitigations work together here. Origin shielding means all CDN PoPs forward misses to a single shield PoP, which is the only one that pulls from origin — reducing the miss storm from N PoPs to 1. Pre-warming the CDN means the Stream Controller notifies the CDN to pre-fetch the first manifest before the stream is announced as live. And a short but nonzero manifest TTL (e.g. 2s) lets the CDN serve the manifest to multiple viewers before re-validating with origin.

Chat hot room

A streamer with 500k concurrent viewers, messages arriving at 5,000/sec. Mitigations: partition WebSocket gateway nodes by room so all viewers in a room connect to a dedicated shard; cap message sends per user per second and filter through the moderation queue; and at extreme scale, deliver only a sample of chat messages to viewers (showing "chat is moving too fast to display all messages" — visible behavior but preferable to dropped connections).

Things to discuss in an interview

The latency trade-off is central: be precise about where each second of delay comes from. Interviewers ask "how do you reduce latency?" — the answer is always "segment size + buffering depth, at a CDN cost premium."
Transcoding is the most expensive component: it's the bottleneck for cost, capacity, and failure surface. Be prepared to discuss GPU utilization, parallelism, and job scheduling.
CDN makes the economics possible: without it, 20 Tbps of egress is not deliverable from origin servers. Be prepared to explain cache hit rates and origin shielding.
Chat and video are separate problems: conflating them in your architecture is a red flag. Chat is push, video delivery is pull.
DVR requires no extra compute: once you keep segments in object storage, DVR is a manifest pointer problem. Storage cost is real at scale (petabyte range at 100k channels) — be ready to discuss tiered retention by channel popularity.
Compare to on-demand video: the key difference is no pre-transcoding window — you have the duration of one segment (2–6 seconds) to transcode and publish, not hours.

Things you should now be able to answer

Why is RTMP still the dominant ingest protocol when it is 20 years old?
What is an ABR ladder and why does the player switch renditions?
Where does the 10–30s of glass-to-glass latency come from, step by step?
Why can't you just use WebRTC to fan out to millions of viewers?
How does DVR work without any special infrastructure beyond keeping segments in S3?
Why is origin shielding critical at stream start?
What is the key architectural difference between chat and video delivery?

Frequently asked questions

▸What is an ABR ladder in live streaming?

An ABR (Adaptive Bitrate) ladder is a set of parallel renditions of the same stream at different resolutions and bitrates. A typical Twitch-style ladder has five levels: 1080p60 at 6 Mbps, 720p30 at 3 Mbps, 480p at 1.5 Mbps, 360p at 800 Kbps, and 160p at 250 Kbps. The player measures download speed against playback speed each segment and switches renditions automatically.

▸Where does the 10-30 seconds of glass-to-glass latency in standard HLS come from?

It is the sum of several sequential delays: local broadcast encode adds 0.5-1s, network ingest adds 0.1-0.5s, transcoding adds 0.5-1s, waiting for a full segment boundary adds 2-6s, CDN propagation adds 0.5-1s, and player buffering of at least two segments adds 4-12s. The segment boundary wait and player buffering account for the majority of the delay and are protocol constraints, not server capacity limits.

▸When should you use LL-HLS instead of standard HLS?

Use LL-HLS when your product requires 2-5 seconds of glass-to-glass latency rather than the 10-30s of standard HLS. LL-HLS publishes partial segments before the full segment is complete and uses blocking playlist requests to avoid polling, but it generates roughly 3-8 times more CDN HTTP requests per viewer, which raises CDN cost and edge load proportionally. If sub-second interactivity is required instead, LL-HLS is insufficient and WebRTC is the only option.

▸Why can WebRTC not replace CDN-based HLS delivery for millions of concurrent viewers?

WebRTC for broadcast fan-out requires a media relay server (SFU or MCU) that maintains a persistent UDP connection to every viewer simultaneously, because WebRTC traffic is not HTTP and cannot be cached. There is no CDN absorption of load, so relay capacity must scale linearly with viewer count. At 5 million concurrent viewers that capacity planning is fundamentally different and far more expensive than the CDN model where a 99.9% cache hit rate collapses origin load to only thousands of requests per second.

▸How does DVR rewind work without additional infrastructure beyond object storage?

Because HLS segments are written to S3 as immutable files during the broadcast, DVR is purely a manifest-pointer problem. A viewer seeking 30 minutes back receives an m3u8 manifest pointing to the segments from 30 minutes ago; those segments already exist in S3 and the CDN caches them on first access. For a 4-hour DVR window with 2-second segments, that is 7,200 segments per rendition per channel stored in S3, with the real cost being storage — which reaches petabyte scale (~2.5 PB) at 100k concurrent channels.

← previous

Design Top-K / Trending (heavy hitters)

Design a Flash Sale / Seckill System

// RELATED

Design a Live Streaming System (Twitch)

The problem

Functional requirements

Non-functional requirements

Capacity estimation

The four stages of a live stream

Stage 1: Ingest

Stage 2: Transcoding

Stage 3: Packaging into HLS segments

Stage 4: Distribution via CDN

Building up to the design

V1: One box, RTMP in, stream out

V2: Add a transcoder and a CDN origin

V3: Decouple ingest from transcoding; add GPU workers

V4: Add stream controller, health monitoring, adaptive scheduling

V5: Low-latency option, chat, VOD pipeline

Full architecture deep-dive

Latency deep-dive: where the delay comes from

Latency modes compared

Player-side: adaptive bitrate switching

Live chat — a separate fan-out problem

DVR and VOD

DVR (live rewind)

VOD pipeline

Storage choices

Failure modes

Broadcaster ingest drop / reconnect

Transcoder failure mid-stream

CDN PoP failure

Thundering herd on segment 0 / stream start

Chat hot room

Things to discuss in an interview

Things you should now be able to answer

Further reading

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform