~/articles/design-video-conferencing
◆◆◆Advancedasked at Zoomasked at Googleasked at Ciscoasked at Microsoft

Design a Video Conferencing System (Zoom)

Carry live audio/video among many participants with low latency. WebRTC, the SFU vs MCU vs mesh trade-off, simulcast, and adaptive bitrate.

21 min read2026-03-24Ironclad Academy
// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Zoom had 300 million daily meeting participants at its 2020 peak. Google Meet, Microsoft Teams, and Cisco Webex each run tens of millions of concurrent calls. Behind every one of those calls is a real-time media pipeline that must deliver audio and video frames to all participants within 200–300 milliseconds, end-to-end — or the conversation falls apart.

A video conferencing system lets multiple people see and hear each other over the internet simultaneously. When you join a Zoom or Google Meet call, your laptop captures your camera and microphone, compresses the media into tiny packets, and fires them across the network to every other person in the room — who must receive and play them back in near-real time, even if they are on a different continent, behind a corporate firewall, or on spotty mobile data. Recording, screen sharing, and layout control all ride on top of this core media delivery problem.

Two things make this genuinely hard. First, the medium is time: unlike a database query or a file download, you cannot retry a dropped packet. A video frame that arrives 500 ms late is useless garbage — worse than silence, because the decoder may glitch. This rules out TCP's retransmission model entirely and forces a UDP-based design where congestion control and codec error-concealment replace reliability. Second, the topology problem is brutal: in a naive peer-to-peer mesh, each participant must upload a full-quality stream to every other participant simultaneously. At 10 people that's 45 connections and ~14 Mbps of upload per person; at 50 people it collapses completely. The central design challenge is finding a server-side routing architecture that kills the O(N²) fan-out cost without re-encoding video on every hop — which is the SFU/simulcast trade-off this article is built around.

Video conferencing is one of the hardest real-time system design problems in interviews because it forces you to reason about time — not just throughput. Every design decision, from transport protocol to topology, falls out of one constraint: a packet that arrives late is as bad as one that never arrives.

Functional requirements

  • POST /calls — create a call room, return a join token.
  • Participant joins via WebRTC: negotiate codecs, establish an encrypted media channel.
  • Live audio/video delivered with < 300 ms glass-to-glass latency.
  • Screen sharing as a second video track.
  • Mute/unmute, camera on/off — server-side awareness (for recording and rendering).
  • Recording: save the call to durable storage for later playback.
  • Breakout rooms, large-call webinar mode (stretch, but worth naming).

Non-functional requirements

  • Low latency: < 150–200 ms one-way media delay for interactive feel; < 300 ms tolerable.
  • Adaptive quality: degrade gracefully under packet loss and bandwidth contraction rather than freezing.
  • NAT traversal: most participants are behind home routers, mobile networks, or strict corporate firewalls.
  • Scalability: 1 M+ concurrent calls globally; a single call can have 2 to ~1000 participants.
  • High availability: SFU failure must not silently drop calls — reconnect within seconds.
  • Security: end-to-end DTLS encryption on every media track.

Capacity estimation

Assumptions: 1 M concurrent calls, 10 participants per call on average.

DimensionEstimateHow we got there
Per-participant video send~1,500 kbps720p 30fps VP8/H.264
Per-participant audio send~50 kbpsOpus codec
Per-participant total outbound~1,550 kbps ≈ 1.6 Mbpsvideo + audio
Per-call inbound to SFU16 Mbps10 senders × 1.6 Mbps
Per-call egress (peak)144 Mbps10 × 9 × 1.6 Mbps (each stream forwarded to 9 others)
Per-call egress (practical)~80–100 Mbps/call~1/3 of receivers get a degraded simulcast layer, reducing total
Calls per 32-core SFU node~40–100 concurrent callsCPU limit (SRTP decryption); NIC I/O = 6.4–16 Gbps, well within a 40-Gbps NIC
SFU fleet size (1 M calls)~10k–25k nodes1 M calls ÷ 40–100 calls/node, distributed across PoPs
TURN clients~15–25% of usersSymmetric NAT / strict firewalls; STUN resolves the rest
TURN tier bandwidth2.4 Tbps1.5M users × 1.6 Mbps/user (15% of 10M users require TURN relay)
Recording storage per call~700 MB raw1 hour × 1.5 Mbps video (675 MB) + 1 hour × 50 kbps audio (~22 MB); compressed in S3-compatible object storage

Takeaway: CPU for SRTP decryption and RTP packet processing is the real SFU bottleneck — forwarding without decoding keeps CPU far lower than an MCU, but SRTP work still dominates over raw bandwidth at typical call densities.

Building up to the design

Don't jump to "SFU + WebRTC + simulcast." Earn every layer.

V1: Direct P2P mesh (two participants)

For a 1:1 call, the browser can open a direct WebRTC PeerConnection to the other browser. NAT traversal happens via ICE/STUN. No server-side media at all — audio and video flow browser-to-browser.

This works beautifully for one-on-one: zero server media cost and the shortest possible path between the two people. The problem shows up the moment you add a third participant.

In a full mesh, each new participant opens PeerConnections to every other one. With N participants you have N×(N-1)/2 connections, and every sender must upload N-1 separate streams. At N=10, that's 45 connections and each participant uploads ~14 Mbps. Home connections collapse. At N=50, it's plainly untenable.

V2: MCU (Multipoint Control Unit)

A media server decodes all incoming streams, composites them into a single mixed stream — the "Brady Bunch" grid — and sends one stream back to each participant. Now everyone sends one stream up and receives one stream down, regardless of call size. Problem solved, right?

Not quite. The MCU has to decode every video track to do its compositing. At ten participants sending 720p, that's ten decodes plus ten encodes (one for each outbound stream). This is massively CPU-intensive — a single MCU server handles perhaps 5–10 concurrent calls of that size. You can scale horizontally, but it's expensive. You also lose per-track control at the client: participants can't click to "pin" a speaker, because the layout is baked into the server's composite stream.

V3: SFU (Selective Forwarding Unit)

The SFU receives each participant's stream and forwards it, as-is, to every other participant — no decoding, just routing RTP packets. Each participant still sends one stream, but the server's CPU cost per call drops dramatically because it's not transcoding. Clients receive individual streams per participant and lay them out however they like.

The remaining gap: all participants receive full-quality video regardless of their actual bandwidth. Someone on a 500 kbps LTE connection tries to receive ten 1.5 Mbps streams. Their call freezes.

V4: SFU + Simulcast

Each sender publishes three simultaneous resolution layers — say, 720p at 1.5 Mbps, 360p at 500 kbps, and 180p at 150 kbps. The SFU monitors each receiver's available bandwidth and forwards the appropriate layer to each one. A weak receiver gets 180p; a fast receiver gets 720p. No re-encoding happens anywhere.

This is the key architectural unlock: heterogeneous network conditions are handled per-receiver, independently. The participant on slow LTE doesn't drag down anyone else.

V5: SFU + Simulcast + ICE/STUN/TURN

Add the NAT traversal layer. Corporate firewalls block UDP. Mobile networks use symmetric NAT. ICE (Interactive Connectivity Establishment) tries direct P2P first, then uses a STUN server to discover public addresses, then falls back to a TURN relay if nothing else works. Approximately 75–85% of connections succeed without TURN across typical deployments; the rest relay through TURN.

V6: Production scale

V5 plus geographic edge PoPs placing SFU fleets close to users, a dedicated signaling plane, recording pipeline, and multi-region failover.

flowchart LR
    V1["V1: P2P mesh<br/>2 people only"] --> V2["V2: MCU<br/>one stream each, CPU-heavy"]
    V2 --> V3["V3: SFU<br/>forward-only, scales better"]
    V3 --> V4["V4: + Simulcast<br/>right quality per receiver"]
    V4 --> V5["V5: + ICE/STUN/TURN<br/>NAT traversal"]
    V5 --> V6["V6: + edge PoPs + signaling<br/>+ recording + failover"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

Why UDP and RTP — and not TCP

TCP gives you reliability via retransmission. For a file, that is ideal. For live video, it actively hurts you.

When a packet carrying video frame N is lost, TCP stalls the stream and waits for a retransmission. The retransmitted packet arrives 1–2 RTTs later, which is 50–200 ms. By then the application already needed to display frames N+1, N+2, N+3. The retransmitted frame is displayed out of its moment — it is a glitch, not a help.

RTP over UDP drops the lost packet and moves on. The codec's error concealment or FEC covers the gap. The viewer sees a momentary artifact rather than a frozen screen waiting for a retransmission.

QUIC, which runs over UDP, can offer per-stream reliability selectively — some systems use it for signaling while keeping media on plain UDP/RTP.

WebRTC and the browser stack

WebRTC is the browser-native API for real-time media. It provides:

ComponentWhat it does
getUserMediaCapture camera / mic
RTCPeerConnectionManage ICE, DTLS, codec negotiation, SRTP
ICE agentTry direct, STUN-reflexive, and TURN-relay paths
DTLSHandshake that derives SRTP keys (MITM-proof)
SRTPEncrypts RTP payload and authenticates headers
RTCPCompanion control channel: loss reports, jitter, REMB/TWCC

In a production SFU deployment, each browser opens a single RTCPeerConnection to the SFU — not to each other participant. The SFU handles fanning out to everyone else.

NAT traversal: ICE, STUN, and TURN

flowchart TD
    A[Participant A<br/>behind NAT] --> STUN[STUN Server<br/>returns reflexive IP:port]
    B[Participant B<br/>behind NAT] --> STUN
    A <-->|"direct UDP if NAT permits"| B
    TURN[TURN Relay Server<br/>allocated relay address]
    TURN <-->|"relay when direct fails"| A
    TURN <-->|"relay when direct fails"| B
    style STUN fill:#0e7490,color:#fff
    style TURN fill:#a855f7,color:#fff

ICE candidate gathering works in three stages: host candidates (local IP addresses), server-reflexive candidates (your public IP:port as seen by the STUN server), and relay candidates (addresses allocated on the TURN server). ICE tries them in priority order via connectivity checks, then settles on the best working pair.

If the network changes mid-call — say, WiFi drops and the client switches to 4G — ICE restart re-negotiates the transport path without tearing down the call. The signaling server triggers an ICE restart, the client re-gathers candidates on the new interface, and media briefly pauses then resumes. This is essential for mobile continuity.

TURN relay capacity is the expensive piece. TURN servers handle full media bandwidth for all relayed clients, so at a typical 15–25% relay rate they represent a significant bandwidth cost. They are typically deployed at every PoP with aggressive auto-scaling.

What the ICE negotiation looks like

Before any media flows, the two endpoints need to discover each other. Here is the full handshake from click-to-join to first audio packet:

sequenceDiagram
    participant A as Participant A
    participant SIG as Signaling Server
    participant STN as STUN Server
    participant SFU as SFU

    A->>SIG: join room "xyz" (WebSocket)
    SIG->>A: room state + SFU address
    A->>STN: binding request
    STN->>A: reflexive IP:port
    A->>SIG: SDP offer (codecs + ICE candidates)
    SIG->>SFU: relay offer
    SFU->>SIG: SDP answer
    SIG->>A: SDP answer
    Note over A,SFU: ICE connectivity checks (STUN pings)
    A-->>SFU: RTP audio+video begins
    SFU-->>A: RTP from other participants

The signaling server only carries a few kilobytes of JSON at call setup. Once ICE completes, it steps back — all media flows directly between participant and SFU over UDP.

The signaling plane vs the media plane

These are completely separate concerns that are often confused:

Signaling planeMedia plane
PurposeWho is in the call, what codecs to use, ICE candidatesActual audio/video data
ProtocolWebSocket / HTTPS (your own signaling server)RTP over UDP
Message typeSDP offers/answers, ICE candidates, call eventsRTP packets (continuous stream)
Latency toleranceSeconds (setup only)< 20 ms jitter tolerance
VolumeTiny (a few KB at call setup)Megabits per second
sequenceDiagram
    participant A as Participant A
    participant SIG as Signaling Server
    participant SFU as SFU
    participant B as Participant B

    A->>SIG: join room "xyz" (WebSocket)
    SIG->>A: current participants list
    A->>SIG: SDP offer (codecs, ICE candidates)
    SIG->>SFU: relay offer
    SFU->>SIG: SDP answer
    SIG->>A: SDP answer
    Note over A,SFU: ICE connectivity checks over UDP
    A-->>SFU: RTP audio+video (continuous)
    SFU-->>B: RTP audio+video (forwarded)
    B-->>SFU: RTP audio+video (continuous)
    SFU-->>A: RTP audio+video (forwarded)

The signaling server is a relatively simple service — stateful per call, but never in the media hot path. It can be a horizontally scaled WebSocket cluster backed by a call-state store (Redis or a relational DB for durability).

The SFU in depth

The SFU is the core of the architecture. At the packet level, here is what happens on every received RTP packet:

  1. Receive an SRTP packet from participant A on an inbound UDP socket.
  2. Identify which RTP stream (SSRC) and which simulcast layer.
  3. Determine which participants need this packet, and which simulcast layer each should receive.
  4. Forward the packet to each destination participant's outbound socket, rewriting the RTP sequence number if switching layers (to avoid decoder gaps).
  5. Collect RTCP feedback (loss, jitter, REMB/TWCC) from each receiver and make forwarding decisions.
flowchart TD
    subgraph Inbound
        A[Participant A<br/>3 simulcast layers]
        B[Participant B]
        C[Participant C]
    end

    A -->|"720p SSRC-1<br/>360p SSRC-2<br/>180p SSRC-3"| SFU
    B --> SFU
    C --> SFU

    subgraph SFU Core
        SFU[SFU Media Server] --> FWD[Forwarding Engine<br/>per-receiver layer selection]
        FWD --> CCU[Congestion Controller<br/>reads RTCP REMB/TWCC]
        CCU --> FWD
    end

    subgraph Outbound
        FWD -->|"720p → A on good link"| OUT_A[Participant A]
        FWD -->|"180p → B on weak link"| OUT_B[Participant B]
        FWD -->|"360p → C on mid link"| OUT_C[Participant C]
    end

    style SFU fill:#ff6b1a,color:#0a0a0f
    style FWD fill:#15803d,color:#fff
    style CCU fill:#0e7490,color:#fff

How the SFU picks a simulcast layer for each receiver

The layer decision happens continuously, not just at call start. Here is the logic:

flowchart TD
    RTT[TWCC feedback arrives<br/>from receiver R] --> BW[Estimate available<br/>bandwidth for R]
    BW --> CHK{Bandwidth vs<br/>current layer?}
    CHK -->|"well above current"| UP[Switch up to next layer<br/>wait for keyframe]
    CHK -->|"well below current"| DOWN[Switch down to lower layer<br/>request PLI from sender]
    CHK -->|"within margin"| HOLD[Hold current layer]
    DOWN --> PLI[Send PLI to sender<br/>request immediate keyframe]
    PLI --> KF[Keyframe arrives on<br/>lower layer SSRC]
    KF --> FWD[Begin forwarding<br/>lower layer to R]
    UP --> UPKF[Wait for keyframe<br/>on higher layer SSRC]
    UPKF --> FWD2[Begin forwarding<br/>higher layer to R]
    style DOWN fill:#ff2e88,color:#fff
    style UP fill:#15803d,color:#fff
    style FWD fill:#ff6b1a,color:#0a0a0f
    style FWD2 fill:#ff6b1a,color:#0a0a0f

Switching layers without a keyframe would hand the decoder an I-frame-less stream, producing corrupted output. The PLI (Picture Loss Indication) RTCP message forces the sender to produce a keyframe immediately on the target layer so the SFU can begin forwarding cleanly.

Simulcast vs SVC

Simulcast has the sender encode three completely independent streams. The SFU forwards the whole stream of one layer. Simple for the SFU, but each sender pays extra encoding CPU — three encodes instead of one, though the lower-resolution layers (360p, 180p) are far cheaper than the top layer, so total CPU is meaningfully more than a single encode but well under 3×. The SFU must also handle 3 streams per sender even if only one is forwarded.

SVC (Scalable Video Coding) encodes a single stream with spatial and temporal layers baked in. The SFU can forward a subset of packets to produce a lower-quality decode without the sender sending separate streams. More bandwidth-efficient, but the SFU is more complex and encoder support across client devices is less uniform. H.264 SVC and VP9 SVC are the common choices.

Most real-world SFU deployments use simulcast with H.264 or VP8 because codec support across clients is more uniform. VP9 SVC is used where client support is assured.

Adaptive bitrate and congestion control

Two complementary mechanisms keep quality smooth as network conditions change.

Sender-side congestion control (GCC / TWCC)

The SFU collects packet arrival timestamps from each receiver via Transport-wide Congestion Control (TWCC) RTCP feedback. It estimates available bandwidth using a delay-gradient algorithm and sends that estimate back to the sender. The sender's congestion controller adjusts its encoding bitrate accordingly.

Google Congestion Control (GCC) is the widely-implemented reference algorithm described in the IETF RMCAT working group documents. The GCC Internet-Draft expired without becoming an RFC, but its delay-gradient approach is the de facto standard implemented in Chrome and other WebRTC stacks. TWCC is now preferred over the older REMB mechanism — REMB had the receiver estimate bandwidth, while TWCC provides raw per-packet timing data to the sender for more accurate estimation.

Receiver-side layer switching (at the SFU)

When the congestion controller determines a receiver's available bandwidth has dropped, the SFU switches from the 720p simulcast layer to the 360p layer at a keyframe boundary. The SFU sends a PLI RTCP message to request an immediate keyframe on the target layer, waits for it to arrive, then begins forwarding that lower layer, rewriting RTP sequence numbers and timestamps to avoid discontinuities.

Some SFU implementations send FIR (Full Intra Request) rather than PLI, and implementations vary in how they debounce these requests to avoid flooding the sender in large calls.

Packet loss concealment and FEC

No matter how good congestion control is, loss bursts happen. Two mitigations work together. FEC (Forward Error Correction) has the sender add redundant packets so the receiver can reconstruct lost ones without a retransmission. RED+ULPFEC is the WebRTC standard for video; Opus includes built-in FEC at the codec level for audio. The jitter buffer on each receiver holds 50–150 ms of packets to smooth arrival jitter and allow FEC recovery. Bigger buffer means smoother playback but higher latency — video conferencing uses shorter buffers than streaming because interactivity matters more than smoothness.

Geographic edge architecture

Placing SFUs near participants eliminates the biggest latency contributor: trans-continental RTT.

flowchart TD
    subgraph "US-West PoP"
        SFU_W[SFU cluster<br/>US-West]
    end
    subgraph "EU-Central PoP"
        SFU_E[SFU cluster<br/>EU-Central]
    end
    subgraph "AP-Southeast PoP"
        SFU_AP[SFU cluster<br/>AP-Southeast]
    end

    UA[User in SF] --> SFU_W
    UB[User in Berlin] --> SFU_E
    UC[User in Singapore] --> SFU_AP

    SFU_W <-->|"SFU cascade<br/>inter-PoP UDP tunnel"| SFU_E
    SFU_E <-->|"SFU cascade"| SFU_AP
    SFU_W <-->|"SFU cascade"| SFU_AP

    SIG[Signaling Cluster<br/>globally load-balanced] -.-> SFU_W
    SIG -.-> SFU_E
    SIG -.-> SFU_AP

    style SFU_W fill:#ff6b1a,color:#0a0a0f
    style SFU_E fill:#ff6b1a,color:#0a0a0f
    style SFU_AP fill:#ff6b1a,color:#0a0a0f
    style SIG fill:#0e7490,color:#fff

SFU cascading is how cross-region calls work efficiently. Each PoP's SFU receives its local participants' streams and forwards one uplink copy to each other PoP's SFU. The remote PoPs then fan out to their local participants. A user in SF and one in Berlin only need one stream crossing the Atlantic — not one per remote participant.

Recording pipeline

Server-side recording avoids requiring any client capability:

flowchart LR
    SFU[SFU] -->|"RTP copy"| REC[Recording Service]
    REC --> MUX[Mux/Remux<br/>RTP → WebM/MP4]
    MUX --> OBJ[(Object Store<br/>S3-compatible)]
    OBJ --> TRANS[Transcoder<br/>standardize codecs]
    TRANS --> CDN[CDN for playback]
    style SFU fill:#ff6b1a,color:#0a0a0f
    style OBJ fill:#15803d,color:#fff
    style CDN fill:#0e7490,color:#fff

The SFU sends a copy of RTP streams to the recording service. The recorder assembles RTP into a container format (WebM or MP4) and writes to object storage. A separate transcoding pass standardizes codecs, generates multiple resolutions, and produces HLS/DASH manifests for on-demand playback. The recording path is independent of the live media path — it adds no latency to participants.

Topology comparison

TopologyCPU per callBandwidth per senderQuality controlPractical limit
Mesh (P2P)None (server)O(N) upload per clientPer-track at client~4–6 participants
MCUO(N) decode + encode per server1 stream up + 1 downServer controls mix~5–10 calls/server (10-participant calls)
SFUO(1) route/forward per packet1 stream up, N−1 streams downPer-track at client~40–100 calls/server (32-core node)
SFU + SimulcastO(1) route + layer select3 streams up (simulcast layers), adaptive downPer-receiver, adaptiveProduction standard

Failure modes and mitigations

SFU node failure

A participant connected to a dead SFU node must reconnect. The signaling server tracks which SFU each session uses; on health-check failure it redirects the ICE restart to a healthy SFU. Reconnection target is under 5 seconds. Sessions are sticky to an SFU for their lifetime, so failover is a re-setup, not a hot migration.

Packet loss burst

Short burst under 1%: FEC recovers without any visible artifact. Medium burst between 1–5%: the jitter buffer and codec PLC (packet loss concealment) bridge the gap with minor quality degradation. Sustained loss above 10%: the congestion controller drops to a lower simulcast layer. If loss persists at all layers, the call freezes momentarily, then the congestion controller ramps back up when conditions improve.

A slow participant dragging quality

In a mesh, one participant on a slow link forces all senders to lower their bitrate for everyone. In an SFU with simulcast, the SFU forwards the high layer to fast receivers and the low layer to the slow one. Each receiver is independent. The person on spotty WiFi does not affect the experience of anyone else on the call.

Network change mid-call (WiFi → 4G)

ICE restart re-negotiates the transport path without tearing down the call. The signaling server signals an ICE restart; the client re-gathers candidates on the new interface and re-runs connectivity checks. Media interruption is typically under a few seconds with a well-implemented ICE restart, though the exact duration is highly implementation- and network-dependent.

Region failover

Signaling state (call membership, participant list) is stored in a durable, replicated store. If a region goes down, the signaling cluster in another region handles new joins. Existing sessions in the failed region reconnect as above; their state is recovered from the replicated store.

Storage choices

DataStoreWhy
Call state (participants, SDP)Redis / PostgresLow latency reads; call state is small
User accounts / authPostgresStrong consistency; ACID for billing
TURN allocation stateIn-memory on TURN serverEphemeral; lost on restart harmlessly
SFU session stateIn-memory on SFUPacket-path must be zero-copy, not DB-bound
Recording (raw RTP)Object store (S3)Write-once, large, cheap
Recording (playback)CDN + object storeHigh read fan-out
Analytics / QoS metricsTime-series DB (e.g. ClickHouse)High ingest rate, aggregation queries

Things to discuss in an interview

  • Why UDP over TCP for media: retransmission latency exceeds inter-frame deadlines; late packets are useless.
  • SFU vs MCU: SFU avoids decoding so CPU scales with connections, not transcoding work.
  • Simulcast and layer selection: how the SFU picks the right quality tier per receiver using TWCC feedback.
  • ICE/STUN/TURN: the full traversal stack, TURN capacity as a cost center, ICE restart for network changes.
  • Signaling vs media plane separation: the signaling server only coordinates, never touches RTP.
  • SFU cascading: how multi-region calls share bandwidth efficiently without sending N streams per link.
  • Recording without adding latency: RTP copy, off-path assembly, object storage.

Things you should now be able to answer

  • Why does video conferencing use UDP instead of TCP?
  • What is the difference between an SFU and an MCU, and when would you use each?
  • How does simulcast let an SFU serve a heterogeneous set of receivers without re-encoding?
  • What does ICE do, and why do some clients always need a TURN relay?
  • How does the SFU switch a receiver from 720p to 360p without corrupting the decoder?
  • How does a call stay alive when a participant switches from WiFi to 4G?
  • How does server-side recording avoid adding latency to the live call?

Further reading

  • WebRTC specification — w3.org/TR/webrtc
  • RFC 3550 — RTP: A Transport Protocol for Real-Time Applications
  • RFC 5245 / RFC 8445 — Interactive Connectivity Establishment (ICE)
  • "Jitsi Meet architecture" — jitsi.org (open-source SFU reference)
  • "mediasoup" — open-source SFU implementation (Node.js + C++ routing core)
  • Google Congestion Control (GCC) — draft-ietf-rmcat-gcc-02 (expired IETF Internet-Draft, last revised 2016; not published as an RFC but widely implemented in WebRTC stacks)
  • RFC 5109 — RTP Payload Format for Generic Forward Error Correction
// FAQ

Frequently asked questions

What is an SFU and how does it differ from an MCU?

An SFU (Selective Forwarding Unit) receives each participant's stream and routes RTP packets to all other participants without decoding or re-encoding — CPU cost scales with connections, not transcoding work. An MCU (Multipoint Control Unit) decodes all incoming streams, composites them into a single mixed stream, and re-encodes one stream per outbound participant, making it CPU-bound to roughly 5-10 concurrent calls of ten-participant size on a single server.

Why does video conferencing use UDP instead of TCP?

When TCP retransmits a lost video packet, it arrives 1-2 RTTs (50-200 ms) later — by which point the decoder already needed to display subsequent frames, making the retransmitted packet a glitch rather than a recovery. RTP over UDP drops the lost packet and lets codec error concealment or FEC cover the gap, producing a momentary artifact rather than a frozen screen waiting on retransmission.

What is simulcast and when should an SFU use it over SVC?

Simulcast has each sender encode three independent resolution layers (e.g., 720p at 1.5 Mbps, 360p at 500 kbps, 180p at 150 kbps) so the SFU can forward the appropriate layer per receiver without re-encoding. SVC encodes a single stream with spatial and temporal layers baked in, allowing the SFU to forward a packet subset for lower quality — more bandwidth-efficient but more complex and with less uniform codec support across clients. Most real-world SFU deployments use simulcast with H.264 or VP8 because cross-client codec support is more uniform.

What fraction of clients need a TURN relay, and why is TURN expensive?

Approximately 15-25% of clients require TURN relaying because symmetric NAT or strict corporate firewalls block direct UDP paths; STUN resolves the remaining 75-85%. TURN is expensive because the relay servers carry full media bandwidth for all relayed clients — at 15% TURN usage across 1 million concurrent calls, that amounts to roughly 2.4 Tbps of relay bandwidth.

How does the SFU switch a receiver from 720p to 360p without corrupting the decoder?

The SFU sends a PLI (Picture Loss Indication) RTCP message to the sender requesting an immediate keyframe on the target lower-resolution layer, waits for that keyframe to arrive, then begins forwarding the lower simulcast layer while rewriting RTP sequence numbers and timestamps to avoid discontinuities. Switching without a keyframe would hand the decoder an I-frame-less stream, producing corrupted output.

// RELATED

You may also like