~/articles/design-chat-system

◆◆◆Advancedasked at Metaasked at Discordasked at Slackasked at Telegram

Design WhatsApp / Chat System

Q: Why does a chat system at WhatsApp scale need WebSocket rather than HTTP polling?

HTTP polling introduces up to 2 seconds of delivery lag and, at 1M users polling every 2 seconds, generates 500k QPS of mostly-empty reads against the database. WebSocket holds a persistent TCP connection per user, dropping delivery latency to roughly 50ms and eliminating the empty-poll load entirely.

Q: How do you scale to 500 million concurrent WebSocket connections?

You split responsibility into stateless gateways and a message service. A single gateway can hold about 100k connections with tuned TCP buffers and an epoll event loop, so reaching 500M concurrent connections requires a fleet of 5,000+ gateways. A Redis session router maps each user ID to the specific gateway holding that user's socket, so any message service instance can route to any device without knowing the socket itself.

Q: When should group fanout use synchronous dispatch versus async Kafka?

Synchronous batched fanout is manageable for groups up to roughly 256 members: batch the Redis lookups with a pipeline, dispatch RPC calls concurrently, and the entire fanout completes in tens of milliseconds. At 1,024 members (WhatsApp's current maximum) even concurrent dispatch stalls the message service thread and a single slow gateway can delay delivery to all other members, so a Kafka-backed async fanout is used instead, delivering to online members within roughly 500ms.

Q: How does presence work at 17 million writes per second?

Each client sends a heartbeat every 30 seconds that executes a Redis SET with a 60-second TTL (SET online:uid 1 EX 60). No explicit disconnect event is needed: if the TTL expires without a new heartbeat, the user is considered offline, which handles crashes and network drops automatically. Read-on-demand (checking the key only when a chat is opened) is the production approach for presence fan-out, because pub/sub at 500 contacts times 500 million users would require 250 billion subscriptions.

Q: How does Signal Protocol group encryption avoid per-member re-encryption?

WhatsApp uses sender keys: the sender distributes a shared symmetric key to each group member, encrypted pairwise via the Double Ratchet. Thereafter the sender encrypts each message once with that symmetric key, and the server fans out that single ciphertext to all members. The sender key is rotated when group membership changes or periodically to preserve forward secrecy.

Realtime 1:1 and group messaging at billions-of-users scale. WebSocket gateways, message store, presence, end-to-end encryption.

16 min read2026-02-09Ironclad Academy

#interview #realtime #websocket #messaging #cassandra #kafka

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

WhatsApp delivers roughly 100 billion messages a day to over 2 billion users — and does it with a remarkably lean engineering team. At its core, a chat system is simple: User A types a message, User B receives it. But that simplicity is deceptive. Getting a message from one phone to another in under 200 milliseconds, reliably, while 500 million other users are doing the same thing simultaneously, is where the real engineering begins.

The central challenge is maintaining a persistent connection per user at massive scale. Unlike a web request that opens, fetches, and closes, chat clients hold a long-lived WebSocket open the entire time the app is in the foreground. That means 500 million simultaneous open TCP connections at peak — requiring thousands of gateway servers just to hold the sockets, before you write a single message to disk. Systems like Messenger, Telegram, Discord, and Slack all face this same constraint in their own way.

The second tension is durability vs. latency. Users expect messages to arrive fast, but they also expect messages to never disappear — even if the recipient's phone is off when the message is sent, or the sender's connection drops mid-send. Writing to durable storage before acknowledging the sender adds a database round-trip to every message; skipping that write risks losing data. Every major chat system resolves this with a "persist before ack" rule, but the storage and sharding choices that make that sustainable at 1.2 million messages per second are non-trivial.

Group chat adds a third dimension: fanout. Sending to one person is an O(1) lookup. Sending to a 1,024-member group means 1,024 lookups and 1,024 pushes — synchronously, that stalls the sender's request; asynchronously, you introduce delivery lag and ordering complexity. Presence, typing indicators, and read receipts each carry their own write-amplification traps. This article walks through each of those problems and shows how production systems like WhatsApp solve them.

Functional requirements

1:1 messaging.
Group messaging (up to 1,024 members; 256 is the classic interview parameter and was WhatsApp's actual limit until May 2022, when it was raised to 512, then to 1,024 in 2024).
Online / typing / read indicators.
Message delivery (sent / delivered / read receipts).
Multi-device support — message arrives on all your devices.
Persistence — messages stored, last 30 days available on demand.
(Stretch) Voice/video calls, attachments, end-to-end encryption.

Non-functional requirements

p99 delivery latency < 200ms when both online.
99.99% availability.
Durable for 30 days (or until user deletes).
Encrypted in transit; ideally end-to-end.

Capacity

Dimension	Estimate	How we got there
DAU	2B users	Given
Message volume	100B/day	Given
Write throughput (avg)	~1.2M msg/s	`100B ÷ 86,400 s`
Write throughput (peak)	~3M msg/s	~2.5× average
Avg message size	100 bytes	Given
Daily write volume	10 TB/day	`100B msgs × 100 bytes`
30-day message store	300 TB	`10 TB/day × 30 days`
Peak concurrent connections	~500M	~500M online at peak → 500M open WebSockets

Takeaway: 500M open WebSockets is the central scaling challenge — it drives the entire gateway fleet design.

Building up to the design

Before drowning in WebSocket gateways and session routers, walk the design forward from the dumbest possible chat app and watch each piece appear when it has to.

V1: HTTP polling

Two endpoints:

POST /send          { to, text }     → INSERT INTO messages
GET  /messages?since=ts              → SELECT … WHERE recipient = me

The client polls every 2s. Server stores messages in a single Postgres table. You get a working chat app for two friends — easy to build, easy to debug, no realtime gymnastics.

What breaks it: latency (2s to "deliver") and load. At 1M users polling each 2s, that's 500k QPS of mostly-empty reads hitting the database.

V2: Long polling, then WebSockets

The client opens a connection and waits for the server to push when there's a message. Eventually you switch from long polling to WebSockets — the server holds an open TCP/TLS socket per user and pushes messages directly. Delivery latency drops to ~50ms; the empty-poll load disappears.

That immediately surfaces a new problem: one process can hold tens of thousands of sockets, not millions. And how does User A's server know which server User B is connected to?

V3: WebSocket gateways + a session router

Split into "gateways" (own the sockets) and a "message service" (owns the logic). A Redis-backed session router maps user_id → gateway_id. When A's gateway receives a message for B, the message service looks up B's gateway and forwards via internal RPC.

flowchart LR
    A[User A] --> GW1[Gateway A]
    GW1 --> MSG[Msg Svc]
    MSG --> ROUTER[(Session Router<br/>user→gateway)]
    MSG --> GW2[Gateway B]
    GW2 --> B[User B]
    style ROUTER fill:#15803d,color:#fff

With ~5,000 gateways × 100k sockets each, you reach 500M concurrent connections. Sockets are now stateless from the message-service point of view.

Which exposes the next gap: messages sent while User B is offline disappear. A network blip on send means the user has to guess whether it went through.

V4: Persist before ack, idempotent send

Write every message to the messages DB before acknowledging the sender. Each message carries a client-generated UUID; resends with the same UUID are deduped. Offline recipients pull undelivered messages on reconnect. Now you have durability, retries-without-dupes, and the "sent / delivered / read" state machine becomes possible — each transition is just a metadata write.

That brings groups into focus. Routing one message to 256 recipients means 256 session-router lookups + 256 gateway pushes. Batched and parallelized, that's manageable. At 1,024 members (WhatsApp's current max) or larger channels, even concurrent dispatch is slow enough that synchronous fanout causes a stutter every time someone posts.

V5: Fanout service + async delivery

For groups, a separate fanout step enumerates members, dedupes per device, and pushes each one through the right gateway. Offline members get a push notification via APNS/FCM; the message is waiting when they reconnect. Now 1:1 and group chat both work, online and offline, and multi-device falls out naturally — each device is just another socket.

Presence is the last rough edge: "is X online?" and typing indicators are a huge write load if done naively. Every keystroke should not be a DB write.

V6: Presence with TTLs in Redis, plus everything else

Presence becomes SET online:user_X 1 EX 60 with a 30s heartbeat. Read receipts compress to "last-read message ID per (member, conversation)" instead of one row per message.

This is roughly where the diagram below picks up — the production system is V5 + V6 + sharded message storage + E2EE on top.

flowchart LR
    V1[V1: HTTP polling<br/>2s lag, DB melts] --> V2[V2: WebSockets<br/>one server scaling cap]
    V2 --> V3[V3: gateways + router<br/>500M sockets]
    V3 --> V4[V4: persist + idempotency<br/>offline + retries]
    V4 --> V5[V5: fanout + push<br/>groups + mobile]
    V5 --> V6[V6: presence TTLs + E2EE<br/>production WhatsApp]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

High-level architecture

flowchart TD
    A[User A device] -->|WebSocket| GW1[WS Gateway A]
    B[User B device] -->|WebSocket| GW2[WS Gateway B]

    GW1 -->|inbound msg| MSG[Message Service]
    GW2 -->|inbound msg| MSG

    MSG -->|"lookup user→gateway"| ROUTER[(Session Router<br/>Redis)]
    MSG --> MSGDB[(Messages DB)]
    MSG --> KAFKA[Kafka — async fanout]

    KAFKA --> FAN[Fanout Workers]
    FAN -->|"lookup user→gateway"| ROUTER
    FAN -->|push| GW1
    FAN -->|push| GW2

    MSG --> PRES[Presence Service]
    PRES --> PRDB[(Presence — Redis)]

    A -.media.-> S3[(S3 / blob)]
    S3 -.notify.-> MSG

    style GW1 fill:#ff6b1a,color:#0a0a0f
    style GW2 fill:#ff6b1a,color:#0a0a0f
    style MSG fill:#0e7490,color:#fff
    style ROUTER fill:#15803d,color:#fff
    style FAN fill:#a855f7,color:#fff

The WebSocket gateway

Every online user holds an open connection to a gateway. The gateway owns the TCP/TLS connection, knows which socket belongs to which user, forwards inbound messages to the message service, and pushes outbound messages down to the right socket.

A single gateway can hold ~100k connections with tuned TCP buffers and an epoll/kqueue event loop. To handle 500M simultaneous users, you need 5,000+ gateways.

Where does each user's socket live?

A session router (Redis) maps user_id → gateway_id:

SET sess:user_42 = gw-1234   EX 3600

When User A sends a message to User B:

1. Gateway A receives, forwards to Message Svc.
2. Message Svc looks up sess:user_B → gw-1234.
3. Message Svc sends to gw-1234 via internal RPC.
4. gw-1234 pushes to User B over WebSocket.

sequenceDiagram
    participant A as User A
    participant GW1 as Gateway A
    participant MSG as Msg Svc
    participant SR as Session Router
    participant GW2 as Gateway B
    participant B as User B
    A->>GW1: send(text)
    GW1->>MSG: SendMessage
    MSG->>SR: where is user B?
    SR-->>MSG: gw-1234
    MSG->>GW2: deliver
    GW2->>B: push(text)
    GW2-->>MSG: delivered
    MSG-->>GW1: ack
    GW1-->>A: ✓

What if User B is offline?

The message is persisted before the ack goes back to A. When B reconnects, the gateway registers a new session-router entry and the message service drains any queued messages down to the fresh socket.

flowchart LR
    A[User A sends] --> MSG[Msg Svc]
    MSG --> DB[(Messages DB<br/>write first)]
    DB --> ACK[Ack to A]
    MSG -->|"B offline"| PUSH[Push Svc]
    PUSH --> APNS[APNS / FCM]
    APNS -->|"B opens app"| RECONNECT[B reconnects]
    RECONNECT --> MSG
    MSG -->|"drain queued"| B[User B]
    style DB fill:#ff6b1a,color:#0a0a0f
    style PUSH fill:#ffaa00,color:#0a0a0f
    style RECONNECT fill:#15803d,color:#fff

Message storage

You need fast inserts (~3M/sec), fast reads (load chat history), and per-conversation locality. The schema is straightforward:

messages (
  conversation_id    -- shard key: hash to one node
  message_id         -- sortable, snowflake-style
  sender_id
  payload            -- text or media reference
  created_at
)

Sharding by conversation_id keeps all messages for a chat on one node and makes "load the latest 50 messages" a single sequential read. Cassandra is the canonical answer here — high write throughput, time-ordered access, and natural wide-row layout per conversation.

WhatsApp takes a different path. They use Mnesia (Erlang's distributed in-memory store with optional disk persistence) for hot routing state — offline message queues, session tables, group membership — and a custom column store for message archives. Mnesia is primarily RAM-based; it earns its place through tight integration with the Erlang runtime, not storage capacity.

1:1 vs group

For 1:1, the recipient lookup is O(1). For groups, it's O(N) — fan out to every member.

flowchart LR
    A[Sender] --> MSG[Msg Svc]
    MSG --> GROUP[Group Service]
    GROUP -->|fetch members| GDB[(Groups DB)]
    GDB --> MSG
    MSG -->|"route to each member's gateway"| FAN[Fanout]
    FAN --> GW1[Gateway 1]
    FAN --> GW2[Gateway 2]
    FAN --> GWN[Gateway N]
    style FAN fill:#ff6b1a,color:#0a0a0f

For small groups (up to ~256 members), synchronous fanout is manageable — batch the Redis lookups with a pipeline or MGET, then dispatch the RPC calls concurrently. In that mode, the entire 256-member fanout stays in the tens of milliseconds. At 1,024 members (WhatsApp's current maximum) even concurrent dispatch is slow enough that the message service thread stalls, and a single sluggish gateway can hold up delivery to all other members.

Async fanout via Kafka fixes this for large groups:

Message service writes to Cassandra and publishes one event to Kafka.
A pool of fanout workers consumes the event, pages through group membership in batches, and dispatches per-gateway pushes.
Online members get a push within ~500ms; offline members get a push notification.

The resulting asymmetry is observable in WhatsApp itself: group messages show "sent" immediately, but "delivered" ticks roll in over ~500ms as members receive them. That's the fanout workers paging through the member list.

Multi-device

User has phone + laptop + tablet, all needing the same message. The two approaches are per-device sockets (each device opens its own WebSocket and the fanout delivers to all of them) versus primary-device sync (your phone is the source of truth and other devices pull from it, which is how older WhatsApp worked until around 2019).

Modern systems — Telegram, Discord, Slack, and post-2019 WhatsApp — all use per-device sockets. Each device is just another entry in the session router; fanout iterates over all devices for a user exactly as it iterates over all members of a group.

flowchart LR
    MSG[Msg Svc] -->|"lookup all devices for user B"| SR[(Session Router)]
    SR --> D1[Phone GW]
    SR --> D2[Laptop GW]
    SR --> D3[Tablet GW]
    D1 --> P[Phone]
    D2 --> L[Laptop]
    D3 --> T[Tablet]
    style SR fill:#15803d,color:#fff
    style MSG fill:#0e7490,color:#fff

Delivery guarantees

sequenceDiagram
    participant A
    participant Server
    participant B
    A->>Server: send msg(id=42)
    Server->>A: ACK_SAVED (✓)
    Server->>B: deliver msg(42)
    B->>Server: ACK_DELIVERED (✓✓)
    Server->>A: status: delivered
    Note over B: user opens chat
    B->>Server: ACK_READ (✓✓ blue)
    Server->>A: status: read

The classic three states: sent (server accepted), delivered (recipient device received), read (recipient opened the chat). Each transition is a metadata write, not a message resend.

Storing read-state for a 1,024-member group means up to 1,024 read records per message. Compress this to "last-read message ID per (member, conversation)" instead of one row per message — a single integer per member rather than a join table of every message × every member.

Delivery semantics: at-least-once, not exactly-once

The system guarantees at-least-once delivery. Client-generated UUIDs make the server idempotent on receipt, so duplicate sends are deduplicated at write time. But the client can still briefly see a message rendered twice if the ACK is lost and the client retransmits — the client must also deduplicate on display using the UUID.

Exactly-once delivery (the message appears precisely once, end-to-end, even across crashes) requires distributed transactions across the gateway, message service, and storage. At this scale, that's impractical. At-least-once + client-side dedup is the standard tradeoff for every major chat system.

Presence

"Online / offline / typing" status, updated every few seconds when active.

Two challenges make this interesting. First, scale: 500M users × 1 update/30s ≈ 17M presence writes/sec. Second, reads on demand: when you open a chat, you want to see whether the other person is online right now.

The solution is Redis with TTL:

SET online:user_42  1  EX 60

Each client sends a heartbeat every 30s. If no heartbeat arrives for 60s, the key expires and the user is considered offline. There's no explicit "disconnect" event to handle — TTL expiry takes care of crashes and network drops automatically.

flowchart LR
    HB[Client heartbeat<br/>every 30s] -->|"SET online:uid 1 EX 60"| RD[(Redis)]
    RD -->|"TTL expires after 60s"| OFFLINE[User shown offline]
    RD -->|"GET online:uid"| ONLINE[User shown online]
    DISCONNECT[App crash / network drop] -->|"no more heartbeats"| RD
    style RD fill:#15803d,color:#fff
    style OFFLINE fill:#ff2e88,color:#fff
    style ONLINE fill:#0e7490,color:#fff

Presence fan-out. When User A comes online, every contact could want to know. Two options:

Pull-on-demand (lazy) checks GET online:A only when you open a chat. Simple, scales to any contact list, at most 60s stale. Push via pub/sub subscribes each user to a Redis Pub/Sub channel per contact — but at 500 contacts × 500M users that's 250B subscriptions, which doesn't hold up.

Production systems use pull-on-demand for open-chat reads and push presence deltas only to users who are actively in a conversation with the contact who just changed state.

Typing indicators are a special case: they expire in ~3s and must never hit the database. Route them directly over WebSocket as transient messages — no persistence, no delivery receipt.

End-to-end encryption (E2EE)

WhatsApp uses the Signal Protocol: each message is encrypted with a key only the sender and recipient know. The server cannot decrypt.

This shapes the architecture in important ways. The server never sees content, so search is client-side and spam detection uses metadata only. For groups, a sender key is used: the sender distributes a shared symmetric key to each group member (encrypted pairwise via the Double Ratchet), then encrypts each message once with that key. The server fans out the single ciphertext to all members. The sender key is rotated when membership changes, or periodically for forward secrecy.

Multi-device sync is harder — each device has its own keys, so messages must be re-encrypted on the sender side for each device.

E2EE is a two-day topic on its own. For interviews, mention it and say "uses Signal-style ratcheting."

Push for offline users

If User B is offline, send a push notification (APNS for iOS, FCM for Android). Once B opens the app, the WebSocket reconnects and pulls undelivered messages from the message store.

flowchart LR
    MSG[Msg Svc] -->|"B offline"| PUSH[Push Svc]
    PUSH --> APNS[APNS]
    PUSH --> FCM[FCM]
    APNS --> IOS[iOS device]
    FCM --> ANDROID[Android device]
    IOS -->|"app opened"| GW[Gateway]
    GW -->|"reconnect, pull"| MSG
    style PUSH fill:#ff6b1a,color:#0a0a0f

Edge cases

Network blip mid-message

The client retries. Each message has a client-generated UUID; the server uses it for idempotency. The same UUID sent twice — the second is silently dropped.

Order of messages

With message_id as a snowflake (timestamp-prefixed), the database sort is correct and consistent. Across devices, the server's timestamp wins.

A user types from two devices simultaneously

Both go through gateways → server timestamps → ordering is deterministic. Some clients also rebroadcast their own messages back to themselves to confirm the server accepted them.

Message edit / delete

A second message with op: edit, target_id: 42, new_text: .... Receivers replace their cached copy. The original stays on disk.

Capacity at WhatsApp scale

WhatsApp's backend (mostly Erlang) handled ~50B messages/day with roughly 32 core engineers at the time of Meta's $19B acquisition in 2014, growing to 100B messages/day by 2020. The lesson: with the right primitives — Erlang's lightweight processes per connection, Mnesia for hot state, BEAM's built-in fault isolation — you can build extremely scalable chat without microservices. The architecture described in this article is more conventional (Cassandra, Kafka, Go/Java gateways), but the routing and delivery concepts are identical.

Things to discuss in an interview

WebSocket gateways as the central scaling unit; session router for routing.
Per-conversation sharding of message storage.
Delivery semantics (sent / delivered / read) and idempotency on retry.
Presence with TTL instead of explicit "leave" events.
Push for offline users.
E2EE if asked: Signal protocol, server can't read.

Things you should now be able to answer

Why are WebSocket gateways the bottleneck and how do you horizontally scale them?
How does the system route a message from User A to User B when they are on different gateways?
What happens to a message if the recipient is offline when it is sent?
Why is synchronous batched fanout manageable for ~256-member groups but breaks down at 1,024? What does the async Kafka alternative look like?
How do you handle 1,024 read-receipts per message in a group without exploding your storage?
How do you scale presence at ~17M writes/sec? What does the fan-out-on-read vs pub/sub tradeoff look like?
Why do you use at-least-once delivery rather than exactly-once, and what is the client's responsibility?
How does sender-key group encryption differ from pairwise encryption, and why does it matter for server fanout?

Frequently asked questions

▸Why does a chat system at WhatsApp scale need WebSocket rather than HTTP polling?