~/articles/design-slack

◆◆◆Advancedasked at Slackasked at Microsoftasked at Meta

Design Slack (team chat at scale)

Q: What is the gateway tier in Slack's architecture and why is it necessary?

The gateway tier is a set of stateful servers that own all persistent WebSocket connections — up to 10 million concurrently. It is necessary because a WebSocket connection is bound to a specific server, so any backend service that wants to push to a user must know which gateway node holds that user's socket; without a routing layer sitting in front of these gateways, fan-out becomes impossible to coordinate.

Q: How does Slack handle fan-out differently for a normal channel versus a mega-channel with 50,000 members?

For normal channels the Message Service publishes once to a pub-sub router topic keyed by channel ID; each gateway that has subscribers on that channel receives the single event and pushes locally to its connected members. For mega-channels the same single publish keeps online delivery efficient, but notifying offline members via @here requires a dedicated fan-out worker pool that reads the membership list in pages from storage and batches push notifications to APNs and FCM with a token-bucket rate limiter to avoid overwhelming those services.

Q: Why is read state stored separately from the message store rather than in Cassandra?

Cassandra is optimized for sequential appends, and mixing the high-volume random writes of read-state updates — roughly 1 million writes per second at design scale — into the message store would hurt compaction and read performance. A dedicated key-value store keyed by (user_id, channel_id) holding last_read_message_id and unread_count handles this access pattern cleanly; Redis fits the entire dataset in one node at roughly 5 GB for 100 million keys.

Q: How does Slack's presence system avoid an N-squared fan-out storm across large workspaces?

The article describes two mitigations: first, presence visibility is scoped to direct contacts, shared channel members, and a coarsened workspace-level snapshot rather than broadcasting every status change to every workspace member; second, instead of emitting one event per status change, gateways send a compact presence snapshot every 15 seconds covering all recently-changed users, trading a little immediacy for a significant reduction in event volume.

Q: What is the difference between how Slack and WhatsApp handle message retention and encryption?

Slack stores messages permanently on the server (configurable retention) and encrypts only in transit via TLS, not end-to-end by default. WhatsApp applies end-to-end encryption using the Signal Protocol and holds undelivered messages on its servers for up to 30 days only, after which messages live on devices rather than centrally.

Channels, threads, presence, and search across huge workspaces. Real-time fan-out over WebSockets, the gateway problem, and read-state per user.

24 min read2026-03-20Ironclad Academy

#interview #realtime #messaging #fan-out

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Slack is a persistent-channel messaging platform — think of it as a shared, searchable, real-time chatroom that a company runs on continuously. On a normal Tuesday morning, engineering teams at thousands of companies are posting messages to #incidents, #deploys, and #eng-general simultaneously. Each message needs to arrive on everyone's screen in under 200 milliseconds, get durably stored for future history, and bump up the unread-count badge for anyone who's not looking. That's the product in a sentence. The scale Slack operates at publicly — over five million simultaneous WebSocket connections at peak — is what makes the engineering genuinely hard.

The core tension is three-way. First, connection management: a WebSocket connection is stateful, so you can't randomly distribute messages to servers without knowing which server holds a given user's socket. Second, fan-out: a message to a 50 000-member #announcements channel has to reach thousands of simultaneously connected clients in real time without hammering every service in the stack. Third, write-amplified state: every message a user reads triggers a write to their read-state record, so what feels like a read-heavy product actually drives enormous sustained write volume. These three problems interact — the routing layer you build to solve connection management is the same layer you'll lean on to tame fan-out.

These challenges are why Slack is a favourite interview question. The system forces you to choose between different fan-out strategies (channel-level pub-sub vs. worker-pool pagination), pick the right data model for three fundamentally different access patterns (append-only messages, hot KV read-state, ephemeral presence), and reason about graceful degradation when the real-time delivery path fails but the durable storage path must not.

Functional requirements

Workspaces contain channels and members.
Channels hold an ordered, persistent message log.
Threads — any message can have a nested reply list.
POST /message → immediately delivered to all connected channel members.
History: clients can paginate backwards through a channel.
Read state: per user per channel, track the last-read message ID and derive unread counts.
Presence: online / away / do-not-disturb, visible within a workspace.
Typing indicators: ephemeral, best-effort.
Full-text search across message history.
Push / email notifications for offline members.

Non-functional requirements

p99 delivery latency < 200ms when sender and recipient are both online.
99.99% availability — a workspace going down during an incident is catastrophic.
Durable message storage: messages must not be lost.
Scale to ~10M concurrent connections globally (design headroom; publicly documented peak is ~5M+).
Search results within ~2 seconds (eventual indexing is acceptable).

Capacity estimation

Dimension	Estimate	How we got there
Concurrent WebSocket connections	10M (design target)	Slack's published peak is ~5M+; 10M is headroom
Message write rate (avg)	100k msg/sec	Design target
Message write rate (peak)	~300k msg/sec	3× avg burst
Avg message body	~250 bytes	Typical short chat message
Daily write volume	~2 TB/day	`100k × 86,400 × 250B`
1-year message store	~700 TB raw / ~250 TB compressed	`2 TB/day × 365` with ~3× compression
Avg channel members	50	—
Online fraction per channel	20% → 10 online recipients/message	50 members × 0.20
Fan-out delivery rate	1M delivery events/sec	`100k msg/sec × 10 recipients`
Online users (presence)	10M	—
Avg channels per user	10	—
Presence subscription edges	100M	`10M users × 10 channels`
Heartbeat interval	10s	—
Heartbeat write rate to Redis	1M writes/sec	`10M users ÷ 10s`
Read-state write rate	1M writes/sec	Generous: 1M message reads/sec → 1 write each
Search ingest rate	100k msg/sec	All messages flow to Elasticsearch consumers
Search index size (1 year)	~100 TB	Inverted index over 1-year corpus; typically 30–50% of source size for short messages

Takeaway: Three bottlenecks jump out immediately: WebSocket connection capacity (10M persistent connections), presence heartbeat write rate (1M writes/sec to Redis), and read-state write rate (1M random KV writes/sec). Each needs its own treatment — and we'll get to all three.

Building up to the design

Start simple: HTTP polling + Postgres

The naive starting point is a REST API where clients poll for new messages every couple of seconds:

POST /channels/{id}/messages     → INSERT into messages
GET  /channels/{id}/messages?since=ts  → SELECT ... ORDER BY ts DESC LIMIT 50

This works for a team of 10. At 10 000 users polling every 2 seconds, you have 5 000 QPS of mostly empty reads slamming the database — and delivery latency is still up to 2 seconds. No push to offline members at all.

Long-polling is the obvious first fix: the server holds the request open until a message arrives or a 30-second timeout fires. That cuts the redundant read rate by ~15×. But 10M open HTTP connections is actually harder to manage than 10M WebSockets because HTTP long-poll carries more per-connection overhead. And you still have a fan-out problem: when a message arrives you have to wake up every parked request for that channel simultaneously.

Upgrade to WebSockets: the gateway tier

Replace the polling loop with a persistent WebSocket. A dedicated gateway tier — N servers that look stateless from the outside but are actually deeply stateful — owns those connections. When a message arrives, the gateway fans it out to the right sockets locally.

Here's the gap this immediately exposes: any backend service that wants to push to user Alice needs to know which gateway node holds her socket. With hundreds of gateway servers, that's a routing problem.

Add a routing layer

Two approaches work here:

Pub-sub mesh (Redis Pub-Sub, NATS, or similar): each gateway subscribes to topics for the users it currently serves — user:{user_id} for DMs, channel:{channel_id} for every channel with at least one connected member. A backend publishes to channel:C42 once; every gateway that has a subscriber on C42 receives it and writes to the relevant sockets. Fan-out becomes O(gateways touched) rather than O(online members).

Dedicated Channel Servers (Slack's documented approach): stateful in-memory servers hold channel state and subscription lists. Each channel hashes to a specific Channel Server; Gateway Servers subscribe to the Channel Servers covering their clients' channels. Fan-out flows CS → all subscribed GSs → connected clients. This avoids a generic pub-sub cluster and keeps channel-level routing logic inside a purpose-built service.

Both solve the routing problem. The pub-sub mesh is operationally simpler; Channel Servers give you richer channel-level state. Either works in an interview if you explain the trade-off.

The mega-channel problem

Once you have the routing layer, a 50 000-member channel where 10 000 are online (the 20% online fraction from the capacity table) still means significant fan-out work in the hot path. Even with channel-level pub-sub, each gateway node might have hundreds of members of that channel and has to push to all of them simultaneously.

The solution is to let the routing layer do the heavy lifting: publish once to channel:{id}. Each gateway receives that one event and fans out locally to its connected members of that channel. For @here notifications to offline members, a dedicated fan-out worker pool reads the message from Kafka, pages through the membership list, and batches push notifications. Back-pressure controls throughput so you don't flood APNs/FCM.

flowchart LR
    V1[V1: HTTP poll<br/>Postgres] --> V2[V2: Long-poll<br/>lower load]
    V2 --> V3[V3: WebSocket gateway<br/>real-time delivery]
    V3 --> V4[V4: + Routing layer<br/>pub-sub router]
    V4 --> V5[V5: Channel pub-sub<br/>+ mega-channel workers]
    V5 --> V6[V6: + Read state + Presence<br/>+ Search + Notifications]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#fff
    style V6 fill:#a855f7,color:#fff

High-level architecture

flowchart TD
    CLIENT[Client app] -->|"WebSocket"| GW[Connection Gateway<br/>stateful — owns sockets]
    GW -->|"inbound message"| MSGSVC[Message Service]
    MSGSVC --> KAFKA[Kafka]
    KAFKA --> CASSANDRA[(Message Store<br/>Cassandra)]
    KAFKA --> SEARCH[Elasticsearch<br/>indexer worker]
    KAFKA --> NOTIF[Push Notification<br/>Service]
    KAFKA --> RS[Read-State Service]
    KAFKA --> FANOUT[Fan-out Worker<br/>mega-channels]

    ROUTER[Pub-Sub Router<br/>Redis Pub-Sub / NATS] --> GW
    MSGSVC --> ROUTER
    FANOUT --> ROUTER

    GW <-->|"heartbeat"| PRES[Presence Service<br/>Redis TTL]

    PRES -->|"presence events"| ROUTER

    style GW fill:#ff6b1a,color:#fff
    style ROUTER fill:#15803d,color:#fff
    style CASSANDRA fill:#0e7490,color:#fff
    style SEARCH fill:#a855f7,color:#fff
    style PRES fill:#ffaa00,color:#0a0a0f
    style KAFKA fill:#ff2e88,color:#fff

The connection gateway tier

The routing problem

With 10M concurrent WebSockets spread across hundreds of gateway servers, any backend service that wants to push to user Alice must know which gateway node holds her connection. The routing layer solves this.

Pub-sub approach: each gateway registers its users into a pub-sub cluster and subscribes to:

user:{user_id} — for direct messages and user-addressed notifications.
channel:{channel_id} — for every channel with at least one member connected on this gateway.

When the Message Service publishes to channel:C42, every gateway with a subscriber on C42 receives it and writes to each relevant socket. Fan-out is O(gateways touched) rather than O(online members).

Slack's documented approach: dedicated stateful Channel Servers hold channel state and manage subscriptions. Gateway Servers subscribe asynchronously to the Channel Servers covering their clients' channels (via consistent hashing). When a message arrives, the Channel Server broadcasts to all subscribed Gateway Servers worldwide, each of which pushes to its connected clients. This avoids a generic pub-sub cluster and keeps channel-level routing logic inside a purpose-built service.

sequenceDiagram
    participant Alice
    participant GW_A as Gateway A (Alice's socket)
    participant GW_B as Gateway B (Bob's socket)
    participant Router as Pub-Sub Router
    participant MsgSvc as Message Service

    Alice->>GW_A: send "hello" to #general
    GW_A->>MsgSvc: POST /message {channel, text, sender}
    MsgSvc->>Router: PUBLISH channel:general {msg}
    Router->>GW_A: deliver (Alice is also subscribed)
    Router->>GW_B: deliver (Bob is on this gateway)
    GW_A->>Alice: push via WebSocket
    GW_B->>Bob: push via WebSocket

Consistent hashing for gateway assignment

When a client first connects, a load balancer (or a connection manager service) assigns it to a gateway node. Consistent hashing on user_id keeps re-connections to the same node likely — good for local cache warming — while still distributing evenly. When a gateway node fails, its clients reconnect and are redistributed.

Gateway node failure and client reconnect

When a gateway goes down, TCP connections drop and clients detect it within seconds (via keep-alive probes or the WebSocket ping/pong mechanism). They reconnect with exponential back-off and jitter to avoid thundering herd. On reconnect, the client sends its last_seen_message_id (persisted locally). The gateway proxies a catch-up call — GET /channels/{id}/messages?after={last_seen_id} — and replays any missed messages before the live stream resumes. This cursor-based catch-up requires that the message store supports efficient range queries by (channel_id, message_id), which is exactly what Cassandra's clustering key gives us.

Message storage

Schema design

Messages are the most-read and most-written entity. The access pattern is almost entirely: write a new message, fetch the last N messages in a channel (paginate backward), or fetch a specific message by ID for threads and permalink resolution.

Cassandra (or a Cassandra-compatible store like ScyllaDB) handles this with a partition key of (workspace_id, channel_id) and a clustering key of message_id — a time-ordered ULID or Snowflake ID. Each partition is an ordered log of messages in that channel.

-- Cassandra CQL (conceptual)
CREATE TABLE messages (
  workspace_id  UUID,
  channel_id    UUID,
  message_id    TIMEUUID,       -- time-ordered for free range scans
  sender_id     UUID,
  body          TEXT,
  thread_root   TIMEUUID,       -- NULL if not a thread reply
  edited_at     TIMESTAMP,
  PRIMARY KEY ((workspace_id, channel_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

History pagination becomes: SELECT * FROM messages WHERE workspace_id=? AND channel_id=? AND message_id < ? LIMIT 50 — a single Cassandra partition scan with no cross-partition scatter.

One caveat on partition size: Cassandra recommends keeping partitions under ~100 MB. A high-traffic channel that has been active for years could accumulate millions of messages and blow past that threshold. The production fix is time-based bucketing — add a bucket field (e.g., YYYYMM or week number) to the partition key: PRIMARY KEY ((workspace_id, channel_id, bucket), message_id). History queries then span at most two partitions when crossing a bucket boundary. This schema omits bucketing for clarity; flag it in an interview.

Threads

A thread is a nested list of messages sharing a thread_root — the ID of the parent message. The two main options are storing thread replies inline (easy but scatters replies across the channel partition) or using a separate thread partition: PRIMARY KEY ((workspace_id, thread_root_id), reply_message_id). The separate partition is cleaner for large threads and makes "fetch all replies to message X" a single efficient scan. The parent message stores a reply_count and last_reply_ts for rendering thread previews without loading all replies.

Message IDs

Use a time-ordered globally unique ID. Two common choices:

Twitter Snowflake (64-bit integer):

1 bit: unused (sign bit, always 0).
41 bits: millisecond timestamp since a custom epoch.
10 bits: machine / worker ID.
12 bits: per-machine sequence number.

This fits in a 64-bit integer, encodes time so database clustering order matches chronological order without a separate timestamp column, and supports 4096 IDs per millisecond per node.

ULID (128-bit, 26-character Crockford Base32 string):

48 bits: millisecond timestamp.
80 bits: cryptographically random.

ULIDs are lexicographically sortable but larger (128 bits vs 64 bits). They trade the machine-coordination requirement of Snowflake for randomness, at the cost of roughly double storage.

For a Cassandra store, Snowflake IDs (stored as BIGINT) or Cassandra-native TIMEUUID (a 128-bit Type 1 UUID embedding a timestamp) are both practical. TIMEUUID is used in the schema above for native range-scan support.

Fan-out deep dive

Normal channels (fewer than ~1 000 members)

Here's the flow when Alice sends a message to #general:

Message Service writes to Kafka topic messages.
Message Service simultaneously publishes to the pub-sub router: PUBLISH channel:{id} {msg_payload}.
All gateways subscribed to channel:{id} receive the event and push to their connected members.
Kafka consumers — the Cassandra writer, Elasticsearch indexer, read-state service — process asynchronously.

Online delivery is driven by the synchronous pub-sub publish, not by Kafka. Kafka is the durable backbone for offline processing. If the pub-sub router hiccups, online users miss a push but catch up via the next reconnect; Cassandra is the source of truth and is never bypassed.

flowchart LR
    MSG[Message Service] -->|"PUBLISH channel:id"| ROUTER[Pub-Sub Router]
    ROUTER --> GW1[Gateway A<br/>Alice's socket]
    ROUTER --> GW2[Gateway B<br/>Bob's socket]
    ROUTER --> GW3[Gateway C<br/>Carlos' socket]
    MSG --> KAFKA[Kafka]
    KAFKA --> CASS[(Cassandra<br/>durable write)]
    KAFKA --> ES[Elasticsearch<br/>async index]
    KAFKA --> RS[Read-state<br/>service]
    style ROUTER fill:#15803d,color:#fff
    style KAFKA fill:#ff2e88,color:#fff
    style CASS fill:#0e7490,color:#fff
    style MSG fill:#ff6b1a,color:#fff

Mega-channels and @here

A channel with 50 000 members where 10 000 are online: publishing 10 000 times to the pub-sub router in the hot path is too expensive.

The fix is to publish once to channel:{id}. Each gateway receives that one event and fans out locally to its connected members. Since each gateway typically has 20–100 members of the mega-channel, each gateway does a small local loop — the fan-out cost is O(gateways) not O(online members).

For @here — notify everyone in the channel, including offline members — a fan-out worker consumes from Kafka, reads the membership list in pages (it can be millions of rows for very large shared channels), and batches push notifications to the notification service. The worker uses a token-bucket rate limiter to avoid overwhelming APNs/FCM at spike time. Some notifications are delayed by seconds. That's acceptable for most channels; critical incident channels should use a dedicated higher-priority queue.

flowchart LR
    KAFKA[Kafka<br/>message event] --> WORKER[Fan-out Worker]
    WORKER -->|"page through members"| MEMBERS[(Channel Members<br/>Postgres / DynamoDB)]
    WORKER --> NOTIF[Push Notification<br/>Service]
    NOTIF --> APNS[APNs]
    NOTIF --> FCM[FCM]
    style WORKER fill:#ff6b1a,color:#fff
    style KAFKA fill:#ff2e88,color:#fff
    style NOTIF fill:#0e7490,color:#fff

Read state and unread counts

The write-volume problem

Every time a new message arrives, the Read-State service must increment unread_count for every channel member who hasn't yet read it. At design scale — 1M message reads per second — that's 1M random KV writes per second, sustained. Separately, every time a user opens a channel, we reset their "last read position" (a MARK_READ write). Mixing those hot random writes into the message store (Cassandra's strength is sequential appends) would hurt compaction and read performance. So we separate them.

A dedicated read-state store

A simple key-value store keyed by (user_id, channel_id), holding last_read_message_id and a cached unread_count:

user_channel_state:{user_id}:{channel_id} →
  { last_read_message_id: "01HX...",
    unread_count: 5,
    mention_count: 1 }

Redis handles this comfortably: 10M users × avg 10 channels = 100M keys × ~50 bytes ≈ 5 GB per replica — fits one node. DynamoDB or a dedicated Postgres shard with PK on (user_id, channel_id) are equally valid.

When a message arrives, the Read-State service (consuming from Kafka) increments unread_count for every channel member whose last_read_message_id is older than the new message. When the user opens a channel, the client sends MARK_READ {channel_id, message_id}, which resets unread_count = 0 and updates last_read_message_id. Badge counts across all of a user's channels are then a single Redis scan over that user's keys.

flowchart LR
    KAFKA[Kafka<br/>new message] --> RSS[Read-State Service]
    RSS -->|"increment unread_count<br/>for each member"| RSDB[(Read-State Store<br/>Redis / DynamoDB)]
    CLIENT[Client opens channel] -->|"MARK_READ"| API[API]
    API -->|"reset unread_count = 0"| RSDB
    RSDB -->|"badge count query"| CLIENT
    style RSS fill:#ff6b1a,color:#fff
    style RSDB fill:#15803d,color:#fff
    style KAFKA fill:#ff2e88,color:#fff

Presence

The N² problem

If 10 000 users in a workspace each care about everyone else's presence, a single status change must be broadcast to 9 999 users. That's ~10k events per status change — and if all 10 000 users change status simultaneously (e.g., everyone coming online Monday morning), the total number of subscription edges is 10 000 × 10 000 = 100M, making the per-second delivery volume clearly infeasible at workspace scale.

Bounding fan-out

The practical fix is to scope presence visibility so "interested" means a much smaller set:

Direct contacts or DM partners: you see presence for users you've messaged directly.
Shared channel members: presence is shown for members of channels you're both in.
Workspace-level "who's online" list: coarsened to online vs offline only, refreshed every 30 seconds — no per-user change events, just a periodic snapshot.

Heartbeat mechanism

Every connected client sends a heartbeat (ping/pong) to the gateway every 10 seconds, which proxies it to the Presence Service:

Gateway → Presence Service: heartbeat(user_id, workspace_id)
Presence Service → Redis:   SET presence:{user_id} "online" EX 30

The Redis TTL (30 seconds here — a few multiples of the heartbeat interval) means: if no heartbeat arrives, the key expires and the user appears offline. No explicit "disconnect" event is required, though the gateway can send one on a clean close for faster response. Note that the TTL for the Redis key is a low-level implementation detail; the application-level "away" transition (after no activity from any device) typically happens on a longer timescale, on the order of minutes.

sequenceDiagram
    participant Client
    participant GW as Gateway
    participant PS as Presence Service
    participant RD as Redis

    loop Every 10 seconds
        Client->>GW: heartbeat ping
        GW->>PS: heartbeat(user_id)
        PS->>RD: SET presence:user_id "online" EX 30
    end

    Note over RD: After 30s with no heartbeat, key expires → user appears offline

Presence fan-out

When a user's presence changes (a key appears or expires), the Presence Service publishes a presence_changed event to the pub-sub router for each "interested" user in the scoped set. For large workspaces, presence is batch-delivered: instead of one event per status change, the gateway sends a compact presence snapshot every 15 seconds covering all recently-changed users, trading a little immediacy for a lot of throughput.

Typing indicators

Typing indicators are ephemeral and best-effort — the architecture treats them very differently from messages. When a user starts typing:

Client sends TYPING_START {channel_id} to its gateway.
Gateway publishes to channel:{channel_id}:typing on the pub-sub router.
Other gateways deliver via WebSocket.
Indicator auto-expires after 5 seconds (no TYPING_STOP needed).

These events are never written to Kafka or the message store. If one is dropped, that's fine — you just don't see the typing bubble for a moment. This is an explicit choice to keep ephemeral UX out of the durable write path entirely.

Full-text search

The pipeline

Messages flow through Kafka into an indexing worker that writes to Elasticsearch:

Parse the message body.
Extract metadata: sender, channel, workspace, timestamp.
Attach workspace-level access-control metadata as a document field (so queries can filter by channels the searching user can actually see).
Upsert using message_id as the document ID, making the operation idempotent.

Index design

Each workspace gets a separate Elasticsearch index (or data stream), or workspaces share an index with a workspace_id filter. The shared-index approach is simpler to manage but requires careful routing to prevent cross-tenant data leaks.

Every search query enforces: workspace_id = X AND channel_id IN [channels user can see]. Access control is checked both at query time (filter) and at the application layer (re-verify before returning results).

Latency

Indexing lag is typically 1–10 seconds end-to-end. Elasticsearch's default refresh interval is 1 second, so a document becomes searchable within roughly a second of being written. The Kafka consumer pipeline adds variable propagation delay on top — usually a second or two at low load, potentially more under burst. That's fine — users don't expect to find a message they sent one second ago.

Storage choices

Data	Store	Rationale
Messages	Cassandra (sharded by workspace+channel)	High append throughput; efficient range scans by time
Channel metadata, workspace config	Postgres	Relational; low volume; ACID useful
Channel membership	Postgres (or DynamoDB for large orgs)	Many-to-many join; occasional bulk reads for fan-out
Read state (last read, unread count)	Redis (or DynamoDB)	High write volume; simple KV access pattern
Presence	Redis with TTL	Ephemeral; TTL handles disconnection naturally
Routing (user → gateway)	Redis Pub-Sub / NATS	Pub-sub semantics; ephemeral; no persistence needed
Search index	Elasticsearch	Inverted index; rich query language
Event backbone	Kafka	Durable; fan-out to multiple consumers; replayable
Attachments / files	Object storage (S3)	Large blobs; CDN-served

Failure modes

Gateway node failure

A gateway crashes. Clients detect the dropped connection within seconds and reconnect with exponential back-off (with jitter to avoid thundering herd). On reconnect the client supplies its last_seen_message_id; the new gateway proxies a history catch-up before resuming the live stream. Messages sent during the gap were durably written to Cassandra via Kafka and come back in the catch-up response.

Pub-sub router failure

If the Redis pub-sub cluster fails, in-flight deliveries to online users are lost — pub-sub events are not persistent. Recovery: clients see no new messages and fall back to polling the catch-up API until WebSocket delivery resumes. Message durability is unaffected because the Kafka → Cassandra path is completely independent.

Kafka consumer lag

If a consumer falls behind — say the Elasticsearch indexer slows down — messages continue being delivered in real time via the pub-sub path. The lagging consumer eventually catches up. No data is lost as long as Kafka retention covers the lag window (typically 7 days).

Presence flapping

A mobile client on a flaky network repeatedly connects and disconnects, each reconnect triggering a presence-change event. Mitigation: debounce presence events with a 5-second grace window before broadcasting "offline." The gateway sends a clean disconnect event; the Presence Service waits 5 seconds before marking offline and publishing the change.

Mega-channel @here storm

A @here in a 50 000-member channel triggers fan-out notifications to all online members. If 10 000 are online and the push notification service is bounded, the fan-out worker's token-bucket rate limiter queues the work. Some notifications are delayed by seconds. Acceptable for most channels; critical incident channels should use a dedicated higher-priority queue.

API design (key endpoints)

-- Send a message
POST /api/v1/channels/{channel_id}/messages
Authorization: Bearer {token}
{ "text": "hello world", "thread_root": null }

→ 201 Created
{ "message_id": "01HX3AB...", "ts": "2026-03-20T12:34:56Z" }

-- Fetch history (paginate backward)
GET /api/v1/channels/{channel_id}/messages?before={message_id}&limit=50

→ 200 OK
{ "messages": [...], "has_more": true }

-- Mark channel as read
POST /api/v1/channels/{channel_id}/mark_read
{ "last_read_message_id": "01HX3AB..." }

-- WebSocket (client connects once, then receives events)
wss://gateway.slack.internal/connect?token={session_token}

← {"type": "message", "channel_id": "...", "message": {...}}
← {"type": "presence", "user_id": "...", "status": "away"}
← {"type": "typing", "channel_id": "...", "user_id": "..."}
← {"type": "read_state", "channel_id": "...", "unread_count": 3}

Sequence: message delivery end-to-end

sequenceDiagram
    participant Alice as Alice (client)
    participant GW_A as Gateway A
    participant MsgSvc as Message Service
    participant Kafka as Kafka
    participant Router as Pub-Sub Router
    participant GW_B as Gateway B
    participant Bob as Bob (client)
    participant Cass as Cassandra
    participant ES as Elasticsearch
    participant RS as Read-State Service

    Alice->>GW_A: WS: send message to #general
    GW_A->>MsgSvc: POST /message {channel, text}
    MsgSvc->>Cass: write message (sync, quorum)
    MsgSvc->>Router: PUBLISH channel:general {msg}
    MsgSvc->>Kafka: produce message event
    MsgSvc-->>GW_A: 201 Created {message_id}
    GW_A-->>Alice: WS: ack {message_id}
    Router->>GW_A: deliver to Alice (echo)
    Router->>GW_B: deliver (Bob is subscribed here)
    GW_B->>Bob: WS: push message
    Kafka->>ES: async index (~1-10s lag)
    Kafka->>RS: read-state fan-out (async)

Notice that the Cassandra write is on the synchronous path — before returning 201 — to guarantee durability. The Kafka produce is also synchronous (acks=all, meaning all in-sync replicas must acknowledge) before returning, ensuring the event is not lost. Online delivery via the pub-sub router and durable storage via Kafka/Cassandra are parallel, independent paths.

Contrast with 1:1 chat systems

This design differs from WhatsApp / 1:1 chat in several key ways:

Dimension	Slack (team chat)	WhatsApp (1:1 / small group)
Fan-out model	Channel pub-sub → gateways	Per-recipient delivery via session server
Group size	Up to 50 000 members	Up to 1 024 (raised from 512, which itself was raised from 256)
Read state	Per user, per channel, server-side	Per message, device-side + server receipts
Presence	Workspace-scoped, bounded fan-out	Contact-list scoped
Search	Server-side full-text (Elasticsearch)	Limited (client-side or minimal server search)
Message retention	Permanent (configurable)	Device-side; server holds undelivered messages up to 30 days only
Encryption	In transit (TLS), not E2E by default	End-to-end encrypted (Signal Protocol)

Things to discuss in an interview

The routing problem: how does the Message Service know which gateway holds Alice's WebSocket? Walk through the pub-sub subscription model.
Fan-out strategies: normal channel (channel-level pub-sub), mega-channel (worker pool + back-pressure), why they differ.
Read-state at write scale: why it lives in a dedicated store rather than alongside messages.
Presence N² problem: how scoping visibility and batching bounds the fan-out.
Catch-up after reconnect: cursor-based, backed by the durable Cassandra log, decoupled from the pub-sub delivery path.
Typing indicators as ephemeral pub-sub: never persisted, acceptable to lose, auto-expire by TTL.
Consistent hashing for gateway assignment: what happens on gateway failure (reconnect + catch-up), why consistent hashing minimises reshuffling.

Things you should now be able to answer

Why can't the Message Service push directly to the user's WebSocket without a routing layer?
What happens to messages sent to a channel while a user is disconnected, and how does the client recover them?
Why is presence expensive at scale, and what are two mechanisms to bound the cost?
Why are typing indicators handled differently from messages?
What is the trade-off between writing read-state synchronously vs. via Kafka?
Why is Cassandra a better fit than Postgres for the message store at this scale?
How does fan-out work differently for a 50-member channel vs. a 50 000-member channel?

Frequently asked questions

▸What is the gateway tier in Slack's architecture and why is it necessary?

The gateway tier is a set of stateful servers that own all persistent WebSocket connections — up to 10 million concurrently. It is necessary because a WebSocket connection is bound to a specific server, so any backend service that wants to push to a user must know which gateway node holds that user's socket; without a routing layer sitting in front of these gateways, fan-out becomes impossible to coordinate.

▸How does Slack handle fan-out differently for a normal channel versus a mega-channel with 50,000 members?

For normal channels the Message Service publishes once to a pub-sub router topic keyed by channel ID; each gateway that has subscribers on that channel receives the single event and pushes locally to its connected members. For mega-channels the same single publish keeps online delivery efficient, but notifying offline members via @here requires a dedicated fan-out worker pool that reads the membership list in pages from storage and batches push notifications to APNs and FCM with a token-bucket rate limiter to avoid overwhelming those services.

▸Why is read state stored separately from the message store rather than in Cassandra?

Cassandra is optimized for sequential appends, and mixing the high-volume random writes of read-state updates — roughly 1 million writes per second at design scale — into the message store would hurt compaction and read performance. A dedicated key-value store keyed by (user_id, channel_id) holding last_read_message_id and unread_count handles this access pattern cleanly; Redis fits the entire dataset in one node at roughly 5 GB for 100 million keys.

▸How does Slack's presence system avoid an N-squared fan-out storm across large workspaces?

The article describes two mitigations: first, presence visibility is scoped to direct contacts, shared channel members, and a coarsened workspace-level snapshot rather than broadcasting every status change to every workspace member; second, instead of emitting one event per status change, gateways send a compact presence snapshot every 15 seconds covering all recently-changed users, trading a little immediacy for a significant reduction in event volume.

▸What is the difference between how Slack and WhatsApp handle message retention and encryption?

Slack stores messages permanently on the server (configurable retention) and encrypts only in transit via TLS, not end-to-end by default. WhatsApp applies end-to-end encryption using the Signal Protocol and holds undelivered messages on its servers for up to 30 days only, after which messages live on devices rather than centrally.

← previous

Quorums, Read-Repair & Anti-Entropy (Dynamo-style)

Storage Engines: LSM-Trees vs B-Trees

// RELATED

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform