Design Slack (team chat at scale)
Channels, threads, presence, and search across huge workspaces. Real-time fan-out over WebSockets, the gateway problem, and read-state per user.
The problem
Slack is a persistent-channel messaging platform — think of it as a shared, searchable, real-time chatroom that a company runs on continuously. On a normal Tuesday morning, engineering teams at thousands of companies are posting messages to #incidents, #deploys, and #eng-general simultaneously. Each message needs to arrive on everyone's screen in under 200 milliseconds, get durably stored for future history, and bump up the unread-count badge for anyone who's not looking. That's the product in a sentence. The scale Slack operates at publicly — over five million simultaneous WebSocket connections at peak — is what makes the engineering genuinely hard.
The core tension is three-way. First, connection management: a WebSocket connection is stateful, so you can't randomly distribute messages to servers without knowing which server holds a given user's socket. Second, fan-out: a message to a 50 000-member #announcements channel has to reach thousands of simultaneously connected clients in real time without hammering every service in the stack. Third, write-amplified state: every message a user reads triggers a write to their read-state record, so what feels like a read-heavy product actually drives enormous sustained write volume. These three problems interact — the routing layer you build to solve connection management is the same layer you'll lean on to tame fan-out.
These challenges are why Slack is a favourite interview question. The system forces you to choose between different fan-out strategies (channel-level pub-sub vs. worker-pool pagination), pick the right data model for three fundamentally different access patterns (append-only messages, hot KV read-state, ephemeral presence), and reason about graceful degradation when the real-time delivery path fails but the durable storage path must not.
Functional requirements
- Workspaces contain channels and members.
- Channels hold an ordered, persistent message log.
- Threads — any message can have a nested reply list.
POST /message→ immediately delivered to all connected channel members.- History: clients can paginate backwards through a channel.
- Read state: per user per channel, track the last-read message ID and derive unread counts.
- Presence: online / away / do-not-disturb, visible within a workspace.
- Typing indicators: ephemeral, best-effort.
- Full-text search across message history.
- Push / email notifications for offline members.
Non-functional requirements
- p99 delivery latency < 200ms when sender and recipient are both online.
- 99.99% availability — a workspace going down during an incident is catastrophic.
- Durable message storage: messages must not be lost.
- Scale to ~10M concurrent connections globally (design headroom; publicly documented peak is ~5M+).
- Search results within ~2 seconds (eventual indexing is acceptable).
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Concurrent WebSocket connections | 10M (design target) | Slack's published peak is ~5M+; 10M is headroom |
| Message write rate (avg) | 100k msg/sec | Design target |
| Message write rate (peak) | ~300k msg/sec | 3× avg burst |
| Avg message body | ~250 bytes | Typical short chat message |
| Daily write volume | ~2 TB/day | 100k × 86,400 × 250B |
| 1-year message store | ~700 TB raw / ~250 TB compressed | 2 TB/day × 365 with ~3× compression |
| Avg channel members | 50 | — |
| Online fraction per channel | 20% → 10 online recipients/message | 50 members × 0.20 |
| Fan-out delivery rate | 1M delivery events/sec | 100k msg/sec × 10 recipients |
| Online users (presence) | 10M | — |
| Avg channels per user | 10 | — |
| Presence subscription edges | 100M | 10M users × 10 channels |
| Heartbeat interval | 10s | — |
| Heartbeat write rate to Redis | 1M writes/sec | 10M users ÷ 10s |
| Read-state write rate | 1M writes/sec | Generous: 1M message reads/sec → 1 write each |
| Search ingest rate | 100k msg/sec | All messages flow to Elasticsearch consumers |
| Search index size (1 year) | ~100 TB | Inverted index over 1-year corpus; typically 30–50% of source size for short messages |
Takeaway: Three bottlenecks jump out immediately: WebSocket connection capacity (10M persistent connections), presence heartbeat write rate (1M writes/sec to Redis), and read-state write rate (1M random KV writes/sec). Each needs its own treatment — and we'll get to all three.
Building up to the design
Start simple: HTTP polling + Postgres
The naive starting point is a REST API where clients poll for new messages every couple of seconds:
POST /channels/{id}/messages → INSERT into messages
GET /channels/{id}/messages?since=ts → SELECT ... ORDER BY ts DESC LIMIT 50
This works for a team of 10. At 10 000 users polling every 2 seconds, you have 5 000 QPS of mostly empty reads slamming the database — and delivery latency is still up to 2 seconds. No push to offline members at all.
Long-polling is the obvious first fix: the server holds the request open until a message arrives or a 30-second timeout fires. That cuts the redundant read rate by ~15×. But 10M open HTTP connections is actually harder to manage than 10M WebSockets because HTTP long-poll carries more per-connection overhead. And you still have a fan-out problem: when a message arrives you have to wake up every parked request for that channel simultaneously.
Upgrade to WebSockets: the gateway tier
Replace the polling loop with a persistent WebSocket. A dedicated gateway tier — N servers that look stateless from the outside but are actually deeply stateful — owns those connections. When a message arrives, the gateway fans it out to the right sockets locally.
Here's the gap this immediately exposes: any backend service that wants to push to user Alice needs to know which gateway node holds her socket. With hundreds of gateway servers, that's a routing problem.
Add a routing layer
Two approaches work here:
Pub-sub mesh (Redis Pub-Sub, NATS, or similar): each gateway subscribes to topics for the users it currently serves — user:{user_id} for DMs, channel:{channel_id} for every channel with at least one connected member. A backend publishes to channel:C42 once; every gateway that has a subscriber on C42 receives it and writes to the relevant sockets. Fan-out becomes O(gateways touched) rather than O(online members).
Dedicated Channel Servers (Slack's documented approach): stateful in-memory servers hold channel state and subscription lists. Each channel hashes to a specific Channel Server; Gateway Servers subscribe to the Channel Servers covering their clients' channels. Fan-out flows CS → all subscribed GSs → connected clients. This avoids a generic pub-sub cluster and keeps channel-level routing logic inside a purpose-built service.
Both solve the routing problem. The pub-sub mesh is operationally simpler; Channel Servers give you richer channel-level state. Either works in an interview if you explain the trade-off.
The mega-channel problem
Once you have the routing layer, a 50 000-member channel where 10 000 are online (the 20% online fraction from the capacity table) still means significant fan-out work in the hot path. Even with channel-level pub-sub, each gateway node might have hundreds of members of that channel and has to push to all of them simultaneously.
The solution is to let the routing layer do the heavy lifting: publish once to channel:{id}. Each gateway receives that one event and fans out locally to its connected members of that channel. For @here notifications to offline members, a dedicated fan-out worker pool reads the message from Kafka, pages through the membership list, and batches push notifications. Back-pressure controls throughput so you don't flood APNs/FCM.
flowchart LR
V1[V1: HTTP poll<br/>Postgres] --> V2[V2: Long-poll<br/>lower load]
V2 --> V3[V3: WebSocket gateway<br/>real-time delivery]
V3 --> V4[V4: + Routing layer<br/>pub-sub router]
V4 --> V5[V5: Channel pub-sub<br/>+ mega-channel workers]
V5 --> V6[V6: + Read state + Presence<br/>+ Search + Notifications]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#fff
style V6 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
CLIENT[Client app] -->|"WebSocket"| GW[Connection Gateway<br/>stateful — owns sockets]
GW -->|"inbound message"| MSGSVC[Message Service]
MSGSVC --> KAFKA[Kafka]
KAFKA --> CASSANDRA[(Message Store<br/>Cassandra)]
KAFKA --> SEARCH[Elasticsearch<br/>indexer worker]
KAFKA --> NOTIF[Push Notification<br/>Service]
KAFKA --> RS[Read-State Service]
KAFKA --> FANOUT[Fan-out Worker<br/>mega-channels]
ROUTER[Pub-Sub Router<br/>Redis Pub-Sub / NATS] --> GW
MSGSVC --> ROUTER
FANOUT --> ROUTER
GW <-->|"heartbeat"| PRES[Presence Service<br/>Redis TTL]
PRES -->|"presence events"| ROUTER
style GW fill:#ff6b1a,color:#fff
style ROUTER fill:#15803d,color:#fff
style CASSANDRA fill:#0e7490,color:#fff
style SEARCH fill:#a855f7,color:#fff
style PRES fill:#ffaa00,color:#0a0a0f
style KAFKA fill:#ff2e88,color:#fff
The connection gateway tier
The routing problem
With 10M concurrent WebSockets spread across hundreds of gateway servers, any backend service that wants to push to user Alice must know which gateway node holds her connection. The routing layer solves this.
Pub-sub approach: each gateway registers its users into a pub-sub cluster and subscribes to:
user:{user_id}— for direct messages and user-addressed notifications.channel:{channel_id}— for every channel with at least one member connected on this gateway.
When the Message Service publishes to channel:C42, every gateway with a subscriber on C42 receives it and writes to each relevant socket. Fan-out is O(gateways touched) rather than O(online members).
Slack's documented approach: dedicated stateful Channel Servers hold channel state and manage subscriptions. Gateway Servers subscribe asynchronously to the Channel Servers covering their clients' channels (via consistent hashing). When a message arrives, the Channel Server broadcasts to all subscribed Gateway Servers worldwide, each of which pushes to its connected clients. This avoids a generic pub-sub cluster and keeps channel-level routing logic inside a purpose-built service.
sequenceDiagram
participant Alice
participant GW_A as Gateway A (Alice's socket)
participant GW_B as Gateway B (Bob's socket)
participant Router as Pub-Sub Router
participant MsgSvc as Message Service
Alice->>GW_A: send "hello" to #general
GW_A->>MsgSvc: POST /message {channel, text, sender}
MsgSvc->>Router: PUBLISH channel:general {msg}
Router->>GW_A: deliver (Alice is also subscribed)
Router->>GW_B: deliver (Bob is on this gateway)
GW_A->>Alice: push via WebSocket
GW_B->>Bob: push via WebSocket
Consistent hashing for gateway assignment
When a client first connects, a load balancer (or a connection manager service) assigns it to a gateway node. Consistent hashing on user_id keeps re-connections to the same node likely — good for local cache warming — while still distributing evenly. When a gateway node fails, its clients reconnect and are redistributed.
Gateway node failure and client reconnect
When a gateway goes down, TCP connections drop and clients detect it within seconds (via keep-alive probes or the WebSocket ping/pong mechanism). They reconnect with exponential back-off and jitter to avoid thundering herd. On reconnect, the client sends its last_seen_message_id (persisted locally). The gateway proxies a catch-up call — GET /channels/{id}/messages?after={last_seen_id} — and replays any missed messages before the live stream resumes. This cursor-based catch-up requires that the message store supports efficient range queries by (channel_id, message_id), which is exactly what Cassandra's clustering key gives us.
Message storage
Schema design
Messages are the most-read and most-written entity. The access pattern is almost entirely: write a new message, fetch the last N messages in a channel (paginate backward), or fetch a specific message by ID for threads and permalink resolution.
Cassandra (or a Cassandra-compatible store like ScyllaDB) handles this with a partition key of (workspace_id, channel_id) and a clustering key of message_id — a time-ordered ULID or Snowflake ID. Each partition is an ordered log of messages in that channel.
-- Cassandra CQL (conceptual)
CREATE TABLE messages (
workspace_id UUID,
channel_id UUID,
message_id TIMEUUID, -- time-ordered for free range scans
sender_id UUID,
body TEXT,
thread_root TIMEUUID, -- NULL if not a thread reply
edited_at TIMESTAMP,
PRIMARY KEY ((workspace_id, channel_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
History pagination becomes: SELECT * FROM messages WHERE workspace_id=? AND channel_id=? AND message_id < ? LIMIT 50 — a single Cassandra partition scan with no cross-partition scatter.
One caveat on partition size: Cassandra recommends keeping partitions under ~100 MB. A high-traffic channel that has been active for years could accumulate millions of messages and blow past that threshold. The production fix is time-based bucketing — add a bucket field (e.g., YYYYMM or week number) to the partition key: PRIMARY KEY ((workspace_id, channel_id, bucket), message_id). History queries then span at most two partitions when crossing a bucket boundary. This schema omits bucketing for clarity; flag it in an interview.
Threads
A thread is a nested list of messages sharing a thread_root — the ID of the parent message. The two main options are storing thread replies inline (easy but scatters replies across the channel partition) or using a separate thread partition: PRIMARY KEY ((workspace_id, thread_root_id), reply_message_id). The separate partition is cleaner for large threads and makes "fetch all replies to message X" a single efficient scan. The parent message stores a reply_count and last_reply_ts for rendering thread previews without loading all replies.
Message IDs
Use a time-ordered globally unique ID. Two common choices:
Twitter Snowflake (64-bit integer):
- 1 bit: unused (sign bit, always 0).
- 41 bits: millisecond timestamp since a custom epoch.
- 10 bits: machine / worker ID.
- 12 bits: per-machine sequence number.
This fits in a 64-bit integer, encodes time so database clustering order matches chronological order without a separate timestamp column, and supports 4096 IDs per millisecond per node.
ULID (128-bit, 26-character Crockford Base32 string):
- 48 bits: millisecond timestamp.
- 80 bits: cryptographically random.
ULIDs are lexicographically sortable but larger (128 bits vs 64 bits). They trade the machine-coordination requirement of Snowflake for randomness, at the cost of roughly double storage.
For a Cassandra store, Snowflake IDs (stored as BIGINT) or Cassandra-native TIMEUUID (a 128-bit Type 1 UUID embedding a timestamp) are both practical. TIMEUUID is used in the schema above for native range-scan support.
Fan-out deep dive
Normal channels (fewer than ~1 000 members)
Here's the flow when Alice sends a message to #general:
- Message Service writes to Kafka topic
messages. - Message Service simultaneously publishes to the pub-sub router:
PUBLISH channel:{id} {msg_payload}. - All gateways subscribed to
channel:{id}receive the event and push to their connected members. - Kafka consumers — the Cassandra writer, Elasticsearch indexer, read-state service — process asynchronously.
Online delivery is driven by the synchronous pub-sub publish, not by Kafka. Kafka is the durable backbone for offline processing. If the pub-sub router hiccups, online users miss a push but catch up via the next reconnect; Cassandra is the source of truth and is never bypassed.
flowchart LR
MSG[Message Service] -->|"PUBLISH channel:id"| ROUTER[Pub-Sub Router]
ROUTER --> GW1[Gateway A<br/>Alice's socket]
ROUTER --> GW2[Gateway B<br/>Bob's socket]
ROUTER --> GW3[Gateway C<br/>Carlos' socket]
MSG --> KAFKA[Kafka]
KAFKA --> CASS[(Cassandra<br/>durable write)]
KAFKA --> ES[Elasticsearch<br/>async index]
KAFKA --> RS[Read-state<br/>service]
style ROUTER fill:#15803d,color:#fff
style KAFKA fill:#ff2e88,color:#fff
style CASS fill:#0e7490,color:#fff
style MSG fill:#ff6b1a,color:#fff
Mega-channels and @here
A channel with 50 000 members where 10 000 are online: publishing 10 000 times to the pub-sub router in the hot path is too expensive.
The fix is to publish once to channel:{id}. Each gateway receives that one event and fans out locally to its connected members. Since each gateway typically has 20–100 members of the mega-channel, each gateway does a small local loop — the fan-out cost is O(gateways) not O(online members).
For @here — notify everyone in the channel, including offline members — a fan-out worker consumes from Kafka, reads the membership list in pages (it can be millions of rows for very large shared channels), and batches push notifications to the notification service. The worker uses a token-bucket rate limiter to avoid overwhelming APNs/FCM at spike time. Some notifications are delayed by seconds. That's acceptable for most channels; critical incident channels should use a dedicated higher-priority queue.
flowchart LR
KAFKA[Kafka<br/>message event] --> WORKER[Fan-out Worker]
WORKER -->|"page through members"| MEMBERS[(Channel Members<br/>Postgres / DynamoDB)]
WORKER --> NOTIF[Push Notification<br/>Service]
NOTIF --> APNS[APNs]
NOTIF --> FCM[FCM]
style WORKER fill:#ff6b1a,color:#fff
style KAFKA fill:#ff2e88,color:#fff
style NOTIF fill:#0e7490,color:#fff
Read state and unread counts
The write-volume problem
Every time a new message arrives, the Read-State service must increment unread_count for every channel member who hasn't yet read it. At design scale — 1M message reads per second — that's 1M random KV writes per second, sustained. Separately, every time a user opens a channel, we reset their "last read position" (a MARK_READ write). Mixing those hot random writes into the message store (Cassandra's strength is sequential appends) would hurt compaction and read performance. So we separate them.
A dedicated read-state store
A simple key-value store keyed by (user_id, channel_id), holding last_read_message_id and a cached unread_count:
user_channel_state:{user_id}:{channel_id} →
{ last_read_message_id: "01HX...",
unread_count: 5,
mention_count: 1 }
Redis handles this comfortably: 10M users × avg 10 channels = 100M keys × ~50 bytes ≈ 5 GB per replica — fits one node. DynamoDB or a dedicated Postgres shard with PK on (user_id, channel_id) are equally valid.
When a message arrives, the Read-State service (consuming from Kafka) increments unread_count for every channel member whose last_read_message_id is older than the new message. When the user opens a channel, the client sends MARK_READ {channel_id, message_id}, which resets unread_count = 0 and updates last_read_message_id. Badge counts across all of a user's channels are then a single Redis scan over that user's keys.
flowchart LR
KAFKA[Kafka<br/>new message] --> RSS[Read-State Service]
RSS -->|"increment unread_count<br/>for each member"| RSDB[(Read-State Store<br/>Redis / DynamoDB)]
CLIENT[Client opens channel] -->|"MARK_READ"| API[API]
API -->|"reset unread_count = 0"| RSDB
RSDB -->|"badge count query"| CLIENT
style RSS fill:#ff6b1a,color:#fff
style RSDB fill:#15803d,color:#fff
style KAFKA fill:#ff2e88,color:#fff
Presence
The N² problem
If 10 000 users in a workspace each care about everyone else's presence, a single status change must be broadcast to 9 999 users. That's ~10k events per status change — and if all 10 000 users change status simultaneously (e.g., everyone coming online Monday morning), the total number of subscription edges is 10 000 × 10 000 = 100M, making the per-second delivery volume clearly infeasible at workspace scale.
Bounding fan-out
The practical fix is to scope presence visibility so "interested" means a much smaller set:
- Direct contacts or DM partners: you see presence for users you've messaged directly.
- Shared channel members: presence is shown for members of channels you're both in.
- Workspace-level "who's online" list: coarsened to online vs offline only, refreshed every 30 seconds — no per-user change events, just a periodic snapshot.
Heartbeat mechanism
Every connected client sends a heartbeat (ping/pong) to the gateway every 10 seconds, which proxies it to the Presence Service:
Gateway → Presence Service: heartbeat(user_id, workspace_id)
Presence Service → Redis: SET presence:{user_id} "online" EX 30
The Redis TTL (30 seconds here — a few multiples of the heartbeat interval) means: if no heartbeat arrives, the key expires and the user appears offline. No explicit "disconnect" event is required, though the gateway can send one on a clean close for faster response. Note that the TTL for the Redis key is a low-level implementation detail; the application-level "away" transition (after no activity from any device) typically happens on a longer timescale, on the order of minutes.
sequenceDiagram
participant Client
participant GW as Gateway
participant PS as Presence Service
participant RD as Redis
loop Every 10 seconds
Client->>GW: heartbeat ping
GW->>PS: heartbeat(user_id)
PS->>RD: SET presence:user_id "online" EX 30
end
Note over RD: After 30s with no heartbeat, key expires → user appears offline
Presence fan-out
When a user's presence changes (a key appears or expires), the Presence Service publishes a presence_changed event to the pub-sub router for each "interested" user in the scoped set. For large workspaces, presence is batch-delivered: instead of one event per status change, the gateway sends a compact presence snapshot every 15 seconds covering all recently-changed users, trading a little immediacy for a lot of throughput.
Typing indicators
Typing indicators are ephemeral and best-effort — the architecture treats them very differently from messages. When a user starts typing:
- Client sends
TYPING_START {channel_id}to its gateway. - Gateway publishes to
channel:{channel_id}:typingon the pub-sub router. - Other gateways deliver via WebSocket.
- Indicator auto-expires after 5 seconds (no
TYPING_STOPneeded).
These events are never written to Kafka or the message store. If one is dropped, that's fine — you just don't see the typing bubble for a moment. This is an explicit choice to keep ephemeral UX out of the durable write path entirely.
Full-text search
The pipeline
Messages flow through Kafka into an indexing worker that writes to Elasticsearch:
- Parse the message body.
- Extract metadata: sender, channel, workspace, timestamp.
- Attach workspace-level access-control metadata as a document field (so queries can filter by channels the searching user can actually see).
- Upsert using
message_idas the document ID, making the operation idempotent.
Index design
Each workspace gets a separate Elasticsearch index (or data stream), or workspaces share an index with a workspace_id filter. The shared-index approach is simpler to manage but requires careful routing to prevent cross-tenant data leaks.
Every search query enforces: workspace_id = X AND channel_id IN [channels user can see]. Access control is checked both at query time (filter) and at the application layer (re-verify before returning results).
Latency
Indexing lag is typically 1–10 seconds end-to-end. Elasticsearch's default refresh interval is 1 second, so a document becomes searchable within roughly a second of being written. The Kafka consumer pipeline adds variable propagation delay on top — usually a second or two at low load, potentially more under burst. That's fine — users don't expect to find a message they sent one second ago.
Storage choices
| Data | Store | Rationale |
|---|---|---|
| Messages | Cassandra (sharded by workspace+channel) | High append throughput; efficient range scans by time |
| Channel metadata, workspace config | Postgres | Relational; low volume; ACID useful |
| Channel membership | Postgres (or DynamoDB for large orgs) | Many-to-many join; occasional bulk reads for fan-out |
| Read state (last read, unread count) | Redis (or DynamoDB) | High write volume; simple KV access pattern |
| Presence | Redis with TTL | Ephemeral; TTL handles disconnection naturally |
| Routing (user → gateway) | Redis Pub-Sub / NATS | Pub-sub semantics; ephemeral; no persistence needed |
| Search index | Elasticsearch | Inverted index; rich query language |
| Event backbone | Kafka | Durable; fan-out to multiple consumers; replayable |
| Attachments / files | Object storage (S3) | Large blobs; CDN-served |
Failure modes
Gateway node failure
A gateway crashes. Clients detect the dropped connection within seconds and reconnect with exponential back-off (with jitter to avoid thundering herd). On reconnect the client supplies its last_seen_message_id; the new gateway proxies a history catch-up before resuming the live stream. Messages sent during the gap were durably written to Cassandra via Kafka and come back in the catch-up response.
Pub-sub router failure
If the Redis pub-sub cluster fails, in-flight deliveries to online users are lost — pub-sub events are not persistent. Recovery: clients see no new messages and fall back to polling the catch-up API until WebSocket delivery resumes. Message durability is unaffected because the Kafka → Cassandra path is completely independent.
Kafka consumer lag
If a consumer falls behind — say the Elasticsearch indexer slows down — messages continue being delivered in real time via the pub-sub path. The lagging consumer eventually catches up. No data is lost as long as Kafka retention covers the lag window (typically 7 days).
Presence flapping
A mobile client on a flaky network repeatedly connects and disconnects, each reconnect triggering a presence-change event. Mitigation: debounce presence events with a 5-second grace window before broadcasting "offline." The gateway sends a clean disconnect event; the Presence Service waits 5 seconds before marking offline and publishing the change.
Mega-channel @here storm
A @here in a 50 000-member channel triggers fan-out notifications to all online members. If 10 000 are online and the push notification service is bounded, the fan-out worker's token-bucket rate limiter queues the work. Some notifications are delayed by seconds. Acceptable for most channels; critical incident channels should use a dedicated higher-priority queue.
API design (key endpoints)
-- Send a message
POST /api/v1/channels/{channel_id}/messages
Authorization: Bearer {token}
{ "text": "hello world", "thread_root": null }
→ 201 Created
{ "message_id": "01HX3AB...", "ts": "2026-03-20T12:34:56Z" }
-- Fetch history (paginate backward)
GET /api/v1/channels/{channel_id}/messages?before={message_id}&limit=50
→ 200 OK
{ "messages": [...], "has_more": true }
-- Mark channel as read
POST /api/v1/channels/{channel_id}/mark_read
{ "last_read_message_id": "01HX3AB..." }
-- WebSocket (client connects once, then receives events)
wss://gateway.slack.internal/connect?token={session_token}
← {"type": "message", "channel_id": "...", "message": {...}}
← {"type": "presence", "user_id": "...", "status": "away"}
← {"type": "typing", "channel_id": "...", "user_id": "..."}
← {"type": "read_state", "channel_id": "...", "unread_count": 3}
Sequence: message delivery end-to-end
sequenceDiagram
participant Alice as Alice (client)
participant GW_A as Gateway A
participant MsgSvc as Message Service
participant Kafka as Kafka
participant Router as Pub-Sub Router
participant GW_B as Gateway B
participant Bob as Bob (client)
participant Cass as Cassandra
participant ES as Elasticsearch
participant RS as Read-State Service
Alice->>GW_A: WS: send message to #general
GW_A->>MsgSvc: POST /message {channel, text}
MsgSvc->>Cass: write message (sync, quorum)
MsgSvc->>Router: PUBLISH channel:general {msg}
MsgSvc->>Kafka: produce message event
MsgSvc-->>GW_A: 201 Created {message_id}
GW_A-->>Alice: WS: ack {message_id}
Router->>GW_A: deliver to Alice (echo)
Router->>GW_B: deliver (Bob is subscribed here)
GW_B->>Bob: WS: push message
Kafka->>ES: async index (~1-10s lag)
Kafka->>RS: read-state fan-out (async)
Notice that the Cassandra write is on the synchronous path — before returning 201 — to guarantee durability. The Kafka produce is also synchronous (acks=all, meaning all in-sync replicas must acknowledge) before returning, ensuring the event is not lost. Online delivery via the pub-sub router and durable storage via Kafka/Cassandra are parallel, independent paths.
Contrast with 1:1 chat systems
This design differs from WhatsApp / 1:1 chat in several key ways:
| Dimension | Slack (team chat) | WhatsApp (1:1 / small group) |
|---|---|---|
| Fan-out model | Channel pub-sub → gateways | Per-recipient delivery via session server |
| Group size | Up to 50 000 members | Up to 1 024 (raised from 512, which itself was raised from 256) |
| Read state | Per user, per channel, server-side | Per message, device-side + server receipts |
| Presence | Workspace-scoped, bounded fan-out | Contact-list scoped |
| Search | Server-side full-text (Elasticsearch) | Limited (client-side or minimal server search) |
| Message retention | Permanent (configurable) | Device-side; server holds undelivered messages up to 30 days only |
| Encryption | In transit (TLS), not E2E by default | End-to-end encrypted (Signal Protocol) |
Things to discuss in an interview
- The routing problem: how does the Message Service know which gateway holds Alice's WebSocket? Walk through the pub-sub subscription model.
- Fan-out strategies: normal channel (channel-level pub-sub), mega-channel (worker pool + back-pressure), why they differ.
- Read-state at write scale: why it lives in a dedicated store rather than alongside messages.
- Presence N² problem: how scoping visibility and batching bounds the fan-out.
- Catch-up after reconnect: cursor-based, backed by the durable Cassandra log, decoupled from the pub-sub delivery path.
- Typing indicators as ephemeral pub-sub: never persisted, acceptable to lose, auto-expire by TTL.
- Consistent hashing for gateway assignment: what happens on gateway failure (reconnect + catch-up), why consistent hashing minimises reshuffling.
Things you should now be able to answer
- Why can't the Message Service push directly to the user's WebSocket without a routing layer?
- What happens to messages sent to a channel while a user is disconnected, and how does the client recover them?
- Why is presence expensive at scale, and what are two mechanisms to bound the cost?
- Why are typing indicators handled differently from messages?
- What is the trade-off between writing read-state synchronously vs. via Kafka?
- Why is Cassandra a better fit than Postgres for the message store at this scale?
- How does fan-out work differently for a 50-member channel vs. a 50 000-member channel?
Further reading
- Slack Engineering — "Real-time Messaging" — the authoritative Slack post covering Channel Servers, Gateway Servers, CHARMs, and the consistent-hashing fan-out architecture.
- consistent hashing — underpins gateway assignment and sharding.
- design-chat-system — the 1:1 / small-group baseline to contrast against.
- design-notification-system — offline push delivery for Slack's fan-out worker.
- realtime-communication-patterns — WebSocket vs SSE vs long-poll trade-offs.
- Cassandra documentation — data modelling for time-series workloads.
- Elasticsearch documentation — data streams for append-heavy indexing.
Frequently asked questions
▸What is the gateway tier in Slack's architecture and why is it necessary?
The gateway tier is a set of stateful servers that own all persistent WebSocket connections — up to 10 million concurrently. It is necessary because a WebSocket connection is bound to a specific server, so any backend service that wants to push to a user must know which gateway node holds that user's socket; without a routing layer sitting in front of these gateways, fan-out becomes impossible to coordinate.
▸How does Slack handle fan-out differently for a normal channel versus a mega-channel with 50,000 members?
For normal channels the Message Service publishes once to a pub-sub router topic keyed by channel ID; each gateway that has subscribers on that channel receives the single event and pushes locally to its connected members. For mega-channels the same single publish keeps online delivery efficient, but notifying offline members via @here requires a dedicated fan-out worker pool that reads the membership list in pages from storage and batches push notifications to APNs and FCM with a token-bucket rate limiter to avoid overwhelming those services.
▸Why is read state stored separately from the message store rather than in Cassandra?
Cassandra is optimized for sequential appends, and mixing the high-volume random writes of read-state updates — roughly 1 million writes per second at design scale — into the message store would hurt compaction and read performance. A dedicated key-value store keyed by (user_id, channel_id) holding last_read_message_id and unread_count handles this access pattern cleanly; Redis fits the entire dataset in one node at roughly 5 GB for 100 million keys.
▸How does Slack's presence system avoid an N-squared fan-out storm across large workspaces?
The article describes two mitigations: first, presence visibility is scoped to direct contacts, shared channel members, and a coarsened workspace-level snapshot rather than broadcasting every status change to every workspace member; second, instead of emitting one event per status change, gateways send a compact presence snapshot every 15 seconds covering all recently-changed users, trading a little immediacy for a significant reduction in event volume.
▸What is the difference between how Slack and WhatsApp handle message retention and encryption?
Slack stores messages permanently on the server (configurable retention) and encrypts only in transit via TLS, not end-to-end by default. WhatsApp applies end-to-end encryption using the Signal Protocol and holds undelivered messages on its servers for up to 30 days only, after which messages live on devices rather than centrally.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.