Design a Notification System (Push, Email, SMS)
A reliable multi-channel notification platform — fanout, templates, dedup, rate limiting, and the realities of APNS/FCM.
The problem
Slack sends you a push when a thread you're in goes active at 2 AM. Amazon fires an email the moment your package is out for delivery. WhatsApp pings your phone, your laptop, and your watch simultaneously. These look like trivial "call APNS and move on" operations — until you're responsible for one. At 1 billion notifications a day across push, email, and SMS, the surface area for failure is enormous.
The core of this system is a multi-channel fan-out: a single upstream event (order shipped, payment received, new follower) must be routed to the right channels, for the right user, through the right external provider, with the right rendered copy in the right language. That routing pipeline has to respect opt-outs, honor quiet hours, and survive flaky provider APIs — all without ever double-sending the same message.
Two engineering tensions dominate the design. First, at-least-once delivery vs. exactly-once user experience: Kafka and retry logic guarantee you'll attempt every notification, but that means duplicates are a constant threat. Every piece of the pipeline needs idempotency, or users start filing bugs like "I got the same confirmation email six times." Second, fan-out amplification: a single celebrity-account event on a social platform can fan out to millions of followers instantly. The queueing and routing layer has to absorb that burst without the pipeline backing up or the providers rate-limiting you into a blacklist.
Functional requirements
- Send notifications via push (mobile), email, SMS, in-app.
- Support templates with variables (
Hi {name}, your order #{id} shipped). - User preferences: opt-in/out per channel, per category.
- Scheduled and triggered notifications.
- Track delivery status and engagement.
Non-functional
- 1B notifications/day at peak.
- Delivery within 30 seconds for transactional events.
- 99.9% delivery success on healthy paths.
- Idempotent — never double-send.
- Auditable.
Capacity
| Dimension | Estimate | How we got there |
|---|---|---|
| Volume | 1B notifications/day | Given |
| Throughput (avg) | ~12k/sec | 1B ÷ 86,400 s |
| Throughput (peak) | ~50k/sec | ~4× average burst |
| Channel split | 70% push · 20% email · 5% SMS · 5% in-app | Typical app ratio |
| Per-event payload | ~2 KB | Template + variables + metadata |
| Retention | 30 days delivery audit | SLA / compliance |
| Audit log storage | 60 TB | 1B × 2 KB × 30 days |
Takeaway: the bottleneck is not storage (60 TB is manageable) — it is throughput. At 50k/sec peak across four external providers (APNS, FCM, SES/SendGrid, Twilio), the system lives or dies on queue depth management and per-provider rate limiting.
Building up to the design
A notification system is the "looks simple, fails in horrifying ways" problem. The architecture below has half a dozen pieces; each appears the moment you hit a wall. Walk forward.
V1: Synchronous send from the calling service
def on_order_shipped(order):
user = users.get(order.user_id)
apns.push(user.device_token, "Your order has shipped!")
db.save_order(order)
This works for tiny apps and gets you a demo. The problem is that APNS is a third-party network call — it's fast on a good day and completely unreachable on a bad one. When it's slow, your order shipped endpoint times out. When the caller retries, you send the notification twice. You've tightly coupled your order flow to a flaky external dependency, and there's no seam to pull them apart.
V2: Enqueue via a queue, send in a worker
order shipped → push a NotificationEvent to a queue (Kafka, SQS, RabbitMQ). A separate worker reads the queue and calls APNS.
Now the caller returns instantly. APNS slowness no longer blocks business logic, and failures are retryable because the event is sitting safely in the queue. The problem is that one queue plus one worker can't handle 50k notifications per second across multiple channels. And if a user opts out of marketing notifications after the event is already queued, you'll still send it — because you haven't separated the routing decision from the enqueue step.
V3: Channel-specific workers + user preference check
Split into channel-specific workers — a push worker, an email worker, an SMS worker. Each pulls from the queue, checks whether this user wants this kind of notification on this channel, then sends. Each channel now scales independently, and opt-outs are honored at delivery time, not just at enqueue. The next crack appears when you try to change notification copy: every string is hard-coded in the worker, so a wording change means a redeploy.
V4: Template engine + render at send
A separate template service stores templates by (template_id, channel, locale). The worker fetches the template, renders it with the event's variables, then sends. Now you can edit a template and all queued messages pick up the change. You also get safe variable substitution that prevents HTML or SMS injection. The next thing that bites you: services retry on network blips. You'll start sending duplicate notifications.
V5: Idempotency + dedup
Every event carries an idempotency_key (e.g. order:42:shipped). Workers check a recent-deliveries cache (Redis with a 24-hour TTL) before sending — same key in that window means skip. This handles Kafka redelivery, API retries, and the occasional provider that returns success twice.
V6: Rate limiting + provider failover
Two more things you need before calling this production-ready. First, per-user rate caps: never send the same user more than N notifications per hour for the same category. Second, per-provider rate limiting so you don't blow your Twilio quota. And when a provider goes down — SendGrid having an incident, say — you want to route email through SES automatically until it recovers.
V7: Audit + delivery tracking
Every send, skip, and failure writes to an audit log. Failures requeue with exponential backoff up to N retries; after that, the event lands in a dead-letter queue and pages the on-call. This is the production design — every box in the diagram below earned its place by fixing something that broke.
flowchart LR
V1[V1: sync send<br/>caller blocked] --> V2[V2: + queue + worker<br/>decoupled]
V2 --> V3[V3: + channels + prefs<br/>scaled, polite]
V3 --> V4[V4: + templates<br/>no-deploy changes]
V4 --> V5[V5: + idempotency<br/>no dupes]
V5 --> V6[V6: + rate limit + failover<br/>resilient]
V6 --> V7[V7: + audit + retries<br/>production]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V5 fill:#ff6b1a,color:#0a0a0f
style V7 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
SRC1[Order service] --> ENQ[Notification API]
SRC2[User service] --> ENQ
SRC3[Background jobs] --> ENQ
ENQ --> KAFKA[Kafka: notify_events]
KAFKA --> ROUTER[Router]
ROUTER --> TPL[Template Engine]
TPL --> ROUTER
ROUTER --> PREFS[(User Preferences)]
PREFS --> ROUTER
ROUTER --> CHN_PUSH[Push Worker]
ROUTER --> CHN_EMAIL[Email Worker]
ROUTER --> CHN_SMS[SMS Worker]
ROUTER --> CHN_IN[In-app Worker]
CHN_PUSH --> APNS[APNS]
CHN_PUSH --> FCM[FCM]
CHN_EMAIL --> SES[SES / SendGrid]
CHN_SMS --> TWIL[Twilio]
CHN_IN --> WS[WebSocket Gateway]
CHN_PUSH --> AUDIT[(Audit Log)]
CHN_EMAIL --> AUDIT
CHN_SMS --> AUDIT
CHN_IN --> AUDIT
style ENQ fill:#ff6b1a,color:#0a0a0f
style KAFKA fill:#a855f7,color:#fff
style ROUTER fill:#15803d,color:#fff
API
POST /notifications/send
{
"user_id": 12345,
"template_id": "order_shipped",
"data": { "order_id": 99, "tracking_url": "..." },
"channels": ["push", "email"], // or auto from prefs
"idempotency_key": "order_99_shipped",
"send_at": "2026-02-15T14:00:00Z" // optional, default now
}
Returns 202 Accepted with a notification_id.
Why a queue is mandatory
Provider API calls are fast in the happy path but highly variable and periodically unavailable. SendGrid's API accepts a message in ~22 ms p50 but spikes to hundreds of milliseconds under load, and then your email still has to be delivered to the recipient's mailbox — total end-to-end latency is seconds to minutes. Twilio's SMS API similarly accepts requests in ~65 ms p50, but actual delivery to the handset adds carrier-dependent latency (seconds to tens of seconds). APNS round-trip is typically fast (tens of milliseconds per message on a warm HTTP/2 connection), but a single persistent connection can saturate or the provider can be temporarily unreachable. None of these are reliable enough to call synchronously on the critical path.
A request hits the API → enqueue → return 202. Workers consume the queue, deliver, retry on failure.
Kafka is a strong choice: high throughput, partitioning by user_id (in-order per user), retention for replay. If a worker crashes, Kafka redelivers from the last committed offset, which is exactly why idempotency is non-negotiable.
Templates
Templates separate what to say from how to render it for each channel.
template: order_shipped
versions:
push: "Your order #{order_id} shipped! 📦"
email:
subject: "Your order is on the way"
html: "Hi {name}, your order shipped. Track it at {tracking_url}."
sms: "Order #{order_id} shipped. Track: {tracking_url}"
Stored in a templates DB; rendered just-in-time with the event payload. Supports per-language variants (order_shipped:en, order_shipped:fr).
Idempotency
Every notification has an idempotency_key. Before sending, the worker checks whether we already processed that key.
def process(event):
if redis.set(f"idem:{event.idempotency_key}", 1, nx=True, ex=86400):
send(event)
else:
log("duplicate, skipped")
SET NX returns True on the first call and None on every retry — both are handled correctly by the if conditional (the redis-py library translates the raw Redis "OK" response to True; on a no-op it returns None, which is falsy). This matters for three scenarios: Kafka redelivers from the last committed offset after a worker crash, a caller retries the API after a network timeout, and some providers return a success response twice. The 24-hour TTL covers all of them.
flowchart LR
EVT[Event arrives] --> CHK{"Redis SET NX<br/>idempotency_key"}
CHK -->|"OK — first time"| SEND[Send to provider]
CHK -->|"nil — seen before"| SKIP[Skip + log]
SEND --> LOG[Write audit log]
SKIP --> LOG
style CHK fill:#ff6b1a,color:#0a0a0f
style SEND fill:#15803d,color:#fff
style SKIP fill:#ffaa00,color:#0a0a0f
style LOG fill:#0e7490,color:#fff
User preferences
Stored in a fast-access KV store:
user:42:notif_prefs = {
"push": ["transactional", "social"],
"email": ["transactional"],
"sms": ["transactional"]
}
Router enforces: an event with category social and only-email user → drop the email send (user opted out).
GDPR / regulatory: every notification must respect "do not contact" lists; transactional vs. marketing categories matter for legal compliance.
Push notifications: APNS / FCM
APNS (Apple Push Notification service): persistent HTTP/2 connection from your servers to Apple's servers. Apple delivers to the device. A single client instance (multiplexing concurrent HTTP/2 streams) can sustain roughly 4,000+ pushes/sec in community benchmarks (Apple does not publish a formal limit); run one instance per CPU core to scale throughput further.
FCM (Firebase Cloud Messaging): the canonical push rail for Android (successor to the deprecated GCM). FCM also delivers to iOS by tunneling through APNS and to the web via the Web Push protocol — meaning for cross-platform apps, FCM can be a single provider abstraction in front of both Android and APNS delivery.
sequenceDiagram
participant App as Push Worker
participant APNS
participant Device
App->>APNS: send(token, payload)
APNS->>Device: deliver
APNS-->>App: 200 OK / 4xx (bad token, etc.)
Device-->>APNS: feedback (token revoked, etc.)
Each app has a device token per install. Tokens become invalid when a user reinstalls the app, the OS rotates the token, or the app is deleted. APNS returns HTTP 410 (Unregistered) for an inactive token, and 400 (DeviceTokenNotForTopic) if the token's topic (bundle ID) mismatches the certificate. On a 410, remove the token from your DB immediately.
One operational note: APNS deliberately delays returning 410 on a sliding, undocumented schedule — this is a privacy-protecting design so providers cannot track app installs and uninstalls in real time. You may send to a stale token for minutes or longer before APNS starts returning 410. Your audit log and a periodic token-validity sweep help detect stale tokens over time. Do not infer an app is still installed just because APNS hasn't returned 410 yet.
Delivery guarantees
Three things worth being explicit about here. Kafka gives you at-least-once delivery, which means duplicates are possible if you don't dedup with idempotency keys. Channels have no ordering guarantees relative to each other — push may arrive before email for the same event, or after. And provider acceptance is not the same as device delivery: even after APNS or FCM returns 200 OK, a notification may never reach the user's screen if their device is offline, the app is uninstalled, or a carrier filters the SMS. Build your audit log accordingly — "accepted by provider" and "delivered to user" are two different states.
Rate limiting and quiet hours
You don't want to spam users. The Router enforces two layers of policy.
The first is a per-user rate cap: for example, max 5 marketing notifications per hour per user. This is a Redis sliding window keyed by (user_id, category) — cheap to check, no DB write per send. See the rate limiter article for the algorithm.
The second is quiet hours: 10 PM – 7 AM in the user's local timezone, suppress non-critical notifications. Quiet-hours logic must use the user's current timezone, not the timezone at enqueue time — always recompute at dispatch, because users travel.
flowchart LR
EVT2[Event ready to send] --> RC{"Rate cap OK?<br/>(sliding window)"}
RC -->|"over limit"| DROP[Drop or delay]
RC -->|"within limit"| QH{"Quiet hours<br/>in user TZ?"}
QH -->|"yes + non-critical"| SCHED[Schedule for morning]
QH -->|"yes + critical"| DISPATCH[Dispatch to worker]
QH -->|"no"| DISPATCH
style RC fill:#ff6b1a,color:#0a0a0f
style QH fill:#a855f7,color:#fff
style DISPATCH fill:#15803d,color:#fff
style SCHED fill:#ffaa00,color:#0a0a0f
Aggregation / digesting
Aggregation deserves a closer look. The naive system fires one notification per event; a popular post can generate hundreds of likes in minutes. We collapse them:
flowchart LR
E1[Like by Alice] --> AGG[Aggregator window]
E2[Like by Bob] --> AGG
E3[Like by Carol] --> AGG
E4[5 min timer] --> AGG
AGG --> NOTIF["3 people liked your post"]
style AGG fill:#ff6b1a,color:#0a0a0f
Implementation: Redis sorted set keyed by (user_id, post_id, action), members are actor IDs. After a quiet window or N events, flush as a digest.
Tracking and engagement
For each notification, track:
created_atdispatched_at(sent to provider)delivered_at(provider confirmed delivery)opened_at(user tapped/opened — best-effort, varies by channel)clicked_at(user clicked a link)
Stored in an event store (Kafka → BigQuery / Snowflake) for analytics.
A/B test templates by varying template_version and measuring open rates. Bad templates (high opt-out, low CTR) get retired.
Failures
Provider outages
When APNS, FCM, or Twilio returns 5xx or times out, the worker retries with exponential backoff. After N failed attempts, the event goes to a dead-letter queue — visible to ops, inspectable, and either re-drainable once the provider recovers or forwarded to a fallback channel (APNS down → try email). The key is that the event is never silently dropped; it's always somewhere in the system with a status you can query.
flowchart LR
SEND2[Send to provider] --> OK{"200 OK?"}
OK -->|yes| DONE[Write audit: delivered]
OK -->|no| RETRY{"Retries < N?"}
RETRY -->|yes| BACKOFF["Wait 2^n seconds<br/>retry"]
RETRY -->|no| DLQ[Dead-letter queue<br/>+ alert]
BACKOFF --> SEND2
style SEND2 fill:#0e7490,color:#fff
style OK fill:#ff6b1a,color:#0a0a0f
style DONE fill:#15803d,color:#fff
style DLQ fill:#ff2e88,color:#fff
Invalid tokens
When a provider returns 410 or a similar "bad token" status, mark the device token as dead in your DB and don't try again. Retrying bad tokens wastes quota and can get your sending IP flagged.
User opts out mid-flight
A notification was queued before the opt-out arrived. Because the Router checks preferences at dispatch time — not at enqueue time — the opt-out applies even to messages already sitting in Kafka. This is the reason preference enforcement happens late in the pipeline.
Timezone changes
If a user is mid-flight between timezones, quiet-hours logic must use their current timezone at the moment of dispatch. Storing the timezone at enqueue time would suppress or send notifications at the wrong local hour. Recompute at dispatch.
Storage
| Data | Store |
|---|---|
| Notification event | Kafka (transient) → S3 (long-term audit) |
| User preferences | DynamoDB / Redis (read-heavy) |
| Device tokens | Postgres (small per user) |
| Templates | Postgres + cache |
| Delivery status | DynamoDB or Cassandra (write-heavy, append) |
Hot path: send a notification
sequenceDiagram
participant Caller
participant API
participant Kafka
participant Router
participant Push Worker
participant APNS
Caller->>API: POST /send
API->>Kafka: enqueue
API-->>Caller: 202
Kafka->>Router: deliver
Router->>Router: check prefs, render template
Router->>Push Worker: dispatch
Push Worker->>APNS: send
APNS-->>Push Worker: 200
Push Worker->>Audit: delivery_log
End-to-end ~1–5 seconds for transactional. Marketing / digests can be minutes to hours.
Things to discuss in an interview
- Why a queue (Kafka) between API and workers — provider latency.
- Idempotency with
SET NXkeys. - Multi-channel routing based on prefs.
- Aggregation to avoid spam.
- Template engine with per-channel renderers.
- Provider quirks — token expiry, rate limits, regional outages.
Things you should now be able to answer
- Why is the API
202 Acceptedinstead of200 OK? - What does idempotency mean here, and how do you achieve it?
- How do you avoid spamming a user with 50 notifications in a minute?
- What happens when APNS goes down for an hour?
- How do you ensure quiet hours work across timezones?
Further reading
- "Notifications at Scale" — Slack engineering blog
- APNS HTTP/2 protocol docs
- Pinterest's notification system blog post
Frequently asked questions
▸Why does the Notification API return 202 Accepted instead of 200 OK?
Because the actual delivery is asynchronous. The API enqueues the event to Kafka and returns immediately — the caller should not wait for APNS, FCM, or Twilio to respond, since those providers are fast at p50 but highly variable under load and periodically unreachable. A 202 signals that the request was accepted and will be processed, not that delivery is complete.
▸How does idempotency prevent double-sends on notification retries?
Every event carries an idempotency_key (for example, order:42:shipped). Before sending, the worker executes a Redis SET NX on that key with a 24-hour TTL. The first call claims the key and proceeds; every subsequent call for the same key returns nil and skips. This covers three scenarios: Kafka redelivering from the last committed offset after a worker crash, a caller retrying the API after a network timeout, and providers that occasionally return a success response twice.
▸When should user preferences be checked — at enqueue time or at dispatch time?
At dispatch time, in the Router. If preferences are enforced at enqueue, an opt-out that arrives after the event is already sitting in Kafka will be ignored and the notification will still go out. Checking at dispatch means opt-outs take effect for all queued messages, even those already in flight.
▸What is the 30-day audit log storage estimate for 1 billion notifications per day?
Approximately 60 TB, calculated as 1 billion notifications times 2 KB per event times 30 days. The article notes the bottleneck is not storage — 60 TB is manageable — but throughput: at 50k per second peak across four external providers (APNS, FCM, SES/SendGrid, Twilio), the system lives or dies on queue depth management and per-provider rate limiting.
▸How should quiet-hours logic handle users who travel across timezones?
Always recompute the user's current timezone at the moment of dispatch, never store it at enqueue time. A notification queued while a user is in New York should use their London timezone if they have traveled by the time the Router processes it. Using a stale timezone would suppress or send notifications at the wrong local hour.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.