~/articles/design-notification-system

◆◆Intermediateasked at Metaasked at Googleasked at Amazonasked at Slack

Design a Notification System (Push, Email, SMS)

A reliable multi-channel notification platform — fanout, templates, dedup, rate limiting, and the realities of APNS/FCM.

14 min read2026-02-15Ironclad Academy

#interview #notifications #messaging #asynchronous #kafka #rate-limiting

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Slack sends you a push when a thread you're in goes active at 2 AM. Amazon fires an email the moment your package is out for delivery. WhatsApp pings your phone, your laptop, and your watch simultaneously. These look like trivial "call APNS and move on" operations — until you're responsible for one. At 1 billion notifications a day across push, email, and SMS, the surface area for failure is enormous.

The core of this system is a multi-channel fan-out: a single upstream event (order shipped, payment received, new follower) must be routed to the right channels, for the right user, through the right external provider, with the right rendered copy in the right language. That routing pipeline has to respect opt-outs, honor quiet hours, and survive flaky provider APIs — all without ever double-sending the same message.

Two engineering tensions dominate the design. First, at-least-once delivery vs. exactly-once user experience: Kafka and retry logic guarantee you'll attempt every notification, but that means duplicates are a constant threat. Every piece of the pipeline needs idempotency, or users start filing bugs like "I got the same confirmation email six times." Second, fan-out amplification: a single celebrity-account event on a social platform can fan out to millions of followers instantly. The queueing and routing layer has to absorb that burst without the pipeline backing up or the providers rate-limiting you into a blacklist.

Functional requirements

Send notifications via push (mobile), email, SMS, in-app.
Support templates with variables (Hi {name}, your order #{id} shipped).
User preferences: opt-in/out per channel, per category.
Scheduled and triggered notifications.
Track delivery status and engagement.

Non-functional

1B notifications/day at peak.
Delivery within 30 seconds for transactional events.
99.9% delivery success on healthy paths.
Idempotent — never double-send.
Auditable.

Capacity

Dimension	Estimate	How we got there
Volume	1B notifications/day	Given
Throughput (avg)	~12k/sec	`1B ÷ 86,400 s`
Throughput (peak)	~50k/sec	~4× average burst
Channel split	70% push · 20% email · 5% SMS · 5% in-app	Typical app ratio
Per-event payload	~2 KB	Template + variables + metadata
Retention	30 days delivery audit	SLA / compliance
Audit log storage	60 TB	`1B × 2 KB × 30 days`

Takeaway: the bottleneck is not storage (60 TB is manageable) — it is throughput. At 50k/sec peak across four external providers (APNS, FCM, SES/SendGrid, Twilio), the system lives or dies on queue depth management and per-provider rate limiting.

Building up to the design

A notification system is the "looks simple, fails in horrifying ways" problem. The architecture below has half a dozen pieces; each appears the moment you hit a wall. Walk forward.

V1: Synchronous send from the calling service

def on_order_shipped(order):
    user = users.get(order.user_id)
    apns.push(user.device_token, "Your order has shipped!")
    db.save_order(order)

This works for tiny apps and gets you a demo. The problem is that APNS is a third-party network call — it's fast on a good day and completely unreachable on a bad one. When it's slow, your order shipped endpoint times out. When the caller retries, you send the notification twice. You've tightly coupled your order flow to a flaky external dependency, and there's no seam to pull them apart.

V2: Enqueue via a queue, send in a worker

order shipped → push a NotificationEvent to a queue (Kafka, SQS, RabbitMQ). A separate worker reads the queue and calls APNS.

Now the caller returns instantly. APNS slowness no longer blocks business logic, and failures are retryable because the event is sitting safely in the queue. The problem is that one queue plus one worker can't handle 50k notifications per second across multiple channels. And if a user opts out of marketing notifications after the event is already queued, you'll still send it — because you haven't separated the routing decision from the enqueue step.

V3: Channel-specific workers + user preference check

Split into channel-specific workers — a push worker, an email worker, an SMS worker. Each pulls from the queue, checks whether this user wants this kind of notification on this channel, then sends. Each channel now scales independently, and opt-outs are honored at delivery time, not just at enqueue. The next crack appears when you try to change notification copy: every string is hard-coded in the worker, so a wording change means a redeploy.

V4: Template engine + render at send

A separate template service stores templates by (template_id, channel, locale). The worker fetches the template, renders it with the event's variables, then sends. Now you can edit a template and all queued messages pick up the change. You also get safe variable substitution that prevents HTML or SMS injection. The next thing that bites you: services retry on network blips. You'll start sending duplicate notifications.

V5: Idempotency + dedup

Every event carries an idempotency_key (e.g. order:42:shipped). Workers check a recent-deliveries cache (Redis with a 24-hour TTL) before sending — same key in that window means skip. This handles Kafka redelivery, API retries, and the occasional provider that returns success twice.

V6: Rate limiting + provider failover

Two more things you need before calling this production-ready. First, per-user rate caps: never send the same user more than N notifications per hour for the same category. Second, per-provider rate limiting so you don't blow your Twilio quota. And when a provider goes down — SendGrid having an incident, say — you want to route email through SES automatically until it recovers.

V7: Audit + delivery tracking

Every send, skip, and failure writes to an audit log. Failures requeue with exponential backoff up to N retries; after that, the event lands in a dead-letter queue and pages the on-call. This is the production design — every box in the diagram below earned its place by fixing something that broke.

flowchart LR
    V1[V1: sync send<br/>caller blocked] --> V2[V2: + queue + worker<br/>decoupled]
    V2 --> V3[V3: + channels + prefs<br/>scaled, polite]
    V3 --> V4[V4: + templates<br/>no-deploy changes]
    V4 --> V5[V5: + idempotency<br/>no dupes]
    V5 --> V6[V6: + rate limit + failover<br/>resilient]
    V6 --> V7[V7: + audit + retries<br/>production]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V7 fill:#a855f7,color:#fff

High-level architecture

flowchart TD
    SRC1[Order service] --> ENQ[Notification API]
    SRC2[User service] --> ENQ
    SRC3[Background jobs] --> ENQ

    ENQ --> KAFKA[Kafka: notify_events]

    KAFKA --> ROUTER[Router]
    ROUTER --> TPL[Template Engine]
    TPL --> ROUTER
    ROUTER --> PREFS[(User Preferences)]
    PREFS --> ROUTER

    ROUTER --> CHN_PUSH[Push Worker]
    ROUTER --> CHN_EMAIL[Email Worker]
    ROUTER --> CHN_SMS[SMS Worker]
    ROUTER --> CHN_IN[In-app Worker]

    CHN_PUSH --> APNS[APNS]
    CHN_PUSH --> FCM[FCM]
    CHN_EMAIL --> SES[SES / SendGrid]
    CHN_SMS --> TWIL[Twilio]
    CHN_IN --> WS[WebSocket Gateway]

    CHN_PUSH --> AUDIT[(Audit Log)]
    CHN_EMAIL --> AUDIT
    CHN_SMS --> AUDIT
    CHN_IN --> AUDIT

    style ENQ fill:#ff6b1a,color:#0a0a0f
    style KAFKA fill:#a855f7,color:#fff
    style ROUTER fill:#15803d,color:#fff

API

POST /notifications/send
{
  "user_id": 12345,
  "template_id": "order_shipped",
  "data": { "order_id": 99, "tracking_url": "..." },
  "channels": ["push", "email"],            // or auto from prefs
  "idempotency_key": "order_99_shipped",
  "send_at": "2026-02-15T14:00:00Z"          // optional, default now
}

Returns 202 Accepted with a notification_id.

Why a queue is mandatory

Provider API calls are fast in the happy path but highly variable and periodically unavailable. SendGrid's API accepts a message in ~22 ms p50 but spikes to hundreds of milliseconds under load, and then your email still has to be delivered to the recipient's mailbox — total end-to-end latency is seconds to minutes. Twilio's SMS API similarly accepts requests in ~65 ms p50, but actual delivery to the handset adds carrier-dependent latency (seconds to tens of seconds). APNS round-trip is typically fast (tens of milliseconds per message on a warm HTTP/2 connection), but a single persistent connection can saturate or the provider can be temporarily unreachable. None of these are reliable enough to call synchronously on the critical path.

A request hits the API → enqueue → return 202. Workers consume the queue, deliver, retry on failure.

Kafka is a strong choice: high throughput, partitioning by user_id (in-order per user), retention for replay. If a worker crashes, Kafka redelivers from the last committed offset, which is exactly why idempotency is non-negotiable.

Templates

Templates separate what to say from how to render it for each channel.

template: order_shipped
versions:
  push:    "Your order #{order_id} shipped! 📦"
  email:
    subject: "Your order is on the way"
    html:    "Hi {name}, your order shipped. Track it at {tracking_url}."
  sms:     "Order #{order_id} shipped. Track: {tracking_url}"

Stored in a templates DB; rendered just-in-time with the event payload. Supports per-language variants (order_shipped:en, order_shipped:fr).

Idempotency

Every notification has an idempotency_key. Before sending, the worker checks whether we already processed that key.

def process(event):
    if redis.set(f"idem:{event.idempotency_key}", 1, nx=True, ex=86400):
        send(event)
    else:
        log("duplicate, skipped")

SET NX returns True on the first call and None on every retry — both are handled correctly by the if conditional (the redis-py library translates the raw Redis "OK" response to True; on a no-op it returns None, which is falsy). This matters for three scenarios: Kafka redelivers from the last committed offset after a worker crash, a caller retries the API after a network timeout, and some providers return a success response twice. The 24-hour TTL covers all of them.

flowchart LR
    EVT[Event arrives] --> CHK{"Redis SET NX<br/>idempotency_key"}
    CHK -->|"OK — first time"| SEND[Send to provider]
    CHK -->|"nil — seen before"| SKIP[Skip + log]
    SEND --> LOG[Write audit log]
    SKIP --> LOG
    style CHK fill:#ff6b1a,color:#0a0a0f
    style SEND fill:#15803d,color:#fff
    style SKIP fill:#ffaa00,color:#0a0a0f
    style LOG fill:#0e7490,color:#fff

User preferences

Stored in a fast-access KV store:

user:42:notif_prefs = {
  "push":  ["transactional", "social"],
  "email": ["transactional"],
  "sms":   ["transactional"]
}

Router enforces: an event with category social and only-email user → drop the email send (user opted out).

GDPR / regulatory: every notification must respect "do not contact" lists; transactional vs. marketing categories matter for legal compliance.

Push notifications: APNS / FCM

APNS (Apple Push Notification service): persistent HTTP/2 connection from your servers to Apple's servers. Apple delivers to the device. A single client instance (multiplexing concurrent HTTP/2 streams) can sustain roughly 4,000+ pushes/sec in community benchmarks (Apple does not publish a formal limit); run one instance per CPU core to scale throughput further.

FCM (Firebase Cloud Messaging): the canonical push rail for Android (successor to the deprecated GCM). FCM also delivers to iOS by tunneling through APNS and to the web via the Web Push protocol — meaning for cross-platform apps, FCM can be a single provider abstraction in front of both Android and APNS delivery.

sequenceDiagram
    participant App as Push Worker
    participant APNS
    participant Device
    App->>APNS: send(token, payload)
    APNS->>Device: deliver
    APNS-->>App: 200 OK / 4xx (bad token, etc.)
    Device-->>APNS: feedback (token revoked, etc.)

Each app has a device token per install. Tokens become invalid when a user reinstalls the app, the OS rotates the token, or the app is deleted. APNS returns HTTP 410 (Unregistered) for an inactive token, and 400 (DeviceTokenNotForTopic) if the token's topic (bundle ID) mismatches the certificate. On a 410, remove the token from your DB immediately.

One operational note: APNS deliberately delays returning 410 on a sliding, undocumented schedule — this is a privacy-protecting design so providers cannot track app installs and uninstalls in real time. You may send to a stale token for minutes or longer before APNS starts returning 410. Your audit log and a periodic token-validity sweep help detect stale tokens over time. Do not infer an app is still installed just because APNS hasn't returned 410 yet.

Delivery guarantees

Three things worth being explicit about here. Kafka gives you at-least-once delivery, which means duplicates are possible if you don't dedup with idempotency keys. Channels have no ordering guarantees relative to each other — push may arrive before email for the same event, or after. And provider acceptance is not the same as device delivery: even after APNS or FCM returns 200 OK, a notification may never reach the user's screen if their device is offline, the app is uninstalled, or a carrier filters the SMS. Build your audit log accordingly — "accepted by provider" and "delivered to user" are two different states.

Rate limiting and quiet hours

You don't want to spam users. The Router enforces two layers of policy.

The first is a per-user rate cap: for example, max 5 marketing notifications per hour per user. This is a Redis sliding window keyed by (user_id, category) — cheap to check, no DB write per send. See the rate limiter article for the algorithm.

The second is quiet hours: 10 PM – 7 AM in the user's local timezone, suppress non-critical notifications. Quiet-hours logic must use the user's current timezone, not the timezone at enqueue time — always recompute at dispatch, because users travel.

flowchart LR
    EVT2[Event ready to send] --> RC{"Rate cap OK?<br/>(sliding window)"}
    RC -->|"over limit"| DROP[Drop or delay]
    RC -->|"within limit"| QH{"Quiet hours<br/>in user TZ?"}
    QH -->|"yes + non-critical"| SCHED[Schedule for morning]
    QH -->|"yes + critical"| DISPATCH[Dispatch to worker]
    QH -->|"no"| DISPATCH
    style RC fill:#ff6b1a,color:#0a0a0f
    style QH fill:#a855f7,color:#fff
    style DISPATCH fill:#15803d,color:#fff
    style SCHED fill:#ffaa00,color:#0a0a0f

Aggregation / digesting

Aggregation deserves a closer look. The naive system fires one notification per event; a popular post can generate hundreds of likes in minutes. We collapse them:

flowchart LR
    E1[Like by Alice] --> AGG[Aggregator window]
    E2[Like by Bob] --> AGG
    E3[Like by Carol] --> AGG
    E4[5 min timer] --> AGG
    AGG --> NOTIF["3 people liked your post"]
    style AGG fill:#ff6b1a,color:#0a0a0f

Implementation: Redis sorted set keyed by (user_id, post_id, action), members are actor IDs. After a quiet window or N events, flush as a digest.

Tracking and engagement

For each notification, track:

created_at
dispatched_at (sent to provider)
delivered_at (provider confirmed delivery)
opened_at (user tapped/opened — best-effort, varies by channel)
clicked_at (user clicked a link)

Stored in an event store (Kafka → BigQuery / Snowflake) for analytics.

A/B test templates by varying template_version and measuring open rates. Bad templates (high opt-out, low CTR) get retired.

Failures

Provider outages

When APNS, FCM, or Twilio returns 5xx or times out, the worker retries with exponential backoff. After N failed attempts, the event goes to a dead-letter queue — visible to ops, inspectable, and either re-drainable once the provider recovers or forwarded to a fallback channel (APNS down → try email). The key is that the event is never silently dropped; it's always somewhere in the system with a status you can query.

flowchart LR
    SEND2[Send to provider] --> OK{"200 OK?"}
    OK -->|yes| DONE[Write audit: delivered]
    OK -->|no| RETRY{"Retries < N?"}
    RETRY -->|yes| BACKOFF["Wait 2^n seconds<br/>retry"]
    RETRY -->|no| DLQ[Dead-letter queue<br/>+ alert]
    BACKOFF --> SEND2
    style SEND2 fill:#0e7490,color:#fff
    style OK fill:#ff6b1a,color:#0a0a0f
    style DONE fill:#15803d,color:#fff
    style DLQ fill:#ff2e88,color:#fff

Invalid tokens

When a provider returns 410 or a similar "bad token" status, mark the device token as dead in your DB and don't try again. Retrying bad tokens wastes quota and can get your sending IP flagged.

User opts out mid-flight

A notification was queued before the opt-out arrived. Because the Router checks preferences at dispatch time — not at enqueue time — the opt-out applies even to messages already sitting in Kafka. This is the reason preference enforcement happens late in the pipeline.

Timezone changes

If a user is mid-flight between timezones, quiet-hours logic must use their current timezone at the moment of dispatch. Storing the timezone at enqueue time would suppress or send notifications at the wrong local hour. Recompute at dispatch.

Storage

Data	Store
Notification event	Kafka (transient) → S3 (long-term audit)
User preferences	DynamoDB / Redis (read-heavy)
Device tokens	Postgres (small per user)
Templates	Postgres + cache
Delivery status	DynamoDB or Cassandra (write-heavy, append)

Hot path: send a notification

sequenceDiagram
    participant Caller
    participant API
    participant Kafka
    participant Router
    participant Push Worker
    participant APNS
    Caller->>API: POST /send
    API->>Kafka: enqueue
    API-->>Caller: 202
    Kafka->>Router: deliver
    Router->>Router: check prefs, render template
    Router->>Push Worker: dispatch
    Push Worker->>APNS: send
    APNS-->>Push Worker: 200
    Push Worker->>Audit: delivery_log

End-to-end ~1–5 seconds for transactional. Marketing / digests can be minutes to hours.

Things to discuss in an interview

Why a queue (Kafka) between API and workers — provider latency.
Idempotency with SET NX keys.
Multi-channel routing based on prefs.
Aggregation to avoid spam.
Template engine with per-channel renderers.
Provider quirks — token expiry, rate limits, regional outages.

Things you should now be able to answer

Why is the API 202 Accepted instead of 200 OK?
What does idempotency mean here, and how do you achieve it?
How do you avoid spamming a user with 50 notifications in a minute?
What happens when APNS goes down for an hour?
How do you ensure quiet hours work across timezones?

Frequently asked questions

▸Why does the Notification API return 202 Accepted instead of 200 OK?

Because the actual delivery is asynchronous. The API enqueues the event to Kafka and returns immediately — the caller should not wait for APNS, FCM, or Twilio to respond, since those providers are fast at p50 but highly variable under load and periodically unreachable. A 202 signals that the request was accepted and will be processed, not that delivery is complete.

▸How does idempotency prevent double-sends on notification retries?

Every event carries an idempotency_key (for example, order:42:shipped). Before sending, the worker executes a Redis SET NX on that key with a 24-hour TTL. The first call claims the key and proceeds; every subsequent call for the same key returns nil and skips. This covers three scenarios: Kafka redelivering from the last committed offset after a worker crash, a caller retrying the API after a network timeout, and providers that occasionally return a success response twice.

▸When should user preferences be checked — at enqueue time or at dispatch time?

At dispatch time, in the Router. If preferences are enforced at enqueue, an opt-out that arrives after the event is already sitting in Kafka will be ignored and the notification will still go out. Checking at dispatch means opt-outs take effect for all queued messages, even those already in flight.

▸What is the 30-day audit log storage estimate for 1 billion notifications per day?

Approximately 60 TB, calculated as 1 billion notifications times 2 KB per event times 30 days. The article notes the bottleneck is not storage — 60 TB is manageable — but throughput: at 50k per second peak across four external providers (APNS, FCM, SES/SendGrid, Twilio), the system lives or dies on queue depth management and per-provider rate limiting.

▸How should quiet-hours logic handle users who travel across timezones?

Always recompute the user's current timezone at the moment of dispatch, never store it at enqueue time. A notification queued while a user is in New York should use their London timezone if they have traveled by the time the Router processes it. Using a stale timezone would suppress or send notifications at the wrong local hour.

← previous

Design Google Drive / Dropbox

Design a Distributed Cache (like Memcached)

// RELATED