Design an Email Service (Gmail)
Send, receive, store, and search email for hundreds of millions of users. SMTP ingestion, sharded mailbox storage, full-text search, and spam filtering.
The problem
Gmail handles roughly 1.8 billion active users and, by Google's own estimates, processes hundreds of billions of messages every day. Behind the clean compose window is one of the oldest distributed systems still in widespread production use — SMTP was standardized in 1982, and every email you send still touches that protocol at some point in its journey.
At its core, email is a store-and-forward messaging system. A user composes a message; your service queues it, looks up the recipient's mail server (MX) in DNS, opens an SMTP connection, and hands it off. Inbound works in reverse: a remote mail server connects to your SMTP ingestion endpoint, transfers the message, and your system writes it durably to the recipient's mailbox. The user then fetches their mail via a web API or a legacy IMAP client. Gmail, Outlook, Yahoo Mail, and Fastmail all run this same two-sided exchange.
The engineering tension comes from three places at once. First, durability is non-negotiable: the moment your SMTP server sends 250 OK, you have made a binding delivery commitment — any message lost after that point violates the protocol and destroys trust. Second, the write volume is punishing: 500 million active users receiving 25 messages a day averages 145,000 inbound messages per second, and each one must be spam-checked, body-stored in blob storage, and metadata-indexed before you say 250. Third, the data model spans wildly different access patterns — users want sub-200ms mailbox list loads, sub-second full-text search over years of archived mail, and attachment deduplication across mass CC storms — all from the same underlying store.
Understanding how to decompose those sub-problems — inbound pipeline, outbound delivery queue, mailbox storage split, spam filtering, and per-user search — is exactly what the interview tests. The individual components are not exotic; what matters is that you name the right split points and explain why.
Functional requirements
- Users send and receive email via web and mobile APIs.
- Inbound server-to-server transfer uses SMTP; users do not speak SMTP directly.
- Users can organize mail with labels and folders; mark read/unread; thread conversations.
- Full-text search over a user's mailbox.
- Attachments up to ~25 MB per message.
- (Optional) Spam/phishing filtering visible to the user; virus scanning on attachments.
Non-functional requirements
- Durability: no message loss after ingestion — a dropped email is a catastrophe.
- Availability: 99.99%+ for the read path; brief ingestion delays are tolerable.
- Inbound latency: mail delivered to inbox within seconds of SMTP acceptance.
- Read latency: message list / thread view < 200ms p99.
- Search latency: full-text results < 1s p95.
- Deliverability: outbound mail must not be flagged as spam by recipients; SPF/DKIM/DMARC must be configured correctly.
Capacity
| Dimension | Estimate | How we got there |
|---|---|---|
| Active users | 500M (design scale; Gmail's real count is ~1.8B as of 2025) | Design exercise target |
| Avg messages received | 25 msgs / user / day | Mix of personal, transactional, and newsletters; varies widely by user type |
| Inbound rate (avg) | ~145,000 emails/sec | 500M × 25 ÷ 86,400 |
| Inbound rate (peak) | ~290,000 emails/sec | 2× avg; bulk/transactional bursts arrive unevenly |
| Avg message size (raw) | 75 KB | Headers ~5 KB, body ~20 KB, remainder attachments; real-world averages shift with attachment ratio |
| Avg message size (after dedup + compression) | ~30 KB | ~2.5× savings assumed |
| Write throughput (raw) | ~10.9 GB/sec | 145,000 msg/s × 75 KB |
| Storage added per day (raw) | ~940 TB/day | 10.9 GB/s × 86,400 s |
| Storage added per day (net, after dedup + compression) | ~375 TB/day | 940 TB ÷ 2.5 |
| Per-user storage per year (uncompressed) | ~685 MB/year | 25 msg/day × 75 KB × 365 days |
| Per-user storage after 10 years | ~6.9 GB | Within real free-tier quotas |
| Outbound rate | ~15,000–45,000 sends/sec | Typically 10–30% of inbound volume |
| Search QPS (global peak) | ~2,000/sec | 0.35 searches/user/day × 500M ÷ 86,400; each query touches only one user's index — isolation is free |
Takeaway: Storage throughput dominates: 145,000 inbound messages per second at 75 KB each means ~10.9 GB/sec raw write throughput and ~940 TB of new data every day before dedup and compression.
A per-user inverted index at 500M users is large, but only a fraction of users search actively. Hot indexes live in memory; cold ones are loaded from disk on demand.
Protocol boundary — what SMTP does and doesn't do
This is consistently misunderstood in interviews. SMTP is the server-to-server transfer protocol — used when an external mail server delivers a message to your ingestion servers, and when your egress servers deliver outbound mail to a recipient's MX. End users do not speak SMTP to Gmail's servers in 2025; they use the HTTP/HTTPS-based web app or mobile API. Legacy clients (Outlook, Thunderbird) use IMAP (read/sync) or POP3 (download-and-delete) to access mailboxes — keep these in scope for completeness but out of the critical path.
flowchart LR
EXT[External mail server] -.SMTP port 25.-> SMTPIN[SMTP ingestion]
SMTPOUT[SMTP egress] -.SMTP port 25 / 587.-> EXT2[Recipient MX]
APP[User browser / app] -.HTTPS REST.-> API[API Gateway]
LEGACY[IMAP / POP3 clients] -.IMAP port 993.-> IMAP[IMAP server]
style SMTPIN fill:#0e7490,color:#fff
style API fill:#ff6b1a,color:#0a0a0f
For the interview, keep protocol discussion at this level: SMTP in/out at the edges, HTTP API for users, IMAP for legacy clients. Do not spend time on SMTP commands unless asked.
Building up to the design
Start with the simplest thing that could possibly work. Each step breaks for a specific, nameable reason — and naming those reasons is how you earn credibility in the room.
V1: One table, no spam filter
Store each message as a row in messages(id, user_id, from, to, subject, body, received_at). SMTP ingestion calls INSERT. The web app calls SELECT * FROM messages WHERE user_id = ? ORDER BY received_at DESC LIMIT 50. One API box, one Postgres.
This works for a handful of users, but at ~145 000 emails/sec (peak over 290 000) a single Postgres cannot keep up. Worse, the body column makes rows enormous — every list query drags the full message text across the wire even when you only need the subject line. And every phishing mail lands in the inbox, because there is no spam filter.
V2: Split metadata from body; add blob storage
The fix for enormous rows is architectural: move the large body and attachments out of the database and into a blob store (object storage). The database row keeps only the metadata — from, to, subject, thread ID, labels, flags, timestamps, and a pointer to the blob.
messages row: ~1 KB
message body blob: ~74 KB (object storage)
Now reading the message list means hitting the DB only. Opening a thread fetches the blob. DB I/O drops by roughly 98%. But you are still on one shard — all 500M users' metadata in one Postgres, which means write throughput, memory, and connection limits all fail simultaneously.
V3: Shard by user_id; add delivery queue
Shard the metadata store by user_id. All operations for a user — message list, label update, search — hit a single shard, so there are no cross-shard joins. Use consistent hashing with virtual nodes so adding shards later does not require full reshuffling (see consistent hashing).
While you are here, add a durable outbound delivery queue. When a user hits "Send," write to the queue immediately for a fast acknowledgment, then let egress workers drain it with retry logic. This decouples the send API response time from the reachability of the remote mail server. Now spam still lands in inboxes, there is no full-text search, and identical attachments are stored millions of times over.
V4: Add spam pipeline and attachment dedup
Route inbound messages through a spam filter between SMTP ingestion and the mailbox write. For attachments, content-hash dedup: before writing a blob, hash the body or attachment bytes. If that hash already exists in the store, skip the write and record a pointer to the existing blob. A forwarded email or CC storm stores the body exactly once, regardless of how many recipients it reaches.
At this point, users search and get nothing — the metadata DB supports queries by user_id and received_at, but not full-text over subject or body.
V5: Per-user inverted index for search
An indexing worker subscribes to the write stream (or reads from a CDC log) and builds an inverted index per user — a map from each word to the list of message IDs containing it. The index is stored per user (isolation comes for free), built asynchronously so the write path is not blocked.
V6: Production scale
V3 + V4 + V5, plus: multi-region replication for durability, tiered storage (recent mail on fast SSD-backed blob, older mail in cheap cold storage), quota enforcement per user, a greylisting layer in the spam pipeline, and operational tooling for abuse response.
flowchart LR
V1[V1: single DB<br/>works for ~100 users] --> V2[V2: blob split<br/>DB rows are small]
V2 --> V3[V3: shard by user_id<br/>delivery queue]
V3 --> V4[V4: spam filter<br/>attachment dedup]
V4 --> V5[V5: per-user search index<br/>async indexing]
V5 --> V6[V6: multi-region<br/>tiered storage + quotas]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V6 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
INET[Internet senders] -.SMTP port 25.-> MX[MX / SMTP ingestion<br/>load-balanced pool]
MX --> SPAMPIPE[Spam + Virus Pipeline]
SPAMPIPE -->|clean| MBW[Mailbox Write Service]
SPAMPIPE -->|spam| MBW2[Mailbox Write Service<br/>spam-flagged]
MBW --> META[(Metadata store<br/>sharded by user_id)]
MBW --> BLOBW[(Blob store<br/>message bodies + attachments)]
MBW --> IDXQ[Indexing queue]
IDXQ --> IDXW[Index Worker]
IDXW --> SIDX[(Per-user search index)]
USER[User] --> GW[API Gateway + Auth]
GW --> MBRAPI[Mailbox Read API]
MBRAPI --> META
MBRAPI --> BLOBW
GW --> SRCHAPI[Search API]
SRCHAPI --> SIDX
GW --> SENDAPI[Send API]
SENDAPI --> MBW3[Mailbox Write<br/>Sent folder]
SENDAPI --> DQ[(Delivery Queue)]
DQ --> EGRESS[SMTP Egress workers]
EGRESS -.retry loop.-> DQ
EGRESS -.SMTP.-> INET2[Recipient MX servers]
style MX fill:#0e7490,color:#fff
style SPAMPIPE fill:#ff6b1a,color:#0a0a0f
style META fill:#15803d,color:#fff
style BLOBW fill:#a855f7,color:#fff
style SIDX fill:#ffaa00,color:#0a0a0f
style DQ fill:#ff2e88,color:#fff
Inbound path — SMTP to inbox
Here is what actually happens when an external mail server delivers a message to you, step by step:
sequenceDiagram
participant Sender as External MTA
participant MX as SMTP Ingestion
participant SP as Spam Pipeline
participant BS as Blob store
participant DB as Metadata shard
participant IQ as Index queue
participant IW as Index Worker
Sender->>MX: SMTP EHLO / MAIL FROM / RCPT TO / DATA
MX->>MX: SPF check at MAIL FROM, then DKIM + DMARC after DATA
MX->>SP: pass message for scoring
SP->>SP: IP reputation + content ML + header analysis
SP-->>MX: verdict — clean / spam / virus
MX->>BS: write body blob, content-hash key
MX->>DB: INSERT metadata row — blob_key, spam_score, labels
Note over MX,DB: 250 sent only after durable write completes
MX-->>Sender: 250 OK, or 5xx reject for spam/virus
MX->>IQ: emit indexing event
IQ-->>IW: async
The 250 OK back to the sender is a durability commitment. The moment you say 250, you have accepted legal and operational responsibility for that message. If you lose it after saying 250, you have violated SMTP. So: write the blob, write the metadata row, confirm both are durable — then say 250.
SPF is checked during the MAIL FROM / RCPT TO phase, before DATA arrives, because it only needs the client IP and the envelope-from address. DKIM verification happens after DATA is received — the ingestion server checks the DKIM-Signature header against the sending domain's public key in DNS. DMARC is evaluated after both SPF and DKIM verdicts are in. All three checks are read-only and stateless; they add no shared state to your ingestion servers.
One nuance worth calling out: a spam verdict does not have to block the 250. For borderline-spam messages, you can still accept (return 250) and route to the spam folder. Silently dropping an accepted message violates the SMTP contract — only reject outright with 5xx for the most obvious abuse.
Mailbox storage — the schema
The two-layer design keeps the database lean: small metadata rows in a sharded DB, large blobs in object storage.
Metadata table (per shard)
CREATE TABLE messages (
message_id UUID NOT NULL,
user_id BIGINT NOT NULL, -- shard key
thread_id UUID NOT NULL,
blob_key VARCHAR(64) NOT NULL, -- SHA-256 or similar content hash
from_addr TEXT NOT NULL,
subject TEXT NOT NULL,
received_at TIMESTAMPTZ NOT NULL,
size_bytes INT NOT NULL,
labels TEXT[] NOT NULL DEFAULT '{}', -- INBOX, SPAM, SENT, user labels
is_read BOOLEAN NOT NULL DEFAULT FALSE,
is_starred BOOLEAN NOT NULL DEFAULT FALSE,
snippet TEXT, -- first ~200 chars of body, pre-extracted
PRIMARY KEY (user_id, message_id)
);
CREATE INDEX ON messages (user_id, received_at DESC);
CREATE INDEX ON messages (user_id, thread_id);
CREATE INDEX ON messages (user_id, labels, received_at DESC);
Every query leads with user_id as the shard key. The shard router sends the query to exactly one node, and the local index handles the rest. No cross-shard queries for normal mailbox operations.
Blob store
Body and attachments are written to an object store (e.g. S3-compatible) under a content-hash key:
blob key: sha256(message_body_bytes)
value: raw MIME body bytes (optionally compressed)
Before writing, check if blob_key already exists. If it does — say, a mass-CC email going to 10 000 recipients — skip the write and record a pointer to the existing blob. For attachments, the same PDF attached by a thousand different users is one blob.
Retention and tiering: recent blobs on hot (SSD) storage; blobs older than 1 year move to cold (spinning / tape / glacier-class) storage. The metadata row's blob_key is stable — the tier is transparent to the reader.
flowchart LR
NEW[New message blob] --> HOT[(Hot SSD store<br/>recent mail)]
HOT -->|"age > 1 year"| COLD[(Cold storage<br/>cheap, high-latency)]
COLD -->|"user opens old mail"| HOT
META[(Metadata row<br/>blob_key)] -.stable pointer.-> HOT
META -.stable pointer.-> COLD
style HOT fill:#ff6b1a,color:#0a0a0f
style COLD fill:#0e7490,color:#fff
style META fill:#15803d,color:#fff
The metadata row's blob_key never changes — tiering is transparent. When a user opens a five-year-old message, the read service fetches from cold storage, which takes a few hundred milliseconds more than hot — acceptable for infrequent access.
Conversation threading
Threading groups related messages into a conversation view. The canonical approach works in three steps:
- Check
ReferencesandIn-Reply-Toheaders. If they reference a knownmessage_idin the system, assign the samethread_id. - If no reference headers are present, try subject normalization (strip "Re:", "Fwd:", "AW:", etc.) combined with a participant-set match within a time window.
- If neither matches, create a new
thread_id.
flowchart TD
MSG[Incoming message] --> CHECK1{"References /<br/>In-Reply-To header?"}
CHECK1 -->|yes| MATCH1[Look up referenced message_id]
MATCH1 -->|found| ASSIGN[Assign same thread_id]
CHECK1 -->|no| CHECK2{"Subject + participants<br/>match within window?"}
MATCH1 -->|not found| CHECK2
CHECK2 -->|yes| ASSIGN
CHECK2 -->|no| NEW[Create new thread_id]
ASSIGN --> STORE[(Store thread_id in metadata row)]
NEW --> STORE
style ASSIGN fill:#15803d,color:#fff
style NEW fill:#ff6b1a,color:#0a0a0f
style STORE fill:#a855f7,color:#fff
Threading is computed at ingestion and stored in the metadata row. A thread view is then a single index scan:
SELECT * FROM messages
WHERE user_id = ? AND thread_id = ?
ORDER BY received_at ASC;
Outbound path — the delivery queue
When a user hits Send, you want to return a fast acknowledgment. You do not want to hold the HTTP connection open while your egress server negotiates an SMTP session with a remote server that might be down for hours. So the send API writes the message to a durable delivery queue and returns immediately. Egress workers drain the queue.
sequenceDiagram
participant U as User / Send API
participant DQ as Delivery Queue
participant EW as Egress Worker
participant RMTA as Recipient MX
U->>DQ: enqueue(recipient_domain, message_id, attempt=0)
EW->>DQ: dequeue job
EW->>RMTA: SMTP connect + DATA
alt success 250
RMTA-->>EW: 250 OK
EW->>DQ: ack (delete job)
else transient 4xx
RMTA-->>EW: 421 / 450 try later
EW->>DQ: requeue with delay (exponential backoff)
else permanent 5xx
RMTA-->>EW: 550 user unknown
EW->>DQ: ack + generate NDR bounce to sender
end
The 4xx vs 5xx distinction matters enormously here. A 4xx is a transient failure — the remote server is temporarily unavailable, overloaded, or greylisting you. You requeue and retry. A 5xx is a permanent failure — the recipient address does not exist, or the server is explicitly rejecting you. You give up and generate a Non-Delivery Report (NDR) back to the sender. Treating a 5xx as transient and retrying wastes queue resources and annoys the remote server.
Retry schedule (SMTP convention, roughly):
| Attempt | Delay |
|---|---|
| 1 | immediate |
| 2 | 5 min |
| 3 | 30 min |
| 4 | 2 hours |
| 5 | 8 hours |
| ... | doubling |
| Final | 4–5 days — then bounce NDR |
RFC 5321 section 4.5.4 recommends retrying for at least 4–5 days before giving up. Many production MTAs use exactly 5 days as their maximum queue lifetime.
Partition the queue by recipient domain (or by hash of recipient address) so one slow domain does not block all outbound delivery. A small ISP's MX server going dark should not hold up email going to Gmail or Outlook. Each worker holds a limited number of concurrent SMTP connections per destination to avoid overwhelming small servers.
Deliverability — SPF, DKIM, DMARC
These three mechanisms help recipient servers trust that outbound mail from your domain is legitimate. Get them wrong and your outbound mail lands in spam for recipients — or gets rejected outright.
| Mechanism | What it does | Where it lives |
|---|---|---|
| SPF | Declares which IP addresses are authorized to send for your domain | DNS TXT record on sending domain |
| DKIM | Cryptographically signs selected message headers plus a hash of the body with a private key; recipient verifies with the public key published as a DNS TXT record | Header added by egress server |
| DMARC | Policy: what to do if SPF or DKIM fails (quarantine, reject, or none); also enables aggregate reports | DNS TXT record on sending domain |
In an interview: mention all three, explain SPF is an IP allowlist in DNS, DKIM is a per-message signature, and DMARC ties them together into a policy. This signals operational depth beyond "I know what SMTP is."
Spam and virus filtering
Spam filtering is a major subsystem in its own right. The key insight is to order your filters from cheapest to most expensive — reject obvious junk before you ever touch your ML model.
flowchart TD
MSG[Inbound message] --> IPCHECK[IP / sender reputation<br/>block known spam IPs]
IPCHECK --> DNSBL[DNSBL check<br/>real-time blackhole lists]
DNSBL --> GREY[Greylisting<br/>new sender triplet delayed]
GREY --> SPFDK[SPF / DKIM / DMARC verify]
SPFDK --> RULES[Rule engine<br/>header heuristics]
RULES --> ML[ML classifier<br/>content + metadata features]
ML --> VERDICT{Verdict}
VERDICT -->|score < threshold| INBOX[Deliver to inbox]
VERDICT -->|threshold exceeded| SPAMFOLDER[Deliver to spam folder]
VERDICT -->|virus detected| QUARANTINE[Quarantine / strip attachment]
style IPCHECK fill:#0e7490,color:#fff
style ML fill:#ff6b1a,color:#0a0a0f
style VERDICT fill:#15803d,color:#fff
Greylisting is a cheap technique that most candidates miss. When you see a new sender triplet (client IP, envelope-from, envelope-to) for the first time, respond with a temporary 4xx — not a rejection, just "try again in a moment." RFC 6647 identifies 421 or 450 as appropriate response codes; 451 is also commonly deployed in practice. Legitimate MTAs will retry within minutes. Cheap spam bots, which are optimizing for volume and usually do not implement retry logic, often do not. You trade a ~5-minute delay for new correspondents against a meaningful reduction in spam volume that even reaches your ML scoring stage.
The ML classifier uses features including: URL reputation in the body, sender domain age, message structure, sending rate patterns, user feedback (explicit spam reports), and language model-based content signals. The training signal comes from user "Report spam" and "Not spam" actions — that feedback loop is worth raising explicitly in an interview.
Full-text search
Per-user inverted index
An inverted index maps each word to the set of message IDs containing it:
"invoice" → [msg_11, msg_47, msg_203, ...]
"receipt" → [msg_47, msg_88, ...]
Per-user isolation is the elegant property here: each user's index is entirely independent. Search queries never touch another user's data, and you can shard the search index on user_id using the exact same key as the mailbox. Scaling is trivially proportional to users, not to global message volume.
Indexing pipeline
sequenceDiagram
participant MW as Mailbox Write
participant IQ as Index queue
participant IW as Index Worker
participant IX as Search index store
MW->>IQ: emit(user_id, message_id, subject, snippet, body_ref)
IW->>IQ: consume batch
IW->>IW: tokenize + normalize + stem
IW->>IX: merge postings list updates
Note over IW,IX: async — lag typically seconds
Tokenization: lowercase, remove punctuation, stem (or lemmatize) words, remove stopwords. For email search, also index from: and to: fields as structured filters so queries like from:alice@example.com resolve efficiently.
Index storage: write-optimized store (e.g. an LSM-tree based store, or Lucene-style segments). Per-user segments are small enough to fit in memory for active users; cold users' indexes are paged in on demand.
Consistency: search results may lag writes by a few seconds. This is acceptable — users understand that a message they just received may not yet appear in search. Do not sacrifice write path throughput for synchronous indexing.
Storage choices
| Data type | Store | Why |
|---|---|---|
| Metadata (headers, labels, thread_id, flags) | Sharded relational DB or wide-column store | Structured queries, per-user sharding, transactional label updates |
| Message bodies + MIME parts | Object store (S3-compatible) | Large blobs, cheap at scale, content-addressed dedup |
| Attachments | Object store (same or separate bucket, cold-tiered) | Same rationale; virus-scanned before write |
| Per-user search index | LSM-tree / Lucene segments, sharded by user_id | Write-optimized, compaction-friendly |
| Delivery queue (outbound) | Durable queue (Kafka or purpose-built) | Persistent, replay on failure, partitioned by domain |
| User account / auth data | Replicated relational DB | Low write volume, strong consistency needed |
| Spam model weights | ML model store (versioned artifacts) | Loaded into scoring service; updated offline |
Failure modes
Outbound delivery to a temporarily-down server
The delivery queue handles this with exponential backoff. One subtle constraint: when the queue worker retries, it must reconnect and retry from the start of the SMTP transaction — SMTP is not resumable mid-DATA. If the message carries a multi-MB attachment, this means resending the entire payload on each retry. For large messages, consider streaming directly from object storage on retry rather than buffering on the egress server.
Spam false positives (good mail flagged as spam)
False positives destroy user trust far more than letting a piece of spam through. Keep precision extremely high (near 100%) even at the cost of recall. The user "Not spam" button is the primary feedback signal — make it prominent and act on it quickly. Provide a guaranteed-delivered path for known contacts. In an interview: acknowledge the precision/recall trade-off explicitly rather than claiming the ML model just handles it.
Large attachment storms
A single email with a 25 MB attachment sent to 10 000 recipients would naively write 25 MB × 10 000 = 250 GB to blob storage. Content-hash dedup collapses this: the blob is written once (25 MB), and 10 000 metadata rows each hold a pointer to the same key — 10 000 × ~1 KB = ~10 MB — giving roughly 35 MB total, a 7 000× reduction.
Search index lag
On a write burst (bulk import, mailing list storm), the indexing queue grows and search results fall behind. Mitigations: set consumer auto-scaling triggers on queue lag; prioritize indexing for active users; surface a "results may be incomplete" indicator when index lag exceeds a threshold.
Hot mailbox
A single user_id shard receiving very high write volume — a shared corporate inbox, a high-volume transactional account — becomes a write hotspot. Mitigations: virtual shards (split one logical user across sub-shards with a local fan-out service); rate-limit inbound per recipient; route to a dedicated high-throughput shard.
Quota enforcement
Every user has a storage quota (e.g. 15 GB for free tier). Track used_bytes per user in a fast counter store (Redis or a dedicated quota service). On message write, atomically increment the counter. If used_bytes + message_size > quota, reject the ingestion with a 452 SMTP response ("insufficient storage") — this causes the sending MTA to retry later rather than generating a bounce. Quota counts message bytes from the user's perspective, not deduplicated blob bytes — users should not benefit from or be penalized by what other users happen to share. Soft quota alerts at 75%, 90%, and 100% let users clean up before the hard limit hits.
Things to discuss in an interview
- Protocol boundary: SMTP is server-to-server; users hit HTTP APIs. Mention IMAP for legacy clients.
- The 250 OK commitment: durability guarantee — do not say 250 until you have persisted.
- Metadata vs blob split: why you separate the two, and how content-hash dedup works.
- Append-only write pattern: new messages are inserts; label/read-flag changes are narrow updates. No full-row rewrites.
- Per-user search isolation: why per-user inverted indexes are architecturally elegant.
- Outbound retry and SMTP bounce handling: 4xx vs 5xx, NDR generation.
- Spam as a layered pipeline: early cheap filters before expensive ML; user feedback as training signal.
- Greylisting: a cheap, effective spam reduction technique most candidates don't mention.
Things you should now be able to answer
- Why does the ingestion server say
250 OKbefore writing to the database? (It does not — you must write before you say 250.) - What is the purpose of DKIM and how does it differ from SPF?
- How does content-hash deduplication reduce storage for CC storms?
- Why is the search index built asynchronously rather than synchronously?
- What is the difference between a 4xx and a 5xx SMTP response, and what does each mean for the outbound delivery queue?
- How does greylisting reduce spam without requiring ML?
- Why shard by
user_idrather than bymessage_id?
Further reading
- RFC 5321 — Simple Mail Transfer Protocol (the canonical SMTP spec)
- RFC 6376 — DomainKeys Identified Mail (DKIM Signatures)
- RFC 7208 — Sender Policy Framework (SPF)
- RFC 7489 — Domain-based Message Authentication, Reporting, and Conformance (DMARC)
- "Lessons Learned from Scaling Email Infrastructure" — various engineering blogs (Fastmail, Mailchimp, Postmark)
- Design a distributed message queue — for delivery queue deep-dive
- Consistent hashing — for understanding the mailbox sharding strategy
Frequently asked questions
▸Why must an SMTP ingestion server write the message durably before sending 250 OK?
The 250 OK response is a binding delivery commitment under the SMTP protocol. The moment you send it, you have accepted legal and operational responsibility for the message. Any message lost after saying 250 violates SMTP, so the ingestion server must write the blob and the metadata row and confirm both are durable before issuing the response.
▸Why shard the mailbox metadata store by user_id rather than by message_id?
All normal mailbox operations — message list, label update, thread view — are scoped to a single user, so sharding by user_id means every query hits exactly one shard with no cross-shard joins. Sharding by message_id would scatter a single user's messages across many shards, requiring expensive fan-out reads for every inbox load.
▸What is greylisting and how does it reduce spam without an ML classifier?
Greylisting responds to a new sender triplet (client IP, envelope-from, envelope-to) with a temporary 4xx response, asking the sender to retry in a few minutes. Legitimate MTAs implement retry logic and will succeed on the second attempt. Cheap spam bots, optimizing for volume, typically skip retries entirely, so they never deliver. The technique adds roughly a 5-minute delay for new correspondents in exchange for a meaningful reduction in spam volume that ever reaches the ML scoring stage.
▸How does content-hash deduplication reduce storage for large attachment storms?
Before writing a blob, the system hashes the body or attachment bytes and checks whether that key already exists in object storage. If it does, the write is skipped and a pointer to the existing blob is recorded in the metadata row instead. A 25 MB attachment sent to 10,000 recipients is written once (25 MB) and the 10,000 metadata rows each hold a ~1 KB pointer (~10 MB total), collapsing 250 GB of naive storage down to roughly 35 MB — about a 7,000x reduction.
▸What is the difference between a 4xx and a 5xx SMTP response from the recipient server, and what does each mean for the outbound delivery queue?
A 4xx is a transient failure indicating the remote server is temporarily unavailable or greylisting the sender; the egress worker requeues the message with exponential backoff and retries for up to 4-5 days per RFC 5321. A 5xx is a permanent failure meaning the address does not exist or the server is explicitly rejecting the message; the worker acknowledges the job as done and generates a Non-Delivery Report back to the original sender.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.