Scale From Zero to Millions of Users
The classic walkthrough — start with one server, add a load balancer, add caching, replicate the database, shard, geo-distribute. Every transition explained.
This is the canonical "evolution" of a web system, made famous by Alex Xu's System Design Interview and worth re-walking on the way to every interview. It covers the same sequence of upgrades that Amazon, eBay, Twitter, and Instagram all went through — sometimes painfully — as they grew.
Day 0 — single server
Everything on one box: web server, application code, database.
flowchart LR
U[Users] -->|HTTPS| S[Single Box<br/>Web + App + DB]
style S fill:#ff6b1a,color:#0a0a0f
Capacity: maybe 1,000 concurrent users; latencies fine; total monthly cost ~$50.
This is also where you start: a side project, an MVP, a launch. It is completely fine for the first month of any new product.
Step 1 — separate the database
The database (RAM-hungry, IO-hungry) and the app (CPU-hungry, network-hungry) compete for the same resources on the same box. When the DB starts paging swap and the web server is waiting on disk, both suffer. Move the database to its own machine.
flowchart LR
U[Users] --> S[App Server]
S --> DB[(Database)]
style S fill:#15803d,color:#fff
style DB fill:#ff6b1a,color:#0a0a0f
Now you can tune them independently. App server CPU pegged? Bigger app box. Database disk full? Bigger DB box.
Step 2 — vertical scale (a.k.a. "bigger box")
Bump the instance type. AWS m5.large → m5.xlarge → m5.4xlarge → m5.16xlarge. Doubling the instance roughly doubles capacity. Easy. Boring.
The ceiling: a single machine can only get so big. The largest standard AWS m5 instance (m5.24xlarge) gives you 96 vCPUs and 384 GiB RAM — impressive, but finite, and eye-wateringly expensive. Past that, you must scale horizontally.
Step 3 — load balancer + multiple app servers
Spin up a second app server. Put a load balancer in front. Now you have horizontal scaling for the app tier.
flowchart LR
U[Users] --> LB[Load Balancer]
LB --> A1[App 1]
LB --> A2[App 2]
LB --> A3[App N]
A1 --> DB[(Database)]
A2 --> DB
A3 --> DB
style LB fill:#ff6b1a,color:#0a0a0f
style DB fill:#0e7490,color:#fff
As soon as that second server is up, you'll discover that any state you kept inside the first server is invisible to the second. Sessions stored in App 1's memory are gone when the load balancer sends that user to App 2. Files uploaded to App 1's local disk don't exist on App 2. An in-process cache on App 1 is a completely separate cache from App 2's — they'll drift and contradict each other. The fix for all three is the same: move that state out of the app server entirely.
sequenceDiagram
participant U as User
participant LB as Load Balancer
participant A1 as App Server 1
participant A2 as App Server 2
participant R as Redis
U->>LB: Login (session created)
LB->>A1: Route to App 1
A1->>A1: Session stored in local memory
U->>LB: Next request
LB->>A2: Route to App 2 (different server)
A2->>A2: No session found — 401 Unauthorized
Note over A1,A2: Fix: store sessions in Redis, not in process memory
A2->>R: Session lookup
R-->>A2: Session found
- Store sessions in Redis or cookies (not in process memory).
- Store uploads in object storage like S3 or GCS (not on local disk).
- A shared cache (Redis) replaces per-process in-memory caches.
This is the stateless app tier principle: once every stateful concern lives outside the app servers, adding or removing servers is frictionless.
Two failure modes worth naming for interviews. Sticky sessions — routing a user to the same server every time — look like a quick fix but defeat horizontal scale entirely; when that server goes down, all its users lose their sessions. Fix the root cause instead. JWT tokens move session state to the client, which is great for statelessness, but you cannot revoke a token without a shared deny-list, which reintroduces server-side state. Know the trade-off. And whatever you do: give your load balancer a drain period so in-flight requests can complete before a node shuts down.
Step 4 — database replication
The DB is now the bottleneck. A typical consumer app has a 10:1 read-to-write ratio — for every order placed, there are ten times as many product pages browsed. All those reads landing on one primary is wasteful. Add read replicas.
flowchart LR
LB --> A1[App]
A1 -->|writes| P[(Primary)]
A1 -->|reads| R1[(Replica 1)]
A1 -->|reads| R2[(Replica 2)]
P -->|replicate| R1
P -->|replicate| R2
style P fill:#ff6b1a,color:#0a0a0f
style R1 fill:#15803d,color:#fff
style R2 fill:#15803d,color:#fff
Reads scale horizontally. Writes still bottleneck on the primary, but for most read-heavy workloads we just bought 10× capacity without touching the write path.
Watch out for replication lag — a read on a replica may not see a write that just committed on the primary. The right mitigation depends on how much staleness you can tolerate:
| Scenario | Approach |
|---|---|
| "Read your writes" (user just posted, refreshes) | Route that user's reads to the primary for ~1s, or use a write token |
| Stale profile data (minor) | Accept eventual consistency; TTL cache + replica |
| Monetary balance / inventory | Always read from primary; replicas for analytics only |
| Replica lags badly (seconds) | Alert and stop sending reads to lagging replica |
MySQL and Postgres async replication typically lags under 1 second in the same region under normal load; cross-region adds the network round-trip — 10–100 ms — on top.
Step 5 — caching
Even with replicas, popular queries (the front page, trending content, product listings) hit the DB millions of times a day with the exact same result each time. Put a cache in front and avoid repeating that work.
flowchart LR
A1[App] -->|1. check cache| C[(Redis)]
C -.miss.-> A1
A1 -->|"2. miss → query DB"| DB[(Replica)]
DB --> A1
A1 -->|3. write back| C
C --> A1
style C fill:#15803d,color:#fff
Cache-aside pattern. We covered eviction policies, TTLs, and the thundering-herd problem in the caching module.
A well-configured cache serves 90%+ of reads from memory — that's a 10× DB load reduction on top of the replicas. The two layers together can carry a read-heavy workload an order of magnitude beyond what a single DB could handle alone.
Step 6 — CDN
Static assets — JS bundles, CSS, images, video — are large and change infrequently. Serving them from your app server means every byte travels from your origin to users worldwide: 100–200 ms per request for a user on another continent. A CDN puts copies at 200+ edge locations, so those assets travel the last 5–20 ms instead.
flowchart LR
U[Users worldwide] --> EDGE[CDN Edges<br/>200+ locations]
EDGE -.miss.-> ORIGIN[Origin: app + S3]
style EDGE fill:#ff6b1a,color:#0a0a0f
First Contentful Paint drops dramatically. Modern stacks (Vercel, Cloudflare Workers) push this further — even server-rendered HTML can be cached or generated at the edge.
Step 7 — message queue
Some work is inherently slow: sending emails, resizing uploaded images, running ML inference, computing recommendations. If you do that work synchronously, the user waits. Don't make them wait. Queue it.
flowchart LR
U --> APP[App]
APP -->|"return 201"| U
APP -->|"publish job"| Q[Queue]
Q --> W1[Worker 1]
Q --> W2[Worker 2]
style Q fill:#0e7490,color:#fff
Moving slow work off the request path keeps API latency low. Workers scale independently of the API — you can run more workers during image-processing peaks without touching your web tier. Failures are retried automatically without re-submitting. And a queue absorbs traffic spikes, turning what would be a thundering herd into a steady drain.
Step 8 — database sharding
Replicas and caching handle read scale beautifully, but all writes still funnel through one primary. Past a certain write rate — sustained writes over roughly 10k/s, or a dataset that won't fit in a single machine's RAM — the primary becomes the ceiling. Time to shard.
flowchart TD
APP[App] --> SR[Shard Router]
SR --> S1[(Shard 1<br/>users 0-9)]
SR --> S2[(Shard 2<br/>users 10-19)]
SR --> S3[(Shard 3<br/>users 20-29)]
SR --> S4[(Shard N...)]
style SR fill:#ff6b1a,color:#0a0a0f
Each shard is itself a primary + replicas. Writes go to the right shard based on a shard key (usually user_id).
Sharding brings real complexity that you have to plan for. Cross-shard queries ("top 10 users by activity") require a scatter-gather fan-out across every shard followed by in-app sorting — these are expensive and should be avoided or served from a pre-aggregated summary table. Resharding is the painful one: if you use modulo hashing (user_id % N), adding a new shard moves most of your data (67–94% depending on the starting shard count). Consistent hashing limits movement to 1/N of keys per shard added — strongly prefer it from the start. Hot shards happen when one entity is far more active than average — a celebrity account with millions of followers generates writes an order of magnitude beyond a normal user. The shard that owns that user_id overheats. Short-term mitigations: move that user to a dedicated shard, or add read replicas to absorb the read pressure. Long-term: re-examine whether user_id was the right shard key. Cross-shard transactions are essentially impossible to make atomic; if your schema requires rows in two different shards to update together, you've likely chosen the wrong shard key. And secondary indexes get complicated fast: if you shard by user_id but need to query by email, either scatter-gather across all shards or maintain a reverse-index shard that maps email → user_id.
flowchart LR
HOT["Hot user: 1M followers"] --> SR[Shard Router]
SR --> SH1[("Shard 7<br/>overloaded")]
NRM[Normal users] --> SR
SR --> SH2[("Shard 1<br/>normal")]
SR --> SH3[("Shard 2<br/>normal")]
SH1 -.replica.-> RR1[(Read Replica)]
SH1 -.replica.-> RR2[(Read Replica)]
style SH1 fill:#ff2e88,color:#fff
style SR fill:#ff6b1a,color:#0a0a0f
style RR1 fill:#15803d,color:#fff
style RR2 fill:#15803d,color:#fff
Step 9 — split into services
A sharded monolith is workable, but once your engineering org grows past ~50–100 people, the monolith itself becomes the bottleneck — not in compute, but in coordination. Teams step on each other's deploys. Different parts of the system have wildly different scaling requirements: search needs Elasticsearch, recommendations need a GPU cluster, the notification service just needs a queue. A microservices split lets each domain team own its stack, deploy independently, and scale to its own demand curve.
flowchart TD
LB[Load Balancer / API Gateway] --> US[User Service]
LB --> PS[Posts Service]
LB --> NS[Notifications]
LB --> SS[Search]
LB --> RS[Recommendations]
US --> UDB[(Users DB)]
PS --> PDB[(Posts DB)]
NS --> Q[Queue]
SS --> ES[(Elasticsearch)]
RS --> REC[(Feature Store)]
style LB fill:#ff6b1a,color:#0a0a0f
Microservices are not free: more moving parts, more failure modes, more network hops, and a dramatically harder debugging story. But for a 100+ engineer org, the team-scaling benefit usually outweighs the operational cost.
Step 10 — multi-region
You have users on every continent. A single region adds 100–200 ms of latency for distant users, and is a single point of failure for the whole product.
flowchart TD
DNS[Geo DNS / Anycast] --> US[us-east-1]
DNS --> EU[eu-west-1]
DNS --> APAC[ap-northeast-1]
US -.async replicate.-> EU
EU -.async replicate.-> APAC
APAC -.async replicate.-> US
style DNS fill:#ff6b1a,color:#0a0a0f
Two main flavors. Active-passive sends all writes to one region; the others are warm standbys. Failover is slow — manual failover takes minutes; even automated cross-region failover (e.g., Aurora Global Database with managed promotion) is measured in tens of seconds to a minute — but data stays consistent because there's only one writer. Active-active lets every region accept writes and syncs them asynchronously. Faster for local users, but write conflicts become real. Three common approaches to those conflicts: last-writer-wins (simple, but concurrent edits lose data), region-partitioned writes where EU users always write to the EU region (avoids conflicts by design, works well when data is naturally geographic), or CRDTs and application-level merge logic (correct but complex, not used natively by DynamoDB Global Tables — which resolves conflicts via last-writer-wins timestamps by default).
A third option that often goes unmentioned: read-local, write-global — reads are served locally from async replicas, but all writes route to one primary region. Low-latency reads worldwide, with the trade-off of higher write latency for non-primary regions. Many teams start here before committing to the operational complexity of active-active.
The whole picture
After all ten steps, the final architecture for a "millions of users" consumer app:
flowchart TD
Users --> DNS[Anycast DNS]
DNS --> CDN[Global CDN]
CDN --> ALB[Regional ALB]
ALB --> APIGW[API Gateway]
APIGW --> AUTH[Auth Service]
APIGW --> US[Users svc]
APIGW --> PS[Posts svc]
APIGW --> FS[Feed svc]
APIGW --> NS[Notif svc]
APIGW --> SS[Search svc]
US --> URC[Redis<br/>users cache]
US --> UDB[(Postgres<br/>sharded)]
PS --> PRC[Redis]
PS --> PDB[(Postgres<br/>sharded)]
FS --> KFK[Kafka]
NS --> KFK
KFK --> Workers[Workers]
Workers --> NDB[(DynamoDB)]
SS --> ES[(Elasticsearch)]
style CDN fill:#ff6b1a,color:#0a0a0f
style ALB fill:#ffaa00,color:#0a0a0f
style KFK fill:#0e7490,color:#fff
Cost ballpark at each stage
| Stage | DAU | Monthly cost (rough) |
|---|---|---|
| Day 0 | <1k | $50 |
| LB + multi-app | 10k | $500 |
| + replicas + cache | 100k | $5k |
| + CDN + queue | 1M | $30k |
| + sharding | 10M | $300k |
| + multi-region + microservices | 100M | $5M+ |
These are rough order-of-magnitude. Real numbers depend wildly on workload. But the shape — exponential in users, exponential in cost, sub-linear in cost-per-user — is universal.
Lessons
The main one: don't over-engineer. A single Postgres on a beefy box gets you very far. Shard when you must, not when it sounds cool. Most scaling is reactive — you hit a bottleneck, you measure, you fix. The 10-step ladder above is what reaching limits looks like, not a prescription for how to build greenfield systems.
Operational maturity matters more than fancy tech. Monitoring, on-call, deploys, runbooks — these dominate the budget at scale, not the choice of database engine.
And: the frontier moves. What used to require a sharded MySQL cluster now fits in a single Aurora instance (up to 256 TiB storage on current engine versions, and Aurora Limitless handles horizontal write sharding natively). What requires a hand-rolled Cassandra cluster today might be a managed offering tomorrow. Always check whether a managed service has closed the gap before committing to DIY sharding.
Things to discuss in an interview
- "Which of these would you do first if your DB was at 80% CPU?" → replicas, then cache, in that order. Read replicas are a database-native horizontal scale-out; a cache layer on top then eliminates the repeated reads that still reach the replicas. If the bottleneck is specifically write throughput rather than read load, sharding is the next step — but adding a cache is reversible and fast while sharding is neither.
- "How would you decide when to shard?" → when read replicas can't relieve write load, or when the working set won't fit in one machine's RAM. A common threshold: >5 TB data or sustained writes >10k/s on a single primary.
- "What goes wrong when you go multi-region?" → write conflicts, replication lag, increased operational complexity, cost (inter-region data transfer), and regulatory concerns (GDPR — EU user data must stay in EU).
- "What's the trade-off between active-active and active-passive multi-region?" → active-active gives low write latency globally but requires conflict resolution; active-passive is simpler but all writes bear cross-region latency if the primary is far.
- "How do you know when to add a message queue?" → when a synchronous operation takes > 100–200ms (email, image resizing, ML inference) and the user doesn't need the result immediately. Also: when you have burst traffic that would overwhelm downstream systems.
- "A single shard is getting much hotter than others — what do you do?" → investigate the shard key: hot shards are often a sign of a bad partition strategy. Short-term: move hot data to a dedicated shard or add read replicas for it. Long-term: re-key.
Things you should now be able to answer
After reading this article, you should be able to explain:
- Why stateless app servers are a prerequisite to horizontal scaling, not a consequence of it.
- The difference between read replicas (read scale-out) and sharding (write scale-out), and when each applies.
- Why a 10:1 read:write ratio is the normal case for consumer apps and how that shapes architecture choices.
- Two concrete failure modes of in-memory sessions when you add a second app server.
- Why modulo sharding is painful to rebalance, and what consistent hashing does differently.
- The conflict-resolution strategies available for active-active multi-region writes.
- Why message queues improve both latency and resilience (two distinct benefits, not one).
Frequently asked questions
▸What is the stateless app tier and why is it a prerequisite for horizontal scaling?
A stateless app tier is one where sessions, uploads, and caches are stored outside the app server — in Redis, S3, or a shared cache — rather than in process memory or local disk. It is a prerequisite, not a consequence, of horizontal scaling because as soon as a second app server is added, any state stored on the first server becomes invisible to the second, causing failures like 401 errors on session lookups and cache drift between instances.
▸When should you add read replicas versus sharding?
Add read replicas when reads are the bottleneck, which is the common case given a typical consumer app's 10:1 read-to-write ratio. Shard only when write throughput saturates the primary — the article gives concrete thresholds of sustained writes over roughly 10,000 per second or a dataset that will not fit in a single machine's RAM (a common heuristic is more than 5 TB of data).
▸How much DB load can a well-configured cache layer absorb?
A well-configured cache using the cache-aside pattern serves 90 percent or more of reads from memory, which represents a 10x reduction in DB load on top of what read replicas already provide. Combined, the two layers can carry a read-heavy workload an order of magnitude beyond what a single database could handle alone.
▸What is the difference between active-active and active-passive multi-region, and what are the write-conflict options for active-active?
Active-passive sends all writes to one region while others act as warm standbys; failover is slow, taking minutes manually or tens of seconds to a minute with automation like Aurora Global Database, but data stays consistent because there is only one writer. Active-active lets every region accept writes asynchronously for lower local latency, but requires conflict resolution — the article identifies three strategies: last-writer-wins (simple but risks losing concurrent edits, and the default used by DynamoDB Global Tables via item-level timestamp comparison), region-partitioned writes where a user always writes to their home region (avoids conflicts by design), and CRDTs or application-level merge logic (correct but complex).
▸Why is modulo sharding painful and what does consistent hashing do differently?
With modulo sharding using user_id mod N, adding one new shard forces the majority of existing data to move to a different shard — in practice 67% for N=2 rising to 90%+ for larger shard counts. Consistent hashing limits data movement to 1/N of keys per shard added, making resharding far less disruptive, which is why the article recommends preferring it from the start.
You may also like
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design a Feature Store
Serve the exact same feature values to model training and online inference — eliminating training-serving skew — across batch, streaming, and on-demand tiers at sub-10ms latency and millions of reads per second. The architecture powering Uber Michelangelo, Airbnb Chronon, and DoorDash Gigascale.
Design an LLM Inference & Serving System
Serve token generation for a 70B-parameter model at scale — where KV cache, not FLOPs, caps concurrency and continuous batching is what separates good GPU utilization from terrible utilization.