~/articles/database-sharding

◆◆Intermediateasked at Metaasked at Googleasked at Uberasked at Slack

Database Sharding

When you outgrow a single database — how to split data across many machines, the strategies that work, and the operational pain you'll inherit.

12 min read2026-02-03Ironclad Academy

#databases #sharding #scalability #distributed-systems

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

A single database — even a beefy one with read replicas — has a hard ceiling. Past 1–10 TB of hot data, or when write throughput saturates the primary (the practical ceiling is roughly 10k–50k writes/s for transactional workloads with full durability, though batch-oriented workloads can push higher), vertical scaling stops being the answer. You shard.

Sharding means partitioning your data across multiple databases, each holding a slice. This article covers when you actually need it, how to choose a sharding strategy, and the operational landmines (cross-shard queries, resharding, hot shards) that come with it.

When do you actually need to shard?

Almost never as early as people think. The order of operations:

flowchart LR
    A[Single DB] -->|"hot data > RAM<br/>OR writes > 5k/s"| B[Read replicas]
    B -->|"writes still bottleneck"| C[Vertical scale<br/>biggest box you can buy]
    C -->|"hit machine ceiling<br/>or 10TB hot data"| D[Shard]
    style D fill:#ff6b1a,color:#0a0a0f

If your database fits on an r6id.32xlarge (128 vCPU, 1024 GiB RAM, 4×1.9 TB local NVMe — or the equivalent in your cloud) and your write rate is not saturating the primary, you don't need to shard. Most apps never reach that scale.

Sharding adds complexity to every part of your system. Push it as late as you reasonably can.

Shard key — the single most important decision

Pick the wrong shard key and everything else is harder. Forever.

A good shard key has four properties working together. It has high cardinality — enough distinct values that data spreads across shards rather than piling up in one. It has an even access pattern — no single value soaks up a disproportionate share of traffic. It co-locates related data — ideally, all the rows your most frequent queries touch share a key, so those queries hit exactly one shard. And it is stable — the key value for a given row rarely changes, because changing it means moving the row across shards.

The most common shard key for consumer products is user_id. Most queries are naturally scoped to one user, users tend to have similar activity levels, and everything belonging to one user lands together on one machine.

flowchart TD
    R[Row] --> Q{"hash(user_id) % N"}
    Q -->|"→ 0"| S1[Shard A]
    Q -->|"→ 1"| S2[Shard B]
    Q -->|"→ 2"| S3[Shard C]
    style Q fill:#ff6b1a,color:#0a0a0f

Here is what a real read request looks like as it flows through that routing layer:

sequenceDiagram
    participant App
    participant Router as Shard Router
    participant ShardB as Shard B
    participant ReplicaB as Replica B
    App->>Router: SELECT * FROM orders WHERE user_id=42
    Router->>Router: "hash(42) % N → shard B"
    Router->>ReplicaB: forward read
    ReplicaB-->>Router: result rows
    Router-->>App: rows
    Note over Router,ShardB: writes always go to the primary
    App->>Router: INSERT INTO orders (user_id=42, ...)
    Router->>ShardB: forward write to primary
    ShardB-->>Router: OK
    Router-->>App: OK

The router is thin — a small config table or in-memory ring. All the heavy lifting stays in the shards themselves.

Partitioning strategies

1. Range-based sharding

Split by ranges of the shard key:

Shard A: user_id 0       — 999,999
Shard B: user_id 1M      — 1,999,999
Shard C: user_id 2M      — 2,999,999

Range queries are efficient with this approach — "all users between 1M and 1.5M" hits a single shard. The downside is hot spots. New users (highest IDs) all land on the same shard, and power users may cluster within a range. Used by HBase and MongoDB (sometimes).

2. Hash-based sharding

shard = hash(user_id) % N

Load is evenly distributed because the hash randomizes placement. The tradeoff is that range queries become useless — "all users who signed up in February" forces a scatter-gather across every shard. Worse, adding shards reshuffles everything, because the modulus changes. Many DIY sharding setups start here and later regret it. See consistent hashing for the fix.

3. Consistent hashing

Hash-based, but adding or removing shards moves only ~K/N of keys rather than rehashing everything. We have a whole article on it. This is the default choice for production sharded systems. Used by Cassandra, DynamoDB, and Riak.

The key intuition: both data keys and nodes are hashed onto the same ring. A key belongs to the nearest node clockwise from it. Add a node and only the keys between it and its predecessor move — everything else stays put.

flowchart LR
    subgraph "Consistent Hash Ring"
    N1((Node A<br/>pos 10)) --> N2((Node B<br/>pos 40))
    N2 --> N3((Node C<br/>pos 80))
    N3 --> N1
    K1[key hash=25<br/>→ Node B] -.-> N2
    K2[key hash=55<br/>→ Node C] -.-> N3
    K3[key hash=95<br/>→ Node A] -.-> N1
    end
    style N1 fill:#0e7490,color:#fff
    style N2 fill:#15803d,color:#fff
    style N3 fill:#0e7490,color:#fff
    style K1 fill:#ffaa00,color:#0a0a0f
    style K2 fill:#ffaa00,color:#0a0a0f
    style K3 fill:#ffaa00,color:#0a0a0f

When you add Node D at position 60, only keys between 40 and 60 (currently owned by Node C) need to migrate — everything else is untouched.

4. Directory-based sharding

A lookup table:

user_id 12345 → shard_id 7
user_id 67890 → shard_id 2

This gives you maximum flexibility — you can move a single user to a new shard without rehashing anything, just by updating the directory. The cost is that the directory itself becomes a service you have to keep fast, consistent, and highly available. It is another database that the application depends on for every request.

Used by Slack, Figma, and many SaaS products that need per-tenant placement control.

flowchart LR
    APP[App] -->|1. lookup tenant_id| DIR[(Directory<br/>tiny DB)]
    DIR -->|2. shard 7| APP
    APP -->|3. query| S7[(Shard 7)]
    style DIR fill:#ff6b1a,color:#0a0a0f

5. Geo-based / functional sharding

Different shards for different regions or different feature areas:

Shard EU: users in Europe
Shard US: users in America

This gives you data locality (lower latency for local users) and makes regulatory compliance easier — GDPR can require that EU data stays in EU. The awkwardness is cross-region queries and users who travel or have data straddling regions.

The hard parts of sharding

1. Cross-shard queries

A query like "give me the top 10 products by sales" requires querying every shard, then merging. This is scatter-gather.

flowchart TD
    APP[App] -->|"SELECT TOP 10..."| FAN
    FAN[Scatter] -->|query| S1[Shard 1]
    FAN -->|query| S2[Shard 2]
    FAN -->|query| S3[Shard 3]
    FAN -->|query| SN[Shard N]
    S1 -->|local top 10| MERGE
    S2 -->|local top 10| MERGE
    S3 -->|local top 10| MERGE
    SN -->|local top 10| MERGE
    MERGE[Merge<br/>global top 10] --> APP
    style FAN fill:#ff6b1a,color:#0a0a0f
    style MERGE fill:#15803d,color:#fff

The latency of a scatter-gather query is not the average of each shard's response — it is the maximum. One slow shard slows the whole query. You also pay N× the network cost and have to write the merge logic yourself.

The mitigations: keep these queries off the hot path and pre-compute aggregates instead. Use a separate OLAP store (BigQuery, Snowflake, ClickHouse) for analytics. Materialized views can pre-fan-out the data so the scatter-gather runs offline.

2. Cross-shard transactions

You can't easily wrap a cross-shard write in BEGIN ... COMMIT. Two-phase commit (2PC) is the formal solution, but it is slow and fragile in ways that matter at scale.

2PC requires two full network round-trips before any participant can commit. All rows involved are locked for the entire duration of both phases. If the coordinator crashes between the prepare and commit phases, participants stay blocked indefinitely — the "blocking problem" that makes 2PC unsuitable for high-availability systems. Cross-region 2PC can easily hit 200–500 ms of latency.

Your options:

Avoid them. Design schemas so transactions stay within one shard. This was Slack's original approach: all data for a workspace was co-located on one shard (though each physical shard held many workspaces), so every workspace-scoped transaction stayed local. Transactions are simple again because there is no cross-shard case.
Sagas (covered in the message queue module) — a sequence of compensable steps, each with a compensating action for rollback. You get eventual consistency, not ACID atomicity.
NewSQL (CockroachDB, Spanner) — pay the latency cost for distributed ACID. These systems run 2PC on top of Paxos/Raft consensus groups so the coordinator is replicated and a coordinator failure does not leave participants blocked indefinitely.

3. Resharding (a.k.a. "the operations nightmare")

You have 100 shards and you've outgrown them. You need 200. Now what?

The naive answer — double the modulus, rehash everything — requires moving roughly half your data. That is a migration that touches every row in every shard simultaneously, while the system is still serving traffic. In practice, resharding involves engineers writing custom migration scripts, carefully planned downtime windows or read/write fencing, a long tail of inconsistencies that only surface under production load, and an anxious month.

flowchart LR
    OLD[100 shards] -->|"must move ~50% of data"| NEW[200 shards]
    style OLD fill:#ff6b1a,color:#0a0a0f
    style NEW fill:#15803d,color:#fff

There are better ways to structure for this from the start:

Double-write + backfill: write to both old and new shards simultaneously; backfill historical data in the background; cut reads over to the new shards once the backfill catches up.
Logical sharding (Vitess, Citus): a virtual layer presents one database to the application and lets you split underlying physical shards transparently. The application never sees the physical shard count change.
Pre-allocated vnodes (Cassandra-style): start with many logical shards on few physical nodes. Cassandra historically defaulted to 256 vnodes per physical node (reduced to 16 as of Cassandra 4.0, which improves repair and streaming performance at the cost of less granular load-balancing). To scale out, move some logical shards to the new physical node — no rehashing required, just a metadata update.

Plan for resharding from day one, even if you never do it. The routing layer you choose now determines how painful the day will be.

4. Hot shards

One shard receives more traffic than the others. It is a more common failure mode than people expect. Common causes: a celebrity user lands on shard 7 and shard 7 goes on fire; a time-based shard key pushes all writes to "today's" shard; or one tenant is simply 1000× larger than everyone else.

Your options, roughly ordered by how much machinery they require:

Salt the key (write sharding): append a random suffix — shard_key = hash(user_id + suffix_0..N) — to spread one logical entity's writes across N shards. Reads must scatter-gather those N shards and merge results. Good for write-heavy hot keys; adds read complexity.
Replicate hot keys: serve reads from multiple replicas of the hot shard. Works well for read-hot keys (celebrity profile page) with infrequent writes.
Move the hot entity to its own shard (directory-based sharding): give the celebrity user or the largest tenant a dedicated physical shard you can scale independently. Only the directory entry changes; the application doesn't notice.
Cache aggressively in front: an in-process or Redis cache absorbs read traffic before it ever reaches the shard.

Interview tip: Interviewers often push on "what if a user goes viral?" Walk through the tradeoff: salting fixes write throughput but adds read latency and scatter-gather complexity; isolation is cleaner but requires directory sharding infrastructure. Pick based on whether the bottleneck is reads or writes.

Sharded vs. distributed (the difference)

A sharded database is N independent databases with a routing layer. Each shard has its own primary, its own replication, its own transaction scope. Cross-shard semantics are ad hoc — you build them yourself with scatter-gather or saga patterns.

A distributed SQL database (Spanner, CockroachDB, Yugabyte) presents a single logical database to the application while internally sharding and replicating with consensus. Transactions span shards because the system handles 2PC with a replicated coordinator.

flowchart TD
    subgraph Sharded
    APP1[App] --> R1[Router]
    R1 --> SH1[(Shard 1)]
    R1 --> SH2[(Shard 2)]
    R1 --> SH3[(Shard 3)]
    end
    subgraph Distributed
    APP2[App] --> DSQL[Distributed SQL<br/>'one logical DB']
    DSQL --> RAFT[Raft groups + Time Service]
    end
    style R1 fill:#ff6b1a,color:#0a0a0f
    style DSQL fill:#a855f7,color:#fff

Sharded is what most companies build by hand on top of MySQL or Postgres. Distributed SQL is what you buy when you'll pay extra to skip the operational pain — both the complexity of managing N independent databases and the 2PC blocking problem.

Tools to know

Tool	Role
Vitess	MySQL sharding layer (used by YouTube, Slack, Square)
Citus	Postgres sharding extension (acquired by Microsoft in 2019; now powers Azure Cosmos DB for PostgreSQL, formerly Hyperscale)
CockroachDB / Spanner / Yugabyte	Distributed SQL — sharding + transactions built-in
DynamoDB / Cassandra	NoSQL with sharding native
PgBouncer / ProxySQL	Connection pooling, sometimes light routing

A real-world example: Slack

Slack's sharding has evolved across two major phases — a useful illustration of how requirements force re-sharding:

Phase 1 — workspace-sharded MySQL (original design). Each database shard held many complete workspaces. A metadata cluster acted as a directory: workspace_id → shard_id. All channels, messages, users, and files for a workspace lived on one shard, so transactions within a workspace never crossed shards. Classic directory-based sharding, workspace as the shard key.

Phase 2 — Vitess + channel-level sharding. As Enterprise Grid and Slack Connect (cross-workspace channels) grew, the "one workspace = one shard" invariant broke down. Large enterprise customers generated concentrated hot-shard load that couldn't be spread across the fleet. Slack migrated to Vitess and re-sharded message data by channel_id rather than workspace_id — spreading a busy workspace's load across the cluster while keeping the directory-lookup routing model.

The lesson: your first shard key choice will feel permanent but usually isn't. Design the routing layer — a directory or a proxy like Vitess — so the underlying physical mapping can change without rewriting the application.

Things to discuss in an interview

"Why is user_id usually a good shard key?" — high cardinality, even access, co-locates user data.
"What's the cost of scatter-gather queries?" — latency = max(shards), N× network calls.
"How would you reshard a system that's running?" — double-write + backfill, or logical shards.
"What's a hot shard? How do you fix one?" — salting, replication, isolating to its own shard.

Things you should now be able to answer

When is sharding the right answer vs. read replicas vs. NewSQL?
Why does shard key choice matter so much?
What problem do directory-based shards solve that hash-based shards don't?
Why are cross-shard transactions hard, and how do real systems avoid them?

// KEY TAKEAWAYS

▸Shard key choice is effectively permanent: high cardinality, even access pattern, co-location of related data, and stability are all required together.
▸Consistent hashing moves only ~K/N keys when adding a node, making it the default choice for production systems over plain hash-mod-N sharding.
▸Cross-shard 2PC is slow, blocking, and unsuitable for high-availability systems — cross-region 2PC can easily hit 200 to 500 ms of latency, so schema design should keep transactions within one shard.
▸Hot shards are common: a celebrity user on shard 7 can take it down, and the fix depends on whether the bottleneck is reads (replicate) or writes (salt the key or isolate to a dedicated shard via directory lookup).
▸Pre-allocate logical shards from day one using vnodes or a directory layer so resharding means moving metadata, not rehashing and moving half the data.

// FAQ

Frequently asked questions

▸What is database sharding?

Sharding partitions a dataset across N independent databases using a routing key, so each database holds a slice of the total data. It buys write throughput and storage headroom a single machine cannot provide, but forces you to give up cross-shard transactions and makes the shard key choice nearly impossible to change later.

▸When should I shard instead of scaling vertically or adding read replicas?

Shard only after read replicas and vertical scaling are exhausted. The article puts the practical ceiling at roughly 10k to 50k writes per second for transactional workloads with full durability, and somewhere past 1 to 10 TB of hot data. Most applications never reach that scale, and sharding adds complexity to every part of the system.

▸Hash-based sharding vs. consistent hashing: what is the difference?

Plain hash-based sharding uses shard = hash(user_id) % N, which forces a full data reshuffle whenever N changes because the modulus changes. Consistent hashing places both keys and nodes on a ring, so adding or removing a node moves only approximately K divided by N keys rather than rehashing everything. Cassandra, DynamoDB, and Riak all use consistent hashing.

▸What is the latency cost of scatter-gather queries in a sharded system?

The latency of a scatter-gather query is not the average of each shard's response time but the maximum — one slow shard delays the whole query. You also pay N times the network cost and must write the merge logic yourself, which is why the article recommends keeping these queries off the hot path and pre-computing aggregates in a separate OLAP store such as BigQuery, Snowflake, or ClickHouse.

▸How did Slack's sharding evolve, and what lesson does it teach?

Slack originally used directory-based sharding with workspace_id as the shard key, keeping all data for one workspace on one shard. When Enterprise Grid and cross-workspace channels grew, large customers created concentrated hot-shard load, so Slack migrated to Vitess and re-sharded message data by channel_id instead. The lesson is that your first shard key will feel permanent but usually is not, so design the routing layer to allow the physical mapping to change without rewriting the application.

← previous

CAP Theorem Deep Dive

Consistent Hashing

// RELATED