~/articles/design-distributed-id-generator

◆◆Intermediateasked at Twitterasked at Discordasked at Instagramasked at Stripe

Design a Distributed Unique ID Generator

How Twitter/Discord/Instagram generate billions of unique IDs per day with no central coordinator. UUIDs, snowflake, ULIDs.

16 min read2026-02-19Ironclad Academy

#interview #distributed-systems #identifiers

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Twitter mints roughly 6,000 tweets per second at peak. Each tweet needs a unique identifier the moment it's created: before it's written to any database, before any acknowledgment goes back to the client, and across hundreds of database shards that have no visibility into each other. Multiply that by likes, retweets, DMs, and ad impressions, and you're looking at hundreds of thousands of IDs per second from a single product. Discord, Instagram, and Stripe face the exact same problem at similar or larger scale.

The first instinct — auto-increment a database column — works beautifully for one database. The moment you shard, there's no sequence to share without locking across machines, and cross-machine locking on the write path is the thing you were sharding to avoid. UUIDs sidestep coordination entirely, but 128 random bits destroy B-tree index locality: every insert is a random seek into the middle of the tree rather than an append to the end. At a million inserts per minute, that difference in index behavior is the difference between acceptable and catastrophic write latency.

The core tension is fitting three constraints into one primitive: no coordination on the hot path, 64-bit width (so existing schemas don't grow), and time-sortable output (so ORDER BY id and ORDER BY created_at are the same index scan). Getting all three simultaneously is what makes this a genuine design problem rather than a lookup.

The solution Twitter shipped in 2010 — Snowflake — packs a millisecond timestamp, a machine ID, and a per-millisecond sequence counter into a single 64-bit integer. Each generator node works entirely independently; the only shared state is the machine ID, claimed once at startup. Instagram later extended the idea by embedding the shard ID directly in the bits, so the ID itself tells you which database owns the row — no routing table required. Both approaches live under the hood of products you use every day.

Capacity / back-of-the-envelope

Dimension	Estimate	How we got there
Single-node throughput	~4.1 M IDs/sec	4,096 IDs/ms × 1,000 ms = 4,096,000 IDs/sec
Cluster throughput (1,024 nodes)	~4.2 B IDs/sec	4,096 IDs/ms × 1,024 nodes × 1,000 ms = 4,194,304,000 IDs/sec
Timestamp overflow	~69.7 years from epoch	2^41 ms / (1,000 × 60 × 60 × 24 × 365.25) ≈ 69.7 years
ID storage per key	8 bytes	64-bit integer = 8 bytes (vs 16 bytes for UUID)
Instagram per-shard throughput	~1.0 M IDs/sec	1,024 IDs/ms × 1,000 ms (seq reduced from 12 to 10 bits)

Takeaway: A single Snowflake node saturates at ~4.1 M IDs/sec; a 1,024-node cluster reaches ~4.2 B IDs/sec with no per-ID coordination. The 41-bit timestamp gives about 70 years of headroom from any epoch chosen today.

Functional requirements

64-bit integer ID (or 128-bit UUID).
Globally unique.
Generated at high throughput (~10k/sec per host, billions/day cluster-wide).
(Often) sortable by creation time.
Generated without coordination on the hot path.

Non-functional

< 1ms per ID.
No central bottleneck.
Survive node failures.

Building up to the design

ID generation is the question where "just use UUIDs" is almost right and almost wrong, and the candidates who can explain the difference get the offer. Walk forward.

V1: `AUTO_INCREMENT` in one Postgres

CREATE TABLE tweets (id BIGSERIAL PRIMARY KEY, ...);

The DB hands out 1, 2, 3, … sequentially. Inserts are append-only on the index. For a single database, this is basically perfect — 64-bit, monotonic, sortable, and fast. Most products can run here for years.

The wall you hit is cross-shard. The moment you spin up a second database, there's no way to share the sequence without distributed coordination — and distributed coordination on every write is exactly what we're trying to avoid. Sequential IDs also leak business data: user #97 is obviously the 97th customer.

V2: UUIDv4 (random) — the "no coordination" answer

128 random bits, generated client-side, near-zero collision risk. Works on any node, any language, no infrastructure dependency.

The downside shows up in the database. Random IDs scatter inserts across the B-tree, page after page, causing index thrash — each insert is essentially a random seek. You also lose time-sort: ORDER BY id means nothing, so every "show recent items" query needs a separate created_at index, doubling write costs. And 128 bits doubles the size of every foreign key column in your schema, which compounds across joins.

V3: UUIDv7 / ULID — time-prefixed UUID

First 48 bits = millisecond timestamp, rest = random. Time-sortable, no coordination, and B-trees are much happier because new IDs cluster near the end of the index rather than scattering. This is a real improvement over UUIDv4 for most applications.

Still 128 bits, though. Most internal systems were built around 64-bit IDs, and doubling every key column is a real cost at scale — storage, memory in buffer pools, and wire overhead on every join.

V4: Range allocation (Flickr ticket server)

A central service hands out blocks of 10,000 IDs at a time; each app consumes its block locally. Coordination drops from once per ID to once per 10,000 IDs — a 10,000× improvement — and the IDs are 64 bits.

The remaining problems: the central service is a single point of failure, an app crash loses its unused block (creating gaps), and the IDs don't carry any time or shard information.

V5: Snowflake — Twitter's canonical answer

Pack the ID into 64 bits:

| 1 bit | 41 bits timestamp | 10 bits machine | 12 bits seq |

Each generator knows its machine_id (assigned once at startup) and composes every ID from (now_ms, machine_id, seq_within_ms). No coordination on the hot path — ever. A single node can mint 4,096 IDs per millisecond; a cluster of 1,024 nodes produces ~4.2 billion IDs per second. IDs sort chronologically because the timestamp occupies the top bits.

Two problems remain. First: clocks. If a node's clock jumps backward even one millisecond, it will mint IDs that overlap with IDs already issued, silently breaking uniqueness. Second: machine IDs must be unique across every generator in the cluster — quietly assigning the same ID to two nodes corrupts everything without a single error message.

V6: Snowflake hardening + Instagram-style shard embedding

Both problems have clean solutions:

Clock skew: refuse to mint if now_ms < last_ts; spin-wait until the clock catches up. For larger backward jumps (VM migrations, leap seconds), fall back to a logical clock — hold the timestamp at last_ts and keep incrementing seq.
Machine ID assignment: ZooKeeper or etcd ephemeral nodes. On startup, the generator acquires a free slot via compare-and-swap; the lease releases automatically if the process dies. Startup cost is ~50–100 ms, then it's gone.
Embed shard ID in the bits (Instagram's trick): swap | 41 ts | 13 shard | 10 seq |. Now the ID itself tells you which DB shard the row lives on — no separate lookup, no routing table.

This is the production-grade design — V5 + V6 — used by Twitter, Discord, Instagram, and Mastodon.

flowchart LR
    V1[V1: BIGSERIAL<br/>one DB] --> V2[V2: UUIDv4<br/>random, index thrash]
    V2 --> V3[V3: UUIDv7/ULID<br/>time-sortable, 128b]
    V3 --> V4[V4: range allocation<br/>64b, central SPOF]
    V4 --> V5[V5: Snowflake<br/>distributed, time-sortable]
    V5 --> V6[V6: + shard embed + clock safety<br/>production]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

The picks below are the same approaches mapped against different constraints — read them as "which V to use for which workload."

Approaches

1. UUIDv4 (random)

128 random bits. Zero coordination, easy, globally unique with overwhelming probability. Reach for it when you need something that just works across any node, language, or runtime — and when ordering and storage efficiency don't matter. The moment you're building clustered indexes or doing a lot of joins, the 128-bit width and random distribution start hurting you badly.

2. UUIDv7 / ULID (timestamp + random)

128 bits, but the first 48 bits are a millisecond Unix timestamp; the remaining 80 bits hold version/variant markers and random (or sequence) data.

01938F33-7234-7XXX-YYYY-ZZZZZZZZZZZZ
└── 48-bit ms ──┘└─ 4b ver | 12b rand_a | 2b var | 62b rand_b ─┘

Note on UUIDv7 structure: the 80 non-timestamp bits include a 4-bit version marker and 2-bit variant field, leaving 74 bits for random or sequence data — not strictly "all random", but effectively so for collision purposes.

Time-sortable, no coordination, near-zero collision risk. A strong choice for modern apps and frameworks that want lexicographic time-sort without external coordination. The 128-bit width is the only real cost. (Stripe's own prefixed IDs — e.g. cus_, ch_ — use a custom encoding with a shard key baked in, not UUIDv7.)

3. DB auto-increment

SERIAL/BIGSERIAL in Postgres, AUTO_INCREMENT in MySQL. Simple, monotonic, sortable, and a single point of contention. Doesn't work across shards; sequential IDs leak business data. Use this when you have one database and don't need to hide growth numbers. Stop using it the moment you shard.

4. Range-based allocation (block IDs)

A central service hands out blocks of 10,000 IDs at a time. Each app server consumes its block locally; it refills when running low.

ID Service → "you have 1,000,000 - 1,009,999"
App: uses 1,000,000, then 1,000,001, ...
App: low on IDs → ask for next block → "you have 1,010,000 - 1,019,999"

Throughput per app is limited only by the local counter — one coordinator call per 10,000 IDs rather than per ID. The cost: gaps appear when an app crashes mid-block, and the service needs to be highly available. Used by Flickr's Ticket Server and many internal systems.

5. Snowflake (Twitter, 2010 — the canonical answer)

64 bits split into:

| 1 bit | 41 bits         | 10 bits   | 12 bits |
| sign  | timestamp ms    | machine   | seq     |

timestamp_ms: milliseconds since a custom epoch — gives ~69 years of IDs.
machine_id: identifies the generating node (1024 unique nodes).
seq: sequence within the same millisecond (4096 IDs/ms per node).

flowchart LR
    NODE[Node 17, t=1707432000000] --> ID1["seq=0: 01101...01.0000010001.000000000000"]
    NODE --> ID2["seq=1: 01101...01.0000010001.000000000001"]
    NODE --> NEXT["next ms: seq resets to 0"]
    style ID1 fill:#ff6b1a,color:#0a0a0f
    style ID2 fill:#ff6b1a,color:#0a0a0f
    style NEXT fill:#15803d,color:#fff

The bit layout is worth visualizing. Those 64 bits carry everything the generator needs — and the timestamp in the upper bits means numeric sort and chronological sort are the same operation.

flowchart LR
    SIGN["Bit 63\n(sign = 0)"] --> TS["Bits 62–22\n41 bits: ms timestamp"]
    TS --> MID["Bits 21–12\n10 bits: machine_id"]
    MID --> SEQ["Bits 11–0\n12 bits: seq"]
    style SIGN fill:#0e7490,color:#fff
    style TS fill:#ff6b1a,color:#0a0a0f
    style MID fill:#a855f7,color:#fff
    style SEQ fill:#15803d,color:#fff

class Snowflake:
    def __init__(self, machine_id: int, epoch: int = 1577836800000):
        self.machine_id = machine_id   # 0..1023
        self.epoch = epoch
        self.last_ts = -1
        self.seq = 0

    def next_id(self) -> int:
        ts = int(time.time() * 1000)
        if ts < self.last_ts:
            # Clock jumped backward — refuse to mint; wait for clock to recover
            while ts < self.last_ts:
                ts = int(time.time() * 1000)
        if ts == self.last_ts:
            self.seq = (self.seq + 1) & 0xFFF  # 12 bits
            if self.seq == 0:
                # Sequence exhausted within 1 ms; spin to next ms
                while ts <= self.last_ts:
                    ts = int(time.time() * 1000)
        else:
            self.seq = 0
        self.last_ts = ts
        return ((ts - self.epoch) << 22) | (self.machine_id << 12) | self.seq

64 bits, no coordination on the hot path, sortable by time, and 4,096 IDs/ms × 1,024 machines = ~4.2 billion IDs/sec cluster-wide. The two things that can go wrong are backward clock jumps (handled by the spin-wait above) and duplicate machine IDs (handled by etcd at startup). Both are addressed in the operational section below.

Used by: Twitter, Discord, Instagram (they call theirs "Sharded ID"), Mastodon.

6. Instagram's variant (sharded auto-increment + timestamp)

Instagram needed the time-sort property of Snowflake but with shard awareness. Their ID:

| 41 bits timestamp | 13 bits shard_id | 10 bits seq |

The shard_id directly tells you which DB shard the row lives on — the ID is its own shard key. Beautiful.

CREATE FUNCTION next_id(seq_name TEXT, OUT result BIGINT) AS $$
DECLARE
    epoch_ms BIGINT := 1314220021721;
    seq_id   BIGINT;
    now_ms   BIGINT;
    shard_id INT := 5;          -- per-shard
BEGIN
    SELECT nextval(seq_name) % 1024 INTO seq_id;
    SELECT EXTRACT(EPOCH FROM clock_timestamp()) * 1000 INTO now_ms;
    result := (now_ms - epoch_ms) << 23;
    result := result | (shard_id << 10);
    result := result | seq_id;
END;
$$ LANGUAGE plpgsql;

Generated by Postgres itself, so no external service needed. Each shard gets its own function, hardcoded with its shard_id.

Comparison

Approach	Bits	Sortable	Centralized	Throughput
UUIDv4	128	No	No	∞
UUIDv7 / ULID	128	Yes	No	∞
DB auto-increment	64	Yes	Yes	~10k/s
Range allocation	64	Yes	Yes (less so)	~1M/s
Snowflake	64	Yes	No (one-time setup)	~4.2B/s
Instagram	64	Yes	No	~1M/s per shard

For most modern systems: Snowflake or ULID. Use ULID if you don't have control over machine IDs (e.g. serverless).

Operational concerns

Clock skew

Snowflake assumes monotonic, accurate clocks. If a node's clock jumps backward, IDs collide — you can get two IDs with the same (timestamp, machine_id, seq). The mitigations layer on top of each other depending on how large the backward jump is.

NTP on every node keeps drift to single-digit milliseconds under normal conditions. Most of the time, that's all you need.

For the occasional NTP correction, the standard guard is: if now_ms < last_ts, spin-wait until the clock catches up. This works cleanly for sub-millisecond corrections; it causes a brief stall for corrections in the 1–10 ms range, which is acceptable. It breaks down after a leap second or a VM live-migration, where the clock can jump back by hundreds of milliseconds.

For those cases, a logical clock fallback helps: pin timestamp at last_ts and keep incrementing seq. This works as long as the backward jump is shorter than the time needed to exhaust 4,096 sequence slots (effectively, less than 1 ms of headroom). Larger jumps still require a hard wait.

The cleanest approach combines both: use CLOCK_MONOTONIC (not CLOCK_REALTIME) for the sequence guard, and compare it to wall-clock time at the start of each millisecond to detect drift early before it becomes a problem.

flowchart TD
    CALL[next_id called] --> CMP{"now_ms < last_ts?"}
    CMP -->|yes — clock went back| WAIT[Spin-wait until now_ms >= last_ts]
    WAIT --> CMP2{"now_ms == last_ts?"}
    CMP -->|no| CMP2
    CMP2 -->|yes — same ms| INC["seq = (seq + 1) & 0xFFF"]
    INC --> FULL{"seq == 0?\n(wrapped)"}
    FULL -->|yes — seq exhausted| SPIN[Spin to next ms]
    SPIN --> EMIT
    FULL -->|no| EMIT[Emit ID]
    CMP2 -->|no — new ms| RESET[seq = 0]
    RESET --> EMIT
    style WAIT fill:#ff2e88,color:#fff
    style SPIN fill:#ffaa00,color:#0a0a0f
    style EMIT fill:#15803d,color:#fff

Machine ID assignment

Where does each generator get its 10-bit machine_id? The failure mode is silent: two nodes sharing the same machine_id will mint IDs that collide — no error, no duplicate-key violation, just silently broken uniqueness across shards.

The options range from quick-but-fragile to robust-but-heavyweight:

Static config — assign the ID at deploy time. Simple; fails with auto-scaling because a replacement pod can be assigned the same ID as its predecessor, which may still be running during a slow shutdown.

ZooKeeper / etcd ephemeral nodes — the generator acquires a free slot via a distributed compare-and-swap on startup; the lease releases automatically when the process dies. Adds ~50–100 ms to startup time, after which it's gone. Most robust approach.

Cloud instance metadata — hash the EC2 instance ID or GCP instance name into the 10-bit range. Collision probability is low but non-zero if you hash more than 1,024 instances. The same instance always gets the same machine_id, which can cause conflicts during rapid scale-in/scale-out.

Pod IP last octet — take the last 8 bits of the pod IP. Works for clusters where IPs are stable and the fleet is under 256 generators. A quick operational hack that breaks the moment your cluster crosses 256 pods or IPs get recycled.

flowchart LR
    BOOT[Generator boots] --> ETCD{etcd available?}
    ETCD -->|yes| CAS["Compare-and-swap\nfree slot in 0..1023"]
    CAS -->|acquired| LEASE["Hold ephemeral lease\nauto-released on crash"]
    LEASE --> RUN[Mint IDs locally]
    ETCD -->|no| STATIC["Fall back to\nstatic config"]
    STATIC --> RUN
    CAS -->|all slots taken| ERR[Error — too many generators]
    style LEASE fill:#15803d,color:#fff
    style RUN fill:#ff6b1a,color:#0a0a0f
    style ERR fill:#ff2e88,color:#fff

Stress: ID generation at maximum rate

A single Snowflake node maxes at 4,096 IDs/ms = ~4.1M IDs/sec. If you need more from one node, either upgrade to a 16-bit seq field (sacrifice 4 bits of machine namespace or timestamp resolution), or run multiple "logical generators" per host, each with its own virtual machine_id.

Why time-sortable IDs matter

A common pattern: store IDs as primary keys, also use them for ordering.

SELECT * FROM tweets WHERE author_id = 42 ORDER BY id DESC LIMIT 50;

Because id is a snowflake, this is a single backwards index scan. If id were random (UUIDv4), you'd need a separate created_at index, doubling write costs.

This is also why time-sortable IDs are easier on B-tree indexes — new IDs are roughly monotonic, so inserts append rather than scatter.

Things to discuss in an interview

Trade-off: 64 vs. 128 bits — and the concrete cost of wider FK columns at scale.
Time-sortable as a critical property for storage performance (B-tree locality, index scan direction).
Machine ID assignment as the hidden coordination problem — and why "just use the pod IP" breaks at > 256 pods.
Clock skew handling — what happens if NTP steps the clock back 2 ms? What about 500 ms (VM live-migration)?
Centralized fallback: range allocation when snowflake-style isn't an option (e.g., WASM, edge workers, strict multi-tenant isolation).
Instagram trick: when to embed the shard in the ID, and what you lose if you ever need to reshard.
Epoch choice: pick a recent epoch close to "now" to maximize the useful range of the 41-bit timestamp field.

Things you should now be able to answer

Why is UUIDv4 a poor primary key for indexes despite being unique?
What does each segment of a snowflake ID encode?
How does Instagram's variant let you derive the shard from the ID?
What's the failure mode if a snowflake node's clock jumps backward?
Why do block-allocation schemes still use a centralized service?

Frequently asked questions

▸What is a Snowflake ID and how are its 64 bits structured?

A Snowflake ID is a 64-bit integer composed of three fields: 41 bits of millisecond timestamp (using a custom epoch), 10 bits of machine ID, and 12 bits of per-millisecond sequence number, plus 1 unused sign bit. The timestamp occupies the top bits so numeric sort and chronological sort are the same operation, and a single node can mint 4,096 unique IDs per millisecond without any coordination.

▸Why is UUIDv4 a poor primary key for database indexes?

UUIDv4's 128 random bits scatter inserts across a B-tree index, causing each insert to be a random seek into the middle of the tree rather than an append to the end. At high write rates this creates serious index thrash. UUIDv4 also doubles the size of every foreign-key column compared to a 64-bit ID, compounding across joins.

▸How many IDs can a single Snowflake node and a full 1,024-node cluster generate per second?

A single Snowflake node produces 4,096 IDs per millisecond, or roughly 4.1 million IDs per second. A cluster of 1,024 nodes scales that to approximately 4.2 billion IDs per second. The 41-bit timestamp field overflows in about 69.7 years from the chosen epoch.

▸When should I use ULID or UUIDv7 instead of Snowflake?

Use ULID or UUIDv7 when you cannot control machine ID assignment — for example in serverless or edge environments where no coordination service like etcd is available. Both are 128-bit and time-sortable, so they are B-tree-friendly, but they cost more storage and wire overhead than 64-bit Snowflake IDs.

▸What is Instagram's shard-embedding trick and what does it cost?

Instagram restructures the 64-bit layout to 41 bits timestamp, 13 bits shard ID, and 10 bits sequence. Compared to standard Snowflake (1 sign + 41 ts + 10 machine + 12 seq = 64), the 3 extra shard bits come from absorbing the unused sign bit (+1 bit) and shrinking the sequence field from 12 to 10 bits (+2 bits). The seq reduction also cuts per-shard throughput from 4,096 IDs/ms to 1,024 IDs/ms. The benefit is that any service can derive which database shard owns a row directly from the ID itself, eliminating a routing-table lookup. The cost is that if you ever need to reshard, the shard ID embedded in existing IDs becomes stale.

← previous

Design a Distributed Key-Value Store (Dynamo)

Design Yelp / Nearby Search (proximity service)

// RELATED