~/articles/design-leaderboard

◆◆Intermediateasked at Riot Gamesasked at Activisionasked at Meta

Design a Real-Time Leaderboard (gaming)

Rank millions of players by score and answer "top N" and "my rank" instantly. Redis sorted sets, sharding by score range, and approximate ranks at scale.

22 min read2026-04-09Ironclad Academy

#interview #ranking #caching #realtime

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Riot Games' League of Legends tracks tens of millions of players across regions. At the end of a ranked match, your score updates and the global standings shift. Within seconds, you expect to see your new rank on the profile screen — and so does every other player finishing a match at the same time. Behind that instant feedback sits a system handling 50,000 score writes per second at steady state, spiking to 200,000 at tournament close, while simultaneously answering "show me the top 100" 5,000 times per second and "what's my rank?" 20,000 times per second.

A leaderboard has three core queries: write a new score, read the top-N players in order, and read one player's current rank. The first two are straightforward — a sorted index handles both. The third is where leaderboards get hard. Answering "what is my rank?" naively means counting every player with a higher score than yours. At 10 million players, a mid-ranked player triggers millions of index reads on every request. At 20,000 such requests per second, that arithmetic becomes untenable fast.

The central engineering tension is read/write asymmetry on a ranking index. Score writes are append-like and high-frequency. Rank reads need O(log N) performance — not the O(rank) cost a B-tree index delivers. That gap is why this problem is on every senior engineering interview circuit: the standard database tooling solves the easy queries and fails on the hard one.

Redis sorted sets break that impasse. A skip list with span counters can return any member's rank in O(log N), the same complexity as a lookup. The rest of the design — time-windowed boards, durability, sharding, approximate rank at half a billion players — flows from that one structural choice.

Functional requirements

POST /score — update a player's score (increment by delta, or set absolutely). Returns the new score.
GET /leaderboard/top?n=N&board=global|weekly|daily — paginated top-N list with rank, player ID, and score.
GET /player/{id}/rank?board=... — the calling player's rank on the specified board.
Multiple simultaneous boards: all-time global, weekly, daily (auto-expire).
Deterministic tie-breaking — two players on the same score must have a stable, defined relative rank.

Non-functional requirements

Score-update latency: < 10 ms p99 end-to-end.
Rank-read latency: < 20 ms p99.
Availability: 99.9%+. Brief rank staleness is acceptable; losing scores is not.
Durability: no score loss on server crash — scores must survive a Redis failure.
Scale: 10 M active players, 50 k updates/s average, 200 k/s peak.

Capacity estimation

Dimension	Estimate	How we got there
Registered players	50 M total; 10 M simultaneously active	Given
Score updates (avg)	50 k/s	Given
Score updates (peak)	200 k/s	Tournament close / viral event
Top-100 reads	5 k/s	~150 k clients viewing the lobby, polling every 30 s
Rank reads	20 k/s	~600 k clients on a profile view, refreshing every 30 s
Redis ZSET bytes/member	~150 B	member string (10 B) + score float64 (8 B) + skip-list node (~56 B) + hashtable entry (~64 B) + robj/SDS overhead (~16 B); varies with Redis version and allocator alignment
Redis memory — 10 M members	~1.5 GB	`10 M × 150 B` — fits a single Redis node easily
Redis memory — 50 M members	~7.5 GB	`50 M × 150 B` — fits a 16 GB Redis node
Write throughput ceiling	~100 k–200 k ZADD/s on one core	Redis single-threaded; 200 k/s is near the limit — sharding or write coalescing needed at sustained peak
Read throughput	25 k reads/s total (top-N + rank)	Well within one Redis node; CDN absorbs most top-N reads so raw Redis QPS is much lower
Postgres storage	~5 GB	`50 M rows × ~100 B/row` — trivial; one primary + one replica

Takeaway: One Redis primary handles the read load for 50 M players comfortably. At ~150 bytes/member, 50 M members use ~7.5 GB — budget a 16 GB instance. Sharding becomes necessary when write rates persistently saturate a single core.

Why SQL rank queries break at scale

Before introducing Redis, it helps to understand exactly where SQL fails — both to motivate the design and to answer it in an interview.

Option A — top-N query:

SELECT player_id, score FROM player_scores
ORDER BY score DESC
LIMIT 100;

With a B-tree index on (score DESC), the database returns 100 rows by walking the index forward. This is O(100) — fast, unproblematic.

Option B — rank query:

SELECT COUNT(*) + 1
FROM player_scores
WHERE score > (SELECT score FROM players WHERE player_id = ?);

The database range-scans the index from the player's score to the top. For a player ranked 500,000 out of 10 M, this reads ~500,000 index entries per query. At 20 k rank queries per second, that is 10 billion index reads per second — not viable.

The root cause: a B-tree index gives O(log N) point lookups and O(k) range scans, but it cannot answer "how many elements rank above this one" in O(log N). That requires the index to maintain subtree element counts — exactly what a skip list does.

A Redis sorted set is a skip list + hash map. The skip list's nodes maintain span counts (how many nodes each forward pointer skips). To compute a rank, the skip list traverses from the head, accumulating span counts at each level until it reaches the target node — O(log N). ZREVRANK is simply (total_members − 1) − ZRANK, both derived from the same O(log N) span traversal. That is the structural reason Redis sorted sets win here.

Building up to the design

V1: Postgres `ORDER BY score`

CREATE TABLE player_scores (
    player_id BIGINT PRIMARY KEY,
    score     BIGINT NOT NULL DEFAULT 0,
    updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_score_desc ON player_scores (score DESC);

Top-N is fine — the index walk returns 100 rows in milliseconds. Rank queries are the problem: every COUNT(*) WHERE score > X is an O(rank) index scan. A mid-ranked player in a game with 10 M players triggers half a million index reads per query. And at 200 k UPDATE statements per second, a single Postgres table builds up lock contention and WAL pressure fast.

V2: Redis ZSET as the ranking index

Keep Postgres as the source of truth. On every score update, also call ZADD leaderboard <score> <player_id> on Redis. Serve all rank and top-N queries from Redis.

ZADD is O(log N), ZREVRANK is O(log N), ZREVRANGE is O(log N + M) where M is the number of elements returned. All sub-millisecond. But Redis is in-memory — a crash or failover empties the ZSET and you've lost every rank. You need a rebuild path.

V3: Write-behind + rebuild-on-restart

Write to Postgres first (synchronously), then update Redis (synchronously, after the DB write). If Redis crashes, rebuild from Postgres:

Stream SELECT player_id, score FROM player_scores
  → batch ZADD via Redis pipeline (batches of 10 k) at ~100–300 k rows/s
  → 50 M rows rebuilt in roughly 3–8 minutes (hardware- and network-dependent)

During the rebuild, serve stale reads from a Redis replica or fall back to SQL rank queries. Once the rebuild finishes, Redis takes over again. This write ordering is what makes the system safe: Redis can always be reconstructed from Postgres, never the other way around.

V4: Time-windowed boards with namespaced ZSETs

The global all-time board is straightforward. Weekly and daily boards are trickier — you can't just reset scores at midnight or you create a thundering herd of players suddenly dropping to rank zero. The cleaner approach is to name each board by its time bucket and let it expire naturally:

lb:global          ← all-time, never expires
lb:weekly:202615   ← ISO week 15 of 2026, TTL = 14 days
lb:daily:20260409  ← day bucket, TTL = 48 hours

Score updates write to all active boards in a single pipeline:

PIPELINE
  ZADD   lb:global          <new_total_score> <player_id>
  ZINCRBY lb:weekly:202615  <delta>            <player_id>
  ZINCRBY lb:daily:20260409 <delta>            <player_id>
EXEC

Weekly and daily boards track window-relative scores via ZINCRBY. When Monday arrives, the Score Service starts writing to lb:weekly:202616 instead of lb:weekly:202615. The old key expires on its own schedule — no cron job, no reset storm.

V5: Sharding when one ZSET is too big

At ~150 bytes/member, 100 M members consume roughly 15 GB — manageable on a 32 GB node, but Redis command execution is single-threaded. At 200 k ZADD/s sustained write rates, the single core saturates before memory becomes the bottleneck. That is the signal to shard.

Score-range sharding divides the score space into bands:

shard 0: score in [0,       999]    ← casual majority, large population
shard 1: score in [1000,   4999]    ← intermediate
shard 2: score in [5000,  19999]    ← competitive
shard 3: score in [20000,     ∞]    ← elite, small population

Bands are intentionally unequal — match the score distribution so populations are roughly balanced. Computing global rank across shards is still exact:

global_rank(player in shard 2, score = 12,000)
  = ZCARD(shard 3)                  ← everyone in the elite band ranks above
  + ZREVRANK(shard 2, player_id)    ← position within this shard
  + 1                               ← convert to 1-indexed

Two Redis calls, both O(log N), to two different nodes — parallelizable.

V6: Approximate rank via score histogram (500 M+ players)

At 500 M players with high write rates, maintaining exact cross-shard rank in real time has diminishing value: a player ranked 3,482,771 vs 3,482,801 sees no meaningful difference. Most games show "top X%" instead.

Maintain N histogram buckets covering the score range:

bucket[i] = count of players with score in [lo_i, hi_i)

Approximate rank of a player with score S:

rank ≈ SUM(bucket[k] for all k where lo_k > S) + 1

Error is bounded by the width of one bucket. Size buckets so each covers ~10 k players: rank is accurate to ±10 k, which is imperceptible to most users.

flowchart LR
    V1["V1: SQL ORDER BY<br/>OK for top-N, O(rank) rank"] --> V2["V2: Redis ZSET<br/>O(log N) rank, ephemeral"]
    V2 --> V3["V3: + Postgres source of truth<br/>rebuild on restart"]
    V3 --> V4["V4: + time-windowed ZSETs<br/>daily/weekly with TTL"]
    V4 --> V5["V5: + score-range sharding<br/>exact global rank at scale"]
    V5 --> V6["V6: + histogram approximation<br/>percentile rank at 500M+"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

High-level architecture

flowchart TD
    GS[Game Server] -->|score event| SS[Score Service]
    SS -->|"1. write player_scores"| PG[(Postgres<br/>source of truth)]
    SS -->|"2. ZADD pipeline"| RD[(Redis<br/>lb:global lb:weekly lb:daily)]

    RD -.rebuild on cold start.-> PG

    CL[Game Client] --> CDN[CDN edge<br/>top-100 cached 5s]
    CDN -->|miss| LBAPI[Leaderboard API]
    LBAPI -->|ZREVRANGE top-N| RD
    LBAPI -->|ZREVRANK player rank| RD

    HIST[Histogram updater<br/>background job] -->|read ZCOUNT per bucket| RD
    HIST --> HMAP[(Histogram buckets<br/>Redis hash)]

    style RD fill:#15803d,color:#fff
    style PG fill:#0e7490,color:#fff
    style SS fill:#ff6b1a,color:#0a0a0f
    style CDN fill:#ffaa00,color:#0a0a0f
    style LBAPI fill:#ff2e88,color:#fff

Notice the two separate read paths: the top-N query flows through the CDN so a 5-second cache absorbs the burst of lobby polls, while the personal rank query goes straight to Redis because it's different for every player and can't be shared. The Score Service always writes Postgres before Redis — if that ordering is maintained, Redis is just a view over durable data that can be rebuilt.

Redis ZSET commands reference

Command	Complexity	Purpose
`ZADD key score member`	O(log N)	Insert or update a member's score
`ZINCRBY key delta member`	O(log N)	Atomically add delta to existing score
`ZREVRANGE key 0 M-1 WITHSCORES`	O(log N + M)	Top-M members, highest score first (N = total set size, M = returned count)
`ZREVRANK key member`	O(log N)	0-indexed rank from the top
`ZSCORE key member`	O(1)	Member's current score
`ZCARD key`	O(1)	Total member count
`ZCOUNT key min max`	O(log N)	Count members in a score range (for histogram)
`ZREM key member`	O(log N)	Remove member (needed when moving between shards)

ZREVRANK returns 0 for the top player. Add 1 for the human-readable 1-indexed rank. Returns nil if the player is not yet on the leaderboard.

Score update flow

sequenceDiagram
    participant GS as Game Server
    participant SS as Score Service
    participant PG as Postgres
    participant RD as Redis
    GS->>SS: score_event(player_id=42, delta=+150)
    SS->>PG: UPDATE player_scores SET score=score+150 WHERE id=42 RETURNING score
    PG-->>SS: new_score = 7350
    SS->>RD: PIPELINE [ZADD lb:global 7350 "42", ZINCRBY lb:weekly:202615 150 "42", ZINCRBY lb:daily:20260409 150 "42"] EXEC
    RD-->>SS: OK
    SS-->>GS: {new_score: 7350, rank: 1482}

The Postgres write commits first. If Redis fails, the score is safe in Postgres and re-applied on rebuild. If Postgres fails, we return an error — no score is updated anywhere. This write ordering means Redis can always be reconstructed from Postgres, never the other way around.

Read flow — top-N

sequenceDiagram
    participant CL as Game Client
    participant CDN as CDN Edge
    participant API as Leaderboard API
    participant RD as Redis
    CL->>CDN: GET /leaderboard/top?n=100&board=global
    alt cache hit (< 5s old)
        CDN-->>CL: 200 OK (cached response)
    else cache miss
        CDN->>API: forward request
        API->>RD: ZREVRANGE lb:global 0 99 WITHSCORES
        RD-->>API: 100 (player_id, score) pairs
        API-->>CDN: 200 OK  Cache-Control: max-age=5
        CDN-->>CL: 200 OK
    end

The top-100 list is identical for every client. A 5-second CDN TTL absorbs the 5 k/s read load — each CDN PoP makes at most one Redis call per 5 seconds rather than 5,000.

Read flow — player rank

Personal rank cannot be CDN-cached — every player's rank is different. Serve from Redis directly:

rank  = ZREVRANK(lb:global, player_id) + 1   ← 1-indexed
score = ZSCORE(lb:global, player_id)

Both are O(log N) and return in under 1 ms on a warm Redis. At 20 k rank reads/s, one Redis primary handles this load with significant headroom (Redis typically sustains 100 k–200 k simple commands/s on a single core; actual throughput is hardware- and workload-dependent).

Tie-breaking

When two players share a score, ZREVRANK resolves ties lexicographically by member string — deterministic, but lexicographic ordering of numeric IDs is not always fair (low string IDs win).

A cleaner approach is to encode a composite value into the ZSET score itself. IEEE 754 float64 can represent integers exactly up to 2^53 ≈ 9.007 × 10^15. A composite encoding must stay below this ceiling. If real scores are integers in [0, 10^6], you can encode:

zset_score = real_score × 10^9 + (10^9 − seconds_since_epoch_of_first_achievement % 10^9)

This reserves 9 decimal digits for the score and 9 digits for the tiebreaker. Max composite ≈ 10^6 × 10^9 = 10^15 < 9.007 × 10^15 — safely within float64 range.

The example in many articles uses × 10^10, which limits real scores to ~900,000 before the ceiling is breached. In practice, check your game's maximum achievable score before committing to any multiplier. For games with scores in the tens of millions, composite float encoding breaks; prefer a lexicographic tiebreaker string or resolve ties at display time using a SQL query.

An alternative: store a tiebreaker column in Postgres, resolve ties only at display time using SQL, not in the ZSET. Simpler but adds one DB call to the rank read.

Persistence and rebuild

Redis provides two native persistence mechanisms; neither replaces Postgres as the source of truth.

Mechanism	Durability	Tradeoff
RDB (snapshot)	Loses up to N minutes	Fast restart; simple setup
AOF (append-only file, `everysec`)	Loses up to ~1 second	Slower restart for large datasets
Rebuild from Postgres	Authoritative; no data loss	~3–8 min for 50 M rows (pipelined); serves stale during rebuild

Enable AOF with appendfsync everysec and keep Postgres as the fallback. A crash loses at most 1 second of rank data. If the AOF is corrupted or the node is replaced cold, trigger a rebuild.

Rebuild logic (sketch):

def rebuild(pg_conn, redis_client, board_key):
    pipe = redis_client.pipeline(transaction=False)
    rows = pg_conn.execute(
        "SELECT player_id, score FROM player_scores"
    )
    for i, (player_id, score) in enumerate(rows):
        pipe.zadd(board_key, {str(player_id): score})
        if i % 10_000 == 0:
            pipe.execute()
            pipe = redis_client.pipeline(transaction=False)
    pipe.execute()

Block rank queries (or serve SQL fallback) while the rebuild runs and the ZSET is incomplete.

Here is what that failure-and-recovery arc looks like end to end:

sequenceDiagram
    participant API as Leaderboard API
    participant RD as Redis Primary
    participant PG as Postgres
    note over RD: Redis crashes
    API->>RD: ZREVRANK lb:global player_42
    RD-->>API: connection error
    API->>PG: SELECT COUNT(*)+1 FROM player_scores WHERE score > 7350
    PG-->>API: rank 1501 (SQL fallback, slower)
    note over RD: new primary elected (replica promoted)
    RD-->>API: ZSET is empty — rebuild triggered
    API->>PG: SELECT player_id, score FROM player_scores (stream)
    PG-->>API: 50M rows in batches of 10k
    API->>RD: PIPELINE ZADD lb:global ... EXEC (repeat)
    note over RD: rebuild complete (~3-8 min)
    API->>RD: ZREVRANK lb:global player_42
    RD-->>API: rank 1482 (fast path restored)

Sharding — cross-shard top-N and global rank

With score-range shards, the top-N query becomes a k-way merge across shards run in parallel, then sorted in the API layer. Because bands are ordered by score range, the elite shard almost always dominates the global top-100; only at shard boundaries do multiple shards contribute candidates.

flowchart LR
    API[Leaderboard API] -->|"ZREVRANGE 0 99"| S0[(Shard 0<br/>casual)]
    API -->|"ZREVRANGE 0 99"| S1[(Shard 1<br/>intermediate)]
    API -->|"ZREVRANGE 0 99"| S2[(Shard 2<br/>competitive)]
    API -->|"ZREVRANGE 0 99"| S3[(Shard 3<br/>elite)]
    S0 --> MERGE[API merges<br/>top-100 in memory]
    S1 --> MERGE
    S2 --> MERGE
    S3 --> MERGE
    style S3 fill:#15803d,color:#fff
    style S2 fill:#0e7490,color:#fff
    style MERGE fill:#ff6b1a,color:#0a0a0f

For global rank when a player changes score bands, remove from the old shard and add to the new one atomically — use a Lua script on the shards, or a two-phase write with idempotency keys.

Time-windowed leaderboards

Board	Redis key pattern	TTL	Score semantics
All-time global	`lb:global`	None	Absolute total score
Weekly	`lb:weekly:GGGGWW`	14 days	Window delta via `ZINCRBY`
Daily	`lb:daily:YYYYMMDD`	48 hours	Window delta via `ZINCRBY`
Tournament	`lb:tourney:T123`	Operator-set	Absolute or delta

When Monday arrives, the Score Service writes to lb:weekly:202616 instead of lb:weekly:202615. The old key expires via its TTL — no batch reset job, no thundering herd of players dropping to rank 0.

Handling hot updates — viral event / tournament end

When a tournament ends, score updates can spike to 10–20× normal rate. The mitigations work at different layers:

Problem	Mitigation
Redis CPU saturation at 200 k ZADD/s	Coalesce: buffer 500 ms of updates per player; write one ZADD with the final score
Write amplification (3 boards per update)	Pipelining: all boards in one round trip
Postgres write contention at peak	Write-behind: accept the Redis write first, flush to Postgres async (risk: 1 ZADD worth of score loss on crash)
Top-N read spike (everyone checks ranks)	CDN absorbs it; Redis primary sees only cache misses

Write coalescing is the most impactful lever. A typical match produces many incremental score events, but only the final total matters for the leaderboard. Batch and write once at match completion.

Storage choices

Data	Store	Why
Score index (current)	Redis ZSET	O(log N) rank + update; sub-ms reads
Source of truth	Postgres	Durability, rebuild, aggregations
Score history / audit log	Append-only Postgres table or Cassandra	Time-series, high write volume
Histogram buckets	Redis hash (N fields)	Lightweight; updated by background job every 30 s
Time-windowed boards	Redis ZSET with TTL	Auto-expire; separate from global board
Player profiles	Postgres	Relational, joins

Failure modes

Redis node failure

With Redis Sentinel or Redis Cluster, automatic failover to a replica typically takes 10–60 seconds, depending heavily on down-after-milliseconds (Sentinel) or cluster-node-timeout (Cluster) configuration. Default cluster-node-timeout is 15 s; Sentinel examples in the official docs range from 5–60 s (down-after-milliseconds of 5000, 10000, or 60000 are all shown). During failover:

Route reads to the replica in read-only mode (acceptable for near-real-time rank).
Queue or drop score updates (brief score lag, recoverable from Postgres on reconciliation).

If the new primary's ZSET is empty (replica was behind or blank), trigger a rebuild from Postgres. Block rank queries or serve estimated ranks from the score histogram while rebuilding.

Cross-shard player migration

A player crossing a score-range boundary must be atomically moved: ZREM from the old shard and ZADD to the new shard. Without atomicity, the player appears in both shards and gets a double-counted global rank.

Fix: execute both operations in a Lua script evaluated on a coordinator, or use an idempotency-key two-phase write. Score-band crossings are infrequent — mostly during a new player's initial climb — so even serialized coordination is acceptable.

Hot key on the global ZSET

All reads for lb:global hit one Redis key. At very high read QPS, CDN caching (for top-N) and Redis read replicas (for rank queries) distribute the load. Writes must still go to the primary.

Score injection / cheating

Game servers directly calling POST /score are a trust boundary. The Score Service should:

Validate score deltas against signed game-event records (server signs the match result; score service verifies).
Reject deltas above a per-game-mode ceiling (e.g., > 99th percentile).
Emit anomalies to a fraud queue for async review.

Exact vs approximate rank — the trade-off

Approach	Rank accuracy	Complexity	Scales to
Single ZSET	Exact	Minimal	~100 M members
Score-range shards	Exact	Shard routing, cross-shard merge	Any size
Score histogram	±bucket size	Background updater	Any size
Percentile tier (top 1%, 5%…)	Coarse	Only counts needed	Any size

A pragmatic production design: exact rank from a single ZSET for the top 10 k players (visible on the leaderboard), approximate percentile from the histogram for everyone else. Most players care more about their tier than their exact number.

Schema (Postgres)

CREATE TABLE player_scores (
    player_id   BIGINT      PRIMARY KEY,
    score       BIGINT      NOT NULL DEFAULT 0,
    updated_at  TIMESTAMPTZ DEFAULT now()
);

-- Window-scoped scores for weekly/daily boards
CREATE TABLE player_score_windows (
    player_id   BIGINT      NOT NULL,
    window_key  TEXT        NOT NULL,  -- e.g. 'weekly:202615'
    score       BIGINT      NOT NULL DEFAULT 0,
    updated_at  TIMESTAMPTZ DEFAULT now(),
    PRIMARY KEY (player_id, window_key)
);

CREATE INDEX idx_player_scores_score ON player_scores (score DESC);
CREATE INDEX idx_score_windows ON player_score_windows (window_key, score DESC);

The score DESC index supports both the rebuild query and SQL rank fallback during Redis downtime.

API design

POST /api/v1/score
Authorization: Bearer <game-server-token>
Content-Type: application/json

{ "player_id": 42, "delta": 150, "board": "global", "match_id": "m_789" }

→ 200 OK
{ "player_id": 42, "new_score": 7350, "rank": 1482 }

GET /api/v1/leaderboard/top?n=100&board=global
Authorization: Bearer <client-token>

→ 200 OK
Cache-Control: max-age=5
{ "board": "global", "as_of": "2026-04-09T14:22:01Z",
  "entries": [
    { "rank": 1, "player_id": 9001, "score": 98420 },
    { "rank": 2, "player_id": 7734, "score": 97810 }
  ]
}

GET /api/v1/player/42/rank?board=weekly
Authorization: Bearer <client-token>

→ 200 OK
{ "player_id": 42, "board": "weekly", "rank": 382,
  "score": 3200, "percentile": 96.2 }

Things to discuss in an interview

Why Redis ZSET and not a sorted Postgres index? B-tree indexes cannot return a rank in O(log N) — they lack subtree element counts. The skip list's span counters are exactly what ZREVRANK uses.
How do time-windowed boards work without a cron reset? Namespaced keys with TTL. New week = new key. Old key expires naturally. No batch resets, no thundering herd.
How do you handle tie-breaking deterministically? Composite float score encoding, or secondary application-layer sort using a Postgres tiebreaker column.
What is the rebuild path if Redis crashes? Stream from Postgres, batch-ZADD via pipeline, block or degrade rank queries during rebuild.
When does exact rank become infeasible, and what do you serve instead? At very high member counts with high write rates, fall back to score-histogram percentile approximation. Exact rank still makes sense for the top few thousand visible players.
How do you prevent the global ZSET from becoming a hot key? CDN caching for top-N (5 s TTL absorbs millions of reads), Redis read replicas for rank queries.

Things you should now be able to answer

Why is SELECT COUNT(*) WHERE score > X too slow for real-time rank queries at scale?
What Redis commands give you top-N, a player's rank, and an atomic score update — and what are their time complexities?
How do you rebuild a Redis ZSET after a crash without losing scores?
How does score-range sharding let you compute exact global rank with only two Redis calls?
Why is a time-windowed leaderboard implemented with a TTL key better than a cron reset?
At what scale does exact rank become unnecessary, and what do you serve instead?

Frequently asked questions

▸Why can't a B-tree index answer 'what is my rank?' in O(log N)?

A B-tree can do O(log N) point lookups and O(k) range scans, but it cannot count how many elements rank above a given value in O(log N) because it lacks subtree element counts. Answering that query requires scanning the index from the player's score to the top, which is O(rank) — for a player ranked 500,000 out of 10 million, that is 500,000 index reads per query, and at 20,000 such requests per second the math becomes untenable.

▸What are the time complexities of ZADD, ZREVRANK, and ZREVRANGE on a Redis sorted set?

ZADD is O(log N), ZREVRANK is O(log N), and ZREVRANGE is O(log N + M) where N is the total number of members in the set and M is the number of elements returned. All three are sub-millisecond because the sorted set's skip list maintains span counters that let it traverse to any position while accumulating rank without scanning every node.

▸How do time-windowed leaderboards (weekly, daily) work without a cron reset job?

Each time window gets its own namespaced Redis key — for example, lb:weekly:202615 for ISO week 15 of 2026 — with a TTL set at creation (14 days for weekly, 48 hours for daily). When Monday arrives, the Score Service starts writing to lb:weekly:202616 instead. The old key expires on its own schedule, so there is no batch reset, no thundering herd of players dropping to rank zero.

▸How does score-range sharding compute exact global rank with only two Redis calls?

The score space is divided into ordered bands, and each band lives on a separate shard. To rank a player in shard 2, you call ZCARD on every higher shard to count all players who definitively rank above, then call ZREVRANK within the player's own shard for their position inside the band. Those two O(log N) calls to different nodes can run in parallel, giving exact rank without cross-shard joins.

▸At what scale does exact rank become impractical, and what does the article recommend instead?

Past roughly 500 million players with high write rates, maintaining exact cross-shard rank in real time has diminishing value because a player ranked 3,482,771 versus 3,482,801 sees no meaningful difference. The article recommends a score-histogram approximation: bucket the score range so each bucket covers roughly 10,000 players, sum the counts of all buckets above a player's score, and deliver percentile-accurate rank accurate to plus or minus one bucket width.

← previous

Design a Flash Sale / Seckill System

Design a Content Delivery Network (CDN)

// RELATED