Design a Real-Time Leaderboard (gaming)
Rank millions of players by score and answer "top N" and "my rank" instantly. Redis sorted sets, sharding by score range, and approximate ranks at scale.
The problem
Riot Games' League of Legends tracks tens of millions of players across regions. At the end of a ranked match, your score updates and the global standings shift. Within seconds, you expect to see your new rank on the profile screen — and so does every other player finishing a match at the same time. Behind that instant feedback sits a system handling 50,000 score writes per second at steady state, spiking to 200,000 at tournament close, while simultaneously answering "show me the top 100" 5,000 times per second and "what's my rank?" 20,000 times per second.
A leaderboard has three core queries: write a new score, read the top-N players in order, and read one player's current rank. The first two are straightforward — a sorted index handles both. The third is where leaderboards get hard. Answering "what is my rank?" naively means counting every player with a higher score than yours. At 10 million players, a mid-ranked player triggers millions of index reads on every request. At 20,000 such requests per second, that arithmetic becomes untenable fast.
The central engineering tension is read/write asymmetry on a ranking index. Score writes are append-like and high-frequency. Rank reads need O(log N) performance — not the O(rank) cost a B-tree index delivers. That gap is why this problem is on every senior engineering interview circuit: the standard database tooling solves the easy queries and fails on the hard one.
Redis sorted sets break that impasse. A skip list with span counters can return any member's rank in O(log N), the same complexity as a lookup. The rest of the design — time-windowed boards, durability, sharding, approximate rank at half a billion players — flows from that one structural choice.
Functional requirements
POST /score— update a player's score (increment by delta, or set absolutely). Returns the new score.GET /leaderboard/top?n=N&board=global|weekly|daily— paginated top-N list with rank, player ID, and score.GET /player/{id}/rank?board=...— the calling player's rank on the specified board.- Multiple simultaneous boards: all-time global, weekly, daily (auto-expire).
- Deterministic tie-breaking — two players on the same score must have a stable, defined relative rank.
Non-functional requirements
- Score-update latency: < 10 ms p99 end-to-end.
- Rank-read latency: < 20 ms p99.
- Availability: 99.9%+. Brief rank staleness is acceptable; losing scores is not.
- Durability: no score loss on server crash — scores must survive a Redis failure.
- Scale: 10 M active players, 50 k updates/s average, 200 k/s peak.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Registered players | 50 M total; 10 M simultaneously active | Given |
| Score updates (avg) | 50 k/s | Given |
| Score updates (peak) | 200 k/s | Tournament close / viral event |
| Top-100 reads | 5 k/s | ~150 k clients viewing the lobby, polling every 30 s |
| Rank reads | 20 k/s | ~600 k clients on a profile view, refreshing every 30 s |
| Redis ZSET bytes/member | ~150 B | member string (10 B) + score float64 (8 B) + skip-list node (~56 B) + hashtable entry (~64 B) + robj/SDS overhead (~16 B); varies with Redis version and allocator alignment |
| Redis memory — 10 M members | ~1.5 GB | 10 M × 150 B — fits a single Redis node easily |
| Redis memory — 50 M members | ~7.5 GB | 50 M × 150 B — fits a 16 GB Redis node |
| Write throughput ceiling | ~100 k–200 k ZADD/s on one core | Redis single-threaded; 200 k/s is near the limit — sharding or write coalescing needed at sustained peak |
| Read throughput | 25 k reads/s total (top-N + rank) | Well within one Redis node; CDN absorbs most top-N reads so raw Redis QPS is much lower |
| Postgres storage | ~5 GB | 50 M rows × ~100 B/row — trivial; one primary + one replica |
Takeaway: One Redis primary handles the read load for 50 M players comfortably. At ~150 bytes/member, 50 M members use ~7.5 GB — budget a 16 GB instance. Sharding becomes necessary when write rates persistently saturate a single core.
Why SQL rank queries break at scale
Before introducing Redis, it helps to understand exactly where SQL fails — both to motivate the design and to answer it in an interview.
Option A — top-N query:
SELECT player_id, score FROM player_scores
ORDER BY score DESC
LIMIT 100;
With a B-tree index on (score DESC), the database returns 100 rows by walking the index forward. This is O(100) — fast, unproblematic.
Option B — rank query:
SELECT COUNT(*) + 1
FROM player_scores
WHERE score > (SELECT score FROM players WHERE player_id = ?);
The database range-scans the index from the player's score to the top. For a player ranked 500,000 out of 10 M, this reads ~500,000 index entries per query. At 20 k rank queries per second, that is 10 billion index reads per second — not viable.
The root cause: a B-tree index gives O(log N) point lookups and O(k) range scans, but it cannot answer "how many elements rank above this one" in O(log N). That requires the index to maintain subtree element counts — exactly what a skip list does.
A Redis sorted set is a skip list + hash map. The skip list's nodes maintain span counts (how many nodes each forward pointer skips). To compute a rank, the skip list traverses from the head, accumulating span counts at each level until it reaches the target node — O(log N). ZREVRANK is simply (total_members − 1) − ZRANK, both derived from the same O(log N) span traversal. That is the structural reason Redis sorted sets win here.
Building up to the design
V1: Postgres ORDER BY score
CREATE TABLE player_scores (
player_id BIGINT PRIMARY KEY,
score BIGINT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_score_desc ON player_scores (score DESC);
Top-N is fine — the index walk returns 100 rows in milliseconds. Rank queries are the problem: every COUNT(*) WHERE score > X is an O(rank) index scan. A mid-ranked player in a game with 10 M players triggers half a million index reads per query. And at 200 k UPDATE statements per second, a single Postgres table builds up lock contention and WAL pressure fast.
V2: Redis ZSET as the ranking index
Keep Postgres as the source of truth. On every score update, also call ZADD leaderboard <score> <player_id> on Redis. Serve all rank and top-N queries from Redis.
ZADD is O(log N), ZREVRANK is O(log N), ZREVRANGE is O(log N + M) where M is the number of elements returned. All sub-millisecond. But Redis is in-memory — a crash or failover empties the ZSET and you've lost every rank. You need a rebuild path.
V3: Write-behind + rebuild-on-restart
Write to Postgres first (synchronously), then update Redis (synchronously, after the DB write). If Redis crashes, rebuild from Postgres:
Stream SELECT player_id, score FROM player_scores
→ batch ZADD via Redis pipeline (batches of 10 k) at ~100–300 k rows/s
→ 50 M rows rebuilt in roughly 3–8 minutes (hardware- and network-dependent)
During the rebuild, serve stale reads from a Redis replica or fall back to SQL rank queries. Once the rebuild finishes, Redis takes over again. This write ordering is what makes the system safe: Redis can always be reconstructed from Postgres, never the other way around.
V4: Time-windowed boards with namespaced ZSETs
The global all-time board is straightforward. Weekly and daily boards are trickier — you can't just reset scores at midnight or you create a thundering herd of players suddenly dropping to rank zero. The cleaner approach is to name each board by its time bucket and let it expire naturally:
lb:global ← all-time, never expires
lb:weekly:202615 ← ISO week 15 of 2026, TTL = 14 days
lb:daily:20260409 ← day bucket, TTL = 48 hours
Score updates write to all active boards in a single pipeline:
PIPELINE
ZADD lb:global <new_total_score> <player_id>
ZINCRBY lb:weekly:202615 <delta> <player_id>
ZINCRBY lb:daily:20260409 <delta> <player_id>
EXEC
Weekly and daily boards track window-relative scores via ZINCRBY. When Monday arrives, the Score Service starts writing to lb:weekly:202616 instead of lb:weekly:202615. The old key expires on its own schedule — no cron job, no reset storm.
V5: Sharding when one ZSET is too big
At ~150 bytes/member, 100 M members consume roughly 15 GB — manageable on a 32 GB node, but Redis command execution is single-threaded. At 200 k ZADD/s sustained write rates, the single core saturates before memory becomes the bottleneck. That is the signal to shard.
Score-range sharding divides the score space into bands:
shard 0: score in [0, 999] ← casual majority, large population
shard 1: score in [1000, 4999] ← intermediate
shard 2: score in [5000, 19999] ← competitive
shard 3: score in [20000, ∞] ← elite, small population
Bands are intentionally unequal — match the score distribution so populations are roughly balanced. Computing global rank across shards is still exact:
global_rank(player in shard 2, score = 12,000)
= ZCARD(shard 3) ← everyone in the elite band ranks above
+ ZREVRANK(shard 2, player_id) ← position within this shard
+ 1 ← convert to 1-indexed
Two Redis calls, both O(log N), to two different nodes — parallelizable.
V6: Approximate rank via score histogram (500 M+ players)
At 500 M players with high write rates, maintaining exact cross-shard rank in real time has diminishing value: a player ranked 3,482,771 vs 3,482,801 sees no meaningful difference. Most games show "top X%" instead.
Maintain N histogram buckets covering the score range:
bucket[i] = count of players with score in [lo_i, hi_i)
Approximate rank of a player with score S:
rank ≈ SUM(bucket[k] for all k where lo_k > S) + 1
Error is bounded by the width of one bucket. Size buckets so each covers ~10 k players: rank is accurate to ±10 k, which is imperceptible to most users.
flowchart LR
V1["V1: SQL ORDER BY<br/>OK for top-N, O(rank) rank"] --> V2["V2: Redis ZSET<br/>O(log N) rank, ephemeral"]
V2 --> V3["V3: + Postgres source of truth<br/>rebuild on restart"]
V3 --> V4["V4: + time-windowed ZSETs<br/>daily/weekly with TTL"]
V4 --> V5["V5: + score-range sharding<br/>exact global rank at scale"]
V5 --> V6["V6: + histogram approximation<br/>percentile rank at 500M+"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V6 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
GS[Game Server] -->|score event| SS[Score Service]
SS -->|"1. write player_scores"| PG[(Postgres<br/>source of truth)]
SS -->|"2. ZADD pipeline"| RD[(Redis<br/>lb:global lb:weekly lb:daily)]
RD -.rebuild on cold start.-> PG
CL[Game Client] --> CDN[CDN edge<br/>top-100 cached 5s]
CDN -->|miss| LBAPI[Leaderboard API]
LBAPI -->|ZREVRANGE top-N| RD
LBAPI -->|ZREVRANK player rank| RD
HIST[Histogram updater<br/>background job] -->|read ZCOUNT per bucket| RD
HIST --> HMAP[(Histogram buckets<br/>Redis hash)]
style RD fill:#15803d,color:#fff
style PG fill:#0e7490,color:#fff
style SS fill:#ff6b1a,color:#0a0a0f
style CDN fill:#ffaa00,color:#0a0a0f
style LBAPI fill:#ff2e88,color:#fff
Notice the two separate read paths: the top-N query flows through the CDN so a 5-second cache absorbs the burst of lobby polls, while the personal rank query goes straight to Redis because it's different for every player and can't be shared. The Score Service always writes Postgres before Redis — if that ordering is maintained, Redis is just a view over durable data that can be rebuilt.
Redis ZSET commands reference
| Command | Complexity | Purpose |
|---|---|---|
ZADD key score member | O(log N) | Insert or update a member's score |
ZINCRBY key delta member | O(log N) | Atomically add delta to existing score |
ZREVRANGE key 0 M-1 WITHSCORES | O(log N + M) | Top-M members, highest score first (N = total set size, M = returned count) |
ZREVRANK key member | O(log N) | 0-indexed rank from the top |
ZSCORE key member | O(1) | Member's current score |
ZCARD key | O(1) | Total member count |
ZCOUNT key min max | O(log N) | Count members in a score range (for histogram) |
ZREM key member | O(log N) | Remove member (needed when moving between shards) |
ZREVRANK returns 0 for the top player. Add 1 for the human-readable 1-indexed rank. Returns nil if the player is not yet on the leaderboard.
Score update flow
sequenceDiagram
participant GS as Game Server
participant SS as Score Service
participant PG as Postgres
participant RD as Redis
GS->>SS: score_event(player_id=42, delta=+150)
SS->>PG: UPDATE player_scores SET score=score+150 WHERE id=42 RETURNING score
PG-->>SS: new_score = 7350
SS->>RD: PIPELINE [ZADD lb:global 7350 "42", ZINCRBY lb:weekly:202615 150 "42", ZINCRBY lb:daily:20260409 150 "42"] EXEC
RD-->>SS: OK
SS-->>GS: {new_score: 7350, rank: 1482}
The Postgres write commits first. If Redis fails, the score is safe in Postgres and re-applied on rebuild. If Postgres fails, we return an error — no score is updated anywhere. This write ordering means Redis can always be reconstructed from Postgres, never the other way around.
Read flow — top-N
sequenceDiagram
participant CL as Game Client
participant CDN as CDN Edge
participant API as Leaderboard API
participant RD as Redis
CL->>CDN: GET /leaderboard/top?n=100&board=global
alt cache hit (< 5s old)
CDN-->>CL: 200 OK (cached response)
else cache miss
CDN->>API: forward request
API->>RD: ZREVRANGE lb:global 0 99 WITHSCORES
RD-->>API: 100 (player_id, score) pairs
API-->>CDN: 200 OK Cache-Control: max-age=5
CDN-->>CL: 200 OK
end
The top-100 list is identical for every client. A 5-second CDN TTL absorbs the 5 k/s read load — each CDN PoP makes at most one Redis call per 5 seconds rather than 5,000.
Read flow — player rank
Personal rank cannot be CDN-cached — every player's rank is different. Serve from Redis directly:
rank = ZREVRANK(lb:global, player_id) + 1 ← 1-indexed
score = ZSCORE(lb:global, player_id)
Both are O(log N) and return in under 1 ms on a warm Redis. At 20 k rank reads/s, one Redis primary handles this load with significant headroom (Redis typically sustains 100 k–200 k simple commands/s on a single core; actual throughput is hardware- and workload-dependent).
Tie-breaking
When two players share a score, ZREVRANK resolves ties lexicographically by member string — deterministic, but lexicographic ordering of numeric IDs is not always fair (low string IDs win).
A cleaner approach is to encode a composite value into the ZSET score itself. IEEE 754 float64 can represent integers exactly up to 2^53 ≈ 9.007 × 10^15. A composite encoding must stay below this ceiling. If real scores are integers in [0, 10^6], you can encode:
zset_score = real_score × 10^9 + (10^9 − seconds_since_epoch_of_first_achievement % 10^9)
This reserves 9 decimal digits for the score and 9 digits for the tiebreaker. Max composite ≈ 10^6 × 10^9 = 10^15 < 9.007 × 10^15 — safely within float64 range.
The example in many articles uses × 10^10, which limits real scores to ~900,000 before the ceiling is breached. In practice, check your game's maximum achievable score before committing to any multiplier. For games with scores in the tens of millions, composite float encoding breaks; prefer a lexicographic tiebreaker string or resolve ties at display time using a SQL query.
An alternative: store a tiebreaker column in Postgres, resolve ties only at display time using SQL, not in the ZSET. Simpler but adds one DB call to the rank read.
Persistence and rebuild
Redis provides two native persistence mechanisms; neither replaces Postgres as the source of truth.
| Mechanism | Durability | Tradeoff |
|---|---|---|
| RDB (snapshot) | Loses up to N minutes | Fast restart; simple setup |
AOF (append-only file, everysec) | Loses up to ~1 second | Slower restart for large datasets |
| Rebuild from Postgres | Authoritative; no data loss | ~3–8 min for 50 M rows (pipelined); serves stale during rebuild |
Enable AOF with appendfsync everysec and keep Postgres as the fallback. A crash loses at most 1 second of rank data. If the AOF is corrupted or the node is replaced cold, trigger a rebuild.
Rebuild logic (sketch):
def rebuild(pg_conn, redis_client, board_key):
pipe = redis_client.pipeline(transaction=False)
rows = pg_conn.execute(
"SELECT player_id, score FROM player_scores"
)
for i, (player_id, score) in enumerate(rows):
pipe.zadd(board_key, {str(player_id): score})
if i % 10_000 == 0:
pipe.execute()
pipe = redis_client.pipeline(transaction=False)
pipe.execute()
Block rank queries (or serve SQL fallback) while the rebuild runs and the ZSET is incomplete.
Here is what that failure-and-recovery arc looks like end to end:
sequenceDiagram
participant API as Leaderboard API
participant RD as Redis Primary
participant PG as Postgres
note over RD: Redis crashes
API->>RD: ZREVRANK lb:global player_42
RD-->>API: connection error
API->>PG: SELECT COUNT(*)+1 FROM player_scores WHERE score > 7350
PG-->>API: rank 1501 (SQL fallback, slower)
note over RD: new primary elected (replica promoted)
RD-->>API: ZSET is empty — rebuild triggered
API->>PG: SELECT player_id, score FROM player_scores (stream)
PG-->>API: 50M rows in batches of 10k
API->>RD: PIPELINE ZADD lb:global ... EXEC (repeat)
note over RD: rebuild complete (~3-8 min)
API->>RD: ZREVRANK lb:global player_42
RD-->>API: rank 1482 (fast path restored)
Sharding — cross-shard top-N and global rank
With score-range shards, the top-N query becomes a k-way merge across shards run in parallel, then sorted in the API layer. Because bands are ordered by score range, the elite shard almost always dominates the global top-100; only at shard boundaries do multiple shards contribute candidates.
flowchart LR
API[Leaderboard API] -->|"ZREVRANGE 0 99"| S0[(Shard 0<br/>casual)]
API -->|"ZREVRANGE 0 99"| S1[(Shard 1<br/>intermediate)]
API -->|"ZREVRANGE 0 99"| S2[(Shard 2<br/>competitive)]
API -->|"ZREVRANGE 0 99"| S3[(Shard 3<br/>elite)]
S0 --> MERGE[API merges<br/>top-100 in memory]
S1 --> MERGE
S2 --> MERGE
S3 --> MERGE
style S3 fill:#15803d,color:#fff
style S2 fill:#0e7490,color:#fff
style MERGE fill:#ff6b1a,color:#0a0a0f
For global rank when a player changes score bands, remove from the old shard and add to the new one atomically — use a Lua script on the shards, or a two-phase write with idempotency keys.
Time-windowed leaderboards
| Board | Redis key pattern | TTL | Score semantics |
|---|---|---|---|
| All-time global | lb:global | None | Absolute total score |
| Weekly | lb:weekly:GGGGWW | 14 days | Window delta via ZINCRBY |
| Daily | lb:daily:YYYYMMDD | 48 hours | Window delta via ZINCRBY |
| Tournament | lb:tourney:T123 | Operator-set | Absolute or delta |
When Monday arrives, the Score Service writes to lb:weekly:202616 instead of lb:weekly:202615. The old key expires via its TTL — no batch reset job, no thundering herd of players dropping to rank 0.
Handling hot updates — viral event / tournament end
When a tournament ends, score updates can spike to 10–20× normal rate. The mitigations work at different layers:
| Problem | Mitigation |
|---|---|
| Redis CPU saturation at 200 k ZADD/s | Coalesce: buffer 500 ms of updates per player; write one ZADD with the final score |
| Write amplification (3 boards per update) | Pipelining: all boards in one round trip |
| Postgres write contention at peak | Write-behind: accept the Redis write first, flush to Postgres async (risk: 1 ZADD worth of score loss on crash) |
| Top-N read spike (everyone checks ranks) | CDN absorbs it; Redis primary sees only cache misses |
Write coalescing is the most impactful lever. A typical match produces many incremental score events, but only the final total matters for the leaderboard. Batch and write once at match completion.
Storage choices
| Data | Store | Why |
|---|---|---|
| Score index (current) | Redis ZSET | O(log N) rank + update; sub-ms reads |
| Source of truth | Postgres | Durability, rebuild, aggregations |
| Score history / audit log | Append-only Postgres table or Cassandra | Time-series, high write volume |
| Histogram buckets | Redis hash (N fields) | Lightweight; updated by background job every 30 s |
| Time-windowed boards | Redis ZSET with TTL | Auto-expire; separate from global board |
| Player profiles | Postgres | Relational, joins |
Failure modes
Redis node failure
With Redis Sentinel or Redis Cluster, automatic failover to a replica typically takes 10–60 seconds, depending heavily on down-after-milliseconds (Sentinel) or cluster-node-timeout (Cluster) configuration. Default cluster-node-timeout is 15 s; Sentinel examples in the official docs range from 5–60 s (down-after-milliseconds of 5000, 10000, or 60000 are all shown). During failover:
- Route reads to the replica in read-only mode (acceptable for near-real-time rank).
- Queue or drop score updates (brief score lag, recoverable from Postgres on reconciliation).
If the new primary's ZSET is empty (replica was behind or blank), trigger a rebuild from Postgres. Block rank queries or serve estimated ranks from the score histogram while rebuilding.
Cross-shard player migration
A player crossing a score-range boundary must be atomically moved: ZREM from the old shard and ZADD to the new shard. Without atomicity, the player appears in both shards and gets a double-counted global rank.
Fix: execute both operations in a Lua script evaluated on a coordinator, or use an idempotency-key two-phase write. Score-band crossings are infrequent — mostly during a new player's initial climb — so even serialized coordination is acceptable.
Hot key on the global ZSET
All reads for lb:global hit one Redis key. At very high read QPS, CDN caching (for top-N) and Redis read replicas (for rank queries) distribute the load. Writes must still go to the primary.
Score injection / cheating
Game servers directly calling POST /score are a trust boundary. The Score Service should:
- Validate score deltas against signed game-event records (server signs the match result; score service verifies).
- Reject deltas above a per-game-mode ceiling (e.g., > 99th percentile).
- Emit anomalies to a fraud queue for async review.
Exact vs approximate rank — the trade-off
| Approach | Rank accuracy | Complexity | Scales to |
|---|---|---|---|
| Single ZSET | Exact | Minimal | ~100 M members |
| Score-range shards | Exact | Shard routing, cross-shard merge | Any size |
| Score histogram | ±bucket size | Background updater | Any size |
| Percentile tier (top 1%, 5%…) | Coarse | Only counts needed | Any size |
A pragmatic production design: exact rank from a single ZSET for the top 10 k players (visible on the leaderboard), approximate percentile from the histogram for everyone else. Most players care more about their tier than their exact number.
Schema (Postgres)
CREATE TABLE player_scores (
player_id BIGINT PRIMARY KEY,
score BIGINT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ DEFAULT now()
);
-- Window-scoped scores for weekly/daily boards
CREATE TABLE player_score_windows (
player_id BIGINT NOT NULL,
window_key TEXT NOT NULL, -- e.g. 'weekly:202615'
score BIGINT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ DEFAULT now(),
PRIMARY KEY (player_id, window_key)
);
CREATE INDEX idx_player_scores_score ON player_scores (score DESC);
CREATE INDEX idx_score_windows ON player_score_windows (window_key, score DESC);
The score DESC index supports both the rebuild query and SQL rank fallback during Redis downtime.
API design
POST /api/v1/score
Authorization: Bearer <game-server-token>
Content-Type: application/json
{ "player_id": 42, "delta": 150, "board": "global", "match_id": "m_789" }
→ 200 OK
{ "player_id": 42, "new_score": 7350, "rank": 1482 }
GET /api/v1/leaderboard/top?n=100&board=global
Authorization: Bearer <client-token>
→ 200 OK
Cache-Control: max-age=5
{ "board": "global", "as_of": "2026-04-09T14:22:01Z",
"entries": [
{ "rank": 1, "player_id": 9001, "score": 98420 },
{ "rank": 2, "player_id": 7734, "score": 97810 }
]
}
GET /api/v1/player/42/rank?board=weekly
Authorization: Bearer <client-token>
→ 200 OK
{ "player_id": 42, "board": "weekly", "rank": 382,
"score": 3200, "percentile": 96.2 }
Things to discuss in an interview
- Why Redis ZSET and not a sorted Postgres index? B-tree indexes cannot return a rank in O(log N) — they lack subtree element counts. The skip list's span counters are exactly what
ZREVRANKuses. - How do time-windowed boards work without a cron reset? Namespaced keys with TTL. New week = new key. Old key expires naturally. No batch resets, no thundering herd.
- How do you handle tie-breaking deterministically? Composite float score encoding, or secondary application-layer sort using a Postgres tiebreaker column.
- What is the rebuild path if Redis crashes? Stream from Postgres, batch-
ZADDvia pipeline, block or degrade rank queries during rebuild. - When does exact rank become infeasible, and what do you serve instead? At very high member counts with high write rates, fall back to score-histogram percentile approximation. Exact rank still makes sense for the top few thousand visible players.
- How do you prevent the global ZSET from becoming a hot key? CDN caching for top-N (5 s TTL absorbs millions of reads), Redis read replicas for rank queries.
Things you should now be able to answer
- Why is
SELECT COUNT(*) WHERE score > Xtoo slow for real-time rank queries at scale? - What Redis commands give you top-N, a player's rank, and an atomic score update — and what are their time complexities?
- How do you rebuild a Redis ZSET after a crash without losing scores?
- How does score-range sharding let you compute exact global rank with only two Redis calls?
- Why is a time-windowed leaderboard implemented with a TTL key better than a cron reset?
- At what scale does exact rank become unnecessary, and what do you serve instead?
Further reading
- Redis sorted set documentation and skip-list internals — redis.io/docs/data-types/sorted-sets
- Consistent hashing — relevant when sharding the Redis tier itself
- Database sharding — context for sharding Postgres at scale
- Design a distributed cache — cache hierarchy patterns that complement the Redis ZSET design
- "Skip Lists: A Probabilistic Alternative to Balanced Trees" — William Pugh (1990)
Frequently asked questions
▸Why can't a B-tree index answer 'what is my rank?' in O(log N)?
A B-tree can do O(log N) point lookups and O(k) range scans, but it cannot count how many elements rank above a given value in O(log N) because it lacks subtree element counts. Answering that query requires scanning the index from the player's score to the top, which is O(rank) — for a player ranked 500,000 out of 10 million, that is 500,000 index reads per query, and at 20,000 such requests per second the math becomes untenable.
▸What are the time complexities of ZADD, ZREVRANK, and ZREVRANGE on a Redis sorted set?
ZADD is O(log N), ZREVRANK is O(log N), and ZREVRANGE is O(log N + M) where N is the total number of members in the set and M is the number of elements returned. All three are sub-millisecond because the sorted set's skip list maintains span counters that let it traverse to any position while accumulating rank without scanning every node.
▸How do time-windowed leaderboards (weekly, daily) work without a cron reset job?
Each time window gets its own namespaced Redis key — for example, lb:weekly:202615 for ISO week 15 of 2026 — with a TTL set at creation (14 days for weekly, 48 hours for daily). When Monday arrives, the Score Service starts writing to lb:weekly:202616 instead. The old key expires on its own schedule, so there is no batch reset, no thundering herd of players dropping to rank zero.
▸How does score-range sharding compute exact global rank with only two Redis calls?
The score space is divided into ordered bands, and each band lives on a separate shard. To rank a player in shard 2, you call ZCARD on every higher shard to count all players who definitively rank above, then call ZREVRANK within the player's own shard for their position inside the band. Those two O(log N) calls to different nodes can run in parallel, giving exact rank without cross-shard joins.
▸At what scale does exact rank become impractical, and what does the article recommend instead?
Past roughly 500 million players with high write rates, maintaining exact cross-shard rank in real time has diminishing value because a player ranked 3,482,771 versus 3,482,801 sees no meaningful difference. The article recommends a score-histogram approximation: bucket the score range so each bucket covers roughly 10,000 players, sum the counts of all buckets above a player's score, and deliver percentile-accurate rank accurate to plus or minus one bucket width.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.