#distributed-systems
31 articles
Design a Social Graph Service (Facebook's TAO)
Serve billions of "who follows whom" reads over a graph of trillions of edges. The objects-and-associations model, a cache in front of sharded SQL, and the hot-vertex problem.
Design an Authorization System (Google Zanzibar / RBAC / ReBAC)
Answer "can user U do action A on resource R?" globally, in milliseconds, consistently. RBAC vs ABAC vs ReBAC, Zanzibar relation tuples, and the new-enemy problem.
Design a Distributed Job Scheduler (cron at scale)
Run millions of scheduled and recurring jobs reliably — at-least-once execution, leader election, sharded time-wheels, and exactly-once side effects via idempotency.
Design a Globally-Distributed SQL Database (Spanner / CockroachDB)
SQL transactions that are ACID across continents. How Spanner shards into Paxos groups, runs 2PC on top, and uses TrueTime to give you external consistency — the CP counterpart to Dynamo.
Design an Object Storage Service (S3)
Store arbitrary blobs with HTTP GET/PUT at exabyte scale and 11 nines of durability. Metadata vs data separation, erasure coding, and self-healing.
Design a Payment System (Stripe-style)
Move money correctly. Double-entry ledgers, idempotency keys, the authorize/capture/settle lifecycle, reconciliation, and why money never gets eventual consistency.
Design a Distributed Message Queue (Kafka)
Build a durable, partitioned, replicated commit log like Kafka — ordering, consumer groups, replication (ISR), and exactly-once.
Design a Distributed Search Engine (Elasticsearch)
Index billions of documents and answer full-text queries in milliseconds. Inverted indexes, sharding + replication, scatter-gather, and relevance scoring.
Design a Distributed Lock / Coordination Service (ZooKeeper / etcd)
Provide mutual exclusion and coordination across machines safely. Consensus-backed locks, leases, fencing tokens, and why a lock without fencing is unsafe.
Backpressure & Flow Control
What happens when a fast producer overwhelms a slow consumer? Backpressure, bounded buffers, load shedding, and why unbounded queues are a trap.
Quorums, Read-Repair & Anti-Entropy (Dynamo-style)
How leaderless databases like Dynamo and Cassandra stay available and converge. Quorum R+W>N, read-repair, hinted handoff, Merkle anti-entropy, and conflict resolution.
Idempotency & Exactly-Once Semantics
Networks retry, so your operations will run twice. Idempotency keys, dedup, and why "exactly-once delivery" is a myth but "exactly-once effect" is achievable.
Event Sourcing & CQRS
Store every change as an immutable event and rebuild state by replay. Event sourcing, CQRS read models, snapshots, and the trade-offs nobody warns you about.
Design a Distributed File System (GFS / HDFS)
Store petabyte files across thousands of commodity machines for high-throughput batch reads. The single-master + chunkservers design, replication, and append-heavy workloads.
Change Data Capture (CDC) & the Outbox Pattern
Turn your database write log into a reliable event stream. Log-based CDC, the dual-write problem, and the transactional outbox.
Designing a Feature Flag Service
An in-house LaunchDarkly. Distributing config with sub-100ms freshness to thousands of services, targeting rules, and the safety properties that prevent a flag flip from taking the site down.
The Saga Pattern & Distributed Transactions
How do you keep data consistent across services with no shared database? Sagas, compensating transactions, orchestration vs choreography, and why 2PC fails at scale.
Leader Election and Consensus (Raft, Paxos)
How distributed systems agree on a single leader without splitting brains. Raft step-by-step, Paxos explained intuitively, and where consensus shows up in production.
Database Replication
Single-leader, multi-leader, and leaderless replication. Sync vs async, replication lag, conflict resolution, and how each model trades availability for consistency.
Design a Distributed Key-Value Store (Dynamo)
Build your own DynamoDB / Cassandra. Sharding, replication, quorum reads/writes, vector clocks, conflict resolution.
Design a Distributed Unique ID Generator
How Twitter/Discord/Instagram generate billions of unique IDs per day with no central coordinator. UUIDs, snowflake, ULIDs.
Design Yelp / Nearby Search (proximity service)
Find restaurants/businesses near a location, fast. Geohash, quadtree, hexagonal cells, and the right index for "within 5 km of me".
Design Google Drive / Dropbox
File sync that works on every device, blob storage, deduplication, conflict resolution, and how to do all of it efficiently.
Design a Distributed Cache (like Memcached)
A cache that scales across hundreds of nodes — consistent hashing, replication, eviction, and the operational problems you'll meet.
Design a Web Crawler (Googlebot)
Crawl billions of URLs at petabyte scale. URL frontier, politeness, deduplication, 4xx/dead-link handling, and the realities of indexing the web.
Design a Rate Limiter
Token bucket, leaky bucket, fixed window, sliding window — and the distributed Redis-based limiter you can copy into production.
CAP Theorem Deep Dive
The CAP theorem, debunked myths, PACELC, and the actual trade-offs every distributed database makes.
Database Sharding
When you outgrow a single database — how to split data across many machines, the strategies that work, and the operational pain you'll inherit.
Consistent Hashing
Why hash-mod-N breaks when you resize, and how Amazon Dynamo, Cassandra, and Memcached avoid it with consistent hashing and virtual nodes.
CAP, Consistency, and Replication
CAP and PACELC, consistency models from linearizable to eventual, replication strategies, quorums, partitioning, consensus (Raft, Paxos), CRDTs, and 2PC.
Reliability and Failure Patterns
Timeouts, retries with backoff, circuit breakers, bulkheads, deadlines, hedged requests, and graceful degradation — the patterns that keep distributed systems standing.