#caching
15 articles
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design a Feature Store
Serve the exact same feature values to model training and online inference — eliminating training-serving skew — across batch, streaming, and on-demand tiers at sub-10ms latency and millions of reads per second. The architecture powering Uber Michelangelo, Airbnb Chronon, and DoorDash Gigascale.
Design an LLM Inference & Serving System
Serve token generation for a 70B-parameter model at scale — where KV cache, not FLOPs, caps concurrency and continuous batching is what separates good GPU utilization from terrible utilization.
Design a Social Graph Service (Facebook's TAO)
Serve billions of "who follows whom" reads over a graph of trillions of edges. The objects-and-associations model, a cache in front of sharded SQL, and the hot-vertex problem.
Design a Distributed Counter (view / like counts)
Count likes and views at millions of increments per second without a single hot row melting. Sharded counters, write batching, and approximate vs exact counts.
Design a Real-Time Leaderboard (gaming)
Rank millions of players by score and answer "top N" and "my rank" instantly. Redis sorted sets, sharding by score range, and approximate ranks at scale.
Design a Content Delivery Network (CDN)
Serve content from the edge, close to users, at massive scale. Request routing (anycast & DNS), cache hierarchies, invalidation, and origin shielding.
Bloom Filters
A tiny, probabilistic data structure that says "definitely not" or "maybe" — and saves billions of disk reads. The math, the tuning, and where every big system uses one.
Design a Distributed Cache (like Memcached)
A cache that scales across hundreds of nodes — consistent hashing, replication, eviction, and the operational problems you'll meet.
Design Search Autocomplete (Typeahead)
Sub-100ms autocomplete suggestions across billions of queries — tries, top-k caching, and personalized ranking.
Design a News Feed (Facebook / Instagram)
How Meta builds the home feed — fanout, ranking, candidate generation, and how the same architecture serves billions.
Design Twitter / X (the home timeline)
500M users, 500M tweets/day, p99 feed loads under 200ms. The fanout-on-write vs fanout-on-read trade-off that defines the system.
Design a URL Shortener (TinyURL / bit.ly)
A classic FAANG warmup. Generate short codes, store them, redirect fast, scale to billions of URLs.
Scale From Zero to Millions of Users
The classic walkthrough — start with one server, add a load balancer, add caching, replicate the database, shard, geo-distribute. Every transition explained.
Caching
Cache hierarchies, write strategies, eviction policies, Redis data structures, the four classic cache pathologies, and how to size and warm a cache properly.