Mastersystem design.From zero to senior.
A free field guide to building scalable systems. Beginner-friendly fundamentals, deep architectural deep-dives, and 55+ FAANG-grade interview problems — all in one place, all with diagrams.
The crash course
No system design background? Start here. Twelve focused modules that cover the prerequisites — networking, databases, caching, queues — before you tackle full systems.
What Is System Design?
Start here. What system design actually means, why it matters, the framework that turns a vague problem into a defensible architecture, and how to read every other article in this course.
Back-of-the-Envelope Estimation
Capacity math you can do in your head — QPS, storage, bandwidth, memory — the latency numbers every engineer should know, and worked examples for Twitter, URL shorteners, and a chat system.
Networking and HTTP
TCP vs UDP, DNS, TLS, HTTP/1.1 vs HTTP/2 vs HTTP/3, anycast routing, and what actually happens between a browser and a server.
APIs and Communication Protocols
REST, gRPC, GraphQL, WebSockets, Server-Sent Events, and webhooks — when to use each, how to design them, and the patterns that keep them sane at scale.
Featured articles
Real systems — Twitter, Uber, YouTube, WhatsApp — broken down. Every article is tagged by difficulty and ships with sequence and architecture diagrams.
Model Context Protocol (MCP) and Tool-Use Infrastructure
How LLMs safely reach the outside world — from raw function calling to MCP, the open standard that collapses N×M bespoke integrations to N+M, with production-grade security, reliability, and a ~88% token reduction via deferred tool loading.
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.
Design an LLM Evaluation Platform
Build the system that tells a team whether a prompt or model change made the product better or worse — automatically. Covers offline eval with LLM-as-judge, CI regression gating, online production sampling, human annotation queues, and eval for RAG, agents, and classifiers at the scale of 450 million evaluations per month.
Design a GraphRAG System (Knowledge-Graph-Augmented Retrieval)
When vanilla vector RAG fails on "summarize the entire corpus" and multi-hop questions, you build a knowledge graph first — covering entity extraction, Leiden community detection, map-reduce global search, and graph traversal for multi-hop, based on Microsoft GraphRAG and production deployments at Neo4j, LinkedIn, and Writer.
Browse by topic
Tackle FAANG interviews
Ten classic system design problems pulled from interview reports at Meta, Google, Amazon, Netflix, and more. Each comes with requirements, capacity estimation, and architecture deep-dive.
Build it. Scale it.
Ship it.
Whether you're prepping for a senior interview or just want to understand how the internet's largest systems work — start here.