#reliability
7 articles
Model Context Protocol (MCP) and Tool-Use Infrastructure
How LLMs safely reach the outside world — from raw function calling to MCP, the open standard that collapses N×M bespoke integrations to N+M, with production-grade security, reliability, and a ~88% token reduction via deferred tool loading.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an AI Agent Platform
Build a platform that runs autonomous LLM agents — each capable of planning, calling tools, and completing multi-step tasks lasting minutes to hours — with durable state, idempotent tool execution, and per-tenant safety guardrails.
Design a Distributed Job Scheduler (cron at scale)
Run millions of scheduled and recurring jobs reliably — at-least-once execution, leader election, sharded time-wheels, and exactly-once side effects via idempotency.
Backpressure & Flow Control
What happens when a fast producer overwhelms a slow consumer? Backpressure, bounded buffers, load shedding, and why unbounded queues are a trap.
Idempotency & Exactly-Once Semantics
Networks retry, so your operations will run twice. Idempotency keys, dedup, and why "exactly-once delivery" is a myth but "exactly-once effect" is achievable.
Reliability and Failure Patterns
Timeouts, retries with backoff, circuit breakers, bulkheads, deadlines, hedged requests, and graceful degradation — the patterns that keep distributed systems standing.