// TOPIC

#reliability

7 articles

◆◆IntermediateAnthropicOpenAI
01

Model Context Protocol (MCP) and Tool-Use Infrastructure

How LLMs safely reach the outside world — from raw function calling to MCP, the open standard that collapses N×M bespoke integrations to N+M, with production-grade security, reliability, and a ~88% token reduction via deferred tool loading.

#ai#llm#agents
31 min
◆◆◆AdvancedCloudflareOpenAI
02

Design an LLM Gateway (AI Gateway & Model Router)

A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.

#interview#ai#llm
31 min
◆◆◆AdvancedOpenAIAnthropic
03

Design an AI Agent Platform

Build a platform that runs autonomous LLM agents — each capable of planning, calling tools, and completing multi-step tasks lasting minutes to hours — with durable state, idempotent tool execution, and per-tenant safety guardrails.

#interview#ai#agents
24 min
◆◆◆AdvancedAmazonGoogle
04

Design a Distributed Job Scheduler (cron at scale)

Run millions of scheduled and recurring jobs reliably — at-least-once execution, leader election, sharded time-wheels, and exactly-once side effects via idempotency.

#interview#distributed-systems#scheduling
20 min
◆◆IntermediateNetflixLinkedIn
05

Backpressure & Flow Control

What happens when a fast producer overwhelms a slow consumer? Backpressure, bounded buffers, load shedding, and why unbounded queues are a trap.

#distributed-systems#reliability#streaming
17 min
◆◆IntermediateStripeAmazon
06

Idempotency & Exactly-Once Semantics

Networks retry, so your operations will run twice. Idempotency keys, dedup, and why "exactly-once delivery" is a myth but "exactly-once effect" is achievable.

#distributed-systems#reliability#messaging
18 min
◆◆Intermediate
07

Reliability and Failure Patterns

Timeouts, retries with backoff, circuit breakers, bulkheads, deadlines, hedged requests, and graceful degradation — the patterns that keep distributed systems standing.

#reliability#resilience#distributed-systems
15 min