// TOPIC

#reliability

7 articles

Model Context Protocol (MCP) and Tool-Use Infrastructure

How LLMs safely reach the outside world — from raw function calling to MCP, the open standard that collapses N×M bespoke integrations to N+M, with production-grade security, reliability, and a ~88% token reduction via deferred tool loading.

#ai#llm#agents

31 min

◆◆◆AdvancedCloudflareOpenAI

Design an LLM Gateway (AI Gateway & Model Router)

A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.

#interview#ai#llm

31 min

◆◆◆AdvancedOpenAIAnthropic

Design an AI Agent Platform

Build a platform that runs autonomous LLM agents — each capable of planning, calling tools, and completing multi-step tasks lasting minutes to hours — with durable state, idempotent tool execution, and per-tenant safety guardrails.

#interview#ai#agents

24 min

◆◆◆AdvancedAmazonGoogle

Design a Distributed Job Scheduler (cron at scale)

Run millions of scheduled and recurring jobs reliably — at-least-once execution, leader election, sharded time-wheels, and exactly-once side effects via idempotency.

#interview#distributed-systems#scheduling

20 min

◆◆IntermediateNetflixLinkedIn

Backpressure & Flow Control

What happens when a fast producer overwhelms a slow consumer? Backpressure, bounded buffers, load shedding, and why unbounded queues are a trap.

#distributed-systems#reliability#streaming

17 min

◆◆IntermediateStripeAmazon

Idempotency & Exactly-Once Semantics

Networks retry, so your operations will run twice. Idempotency keys, dedup, and why "exactly-once delivery" is a myth but "exactly-once effect" is achievable.

#distributed-systems#reliability#messaging

18 min

◆◆Intermediate

Reliability and Failure Patterns

Timeouts, retries with backoff, circuit breakers, bulkheads, deadlines, hedged requests, and graceful degradation — the patterns that keep distributed systems standing.

#reliability#resilience#distributed-systems

15 min