~/articles/design-llm-gateway

◆◆◆Advancedasked at Cloudflareasked at OpenAIasked at Anthropicasked at LiteLLMasked at Portkey

Design an LLM Gateway (AI Gateway & Model Router)

Q: What is an LLM gateway and how is it different from a regular API gateway?

A regular API gateway handles authentication, rate limiting, and routing for services your team controls. An LLM gateway does all of that, but the backends are third-party model providers with their own incompatible schemas, token-based quotas, non-deterministic latency, and steep per-token pricing. The gateway must normalize heterogeneous provider APIs into a unified interface, enforce token-per-minute limits (not just requests), manage dollar budgets, cache semantically equivalent prompts, and handle streaming SSE passthrough — concerns a generic gateway never needed to address.

Q: Why does rate limiting on requests-per-minute fail for LLM traffic?

A single 128k-token GPT-4o request consumes as much quota as roughly 2,500 short 50-token requests but registers as exactly one in an RPM counter. A batch job running long-context summarization can exhaust TPM quotas in seconds while staying within RPM limits, starving all real-time user traffic. Production LLM gateways must enforce TPM limits alongside RPM limits and read actual token counts from provider usage responses — not estimates — for accurate post-call metering. Kong AI Gateway reads real counts from provider responses; Azure APIM uses RPM = TPM ÷ 6 as a conversion heuristic.

Q: What similarity threshold should I use for semantic caching, and why is tuning it critical?

The widely-cited starting point is 0.92 cosine similarity. Below 0.80 the cache returns semantically different answers — in one documented financial services incident, a question about not wanting a business account matched a prior question about account closure at 88.7% similarity and incorrectly triggered the closure flow. Above 0.95–0.97 the gain over an exact hash cache diminishes sharply, since most matches at that similarity level differ only in whitespace or trivial rephrasing that an exact hash would also catch. Production hit rates are 20–45% blended, not the 95% some vendors claim; FAQ bots hit 40–60%, RAG Q&A 15–25%, and agentic tool calls only 5–15%. Domain-specific per-category thresholds outperform a single global value.

Q: How do retry, hedge, and circuit-break differ, and when should I use each?

Retries are sequential: if the first call fails or times out, you wait and try again. For LLMs where a timeout takes 8–10 seconds, three sequential retries create a ~30-second P99 latency cliff. Hedged requests fire a parallel duplicate to a second provider at the P90 latency mark — dramatically reducing P99 without waiting for failure. Circuit breakers stop routing to known-down providers entirely, preventing the retry storm. The design rule: hedge for latency reduction, circuit-break for availability, retry only for quick transient errors on idempotent requests, and cap global retry budget at 5–15% of base traffic to prevent storm amplification.

Q: What happens to streaming SSE when the gateway buffers the response?

If the gateway buffers the full response before forwarding — the default behavior for nginx with proxy_buffering on, or when gzip is enabled for text/event-stream — the client receives a complete response dump instead of a token-by-token stream. The first-token latency (TTFT) appears as the full generation time. This breaks the conversational feel that made users choose streaming in the first place. Fix: set proxy_buffering off, disable gzip for text/event-stream, and on Kong specifically enforce HTTP/1.1 because HTTP/2 multiplexing is incompatible with SSE chunk framing.

A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.

31 min read2026-06-25Ironclad Academy

#interview #ai #llm #llmops #caching #reliability

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

In 2024 Stripe's engineering team was calling five different model providers from six different microservices. Each service maintained its own provider credentials, its own retry logic, and its own token-counting approximation. When Anthropic hit a regional outage, a human had to grep the codebase for every anthropic.messages.create call and manually reroute traffic. Nobody could answer the question "how much did we spend on GPT-4o last month?" without querying five different billing dashboards and hoping nothing was double-counted.

The LLM gateway exists to solve exactly this. It is the API gateway pattern — the same one discussed in API Gateways & the Backend-for-Frontend Pattern — applied to a new class of backend where the failure modes, cost structures, and traffic shapes are completely different from anything the original API gateway designers anticipated.

Classic API gateways route to backends you control. An LLM gateway routes to black-box third-party inference endpoints that share quota across all your keys globally, charge per token rather than per call, return non-deterministic latency from 50 ms to 120 seconds, and occasionally decide that your valid request violates their content policy with no appeal path. The gateway has to absorb all of that and present a clean, reliable surface to the application layer.

The real-world implementations range from open-source self-hosted proxies (LiteLLM, Portkey's open-source gateway) to managed edge services (Cloudflare AI Gateway) to enterprise vendor stacks (Kong AI Gateway, Envoy AI Gateway from Tetrate/CNCF) to aggregator marketplaces (OpenRouter). Cloudflare proxied 2 billion requests in the first year after launching in September 2023. Portkey handles over 10 billion requests per month. OpenRouter routes approximately 100 trillion tokens per month as of 2025. These are not toy systems.

This article is about the control plane: routing, caching, rate limiting, observability, and multi-tenancy. It is deliberately not about GPU inference internals — that is Design an LLM Inference Serving System. The gateway sits upstream of inference clusters and third-party providers; it is the network policy layer, not the compute layer.

Functional requirements

Accept requests in the OpenAI chat completions format (/v1/chat/completions) and proxy them to any configured model provider or self-hosted endpoint.
Routing: select the best provider deployment per request based on cost, latency, capability tags, or explicit routing rules; fall back automatically on 429/5xx/timeout.
Virtual API keys: issue per-team keys with configurable TPM/RPM limits and dollar budget caps; a key exhausting its budget must not affect other keys.
Caching: return cached responses for exact-match requests; also return semantically equivalent cached responses within a configurable similarity threshold.
Streaming: pass streaming SSE responses through to the client without buffering the complete response.
Observability hooks: log per-request token counts, cost estimate, TTFT, finish reason, and tenant ID.
Middleware hooks: apply configurable input scanners (PII detection, prompt injection, topic denylist) and output scanners (content moderation, PII deanonymization) per route.

Non-functional requirements

P99 gateway overhead < 50 ms on the critical path (auth + rate-limit + routing, excluding network transit to provider).
Availability > 99.9% independent of any single provider's availability — the gateway should route around provider outages automatically.
Horizontal scale: the data plane must scale stateless; shared state (rate-limit counters, circuit-breaker status) lives in Redis.
Multi-tenancy: one tenant's cache namespace, budget counters, and logs must be fully isolated from all other tenants.
Audit trail: every request and budget modification must be logged for SOC 2 / GDPR compliance.

Capacity estimation

Dimension	Estimate	How we got there
Peak request rate	10,000 req/s	Portkey-scale deployment across 100 enterprise tenants
Average token size	2,500 tokens/req (2,000 in + 500 out)	Typical assistant chat workload; long-doc tasks skew to 20,000+
Token throughput	25M tokens/s	10k req/s × 2,500 tokens
Gateway data plane nodes	8–10 nodes	Each node handles ~1,000–1,200 req/s at <10 ms P99 overhead; 10k req/s ÷ 1,200 = 8.3 → 9 min, ÷ 1,000 = 10 max; Portkey claims sub-10 ms median
Redis for rate limiting	1–2 Redis nodes (or small cluster)	4–5 ops per request (RPM INCR, TPM INCR, budget INCRBYFLOAT, key lookup, optional cooldown check) × 10k req/s = 40–50k Redis ops/s — well within one node's 100k+ ops/s ceiling; budget counter atomic increment adds ~11 µs mean latency
Semantic cache embedding size	~61 GB	10M cached prompts × 1,536-dim float32 × 4 bytes; fits Redis with vector module or Qdrant
Exact cache size	~5–10 GB	10M cached prompt hashes × 500–1,000-byte response average (short completion responses); LRU eviction
Log volume	~2–5 TB/month	10k req/s × ~100 bytes per metadata log entry × 86,400 s/day × 30 days ≈ 2.6 TB/month; compliance full-content logging (25 KB/entry) is a separate opt-in path that adds ~650 TB/month and requires tiered object storage
PostgreSQL spend records	~75–80 TB/year raw; ~500 GB–2 TB hot	10k req/s × 31.5M s/year × ~250 bytes/row ≈ 79 TB/year raw insert volume; in practice, range-partition by month, keep last 90 days hot in PostgreSQL, archive older data to object storage — leaving ~500 GB–2 TB in the hot tier

Takeaway: Redis is the critical-path dependency; PostgreSQL is an async audit sink. The data plane nodes are stateless and scale horizontally. At 10k req/s the bottleneck is provider throughput and network round-trip time, not gateway CPU.

Building up to the design

V1: Dumb pass-through proxy

The simplest possible gateway is a reverse proxy that rewrites the Authorization header and forwards to OpenAI. This gives you centralized key management — developers never see the actual provider key — but nothing else. When OpenAI returns a 429, the client gets a 429. When Anthropic releases a new API parameter, you update the proxy. The blast radius of a single bad API key is your entire organization's OpenAI access.

# V1: naive pass-through
def proxy(request):
    request.headers["Authorization"] = f"Bearer {OPENAI_KEY}"
    return forward_to("https://api.openai.com", request)

This breaks the moment you have two teams with different rate-limit budgets, or you want to try Claude for a workload that was costing too much on GPT-4.

V2: Add routing and multiple providers

Replace the single static target with a router that knows about multiple provider deployments. Each deployment has a model, a provider, and optionally a set of API keys to load-balance across. The router picks a deployment based on a strategy — cheapest first, lowest measured latency, or round-robin — and translates the request to that provider's schema.

flowchart LR
    CLIENT[Client<br/>OpenAI SDK] --> TRANS["Schema translator"]
    TRANS --> ROUTER["Router<br/>(cost / latency / round-robin)"]
    ROUTER --> D1["OpenAI gpt-4o<br/>key pool A"]
    ROUTER --> D2["Anthropic claude-3-5-sonnet<br/>key pool B"]
    ROUTER --> D3["Self-hosted<br/>Llama-3.1-70B"]
    style TRANS fill:#0e7490,color:#fff
    style ROUTER fill:#ff6b1a,color:#0a0a0f
    style D1 fill:#15803d,color:#fff
    style D2 fill:#15803d,color:#fff
    style D3 fill:#15803d,color:#fff

The schema translator is more work than it looks. OpenAI uses role: "assistant", finish_reason: "stop", and streaming deltas with choices[0].delta.content. Anthropic uses role: "assistant", stop_reason: "end_turn", and streaming events with content_block_delta.delta.text. Google uses candidates[0].content.parts. The gateway normalizes all of this to the OpenAI format outbound. Going the other direction — translating inbound OpenAI format to each provider's format — requires maintaining a live translation table per provider. Providers update their schemas regularly; treating this as a one-time task is the wrong mental model.

V3: Add rate limiting, budgets, and circuit breaking

V2 still has no protection against a single team exhausting all available TPM quota. Add:

Virtual keys that map to real provider credentials, each with per-key TPM/RPM limits and a dollar budget.
Redis-backed atomic counters for cross-replica enforcement. A local in-memory counter per node is not enough — you have 10 gateway replicas, and each one enforces limits independently against a 1/10th view of traffic.
A cooldown list. LiteLLM's default: if a deployment has more than 50% failure rate in the past minute, or returns a 429/401/404/408 error code, remove it from the routing pool for 5 seconds. The gateway tries a probe request after the cooldown before restoring full traffic.

This is the minimum viable production gateway. V1 through V3 is roughly what teams build internally in the first six months.

V4: Semantic caching and observability

The LLM call dominates cost. A FAQ bot that gets the same 200 questions every day is re-paying for each one on each occurrence. The exact cache (hash of model + messages + temperature + all other params) handles truly identical requests. The semantic cache handles questions that differ in phrasing but ask for the same answer.

V4 adds an embedding service, a vector store, and asynchronous cost logging. Now the gateway can answer the CFO's question about last month's spend, and the engineering team's question about which model is cheapest for their workload.

V5: Guardrails, streaming SSE passthrough, and multi-tenancy

The final layer adds pre/post middleware hooks for PII redaction and content moderation, correct streaming passthrough (the detail everyone gets wrong initially), and per-tenant cache namespaces to prevent timing side-channel leaks between tenants. This is Portkey, LiteLLM Proxy, and Cloudflare AI Gateway feature parity.

API

The gateway exposes the standard OpenAI chat completions endpoint. Clients do not change their SDK or schema.

POST /v1/chat/completions
Authorization: Bearer vk-team-sales-abc123   # virtual key, not a real provider key
X-Route-Hint: cost                           # optional routing preference
Content-Type: application/json

{
  "model": "gpt-4o",         # logical model name; gateway resolves to a deployment
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "Summarize Q3 revenue trends."}
  ],
  "stream": true,
  "temperature": 0.2,
  "max_tokens": 512
}

Response (streaming SSE):

data: {"id":"chatcmpl-xyz","object":"chat.completion.chunk","choices":[{"delta":{"role":"assistant","content":"Q3"},"finish_reason":null}]}

data: {"id":"chatcmpl-xyz","object":"chat.completion.chunk","choices":[{"delta":{"content":" revenue"},"finish_reason":null}]}

...

data: {"id":"chatcmpl-xyz","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":412,"completion_tokens":301,"total_tokens":713}}

data: [DONE]

The gateway adds a response header X-Gateway-Provider: anthropic/claude-3-5-sonnet so the client can see which deployment served the request. Cost and TTFT are logged asynchronously; they do not appear in the response to stay off the critical path.

The schema

Virtual key table (PostgreSQL):

CREATE TABLE virtual_keys (
    key_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    hashed_key      TEXT NOT NULL UNIQUE,  -- SHA-256; plaintext shown once at creation
    tenant_id       UUID NOT NULL,
    team_id         UUID,
    budget_usd      NUMERIC(12,6),         -- NULL = unlimited
    budget_period   TEXT DEFAULT 'month',  -- second | day | week | month | year
    budget_reset_at TIMESTAMPTZ,
    tpm_limit       INTEGER,
    rpm_limit       INTEGER,
    allowed_models  TEXT[],                -- NULL = all models allowed
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    revoked_at      TIMESTAMPTZ            -- soft delete
);

Per-request spend record (PostgreSQL, written async after response):

CREATE TABLE request_log (
    request_id      UUID PRIMARY KEY,
    key_id          UUID REFERENCES virtual_keys(key_id),
    tenant_id       UUID NOT NULL,
    provider        TEXT NOT NULL,         -- 'openai', 'anthropic', 'google', ...
    model           TEXT NOT NULL,         -- 'gpt-4o', 'claude-3-5-sonnet', ...
    input_tokens    INTEGER,
    output_tokens   INTEGER,
    cached_tokens   INTEGER DEFAULT 0,     -- provider-side cache read
    cost_usd        NUMERIC(12,8),
    ttft_ms         INTEGER,
    total_latency_ms INTEGER,
    finish_reason   TEXT,
    cache_hit       TEXT,                  -- 'exact' | 'semantic' | NULL
    ts              TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON request_log (tenant_id, ts DESC);
CREATE INDEX ON request_log (key_id, ts DESC);

Redis holds the hot-path counters (no persistent state):

rate:{key_id}:rpm   -- INCR + EXPIRE, fixed-window RPM counter
rate:{key_id}:tpm   -- INCR + EXPIRE, fixed-window TPM counter
budget:{key_id}:usd -- INCRBYFLOAT, budget consumed this period
cooldown:{deployment_id} -- SET with TTL = cooldown_seconds (5s default)
cache:exact:{hash}  -- GET/SET, response payload

Architecture

The gateway has two planes: a stateless data plane that handles individual requests, and a control plane (admin API + config store) that manages routes, keys, and policies.

flowchart TD
    subgraph CP["Control Plane"]
        ADMIN["Admin API<br/>(REST)"]
        PGDB[("PostgreSQL<br/>keys · routes · spend")]
        ADMIN --> PGDB
    end
    subgraph DP["Data Plane (stateless, horizontally scaled)"]
        LB["L4 Load Balancer"]
        GW1["Gateway node 1"]
        GW2["Gateway node 2"]
        GWN["Gateway node N"]
        LB --> GW1
        LB --> GW2
        LB --> GWN
    end
    subgraph SHARED["Shared State"]
        REDIS[("Redis cluster<br/>rate-limit counters<br/>circuit-breaker state<br/>exact cache")]
        QDRANT[("Vector store<br/>(Qdrant / Redis-VSS)<br/>semantic cache embeddings")]
    end
    subgraph PROVIDERS["Model Providers"]
        OAI["OpenAI"]
        ANT["Anthropic"]
        GGL["Google"]
        SELF["Self-hosted"]
    end
    GW1 --> REDIS
    GW2 --> REDIS
    GWN --> REDIS
    GW1 --> QDRANT
    GW2 --> QDRANT
    GWN --> QDRANT
    GW1 --> OAI
    GW1 --> ANT
    GW2 --> GGL
    GW2 --> SELF
    GW1 -.async log.-> PGDB
    style CP fill:#0a0a0f,color:#fff
    style DP fill:#0a0a0f,color:#fff
    style REDIS fill:#ff2e88,color:#fff
    style QDRANT fill:#15803d,color:#fff
    style LB fill:#0e7490,color:#fff
    style ADMIN fill:#ffaa00,color:#0a0a0f

Hot path sequence: a non-cached streaming request

sequenceDiagram
    participant C as Client
    participant GW as Gateway node
    participant R as Redis
    participant EMB as Embedding model
    participant VDB as Vector store
    participant PROV as "Provider (Anthropic)"
    participant PG as PostgreSQL

    C->>GW: POST /v1/chat/completions (stream=true)
    GW->>R: Validate virtual key + check budget/RPM/TPM
    R-->>GW: OK (counters updated atomically)
    GW->>GW: Run input guardrails (PII scan, injection check)
    GW->>R: Exact cache lookup (hash of model+messages+params)
    R-->>GW: MISS
    GW->>EMB: Embed prompt (async, ~10ms)
    EMB-->>GW: 1536-dim vector
    GW->>VDB: ANN search (cosine similarity ≥ 0.92?)
    VDB-->>GW: MISS (no close match)
    GW->>GW: Route selection (cost-based: Anthropic claude-3-5-sonnet cheapest)
    GW->>GW: Check cooldown list for selected deployment
    GW->>PROV: POST /v1/messages (translated schema, stream=true)
    PROV-->>GW: SSE stream begins (first token)
    Note over GW,C: Gateway forwards each SSE chunk immediately<br/>(proxy_buffering off)
    GW-->>C: SSE chunk 1 (TTFT measured here)
    PROV-->>GW: SSE chunk 2..N
    GW-->>C: SSE chunk 2..N
    PROV-->>GW: DONE + usage stats
    GW-->>C: DONE
    GW->>R: Write response to exact cache (async)
    GW->>VDB: Write embedding + response to semantic cache (async)
    GW->>PG: INSERT request_log (async, non-blocking)

Every async write after [DONE] happens off the critical path. The client receives the complete stream before the gateway does any bookkeeping.

Request routing strategies

Seven strategies are available in practice (LiteLLM names them explicitly):

Simple shuffle (weighted random). Assign each deployment a weight proportional to its RPM capacity. Requests are distributed proportionally at random. No shared state needed — each gateway replica can route independently. The right choice for homogeneous deployments of the same model.

Rate-limit-aware. Before routing, query each deployment's current counter in Redis. Skip any deployment within 10% of its TPM limit. Route to the deployment with the most remaining capacity. Requires one Redis round-trip per request (adds ~1 ms). Use when deployments have uneven quota allocation.

Latency-based. Track a rolling 5-minute P50 latency for each deployment. Route to the lowest-P50 deployment. This adapts automatically to provider performance degradation — if OpenAI's US-East region slows down, traffic shifts to EU-West without manual intervention. Stored in Redis as a sliding window.

Cost-based. Maintain a pricing table (model + provider → $ per million tokens). Route to the cheapest deployment that meets the request's capability requirements (context length, JSON mode, vision, etc.). OpenRouter's default is inverse-square price weighting: a provider at $1/M is approximately 9x more likely to be selected than one at $3/M.

Least-busy. Track the number of in-flight concurrent requests per deployment (a simple Redis counter). Route to the deployment with the fewest active calls. Useful for self-hosted endpoints where a single long generation blocks throughput for all others.

ML-based routing. RouteLLM (LMSYS, arXiv 2406.18665) and similar systems train a lightweight classifier on human preference data to route each prompt to either a strong expensive model (GPT-4o) or a weak cheap model (GPT-4o-mini) based on prompt complexity. Results: over 2x cost reduction in certain benchmark configurations. AWS Bedrock's Intelligent Prompt Routing uses a similar approach within the Anthropic Claude and Meta Llama families, cutting costs up to 30% without accuracy loss. The caveat: the routing classifier adds 90–540 ms of latency and requires at least 500 labeled training samples per model tier. Most teams start with rule-based routing and add ML routing only after they have enough data.

Canary / traffic split. Route X% of traffic to a new model and 1-X% to the stable one. Cloudflare AI Gateway's Dynamic Routes implement this as if/else conditional routing. LiteLLM supports traffic mirroring: the primary model serves the response while a shadow copy of the request is sent silently to the new model for evaluation, with no impact on user latency. Pitfall: canary splits must be stratified by request type — a new code model receiving only chat requests during canary will not reveal coding regressions.

Automatic fallback and circuit breaking

Fallback chains define what happens when the primary deployment fails. LiteLLM uses an order parameter: deployment with order=1 is tried first, order=2 is the first fallback, and so on, with configurable per-tier retry counts before escalation.

The circuit breaker prevents routing to known-down deployments. LiteLLM's cooldown mechanism: if a deployment accumulates a failure rate above 50% in the past minute, or returns a specific error code (429, 401, 404, 408), it is added to a Redis key with a 5-second TTL. During that window, the router treats it as unavailable. After the TTL expires, a probe request tests the deployment; if it succeeds, the deployment is restored to the routing pool.

OpenRouter handles fallbacks differently: the models array in the request body defines an ordered list of model/provider combinations. If the first model returns a context-length error, a moderation flag, a 429, or 5xx, the request is automatically retried against the next model in the list. This is powerful for cross-model fallback (Claude on Anthropic fails → try Claude on AWS Bedrock) but OpenRouter's August 2025 database outage demonstrated the single-aggregator failure mode: the entire service went down for 50 minutes, including all provider-level failover logic, because the control plane was unavailable.

Hedge requests, not just retries. A hedged request fires a parallel duplicate to a second provider at the P90 latency threshold. If the primary returns before the hedge, cancel the hedge. If the hedge returns first, cancel the primary, return the hedge result, and log the outcome. This reduces P99 latency dramatically compared to sequential retries. The cost: 5–10% additional in-flight requests. Idempotency keys are required on hedged requests to prevent duplicate side effects — the same pattern described in Idempotency and Exactly-Once Delivery.

Global retry budget: cap total retries across all replicas at 5–15% of base traffic. Without this cap, a provider brownout causes every gateway replica to retry simultaneously, sustaining or worsening the overload.

Caching architecture

Two-layer cache, checked in order:

Layer 1: Exact hash cache. Key = SHA-256(model + JSON-serialized messages + temperature + max_tokens + all other params). Hit returns the cached response in <5 ms. Stored in Redis with LRU eviction. Handles 15–30% of traffic for repetitive workloads (the same system prompt + same few questions). Exact cache is the right place for non-creative use cases (classification, entity extraction) where temperature=0 and the same input always deserves the same output.

Layer 2: Semantic vector cache. For a cache miss, embed the prompt (1 embedding call, typically 10–15 ms on a hosted model). Query the vector store for the nearest neighbor using cosine similarity. If the best match exceeds the configured threshold (default 0.92), return the cached response. Blended production hit rates: 20–45%, not the 95% some vendors advertise. The variance by workload type is large:

Workload	Typical hit rate
FAQ / support bot	40–60%
Classification tasks	50–70%
RAG question-answering	15–25%
Open-ended chat	10–20%
Agentic tool calls	5–15%

A 25% blended hit rate on $5,000/month LLM spend saves roughly $1,250/month before infrastructure costs (Redis, embedding calls). Cache hits return in <5 ms vs. 2–5 seconds for a full model call — the latency benefit is often larger than the cost benefit.

Cache stampede prevention. When a popular cached response expires, hundreds of simultaneous requests all miss the cache and rush to call the LLM. Fix: use a distributed lock on cache miss. One request acquires the lock and calls the LLM; the others wait on the lock key. When the lock is released (after the response is cached), all waiting requests read from cache. The lock timeout must be longer than the expected LLM call duration.

The same semantic caching pattern — embedding model, vector store, cosine threshold — is covered in full in Design a RAG Pipeline, including cache invalidation tied to document update events via CDC.

Provider-side prompt caching is complementary, not a replacement. Anthropic caches response prefixes marked with cache_control: {type: "ephemeral"} (minimum 1,024 tokens, 5-minute TTL extending to 1 hour on reuse, 90% cost reduction on reads at $0.30/M vs. $3.00/M). OpenAI has offered automatic prefix caching since October 2024 for inputs over 1,024 tokens (50% discount, no write premium, 5–10-minute TTL). The gateway must track these provider-side TTLs to avoid assuming cached tokens will be present on the next request.

Virtual keys and multi-tenant budget enforcement

LiteLLM's budget hierarchy: organization → team → virtual key → end-user → tag. Each level can have independent limits; the most restrictive binding wins.

Budget enforcement uses a dual-store pattern:

Redis holds current-period spend as a float (INCRBYFLOAT). The atomic increment takes ~11 µs at 5,000 RPS. Redis is the source of truth for real-time enforcement.
PostgreSQL holds the authoritative spend history, written asynchronously after each response returns. If Redis is flushed or crashes, the gateway rehydrates Redis from PostgreSQL on startup.

Virtual key validation: hash the incoming key, look up in Redis (fast path), fall back to PostgreSQL on a Redis miss. The plaintext key is shown exactly once at creation time — the gateway stores only the SHA-256 hash.

Budget durations are calendar-aligned: a monthly budget resets at the start of each calendar month, not 30 days after key creation. The budget_reset_at column stores the next reset timestamp.

Provider API key load balancing across a key pool distributes TPM quota across multiple keys for the same provider. Azure OpenAI uses the formula RPM = TPM ÷ 6 as a conversion between request and token rate limits. PTU (Provisioned Throughput Units) provide reserved throughput at lower per-token cost and stable max latency; Azure APIM routes overflow from exhausted PTU deployments to PAYG backends as a spillover pattern.

Tenant isolation in the cache. Shared caches across tenants create timing side-channel attacks: an adversarial tenant measures response latency (cache hit: <5 ms; LLM call: 2–5 s) to infer what other tenants recently prompted, without seeing actual content. Fix: per-tenant cache namespaces. Prefix all cache keys with tenant:{id}:.... The hit rate within a single tenant's namespace is lower than a cross-tenant shared cache, but this is a non-negotiable security boundary for any production multi-tenant deployment.

Observability hooks

Every request generates a structured log entry, written asynchronously after the response is returned to avoid adding latency to the critical path (Portkey exports via async OpenTelemetry; LiteLLM sends spend updates to PostgreSQL post-request).

The minimum log record per request:

request_id, tenant_id, team_id, virtual_key_id, provider, model,
input_tokens, output_tokens, cached_tokens (provider-side),
cost_usd, ttft_ms, total_latency_ms, finish_reason,
cache_hit (exact | semantic | null), routing_strategy_used,
guardrail_triggered (true | false), timestamp

Derived metrics per provider and route (aggregated at 1-minute intervals):

P50/P95/P99 TTFT and total latency
Error rate (4xx and 5xx separately)
Cache hit rate (exact and semantic)
Token volume and dollar cost per tenant/team/model
Cooldown activation frequency per deployment (a leading indicator of provider health)

Portkey tracks inter-token latency (ITL) alongside TTFT — the time between successive tokens in a stream. A high TTFT with low ITL means the provider is taking a long time to begin generating (prefill bottleneck). A low TTFT with high ITL means the provider is generating slowly (decode bottleneck). These point to different remediation strategies and are covered in detail in Design an LLM Inference Serving System.

Cloudflare AI Gateway stores up to 10 million log entries per Durable Object per gateway, with some individual log entries reaching 50 MB (full prompt + response for long-context calls). For compliance deployments, storing the full prompt content is required for audit. For cost-optimized deployments, store only the hash of the prompt, the token counts, and the cost.

Guardrail middleware

The middleware pipeline runs on two sides of the provider call. See Design an AI Guardrails System for the full treatment; the gateway's role is to define the hook interface and enforce policy routing.

Input scanners run before the request reaches the model:

PII detection and redaction. Replace PII tokens with placeholders ([PERSON_NAME], [EMAIL], [SSN]). The gateway stores the mapping so output deanonymization can restore the original values. Kong's AI PII Sanitizer covers 20 PII categories across 9 languages.
Prompt injection detection. LlamaFirewall (Meta, arXiv 2505.03574, 2025) reduced successful prompt injection attacks by 82% in production. PII-detection classifiers in the same framework achieve F1 ≈ 0.95 (see PII redaction above).
Topic denylist. Block requests matching configured topic categories (competitor mentions, off-scope domains).
Budget pre-check. Reject immediately if the estimated cost of this request would exceed the remaining key budget.

Output scanners run after the provider returns:

Content moderation. Check for toxicity, hate speech, self-harm content.
PII deanonymization. Restore redacted tokens in the response using the per-request mapping.
Malicious URL detection. Scan generated text for phishing or malware URLs.

For compliance use cases the gateway buffers the full response before forwarding, so output scanners have the complete text. For latency-critical paths, stream the response directly and run output scanning on a shadow copy. The two modes are configured per-route, not globally.

Secret management for provider keys

Provider API keys are the blast radius boundary. One leaked key and an attacker has unbounded access to all model providers at your expense.

The gateway stores provider keys encrypted at rest using AES-256 with envelope encryption: a data encryption key (DEK) encrypts the API key, and a key encryption key (KEK) stored in a secrets manager (AWS KMS, HashiCorp Vault, or Cloudflare Secrets Store) encrypts the DEK. Cloudflare AI Gateway uses a two-level AES key hierarchy distributed globally via Quicksilver for low-latency access.

RBAC on key management: developers can create and revoke virtual keys but cannot read the underlying provider keys. Only the gateway data plane (at request time) decrypts provider keys, and only into memory. Audit logging records every key management action (creation, rotation, revocation, read attempts) for SOC 2 compliance.

Key rotation: a new key version is generated in both the secrets manager and the provider's developer portal; the gateway updates its configuration via the admin API without downtime. Automated rotation every 90 days.

Streaming SSE passthrough

Streaming is where most gateway implementations fail silently. The misconfiguration is almost always in the proxy layer, not the application code.

A streaming LLM response is a sequence of server-sent events:

Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"choices":[{"delta":{"content":"Hello"},...}]}

data: {"choices":[{"delta":{"content":" world"},...}]}

data: [DONE]

Pitfalls:

nginx proxy_buffering. Default is on. The proxy buffers chunks until its buffer fills (typically 8 KB or 16 KB) before forwarding to the client. The user sees silence, then a response dump. Fix: proxy_buffering off for the SSE location block.
gzip compression. nginx cannot compress a stream it has not finished receiving. If gzip is enabled for text/event-stream, the proxy buffers the whole response first. Fix: gzip off or exclude text/event-stream from gzip rules.
HTTP/2 and Kong. Kong's SSE passthrough requires HTTP/1.1; HTTP/2 multiplexing interferes with chunk framing. Enforce proxy_http_version 1.1 for LLM backends.
Client disconnect cancellation. When a client disconnects mid-stream, the gateway must cancel the upstream LLM request. A naive implementation lets the upstream call run to completion, billing for tokens the client never received. Propagate connection closure upstream; implement graceful shutdown of streaming requests.
Token counting in streaming mode. Some providers omit token counts from streaming responses (each chunk is not annotated with token counts; only the final [DONE] message includes usage). Kong's AI Rate Limiting Advanced estimates token counts per chunk for providers that omit them, to enable accurate TPM metering without waiting for the final event.

TTFT added by the gateway: 10–50 ms for auth + rate-limit check + routing, measured from when the client sends the first byte of the request to when the gateway forwards the first SSE event. This is the gateway's contribution to the overall TTFT the user experiences.

Edge cases & gotchas

Retry storm amplification. When a provider returns 429s, all gateway replicas retry simultaneously. A 30-second provider blip becomes a 10-minute cascading incident. A naive time.sleep(1); retry() in a for-loop across 10 replicas sustains the overload instead of reducing it. Fix: exponential backoff with full jitter (not equal jitter), respect Retry-After headers, and cap global retry budget at 5–15% of base traffic.

Semantic cache threshold misconfiguration. One documented financial services incident: a user saying "I don't want this business account anymore" matched "How do I close my business account?" at 88.7% similarity. The cached response to the second question included instructions for account closure, which was executed on behalf of the first user who actually wanted to cancel a subscription. The fix is domain-specific per-category thresholds, not a global 0.92 default, and a human review gate for any action that is irreversible.

Token-counting blindness. A team rates-limits on RPM while ignoring TPM. A batch job that submits 100 requests per minute each with 20,000-token inputs is "within limits" by RPM but consumes 2M tokens per minute — a provider quota that was designed for 100 average-length requests. The morning peak of interactive users hits a dry TPM bucket. Fix: enforce TPM limits, always. Read actual token counts from provider usage responses, not estimates.

Single API key as blast radius. One shared provider key means any team's spike — intentional or accidental — exhausts quota for all users. Pool multiple API keys per provider; load-balance across them; assign separate key pools per priority tier (real-time user traffic vs. overnight batch jobs). Overnight batch jobs can saturate a shared key pool just before morning peak users arrive.

Cost explosion from failed partial completions. A LLM call that aborts at the 90% mark bills for full input tokens and partial output tokens, then the retry bills the full input again. A 10,000-token input that fails and retries three times costs 40,000 input tokens, not 10,000. Monitor and alert on partial completion failures separately from clean error metrics.

Provider API schema churn. Anthropic, OpenAI, Google, and Mistral update their API schemas, error codes, and streaming event formats regularly. The normalization layer that looked complete at launch drifts within months. Version-pin provider SDKs, run integration tests against real provider sandbox APIs in CI, and subscribe to provider changelogs. This is an ongoing maintenance commitment, not a one-time build.

Redis as hidden bottleneck. Adding Redis for distributed rate limiting solves cross-replica consistency but makes Redis a throughput ceiling. At 10k req/s with 4–5 Redis ops per request (RPM INCR, TPM INCR, budget INCRBYFLOAT, key lookup), Redis sees 40–50k ops/s — still within one node's 100k+ ops/s limit. But at 100k req/s, Redis becomes the bottleneck. Plan for Redis clustering and implement a local in-memory fallback during Redis outages (fail-open, not fail-closed, to avoid taking the gateway down).

Over-relying on a single aggregator for resilience. OpenRouter's August 2025 database outage took the entire service down for 50 minutes — including all provider-level failover logic — because the control plane database was unavailable. If you use an aggregator for resilience, the aggregator itself must not be a single point of failure. True resilience requires either self-hosted gateway, multi-aggregator routing at the application layer, or a direct-to-provider fallback path.

Trade-offs to discuss in an interview

Hosted SaaS vs. self-hosted. Cloudflare AI Gateway requires no infrastructure to operate and runs at the network edge. LiteLLM requires you to operate Docker containers, PostgreSQL, and Redis. For financial services, healthcare, or any deployment where prompt content cannot leave your VPC, self-hosted is not optional. For a startup that wants to start logging costs in an afternoon, hosted SaaS wins.

Normalization vs. native passthrough. Normalizing all providers to the OpenAI schema means any existing OpenAI SDK client works unchanged — a massive developer-experience win. But Anthropic's extended thinking, Google's grounding via search, and OpenAI's structured outputs with schema enforcement are all provider-specific features that a strict normalization layer either discards or requires custom translation to expose. The practical answer: normalize for the common case, add escape hatches for provider-specific features as first-class route configuration.

Token-based vs. request-based rate limiting. Request-based is simpler to implement and easier to reason about. Token-based is the only accurate representation of the underlying resource. In a mixed workload (short classification calls + long document analysis), RPM limits are essentially meaningless. Build token-based from the start; the incremental complexity is adding one Redis INCRBYFLOAT instead of one INCR.

Semantic cache threshold as a dial, not a constant. A single global threshold is wrong for most deployments. FAQ bots can tolerate 0.88 similarity; financial instruction execution needs 0.98. Per-route, per-category threshold configuration is the right model. The added operational complexity is real but so is the semantic cache's failure mode at a misconfigured threshold.

Streaming passthrough vs. buffered response. Streaming wins for interactive applications (chatbots, code assistants, live document editing). Buffered mode is required for output moderation (you cannot moderate a response you have not finished receiving) and for semantic response caching (you cannot store an incomplete response). The correct architecture routes interactive traffic to the passthrough path and compliance/audit traffic to the buffer path, not a global setting.

ML-based routing vs. rule-based. Rule-based routing (cost threshold, latency threshold, capability tags) is deterministic, debuggable, and requires no training data. ML-based routing (RouteLLM, Bedrock IPR) can reduce cost by 30–98% by matching each prompt's complexity to the right model tier, but it adds 90–540 ms of routing latency, requires labeled training data, and creates a model-of-models maintenance problem. Start rule-based. Add ML routing when you can measure whether it improves cost per correct answer.

Things you should now be able to answer

A team asks why their LLM costs spiked despite staying within their RPM limit. What is the most likely cause, and how does the gateway prevent it?
A client using the Anthropic SDK wants to route some requests through your gateway. What changes do they need to make, and why do you choose the OpenAI format as the gateway's public API?
Walk through the full request lifecycle for a streaming non-cached request, identifying every component that adds latency to TTFT.
Your gateway runs 10 replicas. A team has a 100 TPM limit. One replica enforces this locally. What goes wrong, and how does Redis fix it?
A new model from Mistral is 40% cheaper than GPT-4o for a specific use case. How do you roll it out to 5% of traffic, measure quality, and promote to 100% if it passes?
Two tenants share a semantic cache. Describe the side-channel attack this enables and the correct mitigation.
An OpenRouter database outage takes down all provider failover. What architectural decision caused this, and how would you redesign for true resilience?
A safety team wants every response containing a phone number to be redacted before it reaches the client. Where exactly in the gateway architecture does this happen, and what is the trade-off for streaming responses?
You set the semantic cache similarity threshold to 0.85. Describe a realistic failure mode and how you would detect it in production.
A provider starts throttling all requests and returning 429. Walk through the gateway's cooldown, fallback, and retry behavior step by step.

Frequently asked questions

▸What is an LLM gateway and how is it different from a regular API gateway?

A regular API gateway handles authentication, rate limiting, and routing for services your team controls. An LLM gateway does all of that, but the backends are third-party model providers with their own incompatible schemas, token-based quotas, non-deterministic latency, and steep per-token pricing. The gateway must normalize heterogeneous provider APIs into a unified interface, enforce token-per-minute limits (not just requests), manage dollar budgets, cache semantically equivalent prompts, and handle streaming SSE passthrough — concerns a generic gateway never needed to address.

▸Why does rate limiting on requests-per-minute fail for LLM traffic?

A single 128k-token GPT-4o request consumes as much quota as roughly 2,500 short 50-token requests but registers as exactly one in an RPM counter. A batch job running long-context summarization can exhaust TPM quotas in seconds while staying within RPM limits, starving all real-time user traffic. Production LLM gateways must enforce TPM limits alongside RPM limits and read actual token counts from provider usage responses — not estimates — for accurate post-call metering. Kong AI Gateway reads real counts from provider responses; Azure APIM uses RPM = TPM ÷ 6 as a conversion heuristic.

▸What similarity threshold should I use for semantic caching, and why is tuning it critical?

The widely-cited starting point is 0.92 cosine similarity. Below 0.80 the cache returns semantically different answers — in one documented financial services incident, a question about not wanting a business account matched a prior question about account closure at 88.7% similarity and incorrectly triggered the closure flow. Above 0.95–0.97 the gain over an exact hash cache diminishes sharply, since most matches at that similarity level differ only in whitespace or trivial rephrasing that an exact hash would also catch. Production hit rates are 20–45% blended, not the 95% some vendors claim; FAQ bots hit 40–60%, RAG Q&A 15–25%, and agentic tool calls only 5–15%. Domain-specific per-category thresholds outperform a single global value.

▸How do retry, hedge, and circuit-break differ, and when should I use each?

Retries are sequential: if the first call fails or times out, you wait and try again. For LLMs where a timeout takes 8–10 seconds, three sequential retries create a ~30-second P99 latency cliff. Hedged requests fire a parallel duplicate to a second provider at the P90 latency mark — dramatically reducing P99 without waiting for failure. Circuit breakers stop routing to known-down providers entirely, preventing the retry storm. The design rule: hedge for latency reduction, circuit-break for availability, retry only for quick transient errors on idempotent requests, and cap global retry budget at 5–15% of base traffic to prevent storm amplification.

▸What happens to streaming SSE when the gateway buffers the response?

If the gateway buffers the full response before forwarding — the default behavior for nginx with proxy_buffering on, or when gzip is enabled for text/event-stream — the client receives a complete response dump instead of a token-by-token stream. The first-token latency (TTFT) appears as the full generation time. This breaks the conversational feel that made users choose streaming in the first place. Fix: set proxy_buffering off, disable gzip for text/event-stream, and on Kong specifically enforce HTTP/1.1 because HTTP/2 multiplexing is incompatible with SSE chunk framing.

← previous

Design an LLM Observability Platform

Design an LLM Fine-Tuning Platform

// RELATED