API Gateways & the Backend-for-Frontend Pattern
The single front door to a microservice backend. What an API gateway does, why you add one, the BFF pattern, and how not to turn it into a monolith.
A microservice architecture hands you real benefits — independent deployability, technology diversity, fault isolation — but it also hands you a problem: you now have dozens of services and every client needs to talk to most of them. Without some structure, that leads to dozens of client-side round-trips, every service re-implementing auth and rate limiting, and a coupling explosion where clients must know the topology of your backend. The API gateway is the standard answer. The BFF pattern is the refinement.
The problem without a gateway
Imagine an e-commerce app where the product page needs data from four services: product details, inventory, pricing, and reviews. Without a gateway, the client does this:
sequenceDiagram
participant C as Client
participant P as Product Service
participant I as Inventory Service
participant PR as Pricing Service
participant R as Reviews Service
C->>P: GET /products/42
C->>I: GET /inventory?product=42
C->>PR: GET /prices?product=42
C->>R: GET /reviews?product=42
P-->>C: product details
I-->>C: stock level
PR-->>C: current price
R-->>C: reviews
Note over C: Client assembles page
That's four serial-or-parallel round-trips from the client. On a mobile device on a 4G network with ~60ms RTT each, you're looking at 60–240ms of pure network overhead before any rendering, not counting service latency. Meanwhile each service has to authenticate the caller, enforce rate limits, and log the request. That logic gets duplicated across the fleet, diverges over time, and becomes impossible to audit consistently.
The coupling problem is just as bad. The client knows service hostnames. Rename or split a service and every client version in the wild breaks. Add mTLS internally and you now have to update every client SDK. Add a new service and you teach every client about it.
What an API gateway does
A gateway is a reverse proxy with a richer feature set. It sits at the north-south boundary — inbound from the internet — and handles the concerns that are common to every API call.
flowchart TD
C[Client request] --> GW[API Gateway]
GW --> TLS["TLS termination\n(HTTPS → HTTP internally)"]
GW --> AUTH["Auth/Authz\n(JWT verify, RBAC check)"]
GW --> RL["Rate limiting\n(token bucket per client key)"]
GW --> ROUTE["Routing\n(path → service mapping)"]
GW --> CACHE["Response caching\n(GET /products/42, TTL 30s)"]
GW --> TRANSFORM["Request/response transform\n(field rename, protocol translate)"]
GW --> AGG["Aggregation\n(fan-out to N services, merge)"]
GW --> OBS["Observability\n(request log, trace, latency histogram)"]
ROUTE --> SVC[Downstream services]
style GW fill:#ff6b1a,color:#0a0a0f
style SVC fill:#0e7490,color:#fff
TLS termination
The gateway holds the TLS certificate and terminates HTTPS at the edge. Internal communication between the gateway and services can be plain HTTP or gRPC on a trusted private network, or mTLS if your threat model requires it. Either way, every service avoids per-request certificate overhead.
Authentication and authorization
The gateway verifies the caller's identity on every request, so downstream services can trust that anything arriving from behind the gateway is already authenticated. The practical mechanism matters a lot here.
The most common approach is JWT with local verification: the gateway validates the token's signature and expiry without making a network call. With HS256 the gateway holds a pre-configured shared secret; with RS256 or ES256 it fetches the auth service's public key from a JWKS endpoint at startup and caches it in memory, refreshing on a schedule. Either way, no outbound call is made on the hot path. It's fast — HS256 takes about 0.005ms; RS256 is slower at roughly 0.06–0.14ms — and it scales to any request rate.
sequenceDiagram
participant C as Client
participant GW as API Gateway
participant AS as Auth Service
participant SVC as Backend Service
Note over GW,AS: Startup — key cached, not per-request
GW->>AS: fetch public key (JWKS)
AS-->>GW: public key cached in memory
C->>GW: GET /orders/99 + Bearer token
GW->>GW: verify JWT signature locally (HS256 ~0.005ms, RS256 ~0.06–0.14ms)
GW->>GW: check expiry, scopes
GW->>SVC: forward request + verified identity header
SVC-->>GW: response
GW-->>C: response
The alternative is opaque token introspection: the gateway calls the auth service on every request to exchange an opaque token for user identity. Simple to implement, but it adds a synchronous dependency and a full network hop. At 50k req/s that network hop becomes your auth service's throughput ceiling unless you cache aggressively. A hybrid approach splits the difference: introspect once, issue a short-lived JWT, cache it locally for its TTL.
Authorization — does this caller have permission for this endpoint? — can live at the gateway for coarse-grained checks (is this an authenticated end-user or an internal service?) but should not try to do fine-grained authorization. The gateway doesn't have your domain model. Leave "can this user read this specific order?" to the service that owns orders.
Rate limiting
The gateway is the right place for rate limiting because it sees every request before it reaches any service. A per-API-key or per-IP token-bucket counter in a shared Redis gives you cluster-wide enforcement. This protects downstream services from traffic spikes and DDoS without each service having to implement its own limiter.
As a back-of-the-envelope sanity check: a Redis INCRBY + EXPIRE round-trip is ~0.1–0.2ms (dominated by network latency, not Redis processing time). At 50k req/s, each hitting one rate-limit key lookup, that's 50k Redis operations/sec — comfortably within Redis's single-instance throughput ceiling of ~100k–200k ops/sec (Redis's own benchmarks show ~72k ops/sec with a large key space and ~180k with 50 parallel clients; pipelining or Redis 6+ threaded I/O push it higher still).
Routing
Path-based, header-based, or weighted routing maps an incoming request to the right downstream service:
GET /products/42 → Product Service
GET /orders/99 → Order Service
POST /checkout → Checkout Service
GET /products/42?version=v2 → Product Service v2 (canary)
Weighted routing powers canary deployments: send 5% of traffic to the new version, watch error rates, ramp up. The gateway is already there; you just adjust a weight.
Protocol translation
Clients use HTTP/JSON. Internal services might use gRPC — binary, strongly typed, more efficient over the wire. The gateway can translate: accept an HTTP/JSON POST from the client, re-encode it as a gRPC request to the downstream service, then re-encode the protobuf response back to JSON. Clients remain blissfully unaware that the backend is gRPC.
Aggregation
Instead of four client-side round-trips, the gateway fans out to the four services in parallel and merges the responses:
sequenceDiagram
participant C as Client
participant GW as API Gateway
participant P as Product Service
participant I as Inventory Service
participant PR as Pricing Service
participant R as Reviews Service
C->>GW: GET /product-page/42
par fan-out
GW->>P: GET /products/42
GW->>I: GET /inventory/42
GW->>PR: GET /prices/42
GW->>R: GET /reviews/42
end
P-->>GW: product
I-->>GW: stock
PR-->>GW: price
R-->>GW: reviews
GW-->>C: merged page payload
The client sees one request, one response. Latency is max(latencies of four services) rather than sum(latencies) — typically 30–60ms instead of 120–240ms. Keep fan-out bounded, though: a gateway making 20 parallel calls is starting to resemble a monolith.
Caching and observability
Cacheable GET responses (product details, category listings) can be cached at the gateway layer — in process for small TTLs on high-volume endpoints, or backed by Redis for shared and explicitly invalidatable entries. Every request also gets a trace ID injected if one isn't already present, and logged at entry and exit with latency, status code, and caller identity. One consistent audit log across the fleet, without instrumenting each service separately.
The Backend-for-Frontend pattern
A single general-purpose gateway that serves your web app, native mobile app, and partner API integration equally well usually serves all three badly. The response shape optimal for a feature-rich desktop web page — large, deeply nested, rich metadata — is wrong for a mobile screen. Trying to satisfy all clients from one gateway contract leads to bloated responses with fields that only one client uses, versioning chaos where adding a field for mobile risks breaking web clients, and organisational friction where the mobile team can't ship until the shared gateway team approves their schema change.
The Backend-for-Frontend (BFF) pattern solves this by giving each client type its own gateway service:
flowchart LR
WEB[Web app] --> BFFW[Web BFF]
IOS[iOS app] --> BFFM[Mobile BFF]
AND[Android app] --> BFFM
PART[Partner integration] --> BFFP[Partner BFF]
BFFW --> SVC1[User Service]
BFFW --> SVC2[Order Service]
BFFW --> SVC3[Product Service]
BFFM --> SVC1
BFFM --> SVC2
BFFP --> SVC2
BFFP --> SVC4[Webhook Service]
style BFFW fill:#ff6b1a,color:#0a0a0f
style BFFM fill:#a855f7,color:#fff
style BFFP fill:#15803d,color:#fff
style SVC1 fill:#0e7490,color:#fff
style SVC2 fill:#0e7490,color:#fff
style SVC3 fill:#0e7490,color:#fff
style SVC4 fill:#0e7490,color:#fff
Each BFF is owned by the team that owns the client — the mobile team owns the Mobile BFF. They can evolve their API contract independently without coordinating through a shared team. The Mobile BFF fetches only the fields the app needs, maybe in a different structure entirely. The Partner BFF exposes a stable, versioned, documented contract that changes on a slower cadence.
The BFF is still thin: it routes, aggregates, and transforms. Business logic lives in the downstream services, not here.
When to reach for BFF
BFF is warranted when you have meaningfully different client types with genuinely different data needs, different teams owning different client surfaces, or one client type with a significantly different latency or payload-size budget (IoT devices, mobile on 2G). It adds operational overhead — you're now running N gateway processes instead of one. On a small team or a uniform client surface, a single gateway is simpler and sufficient.
Gateway vs service mesh
This distinction trips up most candidates.
| Dimension | API Gateway | Service Mesh |
|---|---|---|
| Traffic direction | North-south (client → services) | East-west (service → service) |
| Who deploys it | Platform/infra team, one gateway fleet | Sidecar per service instance (e.g. Envoy) |
| Primary concerns | Auth, rate limiting, routing, aggregation, external-facing TLS | mTLS between services, retries, circuit breaking, internal observability |
| Caller | External clients (browsers, mobile, partners) | Internal services calling each other |
| Example tools | Kong, AWS API Gateway, Nginx, Envoy at the edge | Istio, Linkerd, Consul Connect |
They are complementary. A typical production setup has both: a gateway fleet at the edge handling north-south concerns, and a service mesh in the cluster handling east-west reliability and observability. You don't have to choose between them.
Edge gateways vs internal gateways
Large organisations often run two tiers. An edge gateway faces the internet and handles TLS, DDoS protection, bot mitigation, geographic routing, CDN integration, and coarse-grained auth — it runs as close to users as possible, often at PoPs in different regions. An internal gateway sits inside the private network, between internal callers (employee tools, internal services acting as API consumers) and backend services, with a lighter-weight auth surface since the threat model is different.
This mirrors the layered defence pattern: each hop trusts but verifies the previous layer.
Anti-patterns and failure modes
The god gateway
The most common mistake: business logic leaks into the gateway over time. Conditional routing based on user attributes. Feature flags checked at the gateway. Discount calculations applied at the edge. After 18 months the gateway is a stateful, untestable monolith — just distributed. The rule is simple: if it touches your domain model, it belongs in a service, not the gateway.
Single point of failure
A single gateway instance in front of all traffic is an outage waiting to happen. Run multiple instances behind an L4 load balancer (TCP/UDP level, not HTTP — simpler and faster at this position), deploy across multiple availability zones, and keep the gateway stateless. No session state, no in-memory counters that matter if lost. Any instance can handle any request.
Cascading fan-out
A gateway that fans out to 15 services per request amplifies the blast radius of any downstream slowdown. One slow service holds a gateway goroutine for its entire timeout window. At 50k req/s, a 500ms timeout on a flaky downstream means up to 50k × 0.5s = 25,000 concurrent in-flight goroutines waiting. Bounded worker pools and aggressive circuit breakers cap this.
flowchart TD
REQ[Request] --> GW[Gateway]
GW --> CB1[Circuit Breaker]
CB1 -->|closed| SVC1[Service A]
CB1 -->|open — fail fast| ERR1[Fallback]
GW --> CB2[Circuit Breaker]
CB2 -->|closed| SVC2[Service B]
CB2 -->|open — fail fast| ERR2[Fallback]
style CB1 fill:#ffaa00,color:#0a0a0f
style CB2 fill:#ffaa00,color:#0a0a0f
style ERR1 fill:#ff2e88,color:#fff
style ERR2 fill:#ff2e88,color:#fff
Each downstream call should have a circuit breaker: after N consecutive failures or a failure-rate threshold, open the breaker and immediately return a fallback — a cached response, an empty list, a degraded payload — instead of hammering the failing service. This protects the gateway's own thread pool as much as it protects the downstream.
Added latency hop
A gateway inevitably adds a network hop. For simple routing with cached JWT verification the overhead is typically 1–5ms. With TLS termination, JWT verification, routing logic, and fan-out aggregation across multiple services, you can add 5–20ms versus a direct call — the upper end of that range is the fan-out case, not the simple-proxy case. For most product use cases this is acceptable. For ultra-low-latency internal service calls with a sub-millisecond budget, keep the gateway out of the east-west path and use the service mesh or direct gRPC instead.
Building up to the design
V1: Expose services directly
Each microservice has its own DNS name and port. The client holds a service registry or hardcoded hostnames. No extra component, simplest possible path from client to service. The problem is that clients must implement auth, retry, and service discovery themselves. Adding TLS means updating every client. Adding a new service leaks into every client.
V2: Nginx reverse proxy
A single Nginx instance routes by path prefix. All TLS terminates here. Backend services are plain HTTP. You've centralized TLS and given clients one hostname. But auth is still per-service, rate limiting is primitive or absent, and you can't easily do response aggregation or JWT verification in pure Nginx config without Lua or OpenResty.
V3: Purpose-built API gateway
Replace Nginx with a proper gateway — Kong, Envoy, AWS API Gateway, or your own service. Auth plugins, rate limiting backed by Redis, a routing table managed via API, aggregation logic, and request/response transforms all live here. The gateway is stateless; a small cluster behind an L4 LB handles peak traffic. All cross-cutting concerns are now in one place, operationally uniform.
The new problem: you have a single gateway for all clients. The mobile team ships slower because every schema change goes through the shared gateway team, and the mobile payload is bloated with fields the app doesn't need.
V4: Add BFFs
Split the gateway into one BFF per client type. Each team owns their BFF. The mobile BFF fetches less, compresses more, and returns a mobile-optimised shape. The web BFF returns rich nested objects. The partner BFF is versioned and stable.
flowchart LR
V1["V1: direct service calls\nclient knows everything"] --> V2["V2: + Nginx proxy\nTLS in one place"]
V2 --> V3["V3: + purpose-built gateway\ncross-cutting concerns centralised"]
V3 --> V4["V4: + BFFs\neach client owns its gateway"]
style V1 fill:#0e7490,color:#fff
style V2 fill:#15803d,color:#fff
style V3 fill:#ff6b1a,color:#0a0a0f
style V4 fill:#a855f7,color:#fff
Full architecture
flowchart TD
INET[Internet] --> LB4[L4 Load Balancer]
LB4 --> GW1[Gateway instance 1]
LB4 --> GW2[Gateway instance 2]
LB4 --> GW3[Gateway instance 3]
GW1 --> REDIS[(Redis\nrate limit counters)]
GW2 --> REDIS
GW3 --> REDIS
GW1 --> AUTHSVC[Auth Service\npublic key endpoint]
GW1 -->|route /users| USVC[User Service]
GW1 -->|route /orders| OSVC[Order Service]
GW1 -->|route /products| PSVC[Product Service]
GW1 -->|aggregate /page| PSVC
GW1 -->|aggregate /page| OSVC
GW1 --> OBS[Observability\nlogs + traces]
style LB4 fill:#ffaa00,color:#0a0a0f
style GW1 fill:#ff6b1a,color:#0a0a0f
style GW2 fill:#ff6b1a,color:#0a0a0f
style GW3 fill:#ff6b1a,color:#0a0a0f
style REDIS fill:#15803d,color:#fff
style USVC fill:#0e7490,color:#fff
style OSVC fill:#0e7490,color:#fff
style PSVC fill:#0e7490,color:#fff
Gateway instances are stateless; any of the three can handle any request. The L4 LB does health checks at the TCP level and routes around failed instances. Rate-limit counters live in a shared Redis cluster so limits are enforced globally across the fleet, not per-instance. JWT verification uses the auth service's cached public key — fetched at startup and refreshed on a schedule, never per-request. Observability data (logs, distributed trace spans) is emitted asynchronously so the critical path is not gated on a logging write.
Storage and component choices
| Concern | Typical choice | Reasoning |
|---|---|---|
| Rate limit counters | Redis (INCR + EXPIRE) | In-memory, atomic, fast; shared across all gateway instances |
| Auth token verification | JWT + local public key | No network call on the hot path; key rotation handled out-of-band |
| Routing config | In-process config file or control-plane API (e.g. etcd) | Pushed to gateways; routing decisions require zero I/O |
| Response cache | In-process LRU + Redis for shared / invalidatable entries | In-process is fastest; Redis allows explicit invalidation across instances |
| TLS certificates | Cert manager (e.g. cert-manager on K8s, ACM on AWS) | Automated renewal; one cert per gateway cluster |
| Aggregation logic | Thin service code in gateway (Go, Node, JVM) | Needs parallel I/O, error handling, timeouts — easier in code than config |
Things to discuss in an interview
- Why not just call services directly? Enumerate the coupling and cross-cutting concerns problems.
- How do you keep the gateway thin? Rule: if it touches the domain model, it belongs in a service.
- BFF vs a single gateway? Ask about the number and variety of client types first; BFF adds operational cost.
- Gateway vs service mesh? North-south vs east-west; they are complementary.
- How does auth work? Local JWT verification vs introspection; the latency trade-off.
- How do you handle a slow downstream? Circuit breakers, bounded fan-out, timeouts, fallbacks.
- How do you deploy a gateway change without downtime? Rolling deploys work because gateways are stateless; canary via weighted routing.
Things you should now be able to answer
- Why does a microservice architecture need an API gateway?
- What is the difference between authentication at the gateway and authorization inside a service?
- What does "stateless gateway" mean and why does it matter for scaling?
- When does a single API gateway become a bottleneck, and how do you fix it?
- What is the BFF pattern and when does it justify the extra operational complexity?
- How is an API gateway different from a service mesh? Can you run both simultaneously?
- What is the "god gateway" anti-pattern and how do you avoid it?
Further reading
- Sam Newman, Building Microservices (O'Reilly) — chapters on API gateways and BFF are the canonical reference.
- Netflix Tech Blog — "Edge Gateway at Netflix" documents their Zuul/Envoy evolution.
- Phil Calçado, "The Back-end for Front-end Pattern (BFF)" (September 2015) — the original BFF writeup.
- NGINX / Kong gateway documentation — concrete configuration examples for routing, rate limiting, and auth plugins.
- Forward vs reverse proxy — foundational concepts for understanding what a gateway actually is.
- Microservices vs monolith — when you need a gateway in the first place.
- Design a rate limiter — the algorithms behind gateway-level rate limiting.
Frequently asked questions
▸What is an API gateway?
An API gateway is a stateless reverse proxy that sits at the north-south boundary between every client and your microservice fleet. It centralises cross-cutting concerns — TLS termination, JWT auth, rate limiting, routing, protocol translation, response aggregation, caching, and observability — so that individual services never have to implement them.
▸What is the Backend-for-Frontend (BFF) pattern and when should I use it?
The BFF pattern gives each client type (web, mobile, partner API) its own thin gateway service, owned by the team that owns that client. It is warranted when you have meaningfully different client types with genuinely different data needs, different teams owning different surfaces, or one client with a significantly different latency or payload-size budget such as IoT or mobile on 2G. On a small team or a uniform client surface, a single gateway is simpler and sufficient.
▸What is the difference between an API gateway and a service mesh?
An API gateway handles north-south traffic — inbound from external clients to services — and its primary concerns are auth, rate limiting, routing, aggregation, and external-facing TLS. A service mesh handles east-west traffic — service to service inside the cluster — and focuses on mTLS between services, retries, circuit breaking, and internal observability. They are complementary: a typical production setup runs both simultaneously.
▸How fast is JWT verification at the gateway, and why does it matter?
HS256 JWT verification takes roughly 0.005ms and RS256 takes roughly 0.06 to 0.14ms per request, requiring no outbound network call — HS256 uses a pre-configured shared secret while RS256 uses a public key fetched from a JWKS endpoint at startup and cached in memory. The alternative, opaque token introspection, adds a full synchronous network hop to the auth service on every request, which at 50k req/s becomes the auth service's throughput ceiling unless aggressively cached.
▸How many gateway nodes do you need at 50k requests per second?
A single mid-spec gateway process handles roughly 10,000 to 30,000 req/s at L7 HTTP with auth. At 50k req/s with an average overhead of 4ms per request, a small fleet of 3 to 6 stateless gateway nodes behind an L4 load balancer is sufficient. Rate-limit counters live in a shared Redis cluster, which comfortably handles the resulting 50k operations per second given Redis's single-instance ceiling of around 100,000 to 200,000 ops/sec.
You may also like
Design a Content Delivery Network (CDN)
Serve content from the edge, close to users, at massive scale. Request routing (anycast & DNS), cache hierarchies, invalidation, and origin shielding.
Design a Video Conferencing System (Zoom)
Carry live audio/video among many participants with low latency. WebRTC, the SFU vs MCU vs mesh trade-off, simulcast, and adaptive bitrate.
Idempotency & Exactly-Once Semantics
Networks retry, so your operations will run twice. Idempotency keys, dedup, and why "exactly-once delivery" is a myth but "exactly-once effect" is achievable.