~/articles/api-gateway-and-bff

◆◆Intermediateasked at Netflixasked at Amazonasked at Uber

API Gateways & the Backend-for-Frontend Pattern

The single front door to a microservice backend. What an API gateway does, why you add one, the BFF pattern, and how not to turn it into a monolith.

16 min read2026-03-26Ironclad Academy

#architecture #microservices #apis #networking

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

A microservice architecture hands you real benefits — independent deployability, technology diversity, fault isolation — but it also hands you a problem: you now have dozens of services and every client needs to talk to most of them. Without some structure, that leads to dozens of client-side round-trips, every service re-implementing auth and rate limiting, and a coupling explosion where clients must know the topology of your backend. The API gateway is the standard answer. The BFF pattern is the refinement.

The problem without a gateway

Imagine an e-commerce app where the product page needs data from four services: product details, inventory, pricing, and reviews. Without a gateway, the client does this:

sequenceDiagram
    participant C as Client
    participant P as Product Service
    participant I as Inventory Service
    participant PR as Pricing Service
    participant R as Reviews Service

    C->>P: GET /products/42
    C->>I: GET /inventory?product=42
    C->>PR: GET /prices?product=42
    C->>R: GET /reviews?product=42
    P-->>C: product details
    I-->>C: stock level
    PR-->>C: current price
    R-->>C: reviews
    Note over C: Client assembles page

That's four serial-or-parallel round-trips from the client. On a mobile device on a 4G network with ~60ms RTT each, you're looking at 60–240ms of pure network overhead before any rendering, not counting service latency. Meanwhile each service has to authenticate the caller, enforce rate limits, and log the request. That logic gets duplicated across the fleet, diverges over time, and becomes impossible to audit consistently.

The coupling problem is just as bad. The client knows service hostnames. Rename or split a service and every client version in the wild breaks. Add mTLS internally and you now have to update every client SDK. Add a new service and you teach every client about it.

What an API gateway does

A gateway is a reverse proxy with a richer feature set. It sits at the north-south boundary — inbound from the internet — and handles the concerns that are common to every API call.

flowchart TD
    C[Client request] --> GW[API Gateway]

    GW --> TLS["TLS termination\n(HTTPS → HTTP internally)"]
    GW --> AUTH["Auth/Authz\n(JWT verify, RBAC check)"]
    GW --> RL["Rate limiting\n(token bucket per client key)"]
    GW --> ROUTE["Routing\n(path → service mapping)"]
    GW --> CACHE["Response caching\n(GET /products/42, TTL 30s)"]
    GW --> TRANSFORM["Request/response transform\n(field rename, protocol translate)"]
    GW --> AGG["Aggregation\n(fan-out to N services, merge)"]
    GW --> OBS["Observability\n(request log, trace, latency histogram)"]

    ROUTE --> SVC[Downstream services]

    style GW fill:#ff6b1a,color:#0a0a0f
    style SVC fill:#0e7490,color:#fff

TLS termination

The gateway holds the TLS certificate and terminates HTTPS at the edge. Internal communication between the gateway and services can be plain HTTP or gRPC on a trusted private network, or mTLS if your threat model requires it. Either way, every service avoids per-request certificate overhead.

Authentication and authorization

The gateway verifies the caller's identity on every request, so downstream services can trust that anything arriving from behind the gateway is already authenticated. The practical mechanism matters a lot here.

The most common approach is JWT with local verification: the gateway validates the token's signature and expiry without making a network call. With HS256 the gateway holds a pre-configured shared secret; with RS256 or ES256 it fetches the auth service's public key from a JWKS endpoint at startup and caches it in memory, refreshing on a schedule. Either way, no outbound call is made on the hot path. It's fast — HS256 takes about 0.005ms; RS256 is slower at roughly 0.06–0.14ms — and it scales to any request rate.

sequenceDiagram
    participant C as Client
    participant GW as API Gateway
    participant AS as Auth Service
    participant SVC as Backend Service

    Note over GW,AS: Startup — key cached, not per-request
    GW->>AS: fetch public key (JWKS)
    AS-->>GW: public key cached in memory

    C->>GW: GET /orders/99 + Bearer token
    GW->>GW: verify JWT signature locally (HS256 ~0.005ms, RS256 ~0.06–0.14ms)
    GW->>GW: check expiry, scopes
    GW->>SVC: forward request + verified identity header
    SVC-->>GW: response
    GW-->>C: response

The alternative is opaque token introspection: the gateway calls the auth service on every request to exchange an opaque token for user identity. Simple to implement, but it adds a synchronous dependency and a full network hop. At 50k req/s that network hop becomes your auth service's throughput ceiling unless you cache aggressively. A hybrid approach splits the difference: introspect once, issue a short-lived JWT, cache it locally for its TTL.

Authorization — does this caller have permission for this endpoint? — can live at the gateway for coarse-grained checks (is this an authenticated end-user or an internal service?) but should not try to do fine-grained authorization. The gateway doesn't have your domain model. Leave "can this user read this specific order?" to the service that owns orders.

Rate limiting

The gateway is the right place for rate limiting because it sees every request before it reaches any service. A per-API-key or per-IP token-bucket counter in a shared Redis gives you cluster-wide enforcement. This protects downstream services from traffic spikes and DDoS without each service having to implement its own limiter.

As a back-of-the-envelope sanity check: a Redis INCRBY + EXPIRE round-trip is ~0.1–0.2ms (dominated by network latency, not Redis processing time). At 50k req/s, each hitting one rate-limit key lookup, that's 50k Redis operations/sec — comfortably within Redis's single-instance throughput ceiling of ~100k–200k ops/sec (Redis's own benchmarks show ~72k ops/sec with a large key space and ~180k with 50 parallel clients; pipelining or Redis 6+ threaded I/O push it higher still).

Routing

Path-based, header-based, or weighted routing maps an incoming request to the right downstream service:

GET /products/42            → Product Service
GET /orders/99              → Order Service
POST /checkout              → Checkout Service
GET /products/42?version=v2 → Product Service v2 (canary)

Weighted routing powers canary deployments: send 5% of traffic to the new version, watch error rates, ramp up. The gateway is already there; you just adjust a weight.

Protocol translation

Clients use HTTP/JSON. Internal services might use gRPC — binary, strongly typed, more efficient over the wire. The gateway can translate: accept an HTTP/JSON POST from the client, re-encode it as a gRPC request to the downstream service, then re-encode the protobuf response back to JSON. Clients remain blissfully unaware that the backend is gRPC.

Aggregation

Instead of four client-side round-trips, the gateway fans out to the four services in parallel and merges the responses:

sequenceDiagram
    participant C as Client
    participant GW as API Gateway
    participant P as Product Service
    participant I as Inventory Service
    participant PR as Pricing Service
    participant R as Reviews Service

    C->>GW: GET /product-page/42
    par fan-out
        GW->>P: GET /products/42
        GW->>I: GET /inventory/42
        GW->>PR: GET /prices/42
        GW->>R: GET /reviews/42
    end
    P-->>GW: product
    I-->>GW: stock
    PR-->>GW: price
    R-->>GW: reviews
    GW-->>C: merged page payload

The client sees one request, one response. Latency is max(latencies of four services) rather than sum(latencies) — typically 30–60ms instead of 120–240ms. Keep fan-out bounded, though: a gateway making 20 parallel calls is starting to resemble a monolith.

Caching and observability

Cacheable GET responses (product details, category listings) can be cached at the gateway layer — in process for small TTLs on high-volume endpoints, or backed by Redis for shared and explicitly invalidatable entries. Every request also gets a trace ID injected if one isn't already present, and logged at entry and exit with latency, status code, and caller identity. One consistent audit log across the fleet, without instrumenting each service separately.

The Backend-for-Frontend pattern

A single general-purpose gateway that serves your web app, native mobile app, and partner API integration equally well usually serves all three badly. The response shape optimal for a feature-rich desktop web page — large, deeply nested, rich metadata — is wrong for a mobile screen. Trying to satisfy all clients from one gateway contract leads to bloated responses with fields that only one client uses, versioning chaos where adding a field for mobile risks breaking web clients, and organisational friction where the mobile team can't ship until the shared gateway team approves their schema change.

The Backend-for-Frontend (BFF) pattern solves this by giving each client type its own gateway service:

flowchart LR
    WEB[Web app] --> BFFW[Web BFF]
    IOS[iOS app] --> BFFM[Mobile BFF]
    AND[Android app] --> BFFM
    PART[Partner integration] --> BFFP[Partner BFF]

    BFFW --> SVC1[User Service]
    BFFW --> SVC2[Order Service]
    BFFW --> SVC3[Product Service]

    BFFM --> SVC1
    BFFM --> SVC2

    BFFP --> SVC2
    BFFP --> SVC4[Webhook Service]

    style BFFW fill:#ff6b1a,color:#0a0a0f
    style BFFM fill:#a855f7,color:#fff
    style BFFP fill:#15803d,color:#fff
    style SVC1 fill:#0e7490,color:#fff
    style SVC2 fill:#0e7490,color:#fff
    style SVC3 fill:#0e7490,color:#fff
    style SVC4 fill:#0e7490,color:#fff

Each BFF is owned by the team that owns the client — the mobile team owns the Mobile BFF. They can evolve their API contract independently without coordinating through a shared team. The Mobile BFF fetches only the fields the app needs, maybe in a different structure entirely. The Partner BFF exposes a stable, versioned, documented contract that changes on a slower cadence.

The BFF is still thin: it routes, aggregates, and transforms. Business logic lives in the downstream services, not here.

When to reach for BFF

BFF is warranted when you have meaningfully different client types with genuinely different data needs, different teams owning different client surfaces, or one client type with a significantly different latency or payload-size budget (IoT devices, mobile on 2G). It adds operational overhead — you're now running N gateway processes instead of one. On a small team or a uniform client surface, a single gateway is simpler and sufficient.

Gateway vs service mesh

This distinction trips up most candidates.

Dimension	API Gateway	Service Mesh
Traffic direction	North-south (client → services)	East-west (service → service)
Who deploys it	Platform/infra team, one gateway fleet	Sidecar per service instance (e.g. Envoy)
Primary concerns	Auth, rate limiting, routing, aggregation, external-facing TLS	mTLS between services, retries, circuit breaking, internal observability
Caller	External clients (browsers, mobile, partners)	Internal services calling each other
Example tools	Kong, AWS API Gateway, Nginx, Envoy at the edge	Istio, Linkerd, Consul Connect

They are complementary. A typical production setup has both: a gateway fleet at the edge handling north-south concerns, and a service mesh in the cluster handling east-west reliability and observability. You don't have to choose between them.

Edge gateways vs internal gateways

Large organisations often run two tiers. An edge gateway faces the internet and handles TLS, DDoS protection, bot mitigation, geographic routing, CDN integration, and coarse-grained auth — it runs as close to users as possible, often at PoPs in different regions. An internal gateway sits inside the private network, between internal callers (employee tools, internal services acting as API consumers) and backend services, with a lighter-weight auth surface since the threat model is different.

This mirrors the layered defence pattern: each hop trusts but verifies the previous layer.

Anti-patterns and failure modes

The god gateway

The most common mistake: business logic leaks into the gateway over time. Conditional routing based on user attributes. Feature flags checked at the gateway. Discount calculations applied at the edge. After 18 months the gateway is a stateful, untestable monolith — just distributed. The rule is simple: if it touches your domain model, it belongs in a service, not the gateway.

Single point of failure

A single gateway instance in front of all traffic is an outage waiting to happen. Run multiple instances behind an L4 load balancer (TCP/UDP level, not HTTP — simpler and faster at this position), deploy across multiple availability zones, and keep the gateway stateless. No session state, no in-memory counters that matter if lost. Any instance can handle any request.

Cascading fan-out

A gateway that fans out to 15 services per request amplifies the blast radius of any downstream slowdown. One slow service holds a gateway goroutine for its entire timeout window. At 50k req/s, a 500ms timeout on a flaky downstream means up to 50k × 0.5s = 25,000 concurrent in-flight goroutines waiting. Bounded worker pools and aggressive circuit breakers cap this.

flowchart TD
    REQ[Request] --> GW[Gateway]
    GW --> CB1[Circuit Breaker]
    CB1 -->|closed| SVC1[Service A]
    CB1 -->|open — fail fast| ERR1[Fallback]
    GW --> CB2[Circuit Breaker]
    CB2 -->|closed| SVC2[Service B]
    CB2 -->|open — fail fast| ERR2[Fallback]

    style CB1 fill:#ffaa00,color:#0a0a0f
    style CB2 fill:#ffaa00,color:#0a0a0f
    style ERR1 fill:#ff2e88,color:#fff
    style ERR2 fill:#ff2e88,color:#fff

Each downstream call should have a circuit breaker: after N consecutive failures or a failure-rate threshold, open the breaker and immediately return a fallback — a cached response, an empty list, a degraded payload — instead of hammering the failing service. This protects the gateway's own thread pool as much as it protects the downstream.

Added latency hop

A gateway inevitably adds a network hop. For simple routing with cached JWT verification the overhead is typically 1–5ms. With TLS termination, JWT verification, routing logic, and fan-out aggregation across multiple services, you can add 5–20ms versus a direct call — the upper end of that range is the fan-out case, not the simple-proxy case. For most product use cases this is acceptable. For ultra-low-latency internal service calls with a sub-millisecond budget, keep the gateway out of the east-west path and use the service mesh or direct gRPC instead.

Building up to the design

V1: Expose services directly

Each microservice has its own DNS name and port. The client holds a service registry or hardcoded hostnames. No extra component, simplest possible path from client to service. The problem is that clients must implement auth, retry, and service discovery themselves. Adding TLS means updating every client. Adding a new service leaks into every client.

V2: Nginx reverse proxy

A single Nginx instance routes by path prefix. All TLS terminates here. Backend services are plain HTTP. You've centralized TLS and given clients one hostname. But auth is still per-service, rate limiting is primitive or absent, and you can't easily do response aggregation or JWT verification in pure Nginx config without Lua or OpenResty.

V3: Purpose-built API gateway

Replace Nginx with a proper gateway — Kong, Envoy, AWS API Gateway, or your own service. Auth plugins, rate limiting backed by Redis, a routing table managed via API, aggregation logic, and request/response transforms all live here. The gateway is stateless; a small cluster behind an L4 LB handles peak traffic. All cross-cutting concerns are now in one place, operationally uniform.

The new problem: you have a single gateway for all clients. The mobile team ships slower because every schema change goes through the shared gateway team, and the mobile payload is bloated with fields the app doesn't need.

V4: Add BFFs

Split the gateway into one BFF per client type. Each team owns their BFF. The mobile BFF fetches less, compresses more, and returns a mobile-optimised shape. The web BFF returns rich nested objects. The partner BFF is versioned and stable.

flowchart LR
    V1["V1: direct service calls\nclient knows everything"] --> V2["V2: + Nginx proxy\nTLS in one place"]
    V2 --> V3["V3: + purpose-built gateway\ncross-cutting concerns centralised"]
    V3 --> V4["V4: + BFFs\neach client owns its gateway"]
    style V1 fill:#0e7490,color:#fff
    style V2 fill:#15803d,color:#fff
    style V3 fill:#ff6b1a,color:#0a0a0f
    style V4 fill:#a855f7,color:#fff

Full architecture

flowchart TD
    INET[Internet] --> LB4[L4 Load Balancer]
    LB4 --> GW1[Gateway instance 1]
    LB4 --> GW2[Gateway instance 2]
    LB4 --> GW3[Gateway instance 3]

    GW1 --> REDIS[(Redis\nrate limit counters)]
    GW2 --> REDIS
    GW3 --> REDIS

    GW1 --> AUTHSVC[Auth Service\npublic key endpoint]

    GW1 -->|route /users| USVC[User Service]
    GW1 -->|route /orders| OSVC[Order Service]
    GW1 -->|route /products| PSVC[Product Service]
    GW1 -->|aggregate /page| PSVC
    GW1 -->|aggregate /page| OSVC

    GW1 --> OBS[Observability\nlogs + traces]

    style LB4 fill:#ffaa00,color:#0a0a0f
    style GW1 fill:#ff6b1a,color:#0a0a0f
    style GW2 fill:#ff6b1a,color:#0a0a0f
    style GW3 fill:#ff6b1a,color:#0a0a0f
    style REDIS fill:#15803d,color:#fff
    style USVC fill:#0e7490,color:#fff
    style OSVC fill:#0e7490,color:#fff
    style PSVC fill:#0e7490,color:#fff

Gateway instances are stateless; any of the three can handle any request. The L4 LB does health checks at the TCP level and routes around failed instances. Rate-limit counters live in a shared Redis cluster so limits are enforced globally across the fleet, not per-instance. JWT verification uses the auth service's cached public key — fetched at startup and refreshed on a schedule, never per-request. Observability data (logs, distributed trace spans) is emitted asynchronously so the critical path is not gated on a logging write.

Storage and component choices

Concern	Typical choice	Reasoning
Rate limit counters	Redis (INCR + EXPIRE)	In-memory, atomic, fast; shared across all gateway instances
Auth token verification	JWT + local public key	No network call on the hot path; key rotation handled out-of-band
Routing config	In-process config file or control-plane API (e.g. etcd)	Pushed to gateways; routing decisions require zero I/O
Response cache	In-process LRU + Redis for shared / invalidatable entries	In-process is fastest; Redis allows explicit invalidation across instances
TLS certificates	Cert manager (e.g. cert-manager on K8s, ACM on AWS)	Automated renewal; one cert per gateway cluster
Aggregation logic	Thin service code in gateway (Go, Node, JVM)	Needs parallel I/O, error handling, timeouts — easier in code than config

Things to discuss in an interview

Why not just call services directly? Enumerate the coupling and cross-cutting concerns problems.
How do you keep the gateway thin? Rule: if it touches the domain model, it belongs in a service.
BFF vs a single gateway? Ask about the number and variety of client types first; BFF adds operational cost.
Gateway vs service mesh? North-south vs east-west; they are complementary.
How does auth work? Local JWT verification vs introspection; the latency trade-off.
How do you handle a slow downstream? Circuit breakers, bounded fan-out, timeouts, fallbacks.
How do you deploy a gateway change without downtime? Rolling deploys work because gateways are stateless; canary via weighted routing.

Things you should now be able to answer

Why does a microservice architecture need an API gateway?
What is the difference between authentication at the gateway and authorization inside a service?
What does "stateless gateway" mean and why does it matter for scaling?
When does a single API gateway become a bottleneck, and how do you fix it?
What is the BFF pattern and when does it justify the extra operational complexity?
How is an API gateway different from a service mesh? Can you run both simultaneously?
What is the "god gateway" anti-pattern and how do you avoid it?

Frequently asked questions

▸What is an API gateway?

An API gateway is a stateless reverse proxy that sits at the north-south boundary between every client and your microservice fleet. It centralises cross-cutting concerns — TLS termination, JWT auth, rate limiting, routing, protocol translation, response aggregation, caching, and observability — so that individual services never have to implement them.

▸What is the Backend-for-Frontend (BFF) pattern and when should I use it?

The BFF pattern gives each client type (web, mobile, partner API) its own thin gateway service, owned by the team that owns that client. It is warranted when you have meaningfully different client types with genuinely different data needs, different teams owning different surfaces, or one client with a significantly different latency or payload-size budget such as IoT or mobile on 2G. On a small team or a uniform client surface, a single gateway is simpler and sufficient.

▸What is the difference between an API gateway and a service mesh?

An API gateway handles north-south traffic — inbound from external clients to services — and its primary concerns are auth, rate limiting, routing, aggregation, and external-facing TLS. A service mesh handles east-west traffic — service to service inside the cluster — and focuses on mTLS between services, retries, circuit breaking, and internal observability. They are complementary: a typical production setup runs both simultaneously.

▸How fast is JWT verification at the gateway, and why does it matter?

HS256 JWT verification takes roughly 0.005ms and RS256 takes roughly 0.06 to 0.14ms per request, requiring no outbound network call — HS256 uses a pre-configured shared secret while RS256 uses a public key fetched from a JWKS endpoint at startup and cached in memory. The alternative, opaque token introspection, adds a full synchronous network hop to the auth service on every request, which at 50k req/s becomes the auth service's throughput ceiling unless aggressively cached.

▸How many gateway nodes do you need at 50k requests per second?

A single mid-spec gateway process handles roughly 10,000 to 30,000 req/s at L7 HTTP with auth. At 50k req/s with an average overhead of 4ms per request, a small fleet of 3 to 6 stateless gateway nodes behind an L4 load balancer is sufficient. Rate-limit counters live in a shared Redis cluster, which comfortably handles the resulting 50k operations per second given Redis's single-instance ceiling of around 100,000 to 200,000 ops/sec.

← previous

Design a Recommendation System (Netflix / TikTok)

Design a Video Conferencing System (Zoom)

// RELATED