MODULE 08 / 12crash course
~/roadmap/08-load-balancers
Beginner

Load Balancers, Proxies, and Service Mesh

L4 vs L7 load balancing, algorithms from round robin to Maglev, health checks done right, sticky sessions, anycast, GSLB, service mesh, and DDoS mitigation.

15 min read2026-01-22Ironclad Academy

A load balancer is a piece of infrastructure that takes incoming requests and spreads them across many backend servers. Without one, you cannot horizontally scale anything.

This module covers what load balancers do, the algorithms they use, the layers of LBs you'll find in any real system (DNS → edge → regional → service mesh), and the pitfalls — sticky sessions, bad health checks, hash-and-modulo — that cause real-world outages.

Why load balancing exists

Imagine you have one app server. It can handle 1,000 requests per second. You hit 2,000 RPS. What do you do?

Answer: spin up a second app server, put a load balancer in front, send half the traffic to each.

flowchart LR
    U[Users] --> LB[Load Balancer]
    LB --> S1[Server 1]
    LB --> S2[Server 2]
    LB --> S3[Server N...]
    style LB fill:#ff6b1a,color:#0a0a0f

That one change gives you four things at once: horizontal scaling so you add capacity by adding boxes, high availability so one dead server doesn't take the whole system down, rolling deploys where you drain a server before touching it, and a single DNS endpoint so clients never have to know about the cluster underneath.

L4 vs L7 (the most common confusion)

L4 (transport)L7 (application)
Looks atTCP/UDP headers (IP, port)HTTP headers, method, path, cookies
Decision based onSource IP, port, hashURL, host, headers, body
SpeedVery fast (just forward bytes)Slower (parse HTTP)
FeaturesBasic routingPath-based routing, rewrites, A/B, sticky sessions, TLS termination, retries
ExamplesAWS NLB, HAProxy (TCP mode), Linux IPVS, MaglevAWS ALB, NGINX, Envoy, Traefik, Cloudflare
flowchart TD
    REQ[Request: GET api.example.com/users/42] --> L4
    L4{L4 LB} -->|"hash(src_ip, port) → server"| BE1[Backend 1]
    L4 --> BE2[Backend 2]
    REQ2[Request: GET api.example.com/users/42] --> L7
    L7{L7 LB} -->|"path matches /users/* → users svc"| US[Users Service]
    L7 -->|"path matches /posts/* → posts svc"| PS[Posts Service]
    style L4 fill:#0e7490,color:#fff
    style L7 fill:#ff6b1a,color:#0a0a0f

In modern web stacks, you almost always want L7 at the edge — path-based routing, TLS termination, observability — and possibly L4 inside the cluster for raw performance.

L4 wins when you have non-HTTP traffic (databases, custom TCP protocols, gRPC bidirectional streams that are easier to forward at the byte level), when you need line-rate throughput without parsing, or when memory is tight and you can't afford buffering request bodies. L7 wins when you need to make decisions on path, host, or header — API gateways, multi-tenant routing, A/B testing — or when you want retries that actually understand HTTP semantics, TLS termination, WAF rules, or per-request observability.

Load balancing algorithms

Six algorithms worth knowing. Pick the wrong one and your fleet load gets unbalanced even when traffic is uniform.

Round robin

Server 1, server 2, server 3, server 1, server 2, … Works when servers are identical and requests are roughly uniform. The moment requests start varying in cost — a slow POST /export lands on one server while fast GET /health checks are spread everywhere else — that slow server drowns while others stay idle.

Least connections

Send to the server with the fewest active connections. Much better when request durations vary widely. The trap is startup: a freshly restarted server reports 0 connections, so it immediately receives all the traffic and can fall over before it's warmed up. Pair this with a slow-start ramp to avoid the spike.

Least response time

Send to the server with the lowest recent response time. It adapts automatically to overloaded servers. The cold-cache trap is real though: a server with a cold cache sees high response times, gets less traffic because of those high times, stays cold because of the low traffic, and never recovers. Usually paired with least-connections as a tiebreaker to prevent that feedback loop.

IP hash / source affinity

hash(client_ip) % num_servers. Same client always lands on the same server — useful for sticky sessions and in-process caches. The problem is NAT: entire corporate networks or mobile carriers can share a single public IP, so one bucket gets hammered while the others sit quiet.

Power of two choices (P2C)

Pick two servers at random, send to the one with fewer connections.

flowchart LR
    REQ[Request] --> P[Pick 2 random]
    P --> S1[Server A: 10 conns]
    P --> S2[Server B: 3 conns]
    S1 --> CMP{Compare}
    S2 --> CMP
    CMP -->|fewer conns wins| CHOOSE[Choose B]
    style CHOOSE fill:#15803d,color:#fff

Surprisingly close to optimal least-connections, with a fraction of the bookkeeping. Used by HAProxy, NGINX (with the random directive), and many service meshes. Almost always the right default for L4-ish balancing.

Consistent hashing

A special version of hash-based balancing that minimizes reshuffling when servers are added or removed.

flowchart LR
    K1[key 1] -->|"hash → clockwise to A"| A[Server A]
    K2[key 2] -->|"hash → clockwise to B"| B[Server B]
    K3[key 3] -->|"hash → clockwise to C"| C[Server C]
    K4[key 4] -->|"hash → clockwise to D"| D[Server D]
    A -->|ring| B
    B -->|ring| C
    C -->|ring| D
    D -->|ring| A
    style A fill:#ff6b1a,color:#0a0a0f
    style B fill:#ff6b1a,color:#0a0a0f
    style C fill:#ff6b1a,color:#0a0a0f
    style D fill:#ff6b1a,color:#0a0a0f

The problem with hash % N is that when N changes — a server added or removed — almost every key remaps. With consistent hashing, only 1/N of keys move. That matters enormously for sharded caches (Redis cluster, Memcached) where you want the same key to keep landing on the same node so cache hits stay warm, and for stateful service routing where a user's WebSocket connection should keep going to the same server.

We have a whole article on consistent hashing — it's foundational to caches, sharded databases, and service meshes.

Maglev (the Google one)

Google's Maglev algorithm achieves consistent-hashing-like stability when backends change, but uses a permutation table instead of a hash ring. Each backend generates a preference permutation over M=65,537 slots; those permutations are interleaved to fill a lookup table. The result is even load distribution, O(1) lookups per packet, and minimal disruption when servers are added or removed — better than a ring in all three dimensions.

flowchart TD
    A[Maglev table:<br/>65537 slots] --> B[Each slot maps to<br/>a backend]
    B --> C["Lookup: hash(packet) % 65537<br/>→ slot → backend"]
    style A fill:#a855f7,color:#fff

Used in Google's Maglev (the LB itself), Cloudflare's stateless L4 LB infrastructure, Envoy, and Istio. If you want consistent hashing that load-balances really well at line rate, Maglev is what your modern L4 LB is doing under the hood.

Health checks

The load balancer needs to know which backends are alive. There are two flavors, and real systems use both.

Active health checks: the LB pings GET /health on each backend every N seconds. If 3 in a row fail, mark the backend as unhealthy and stop sending traffic. Passive health checks: observe live request errors — if a backend returns a wave of 5xx, drain it.

A good /health endpoint returns 200 only if the server can actually serve traffic (DB connection live, downstream caches reachable). It's not authenticated, not rate-limited, fast (~10ms), and the most important distinction it makes is "ready" vs "live" for graceful shutdown.

Liveness vs readiness (the K8s convention)

GET /livez    → 200 if process is alive
GET /readyz   → 200 if ready to serve traffic

A /livez failure tells the orchestrator "kill this pod, it's stuck." A /readyz failure says "pull this pod from the LB rotation, but don't kill it — restarting won't help." A common mistake is making /readyz check downstream services. When service B has a slow day, every replica of A suddenly fails its readiness check, and you cascade an outage. The rule: readyz checks only what restarting this pod could fix. Covered more in Reliability Patterns.

Slow-start and the warmup trap

A freshly-deployed instance has cold caches and JIT compilation pending. Hitting it with full production traffic immediately means slow responses or even OOM. Slow-start ramps traffic up to a new instance over a configured window — AWS ALB accepts 30–900 seconds (most teams use 30–60s), and Envoy has an equivalent slow_start_window parameter. It's one of those features easy to miss until you've seen a deploy cascade.

flowchart LR
    NEW[New instance joins] --> RAMP["Slow-start: 0% → 100%\nover configured window"]
    RAMP -->|"warmed up"| FULL[Full traffic weight]
    style NEW fill:#0e7490,color:#fff
    style RAMP fill:#ffaa00,color:#0a0a0f
    style FULL fill:#15803d,color:#fff

Connection draining (graceful shutdown)

When you're rolling a deploy:

sequenceDiagram
    participant LB
    participant S as Server (old)
    participant SN as Server (new)
    LB->>S: stop sending NEW connections
    Note over S: drain: complete in-flight reqs
    S->>S: 30s timeout
    S->>S: shutdown
    LB->>SN: send traffic to new instance

Three ingredients: mark the server unready so the LB stops sending new traffic, wait for existing in-flight requests to finish (drain timeout), then shut down. Without this, every deploy throws 5xx errors at the requests that were in flight when the process exited.

Sticky sessions (session affinity)

If a user's session lives in memory on server A, they need to keep landing on server A. Cookie-based affinity lets the LB set a cookie like srv=A; subsequent requests honor it. IP affinity hashes on client IP, but it's brittle behind NAT and proxies.

The better answer is to not use sticky sessions at all. Put session state in Redis or a signed cookie, and make every server stateless. Stateful servers are an operational headache: rolling deploys lose the session of every user pinned to the redeploying box, auto-scaling adds boxes that never get traffic until existing sessions die, and one user with an unusually busy session pins one unusually busy server.

The one exception that earns sticky sessions: WebSocket connections. The connection itself is stateful by definition — it's pinned to one server — so you accept session affinity for those connections, and pair it with consistent hashing so rebalancing on cluster resize doesn't sever too many sessions at once.

Reverse proxies vs load balancers

A reverse proxy sits in front of one or many backends and forwards requests, doing more than just balancing — TLS termination, request logging, header rewriting, response caching, rate limiting, WAF. L7 load balancers are a subclass of reverse proxy; L4 forwarders that pass packets without terminating the TCP connection (AWS NLB in DSR mode, Linux IPVS direct routing, Maglev) are not. Reverse proxies do more than balance:

flowchart LR
    U[Client] --> RP[Reverse Proxy<br/>NGINX / Envoy / Traefik]
    RP -->|TLS terminated| AUTH[Auth]
    RP -->|cached?| CACHE[(Cache)]
    RP -->|rate limit?| RL[Rate Limiter]
    RP --> LB[Load Balancer]
    LB --> S1[Backend 1]
    LB --> S2[Backend 2]
    style RP fill:#ff6b1a,color:#0a0a0f

In practice, NGINX or Envoy plays both roles.

Common reverse proxies, briefly

ProxyOriginStrengths
NGINXC, 2004Battle-tested, easy config, very fast
HAProxyC, 2000TCP/HTTP excellence, low resource use
EnvoyC++, 2016 (Lyft)Modern, observability-first, the service-mesh data plane
TraefikGo, 2015Auto-discovers backends in containers/K8s
CaddyGo, 2015Automatic HTTPS via Let's Encrypt
Cloudflare/Fastly/CloudFrontedgeCDN + WAF + LB in one

Where to put your load balancers

Real systems have multiple tiers of load balancing:

flowchart TD
    INET[Internet] --> DNS[DNS / Anycast]
    DNS --> EDGE[Edge LB / CDN<br/>CloudFront, Cloudflare]
    EDGE --> ALB[Regional L7 ALB]
    ALB --> SVCMESH[Service Mesh<br/>Envoy sidecars]
    SVCMESH --> S1[Service A pods]
    SVCMESH --> S2[Service B pods]
    style EDGE fill:#ff6b1a,color:#0a0a0f
    style ALB fill:#ffaa00,color:#0a0a0f
    style SVCMESH fill:#0e7490,color:#fff
TierWhat it doesExamples
DNSRoutes to nearest regionRoute 53, Cloudflare DNS
EdgeTLS termination, DDoS, static cachingCloudFront, Cloudflare
RegionalRoutes to services within a regionALB, NGINX, Envoy
Service meshRoutes between services in a clusterIstio, Linkerd, Consul

Service mesh (the in-cluster LB)

In a microservices architecture, every service-to-service call needs L7 features — retries, timeouts, mTLS, observability, circuit breaking. Service mesh is "an LB sidecar at every pod, configured centrally."

flowchart LR
    A[App A] --> SA[Envoy sidecar A]
    SA --> SB[Envoy sidecar B]
    SB --> B[App B]
    CTL[Control plane<br/>Istio / Linkerd] -.config.-> SA
    CTL -.config.-> SB
    style SA fill:#ff6b1a,color:#0a0a0f
    style SB fill:#ff6b1a,color:#0a0a0f
    style CTL fill:#a855f7,color:#fff

What the mesh gives you, centrally configured, without app code changes: mTLS between every service pair, per-service retries and timeouts, circuit breakers, request-level metrics and tracing, and traffic shifting (canary 1% to v2). The cost is extra CPU and RAM per pod (the sidecar) and real operational complexity. Don't deploy a service mesh until you have around 10 services and actually need centralized policy — below that, libraries are simpler.

How a load balancer scales itself

If the load balancer itself becomes a bottleneck, you have a problem. The standard fix is horizontal scaling at the DNS level: DNS returns multiple A records (one per LB instance), clients pick one (or rotate). With anycast, the same IP is announced from many locations at once and the network routes you to the closest.

This is why Cloudflare's anycast network can handle global traffic at scale — there are thousands of physical machines spread across PoPs, each announcing the same IP, so the network routes each user to the nearest one. Cloudflare's 1.1.1.1 DNS resolver works exactly this way: one IP, thousands of machines.

Anycast vs DNS load balancing

flowchart LR
    A[User in Tokyo] --> LB1[Anycast IP 1.2.3.4<br/>routed to Tokyo PoP]
    B[User in NYC] --> LB2[Anycast IP 1.2.3.4<br/>routed to NYC PoP]
    style LB1 fill:#15803d,color:#fff
    style LB2 fill:#15803d,color:#fff
DNS-basedAnycast
Routes toDifferent IPs per regionSame IP, BGP picks closest
Failover speedTTL-bound (minutes)Seconds (BGP withdrawal)
SetupDNS + GeoIPMultiple ASN announcements
Used forApp trafficDNS, CDN edge, DDoS mitigation

Modern global load balancers (Cloudflare, CloudFront, GCP Global LB) use anycast — when one PoP fails, BGP reroutes traffic to the next-nearest one within seconds.

Failover and disaster recovery

What happens when an entire data center goes down?

flowchart LR
    U[Users] --> DNS
    DNS -->|primary| EAST[us-east-1]
    DNS -.failover.-> WEST[us-west-2]
    EAST -->|crashed!| X((✗))
    style EAST fill:#ff2e88,color:#fff
    style WEST fill:#15803d,color:#fff

Three modes to know: active-passive has one region serving while the other sits on standby — cheaper, but failover is a real event involving DB promotion and a DNS shift. Active-active (sharded) has each region serve a shard of users, which means losing a shard if a region dies until you fail over. Active-active (replicated) has every region serve every user, requiring multi-region database replication — most resilient, most expensive.

DNS failover is slow because it's TTL-bound. Anycast or GSLB (global server load balancing) is faster — it can shift traffic in seconds by withdrawing BGP routes rather than waiting for TTLs to expire.

DDoS mitigation

A load balancer is one of the few places where DDoS mitigation can happen at scale. Anycast itself absorbs volumetric attacks across many PoPs, spreading the blast across the whole network rather than concentrating it. SYN cookies let you respond to TCP handshakes without allocating server state, which defeats SYN floods. Rate limiting per-IP or per-ASN at the edge handles application-layer floods. WAF rules block based on URL patterns and known-bad user agents. Slow-connection attacks (slowloris) die when you enforce aggressive TCP and HTTP timeouts on idle connections.

For most teams, the practical answer is "use Cloudflare, AWS Shield, or GCP Cloud Armor." DIY DDoS mitigation is a specialty engineering problem — the infrastructure needed to absorb a serious attack costs more to build than to buy.

Rate limiting (a preview)

Rate limits at the LB protect downstream services from misbehaving clients. Two algorithms at a glance:

  • Token bucket: bucket of N tokens, refills at R per second. Each request consumes one. Allows bursts up to N.
  • Sliding window: count requests in the last 60s; reject when over the limit.

When rate limited, return:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1736284800

Full coverage in Reliability Patterns.

A load-balancer checklist

Before any service is exposed to users:

  • L7 LB at the edge with TLS termination?
  • Healthy backends only — proper health checks, with circuit breaker semantics?
  • Connection draining configured for graceful deploys?
  • Rolling deploys with slow-start ramp?
  • No sticky sessions unless absolutely necessary?
  • Anycast or GSLB for multi-region failover?
  • DDoS mitigation at the edge (Cloudflare-class)?
  • Rate limiting per IP and per API key?
  • Observability: per-backend success rate, latency, RPS visible?
  • Path-based routing rules tested in staging?

Things you should now be able to answer

  • What's the difference between L4 and L7 load balancing?
  • Why is consistent hashing better than hash % N?
  • What's wrong with sticky sessions, and when are they unavoidable?
  • Where do you terminate TLS — at the edge, at the LB, or at the app?
  • How does a load balancer detect that a backend has failed?
  • Power-of-two-choices is "almost optimal" with much less work — why does that work?
  • A new pod joins your cluster and immediately gets full traffic — why is this bad and what's the fix?
  • Service mesh "moves LB into a sidecar at every pod." What problems does this solve that a single regional LB can't?

→ Next: Message queues & streams

// FAQ

Frequently asked questions

What is the difference between L4 and L7 load balancing?

L4 load balancers operate on TCP/UDP headers and forward bytes without parsing application data, making them very fast. L7 load balancers inspect HTTP headers, paths, cookies, and bodies, enabling path-based routing, TLS termination, retries, A/B testing, and per-request observability. In modern stacks, the recommendation is L7 at the edge and optionally L4 inside the cluster for raw throughput.

Why is consistent hashing better than hash modulo N for routing?

With hash mod N, adding or removing a single server remaps nearly every key to a different backend, invalidating warm caches and breaking stateful sessions. Consistent hashing limits reshuffling to only 1/N of keys when the cluster size changes. That property is critical for sharded caches like Redis or Memcached where keeping the same key on the same node preserves cache hit rates.

When should sticky sessions be used, and what is the preferred alternative?

Sticky sessions are justified for WebSocket connections, which are stateful by definition and must stay pinned to one server. For HTTP request sessions, the better approach is to store session state in Redis or a signed cookie, making every server stateless. Stateful servers complicate rolling deploys, auto-scaling, and cause uneven load when one user's session is unusually busy.

What is the slow-start feature in load balancers and why does it matter?

Slow-start ramps traffic to a newly joined instance gradually from 0% to 100% rather than sending it full production load immediately. A freshly deployed server has cold caches and pending JIT compilation, so immediate full traffic causes slow responses or OOM crashes. AWS ALB accepts a slow-start window of 30 to 900 seconds, with most teams using 30 to 60 seconds; Envoy has an equivalent slow_start_window parameter.

How does anycast differ from DNS-based load balancing, and when is each used?

DNS-based load balancing returns different IP addresses per region, with failover speed bounded by TTL, meaning failover takes minutes. Anycast advertises the same IP from multiple locations and lets BGP route users to the nearest point of presence, with failover completing in seconds via BGP route withdrawal. Anycast is used for DNS infrastructure, CDN edges, and DDoS mitigation, while DNS-based routing handles general application traffic across regions.