Caching
Cache hierarchies, write strategies, eviction policies, Redis data structures, the four classic cache pathologies, and how to size and warm a cache properly.
A cache trades memory for speed. It is the single most powerful primitive in system design — the reason 99% of websites feel fast despite their backends being slow. A well-placed cache makes a system 100× faster; a badly-placed cache makes a system 100× more confusing.
This module covers where caches live, how to read/write through them, what to evict, the failure modes nobody warns you about, and the data structures (especially in Redis) that make modern caches more than just "hash maps."
Why caching works
Two facts about real systems:
- Access patterns are skewed. A small number of items get most of the requests.
- Compute costs more than memory. A SQL query computing aggregates costs 10–1000× more than a Redis GET.
Multiply: keep the 20% hot data in fast storage, and you serve 80% of traffic at 100× the speed for far less than 100× the cost. This is the deal caches offer, and it's why every system has at least one.
The cache hierarchy (you're already using one)
flowchart LR
U[User Browser] -->|10ms| CDN
CDN -->|5–30ms| LB[Load Balancer]
LB -->|2ms| APP[App Server]
APP -->|in-process<br/>50μs| LC[Local Cache]
LC -->|miss → 1ms| RC[Redis Cluster]
RC -->|miss → 10ms| DB[(Database)]
DB -->|miss → 50ms| OS[(Object Storage)]
style CDN fill:#ff6b1a,color:#0a0a0f
style LC fill:#ffaa00,color:#0a0a0f
style RC fill:#15803d,color:#fff
style DB fill:#0e7490,color:#fff
Every layer is there because the next layer is slower. Browser cache → CDN → Local cache → Redis → Database → Object storage. A hit at any layer skips the rest.
The key property: a cache hit at layer N is roughly 10–100× faster than a hit at layer N+1.
What lives where
| Layer | Examples | Latency | Best for |
|---|---|---|---|
| Browser | HTTP cache, IndexedDB | 0ms | Static assets, user-specific data |
| Edge / CDN | CloudFront, Cloudflare, Fastly | 5–30ms | Global static + edge logic |
| Reverse proxy | Varnish, Nginx | <1ms | API responses, rendered pages |
| App-local (in-process) | Caffeine (Java), LRU map | μs | Reference data, hot lookups |
| Distributed | Redis, Memcached | 1ms | Shared session, query cache, fanout |
| Database buffer pool | Postgres shared_buffers | μs | Hot pages |
| OS page cache | Linux | μs | All file/disk I/O |
You don't pick one — most systems use four or five of these in concert.
In-process caches
The fastest cache is the one inside your process. No network, no serialization — nanosecond access with no infrastructure to manage. The downside is that each app server has its own copy, so N servers means N copies. It doesn't survive restarts, and a write on one server won't invalidate the cache on another.
Use in-process caches for small reference data that's stable and safe to be slightly stale: feature flags, plan tiers, decoded JWT keys. Anything that has to be globally consistent belongs in a shared store instead.
Distributed caches (Redis, Memcached)
A distributed cache — Redis or Memcached — is a single shared store visible to all app instances. Because it's network-attached, you pay about 1ms per operation, and it becomes a dependency that can fail. But a single write invalidates the value for every server at once, and the cache can grow larger than any individual instance can hold.
This makes Redis the right choice for session stores, query caches, leaderboards, rate limiters that need to be globally enforced, and fanout data like social timelines.
Cache write strategies
When the data changes, what does the cache do? There are five patterns, each with a different correctness/latency profile.
Cache-aside (lazy loading) — the most common in practice
The application is responsible for both reading and writing the cache.
sequenceDiagram
participant App
participant Cache
participant DB
App->>Cache: get(k)
alt cache hit
Cache-->>App: v
else cache miss
Cache-->>App: null
App->>DB: select where id = k
DB-->>App: v
App->>Cache: set(k, v, ttl)
end
For writes:
def write(key, value):
db.write(key, value)
cache.delete(key) # or set(key, value); both are common
The read is lazy — the cache only fills after the first miss. The first reader after a write sees stale or nothing; subsequent readers hit cache. This is simple, robust, and it's the default choice for most systems. The main thing to watch for is the race window between a write and the cache delete.
Write-through
sequenceDiagram
participant App
participant Cache
participant DB
App->>Cache: write(k, v)
Cache->>DB: write(k, v)
DB-->>Cache: ok
Cache-->>App: ok
Write goes to both stores synchronously before acknowledging the caller. The cache is always fresh — no stale read after a write. You pay for both stores on every write, which is fine when writes are infrequent or when freshness matters more than write speed.
Write-back (write-behind)
sequenceDiagram
participant App
participant Cache
participant DB
App->>Cache: write(k, v)
Cache-->>App: ok
Note over Cache,DB: async, batched
Cache->>DB: write(k, v)
Write to cache and return immediately; flush to the database later in batches. Write latency drops dramatically, and batching means fewer, larger writes to the DB. The risk: if the cache node dies before flushing, that data is gone. Reserve write-back for situations where you can afford brief data loss — telemetry pipelines, analytics counters, game state.
Write-around
The app writes directly to the database and skips the cache entirely. The cache gets populated only when a subsequent read misses. This pattern protects the cache from write-heavy data that won't be re-read soon — bulk imports, log data, cold backfills. The first read after any write will always miss.
Read-through
The cache itself fetches from the database on a miss, rather than the application. You write to the cache; it handles the DB synchronization. Common with managed caches like AWS DAX in front of DynamoDB. The application code is simpler because it only talks to the cache, but the cache and DB are now tightly coupled.
Picking a strategy
Here is a quick decision tree:
flowchart TD
Q1{Reads or writes dominant?}
Q1 -->|Reads| Q2{Tolerate stale on miss?}
Q1 -->|Writes| Q3{Can you tolerate data loss?}
Q2 -->|Yes| CA[Cache-aside]
Q2 -->|No| WT[Write-through]
Q3 -->|Yes| WB[Write-back]
Q3 -->|No| WT2[Write-through]
style CA fill:#15803d,color:#fff
style WT fill:#ff6b1a,color:#0a0a0f
style WB fill:#ffaa00,color:#0a0a0f
style WT2 fill:#ff6b1a,color:#0a0a0f
If you have to pick one default, cache-aside with TTL is what 80% of systems use.
Eviction policies
When the cache fills up, what gets thrown out?
| Policy | What it evicts | When it shines |
|---|---|---|
| LRU (Least Recently Used) | Oldest-accessed | General-purpose; popular for in-process caches |
| LFU (Least Frequently Used) | Lowest hit count | Long-tail traffic with stable popularity |
| FIFO | Oldest-inserted | Simple; rarely the right choice |
| TTL | Anything past expiry | Time-bounded freshness |
| Random | Random | Surprisingly hard to beat at high pressure |
| TinyLFU + LRU (Caffeine) | Hybrid: LFU admission, LRU eviction | Better hit rate than either alone |
Redis ships with noeviction as the default — it returns errors when full, which is almost never what you want. You almost always want allkeys-lru or volatile-lru unless you have a specific reason.
maxmemory 8gb
maxmemory-policy allkeys-lru
| Redis policy | Behavior |
|---|---|
noeviction | Return errors when full |
allkeys-lru | Evict any key by LRU |
volatile-lru | Evict only keys with TTL by LRU |
allkeys-lfu | Evict any key by LFU |
volatile-lfu | Evict only keys with TTL by LFU |
allkeys-random | Evict a random key |
volatile-ttl | Evict the key closest to expiry |
A subtle point about LRU at scale
True LRU requires keeping a sorted list of every key — expensive at scale. Redis approximates LRU by sampling N random keys and evicting the oldest among them (default N=5; tunable via maxmemory-samples). Since Redis 3.0 the algorithm also maintains a small pool of eviction candidates across sampling rounds, which makes it significantly closer to true LRU than pure random sampling. That's why "Redis LRU" is fast but not strictly LRU. In practice, the approximation is fine.
TTLs (time-to-live)
Every cache entry should have a TTL — even if "infinite is fine." TTLs serve as:
- Belt-and-suspenders freshness: even if your invalidation logic has a bug, the stale entry expires.
- Memory bound: cold keys quietly disappear instead of accumulating.
- A safety valve: a runaway cache problem self-heals after the TTL.
Don't synchronize TTLs. If you set ttl=3600 on every key at the same time, in 1 hour they all expire simultaneously and your DB melts. Always jitter: ttl = base_ttl + random(0, base_ttl × 0.1).
The 80/20 rule
flowchart LR
A[Top 20% of keys] -->|serve 80% of requests| B[Cache it]
C[Bottom 80% of keys] -->|serve 20% of requests| D[Skip the cache]
style B fill:#15803d,color:#fff
style D fill:#6b6b85,color:#0a0a0f
Caches work because access patterns are skewed (Pareto / Zipf distribution). The "head" of the long tail makes up most of your traffic and fits in cache. The "tail" rarely repeats and is wasted memory.
This is why your cache size estimate is 0.2 × daily_read_bytes, not 1.0 × daily_read_bytes.
Sizing a cache
The math:
1. Compute daily read volume in bytes.
2. Multiply by hot ratio (typically 20% for Zipf-like distributions).
3. Add 20–30% for overhead (Redis metadata, memory fragmentation).
4. Divide by per-instance RAM (e.g. 32 GB) to get cluster size.
Example: 10 TB/day read volume × 20% hot × 1.3 overhead ≈ 2.6 TB cache → 80 instances of 32 GB.
Cache hit ratio is the metric to track. A well-sized cache hits 90%+; below 80% and you're either undersized or thrashing (TTLs too short, evictions too aggressive).
Cache problems and fixes
Four classic pathologies. If you've never run into them, you will.
1. Stale data
The cache says X, the database says Y. This happens when a write lands in the database but the old value is still sitting in the cache — your invalidation logic missed a case, or there isn't any.
The standard toolkit: set TTLs so stale entries eventually expire on their own; call cache.delete(k) explicitly after every write; subscribe to a database change feed (CDC via Debezium) and drive invalidations from there; or use versioned keys (user:42:v3) so bumping the version number makes every stale cached value effectively invisible — no delete needed.
For most apps, TTLs plus explicit delete-on-write is enough. CDC-driven invalidation is the next step when "stale read" is causing real bugs.
2. Thundering herd / cache stampede
A hot key expires. Thousands of concurrent requests miss simultaneously. They all hit the database. The database falls over.
sequenceDiagram
participant R1 as Request 1
participant R2 as "Request 2...10k"
participant C as Cache
participant DB as Database
R1->>C: get(hot_key)
C-->>R1: null (expired)
R2->>C: get(hot_key)
C-->>R2: null (expired)
Note over R1,R2: All 10k requests proceed
R1->>DB: SELECT ...
R2->>DB: SELECT ...
Note over DB: 10k simultaneous queries — DB melts
style DB fill:#ff2e88,color:#fff
The cleanest fix is single-flight / request coalescing: only one request rebuilds the cache; every other concurrent request waits on that result instead of firing its own query. Alternatively, use a distributed lock (SET NX EX) around the rebuild so only one process wins the race. You can also use probabilistic early expiration — a random fraction of requests refresh slightly before the TTL — or maintain two TTLs: a soft one that triggers a background refresh while still serving stale, and a hard one that forces a fresh fetch.
3. Cache penetration
Requests for keys that don't exist — an attacker probing for /user/-1, for instance. Every request misses the cache, hits the database, gets nothing back, and skips the cache again. The DB takes a sustained beating from requests that will always be misses.
Three ways to stop it: cache the negative (null with a short TTL), so the second identical request is served from cache; put a Bloom filter in front of the cache so requests for definitively nonexistent keys never even reach it; or validate input before touching the cache or DB and reject obviously bad IDs at the edge.
4. Cache avalanche
A large number of keys expire at the same wall-clock moment — because you set them all to expire at midnight, for instance. At midnight the DB gets slammed.
Jitter fixes this almost entirely: ttl = base_ttl + random(0, 60s). If your cache supports it, also configure graceful degradation so the system serves stale values on cache failure rather than passing every miss through to the DB simultaneously.
Cache warming
A cold cache is dangerous. After a restart or a new deploy, every request misses, the DB melts, and the system may never recover because the load keeps the DB too busy to respond fast enough to warm the cache.
flowchart LR
START[Cache restart] --> COLD[Cold cache]
COLD --> MISS[All requests miss]
MISS --> DB[(Database overloaded)]
DB --> SLOW[Slow responses]
SLOW --> MORE[More concurrent requests pile up]
MORE --> DB
style COLD fill:#ff2e88,color:#fff
style DB fill:#ff6b1a,color:#0a0a0f
Strategies to prevent this: pre-load the most popular keys at startup with a background job that runs the top-N most-frequent queries; replay yesterday's traffic as "shadow" traffic against the new cache before promoting it; do a gradual cutover (1% of traffic → 10% → 50% → 100%) so the cache fills as you ramp; or run blue/green deploys where you warm the green cache before switching over.
For Redis specifically: Redis can preload from RDB/AOF on startup — if persistence is enabled, the cache comes back with most of its data after a restart.
Redis data structures (the killer feature)
Memcached is a hash map of string → string. Redis is a server-side data structure store. The data structures unlock entirely new patterns.
Strings (with INCR, EXPIRE, SETNX)
SET counter 0
INCR counter # atomic
EXPIRE counter 60 # auto-expire
SET lock:job1 1 NX EX 30 # acquire lock, 30s TTL
Used for: counters, distributed locks, rate limit token buckets, atomic flags.
Hashes (per-field reads)
HSET user:42 name "ada" email "ada@example.com" age 30
HGET user:42 name # just one field, no full-doc fetch
HGETALL user:42
Memory-efficient for objects with many small fields. Each field can be read/written independently.
Lists (queues, recent items)
LPUSH inbox:42 "msg1"
LPUSH inbox:42 "msg2"
LRANGE inbox:42 0 9 # last 10 messages
LTRIM inbox:42 0 99 # cap to 100
Used for: bounded inbox/feed, simple queues. For real queueing, prefer Redis Streams.
Sets (membership, intersection)
SADD liked:tweet:7 user:42 user:99
SISMEMBER liked:tweet:7 user:42 # O(1) check
SINTER friends:42 friends:99 # mutual friends
Used for: tags, follower sets, deduplication.
Sorted sets (leaderboards, time-ordered)
ZADD leaderboard 5000 "ada" 4500 "bob" 6000 "carol"
ZREVRANGE leaderboard 0 9 # top 10
ZRANK leaderboard "ada" # rank of a user
# Time-ordered timeline:
ZADD timeline:42 1736284800 "tweet:1" 1736284900 "tweet:2"
ZRANGEBYSCORE timeline:42 1736284800 +inf LIMIT 0 100
Possibly the most useful Redis structure. Used for: leaderboards, ranked feeds, time-windowed sliding rate limiters, priority queues.
Streams (Kafka-lite)
XADD events * action login user_id 42
XREAD COUNT 10 STREAMS events 0
XGROUP CREATE events workergrp $ # consumer groups
Append-only log with consumer groups. Smaller scale than Kafka, much simpler to operate.
HyperLogLog (cardinality estimation)
PFADD uniques:home user:42
PFADD uniques:home user:99
PFCOUNT uniques:home # ~2; ~1% error rate
Counts distinct items in constant memory (~12 KB) regardless of how many items. Used for: distinct visitor counts, cardinality of huge sets.
Bitmaps (bit-level)
SETBIT active:2026-01-26 42 1 # user 42 was active today
GETBIT active:2026-01-26 42
BITCOUNT active:2026-01-26 # how many users active
BITOP AND active:weekly active:2026-01-20 active:2026-01-21 active:2026-01-22 active:2026-01-23 active:2026-01-24 active:2026-01-25 active:2026-01-26 # intersection
Tiny memory footprint for huge user populations. Used for: presence tracking, A/B test cohorts, daily/weekly active flags.
Geospatial
GEOADD drivers 73.98 40.75 "driver:42"
GEOSEARCH drivers FROMLONLAT 73.97 40.76 BYRADIUS 1 km ASC COUNT 10
Used for: ride-hailing nearby search, "find friends near me." See the Yelp-style proximity article.
CDNs (the cache at the edge)
A CDN is a globally-distributed cache that sits between the user and your origin. It serves static (and increasingly dynamic) content from the edge — a server geographically near the user.
flowchart LR
U1[User in Tokyo] --> E1[Edge: Tokyo]
U2[User in São Paulo] --> E2[Edge: São Paulo]
U3[User in Berlin] --> E3[Edge: Berlin]
E1 -.miss.-> O[Origin: Virginia]
E2 -.miss.-> O
E3 -.miss.-> O
style E1 fill:#ff6b1a,color:#0a0a0f
style E2 fill:#ff6b1a,color:#0a0a0f
style E3 fill:#ff6b1a,color:#0a0a0f
style O fill:#0e7490,color:#fff
What you cache at the edge:
- Images, video, JS, CSS, fonts (every static asset).
- Public API responses (with
Cache-Control: public). - Sometimes server-rendered HTML (Vercel ISR, Next.js).
Cache-Control headers
The CDN respects what you tell it:
Cache-Control: public, max-age=31536000, immutable
↑ everyone may cache, for 1 year, content never changes
(use this for hashed asset filenames)
Cache-Control: public, max-age=300, s-maxage=3600, stale-while-revalidate=86400
↑ browsers cache 5min, CDN caches 1hr,
and may serve stale up to 1 day while refreshing in background
Cache-Control: private, max-age=0, no-cache
↑ user-specific; never share, validate on each request
Cache-Control: no-store
↑ don't even store. Use for credit card responses, auth tokens.
Vary: Cookie makes a cache key include the cookie value — important for sites where a logged-in user sees different HTML.
Cache hit ratio is the metric
A well-tuned CDN serves 95%+ of bytes from the edge. Below 80% means your Cache-Control is too restrictive, your URLs aren't cache-friendly, or your origin is sending different responses to "the same" request.
Watch out for query string parameters: /page?utm_source=foo and /page are different cache keys by default. Strip tracking params at the edge before caching.
Browser HTTP cache
The smallest, fastest cache — and the one users love because it makes "back" instant.
Cache-Control: public, max-age=86400
ETag: "a8c7..."
The browser stores the response. On revisit:
- Within
max-age: serve from local cache, no request at all. - Past
max-age: sendIf-None-Match: "a8c7..."to validate. Server returns304 Not Modifiedif unchanged.
The biggest perf win for any site: make your hashed assets immutable, max-age=31536000 — /assets/main.a8c7.js, /img/logo.b3f0.png. The hash in the filename means the URL changes when content changes, so "cache forever" is safe.
When NOT to cache
- The data changes more often than it's read.
- The data is so cheap to compute that the cache doesn't help.
- You can't tolerate any staleness (rare in practice — usually you can).
- The data is per-user and the user only sees it once (no reuse).
- Strong-consistency-required reads (financial balances at the moment of a transaction).
Cache coherence (the hard part)
A single cache is easy. Many caches consistent with one DB is hard. The general patterns:
flowchart TD
DB[(Source of truth)] --> CDC[Change feed CDC]
CDC --> C1[Cache 1 invalidates]
CDC --> C2[Cache 2 invalidates]
CDC --> C3[Cache 3 invalidates]
style DB fill:#ff6b1a,color:#0a0a0f
style CDC fill:#15803d,color:#fff
- Single source publishes invalidations to all caches (via Kafka, Redis pub/sub, or DB CDC like Debezium).
- Versioned keys so a stale cache hit at most returns "old version, key is
v3" → consumer rejects, refetches. - TTLs as a backstop against missed invalidations.
For most apps, TTLs plus explicit invalidation on write are sufficient. CDC-driven invalidation is the next step when stale reads become a production issue.
Worked example: caching layers for an article reader
For a system like this site:
| Layer | What | TTL | Why |
|---|---|---|---|
| Browser HTTP cache | Static assets (JS, CSS) | 1 year, immutable | Hashed filenames |
| CDN | HTML pages (with stale-while-revalidate) | 1 hour, swr 1 day | Articles change rarely |
| Server in-process | Article metadata | 5 min | Page generation |
| Redis | Search index | 1 hour | Fuzzy search results |
| Postgres buffer pool | Article rows | (managed) | Always-hot pages |
Most readers never touch Postgres at all. Most readers don't touch our origin. Most readers don't even fetch from the CDN — they hit their browser cache.
Things you should now be able to answer
- What's the difference between write-through and write-back caching?
- A hot key expires under load. What is the failure pattern and how do you avoid it?
- Why is
allkeys-lruusually the right Redis eviction policy? - Why does a CDN improve perceived latency more than a faster origin?
- When is a cache the wrong solution?
- A user updates their profile but reloads and sees the old version. What's wrong, and what are three ways to fix it?
- You restart your Redis cluster and the database falls over from cold-cache traffic. What should have happened instead?
- Sorted sets vs sets in Redis — when do you reach for which?
→ Next: Load balancing & proxies
Frequently asked questions
▸What is cache-aside (lazy loading) and why is it the most common cache write strategy?
In cache-aside, the application checks the cache first and populates it only on a miss, then writes to the database directly and deletes or overwrites the cache entry on updates. It is the most common strategy because it is simple and robust — the cache only holds data that has actually been requested — and it is what 80% of systems use, typically paired with a TTL.
▸What is a thundering herd (cache stampede) and how do you prevent it?
A thundering herd happens when a hot key expires and thousands of concurrent requests simultaneously miss the cache, all hitting the database at once and potentially taking it down. The cleanest fix is single-flight or request coalescing, where only one request rebuilds the cache and every other concurrent request waits on that result. Alternatives include a distributed lock using Redis SET NX EX, probabilistic early expiration, or a soft TTL that triggers a background refresh while still serving stale data.
▸What is the difference between write-through and write-back caching?
Write-through writes to both the cache and the database synchronously before acknowledging the caller, so the cache is always fresh but every write pays the cost of two stores. Write-back writes to the cache and returns immediately, then flushes to the database asynchronously in batches, dramatically lowering write latency at the cost of potential data loss if the cache node dies before the flush completes.
▸Why is allkeys-lru the recommended Redis eviction policy, and what does Redis ship with by default?
Redis ships with noeviction as the default, which simply returns errors when the cache is full — almost never what you want in production. Allkeys-lru evicts the least-recently-used key across the entire keyspace, which aligns with the reality that a small fraction of keys absorb most traffic. Volatile-lru is the alternative when you only want to evict keys that have a TTL set.
▸How do you size a distributed cache, and what hit ratio signals a well-tuned cache?
Start with daily read volume in bytes, multiply by the hot data ratio (typically 20% for Zipf-like distributions), add 20-30% for Redis metadata and memory fragmentation overhead, then divide by per-instance RAM to get cluster size. For example, 10 TB per day times 20% times 1.3 overhead is roughly 2.6 TB, or about 80 instances of 32 GB. A well-sized cache sustains a hit ratio of 90% or higher; below 80% signals the cache is undersized, TTLs are too short, or evictions are too aggressive.