Design a News Feed (Facebook / Instagram)
How Meta builds the home feed — fanout, ranking, candidate generation, and how the same architecture serves billions.
The problem
Facebook's home feed serves roughly 2 billion people each day. On each feed load, the system needs to pull together posts from hundreds of friends, dozens of pages, and a handful of celebrities — then sort them by what the user is most likely to engage with, not by the time they were posted. Instagram does the same for photos and Reels; LinkedIn does it for professional updates. The product looks deceptively simple: a scrollable list. The engineering behind it is not.
The first hard problem is fanout. When a regular user publishes a post, that content needs to propagate to hundreds of followers' feeds within seconds. Multiply that by 1 billion posts per day and you have roughly 12,000 writes per second flowing into fan-out queues — with peaks three times higher. For celebrities with 50 million followers, naive fanout-on-write would mean 50 million cache insertions per post, which quickly becomes untenable.
The second hard problem is ranking. A chronological feed is easy — sort by timestamp. A ranked feed requires the system to score each candidate post against a model that considers hundreds of signals: how often the viewer interacts with this author, how quickly similar content has gathered engagement, whether the post contains a video or a link, how old it is. Running a heavyweight ML model on 1,000 candidates per request, at 700,000 feed loads per second, means the inference fleet alone could consume thousands of A100-class GPUs.
These two problems — write amplification from fanout and inference cost from ranking — are what make this question interesting. The architecture exists not to store or retrieve posts (that part is solved), but to answer the question: "What 50 things should I show this specific user, right now, in under 500ms?"
Functional requirements
- Post status updates, photos, videos, links.
- View home feed: posts from friends/people you follow.
- Feed is ranked — most relevant first, not chronological.
- React, comment, share.
- Notifications on engagement.
Non-functional requirements
- 2B DAU (Facebook scale).
- p99 feed load < 500ms.
- Posts durable forever.
- Ranking quality must improve over time (ML feedback loops).
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Users | 2B DAU | given |
| Post writes (avg) | ~12k/sec | 2B × 0.5 posts/day = 1B/day ÷ 86,400 s ≈ 12k/s |
| Post writes (peak) | ~35k/sec | ~3× avg burst factor |
| Feed reads (avg) | ~700k/sec | 2B × 30 feeds/day = 60B/day ÷ 86,400 s ≈ 700k/s |
| Feed reads (peak) | ~2M/sec | ~3× avg burst factor |
| Avg post size | ~1 KB | 200 chars text + media reference |
| Text storage | ~1 TB/day | 1B posts/day × 1 KB/post |
| Media storage | 100s of TB/day | video/image payloads at scale |
Takeaway: The numbers are in the same league as Twitter. The architecture rhymes.
Building up to the design
The full Facebook feed has many moving parts: candidate generation, light rank, heavy ML rank, diversity rules, engagement counters. Walking forward from a college-project version makes it obvious why each one exists.
V1: SQL JOIN
SELECT p.*
FROM posts p JOIN friends f ON f.b = p.author
WHERE f.a = $me
ORDER BY p.created_at DESC LIMIT 50;
This works for 100 users with no infrastructure. The moment you hit real scale — where each user follows 200+ people and the posts table has billions of rows — this JOIN degrades into a full table scan on every feed load. Latency explodes well before you hit 10k posts/sec write load.
V2: Materialize timelines on write
The fix is to move the work to write time. When a user posts, push the post ID into every follower's pre-built feed:user_X Redis sorted set — fanout-on-write. Now a feed read is just ZRANGE ... REV on a single Redis key, which takes about a millisecond regardless of how many people the user follows.
The price: a user with 10 million followers triggers 10 million Redis writes per post. That's fine for ordinary users; it's catastrophic for celebrities.
V3: Hybrid fanout
Above some follower threshold — Meta historically uses around 10k — you skip the write-time fanout entirely. On read, the feed service merges the user's pushed candidates with recent posts from the celebrities they follow. You cap the write amplification per post; you cap the per-read celebrity fetch. This is the same trick Twitter uses, and it holds up at every scale.
But now you notice that chronological order isn't actually what users want. A friend's wedding photo from six hours ago is more important than a cousin's meme from thirty minutes ago. Fanout gave you speed. It didn't give you quality.
V4: Ranking on top of candidates
The insight: stop thinking of that cached per-user list as "the feed." It's a candidate set — three hundred to a thousand items — from which the ranker will choose the fifty to show. At read time, score each candidate with a model (recency, affinity, predicted engagement) and return the top 50.
flowchart LR
CACHE[Candidate Cache] --> R[Ranker]
R --> FS[(Feature Store)]
R --> M[ML Model]
R --> FEED[Top 50]
style R fill:#ff6b1a,color:#0a0a0f
style FS fill:#0e7490,color:#fff
Feed quality jumps. Engagement metrics improve. Then you do the math: 700k feed fetches per second, each scoring 1000 candidates through a heavy ML model. That's 700M model evaluations per second — thousands of A100-class GPUs just for inference.
V5: Two-stage ranking + feature store
You need a cheaper first filter. A lightweight linear model runs on pre-computed features and prunes the full raw pool (~5k–10k) down to ~1000 in a few milliseconds. Only those survivors go to the heavy model, which scores them and hands ~100 to a re-ranking layer that applies diversity rules and returns 50. A precomputed feature store serves user features and per-post features at scoring time, so the ranker is mostly a join plus a forward pass. Latency comes back under 100ms; the expensive model still runs, just on a much smaller window.
V6: Production-grade everything else
Add engagement counters (Kafka → batched Cassandra), notifications (Kafka consumer per event type), moderation in the post path, multi-region replication for the feed cache, and deduplication/diversity rules in re-rank.
flowchart LR
V1[V1: SQL JOIN<br/>100 users] --> V2[V2: + fanout on write<br/>fast reads]
V2 --> V3[V3: + hybrid push/pull<br/>celebs handled]
V3 --> V4[V4: + ML rank<br/>quality]
V4 --> V5[V5: + light/heavy rank<br/>latency]
V5 --> V6[V6: + counters, mod, multi-region]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V6 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
U[User] --> CDN
CDN --> APIGW[API Gateway]
APIGW --> POST[Post Service]
APIGW --> FEED[Feed Service]
APIGW --> ENG[Engagement Service]
APIGW --> NOTIF[Notification Service]
POST --> POSTDB[(Posts DB<br/>sharded)]
POST --> KAFKA[Kafka events]
KAFKA --> FANOUT[Fanout Workers]
FANOUT --> CAND[(Candidate Cache<br/>per user)]
FEED --> CAND
FEED --> RANKER[Ranker Service]
RANKER --> FS[(Feature Store)]
RANKER --> ML[ML Model Server]
KAFKA --> NOTIF
KAFKA --> COUNTERS[Engagement Counters]
style POST fill:#ff6b1a,color:#0a0a0f
style FEED fill:#0e7490,color:#fff
style RANKER fill:#a855f7,color:#fff
Three things happen in parallel:
- Ingestion: posts come in, get stored, fanned out to candidate caches.
- Candidate generation: each user has a precomputed list of "things to maybe show."
- Ranking (at read time): the model decides what to show first.
Candidate generation
The fanout decision happens post-write. After a post lands in the Posts DB and a NewPost event appears on Kafka, the Fanout Workers pick it up and check the author's follower count.
flowchart TD
K[Kafka: NewPost event] --> FW[Fanout Worker]
FW --> CHK{follower count}
CHK -->|"< 10k followers"| PUSH["Push post ID into<br/>each follower's candidate cache<br/>(fanout-on-write)"]
CHK -->|"> 10k followers"| SKIP[Skip fanout<br/>mark as celebrity]
PUSH --> CC[(Candidate Cache<br/>Redis sorted sets)]
SKIP --> CP[(Celebrity Posts<br/>fetched at read time)]
style FW fill:#ff6b1a,color:#0a0a0f
style CC fill:#15803d,color:#fff
style CP fill:#0e7490,color:#fff
At read time, the Feed Service merges both pools: the pre-built candidate list from the sorted set plus recent posts from any celebrities the user follows. Together these become the 300–1000 item candidate set the ranker works with. The candidates aren't the final feed — they're an ordered superset from which the ranker will choose ~50 to show.
sequenceDiagram
participant User
participant Feed Svc
participant Candidate Cache
participant Celeb Cache
participant Ranker
participant Model
User->>Feed Svc: GET /feed
Feed Svc->>Candidate Cache: get top-1000 candidates
Feed Svc->>Celeb Cache: get celeb posts
Candidate Cache-->>Feed Svc: candidates
Celeb Cache-->>Feed Svc: celeb posts
Feed Svc->>Ranker: rank(candidates, user_features)
Ranker->>Model: predict_engagement
Model-->>Ranker: scores
Ranker-->>Feed Svc: ranked top 50
Feed Svc-->>User: feed
Ranking
This is where Meta spends most of its compute budget.
What signals matter?
| Feature group | Examples |
|---|---|
| User → post | does the user click on this author often? does the post topic match user interest? |
| User → user | how strong is the friendship (likes exchanged, messages, time on profile)? |
| Post quality | engagement rate so far, content-policy score, video watch-through |
| Recency | post is 2 hours old → fresh; 2 days → stale |
| Diversity | don't show 5 posts from one author in a row |
A mature ranking model (DLRM, transformer over user history) takes hundreds of features and emits a probability of "user will engage with this if shown."
Multi-stage ranking
You can't run the heavy model on everything. Before any filtering, the raw pool is actually larger than the candidate cache alone — when you union the per-user sorted set with celebrity pulls, you might start with ~5,000–10,000 candidates. Running the full DLRM on all of them, for every user, at 700k req/sec, is not a budget you can afford.
flowchart LR
C[~10000 raw candidates] --> RR[Light retrieval rank]
RR --> CC[~1000 candidates]
CC --> HR[Heavy ML model]
HR --> CCC[~100 ranked]
CCC --> RB[Re-rank + diversity rules]
RB --> F[~50 final feed]
style RR fill:#ffaa00,color:#0a0a0f
style HR fill:#ff6b1a,color:#0a0a0f
style RB fill:#15803d,color:#fff
The light retrieval rank is cheap on purpose: simple heuristics or a lightweight linear model filter the large pool down to ~1000 using signals like raw recency, social proximity, and content-type preference. Fast enough that the overhead is negligible.
The heavy ranking stage gets the 1000 survivors. A full DLRM or transformer scores them with hundreds of features. This is where the compute budget goes; embedding tables alone can exceed memory on a single host, so model servers are sharded.
Re-ranking is the step the ML model shouldn't have to learn — business rules enforced after scoring. Don't show five posts from the same author back-to-back. Inject ads at the right cadence. Suppress content that failed a policy check since the model last saw it. Mix Reels, Stories, and text posts in the right ratio.
Why does the two-stage split matter so much? At 700k feed fetches/sec with 1000 candidates each, running the heavy model end-to-end means 700M model inferences/sec — roughly thousands of A100-class GPUs just for the inference fleet. The light-rank stage cuts that by 5–10× at a fraction of the cost.
Operational failure modes staff engineers should know:
Feature store staleness: if the feature store is lagging — say a pipeline is delayed — the ranker silently degrades to stale signals. Instrument feature freshness as a first-class metric and alarm on p99 age of user features.
Model/serving skew: training uses batch-computed features; serving uses real-time features. If these drift (different aggregation windows, different defaults for missing keys), model performance degrades invisibly. Track training-serving skew in shadow mode.
Ranker latency spike cascade: the feed request blocks on the ranker. If a model server is slow, p99 feed latency blows up. Mitigate with hedged requests (send to two shards, take the first response) and per-request timeouts with fallback to light-rank-only output.
Storage choices
| Data | Store | Why |
|---|---|---|
| Posts | Cassandra / TAO | High write, append-mostly |
| Friend graph | TAO (Meta), MyRocks (MySQL+RocksDB) | TAO is the graph layer; MyRocks is the storage engine underneath — LSM-tree reduces storage by ~62% vs InnoDB for social-graph workloads |
| Candidate cache | Redis sorted sets | In-memory ranking |
| Feature store | RocksDB / Bigtable | High-throughput key-value lookup |
| Engagement counters | Cassandra / Manhattan | Eventually consistent, fast incr |
| Model embeddings | Specialized stores (FAISS / vector DB) | Approximate nearest neighbor |
| Search index | Elasticsearch / Meta's Unicorn | Full-text |
The friend graph (TAO)
Meta's TAO is a graph database built on MySQL + Memcache, optimized for an "objects + associations" model.
object: user, post, photo
association: friend(user_a, user_b), liked(user, post), commented_on(user, post)
Queries look like:
TAO.assoc_get(user_a, "friend", limit 100) → my friends
TAO.assoc_get(post, "liked", limit 100) → who liked this
TAO.obj_get(post_id) → post payload
TAO caches aggressively in Memcache, with cache invalidation tied to write paths. The production system achieves a 96.4% cache hit rate (per the 2013 USENIX ATC paper), and reads vastly outnumber writes — more than 99:1 — so the database is almost never on the hot path.
Hot path: viewing the feed
Step by step, with the latency budget:
- Auth check (5ms)
- Read candidate cache (10ms — Redis sorted set ZRANGE)
- Hydrate candidates with post bodies (15ms — Memcache MGET)
- Read user features (10ms — feature store)
- Rank (50ms — model server, with vector embedding lookups)
- Apply re-ranking rules (5ms)
- Serialize + return (10ms)
Total: ~105ms. Add network RTT and TLS on top of that.
Hot path: posting
sequenceDiagram
participant U as User
participant API
participant POST as Post Svc
participant DB as Posts DB
participant K as Kafka
participant FAN as Fanout
participant CC as Candidate Cache
U->>API: POST /post
API->>POST: create
POST->>DB: INSERT
POST->>K: NewPost event
POST-->>API: 201
Note over K,FAN: async
K->>FAN: deliver
FAN->>CC: ZADD candidate for each follower (within reason)
For a power user with millions of followers, fanout writes happen in batches and may take seconds. The rest of the system shouldn't block on it — the Kafka event decouples the fast write acknowledgment from the slow propagation.
Trade-offs worth knowing
Push vs. pull model
| Push (fanout-on-write) | Pull (fanout-on-read) | |
|---|---|---|
| Read latency | Low (precomputed) | High (compute at read) |
| Write cost | Proportional to followers | Constant |
| Storage | High (per-user lists) | Low |
| Fits | Most users | Celebs / inactive followers |
Use both. That's what every production system does.
Eventual consistency
Your friend posts and it might take 2–10 seconds before it appears in your feed. That's fine. But if you post, you should see your own post on next refresh — otherwise the product feels broken even if it's technically consistent. The solution is to maintain a "self" cache updated synchronously on post creation; on feed load, the Feed Service merges it with the regular candidate set.
flowchart LR
ME[Your POST /post] --> POSTDB[(Posts DB)]
ME --> SELF[(Self cache<br/>sync write)]
ME --> K2[Kafka → async fanout<br/>for your followers]
SELF --> FEED[Your own feed load]
FEED -->|merge| CAND[(Candidate Cache)]
style SELF fill:#ffaa00,color:#0a0a0f
style FEED fill:#0e7490,color:#fff
Storage cost
Storing per-user candidate lists for 2B users gets expensive quickly. Raw post IDs alone are 2B users × 1000 candidates × 8B ≈ 16 TB — manageable in isolation. But Redis sorted sets use skiplist encoding once a key exceeds 128 members, and the overhead per entry (skiplist node + hashtable entry + robj wrapper, excluding the member string itself) is roughly 136 bytes. For 8-byte post IDs that's ~144 bytes per entry total, pushing the full in-memory figure to 2B × 1000 × 144 B ≈ 288 TB — far beyond what you'd want warm in Redis.
The practical mitigations: only store candidates for active users (last 30 days, roughly 20–30% of total users), compute lazily for stale users, and consider a more compact off-heap store like a RocksDB-backed sorted set for the long tail.
Ranking infrastructure (the part nobody talks about)
A feed ranker is itself a small distributed system worth understanding:
flowchart LR
REQ[Feed request] --> SHARDED[Sharded model servers]
SHARDED --> EMB[Embedding lookup<br/>vector DB]
SHARDED --> COUNTERS[Counter lookup<br/>'how many likes does this post have?']
SHARDED --> SCORE[Score]
style SHARDED fill:#ff6b1a,color:#0a0a0f
style EMB fill:#a855f7,color:#fff
The model has billions of parameters, so it's sharded across hosts. Embeddings — per-user, per-post — live in a vector store and are batched for lookup. Engagement counters are read with eventual consistency; a post's like count being off by a few hundred for a few seconds is not a correctness problem.
Edge cases
Feed update mid-scroll
The user is reading. Five new posts arrive at the top. Inject them inline? Reorder everything they've already read? The practical answer is a banner: "5 new posts — tap to refresh." Don't reflow what they're already reading.
Re-rank on every load vs. cache the ranked feed?
Cache the candidates, re-rank fresh on every load. Ranking signals — recency, recent interactions, engagement rates — change too quickly to cache the outputs. For pull-to-refresh, the same applies: re-rank, but make sure to include the latest posts.
Ad insertion
Ads are interleaved into the ranked feed by the same ranker, with a separate "expected revenue" signal layered in. Real systems jointly optimize "user value × ad value" — the re-ranking stage is where that tradeoff gets resolved.
Things you should now be able to answer
- Why is candidate generation separate from ranking?
- What's the difference between push and pull, and why use both?
- Why is multi-stage ranking necessary? What does the light rank actually save?
- Why is TAO a graph store and not a SQL DB?
- What happens between the user clicking "post" and seeing it in their friend's feed?
- Why do you cache candidates but re-rank on every load?
- What three failure modes can silently degrade ranking quality without throwing errors?
- How does the storage cost of per-user candidate lists change when you account for Redis overhead vs. raw ID size?
Further reading
- "TAO: Facebook's Distributed Data Store for the Social Graph" (USENIX 2013)
- "Deep Learning Recommendation Model for Personalization and Recommendation Systems" (Meta, 2019 — the DLRM paper)
- Instagram engineering on feed ranking / Explore page
Frequently asked questions
▸What is the hybrid fanout strategy and when does it switch from push to pull?
Meta fans post IDs into per-user Redis sorted sets at write time for authors with fewer than 10k followers. For anyone above that threshold — celebrities — fanout-on-write is skipped entirely, and the Feed Service fetches their recent posts directly at read time and merges them with the prebuilt candidate set. This caps write amplification per post without blowing up per-request read cost.
▸Why are candidates cached but the ranked feed never cached?
Ranking signals like recency, recent interactions, and engagement rates change too quickly for the ranked output to stay valid between requests. The candidates — post IDs stored in Redis sorted sets — are stable enough to precompute, but the final ordering must be produced fresh on every feed load to reflect the latest signals.
▸What is the multi-stage ranking pipeline and what latency does each stage target?
The pipeline has three stages: a light retrieval rank prunes roughly 5,000-10,000 raw candidates down to about 1,000 using cheap heuristics in milliseconds; a heavy ML model (DLRM or transformer) scores those 1,000 survivors and produces about 100 ranked results; a re-rank layer applies diversity and business rules to return the final 50. The full ranking step is budgeted at 50ms, keeping the total hot-path feed response around 105ms before network overhead.
▸What cache hit rate does TAO achieve, and why does that matter for the social graph?
TAO, Meta's graph store built on MySQL and Memcache, achieves a 96.4% cache hit rate per the 2013 USENIX ATC paper. Reads outnumber writes by more than 99:1, so the underlying database is almost never on the hot path — the aggressive Memcache layer absorbs virtually all graph traversal traffic.
▸How large is the candidate cache in memory, and what is the practical mitigation for its cost?
Storing raw 8-byte post IDs for 2 billion users at 1,000 candidates each is 16 TB, but Redis sorted-set overhead (skiplist node plus hashtable entry plus robj wrapper) adds roughly 136 bytes per entry, pushing the true in-memory footprint to around 288 TB. The practical fix is to keep only active users warm — roughly 20-30% of total users active in the last 30 days — and lazily compute candidates for the rest, with a compact off-heap store like a RocksDB-backed sorted set for the long tail.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.