~/articles/design-instagram

◆◆◆Advancedasked at Metaasked at Instagram

Design Instagram (photo sharing)

A photo-first social network — uploads, image processing, feed, stories, and global delivery.

18 min read2026-02-17Ironclad Academy

#interview #feed #storage #cdn #media #ranking

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Instagram is a photo and video sharing platform where users post to a personal grid, follow other accounts, and scroll a ranked home feed. At its peak it serves around 500 million people every day. A user opens the app, sees photos from the 500 people they follow — ranked to surface the most relevant ones first — and the images load in well under a second, whether the user is in Tokyo or São Paulo.

The media side alone is a distinct engineering problem from any text-based social feed. Every upload must be re-encoded into multiple fixed sizes, stripped of EXIF metadata, run through a moderation classifier, and pushed to a globally distributed CDN before it can appear in anyone's feed. Instagram re-compresses all uploads — a 10 MB RAW photo lands on the CDN as a set of JPEG variants totaling under 1 MB — and generates roughly 95 million such posts every single day.

On top of that sits the feed problem, which is essentially Twitter's fan-out problem wearing a different shirt. Each user follows hundreds of accounts; each post from a celebrity can have 100 million followers waiting for it. Push the post to every follower's cache on write and you incinerate your Kafka and Redis clusters. Pull everything at read time and feed load blows past any reasonable latency budget. The answer, as with Twitter, is a push/pull hybrid — and then a two-stage ML ranker that re-orders candidates before the user ever sees them.

The interview question is interesting because it forces you to design two things simultaneously: a media ingestion pipeline that handles raw binary at petabyte scale, and a social feed with celeb fan-out, ranked delivery, and engagement counters that must absorb millions of increments per hour without row-level lock contention.

Functional requirements

Post a photo / video (with caption, location, tags).
Follow / unfollow.
View home feed (chronological + ranked).
View user profile / grid.
Like, comment, save.
Stories (24-hour ephemeral content).

Non-functional

~3B MAU, ~500M–700M DAU.
p99 feed load < 500ms.
Photos served globally with low latency.
Images stored forever.

Capacity

Dimension	Estimate	How we got there
DAU	~500M–700M	Analyst estimates; Meta reports Instagram inside "Family of Apps" DAU, not separately
Post writes	~1,100/s avg	`95M posts ÷ 86,400 s/day`
Feed loads	~175k/s avg	`500M DAU × 30 opens/day = 15B/day ÷ 86,400`
Per-post CDN size	~1 MB	~5 variants × ~200 KB avg (150px ~20 KB, 640px ~150 KB, 1080px ~300 KB); original ~1–2 MB held in cold tier
Daily CDN storage added	~95 TB/day	`95M posts × ~1 MB`
Daily durable writes (3× replication)	~285 TB/day (CDN variants) + ~95–190 TB/day (originals)	CDN tier `95 TB × 3`; originals retained before cold-tier move
Annual CDN tier growth (new media only)	~35 PB/year	`~95 TB/day × 365`

Takeaway: ~1,100 post writes/s and ~175k feed reads/s, adding ~95 TB of CDN media per day — the storage and bandwidth demands dwarf any compute concern, which is why S3 + CDN + async processing are non-negotiable from the first design revision.

Building up to the design

Instagram is two systems stapled together: a feed (like Twitter) and a media pipeline. Both have a clean evolution path from "five lines of code" to production. Walking that path makes each layer earn its keep.

V1: Photos on disk, posts in Postgres

@app.post("/posts")
def create_post(file, caption):
    fname = save_to_disk(file)               # e.g. /var/data/abc.jpg
    db.execute("INSERT INTO posts(user, file, caption) ...")

@app.get("/feed")
def feed(user_id):
    return db.query("""
      SELECT p.* FROM posts p
      JOIN follows f ON f.followee = p.user
      WHERE f.follower = %s
      ORDER BY p.created_at DESC LIMIT 50
    """, user_id)

This works for a college dorm — reads and writes both go through, end-to-end. The first wall you hit is serving a full-resolution image from your origin server to a million users. Bandwidth exhausts the host before the database does, and the JOIN-and-sort on posts gets expensive past ~100k rows.

V2: Move media to S3 + CDN

The fix is to stop serving images yourself. Upload directly from the client to S3 via a signed URL — the raw file never transits your fleet. Store only the S3 key in the database. Put a CDN in front of S3.

Image-serving cost drops roughly 10×, global latency falls to under 100ms via CDN edges, and your origin bandwidth drops to nearly zero. But now every device is requesting the full-resolution original. A grid thumbnail should be ~20 KB, not a megabyte. And serving the raw original leaks EXIF metadata — GPS coordinates, device serial number — and skips moderation entirely.

V3: Async media processing pipeline

On upload, enqueue a job. A pool of processor workers picks it up, generates fixed-size variants (150, 320, 640, 1080, 2048 px wide), strips EXIF, runs the moderation classifier, and writes each variant back to S3. The original goes to cold storage as a backup.

flowchart LR
    APP[App] -->|signed URL| S3R[(Raw S3)]
    S3R --> Q[Queue]
    Q --> W[Processor Worker]
    W --> S3V[(Variants S3)]
    W --> META[(metadata DB)]
    S3V --> CDN
    style W fill:#ff6b1a,color:#0a0a0f

Each device now fetches exactly the right size. CDN cache hit rates jump because a 640px JPEG for one user is byte-for-byte the same as the 640px JPEG for another. Moderation is in the critical path of being indexed, not of delivery — a piece of content can't appear in feeds until the pipeline clears it.

The remaining problem is the feed JOIN. It scales the same way Twitter's does — and hits the same walls.

V4: Feed evolution (push, pull, hybrid)

The evolution is identical to the news feed and Twitter: start with a SQL JOIN, cache the per-user timeline, fanout-on-write to maintain it, then handle celebrities with fanout-on-read.

The Instagram-specific wrinkle is that the feed is mostly ranked (ML), not chronological. So fanout pushes candidates into the cache; ranking re-orders them at read time. That separation matters — the fanout service doesn't need to know anything about ranking; it just keeps the candidate pool fresh.

V5: Engagement counters and stories

Likes on a popular post create hot row contention the moment a post goes viral. The solution is to decouple the write from the counter update: emit a like event to Kafka, have an aggregator batch up increments, and write to Cassandra every few seconds. The counter is a few seconds stale, which is invisible to users.

Stories add a TTL'd parallel feed — 24-hour expiry — with its own precomputed "story tray" per user. Architecturally simpler than the main feed but reusing the same fanout idea, with Redis TTLs doing the expiry automatically.

V6: Production Instagram

V3 + V4 + V5 stacked: media pipeline → S3 → CDN; feed with hybrid fanout; counters via Kafka + Cassandra; stories as a parallel TTL'd subsystem.

flowchart LR
    V1[V1: SQL + disk<br/>1k users] --> V2[V2: + S3 + CDN<br/>cheap delivery]
    V2 --> V3[V3: + async variants<br/>right size, moderation]
    V3 --> V4[V4: feed cache + fanout<br/>fast home page]
    V4 --> V5[V5: counters + stories<br/>engagement at scale]
    V5 --> V6[V6: ranker + cross-region<br/>production Instagram]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

Architecture

flowchart TD
    USER[Mobile app] --> CDN[CDN]
    CDN --> APIGW[API Gateway]
    APIGW --> FEED[Feed Service]
    APIGW --> POST[Post Service]
    APIGW --> PROFILE[Profile Service]
    APIGW --> ENG[Engagement]
    APIGW --> STORY[Stories Service]

    POST --> PROC[Media Processor]
    RAW[(Raw upload bucket)] --> PROC
    PROC --> S3[(S3: variants)]
    PROC --> METADB[(Metadata DB)]

    FEED --> CAND[(Candidate Cache)]
    FEED --> RANKER[Ranker]

    POST --> KAFKA[Kafka events]
    KAFKA --> FANOUT[Fanout]
    FANOUT --> CAND

    S3 --> CDN

    style PROC fill:#ff6b1a,color:#0a0a0f
    style FEED fill:#0e7490,color:#fff
    style CDN fill:#15803d,color:#fff

The feed plumbing is essentially the same as in the news feed article. The interesting Instagram-specific pieces are the media pipeline and stories.

Media pipeline

The upload flow is worth spelling out in detail because the sequence matters — three actors are involved (app, upload API, and the async processor), and the client signals "done" to the server only after S3 has confirmed receipt.

sequenceDiagram
    participant App
    participant Upload API
    participant Raw S3
    participant Processor
    participant Variants S3
    participant Metadata
    App->>Upload API: initiate upload
    Upload API-->>App: signed S3 URL
    App->>Raw S3: PUT image
    App->>Upload API: complete
    Upload API->>Processor: enqueue
    Processor->>Raw S3: read original
    Processor->>Processor: orient, strip EXIF, generate variants
    Processor->>Processor: ML moderation, watermark, virtual try-on, etc.
    Processor->>Variants S3: write thumbnails (e.g. 150, 320, 640, 1080, 2048)
    Processor->>Metadata: register post
    Metadata-->>App: published

For each upload, the processor generates a set of variants:

Use case	Size
Thumbnail (grid view)	150 × 150
Feed (mobile, low DPR)	320 wide
Feed (mobile, standard)	640 wide
Feed (high DPR / desktop)	1080 wide
Max resolution / zoom	2048 wide

Pre-generating these means the CDN serves them directly — no on-demand resizing on the hot path. The CDN URL encodes the variant size, so a grid view and a feed view of the same post are separate cache entries.

For videos: same idea but with multiple resolution variants (240p, 480p, 720p, 1080p) and HLS chunks. See the YouTube article.

Storage

Data	Tech	Notes
User accounts, follows	Postgres / TAO	TAO is Meta's graph-aware cache + MySQL backend, used across all Meta properties including Instagram post-acquisition
Posts metadata	Cassandra (sharded by post_id)	Instagram runs Rocksandra — a Cassandra fork that swaps the JVM storage engine for RocksDB (C++); P99 reads dropped from ~60ms to ~20ms, GC stalls reduced ~10×
Feed candidates	Redis sorted sets	Per-user timeline: sorted by score or recency, trimmed to a few hundred entries
Engagement counters	Cassandra counter columns	Aggregated via Kafka workers before write to avoid hot-row contention
Media originals	S3 Standard → cheaper tier (S3 IA or Glacier) after some inactivity window	Kept for story-archive features and right-to-erasure workflows; exact tiering threshold is not publicly disclosed
Variants	S3 (Standard), served via CDN	5–6 fixed-width JPEG variants per post; CDN edge hit rate >95% for popular content
Search	Elasticsearch	Inverted index on caption tokens + hashtags; trending computed via Kafka streaming counters

Cassandra earns its place here on two counts. First, its write path (LSM-tree, memtable → SSTable) is append-optimized and naturally distributed — exactly what you want for a workload that is almost entirely inserts. Second, partitioning by post_id means any post's metadata lives on a predictable set of nodes, so there are no cross-partition joins on the read path. The Rocksandra storage engine (Instagram's contribution to the Cassandra ecosystem) replaced the Java LSM implementation with RocksDB, shaving significant GC overhead at Instagram's write rate — P99 reads dropped from ~60ms to ~20ms.

Hashtags & search

Hashtags are just inverted indexes on tag tokens.

posts_with_tag(#sunset) → list of post_ids, sorted by recency / popularity

Stored in Cassandra (composite key: (tag, time_bucket, post_id)) or Elasticsearch.

For trending hashtags, a separate streaming pipeline (Kafka → counters) computes top tags every minute.

Stories (the ephemeral part)

Stories are photos/videos that auto-expire after 24 hours. The media pipeline and CDN delivery work identically to regular posts. What differs is the metadata lifecycle and how your client learns which friends have active stories.

Rather than scanning for "which of my 500 follows posted a story in the last 24 hours" at open time, Instagram precomputes a story tray per user — a short list of friend IDs who have unviewed stories. The tray lives in Redis with a 24-hour TTL. When a friend posts a story, the fanout service writes into that friend's followers' trays. When the 24-hour window passes, the TTL expires the tray entry automatically — no cron job needed to clean up.

View receipts (each viewer is recorded so the poster can see who watched) are a high-write-rate counter problem, handled the same way as likes: Kafka → aggregator → Cassandra.

flowchart TD
    A[Friend posts story] --> KAFKA[Kafka]
    KAFKA --> ST[Story Service]
    ST --> SCACHE[(Story Cache<br/>per-user tray<br/>precomputed: friends with unviewed stories)]
    USER[User opens app] --> ST
    ST --> SCACHE
    SCACHE --> USER
    style ST fill:#ff6b1a,color:#0a0a0f
    style SCACHE fill:#0e7490,color:#fff

The clean separation from the main feed index is intentional. If you put stories in the same candidate cache as regular posts, you'd need custom logic to expire them at the 24-hour mark — and you'd pollute the ranker's signal space with a completely different engagement model.

Ranking

When your feed is chronological, "ranking" is just a sort. Instagram stopped being chronological at scale because recency alone is a poor proxy for what you actually want to see. The solution is a two-stage ranking pipeline, where each stage is progressively more expensive but operates on fewer candidates.

flowchart TD
    POOL["Candidate pool<br/>(thousands from fanout cache)"]
    POOL --> S1["Stage 1 — two-tower model<br/>User tower: your history, follows, interaction patterns<br/>Media tower: post features, caption, author stats<br/>→ keep top ~100–500 candidates"]
    S1 --> S2["Stage 2 — heavy multi-task model<br/>Predicts: like, comment, save, share, dwell time<br/>Each gets a weight → final score = weighted sum"]
    S2 --> RERANK["Re-rank with diversity rules<br/>Deduplicate same author in a row<br/>Insert Stories / Reels / ads at fixed slots<br/>Apply safety filters"]
    style POOL fill:#0e7490,color:#fff
    style S1 fill:#ffaa00,color:#0a0a0f
    style S2 fill:#ff6b1a,color:#0a0a0f
    style RERANK fill:#a855f7,color:#fff

The first-stage model is deliberately trained to predict the second-stage model's output (knowledge distillation), so the cheap model behaves like the expensive one on the vast majority of candidates. This lets you run the expensive model only on the top 100–500 items rather than the full pool of thousands. The two-tower + knowledge-distillation architecture is explicitly documented for the Explore ranker; secondary sources confirm the same multi-stage pattern applies to the main feed, though Meta has not published an equivalent deep-dive for the home feed ranking stack.

Instagram-specific engagement signals worth calling out:

Time spent on past posts from this author — a stronger signal than a quick like.
Likes and comments exchanged directly — the mutual-interaction signal.
Save and reshare rates — stronger indicators of intent than a double-tap.
Time-of-day patterns — you scroll differently at 7am versus 11pm.

The Explore page runs a separate ranker optimized for discovery. You haven't followed these accounts, so the two-tower must rely on content embeddings and interest clusters rather than interaction history. That's why Explore feels qualitatively different from your home feed even though both are ML-ranked.

Engagement counters

A viral post accumulates millions of likes. The naive approach — incrementing a counter column in the database on every like — turns the row into a hot spot: every increment acquires a lock, serializing writes from across the globe. At a few hundred likes per second, the lock queue grows faster than it drains.

The fix is to decouple the user action from the counter update:

flowchart LR
    LIKE[User likes] --> KAFKA[Kafka]
    KAFKA --> AGG[Aggregator]
    AGG -->|"increment in batches every 5s"| CASS[(Cassandra counter)]
    READ[User reads count] --> CACHE[(Redis cache)]
    CACHE -.miss.-> CASS
    style AGG fill:#ff6b1a,color:#0a0a0f
    style CACHE fill:#15803d,color:#fff

The like event hits Kafka immediately, the aggregator batches up increments across a five-second window, and Cassandra gets one write per batch rather than one per like. The counter is a few seconds stale — imperceptible to users. For the hottest posts (Kim K's photo at 10M likes per hour), even Cassandra's write path can bottleneck a single counter row, so the counter is sharded across N rows with a sum on read.

Dual-write: "did I like this?" vs. "how many likes?"

These look like the same question but they have different consistency requirements, which means they need different stores.

"Did I like this post?" requires per-user accuracy. You don't want a false negative — clicking like, then seeing the heart un-filled because the aggregation lagged. This write goes directly to a likes(user_id, post_id) table in Postgres, sharded by user_id. It's a small row and a point lookup; strong consistency is easy to deliver.

"How many likes total?" can tolerate a few seconds of lag. That counter lives in Cassandra, incremented via batched Kafka aggregation.

The flow is: Postgres write succeeds → like event emitted to Kafka → Cassandra counter increment (batched). If the Kafka consumer lags, the total count drifts slightly — but your own like state is always correct because it came from Postgres.

The thundering herd on hot posts. When a viral post gets a new like, the cached counter is deleted. Multiple servers may simultaneously experience a cache miss and race to reload from Cassandra. Instagram historically solved this with a memcache lease pattern: on a miss, the first server gets a short-lived token; only the token holder fetches from the database and repopulates the cache. Other servers wait briefly rather than dogpiling Cassandra.

Hot path: viewing the feed

The target is a p99 feed load under 500ms globally, with a rough 100ms budget for the serving layer. The split looks like this: ~10ms to read per-user candidate sets from Redis sorted sets, ~30–50ms for Stage 1 (two-tower inference on the full candidate pool), ~20–40ms for Stage 2 (heavy multi-task model on the top 100–500), and the remainder for hydration — fetching post metadata, author info, and CDN URLs. The tail-latency risk is the heavy Stage 2 model: if it runs synchronously on the critical path, a slow GPU inference spills the entire budget. In practice, Instagram caps Stage 2 to a fixed top-N and enforces a timeout so a slow model call falls back to the Stage 1 ordering rather than stalling the response.

Hot path: uploading a photo

Step	Latency
App: capture, compress	(client)
App: PUT to S3 (signed URL)	500ms – several seconds (depends on network)
Server: enqueue processing	50ms
Processor: variants + moderation	1–10s (background)
Server: 201 to client	(after enqueue) ~100ms

The user sees "posted" immediately; variants and moderation finish in the background over the next few seconds. The post is not yet discoverable in other users' feeds until the processor marks it as complete — so moderation has a window to act before any audience sees it.

Edge cases

Sensitive content

An ML classifier runs on every upload — NSFW, violence, hate speech. High-confidence violations are blocked synchronously within the processing pipeline. Uncertain cases are queued for human review and held back from feeds until cleared.

Right to be forgotten

GDPR / CCPA: when a user deletes their account, all their posts must be removed. This cascades across S3 (media), CDN (purge), Cassandra (metadata), feed caches, and search indexes. Implemented as a soft-delete plus async purge with a 30-day grace window so users can recover an accidentally deleted account.

Bot / fake accounts

Detection lives upstream of post creation: rate limits, CAPTCHA on signup, behavioral analytics, and ML on activity patterns. Letting a bot reach the post-creation endpoint is already a failure of the upstream layer.

Cross-region consistency

A user in the EU posts a photo. A friend in the US opens the app. The post must reach the US within seconds. Metadata replicates asynchronously across regions (~1–2s lag). Media chunks are pre-positioned to global CDN POPs. For a fresh post, the CDN miss path falls back to the origin region — adding a single extra round trip, not a user-visible delay.

Things to discuss in an interview

Media pipeline: variants, async processing, CDN — why pre-generate vs. on-demand resize.
Feed plumbing: candidates + two-stage ranking, push/pull hybrid, celebrity cutoff.
Counters at scale: Kafka aggregation, sharded Cassandra counters, memcache lease for thundering herd.
Stories as a separate TTL'd sub-system — what would break if you put stories in the main feed index?
Moderation as part of ingestion, not as an afterthought — and the latency trade-off of blocking vs. async.
Right-to-erasure: cascading async deletes from S3, CDN purge, feed caches, search indexes.

Things you should now be able to answer

Why pre-generate image variants instead of resizing on demand?
Why store likes in both Postgres and Cassandra, and what happens if the Kafka consumer lags?
How do stories differ architecturally from regular posts?
What goes wrong if you increment a counter row directly under heavy write load, and how does a memcache lease fix it?
How does the CDN serve images for an account followed mostly in another region?
Why does the first-stage ranker get trained on second-stage labels rather than raw engagement signals?
What is the tail latency risk if the heavy ML ranker is on the critical path of every feed load?

Frequently asked questions

▸Why does Instagram pre-generate fixed image variants instead of resizing on demand?

Pre-generating 5-6 fixed-width JPEG variants (150, 320, 640, 1080, 2048 px wide) means the CDN serves each size directly with no on-demand resizing on the hot path. CDN cache hit rates also jump because the 640px JPEG for one user is byte-for-byte identical to the 640px JPEG for another, making the cache far more effective.

▸How does Instagram handle like counts on viral posts without hitting database lock contention?

Like events are emitted to Kafka, an aggregator batches them over a five-second window, and a single batched write goes to a Cassandra counter rather than one write per like. For the hottest posts the counter is further sharded across N rows with a sum on read, and a memcache lease prevents cache-miss thundering herds where multiple servers race to reload the count from Cassandra simultaneously.

▸What is the push/pull hybrid fanout strategy Instagram uses for the home feed?

For regular users, new posts are pushed directly into followers' Redis candidate caches on write, keeping feed load latency low. For celebrity accounts with millions of followers, fanning out on every post would overwhelm Kafka and Redis, so their posts are pulled into the feed at read time instead.

▸Why does Instagram store like state in both Postgres and Cassandra?

The two questions have different consistency requirements. Whether a specific user liked a post requires per-user accuracy with no false negatives, so that write goes to a sharded Postgres table for a reliable point lookup. The total like count can tolerate a few seconds of lag, so it lives in Cassandra and is updated via batched Kafka aggregation.

▸What is Rocksandra and what performance improvement did it deliver for Instagram?

Rocksandra is Instagram's fork of Cassandra that replaces the JVM storage engine with RocksDB, a C++ LSM implementation. It reduced P99 read latency from roughly 60ms to roughly 20ms and cut garbage-collection stalls by approximately 10x at Instagram's write rate.

← previous

Design Yelp / Nearby Search (proximity service)

Design Google Drive / Dropbox

// RELATED