~/articles/design-social-graph

◆◆◆Advancedasked at Metaasked at LinkedInasked at Twitter

Design a Social Graph Service (Facebook's TAO)

Q: What is TAO, and why did Facebook build it instead of using a general-purpose graph database?

TAO (The Associations and Objects) is a two-tier write-through cache in front of sharded MySQL, purpose-built for social-graph access at Facebook scale. General-purpose graph databases handle multi-hop traversal well but cannot absorb the read volume TAO handles — roughly 1 billion reads per second at peak — and social-graph queries are almost always single-hop, not deep traversals. MySQL's operational maturity, backup tooling, and ACID writes make it the better foundation when traversal depth is shallow.

Q: What is the hot-vertex problem and how does the TAO design address it?

The hot-vertex problem is when a celebrity node — followed by 100 million people — concentrates read and write traffic on a single database shard. TAO mitigates it through four mechanisms: a hard per-query limit on assoc_range calls (e.g., 6,000 rows max), a separately maintained association_counts table so counts are a single-row lookup rather than a COUNT(*) scan, leader-level request coalescing so only one DB read fires during a cache miss storm, and approximate or counter-sharded counts for viral posts receiving thousands of write increments per second.

Q: Why are bidirectional friendships stored as two directed association rows instead of one undirected row?

Storing two directed rows — (A, FRIEND, B) and (B, FRIEND, A) — keeps the query for each user's friend list to a single shard, since the associations table is sharded by id1. A single undirected row would require a cross-shard join or a UNION across two shards to reconstruct both directions. The TAO paper reports the async window for the second row to land is below 1 second for 85% of observations and below 3 seconds at p99, which the product tolerates.

Q: What is the difference between the leader cache and the follower cache in TAO?

The leader cache owns one database shard, serializes all writes to that shard, coalesces concurrent read misses so only one DB query fires per key, and sends invalidations to follower caches. The follower cache is a stateless read tier — app servers write to their local follower, which forwards to the leader; read misses go to the leader rather than directly to MySQL, fully insulating the database from the follower tier's miss traffic. Because followers are stateless, scaling out is as simple as adding nodes.

Q: How does the write-through cache pattern preserve durability, and why does TAO prefer it over write-back?

Write-through means the leader writes to MySQL synchronously before acknowledging success to the caller; the cache is updated as the acknowledgment propagates back up the chain. Write-back buffers writes in the cache and flushes them to the DB later, which risks losing those writes if a cache node crashes before the flush. TAO uses write-through to ensure every acknowledged write is already durable on disk.

Serve billions of "who follows whom" reads over a graph of trillions of edges. The objects-and-associations model, a cache in front of sharded SQL, and the hot-vertex problem.

22 min read2026-06-10Ironclad Academy

#interview #graphs #caching #distributed-systems

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Facebook has roughly 3 billion users. Each of those users has friends, follows pages, likes posts, and comments on photos — and every one of those actions creates a directed edge in a graph. At Facebook scale, that graph holds on the order of 1–2 trillion edges. Instagram, LinkedIn, and Twitter sit at the same order of magnitude. Serving reads over that graph — "who does this user follow?", "does user A have user B in their friend list?", "how many likes does this post have?" — is one of the hardest infrastructure problems in social computing.

The naive approach stores relationships in a relational table: two columns, one row per friendship. This works fine through the tens of millions of rows. Past that, it buckles. No single machine holds 1 trillion rows. A SELECT COUNT(*) over a celebrity's 100-million-follower list locks out the entire database for the duration. And the access pattern almost never wants the full list — it wants the 25 most recent, or a yes/no point lookup, or a count. A general-purpose graph database would handle the traversal well but struggles with the read volume: TAO production sees roughly 1 billion reads per second at peak, with reads outnumbering writes by 500:1 or more.

The two engineering tensions that make this problem interesting are read/write asymmetry at extreme scale and the hot-vertex problem. The asymmetry demands a caching tier that absorbs 99%+ of reads so the database only sees the cold path. The hot vertex — a celebrity node followed by 100 million people — concentrates write traffic on one shard and can cascade into a cache stampede on any viral post. Both problems require specific design choices in how data is modeled, sharded, and cached.

Facebook's published answer is TAO (The Associations and Objects), a 2013 USENIX ATC paper describing a two-tier write-through cache in front of sharded MySQL, purpose-built around a small set of query primitives that dominate real social-graph access. This article designs a system in the same spirit, walking from a single SQL table up through the full architecture so each design decision earns its place.

Functional requirements

Store objects: typed nodes with a unique ID and a property bag (JSON or thrift-encoded blob). Types: User, Post, Page, Comment, Photo, Event, …
Store associations: typed directed edges between two object IDs, with a timestamp and optional data. Types: FRIEND, LIKE, FOLLOW, COMMENT, TAG, …
Expose a small set of query primitives:
- assoc_get(id1, atype, id2_set) — point lookup: does edge (id1, atype, id2) exist? returns edge data.
- assoc_range(id1, atype, pos, limit) — range query: get limit edges of type atype from id1, newest-first, starting at offset pos.
- assoc_count(id1, atype) — count: how many edges of type atype does id1 have?
- assoc_time_range(id1, atype, high, low, limit) — time-bounded range.
- obj_get(id_set) — multi-get objects by ID.

Non-functional requirements

Read-heavy, massively so. Reads outnumber writes by 500:1 or more. Every page load fires dozens of read queries; writes (new posts, new likes) are comparatively rare.
Low latency — p99 < 10 ms globally; p50 < 1 ms for cached hits.
High availability — degraded reads (stale data) are preferable to errors.
Eventual consistency — cross-region stale reads are acceptable. Within a region, read-your-writes is desirable.
Extreme scale — billions of nodes, trillions of edges, millions of QPS.

Capacity estimation

Dimension	Estimate	How we got there
Users	~3 B	Facebook scale
Object nodes	~O(100 B)	users + posts + comments + pages
Association edges (lower bound)	~1 trillion	friendships + likes + follows + comments + tags; conservative floor
Association edges (upper bound)	several trillion	—
Storage per association row	~92 bytes	`id1(8B) + id2(8B) + atype(4B) + ts(8B) + data(avg 64B)`
Total association storage	~92 TB	`1 trillion edges × 92 B` → needs sharding across many MySQL instances
Read QPS (lower-bound estimate)	~1.7 M reads/sec	`3B DAU × 50 reads/day ÷ 86,400 s`
Read QPS (TAO production peak)	~1 billion reads/sec	Many more reads per user per active session than daily average captures
DB queries/sec at 99% cache hit	~17k/sec	`1.7M reads/sec × 1%` (from lower-bound estimate)
Write QPS (estimate)	~1M–5M writes/sec globally	Active users ≈ 1–5% of DAU → ~30M–150M concurrent; ~1 write/30 s each → `30M–150M ÷ 30 ≈ 1M–5M/sec`; TAO paper reports millions/sec globally
Read:write ratio	~500:1	Confirmed in TAO paper; reads dominate by orders of magnitude

The 99%+ cache hit rate is not aspirational — it is a hard architectural requirement. Without the cache tier, the DB layer cannot absorb even a small fraction of the read QPS at Facebook scale. The cache makes the DB tractable.

Building up to the design

The naive path starts where most interview candidates start: a friendships table with two columns. Let's walk through why that breaks and what each fix unlocks, so by the time we arrive at the two-tier cache the design feels inevitable rather than handed down from on high.

V1: A single relational table

CREATE TABLE friendships (
    user_a BIGINT,
    user_b BIGINT,
    created_at TIMESTAMP,
    PRIMARY KEY (user_a, user_b)
);

Store friendships as undirected pairs (min(a,b), max(a,b)). "Get friends of user X" is SELECT user_b FROM friendships WHERE user_a = X UNION SELECT user_a WHERE user_b = X.

This works fine to tens of millions of rows. The UNION across two index ranges starts to show its seams around that point, and there's a deeper problem: there's no concept of edge types here. Adding "likes" means a separate likes table, "follows" means another one. You end up with a zoo of relationship tables with no common query interface.

V2: A typed adjacency list (edge table)

CREATE TABLE associations (
    id1       BIGINT       NOT NULL,
    atype     INT          NOT NULL,
    id2       BIGINT       NOT NULL,
    ts        BIGINT       NOT NULL,   -- unix milliseconds
    data      BLOB,
    PRIMARY KEY (id1, atype, id2),
    KEY       (id1, atype, ts DESC)    -- for range queries
);

Now every relationship type goes in one table. "Friends of X, newest-first" is a single index scan:

SELECT id2 FROM associations
WHERE id1=X AND atype=FRIEND
ORDER BY ts DESC LIMIT 25;

The secondary index on (id1, atype, ts DESC, id2) means the query engine satisfies this entirely via index, without touching the heap. Clean.

What breaks here is scale. At 1 trillion rows, no single MySQL instance holds this. And every read hits the database — there's no caching.

V3: Shard by id1 + separate count cache

Shard the associations table horizontally by id1. All edges from a given node land on one shard, which keeps "friends of X" as a local, single-shard query. This is the foundational sharding decision: cross-shard queries become expensive application-level scatter-gather, so the data model must ensure the most common access pattern stays on one machine.

Counts are now a problem. SELECT COUNT(*) WHERE id1=X AND atype=FRIEND on a celebrity with 50M followers requires scanning 50M index entries. Instead, maintain a separate table updated transactionally with every insert and delete:

CREATE TABLE association_counts (
    id1    BIGINT  NOT NULL,
    atype  INT     NOT NULL,
    count  BIGINT  NOT NULL DEFAULT 0,
    PRIMARY KEY (id1, atype)
);

Now assoc_count is a single-row point lookup. But every read still hits the database.

V4: A cache tier (single region)

Put a memcached-style cache in front of MySQL. Reads check cache first; misses fall through to DB and populate the cache on return. Writes go directly to the DB and then invalidate the relevant cache entries — a write-around-plus-invalidation pattern that improves read performance but leaves a stale-cache window between the DB write and the invalidation.

This eliminates ~99% of DB reads. The stickier issue is cache invalidation correctness — with many cache nodes and one DB, there's a window between a write completing in MySQL and the stale cache entry being evicted. Within a single region this is usually milliseconds and most products accept it. But expand to multiple regions and you have two compounding problems: the cross-region network RTT for every write, and a single cache tier that becomes a single point of failure.

V5: Two-tier cache (leader + follower) per region — TAO's model

Separate the cache into two layers: a leader (one per DB shard, serializes writes) and a follower (many nodes per region, serves the read volume). App servers write to their local follower, which forwards to the leader, which writes through to MySQL synchronously. As the acknowledgment travels back up the chain the caches are updated. Reads that miss the follower go to the leader, not directly to the DB — this insulates MySQL from thundering-herd misses.

flowchart LR
    V1["V1: single SQL table\nworks at millions of rows"] --> V2["V2: typed edge table\nall types in one schema"]
    V2 --> V3["V3: shard by id1 + count cache\nmultiple machines"]
    V3 --> V4["V4: + cache (write-around + invalidate)\n99% read hit rate"]
    V4 --> V5["V5: leader/follower cache tiers\nmulti-region, HA"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f

High-level architecture

flowchart TD
    APP["App servers (web/mobile API)"]

    subgraph Region["One region (e.g., US-East)"]
        FT["Follower cache tier\n(many nodes, serve reads + forward writes)"]
        LT["Leader cache tier\n(one per DB shard, serializes writes)"]
        DB[("Sharded MySQL\nobjects + assocs + counts")]
    end

    subgraph OtherRegion["Another region (e.g., EU-West)"]
        FT2["Follower cache tier"]
        LT2["Leader cache tier"]
        DB2[("MySQL replicas\nasync from primary)")]
    end

    APP --> FT
    FT -->|"read miss"| LT
    LT -->|"read miss"| DB
    APP -->|"write"| FT
    FT -->|"write forward"| LT
    LT -->|"write-through"| DB
    LT -.async invalidate.-> FT
    DB -.async replication.-> DB2
    LT -.async invalidate.-> LT2

    style FT fill:#15803d,color:#fff
    style LT fill:#ff6b1a,color:#0a0a0f
    style DB fill:#0e7490,color:#fff
    style FT2 fill:#15803d,color:#fff
    style LT2 fill:#ff6b1a,color:#0a0a0f
    style DB2 fill:#0e7490,color:#fff

The leader cache owns one DB shard. All writes arrive at the leader, which issues the MySQL write synchronously before acknowledging. This serialization means two concurrent writes to the same cache entry never produce a torn state — exactly one wins, and the loser retries. The leader also coalesces simultaneous read misses for the same key: the first miss fires a DB read, the rest wait for the reply. This is what prevents a cache miss on a viral post from spawning thousands of simultaneous MySQL queries.

The follower cache is where the actual read volume lives. Because followers are stateless beyond their in-memory cache, scaling out is as simple as adding nodes. A read miss at the follower goes to the leader — not to MySQL — so the follower tier is fully insulated from direct DB contact. When the leader invalidates a follower after a write, the next follower miss re-fetches from the leader cache, which by then already has the fresh data.

The write-through contract (not write-back) is important: writes are durable in MySQL before the caller sees a success response. This rules out the classic write-back failure mode where a node crash loses buffered writes.

The objects-and-associations data model

TAO defines two entity types:

Objects

Object: { id: uint64, otype: uint32, data: blob }

An object is a node in the graph. otype identifies the kind of entity (user = 1, post = 2, photo = 3, page = 4, …). data is an encoded property bag. The id is a globally unique 64-bit integer, typically derived from a distributed ID generator (see distributed ID generator design).

Objects are stored in MySQL:

CREATE TABLE objects (
    id     BIGINT UNSIGNED NOT NULL,
    otype  INT UNSIGNED    NOT NULL,
    data   MEDIUMBLOB,
    PRIMARY KEY (id)
);

Sharded by id. A point lookup obj_get(id) is always a single-shard query.

Associations

Association: { id1: uint64, atype: uint32, id2: uint64,
               ts: int64, data: blob }

An association is a directed, typed edge from id1 to id2. atype identifies the relationship (FRIEND=1, LIKE=2, FOLLOW=3, COMMENT=4, …). ts is a unix timestamp in milliseconds; edges are returned newest-first by default — the social-graph access pattern is almost always "show me the most recent N."

CREATE TABLE associations (
    id1    BIGINT UNSIGNED NOT NULL,
    atype  INT UNSIGNED    NOT NULL,
    id2    BIGINT UNSIGNED NOT NULL,
    ts     BIGINT          NOT NULL,
    data   BLOB,
    PRIMARY KEY (id1, atype, id2),
    KEY    idx_range (id1, atype, ts DESC, id2)
);

CREATE TABLE association_counts (
    id1    BIGINT UNSIGNED NOT NULL,
    atype  INT UNSIGNED    NOT NULL,
    count  BIGINT UNSIGNED NOT NULL DEFAULT 0,
    PRIMARY KEY (id1, atype)
);

The idx_range index is what makes assoc_range fast: the database engine satisfies WHERE id1=X AND atype=T ORDER BY ts DESC LIMIT N via a pure index scan, reading N leaf pages without touching the rest of the table.

The data model — objects, associations, counts — maps cleanly onto the graph:

flowchart LR
    U["Object\n(User, id=42)"]
    P["Object\n(Post, id=99)"]
    U2["Object\n(User, id=7)"]

    U -->|"LIKE\n(id1=42, atype=2, id2=99, ts=...)"| P
    U -->|"FRIEND\n(id1=42, atype=1, id2=7, ts=...)"| U2
    U2 -->|"FRIEND\n(id1=7, atype=1, id2=42, ts=...)"| U

    style U fill:#0e7490,color:#fff
    style U2 fill:#0e7490,color:#fff
    style P fill:#a855f7,color:#fff

Notice the friendship between users 42 and 7: two directed association rows, one on each user's shard. The LIKE from user 42 to post 99 is a single directed row — no reverse edge needed for that relationship.

Key sharding decision: shard by id1. This guarantees all associations originating from a given node are on one shard, making assoc_range and assoc_count single-shard operations. The tradeoff: "friends of friends" or "who liked X, and who are their friends?" requires application-level scatter-gather across shards — accepted because those queries are rare compared to single-hop lookups.

Bidirectional edges

A "friendship" is conceptually undirected, but stored as two directed association rows: (A, FRIEND, B) and (B, FRIEND, A). This keeps all queries to single-shard reads (querying A's shard gives A's friends without joining B's shard).

The two rows are not written in a single transaction — they land on different shards, which may be different MySQL instances. TAO maintains eventual consistency of the pair: when (A, FRIEND, B) is written, an async task enqueues writing (B, FRIEND, A). If the async write fails, a repair job re-applies it later. The TAO paper reports DB slave lag below 1 second for 85% of observations and below 3 seconds at p99; under hot-spot conditions the paper notes the window can extend to minutes. The product tolerates this one-direction window: "you can see B's pending friend request before B's friend list shows A."

This is the right trade-off: strict cross-shard consistency would require distributed transactions (2PC), which adds latency and a new class of failure modes at trillion-edge scale.

Query API deep dive

assoc_range — the workhorse

assoc_range(id1, atype, pos, limit) → list<Association>

"Give me limit edges of type atype from node id1, starting at offset pos, ordered newest-first."

Used for: "Show 25 friends of user X", "Show 10 newest likes on post Y", "Show next page of comments."

Cache key: (id1, atype, pos, limit) — or, more efficiently, cache the full sorted list and serve slices from the cached copy. TAO also enforces a per-atype upper bound (typically 6,000) on the actual limit used per query; if an association list is longer than this cap, the client issues multiple paginated queries using pos or high as a cursor. Requests that fall within the cached portion are served in-memory; requests beyond the cached portion fall through to DB.

sequenceDiagram
    participant App
    participant Follower as "Follower Cache"
    participant Leader as "Leader Cache"
    participant DB as "MySQL Shard"

    App->>Follower: assoc_range(userId, FRIEND, 0, 25)
    alt Cache hit
        Follower-->>App: list of 25 friends
    else Cache miss
        Follower->>Leader: assoc_range(userId, FRIEND, 0, 25)
        alt Leader cache hit
            Leader-->>Follower: list
            Follower-->>App: list
        else Leader miss
            Leader->>DB: SELECT id2,ts,data FROM assocs WHERE id1=userId AND atype=FRIEND ORDER BY ts DESC LIMIT 25
            DB-->>Leader: rows
            Leader->>Leader: populate cache
            Leader-->>Follower: list
            Follower->>Follower: populate cache
            Follower-->>App: list
        end
    end

assoc_count — likes at a glance

assoc_count(id1, atype) → uint64

Never implemented as SELECT COUNT(*) on the associations table — that requires a full index scan over potentially millions of rows. Instead, the count is maintained in the association_counts table, updated atomically with every insert/delete, and cached separately. The cached count is a single integer — it fits in a small cache entry and is invalidated only when the count changes.

assoc_get — "did A like this post?"

assoc_get(id1, atype, id2_set) → list<Association>

A multi-point lookup. Used for access control ("does user A follow user B?"), button state ("is this post liked by me?"), and deduplication. Since the primary key is (id1, atype, id2), this is a single-shard primary key lookup per id2.

The hot-vertex problem

A celebrity node — a public figure followed by 100 million people — concentrates read and write traffic on a single shard. Without specific countermeasures, it brings down that shard for everyone else. There are four distinct failure modes to think through.

The adjacency list is too large to fetch. A naive SELECT * FROM associations WHERE id1=celebrity_id AND atype=FOLLOWER returns 100M rows. That query will time out, exhaust DB buffer pool memory, and degrade the shard for every other user whose data lives on it. The fix is simple: enforce a hard limit on every assoc_range call (e.g., max 6,000 per call). The product never shows all 100M followers — it shows a count and a sample. The limit keeps the query bounded regardless of how large the list grows.

The count cache becomes a write hot spot. Every new follower invalidates the follower count cache entry. On a viral post gaining 10,000 likes per second, the count cache entry is invalidated thousands of times per second, causing repeated DB reads that negate the point of caching. Two approaches work here: serve approximate counts (cache the count and invalidate asynchronously in batches — coalesce 100 increments into one update) or use counter sharding (split the count into K shards, each holding roughly 1/K of the total; reads sum K shards, writes go to a random shard). Users do not need exact like counts in real time, so approximate counts are usually the right default. Counter sharding adds read complexity but eliminates the write hot spot entirely. See distributed counter design for the mechanics.

Cache stampede on a viral object. A viral post gains millions of views in seconds. Before its cache entry is warm, thousands of requests miss simultaneously and pile into MySQL. TAO's leader tier naturally handles this: the leader serializes concurrent misses for the same key, so only one DB read fires while the rest wait for the reply. For TTL-based caches more generally, probabilistic early expiration (PER) is a well-known mitigation — start refreshing an entry during the tail of its TTL window so expiration is staggered rather than synchronized.

Shard hotness. Even with good caching, the shard holding a celebrity's node receives a disproportionate fraction of the write traffic (every new follower is a write to that shard) and the cache-miss traffic for cold queries. The operational answer is to route reads for that node to MySQL read replicas rather than the primary, and spin up dedicated follower cache nodes that serve only that node's data. This is operationally expensive but sometimes necessary at the extreme end.

Storage modeling trade-offs

Model	Description	Read performance	Write complexity	When to use
Adjacency list (TAO model)	One row per directed edge, sharded by `id1`	O(log N) for range; single shard	Two rows for undirected edges	Default: social-graph reads are single-hop from a known node
Edge table (global)	One unsharded table, `(id1, id2, type)`	Full table scan or complex index	Simple single write	Only viable at small scale
Adjacency matrix	N×N dense matrix in a 2D array	O(1) point lookup	O(1) write	Dense graphs (< millions of nodes); impractical at social scale
Native graph DB (Neo4j, etc.)	Nodes and edges as first-class primitives	Fast multi-hop traversal	Complex ACID	When traversal depth > 2 is frequent (fraud, recommendations)
Separate objects + assocs tables	TAO: objects table + associations table + counts table	O(1) count, O(log N) range	Transactional count maintenance	Production social graph

The key observation here: native graph databases excel at multi-hop traversal — "friends of friends of friends who like X." The social graph at scale is almost never traversed beyond 2 hops in real time. It's overwhelmingly single-hop reads. A sharded relational table with the right index is faster and more operationally mature for that workload. Consistency, backup/restore tooling, and operational familiarity with MySQL beat graph-native semantics when traversal depth is shallow.

Consistency model

stateDiagram-v2
    [*] --> FollowerReceived: "app writes to local follower"
    FollowerReceived --> LeaderReceived: "follower forwards write to leader"
    LeaderReceived --> DBPersisted: "leader writes to MySQL (synchronous)"
    DBPersisted --> LeaderCacheUpdated: "cache updated as ack propagates back"
    LeaderCacheUpdated --> FollowerInvalidated: "leader sends invalidation to followers"
    FollowerInvalidated --> FollowerRepopulated: "first follower miss re-fetches from leader"
    FollowerRepopulated --> [*]

    DBPersisted --> ReplicaLag: "async replication to other regions"
    ReplicaLag --> OtherRegionVisible: "seconds to sub-minute lag"
    OtherRegionVisible --> [*]

Within a region: writes go from app server → local follower → leader → MySQL (synchronous). As the write acknowledgment propagates back down the chain, caches are updated. App servers that just wrote can re-read from their local follower (which now has the updated entry) — read-your-writes is achieved. Followers see invalidations within milliseconds.

Cross-region: DB replication is asynchronous. Another region's replica may lag by seconds. App servers in that region may briefly see stale data. This is accepted: "you posted a comment in the US; your friend in Europe sees it a second later" is not a correctness failure, it's an expected eventual-consistency delay.

What is not tolerated: permanent divergence or lost writes. The write path (app → follower → leader → MySQL primary) is synchronous and durable. Cross-region replication provides eventual convergence, not the ability to permanently disagree.

Failure modes

Failure	Impact	Mitigation
Follower cache node dies	Read QPS shifts to remaining followers or to leader	Auto-healing; follower tier is stateless — any miss routes to leader
Leader cache node dies	Reads fall through to DB; writes from followers have nowhere to go until a replacement leader is up	Leader is stateful but replaceable — new leader cache is rebuilt from DB; short period of degraded write availability and higher DB load
MySQL shard primary dies	Writes stall; reads continue from replica	MySQL failover (semi-synchronous replication, MHA/Orchestrator); typically < 30s failover
Hot-vertex cache stampede	DB shard spike on viral content	Leader coalesces concurrent misses (one DB read, rest wait); count approximation
Bidirectional edge inconsistency	A follows B, but B's follower list doesn't show A	Async reconciler replays failed edge writes; idempotent re-application
Cross-shard query	Expensive scatter-gather (e.g., "which of X's friends like Y?")	Application-level fan-out with timeout; product limits depth and set sizes
Cache poisoning on a write	Stale cache entry served after write	Write-through: leader writes MySQL synchronously, then cache is updated on the ack reply path; followers invalidated within ms

Things to discuss in an interview

Why not a graph database? The workload is single-hop, read-dominated, and at a scale where operational maturity of MySQL + a known caching model beats the traversal advantages of a native graph DB.
How do you handle the hot-vertex problem? Count caching, pagination with hard limits, replica reads, leader-level request coalescing (prevents stampedes), counter sharding.
How do you maintain bidirectional edges? Async two-row writes on different shards; eventual consistency with a reconciler.
What is the sharding key and why? id1 (the source node). Makes all single-hop reads from a node local to one shard — the dominant query pattern.
How does read-your-writes work? App servers write to their local follower, which forwards to the leader. As the write ack propagates back, the follower's cache is updated. A subsequent read from the same follower hits the fresh entry — no stale read window.
How do you count likes on a viral post without hitting the DB? Pre-maintained association_counts table + cache; approximate counts tolerated; counter sharding if write rate is extreme.
What happens when a leader cache dies? Writes from followers have no leader to forward to until a replacement is provisioned. Reads fall through to DB. The new leader cache is rebuilt from DB. The system degrades gracefully but DB takes on extra load until the cache warms.
What is the difference between assoc_range and assoc_time_range? assoc_range uses offset-based pagination (useful for stable lists); assoc_time_range uses time bounds (useful for "new events since last check"). Both use the same underlying index.

Things you should now be able to answer

Why does sharding by id1 make the association table efficient for social graph queries?
What is the difference between a leader cache and a follower cache in TAO's model?
How does a count cache prevent SELECT COUNT(*) from hitting the DB on every like?
Why are bidirectional friendships stored as two directed association rows rather than one undirected row?
What is a cache stampede, and how does TAO's leader tier mitigate it through request coalescing?
When would you choose a native graph database over a sharded SQL adjacency list?
What is the write-through cache pattern and why does TAO use it rather than write-back?

Frequently asked questions

▸What is TAO, and why did Facebook build it instead of using a general-purpose graph database?

TAO (The Associations and Objects) is a two-tier write-through cache in front of sharded MySQL, purpose-built for social-graph access at Facebook scale. General-purpose graph databases handle multi-hop traversal well but cannot absorb the read volume TAO handles — roughly 1 billion reads per second at peak — and social-graph queries are almost always single-hop, not deep traversals. MySQL's operational maturity, backup tooling, and ACID writes make it the better foundation when traversal depth is shallow.

▸What is the hot-vertex problem and how does the TAO design address it?

The hot-vertex problem is when a celebrity node — followed by 100 million people — concentrates read and write traffic on a single database shard. TAO mitigates it through four mechanisms: a hard per-query limit on assoc_range calls (e.g., 6,000 rows max), a separately maintained association_counts table so counts are a single-row lookup rather than a COUNT(*) scan, leader-level request coalescing so only one DB read fires during a cache miss storm, and approximate or counter-sharded counts for viral posts receiving thousands of write increments per second.

▸Why are bidirectional friendships stored as two directed association rows instead of one undirected row?

Storing two directed rows — (A, FRIEND, B) and (B, FRIEND, A) — keeps the query for each user's friend list to a single shard, since the associations table is sharded by id1. A single undirected row would require a cross-shard join or a UNION across two shards to reconstruct both directions. The TAO paper reports the async window for the second row to land is below 1 second for 85% of observations and below 3 seconds at p99, which the product tolerates.

▸What is the difference between the leader cache and the follower cache in TAO?

The leader cache owns one database shard, serializes all writes to that shard, coalesces concurrent read misses so only one DB query fires per key, and sends invalidations to follower caches. The follower cache is a stateless read tier — app servers write to their local follower, which forwards to the leader; read misses go to the leader rather than directly to MySQL, fully insulating the database from the follower tier's miss traffic. Because followers are stateless, scaling out is as simple as adding nodes.

▸How does the write-through cache pattern preserve durability, and why does TAO prefer it over write-back?

Write-through means the leader writes to MySQL synchronously before acknowledging success to the caller; the cache is updated as the acknowledgment propagates back up the chain. Write-back buffers writes in the cache and flushes them to the DB later, which risks losing those writes if a cache node crashes before the flush. TAO uses write-through to ensure every acknowledged write is already durable on disk.

← previous

Design a Shopping Cart & Checkout System

Design an Authorization System (Google Zanzibar / RBAC / ReBAC)

// RELATED

Design a Social Graph Service (Facebook's TAO)

The problem

Functional requirements

Non-functional requirements

Capacity estimation

Building up to the design

V1: A single relational table

V2: A typed adjacency list (edge table)

V3: Shard by id1 + separate count cache

V4: A cache tier (single region)

V5: Two-tier cache (leader + follower) per region — TAO's model

High-level architecture

The objects-and-associations data model

Objects

Associations

Bidirectional edges

Query API deep dive

assoc_range — the workhorse

assoc_count — likes at a glance

assoc_get — "did A like this post?"

The hot-vertex problem

Storage modeling trade-offs

Consistency model

Failure modes

Things to discuss in an interview

Things you should now be able to answer

Further reading

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform