~/articles/design-search-autocomplete
◆◆Intermediateasked at Googleasked at Amazonasked at Metaasked at Twitter

Design Search Autocomplete (Typeahead)

Sub-100ms autocomplete suggestions across billions of queries — tries, top-k caching, and personalized ranking.

15 min read2026-02-13Ironclad Academy
// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Google's search box handles roughly 8.5 billion queries a day, and the autocomplete dropdown has to respond to each keystroke before the user finishes typing — under 100 ms, globally, for everyone. Amazon, YouTube, Twitter, and virtually every search surface at scale face the same problem. The product name is "typeahead" or "search autocomplete," and it is deceptively simple: as you type, show the 10 most relevant completions for whatever you've typed so far.

The hard part is the word "relevant." There are about a billion distinct queries in Google's index. For a given prefix like am, the subtree of completions is enormous — and the top 10 depends on global popularity, how recently a query spiked (breaking news can make a phrase trend in minutes), and what that specific user has searched before. You need to return the right 10 in under 100 ms on every keystroke from anywhere on the planet.

Two core tensions define this problem. First, the read load is extreme: 10 billion queries a day translates to roughly 600k autocomplete lookups per second on average, and 2 million per second at peak (each query triggers about five keystrokes). That is nearly real-time at internet scale. Second, writes and reads conflict: every search is a signal that could update popularity scores, but updating the index in-place on every search at ~120k QPS would cause catastrophic write contention on hot nodes. The write path and read path must be fully decoupled.

The solution combines a classic data structure (trie with precomputed top-k at every node), a distributed-systems pattern (sharding by prefix so the index spans a cluster), a streaming pipeline to keep the index fresh, and personalization layered on top without slowing down the hot path. Each of those choices exists because the previous naive version breaks at scale — the design builds up naturally from a single SQL query to a production-grade distributed system.

Functional requirements

  • As the user types, return up to 10 suggestions for the current prefix.
  • Suggestions ranked by popularity, recency, and personalization.
  • Update suggestions in near-real-time as new queries are made.

Non-functional

  • p99 latency < 100ms (every keystroke!).
  • 10B queries/day.
  • Personalized: each user sees somewhat different suggestions.
  • Available worldwide.

Capacity

At 10 billion queries per day, each triggering roughly five autocomplete lookups (one per character typed), the read load dominates everything else in this design.

DimensionEstimateHow we got there
Search QPS (avg)~120k/s10B ÷ 86,400 s
Autocomplete lookups (avg)~600k/s120k × 5 keystrokes
Autocomplete lookups (peak)~2M/s~3× average peak multiplier
Distinct queries to index~1BLong-tail query distribution

Takeaway: 2M autocomplete lookups/s is the dominant constraint — the entire architecture (trie, sharding, edge caching) exists to serve that number within 100 ms.

Building up to the design

Autocomplete is the cleanest "data structure earns its keep" problem in the catalog. The naive version makes the trie feel obvious, and the trie version makes the precomputed top-k feel obvious. Walk the path and each decision arrives naturally.

V1: LIKE 'prefix%' on a SQL table

SELECT query FROM popular_queries
WHERE query LIKE $1 || '%'
ORDER BY count DESC LIMIT 10;

Works on day one, especially with a B-tree index on query — Postgres can serve LIKE 'prefix%' straight from the index. But at 2M lookups/sec, even an indexed range scan is too expensive. Each one involves a sorted scan and an ORDER BY, and p99 climbs well above the 100ms budget. We need something designed for prefix lookups, not a general-purpose range index.

V2: In-memory trie

Load all popular queries into a trie. Each lookup walks down len(prefix) nodes and returns the children sorted by popularity — O(prefix_length) in descent, microseconds. A huge improvement.

The catch: each lookup still sorts its children by popularity. For a hot prefix like a with 100k children, that's a full sort on every request. And a naive trie for 1B unique queries is large — each node stores 26 child pointers at 8 bytes each (208 bytes/node), so even with significant prefix sharing, the node count can reach hundreds of millions, putting the index at tens to hundreds of GB. Compressed tries (DAWG, HAT-trie) bring that down significantly, but the full global index still won't fit in one machine's RAM affordably. Two problems left: sorting on the read path, and sheer size.

V3: Precompute top-k at every node

Fix the sorting problem at index-build time. At each trie node, store the top-10 completions, pre-sorted. Now a lookup is: walk to the node, return its list. That's it — O(prefix_length), sub-millisecond, completely independent of how large the subtree below that node is. The sort cost moves entirely off the read path.

The trie is still too big for one box, and the counts go stale as new queries pour in. Those are the next two problems.

V4: Shard the trie by prefix

Shard by leading characters: a-d → shard 1, e-h → shard 2, and so on. Or hash-shard by the first 2 characters for more even distribution. Each shard now fits in RAM; routing a request is just prefix[0..2] → shard_id. The cluster scales horizontally and every request hits exactly one shard.

Building the trie nightly is fine for stable global popularity, but trending queries can spike in minutes. If "breaking news" starts trending at 2pm, waiting until midnight to rebuild means six hours of stale results.

V5: Streaming query counts + incremental updates

A Kafka stream of search performed events feeds aggregator workers that batch-update counts. An index builder periodically refreshes top-k caches at trie nodes — every few minutes for hot prefixes, nightly for cold ones. New trending queries now surface within 5 minutes, not hours.

What's still missing: everything above serves the same 10 results for "bay" to everyone. "Bay" for someone in California should surface "Bay Area"; for someone in New York, "Bay Ridge."

V6: Personalization on top

The global trie returns 50 candidates instead of 10. A per-request re-ranker blends in personal features — recent queries, location, language — and trims back to 10. The trie stays simple and fast; the personalization complexity lives in the re-ranker, only on the logged-in hot path.

V7: Production

V3 + V4 + V5 + V6, plus a hot-prefix cache at the edge (CDN serves the top-1000 prefixes that get hammered globally), plus a spam and safety filter on completions.

flowchart LR
    V1[V1: LIKE query<br/>100ms+] --> V2[V2: in-memory trie<br/>sorted per req]
    V2 --> V3[V3: + precomputed top-k<br/>sub-ms]
    V3 --> V4[V4: + shard by prefix<br/>fits in RAM]
    V4 --> V5[V5: + streaming counts<br/>trends in minutes]
    V5 --> V6[V6: + personalization<br/>relevant]
    V6 --> V7[V7: + edge cache + safety<br/>production]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V7 fill:#a855f7,color:#fff

Architecture

Two pipelines run in parallel and never block each other.

The read path is what the user experiences: a keystroke hits the CDN edge, likely returns from cache, and lands at the autocomplete API only on a miss. The API checks its own in-memory prefix cache, and only on a second miss reaches the sharded trie index for the authoritative answer.

The write path runs quietly in the background: every search event flows into Kafka, aggregator workers batch the counts, and an index builder periodically refreshes the top-k lists at each trie node. A fresh build propagates to the live shards in minutes without blocking a single read.

flowchart TD
    USER[User] --> CDN[Edge / CDN]
    CDN --> ACAPI[Autocomplete API]
    ACAPI --> CACHE[(Per-prefix Cache<br/>top-10 per prefix)]
    CACHE -.miss.-> SHARDS[(Sharded prefix index<br/>tries / sorted sets)]

    SAMPLES[Query Stream] --> KAFKA[Kafka]
    KAFKA --> AGG[Aggregator Workers]
    AGG --> COUNTS[(Counts DB)]
    COUNTS --> BUILDER[Index Builder<br/>nightly + incremental]
    BUILDER --> SHARDS

    style ACAPI fill:#ff6b1a,color:#0a0a0f
    style CACHE fill:#15803d,color:#fff
    style BUILDER fill:#a855f7,color:#fff

Data structure: trie + cached top-k

A trie (also called a prefix tree) stores strings so that all words sharing a prefix share a path. cat, car, card, and care all pass through the nodes c → ca before branching. This makes prefix lookup essentially free: to find all completions for ca, you walk two nodes and you're already standing at the right subtree.

flowchart TD
    R[root]
    R --> C[c]
    C --> CA[ca]
    CA --> CAR["car<br/>top: cars 100, carbon 80, card 70"]
    CA --> CAT["cat<br/>top: cat 90, catalog 50"]
    CAR --> CARD["card<br/>top: card 70, cards 60"]
    CAR --> CARE["care<br/>top: care 50, careers 40"]
    style CAR fill:#ff6b1a,color:#0a0a0f
    style CAT fill:#ff6b1a,color:#0a0a0f

Notice the top: list cached at each node. That's the key insight: at index-build time, every node precomputes and stores the top-10 completions reachable from it. Lookup is then "walk len(prefix) steps, return the cached list" — O(prefix_length), no traversal of the subtree, no sort. Without this cache, serving "top 10 for prefix ca" would require walking every node below ca and sorting — far too slow at request time.

Sharding

A single trie can't hold 1B distinct queries in one machine's memory. The fix is to split the trie by prefix: one range of leading characters per shard.

flowchart LR
    REQ["Prefix 'amazon'"] --> ROUTE[Router]
    ROUTE -->|"starts with 'a'"| S1["Shard 1: a–d"]
    ROUTE -->|"starts with 'g'"| S2["Shard 2: e–l"]
    ROUTE -->|"starts with 'p'"| S3["Shard 3: m–r"]
    ROUTE -->|"starts with 'y'"| S4["Shard 4: s–z"]
    style S1 fill:#15803d,color:#fff

Each shard holds a slice of the trie and is replicated for HA. Every request hits exactly one shard, so there's no scatter-gather.

Range-shard vs. hash-shard. Range sharding (a–d, e–h, ...) is simple to reason about and operationally easy, but creates uneven load — s subtrees (search, show, sport, shop, ...) are far hotter than x or z. Hash-sharding by hash(prefix[0:2]) % N distributes evenly, at the cost of losing prefix locality — but that's fine since each request targets exactly one prefix anyway. In practice, many systems use range sharding at the top level and sub-shard hot letters by their second character (sa–sd, se–sk, ...) as load dictates.

Updating the index

With ~120k QPS and 1B distinct queries, you can't update the trie in-place on every search — the write contention on hot nodes would be crushing. Instead, decouple the write stream entirely.

flowchart LR
    Q[Searches] --> KAFKA[Kafka topic]
    KAFKA --> AGG[Aggregator]
    AGG --> COUNT[(Counts DB<br/>"query → count, last_seen")]
    COUNT --> BLD[Index Builder<br/>builds top-k cache per node]
    BLD --> NEWIDX[New trie shards]
    NEWIDX --> SWAP[Atomic swap to live]
    style KAFKA fill:#a855f7,color:#fff
    style BLD fill:#ff6b1a,color:#0a0a0f

Search events land in Kafka. Aggregators consume them and batch-update counts in Cassandra or BigTable. An index builder periodically reads those counts, recomputes the top-k lists at each node, builds a fresh set of trie shards, and then swaps them into the live serving layer atomically — a blue/green deploy of the index itself. Reads never see partial results.

For trending queries — a sudden spike from breaking news — a real-time layer maintains a sliding-window count over the last few minutes and blends that score into ranking, even before the next full rebuild runs.

If the index builder falls behind or crashes, the live trie keeps serving its last-known-good top-k lists. Users see slightly stale results rather than errors. Kafka provides durable replay, so the builder catches up from its last committed offset. Hot-prefix cache TTLs (5 min) bound how long a stale result persists after a shard refreshes.

Ranking

The score blended at each node — and in the re-ranker — combines four signals:

score(query | prefix, user)
  = w1 × log(popularity)
  + w2 × recency_decay(last_seen)
  + w3 × personalization(user, query)
  + w4 × trending_score(query)

For 99% of traffic the global top-10 cached at the trie node is good enough. Personalization is layered on top: the trie returns 50 candidates, and a lightweight server-side re-ranker blends the user's history and location to trim back to 10.

Caching layers

Every layer between the user and the trie store exists to absorb the load before it reaches the next layer.

LayerWhatTTL
BrowserLast result for last prefixclient-controlled
CDNPublic top-10 for popular prefixes60s
Server in-memoryPer-prefix top-105 min
Trie storeAuthoritative24h rebuild

For the most common prefixes — a, am, ama, and so on — 99% of requests are served from CDN or the server-side in-memory cache and the trie never sees the request.

Personalization

The split comes down to logged-in vs. anonymous traffic.

Edge serves the global top-10 to anonymous users — no session to look up, no added latency. For logged-in users, the autocomplete API fetches the user's recent query history from a per-user store (DynamoDB works well here — point lookups by user ID), blends a personal score onto the trie's top-50 candidates, and returns the final top-10. The extra lookup adds roughly 5–10 ms, well within budget from a nearby PoP.

sequenceDiagram
    participant U as User (logged in)
    participant CDN as CDN edge
    participant API as Autocomplete API
    participant TRIE as Trie shard
    participant HIST as User history store
    U->>CDN: prefix "bay"
    CDN-->>U: cache miss (personalized, can't cache)
    CDN->>API: forward request + session token
    API->>TRIE: top-50 for prefix "bay"
    TRIE-->>API: "bay area, baylor, baywatch ..."
    API->>HIST: recent queries for user_id
    HIST-->>API: ["bay ridge", "brooklyn", ...]
    API->>API: re-rank 50 candidates with personal scores
    API-->>U: top-10 (bay ridge first for NYC user)

Notice that CDN can't serve a cached response here — the result is user-specific. That's the cost of personalization: the hot-prefix cache only helps anonymous traffic, so the logged-in path always touches the API layer.

Filtering

Not every popular query should be shown. A blocklist applied at index-build time removes offensive, illegal, or restricted completions. A real-time filter catches anything that slips through or becomes newly problematic. Regular human review keeps both layers calibrated.

Latency budget

For 100ms p99 from a regional PoP:

Stepms
Network (user → edge/server)20–30
Cache lookup (in-memory)1–5
Personalization re-rank5–10
Network (server → user)20–30
Browser render5
Total~51–80 ms

The 100ms target is achievable only when users hit a nearby edge. Cross-continental round trips range from ~70–120ms RTT for trans-Atlantic to 150–300ms for trans-Pacific routes — either of those already consumes most or all of the budget before any processing happens. This is why a CDN with regional PoPs is essential, not optional. The in-memory cache lookup must also be fast: a cross-datacenter Redis call costs 20–30ms on its own, which is why the server-side cache needs to live on the same rack as the API.

Edge cases

Spelling correction

If a user types goggle, a trie alone is helpless: the first character is wrong, there's no shared prefix, and you never reach the right subtree. Two approaches work in practice.

BK-trees build an index on edit distance. Lookup uses the triangle inequality to prune irrelevant subtrees — in practice, a distance-1 query visits only 5–8% of the tree, a distance-2 query visits 17–25%. Performance degrades toward linear as the allowed edit distance increases. SymSpell trades space for speed: it pre-generates all candidates within edit distance 2 at index time so that query-time lookup is effectively O(1) — 100x to ~1,000x faster than BK-tree in Wolf Garbe's benchmarks (the exact multiplier varies by dictionary size and edit-distance threshold).

In production, spell correction runs as a separate parallel service. Its suggestions merge into the trie's prefix results before the response goes back to the client, so it adds nothing to critical-path latency.

Multi-language

Different scripts (Latin, Cyrillic, CJK), different tokenization rules, different popularity distributions per locale. Index per locale; pick the right one at request time based on the user's language setting.

Long queries

For something like "best italian restaurants near me," a prefix trie is no longer the right structure — you're matching tokens, not characters. An inverted index plus n-gram indexing handles this better.

Mobile keyboards

Autocomplete on mobile fires more aggressively — sometimes after every keystroke, sometimes only on word boundaries. The API and ranking are identical; the client tuning controls when to call.

Client-side debounce and prefix caching

A well-implemented client never fires on every single keystroke. Typical optimizations: debounce (wait 50–100ms after the last keystroke), local prefix cache (if "sea" results are cached, the client can filter them for "sear" without a network call), and request cancellation (discard a slow "sea" response if a faster "sear" result arrived first). Together these meaningfully reduce server load compared to a naive fire-on-every-key approach — in practice, fast typists produce a fraction of the network calls they otherwise would.

Storage choices

ComponentTech
Trie shardsCustom in-memory service, or Redis sorted sets per node
CountsCassandra or BigTable (write-heavy, append-only)
Personal historyDynamoDB (per-user lookup)
Query logsKafka → S3/Bigquery
BlocklistPostgres (small, simple)

Things to discuss in an interview

  • Trie + cached top-k: explain why a plain trie doesn't work alone.
  • Sharding by prefix vs. hash: trade-offs.
  • Read path vs. write path clearly separated.
  • Personalization layered on top of global top-k.
  • Caching at multiple levels including CDN.

Things you should now be able to answer

  • Why is a trie a natural fit for prefix-based autocomplete?
  • Why do you cache top-k at each node rather than traversing on every query?
  • How do you keep the index fresh without overwhelming write paths?
  • Where does personalization happen in the pipeline?
  • How does the system handle a sudden trend (e.g. breaking news)?

Further reading

  • "Search Autocomplete @ Scale" — Engineering at Meta
  • "How We Built Twitter Search Autocomplete" — Twitter Engineering
  • Lucene's SuggestField and related autocomplete docs
// FAQ

Frequently asked questions

Why does a trie alone fail for autocomplete at scale, and what fixes it?

A plain trie still requires sorting children by popularity on every lookup. For a hot prefix like 'a' with 100k children, that is a full sort on every request. The fix is to precompute and store the top-10 completions at each node at index-build time, so a lookup is just walk len(prefix) steps and return the cached list — O(prefix_length), sub-millisecond, with zero traversal of the subtree below.

What is the peak autocomplete lookup rate in this design, and why does it differ from the raw search QPS?

Peak autocomplete lookups reach roughly 2 million per second, versus about 120k searches per second on average. The multiplier comes from two factors: each query triggers approximately five keystrokes, pushing average lookups to 600k per second, and a 3x peak multiplier accounts for traffic spikes.

When should you use range sharding versus hash sharding for the trie?

Range sharding by leading characters (a-d, e-h, ...) is simpler to operate but creates uneven load because some letter subtrees are far hotter than others. Hash sharding by the first two characters distributes load evenly at the cost of losing prefix locality, which is acceptable since each request targets exactly one prefix anyway. In practice, many systems range-shard at the top level and sub-shard hot letters by their second character as load dictates.

How does personalization work without slowing down the hot path?

The global trie returns 50 candidates instead of 10, and a lightweight server-side re-ranker blends the user's recent query history and location — fetched from a per-user store like DynamoDB — to trim back to 10. The extra lookup adds roughly 5-10 ms, which fits within the 100ms budget from a nearby PoP. Anonymous traffic bypasses this entirely and is served the global top-10 from CDN.

How does the system surface a trending query within minutes without updating the trie on every write?

Search events flow into a Kafka topic rather than directly updating the trie. Aggregator workers batch-update counts in Cassandra or BigTable, and an index builder periodically recomputes top-k lists and swaps fresh trie shards into the live serving layer atomically. A separate real-time layer also maintains a sliding-window count over the last few minutes and blends a trending score into ranking, so new queries can surface within 5 minutes without waiting for the next full nightly rebuild.

// RELATED

You may also like