~/articles/design-distributed-search

◆◆◆Advancedasked at Elasticasked at Amazonasked at Algolia

Design a Distributed Search Engine (Elasticsearch)

Q: What is an inverted index and why does it make full-text search tractable at scale?

An inverted index maps each term in the corpus to a sorted posting list of document IDs (and term frequencies and positions) that contain it. Instead of scanning every document for a query term — O(N) with a relational LIKE scan — lookup becomes a single key fetch returning the posting list in O(1) plus O(posting list size). Boolean queries like AND reduce to a two-pointer intersection of two sorted lists, and phrase queries use stored position data to verify adjacency.

Q: How does BM25 improve on TF-IDF for relevance scoring?

TF-IDF has two weaknesses BM25 fixes: raw term frequency grows unboundedly, so a document mentioning a term 100 times scores 10x more than one with 10 mentions even though the signal diminishes. BM25 applies a saturation function (controlled by k1, typically 1.2) that caps TF's contribution. It also normalizes TF by document length (controlled by b, typically 0.75), so a long document mentioning a term once is not treated as stronger evidence than a short document doing the same.

Q: Why is hash-on-doc-id sharding preferred over term-based sharding in distributed search?

Term-based sharding would let each shard answer queries independently without scatter-gather, but indexing a single document would touch dozens of shards — one per unique term — making writes complex and transactionally painful. Hash on doc_id keeps each shard as a self-contained index over a slice of the document space, making writes simple (one shard per document) at the cost of scatter-gather on queries.

Q: Why is deep pagination expensive in a distributed search engine, and how should it be handled?

To serve results at offset 10,000 with page size 10, the coordinator must ask every shard for its top 10,010 results, merge them all, and discard the first 10,000. On a 50-shard cluster that means collecting and sorting roughly 500,000 candidates to return 10. The recommended fix is cursor-based pagination using search_after, which passes the last-seen sort key so each shard returns only results beyond that key at O(size) cost. Elasticsearch caps from + size at 10,000 by default for this reason.

Q: What is the difference between a segment refresh and a flush in Lucene-based search engines?

A refresh writes the in-memory buffer to a new immutable segment in the OS filesystem buffer (not fsynced to durable storage) and makes it visible to search — this happens roughly every 1 second by default and is what near-real-time indexing means. A flush commits segments to durable storage via fsync and clears the transaction log; it is more expensive and triggers either every 30 minutes or when the transaction log exceeds 10 GB. Reducing the refresh interval makes documents searchable faster but generates more small segments and increases merge overhead.

Index billions of documents and answer full-text queries in milliseconds. Inverted indexes, sharding + replication, scatter-gather, and relevance scoring.

21 min read2026-05-07Ironclad Academy

#interview #search #indexing #distributed-systems

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Elasticsearch powers the search bar at GitHub, the log aggregation at Uber, and the product discovery at Shopify. The common thread: a corpus too large for any single machine, users who expect answers in under 100 milliseconds, and an index that must stay fresh as documents arrive by the thousands every second.

A WHERE body LIKE '%kubernetes%' scan reads every row in the table — O(N) per query. At a billion documents that takes minutes, not milliseconds. The inverted index solves this by flipping the access pattern: instead of asking "which terms appear in this document?", you pre-build a map of "which documents contain this term?" Term lookup becomes a single key fetch returning a sorted posting list, and boolean queries (AND, OR, NOT) reduce to fast list intersections and unions. That is the core data structure that makes full-text search tractable.

The distributed version of this problem introduces a second layer of complexity. A single machine cannot hold the inverted index for 100 TB of raw text — so the index must be sharded across hundreds of nodes. Every query then needs to fan out to all shards in parallel, collect each shard's local top results, and merge them into a globally correct ranking before returning. This scatter-gather pattern is both the source of the system's horizontal scale and the root of its hardest trade-offs: relevance accuracy across shards with different term statistics, deep-pagination costs that grow with shard count, and near-real-time indexing lag driven by Lucene's segment-based write path.

This article is about indexing and querying a large document corpus — not crawling the web. For web-scale crawling, see design a web crawler. For the typeahead autocomplete problem, see design search autocomplete.

Functional requirements

POST /index — accept a document (id, title, body, metadata), analyze it, and add it to the index.
GET /search?q=<query>&from=0&size=10 — return ranked results for a free-text query.
Support boolean expressions (AND / OR / NOT) and phrase queries.
(Optional) Faceted filtering, range queries on numeric fields, aggregations.
(Optional) Autocomplete / prefix matching.

Non-functional requirements

Low query latency — p99 < 100 ms for most queries.
Near-real-time indexing — a newly indexed document should appear in search within ~30 seconds (ideally ~1 second).
High availability — a single node failure must not take search offline.
Horizontal scale — both corpus size and QPS must scale by adding nodes.
Relevance — results must be ordered by how well they match the query, not arbitrarily.

Capacity estimation

Dimension	Estimate	How we got there
Documents	10 billion	Given
Avg document size	10 KB	Title + body + metadata
Raw text total	100 TB	`10B × 10 KB`
Index storage (typical)	~100 TB	Elasticsearch stores inverted index + `_source` by default; indexed-to-raw ratio ranges 0.57–1.39× depending on compression codec, field mappings, and doc value settings; conservative baseline is ~1× raw
Index storage range	50–130 TB	Best case (best_compression, `_source` disabled): ~50%; default mixed workload: ~100–130%
Target shard size	50 GB	Fits comfortably in a mid-range instance's RAM + NVMe SSD
Shard count (full corpus)	~2,000 primary shards	`100 TB ÷ 50 GB`
Total shard copies (1 primary + 1 replica, 2 total copies)	~4,000	`2,000 × 2`
Nodes (full corpus, 500 GB usable NVMe each)	~400 nodes	`4,000 ÷ (500 GB ÷ 50 GB)`
Hot-tier practical cluster size	~40 nodes	Shrink corpus to 10 TB of hot/recent data or apply tiered storage
Write throughput	5,000 indexing events/sec	Given; each doc passes tokenize → lowercase → stem → stop-word filter; I/O + CPU bound
Query throughput	50,000 QPS	Given; coordinator is stateless and scales horizontally; tiered deployment queries hot shards only

Takeaway: At 100 TB raw corpus a full production cluster needs ~400 nodes; shard count is fixed at index creation, so plan for 3–5× expected data size up front to avoid a costly full re-index as the corpus grows.

The inverted index — the core data structure

A relational WHERE body LIKE '%kubernetes%' performs a full table scan — every row is read. That's O(N) per query. For billions of documents, it's completely impractical.

The inverted index flips the access pattern. During indexing, for every term in every document, we record the document ID (and optionally the term's positions within the document). At query time, looking up a term is a single key lookup into a hash or B-tree, returning the posting list of all matching doc IDs in O(1) + O(|posting list|).

Structure

term          →  posting list
"kubernetes"  →  [(doc:1, tf:3, pos:[4,17,90]), (doc:5, tf:1, pos:[2]), (doc:9, tf:2, pos:[11,44])]
"docker"      →  [(doc:1, tf:1, pos:[6]), (doc:3, tf:4, pos:[1,8,22,31]), ...]
"container"   →  [(doc:1, tf:2, pos:[7,91]), (doc:3, tf:1, pos:[9]), ...]

Each entry in the posting list stores at minimum the doc_id. It typically also stores:

Term frequency (tf): how many times the term appears in the document. Used in relevance scoring.
Positions: the offsets of each occurrence within the document. Required for phrase queries (e.g. "kubernetes cluster" — the two terms must appear adjacent).

Posting lists are stored in sorted order by doc_id. This makes boolean operations efficient.

Analysis pipeline (how text becomes terms)

Before a document reaches the inverted index, it goes through an analysis pipeline:

raw text → tokenizer → lowercase filter → stop-word filter → stemmer → terms

"Running Kubernetes Containers in the Cloud"
  → tokenize  : ["Running", "Kubernetes", "Containers", "in", "the", "Cloud"]
  → lowercase  : ["running", "kubernetes", "containers", "in", "the", "cloud"]
  → stop-words : ["running", "kubernetes", "containers", "cloud"]   (dropped "in", "the")
  → stem       : ["run", "kubernete", "container", "cloud"]

The same pipeline runs on both documents (at index time) and queries (at query time). If you stem "containers" to "container" at index time, you must also stem the query term so they match.

Stop-words (common words like "the", "in", "a") are dropped because they'd appear in nearly every document, bloating the posting list without improving relevance.

flowchart LR
    RAW["Raw text<br/>e.g. 'Running Kubernetes Containers'"] --> TOK[Tokenizer]
    TOK --> LC[Lowercase filter]
    LC --> SW["Stop-word filter<br/>(drop 'the', 'in', 'a')"]
    SW --> STEM["Stemmer<br/>'containers' → 'container'"]
    STEM --> IDX[(Inverted Index<br/>term → posting list)]
    style TOK fill:#0e7490,color:#fff
    style STEM fill:#ff6b1a,color:#0a0a0f
    style IDX fill:#15803d,color:#fff

The same pipeline runs at query time — so when a user searches "Containers", it gets stemmed to "container" and matches the same posting list entry created during indexing.

Boolean query execution

AND query (kubernetes AND docker): intersect the two posting lists. Since both are sorted by doc_id, a two-pointer merge runs in O(|A| + |B|):

A: [1, 3, 5, 9, 14]
B: [1, 3, 7, 14, 20]
result: [1, 3, 14]   ← docs containing both terms

OR query (kubernetes OR docker): union of posting lists. Same merge, take all IDs from either.

Phrase query ("kubernetes cluster"): first intersect like AND to find docs with both terms, then for each candidate, check the position lists — the position of "cluster" must equal position of "kubernetes" + 1 (adjacent in the document). This is why position data is stored.

NOT (kubernetes NOT docker): compute the AND of kubernetes posting list with the complement of docker. In practice, NOT is computed as a filter applied after retrieval, not by materializing a giant "docs without docker" list.

Relevance scoring — TF-IDF and BM25

Getting the right results is necessary; returning them in the right order is what makes search feel good.

TF-IDF (the classic baseline)

The intuition: a term is relevant to a document if it appears often in that document (term frequency, TF) but rarely across all documents (inverse document frequency, IDF). A word like "the" has high TF everywhere and nearly zero IDF — useless as a signal. A word like "photolithography" is rare and meaningful.

TF(t, d)  = count of term t in document d / total terms in document d
IDF(t)    = log(N / df(t))
              N    = total documents in the corpus
              df(t) = documents containing term t

Score(t, d) = TF(t, d) × IDF(t)

For multi-term queries: Score(q, d) = Σ TF-IDF(t, d) for each term t in query

BM25 (Best Match 25) — the modern default

BM25 is the successor to TF-IDF used in most production search engines (including Lucene/Elasticsearch's default scorer). It fixes two weaknesses in TF-IDF.

First, raw TF grows unboundedly. A document with a term 100 times shouldn't score 10× more than one with 10 occurrences — returns diminish. BM25 applies a saturation function that caps TF's contribution. Second, a long document containing a term once is weaker evidence than a short document doing the same. BM25 normalizes TF by document length.

BM25(t, d) = IDF(t) × [ TF(t,d) × (k1 + 1) ]
                         --------------------------
                         [ TF(t,d) + k1 × (1 - b + b × dl/avgdl) ]

k1 ∈ [1.2, 2.0]   — controls TF saturation
b  ∈ [0, 1]       — controls length normalization (0 = off, 1 = full)
dl  = document length
avgdl = average document length in the corpus

Typical defaults: k1 = 1.2, b = 0.75. These are tunable. The score for a multi-term query sums the BM25 contributions of each term, and documents are ranked descending by that score.

Building up to the design

V1: One node, one Lucene index

Start with a single machine running a single Apache Lucene index. You get full inverted-index semantics, BM25 scoring, near-real-time refresh — a working search engine on one box that handles a few hundred GB of documents comfortably.

The wall you hit is storage. At 100 TB of raw corpus, the total on-disk index footprint is roughly 100 TB or more (Elasticsearch stores the inverted index plus the original document source by default). A single machine can't hold that — even an i3en.24xlarge tops out at 60 TB NVMe and 768 GB RAM. You have to split the index.

V2: Shard the index

Partition documents across N shards. Each shard is a complete, self-contained inverted index over its slice of the document space. A query fans out to all shards in parallel, each scores locally, and the coordinator merges the results. Now corpus size scales horizontally, and so does read throughput.

The next fragility: each shard has exactly one copy. If that node goes down, those documents are temporarily unsearchable.

V3: Replicate each shard

Give every shard a primary and one or more replicas. Writes go to the primary; the primary replicates to replicas before acknowledging. Reads can go to either. Replica promotion handles primary failures — the cluster manager detects a lost node and promotes a replica within seconds. You also get read QPS scale-out as a bonus.

At this point the coordinator — the node that receives queries and fans them out — has become a potential bottleneck and single point of failure.

V4: Stateless coordinator tier

The coordinator holds no index data. It reads the shard map from a metadata store (ZooKeeper or etcd), routes queries to the right shards, and merges results. Because it's stateless, you run as many as you need behind a load balancer.

V5: Production-grade additions

Segment merging, field-level index types, multi-tenancy (multiple indices), warm/cold storage tiering for old documents. This is where Elasticsearch and OpenSearch live.

flowchart LR
    V1["V1: single Lucene node<br/>GB scale"] --> V2["V2: shard across nodes<br/>TB scale, no HA"]
    V2 --> V3["V3: + replication<br/>fault tolerant"]
    V3 --> V4["V4: + stateless coordinator<br/>no coordinator SPOF"]
    V4 --> V5["V5: + segments, merging, tiering<br/>production search cluster"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V5 fill:#a855f7,color:#fff

High-level architecture

flowchart TD
    CLI[Client] --> LB[Load Balancer]
    LB --> COORD1[Coordinator / Search Node]
    LB --> COORD2[Coordinator / Search Node]

    COORD1 -->|scatter| SH0P[Shard 0 Primary]
    COORD1 -->|scatter| SH1P[Shard 1 Primary]
    COORD1 -->|scatter| SH2P["Shard 2–N Primary"]

    SH0P -.replicate.-> SH0R[Shard 0 Replica]
    SH1P -.replicate.-> SH1R[Shard 1 Replica]

    ING[Indexing Service] -->|write docs| SH0P
    ING -->|write docs| SH1P
    ING -->|write docs| SH2P

    META[(Shard Metadata<br/>ZooKeeper / etcd)] --> COORD1
    META --> COORD2

    style COORD1 fill:#ff6b1a,color:#0a0a0f
    style COORD2 fill:#ff6b1a,color:#0a0a0f
    style SH0P fill:#15803d,color:#fff
    style SH1P fill:#15803d,color:#fff
    style SH2P fill:#15803d,color:#fff
    style SH0R fill:#a855f7,color:#fff
    style SH1R fill:#a855f7,color:#fff
    style ING fill:#0e7490,color:#fff
    style META fill:#ffaa00,color:#0a0a0f

Two coordinator nodes sit behind the load balancer — stateless, horizontally scalable. Each knows the shard map from ZooKeeper/etcd and routes accordingly. Green nodes are primaries; purple are replicas. The indexing service writes to primaries and lets replication take care of the rest.

Sharding strategy

Each shard is a self-contained inverted index covering a subset of documents. The two main strategies:

Strategy	How	Trade-offs
Hash on `doc_id`	`shard = hash(doc_id) % N`	Even distribution; random access to any doc by ID; simple to implement
Range on `doc_id`	Shard 0 owns IDs 0–99M, shard 1 owns 100M–199M	Ordered scans; potential hot shards if recent docs are accessed more
Term-based	Each shard owns a disjoint set of terms	No scatter-gather needed; writing a doc touches multiple shards; very complex

Hash on doc_id is the right default. Term-based sharding is tempting because it would let each shard answer a query independently — no scatter-gather needed — but writing a single document would touch dozens of shards (one per unique term), making indexing complex and transactionally painful.

Fixed shard count: the number of shards is typically fixed at index creation. Elasticsearch defaults to 1 primary shard per index since version 7.0 (it was 5 in older versions). Changing the shard count requires re-indexing — expensive. Plan for 3–5× expected data size so you don't need to re-shard for years.

Write path — indexing a document

sequenceDiagram
    participant C as Client
    participant ING as Indexing Service
    participant PRI as Primary Shard
    participant REP as Replica Shard

    C->>ING: POST /index { id, title, body }
    ING->>ING: analyze (tokenize, lowercase, stem)
    ING->>PRI: write analyzed doc to segment
    PRI->>REP: replicate segment operation
    REP-->>PRI: ack
    PRI-->>ING: ack
    ING-->>C: 200 OK (indexed, visible at next refresh)

A few things to notice here. The primary writes to an in-memory buffer first, then flushes to a new segment on disk. Replication happens at the operation level — the replica replays the same write. The client's acknowledgment waits for the replica to confirm, which means a slow replica raises write latency. Elasticsearch exposes wait_for_active_shards to tune this: acknowledge after just the primary (faster, less durable) or after N replicas (slower, more durable).

Near-real-time indexing — segments and the refresh cycle

Lucene-style engines use immutable segments as the on-disk unit. A segment is a mini inverted index written once and never modified. Documents are "deleted" by a tombstone, not by in-place removal. Segments accumulate and are periodically merged by a background process.

                     in-memory buffer
                          │
                    [refresh ~1s]
                          ↓
              ┌── segment 1 (on disk, searchable) ──┐
              ├── segment 2 (on disk, searchable) ──┤
              ├── segment 3 (on disk, searchable) ──┤
              └── segment 4 (newly flushed)         ┘
                          │
                   [merge background]
                          ↓
              └── merged segment (fewer, larger)

Why immutable? The OS page cache serves read-heavy workloads on immutable files with no locking complexity during queries. Merging is a background copy operation; search continues without interruption.

The refresh interval defaults to once per second in Elasticsearch. After a refresh, the newly written segment becomes visible to search — this is what "near-real-time" (~1 s latency) means. A flush is different: it commits segments to durable storage via fsync and clears the transaction log. Refreshes are cheap and happen every ~1 s; flushes are more expensive and are triggered either every ~30 minutes (by default) or when the transaction log grows past 10 GB — whichever comes first.

The trade-off is straightforward: reduce the refresh interval and documents appear faster, but you generate more small segments and drive more merging work. Increase it to 30 s during bulk loads and indexing throughput improves significantly, at the cost of delayed searchability.

Query path — scatter-gather

The coordinator receives a query and executes it in two phases.

Phase 1: Query phase (scatter, score, collect top-K)

sequenceDiagram
    participant C as Client
    participant CO as Coordinator
    participant S0 as Shard 0
    participant S1 as Shard 1
    participant SN as "Shard N"

    C->>CO: GET /search?q=kubernetes+docker&size=10
    CO->>S0: query("kubernetes AND docker", size=10)
    CO->>S1: query("kubernetes AND docker", size=10)
    CO->>SN: query("kubernetes AND docker", size=10)
    S0-->>CO: [(doc:42, score:3.1), (doc:101, score:2.8), ...]
    S1-->>CO: [(doc:500, score:3.3), (doc:211, score:2.5), ...]
    SN-->>CO: [(doc:991, score:2.9), ...]
    CO->>CO: merge all local top-10 results, re-rank globally
    CO->>CO: final global top-10: [doc:500, doc:42, doc:991, ...]

Each shard returns its local top-K — just doc IDs and scores, not the full documents. The coordinator merges all N_shards × size results by score and identifies the global top-K. Only scores and IDs travel across the network in this phase.

Phase 2: Fetch phase

sequenceDiagram
    participant C as Client
    participant CO as Coordinator
    participant S0 as Shard 0
    participant S1 as Shard 1

    CO->>S0: fetch docs [42]
    CO->>S1: fetch docs [500]
    S0-->>CO: { doc 42: title, body, metadata }
    S1-->>CO: { doc 500: title, body, metadata }
    CO-->>C: final response with ranked docs

Full document content is only fetched for the final top-K results, not for every candidate. This saves enormous bandwidth on large corpora.

The local top-K subtlety

There is a correctness subtlety worth raising in an interview: a shard's local top-K might not contain documents that would be globally relevant.

If shard 0 has 1000 highly relevant documents and shard 1 has 10, asking each shard for top-10 and merging gives a good global answer. But suppose shard 0's 11th-best document is globally better than shard 1's best — it's dropped because we only asked for top-10 per shard.

The fix is to request a larger local K (e.g., size × num_shards), but that balloons the merge cost. In practice, the error is small when documents are distributed randomly — the IDF statistics per shard are similar and scores cluster predictably. You can also use a "DFS" (distributed frequency search) mode where shards first exchange term statistics so each shard scores using global IDF rather than local IDF, improving ranking accuracy at the cost of an extra round-trip.

Replication and the write path

Primary shard receives write
    │
    ├── writes to in-memory buffer + transaction log (WAL)
    │
    ├── forwards operation to each replica
    │       each replica applies the same write
    │       each replica acks
    │
    └── acks client after configurable quorum

When a primary node fails, the cluster manager detects it via heartbeats and promotes one replica to primary within seconds. Search remains available throughout from the remaining replicas. The minimum for HA is 1 primary + 1 replica. Adding a second replica (1 + 2 total) tolerates two simultaneous node failures.

Pagination and the deep paging problem

from=0, size=10 is cheap: each shard returns its top-10, the coordinator takes the best 10 of the merged list.

from=10000, size=10 is expensive: to find results 10,001–10,010, the coordinator must ask each shard for its top 10,010 results, merge them all, and discard the first 10,000. With N shards (even a modest cluster of 50), that's collecting and sorting 50 × 10,010 ≈ 500K results to return 10. At hundreds of shards the coordinator memory and CPU cost becomes prohibitive.

Three ways to handle it:

Cursor / search_after: instead of offset-based pagination, the client passes the last-seen sort key. Each shard returns results after that key, bounded by size. Scales to arbitrary depth with O(size) cost.
Limit max depth: most systems cap from + size at 10,000 (Elasticsearch default). Deep pagination is an anti-pattern for full-text search; if you need it, use cursors.
Scroll API (for bulk export): keeps a consistent snapshot of the index open, then pages through it. Not for real-time UX; for ETL.

Autocomplete (type-ahead) builds on the same infrastructure but uses different index structures. See design search autocomplete for a dedicated treatment. The short version:

Edge n-gram tokens: "cloud" → {"c", "cl", "clo", "clou", "cloud"} indexed separately. A prefix query becomes an exact term match.
Completion suggester (Lucene): a specialized in-memory FST (finite-state transducer) structure that answers prefix queries in O(prefix length) time, much faster than a general inverted index scan.
Autocomplete must be very fast (< 20 ms) because it fires on every keystroke. It typically runs against a separate, smaller "suggestions" index — often restricted to popular queries or document titles only.

Storage choices

Data	Store	Why
Inverted index segments	NVMe SSD local to each node	Random I/O on posting lists; local avoids network round-trips
Transaction log (WAL)	Local disk, durably flushed	Crash recovery; re-play ops lost from in-memory buffer
Document source (_source)	Stored alongside index, compressed	Returned in phase 2 of scatter-gather; optional if not needed
Shard metadata / cluster state	ZooKeeper or Raft-based consensus (etcd)	Needs strong consistency; coordinator reads this to route queries
Index configurations, mappings	Cluster state store	Small, mutates rarely
Cold / archived indices	Object storage (S3-compatible)	Index files are immutable; searchable cold storage via "searchable snapshots"

Failure modes

Failure	Impact	Mitigation
Primary shard node fails	That shard's writes pause; reads served by replica	Cluster manager promotes replica to primary in seconds
Replica node fails	No data loss; reads fall to primary only	Replace node; primary syncs replica automatically
Hot shard (uneven load)	One shard gets 10× queries	Avoid query patterns that always hit the same shard; use `routing` deliberately; scale that shard's replicas
Expensive query (many high-frequency terms)	O(large posting list) intersection; coordinator timeout	Circuit-breaker on query size; cache frequent query results at coordinator; limit result windows
Deep pagination	Coordinator OOM collecting millions of candidates	Cap `from + size`; enforce cursor-based pagination
Indexing lag spike	Documents not appearing for minutes	Monitor segment count + merge queue; reduce refresh rate during bulk loads; scale indexing nodes
Cluster split-brain	Two master candidates each think they're master	Require quorum ≥ ⌊N/2⌋+1 master-eligible nodes for any election (odd number of master nodes)
Index refresh lag under heavy write	Fresh documents miss SLA	Pre-split into more shards; throttle indexing rate; increase refresh interval during batch loads

Things to discuss in an interview

Why an inverted index, not a B-tree or hash table on documents? The inverted index inverts the lookup — term → docs instead of doc → terms. This makes any term lookup O(log V) or O(1) where V is the vocabulary size, not O(N) where N is the corpus size.
How does BM25 improve on TF-IDF? TF saturation (diminishing returns on term frequency) and document length normalization. TF-IDF rewards very long documents that happen to mention a term many times.
Why is scatter-gather used instead of term-based sharding? Term-based sharding would let each shard answer queries independently, but writing a document would touch dozens of shards (one per unique term), making indexing complex and transactionally difficult.
What's the two-phase query-fetch path and why? Phase 1 collects lightweight (doc_id, score) tuples from all shards to find global top-K without network-transferring full documents. Phase 2 fetches only the ~10 winning documents' content.
How do you handle relevance scoring across shards? Normally each shard scores using its own local IDF, which can skew results if the corpus isn't uniformly distributed. DFS mode collects global term statistics first and applies them uniformly.
Trade-off between refresh interval and indexing freshness? Shorter refresh → documents appear sooner, more small segments, more merging overhead. Longer refresh → documents lag, but indexing throughput improves.

Things you should now be able to answer

Why does a LIKE '%term%' query not scale to billions of documents, but an inverted index does?
How do positions in a posting list enable phrase queries?
Why are Lucene segments immutable, and how does this help query performance?
What happens to search availability when a primary shard node fails?
Why does deep pagination (from=100000) cost so much in a distributed search engine?
How is autocomplete different from full-text search at the data structure level?
Why can a local top-K result set miss globally relevant documents, and when does it matter?

Frequently asked questions

▸What is an inverted index and why does it make full-text search tractable at scale?

An inverted index maps each term in the corpus to a sorted posting list of document IDs (and term frequencies and positions) that contain it. Instead of scanning every document for a query term — O(N) with a relational LIKE scan — lookup becomes a single key fetch returning the posting list in O(1) plus O(posting list size). Boolean queries like AND reduce to a two-pointer intersection of two sorted lists, and phrase queries use stored position data to verify adjacency.

▸How does BM25 improve on TF-IDF for relevance scoring?

TF-IDF has two weaknesses BM25 fixes: raw term frequency grows unboundedly, so a document mentioning a term 100 times scores 10x more than one with 10 mentions even though the signal diminishes. BM25 applies a saturation function (controlled by k1, typically 1.2) that caps TF's contribution. It also normalizes TF by document length (controlled by b, typically 0.75), so a long document mentioning a term once is not treated as stronger evidence than a short document doing the same.

▸Why is hash-on-doc-id sharding preferred over term-based sharding in distributed search?

Term-based sharding would let each shard answer queries independently without scatter-gather, but indexing a single document would touch dozens of shards — one per unique term — making writes complex and transactionally painful. Hash on doc_id keeps each shard as a self-contained index over a slice of the document space, making writes simple (one shard per document) at the cost of scatter-gather on queries.

▸Why is deep pagination expensive in a distributed search engine, and how should it be handled?

To serve results at offset 10,000 with page size 10, the coordinator must ask every shard for its top 10,010 results, merge them all, and discard the first 10,000. On a 50-shard cluster that means collecting and sorting roughly 500,000 candidates to return 10. The recommended fix is cursor-based pagination using search_after, which passes the last-seen sort key so each shard returns only results beyond that key at O(size) cost. Elasticsearch caps from + size at 10,000 by default for this reason.

▸What is the difference between a segment refresh and a flush in Lucene-based search engines?

A refresh writes the in-memory buffer to a new immutable segment in the OS filesystem buffer (not fsynced to durable storage) and makes it visible to search — this happens roughly every 1 second by default and is what near-real-time indexing means. A flush commits segments to durable storage via fsync and clears the transaction log; it is more expensive and triggers either every 30 minutes or when the transaction log exceeds 10 GB. Reducing the refresh interval makes documents searchable faster but generates more small segments and increases merge overhead.

← previous

Design a Calendar System (Google Calendar)

Design an Email Service (Gmail)

// RELATED

Design a Distributed Search Engine (Elasticsearch)

The problem

Functional requirements

Non-functional requirements

Capacity estimation

The inverted index — the core data structure

Structure

Analysis pipeline (how text becomes terms)

Boolean query execution

Relevance scoring — TF-IDF and BM25

TF-IDF (the classic baseline)

BM25 (Best Match 25) — the modern default

Building up to the design

V1: One node, one Lucene index

V2: Shard the index

V3: Replicate each shard

V4: Stateless coordinator tier

V5: Production-grade additions

High-level architecture

Sharding strategy

Write path — indexing a document

Near-real-time indexing — segments and the refresh cycle

Query path — scatter-gather

Phase 1: Query phase (scatter, score, collect top-K)

Phase 2: Fetch phase

The local top-K subtlety

Replication and the write path

Storage choices

Failure modes

Things to discuss in an interview

Things you should now be able to answer

Further reading

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform