Design a Distributed Search Engine (Elasticsearch)
Index billions of documents and answer full-text queries in milliseconds. Inverted indexes, sharding + replication, scatter-gather, and relevance scoring.
The problem
Elasticsearch powers the search bar at GitHub, the log aggregation at Uber, and the product discovery at Shopify. The common thread: a corpus too large for any single machine, users who expect answers in under 100 milliseconds, and an index that must stay fresh as documents arrive by the thousands every second.
A WHERE body LIKE '%kubernetes%' scan reads every row in the table — O(N) per query. At a billion documents that takes minutes, not milliseconds. The inverted index solves this by flipping the access pattern: instead of asking "which terms appear in this document?", you pre-build a map of "which documents contain this term?" Term lookup becomes a single key fetch returning a sorted posting list, and boolean queries (AND, OR, NOT) reduce to fast list intersections and unions. That is the core data structure that makes full-text search tractable.
The distributed version of this problem introduces a second layer of complexity. A single machine cannot hold the inverted index for 100 TB of raw text — so the index must be sharded across hundreds of nodes. Every query then needs to fan out to all shards in parallel, collect each shard's local top results, and merge them into a globally correct ranking before returning. This scatter-gather pattern is both the source of the system's horizontal scale and the root of its hardest trade-offs: relevance accuracy across shards with different term statistics, deep-pagination costs that grow with shard count, and near-real-time indexing lag driven by Lucene's segment-based write path.
This article is about indexing and querying a large document corpus — not crawling the web. For web-scale crawling, see design a web crawler. For the typeahead autocomplete problem, see design search autocomplete.
Functional requirements
POST /index— accept a document (id, title, body, metadata), analyze it, and add it to the index.GET /search?q=<query>&from=0&size=10— return ranked results for a free-text query.- Support boolean expressions (AND / OR / NOT) and phrase queries.
- (Optional) Faceted filtering, range queries on numeric fields, aggregations.
- (Optional) Autocomplete / prefix matching.
Non-functional requirements
- Low query latency — p99 < 100 ms for most queries.
- Near-real-time indexing — a newly indexed document should appear in search within ~30 seconds (ideally ~1 second).
- High availability — a single node failure must not take search offline.
- Horizontal scale — both corpus size and QPS must scale by adding nodes.
- Relevance — results must be ordered by how well they match the query, not arbitrarily.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Documents | 10 billion | Given |
| Avg document size | 10 KB | Title + body + metadata |
| Raw text total | 100 TB | 10B × 10 KB |
| Index storage (typical) | ~100 TB | Elasticsearch stores inverted index + _source by default; indexed-to-raw ratio ranges 0.57–1.39× depending on compression codec, field mappings, and doc value settings; conservative baseline is ~1× raw |
| Index storage range | 50–130 TB | Best case (best_compression, _source disabled): ~50%; default mixed workload: ~100–130% |
| Target shard size | 50 GB | Fits comfortably in a mid-range instance's RAM + NVMe SSD |
| Shard count (full corpus) | ~2,000 primary shards | 100 TB ÷ 50 GB |
| Total shard copies (1 primary + 1 replica, 2 total copies) | ~4,000 | 2,000 × 2 |
| Nodes (full corpus, 500 GB usable NVMe each) | ~400 nodes | 4,000 ÷ (500 GB ÷ 50 GB) |
| Hot-tier practical cluster size | ~40 nodes | Shrink corpus to 10 TB of hot/recent data or apply tiered storage |
| Write throughput | 5,000 indexing events/sec | Given; each doc passes tokenize → lowercase → stem → stop-word filter; I/O + CPU bound |
| Query throughput | 50,000 QPS | Given; coordinator is stateless and scales horizontally; tiered deployment queries hot shards only |
Takeaway: At 100 TB raw corpus a full production cluster needs ~400 nodes; shard count is fixed at index creation, so plan for 3–5× expected data size up front to avoid a costly full re-index as the corpus grows.
The inverted index — the core data structure
A relational WHERE body LIKE '%kubernetes%' performs a full table scan — every row is read. That's O(N) per query. For billions of documents, it's completely impractical.
The inverted index flips the access pattern. During indexing, for every term in every document, we record the document ID (and optionally the term's positions within the document). At query time, looking up a term is a single key lookup into a hash or B-tree, returning the posting list of all matching doc IDs in O(1) + O(|posting list|).
Structure
term → posting list
"kubernetes" → [(doc:1, tf:3, pos:[4,17,90]), (doc:5, tf:1, pos:[2]), (doc:9, tf:2, pos:[11,44])]
"docker" → [(doc:1, tf:1, pos:[6]), (doc:3, tf:4, pos:[1,8,22,31]), ...]
"container" → [(doc:1, tf:2, pos:[7,91]), (doc:3, tf:1, pos:[9]), ...]
Each entry in the posting list stores at minimum the doc_id. It typically also stores:
- Term frequency (tf): how many times the term appears in the document. Used in relevance scoring.
- Positions: the offsets of each occurrence within the document. Required for phrase queries (e.g.
"kubernetes cluster"— the two terms must appear adjacent).
Posting lists are stored in sorted order by doc_id. This makes boolean operations efficient.
Analysis pipeline (how text becomes terms)
Before a document reaches the inverted index, it goes through an analysis pipeline:
raw text → tokenizer → lowercase filter → stop-word filter → stemmer → terms
"Running Kubernetes Containers in the Cloud"
→ tokenize : ["Running", "Kubernetes", "Containers", "in", "the", "Cloud"]
→ lowercase : ["running", "kubernetes", "containers", "in", "the", "cloud"]
→ stop-words : ["running", "kubernetes", "containers", "cloud"] (dropped "in", "the")
→ stem : ["run", "kubernete", "container", "cloud"]
The same pipeline runs on both documents (at index time) and queries (at query time). If you stem "containers" to "container" at index time, you must also stem the query term so they match.
Stop-words (common words like "the", "in", "a") are dropped because they'd appear in nearly every document, bloating the posting list without improving relevance.
flowchart LR
RAW["Raw text<br/>e.g. 'Running Kubernetes Containers'"] --> TOK[Tokenizer]
TOK --> LC[Lowercase filter]
LC --> SW["Stop-word filter<br/>(drop 'the', 'in', 'a')"]
SW --> STEM["Stemmer<br/>'containers' → 'container'"]
STEM --> IDX[(Inverted Index<br/>term → posting list)]
style TOK fill:#0e7490,color:#fff
style STEM fill:#ff6b1a,color:#0a0a0f
style IDX fill:#15803d,color:#fff
The same pipeline runs at query time — so when a user searches "Containers", it gets stemmed to "container" and matches the same posting list entry created during indexing.
Boolean query execution
AND query (kubernetes AND docker): intersect the two posting lists. Since both are sorted by doc_id, a two-pointer merge runs in O(|A| + |B|):
A: [1, 3, 5, 9, 14]
B: [1, 3, 7, 14, 20]
result: [1, 3, 14] ← docs containing both terms
OR query (kubernetes OR docker): union of posting lists. Same merge, take all IDs from either.
Phrase query ("kubernetes cluster"): first intersect like AND to find docs with both terms, then for each candidate, check the position lists — the position of "cluster" must equal position of "kubernetes" + 1 (adjacent in the document). This is why position data is stored.
NOT (kubernetes NOT docker): compute the AND of kubernetes posting list with the complement of docker. In practice, NOT is computed as a filter applied after retrieval, not by materializing a giant "docs without docker" list.
Relevance scoring — TF-IDF and BM25
Getting the right results is necessary; returning them in the right order is what makes search feel good.
TF-IDF (the classic baseline)
The intuition: a term is relevant to a document if it appears often in that document (term frequency, TF) but rarely across all documents (inverse document frequency, IDF). A word like "the" has high TF everywhere and nearly zero IDF — useless as a signal. A word like "photolithography" is rare and meaningful.
TF(t, d) = count of term t in document d / total terms in document d
IDF(t) = log(N / df(t))
N = total documents in the corpus
df(t) = documents containing term t
Score(t, d) = TF(t, d) × IDF(t)
For multi-term queries: Score(q, d) = Σ TF-IDF(t, d) for each term t in query
BM25 (Best Match 25) — the modern default
BM25 is the successor to TF-IDF used in most production search engines (including Lucene/Elasticsearch's default scorer). It fixes two weaknesses in TF-IDF.
First, raw TF grows unboundedly. A document with a term 100 times shouldn't score 10× more than one with 10 occurrences — returns diminish. BM25 applies a saturation function that caps TF's contribution. Second, a long document containing a term once is weaker evidence than a short document doing the same. BM25 normalizes TF by document length.
BM25(t, d) = IDF(t) × [ TF(t,d) × (k1 + 1) ]
--------------------------
[ TF(t,d) + k1 × (1 - b + b × dl/avgdl) ]
k1 ∈ [1.2, 2.0] — controls TF saturation
b ∈ [0, 1] — controls length normalization (0 = off, 1 = full)
dl = document length
avgdl = average document length in the corpus
Typical defaults: k1 = 1.2, b = 0.75. These are tunable. The score for a multi-term query sums the BM25 contributions of each term, and documents are ranked descending by that score.
Building up to the design
V1: One node, one Lucene index
Start with a single machine running a single Apache Lucene index. You get full inverted-index semantics, BM25 scoring, near-real-time refresh — a working search engine on one box that handles a few hundred GB of documents comfortably.
The wall you hit is storage. At 100 TB of raw corpus, the total on-disk index footprint is roughly 100 TB or more (Elasticsearch stores the inverted index plus the original document source by default). A single machine can't hold that — even an i3en.24xlarge tops out at 60 TB NVMe and 768 GB RAM. You have to split the index.
V2: Shard the index
Partition documents across N shards. Each shard is a complete, self-contained inverted index over its slice of the document space. A query fans out to all shards in parallel, each scores locally, and the coordinator merges the results. Now corpus size scales horizontally, and so does read throughput.
The next fragility: each shard has exactly one copy. If that node goes down, those documents are temporarily unsearchable.
V3: Replicate each shard
Give every shard a primary and one or more replicas. Writes go to the primary; the primary replicates to replicas before acknowledging. Reads can go to either. Replica promotion handles primary failures — the cluster manager detects a lost node and promotes a replica within seconds. You also get read QPS scale-out as a bonus.
At this point the coordinator — the node that receives queries and fans them out — has become a potential bottleneck and single point of failure.
V4: Stateless coordinator tier
The coordinator holds no index data. It reads the shard map from a metadata store (ZooKeeper or etcd), routes queries to the right shards, and merges results. Because it's stateless, you run as many as you need behind a load balancer.
V5: Production-grade additions
Segment merging, field-level index types, multi-tenancy (multiple indices), warm/cold storage tiering for old documents. This is where Elasticsearch and OpenSearch live.
flowchart LR
V1["V1: single Lucene node<br/>GB scale"] --> V2["V2: shard across nodes<br/>TB scale, no HA"]
V2 --> V3["V3: + replication<br/>fault tolerant"]
V3 --> V4["V4: + stateless coordinator<br/>no coordinator SPOF"]
V4 --> V5["V5: + segments, merging, tiering<br/>production search cluster"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V5 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
CLI[Client] --> LB[Load Balancer]
LB --> COORD1[Coordinator / Search Node]
LB --> COORD2[Coordinator / Search Node]
COORD1 -->|scatter| SH0P[Shard 0 Primary]
COORD1 -->|scatter| SH1P[Shard 1 Primary]
COORD1 -->|scatter| SH2P["Shard 2–N Primary"]
SH0P -.replicate.-> SH0R[Shard 0 Replica]
SH1P -.replicate.-> SH1R[Shard 1 Replica]
ING[Indexing Service] -->|write docs| SH0P
ING -->|write docs| SH1P
ING -->|write docs| SH2P
META[(Shard Metadata<br/>ZooKeeper / etcd)] --> COORD1
META --> COORD2
style COORD1 fill:#ff6b1a,color:#0a0a0f
style COORD2 fill:#ff6b1a,color:#0a0a0f
style SH0P fill:#15803d,color:#fff
style SH1P fill:#15803d,color:#fff
style SH2P fill:#15803d,color:#fff
style SH0R fill:#a855f7,color:#fff
style SH1R fill:#a855f7,color:#fff
style ING fill:#0e7490,color:#fff
style META fill:#ffaa00,color:#0a0a0f
Two coordinator nodes sit behind the load balancer — stateless, horizontally scalable. Each knows the shard map from ZooKeeper/etcd and routes accordingly. Green nodes are primaries; purple are replicas. The indexing service writes to primaries and lets replication take care of the rest.
Sharding strategy
Each shard is a self-contained inverted index covering a subset of documents. The two main strategies:
| Strategy | How | Trade-offs |
|---|---|---|
Hash on doc_id | shard = hash(doc_id) % N | Even distribution; random access to any doc by ID; simple to implement |
Range on doc_id | Shard 0 owns IDs 0–99M, shard 1 owns 100M–199M | Ordered scans; potential hot shards if recent docs are accessed more |
| Term-based | Each shard owns a disjoint set of terms | No scatter-gather needed; writing a doc touches multiple shards; very complex |
Hash on doc_id is the right default. Term-based sharding is tempting because it would let each shard answer a query independently — no scatter-gather needed — but writing a single document would touch dozens of shards (one per unique term), making indexing complex and transactionally painful.
Fixed shard count: the number of shards is typically fixed at index creation. Elasticsearch defaults to 1 primary shard per index since version 7.0 (it was 5 in older versions). Changing the shard count requires re-indexing — expensive. Plan for 3–5× expected data size so you don't need to re-shard for years.
Write path — indexing a document
sequenceDiagram
participant C as Client
participant ING as Indexing Service
participant PRI as Primary Shard
participant REP as Replica Shard
C->>ING: POST /index { id, title, body }
ING->>ING: analyze (tokenize, lowercase, stem)
ING->>PRI: write analyzed doc to segment
PRI->>REP: replicate segment operation
REP-->>PRI: ack
PRI-->>ING: ack
ING-->>C: 200 OK (indexed, visible at next refresh)
A few things to notice here. The primary writes to an in-memory buffer first, then flushes to a new segment on disk. Replication happens at the operation level — the replica replays the same write. The client's acknowledgment waits for the replica to confirm, which means a slow replica raises write latency. Elasticsearch exposes wait_for_active_shards to tune this: acknowledge after just the primary (faster, less durable) or after N replicas (slower, more durable).
Near-real-time indexing — segments and the refresh cycle
Lucene-style engines use immutable segments as the on-disk unit. A segment is a mini inverted index written once and never modified. Documents are "deleted" by a tombstone, not by in-place removal. Segments accumulate and are periodically merged by a background process.
in-memory buffer
│
[refresh ~1s]
↓
┌── segment 1 (on disk, searchable) ──┐
├── segment 2 (on disk, searchable) ──┤
├── segment 3 (on disk, searchable) ──┤
└── segment 4 (newly flushed) ┘
│
[merge background]
↓
└── merged segment (fewer, larger)
Why immutable? The OS page cache serves read-heavy workloads on immutable files with no locking complexity during queries. Merging is a background copy operation; search continues without interruption.
The refresh interval defaults to once per second in Elasticsearch. After a refresh, the newly written segment becomes visible to search — this is what "near-real-time" (~1 s latency) means. A flush is different: it commits segments to durable storage via fsync and clears the transaction log. Refreshes are cheap and happen every ~1 s; flushes are more expensive and are triggered either every ~30 minutes (by default) or when the transaction log grows past 10 GB — whichever comes first.
The trade-off is straightforward: reduce the refresh interval and documents appear faster, but you generate more small segments and drive more merging work. Increase it to 30 s during bulk loads and indexing throughput improves significantly, at the cost of delayed searchability.
Query path — scatter-gather
The coordinator receives a query and executes it in two phases.
Phase 1: Query phase (scatter, score, collect top-K)
sequenceDiagram
participant C as Client
participant CO as Coordinator
participant S0 as Shard 0
participant S1 as Shard 1
participant SN as "Shard N"
C->>CO: GET /search?q=kubernetes+docker&size=10
CO->>S0: query("kubernetes AND docker", size=10)
CO->>S1: query("kubernetes AND docker", size=10)
CO->>SN: query("kubernetes AND docker", size=10)
S0-->>CO: [(doc:42, score:3.1), (doc:101, score:2.8), ...]
S1-->>CO: [(doc:500, score:3.3), (doc:211, score:2.5), ...]
SN-->>CO: [(doc:991, score:2.9), ...]
CO->>CO: merge all local top-10 results, re-rank globally
CO->>CO: final global top-10: [doc:500, doc:42, doc:991, ...]
Each shard returns its local top-K — just doc IDs and scores, not the full documents. The coordinator merges all N_shards × size results by score and identifies the global top-K. Only scores and IDs travel across the network in this phase.
Phase 2: Fetch phase
sequenceDiagram
participant C as Client
participant CO as Coordinator
participant S0 as Shard 0
participant S1 as Shard 1
CO->>S0: fetch docs [42]
CO->>S1: fetch docs [500]
S0-->>CO: { doc 42: title, body, metadata }
S1-->>CO: { doc 500: title, body, metadata }
CO-->>C: final response with ranked docs
Full document content is only fetched for the final top-K results, not for every candidate. This saves enormous bandwidth on large corpora.
The local top-K subtlety
There is a correctness subtlety worth raising in an interview: a shard's local top-K might not contain documents that would be globally relevant.
If shard 0 has 1000 highly relevant documents and shard 1 has 10, asking each shard for top-10 and merging gives a good global answer. But suppose shard 0's 11th-best document is globally better than shard 1's best — it's dropped because we only asked for top-10 per shard.
The fix is to request a larger local K (e.g., size × num_shards), but that balloons the merge cost. In practice, the error is small when documents are distributed randomly — the IDF statistics per shard are similar and scores cluster predictably. You can also use a "DFS" (distributed frequency search) mode where shards first exchange term statistics so each shard scores using global IDF rather than local IDF, improving ranking accuracy at the cost of an extra round-trip.
Replication and the write path
Primary shard receives write
│
├── writes to in-memory buffer + transaction log (WAL)
│
├── forwards operation to each replica
│ each replica applies the same write
│ each replica acks
│
└── acks client after configurable quorum
When a primary node fails, the cluster manager detects it via heartbeats and promotes one replica to primary within seconds. Search remains available throughout from the remaining replicas. The minimum for HA is 1 primary + 1 replica. Adding a second replica (1 + 2 total) tolerates two simultaneous node failures.
Pagination and the deep paging problem
from=0, size=10 is cheap: each shard returns its top-10, the coordinator takes the best 10 of the merged list.
from=10000, size=10 is expensive: to find results 10,001–10,010, the coordinator must ask each shard for its top 10,010 results, merge them all, and discard the first 10,000. With N shards (even a modest cluster of 50), that's collecting and sorting 50 × 10,010 ≈ 500K results to return 10. At hundreds of shards the coordinator memory and CPU cost becomes prohibitive.
Three ways to handle it:
- Cursor / search_after: instead of offset-based pagination, the client passes the last-seen sort key. Each shard returns results after that key, bounded by
size. Scales to arbitrary depth with O(size) cost. - Limit max depth: most systems cap
from + sizeat 10,000 (Elasticsearch default). Deep pagination is an anti-pattern for full-text search; if you need it, use cursors. - Scroll API (for bulk export): keeps a consistent snapshot of the index open, then pages through it. Not for real-time UX; for ETL.
Autocomplete as a related feature
Autocomplete (type-ahead) builds on the same infrastructure but uses different index structures. See design search autocomplete for a dedicated treatment. The short version:
- Edge n-gram tokens: "cloud" → {"c", "cl", "clo", "clou", "cloud"} indexed separately. A prefix query becomes an exact term match.
- Completion suggester (Lucene): a specialized in-memory FST (finite-state transducer) structure that answers prefix queries in O(prefix length) time, much faster than a general inverted index scan.
- Autocomplete must be very fast (< 20 ms) because it fires on every keystroke. It typically runs against a separate, smaller "suggestions" index — often restricted to popular queries or document titles only.
Storage choices
| Data | Store | Why |
|---|---|---|
| Inverted index segments | NVMe SSD local to each node | Random I/O on posting lists; local avoids network round-trips |
| Transaction log (WAL) | Local disk, durably flushed | Crash recovery; re-play ops lost from in-memory buffer |
| Document source (_source) | Stored alongside index, compressed | Returned in phase 2 of scatter-gather; optional if not needed |
| Shard metadata / cluster state | ZooKeeper or Raft-based consensus (etcd) | Needs strong consistency; coordinator reads this to route queries |
| Index configurations, mappings | Cluster state store | Small, mutates rarely |
| Cold / archived indices | Object storage (S3-compatible) | Index files are immutable; searchable cold storage via "searchable snapshots" |
Failure modes
| Failure | Impact | Mitigation |
|---|---|---|
| Primary shard node fails | That shard's writes pause; reads served by replica | Cluster manager promotes replica to primary in seconds |
| Replica node fails | No data loss; reads fall to primary only | Replace node; primary syncs replica automatically |
| Hot shard (uneven load) | One shard gets 10× queries | Avoid query patterns that always hit the same shard; use routing deliberately; scale that shard's replicas |
| Expensive query (many high-frequency terms) | O(large posting list) intersection; coordinator timeout | Circuit-breaker on query size; cache frequent query results at coordinator; limit result windows |
| Deep pagination | Coordinator OOM collecting millions of candidates | Cap from + size; enforce cursor-based pagination |
| Indexing lag spike | Documents not appearing for minutes | Monitor segment count + merge queue; reduce refresh rate during bulk loads; scale indexing nodes |
| Cluster split-brain | Two master candidates each think they're master | Require quorum ≥ ⌊N/2⌋+1 master-eligible nodes for any election (odd number of master nodes) |
| Index refresh lag under heavy write | Fresh documents miss SLA | Pre-split into more shards; throttle indexing rate; increase refresh interval during batch loads |
Things to discuss in an interview
- Why an inverted index, not a B-tree or hash table on documents? The inverted index inverts the lookup — term → docs instead of doc → terms. This makes any term lookup O(log V) or O(1) where V is the vocabulary size, not O(N) where N is the corpus size.
- How does BM25 improve on TF-IDF? TF saturation (diminishing returns on term frequency) and document length normalization. TF-IDF rewards very long documents that happen to mention a term many times.
- Why is scatter-gather used instead of term-based sharding? Term-based sharding would let each shard answer queries independently, but writing a document would touch dozens of shards (one per unique term), making indexing complex and transactionally difficult.
- What's the two-phase query-fetch path and why? Phase 1 collects lightweight (doc_id, score) tuples from all shards to find global top-K without network-transferring full documents. Phase 2 fetches only the ~10 winning documents' content.
- How do you handle relevance scoring across shards? Normally each shard scores using its own local IDF, which can skew results if the corpus isn't uniformly distributed. DFS mode collects global term statistics first and applies them uniformly.
- Trade-off between refresh interval and indexing freshness? Shorter refresh → documents appear sooner, more small segments, more merging overhead. Longer refresh → documents lag, but indexing throughput improves.
Things you should now be able to answer
- Why does a
LIKE '%term%'query not scale to billions of documents, but an inverted index does? - How do positions in a posting list enable phrase queries?
- Why are Lucene segments immutable, and how does this help query performance?
- What happens to search availability when a primary shard node fails?
- Why does deep pagination (
from=100000) cost so much in a distributed search engine? - How is autocomplete different from full-text search at the data structure level?
- Why can a local top-K result set miss globally relevant documents, and when does it matter?
Further reading
- Lucene's documentation and source — the reference implementation of segment-based inverted indexing.
- "Introduction to Information Retrieval" — Manning, Raghavan, Schutze (freely available online) — canonical textbook on IR, TF-IDF, BM25, and Boolean retrieval.
- Elasticsearch documentation on "Inverted index" and "Relevance scoring" — practical applied material.
- Doug Cutting and Mike Cafarella's work on Nutch and early Hadoop/Lucene — historical context for distributed IR.
- Design search autocomplete — related problem: prefix-aware suggestions.
- Design a web crawler — how to build the corpus this engine indexes.
Frequently asked questions
▸What is an inverted index and why does it make full-text search tractable at scale?
An inverted index maps each term in the corpus to a sorted posting list of document IDs (and term frequencies and positions) that contain it. Instead of scanning every document for a query term — O(N) with a relational LIKE scan — lookup becomes a single key fetch returning the posting list in O(1) plus O(posting list size). Boolean queries like AND reduce to a two-pointer intersection of two sorted lists, and phrase queries use stored position data to verify adjacency.
▸How does BM25 improve on TF-IDF for relevance scoring?
TF-IDF has two weaknesses BM25 fixes: raw term frequency grows unboundedly, so a document mentioning a term 100 times scores 10x more than one with 10 mentions even though the signal diminishes. BM25 applies a saturation function (controlled by k1, typically 1.2) that caps TF's contribution. It also normalizes TF by document length (controlled by b, typically 0.75), so a long document mentioning a term once is not treated as stronger evidence than a short document doing the same.
▸Why is hash-on-doc-id sharding preferred over term-based sharding in distributed search?
Term-based sharding would let each shard answer queries independently without scatter-gather, but indexing a single document would touch dozens of shards — one per unique term — making writes complex and transactionally painful. Hash on doc_id keeps each shard as a self-contained index over a slice of the document space, making writes simple (one shard per document) at the cost of scatter-gather on queries.
▸Why is deep pagination expensive in a distributed search engine, and how should it be handled?
To serve results at offset 10,000 with page size 10, the coordinator must ask every shard for its top 10,010 results, merge them all, and discard the first 10,000. On a 50-shard cluster that means collecting and sorting roughly 500,000 candidates to return 10. The recommended fix is cursor-based pagination using search_after, which passes the last-seen sort key so each shard returns only results beyond that key at O(size) cost. Elasticsearch caps from + size at 10,000 by default for this reason.
▸What is the difference between a segment refresh and a flush in Lucene-based search engines?
A refresh writes the in-memory buffer to a new immutable segment in the OS filesystem buffer (not fsynced to durable storage) and makes it visible to search — this happens roughly every 1 second by default and is what near-real-time indexing means. A flush commits segments to durable storage via fsync and clears the transaction log; it is more expensive and triggers either every 30 minutes or when the transaction log exceeds 10 GB. Reducing the refresh interval makes documents searchable faster but generates more small segments and increases merge overhead.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.