~/articles/design-vector-database

◆◆◆Advancedasked at Pineconeasked at Googleasked at Metaasked at Microsoftasked at Spotify

Design a Vector Database / Semantic Search Service

Q: Why is exact nearest-neighbor search over 1 billion high-dimensional vectors impractical?

Exact nearest-neighbor search requires computing the distance from the query vector to every vector in the index — O(N × D) work per query. At 1 billion vectors with 768 dimensions, that is roughly 768 billion floating-point multiply-add operations per query. Even on a single GPU executing 20 teraFLOPS, one query takes about 38 milliseconds before any I/O. At 10,000 QPS the compute bill is prohibitive, which is why every production system trades a small amount of recall for orders-of-magnitude speed using Approximate Nearest Neighbor algorithms like HNSW or IVF.

Q: What is HNSW and what are its most important tuning parameters?

HNSW (Hierarchical Navigable Small World) is a graph-based ANN index that organizes vectors into a multi-layer proximity graph. During search it enters at a fixed entry point on the top layer and greedily descends to find the approximate nearest neighbors. The key parameters are M (the maximum number of bidirectional edges per node, typically 16 to 64 — higher M improves recall but increases memory roughly proportionally), efConstruction (the beam width used while building the graph, typically 100 to 400 — larger values improve index quality at the cost of build time), and efSearch (the beam width at query time, typically 64 to 256 — larger values improve recall but increase latency proportionally). efSearch is the most useful production knob because you can raise or lower it per request without rebuilding the index.

Q: How much memory does a 1-billion-vector HNSW index require, and how does quantization help?

Raw float32 vectors at 1 billion × 768 dimensions × 4 bytes consume roughly 3 TB. HNSW adds a neighbor graph — with M=16 and 8-byte node IDs, the bottom layer alone costs about 128 GB and all layers combined add roughly 200 GB (the upper layers form a geometric series that adds ~72 GB on top of the 128 GB bottom layer), bringing the uncompressed total to about 3.2 TB. Product Quantization (PQ) compresses each vector from 768 × 4 bytes to 96 bytes by splitting dimensions into sub-vectors and encoding each with a codebook, reducing the 3 TB vector payload to about 90 GB — a 32x reduction. Combined with the graph links, the PQ-compressed HNSW index totals roughly 290 GB, small enough to distribute across a few dozen 16 GB nodes.

Q: What is the metadata filtering problem in vector search, and what are the three main approaches?

The metadata filtering problem occurs when a query adds a predicate such as tenant=A or date>2024-01-01 alongside the top-k ANN request. Post-filtering (run ANN first, then discard non-matching results) loses recall when the filter is selective. Pre-filtering (build a bitmap of matching doc IDs, then restrict graph traversal) degrades graph quality when the filter matches few nodes. Filtered-HNSW threads through both, annotating each graph edge with metadata and skipping edges whose target does not match the filter. Most production systems use pre-filtering for selectivity above roughly 10 percent of the corpus and fall back to brute force on the filtered set when selectivity drops below 1 percent.

Index 1 billion 768-dimensional vectors and answer top-k similarity queries in under 20 ms — the ANN indexing, sharding, and filtering architecture behind Pinecone, Weaviate, and pgvector.

23 min read2026-06-18Ironclad Academy

#interview #ai #vector-db #ann #embeddings #search #sharding

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Pinecone, Weaviate, Qdrant, and Milvus all exist to answer the same question: given a query vector representing a document, image, or search intent, find the K most semantically similar vectors in a corpus of billions. This is the retrieval step inside every RAG pipeline (see Design a RAG Pipeline), the candidate generation layer of recommendation systems (see Design a Recommendation System), and the core of image and audio search. The key distinction from keyword-based retrieval is that there is no term to look up in an inverted index — the match is purely geometric, in a space of 768 or more dimensions.

The engineering tension is straightforward: exact nearest-neighbor search is O(N × D) per query, where N is the number of indexed vectors and D is the dimensionality. At N = 1 billion and D = 768, a single query touches ~768 billion floating-point operations. FAISS benchmarks on modern hardware put exact search throughput well below 1,000 QPS at this scale — and the cost per query makes it untenable at production traffic. The solution is Approximate Nearest Neighbor (ANN) search: accept a controlled drop in recall (say, 95% recall@10 instead of 100%) to gain 100–1000x lower latency. The rest of this article is about building the infrastructure that wraps that ANN index, makes it filterable, shards it across nodes, keeps it fresh, and doesn't lose data when a node dies.

This article covers the store-and-index layer. If you want to understand how text becomes a vector — embedding models, chunking strategy, retrieval-augmented generation — that is Design a RAG Pipeline. If you want full-text keyword search with inverted indexes and BM25, that is Design a Distributed Search Engine. The two problems overlap at the query layer — hybrid search systems run both in parallel and fuse the results — but the indexing architectures are completely different.

Functional requirements

Upsert: insert or update a vector and its metadata (POST /vectors).
Delete: remove a vector by ID.
Query: given a query vector and K, return the top-K approximate nearest neighbors by cosine similarity, dot product, or L2 distance. Optionally apply a metadata filter.
Fetch: retrieve a vector by ID.
Namespaces: logical partitioning of the index so different tenants or use cases share infrastructure without seeing each other's data.

Non-functional requirements

Recall@10 >= 0.95 at p99 latency < 20 ms for queries.
p99 upsert acknowledged within 100 ms; vector visible in query results within 5 seconds.
99.99% availability for queries; the index must survive a node failure without going dark.
Horizontally scalable to 1 billion+ vectors.
Durable: upserted vectors survive process crashes.

Capacity estimation

Dimension	Estimate	How we got there
Vector count	1 billion	Given (interview hypothesis)
Dimensionality	768	Standard OpenAI/BERT embedding size
Raw vector storage (float32)	~3 TB	`1B × 768 × 4 bytes`
HNSW graph links (M=16, 8-byte IDs)	~200 GB	Bottom layer: `1B × 16 × 8 bytes = 128 GB`; upper layers form a geometric series (each successive layer holds ~1/ln(M) the nodes of the layer below); total ≈ 200 GB
Total without compression	~3.2 TB	`3 TB vectors + ~200 GB graph`
After PQ compression (96 sub-vectors, 8-bit codes)	~90 GB vectors	`1B × 96 bytes`; 32x reduction vs raw float32
Compressed HNSW footprint (PQ vectors + graph links)	~290 GB	`90 GB + ~200 GB`
Memory per shard (16 GB node, 70% usable for index)	~11 GB usable	30% headroom for OS + metadata + writable segment
Primary shards from memory	~26	`290 GB ÷ 11 GB`
QPS per shard (HNSW, efSearch=128)	~150–500	Benchmarked at p50 5–10 ms; degrades to 20 ms p99 above ~300 QPS
Primary shards from QPS (10k QPS, 300 QPS/shard)	~34	`10,000 ÷ 300`
Governing constraint	QPS-dominant	Memory math gives ~26 shards; QPS math gives ~34 shards; QPS is the binding constraint — plan for ~35 primary shards with headroom

Takeaway: memory is the binding cost driver. Product Quantization is not optional at billion-vector scale — it is what makes the index fit in RAM. Every recall vs. memory decision ultimately comes down to how aggressively you configure PQ's codebook size.

Building up to the design

V1: Flat index (brute force)

The simplest correct solution is a flat list of vectors with a linear scan. Every query computes the distance to every stored vector and returns the K smallest. FAISS calls this IndexFlatL2 or IndexFlatIP. It delivers perfect recall by definition, works well up to a few million vectors on a single machine, and is the right choice for evaluation or a cold-start MVP.

The wall appears around 10–50 million vectors. At 10M × 768 floats, even a fully parallelized GPU scan takes ~10 ms per query — acceptable. At 100M vectors you are at 100 ms, and at 1B vectors you are over a second. The flat index does not scale.

V2: HNSW — bring the graph

HNSW (Hierarchical Navigable Small World, Malkov and Yashunin 2018) organizes vectors into a multi-layer proximity graph. The top layer holds a sparse random sample of nodes with long-range edges; each successive layer adds more nodes with shorter-range edges; the bottom layer contains all nodes. A query enters the top layer at a fixed entry point, greedily navigates toward the nearest neighbors, descends to the next layer, and repeats. This greedy descent is O(log N) expected.

flowchart TD
    QUERY[Query vector] --> L2[Layer 2<br/>few nodes, long edges<br/>coarse navigation]
    L2 --> L1[Layer 1<br/>more nodes<br/>finer navigation]
    L1 --> L0[Layer 0<br/>all vectors<br/>final neighborhood search]
    L0 --> TOPK[Top-k results]
    style QUERY fill:#ff6b1a,color:#0a0a0f
    style L2 fill:#a855f7,color:#fff
    style L1 fill:#0e7490,color:#fff
    style L0 fill:#15803d,color:#fff
    style TOPK fill:#ffaa00,color:#0a0a0f

Three parameters control the recall-memory-latency trade-off. M is the maximum number of bidirectional edges per node — M=16 is a common default, M=32 or M=64 improves recall but increases graph memory roughly proportionally. efConstruction is the beam width during index construction; values in the range 100–400 improve index quality at the cost of build time, but do not affect query latency once the graph is built. efSearch is the beam width at query time and is the most important production knob, since you can raise or lower it per request without rebuilding the graph. efSearch=64 typically delivers ~90% recall; efSearch=256 pushes toward 98% at roughly 3–5x the latency cost.

At 1 billion vectors with M=16, the raw float32 payload is 3 TB and the graph adds roughly 200 GB, for a total uncompressed index of about 3.2 TB. That requires ~200 nodes at 16 GB each — workable but expensive. The path forward is compression.

V3: IVF + Product Quantization — make it fit in RAM

Inverted File Index (IVF) partitions the vector space into K clusters using K-means (typically K=4096 or K=65536). Each vector is assigned to its nearest cluster centroid. A query finds the nprobe nearest centroids and scans only those clusters' posting lists — reducing the search space from N to roughly N / K × nprobe vectors. IVF alone often achieves 85–92% recall.

Product Quantization (PQ) compresses vectors by splitting the D dimensions into m sub-vectors of D/m dimensions each, then quantizing each sub-vector to one of 256 codewords via a learned codebook (m here is the PQ sub-vector count — a different parameter from the HNSW edge count M). A 768-dimensional vector is split into 96 sub-vectors of 8 dimensions each; each sub-vector index is 1 byte, so the full vector is 96 bytes rather than 768 × 4 = 3,072 bytes — a 32x compression. Distance computation uses precomputed sub-distance lookup tables, which is cache-friendly and fast.

IVF-PQ combines both: coarse IVF partitioning to narrow the search space, then PQ-compressed distance computation within the probe list. At billion scale, FAISS (Meta) ships this combination with 95%+ recall@10 at ~5 ms query latency on CPU. ScaNN (Google) applies a similar idea with an additional asymmetric quantization step that improves recall at the same bit width, particularly for maximum inner product search.

The practical comparison:

Index type	Recall@10	Query latency	Memory (1B × 768)	Build time
Flat (exact)	100%	~1,000 ms	3 TB	Minutes
HNSW (M=16, ef=128)	~96%	~5–10 ms	~3.2 TB	Hours
IVF-PQ (nprobe=64)	~90–94%	~3–8 ms	~100 GB	Hours
IVF-PQ (nprobe=128)	~93–97%	~8–15 ms	~100 GB	Hours
ScaNN (AH+SQ, reorder=150)	~97–99%	~5–12 ms	~300 GB	Hours
DiskANN (SSD-resident)	~95%	~15–30 ms	~400 GB RAM + SSD	Days

DiskANN (Microsoft Research) is worth singling out for the case where even PQ-compressed HNSW is too large: it stores the full vectors on NVMe SSD and uses a small in-memory PQ-compressed graph (the Vamana algorithm) for fast approximation, then fetches exact vectors from SSD only for the top candidates. SSD IOPS is the bottleneck; performance depends heavily on the storage tier. The appeal is handling datasets 10–100x too large for RAM at ~95% recall and 15–30 ms latency.

For this design, we use HNSW with PQ for sealed immutable segments (best recall per query latency once built) and a flat index for the small in-memory writable segment (tiny enough for brute force to be fast).

V4: Sharding — horizontal scale

Once the index exceeds one machine's RAM, shard the vector space across N nodes. Partition vectors randomly by vector ID (modulo hash) so each shard gets a uniform slice of the corpus. Each shard runs its own HNSW index independently.

A query fans out to every shard (scatter-gather), each shard returns its local top-k results with distances, and the coordinator merges the lists by distance and returns the global top-k.

sequenceDiagram
    participant C as Client
    participant R as Router
    participant S0 as Shard 0
    participant S1 as Shard 1
    participant SN as Shard N
    C->>R: query(vector, k=10, filter)
    R->>S0: local_top_k(vector, k=10, filter)
    R->>S1: local_top_k(vector, k=10, filter)
    R->>SN: local_top_k(vector, k=10, filter)
    S0-->>R: [(id, score), ...]
    S1-->>R: [(id, score), ...]
    SN-->>R: [(id, score), ...]
    R->>R: merge N lists, take global top-10
    R-->>C: top-10 results

The merge is exact: the global top-10 is simply the 10 smallest-distance entries across all the local top-10 lists. No additional approximation is introduced at the merge step — the only approximation is within each shard's ANN search.

Each primary shard has one or more read replicas that serve query scatter. Writes go to the primary and are replicated asynchronously; query reads are load-balanced across primary and replicas. Vector databases tolerate eventual read consistency — a freshly upserted vector visible on the primary but not yet on a replica is fine, as long as it becomes visible within the freshness SLA.

V5: Metadata filtering

This is the most nuanced part of the design. Applications rarely want "the globally nearest 10 vectors with no constraints." More often they want "the nearest 10 vectors in namespace tenant=acme where document_date >= 2024-01-01."

There are three approaches, each with a failure mode.

Post-filtering runs ANN search first, then discards results that do not match the predicate. Simple to implement, but if the filter matches only 0.1% of the corpus, the ANN stage must over-fetch by roughly 1,000x to guarantee finding 10 matching results — it effectively becomes brute force with extra steps.

Pre-filtering builds a bitmap of vector IDs that match the metadata predicate (using a secondary metadata index), then restricts HNSW graph traversal to only visit nodes in that bitmap. The graph was not built with this constraint in mind, so the traversal may get stuck in regions where no bitmap-matching neighbor is nearby. With a highly selective filter (<1% of vectors match), this can push recall below 50%.

Filtered-HNSW annotates each node during traversal: when a candidate's metadata does not match, skip it but continue graph traversal rather than stopping. This is the approach used by Weaviate and Qdrant. It handles moderate filter selectivity (>5% of corpus matches) well but degrades gracefully under high selectivity.

The practical production strategy is a three-tier fallback:

Filter matches >10% of the corpus: use pre-filter + HNSW with the bitmap.
Filter matches 1–10%: use filtered-HNSW (annotated traversal).
Filter matches <1%: fall back to brute-force distance computation over the filtered set. At <1% of 1B vectors, that is at most 10M vectors — fast enough for a 20 ms budget.

flowchart TD
    Q[Query + filter predicate] --> SEL{Estimate filter<br/>selectivity}
    SEL -->|>10% match| PRE[Pre-filter bitmap<br/>+ HNSW traversal]
    SEL -->|1-10% match| FHNSW[Filtered-HNSW<br/>skip non-matching nodes]
    SEL -->|<1% match| BF[Brute force over<br/>filtered set only]
    PRE --> MERGE[Merge top-k]
    FHNSW --> MERGE
    BF --> MERGE
    style SEL fill:#ffaa00,color:#0a0a0f
    style PRE fill:#15803d,color:#fff
    style FHNSW fill:#0e7490,color:#fff
    style BF fill:#ff6b1a,color:#0a0a0f

V6: Freshness — immutable segments and compaction

HNSW graphs are expensive to update incrementally. Inserting a new vector requires navigating the graph to find its nearest neighbors and drawing bidirectional edges — viable for a few thousand inserts but too slow for continuous high-throughput ingestion when you want to rebuild only periodically.

The solution borrows from LSM-tree storage engines: separate fresh writes from the sealed HNSW index using an immutable segment structure.

New upserts land in a writable in-memory segment immediately and are queried with brute force (the segment stays small, so this is fast). Deletes write tombstones here. Every write is also appended to a write-ahead log for crash recovery — acknowledgment to the caller can wait for WAL flush (~100 ms) without waiting for the full index build. When the writable segment reaches a threshold size (e.g., 1M vectors), it is sealed and a background job builds a full HNSW index over it. Sealed segments are immutable and serve queries with full HNSW performance. A background compaction process periodically merges multiple sealed segments, rebuilding a new HNSW index over the combined set and discarding tombstoned vectors. This is exactly the same compaction-on-merge pattern described in LSM Trees vs B-Trees — never update in place; compact on the merge pass.

At query time, a query hits all segments simultaneously: sealed segments via HNSW, the writable in-memory segment via brute force. Results are merged. This is how Pinecone's Pods architecture and Weaviate's LSM-segment design achieve sub-5-second write freshness while maintaining HNSW query performance on the bulk of the corpus.

API

# Upsert vectors
POST /vectors
{ "namespace": "acme", "vectors": [{ "id": "doc:42", "values": [0.1, 0.2, ...], "metadata": { "tenant": "acme", "date": "2024-05-01" } }] }
→ 200 { "upserted_count": 1 }

# Query
POST /query
{ "namespace": "acme", "vector": [0.1, ...], "top_k": 10, "filter": { "tenant": "acme", "date": { "$gte": "2024-01-01" } }, "metric": "cosine" }
→ 200 { "matches": [{ "id": "doc:42", "score": 0.91, "metadata": {...} }] }

# Delete
DELETE /vectors
{ "namespace": "acme", "ids": ["doc:42"] }
→ 200 { "deleted_count": 1 }

The schema

Each shard's metadata store (used for pre-filtering and result decoration):

vectors(
  id          TEXT PRIMARY KEY,
  namespace   TEXT NOT NULL,
  shard_id    INT  NOT NULL,
  metadata    JSONB,               -- indexed for filter predicates
  is_deleted  BOOL DEFAULT false,  -- tombstone
  inserted_at TIMESTAMPTZ
)

-- Payload blobs kept separately (large)
vector_data(
  id          TEXT PRIMARY KEY,
  values_pq   BYTEA,               -- PQ-compressed codes, 96 bytes
  values_raw  BYTEA,               -- optional: full float32 for reranking
)

WAL: append-only log on durable storage (S3 or local NVMe with replicated copies). Each entry is a typed record: {UPSERT, id, values, metadata} or {DELETE, id}. On crash recovery, replay the WAL into the writable in-memory segment.

Architecture

flowchart TD
    CLIENT[Client / Application] --> APIGW[API Gateway<br/>+ Auth]
    APIGW --> ROUTER[Query Router<br/>stateless, reads shard map]
    META[(Shard Map<br/>etcd / ZooKeeper)] --> ROUTER

    ROUTER -->|"scatter to shards"| SH0[Shard 0 Primary]
    ROUTER -->|"scatter to shards"| SH1[Shard 1 Primary]
    ROUTER -->|"scatter to shards"| SHN[Shard N Primary]

    SH0 -.async replicate.-> R0[Shard 0 Replica]
    SH1 -.async replicate.-> R1[Shard 1 Replica]

    subgraph shard_internals["Inside each shard"]
        WS[Writable in-memory segment<br/>brute-force queries]
        WAL_S[WAL on durable storage]
        SEG1[Sealed Segment 1<br/>HNSW + PQ]
        SEG2[Sealed Segment 2<br/>HNSW + PQ]
        COMPACT[Compaction job<br/>background]
        WS --> WAL_S
        WS -->|"seal at threshold"| SEG1
        SEG1 --> COMPACT
        SEG2 --> COMPACT
    end

    INGEST[Ingest API<br/>+ embedding pipeline] -->|upsert / delete| SH0
    INGEST -->|upsert / delete| SH1
    INGEST -->|upsert / delete| SHN

    SH0 --> ROUTER
    SH1 --> ROUTER
    SHN --> ROUTER
    ROUTER -->|merge top-k| CLIENT

    style ROUTER fill:#ff6b1a,color:#0a0a0f
    style SH0 fill:#15803d,color:#fff
    style SH1 fill:#15803d,color:#fff
    style SHN fill:#15803d,color:#fff
    style R0 fill:#a855f7,color:#fff
    style R1 fill:#a855f7,color:#fff
    style META fill:#ffaa00,color:#0a0a0f
    style INGEST fill:#0e7490,color:#fff
    style COMPACT fill:#ff2e88,color:#fff

Deep dive: ANN index families

The choice of ANN index is not one-size-fits-all. The relevant axes are recall, query latency, memory footprint, and build time.

HNSW is the right choice when you can afford the memory and want the best recall-per-latency tradeoff. The graph structure means each query traverses only O(log N) nodes; efSearch is a real-time knob that does not require an index rebuild. It is the dominant algorithm in managed vector databases — Pinecone, Qdrant, and Weaviate all use HNSW variants — because its parameters are intuitive to tune and its recall curve degrades smoothly as efSearch decreases.

IVF-Flat (no PQ) does exact distance computation on a narrowed subset of the corpus. It preserves recall near 100% on the probed clusters at the cost of storing full float32 vectors — roughly 32x more per-vector memory than PQ codes, so about 3 TB for our 1B × 768 corpus. Useful for mid-size corpora (100M vectors, a few hundred GB) where you have RAM to spare and want exact results within the probe window.

IVF-PQ (Jégou et al., 2011) combines IVF coarse partitioning with PQ compression. The recall deficit from PQ is approximately 2–4% at equivalent nprobe settings. The main risk is an under-trained codebook — you need roughly 256× more training vectors than codebook entries (PQ with 256 codewords per sub-quantizer needs ~65k training vectors per sub-space, or ~6.24M total training vectors for 96 sub-quantizers: 65,536 × 96 ≈ 6.3M). This is usually not a problem at billion scale, but matters for smaller datasets where you want to use a fine-grained codebook.

ScaNN (Guo et al., ICML 2020) introduces anisotropic quantization, weighting quantization error more heavily for dimensions that contribute most to inner product. This is measurably better than isotropic PQ for maximum inner product search and is the basis of Google's production embedding retrieval.

DiskANN (Jayaram Subramanya et al., NeurIPS 2019) keeps the full vectors on NVMe SSD and uses a small in-memory PQ-compressed graph (the Vamana index) for approximation, then fetches exact vectors from SSD only for the top candidates. SSD IOPS is the bottleneck; target roughly 100k IOPS per query node. The appeal is handling datasets that genuinely cannot fit in RAM even compressed.

Deep dive: the filtered query hot path

Understanding the per-shard sequence for a filtered query is worth walking through explicitly, because this is where most implementations introduce subtle recall bugs.

sequenceDiagram
    participant C as Client
    participant R as Router
    participant S as Shard (one of N)
    participant META_IDX as Metadata Index
    participant HNSW_IDX as HNSW Graph
    participant WS as Writable Segment
    C->>R: query(vector, k=10, filter={tenant:acme, date>2024})
    R->>S: local_query(vector, k=10, filter)
    S->>META_IDX: build_bitmap(filter={tenant:acme, date>2024})
    META_IDX-->>S: bitmap (2.7M of 27M vectors match = 10%)
    Note over S: selectivity 10% — use pre-filter + HNSW
    S->>HNSW_IDX: search(vector, ef=128, allowed_bitmap)
    HNSW_IDX-->>S: candidate_ids with distances
    S->>WS: brute_force_search(vector, k=10, filter)
    WS-->>S: candidates from fresh writes
    S->>S: merge HNSW + writable candidates, remove tombstones
    S-->>R: local top-10 (id, score)
    R->>R: merge from all shards, global top-10
    R-->>C: results

The metadata bitmap comes from a secondary B-tree or roaring bitmap index on the metadata fields (column store or a Postgres-style JSONB index). A very selective filter — say, tenant=acme where only 0.01% of vectors belong to this tenant — makes the bitmap nearly empty. HNSW traversal constrained to that bitmap explores dead end after dead end; at that point the fallback to brute force on the filtered 0.01% is both faster and more correct.

Edge cases and gotchas

The recall cliff at high filter selectivity. An HNSW graph built over the full corpus assumes uniform access to all nodes. A filter that restricts to 0.1% of nodes means 99.9% of graph edges lead somewhere irrelevant. Recall drops precipitously below about 1% selectivity. The fallback to brute force is not a performance hack — it is the only way to preserve correctness.

Deletes degrade the graph over time. Tombstoning keeps the node in the graph. Future traversals still visit it, wasting beam width on a node whose result will be filtered. Over time this reduces effective graph connectivity and degrades recall. Compaction (rebuilding the segment without tombstoned vectors) is the only fix. Monitor tombstone ratio per segment and trigger compaction when it exceeds ~10–15%.

Compaction memory spikes. Rebuilding an HNSW graph over 100M vectors takes 2–4 hours with efConstruction=200. During compaction the old segment must remain live for queries, roughly doubling that shard's memory temporarily. Build compaction headroom into your shard sizing — do not let shards run at 90% memory utilization.

New shard cold start. When you add a shard to scale out, it is initially empty. Using consistent hashing for the shard map makes re-sharding less disruptive than naive modulo partitioning, but the new shard still needs to receive migrated vector data before it can contribute to queries. During migration, keep both old and new shards in the scatter-gather pool so results remain correct.

Rebuild vs. incremental insert. The hnswlib implementation supports incremental inserts (necessary for the writable segment path), but a graph built incrementally has lower quality than one built in a single batch over the complete dataset — edges are drawn based on the partial graph state at insert time rather than the final distribution. At segment seal time, building from scratch yields a better index than inheriting the incrementally grown structure. Accept the build cost.

Distance metric mismatch. Vectors from OpenAI text-embedding-3 are unit-normalized; for unit-normalized vectors, cosine similarity and dot product give identical rankings. Mixing metrics between indexing and querying produces silently wrong results. Record the distance metric at namespace creation time and enforce it on every query.

Trade-offs to discuss in an interview

efSearch: recall vs. latency. efSearch is a per-query knob — raising it from 64 to 256 typically lifts recall from ~90% to ~97% but triples query time. The right value depends on the application: a Q&A system might accept 30 ms for 97% recall; an autocomplete surface needs 5 ms and accepts 90% recall. The fact that you can tune this per query without a rebuild is one of HNSW's biggest operational advantages.

PQ compression: memory vs. recall. Compressing from float32 to 8-bit PQ codes saves ~32x memory but loses 2–4% recall at matched efSearch. For applications where the embedding model itself produces noisy representations (e.g., very short documents), 2–4% recall loss is irrelevant. For precise semantic search over long documents, it may matter. The standard mitigation is a rerank pass: fetch the top-50 ANN candidates, compute exact float32 distances for them, and return the true top-10. This recovers most of the recall deficit at the cost of fetching ~50 raw vectors per query.

M and efConstruction: build quality vs. build time and memory. Higher M produces a denser graph with better recall but more memory per node. M=64 with efConstruction=400 produces near-perfect recall but uses 4x the graph memory of M=16 and takes 2–3x longer to build. Find the M that gives recall ≥ your target and leave it there — there is no free lunch.

Immutable segments vs. mutable HNSW. The segment-based approach adds query-time complexity (merge from multiple segments) and periodic compaction cost. The alternative is a single mutable HNSW with incremental inserts, which avoids compaction but produces a suboptimal graph and creates reader/writer contention on a central structure. The segment approach trades operational complexity for clean read/write separation and predictable query latency.

Replication consistency. Managed vector databases like Pinecone use async replication. A freshly upserted vector might not appear on a replica for 1–5 seconds. Applications that need read-after-write guarantees — "I just uploaded a document, now let me search for it" — should either query the primary directly or implement client-side consistency tokens that route to the primary until the write propagates.

Things you should now be able to answer

Why is exact nearest-neighbor search infeasible at 1 billion vectors, and what recall-latency tradeoff does ANN make?
What is HNSW, what are the roles of M, efConstruction, and efSearch, and why is efSearch the most useful per-query knob in production?
How does Product Quantization compress a 768-dimensional float32 vector to 96 bytes, and what are the risks of an under-trained codebook?
Why does metadata post-filtering fail on selective predicates, and what is the three-tier filtering fallback strategy?
How does the immutable-segment + WAL write path achieve write freshness without sacrificing HNSW query performance?
What happens to recall when tombstones accumulate, and why is compaction the only correct fix?
How does scatter-gather work across N shards, and is there an approximation error introduced at the merge step?
When should you choose DiskANN over HNSW+PQ, and what is the limiting resource in each case?
How does vector search fit into a RAG pipeline, and where does the vector database's responsibility end?

Frequently asked questions

▸Why is exact nearest-neighbor search over 1 billion high-dimensional vectors impractical?

Exact nearest-neighbor search requires computing the distance from the query vector to every vector in the index — O(N × D) work per query. At 1 billion vectors with 768 dimensions, that is roughly 768 billion floating-point multiply-add operations per query. Even on a single GPU executing 20 teraFLOPS, one query takes about 38 milliseconds before any I/O. At 10,000 QPS the compute bill is prohibitive, which is why every production system trades a small amount of recall for orders-of-magnitude speed using Approximate Nearest Neighbor algorithms like HNSW or IVF.

▸What is HNSW and what are its most important tuning parameters?

HNSW (Hierarchical Navigable Small World) is a graph-based ANN index that organizes vectors into a multi-layer proximity graph. During search it enters at a fixed entry point on the top layer and greedily descends to find the approximate nearest neighbors. The key parameters are M (the maximum number of bidirectional edges per node, typically 16 to 64 — higher M improves recall but increases memory roughly proportionally), efConstruction (the beam width used while building the graph, typically 100 to 400 — larger values improve index quality at the cost of build time), and efSearch (the beam width at query time, typically 64 to 256 — larger values improve recall but increase latency proportionally). efSearch is the most useful production knob because you can raise or lower it per request without rebuilding the index.

▸How much memory does a 1-billion-vector HNSW index require, and how does quantization help?

Raw float32 vectors at 1 billion × 768 dimensions × 4 bytes consume roughly 3 TB. HNSW adds a neighbor graph — with M=16 and 8-byte node IDs, the bottom layer alone costs about 128 GB and all layers combined add roughly 200 GB (the upper layers form a geometric series that adds ~72 GB on top of the 128 GB bottom layer), bringing the uncompressed total to about 3.2 TB. Product Quantization (PQ) compresses each vector from 768 × 4 bytes to 96 bytes by splitting dimensions into sub-vectors and encoding each with a codebook, reducing the 3 TB vector payload to about 90 GB — a 32x reduction. Combined with the graph links, the PQ-compressed HNSW index totals roughly 290 GB, small enough to distribute across a few dozen 16 GB nodes.

▸What is the metadata filtering problem in vector search, and what are the three main approaches?

The metadata filtering problem occurs when a query adds a predicate such as tenant=A or date>2024-01-01 alongside the top-k ANN request. Post-filtering (run ANN first, then discard non-matching results) loses recall when the filter is selective. Pre-filtering (build a bitmap of matching doc IDs, then restrict graph traversal) degrades graph quality when the filter matches few nodes. Filtered-HNSW threads through both, annotating each graph edge with metadata and skipping edges whose target does not match the filter. Most production systems use pre-filtering for selectivity above roughly 10 percent of the corpus and fall back to brute force on the filtered set when selectivity drops below 1 percent.

▸How do vector databases handle deletes without corrupting the HNSW graph?

HNSW graphs are built by inserting nodes and drawing bidirectional edges; removing a node while leaving its neighbors coherent requires expensive graph surgery. Production systems instead use tombstones — the deleted vector is marked as deleted in a metadata store but its graph node remains. At query time, tombstoned results are filtered out of the top-k response. Background compaction periodically rebuilds sealed segments from scratch, excluding tombstoned vectors and reclaiming their memory. This is the same pattern LSM-tree storage engines use for deletes, as described in [LSM Trees vs B-Trees](/articles/lsm-trees-vs-b-trees).

← previous

The ML / GenAI System Design Interview Framework

Design a RAG (Retrieval-Augmented Generation) Pipeline

// RELATED