~/articles/design-graphrag

◆◆◆Advancedasked at Microsoftasked at Neo4jasked at LlamaIndexasked at Google

Design a GraphRAG System (Knowledge-Graph-Augmented Retrieval)

Q: When should I use GraphRAG instead of vanilla RAG?

GraphRAG earns its cost on two failure modes of vanilla RAG: multi-hop questions (where the answer requires connecting information from two or more separate documents) and aggregative questions (corpus-wide themes, summaries, or trend analysis). On simple single-hop fact retrieval, basic RAG actually scores higher — 60.92% vs 49.29% for MS-GraphRAG on NQ benchmarks (Exact Match metric; F1 scores are 71.70% vs 63.01% per the same study, arXiv:2502.11371). Use GraphRAG when your users ask things like "what are all the strategic risks mentioned across these 200 earnings calls?" and accept the indexing cost, which starts at roughly $5–6 per 1,000 documents at GPT-4o-mini pricing.

Q: How does global search differ from local search in Microsoft GraphRAG?

Local search finds the most semantically similar entities to the query via vector lookup, then traverses their 1–2 hop neighborhood in the graph to collect related chunks and community reports — fast (30–34 seconds on local hardware) and precise for entity-centric questions. Global search ignores the graph traversal entirely and instead runs a map-reduce pass over all pre-computed community summaries: each summary batch produces a partial answer scored 0–100 for helpfulness, zero-score batches are dropped, and the remainder is synthesized into a final answer. Global search processes up to 1,500 community reports per query, costs 5–10x more tokens, but handles "what are the main themes" questions no retrieval index can answer.

Q: What is the Leiden algorithm and why does GraphRAG use it instead of Louvain?

Both are modularity-maximizing community detection algorithms that partition a graph into clusters. Louvain is fast but can produce disconnected communities — a cluster where some nodes have no internal path to others — which creates incoherent community summaries. Leiden adds a refinement phase that guarantees every detected community is internally well-connected, yielding more topically coherent groups. Microsoft GraphRAG uses a hierarchical Leiden run producing four levels: C0 (macro-topics, 26,657 tokens of summaries for one podcast corpus) through C3 (leaf-level topics, 746,100 tokens). Finer levels answer more specific questions; coarser levels cover broader themes at lower token cost.

Q: How expensive is GraphRAG indexing, and how has the cost changed?

The original 2024 Microsoft GraphRAG paper (Edge et al., arXiv:2404.16130) took 281 minutes to index a ~1M token podcast corpus using GPT-4-turbo. A 5GB legal dataset in early 2024 cost roughly $33,000 in LLM API calls. By mid-2025, LazyGraphRAG (which defers all LLM summarization to query time and uses NLP noun-phrase co-occurrence for indexing) brings that to 0.1% of the original cost — roughly $33 for the same legal dataset. At GPT-4o-mini pricing for a standard 1,000-document enterprise corpus (full GraphRAG, both input and output tokens), expect around $5–6 in LLM tokens; a 10,000-document corpus runs $50–60, dropping to ~$9–11 with KET-RAG selective extraction (18.3% of full cost). The entity extraction prompt accounts for approximately 75% of total indexing cost.

Q: What is DRIFT Search and when should I use it?

DRIFT (Dynamic Reasoning and Inference with Flexible Traversal) is a three-phase hybrid that starts global and drills local. Phase one uses HyDE to expand the query and matches it against community reports (global breadth). Phase two generates follow-up sub-questions and runs two iterations of local search on them (graph precision). Phase three produces a ranked Q&A hierarchy. In Microsoft Research evaluations on the 5,590-article AP news corpus, DRIFT outperformed Local Search on comprehensiveness 78% of the time and diversity 81% of the time. Use it when queries are thematic but still benefit from entity-level grounding — between pure local and pure global.

When vanilla vector RAG fails on "summarize the entire corpus" and multi-hop questions, you build a knowledge graph first — covering entity extraction, Leiden community detection, map-reduce global search, and graph traversal for multi-hop, based on Microsoft GraphRAG and production deployments at Neo4j, LinkedIn, and Writer.

27 min read2026-06-25Ironclad Academy

#interview #ai #rag #llm #search #embeddings

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

In 2024, a team at Microsoft Research asked a deceptively simple question: given a podcast transcript about hundreds of topics and personalities, can you ask the system "who are the most influential people mentioned, and how are they connected?" A standard RAG pipeline — embed the transcript in chunks, retrieve by cosine similarity, hand to GPT-4 — gives you a partial answer from whatever chunk happens to be most similar to your question. It has no mechanism to aggregate across the full corpus. It cannot trace that person A mentors person B who co-founded a company with person C unless all three facts appear in the same retrieved chunk.

This is the structural failure of vector retrieval. Chunks are isolated. The retrieval step finds the most similar fragment, not the most connected answer. For single-hop factual questions — "what is the boiling point of water?" or "what does section 4.2 of the contract say?" — vector similarity is excellent. The moment you need information that lives across multiple disconnected documents, or the moment you want to ask a question about the corpus as a whole, retrieval fails at the architecture level.

Microsoft published GraphRAG in April 2024 (Edge et al., arXiv:2404.16130), open-sourced it in July 2024 to 20,000+ GitHub stars within weeks, and demonstrated 72–83% win rates on comprehensiveness and diversity over vector RAG for global query types. The insight is simple: before you retrieve, build a knowledge graph. After you build the graph, run community detection and pre-write summaries for each community. Now you have two search modes: traverse the graph for multi-hop precision, or map-reduce over community summaries for corpus-wide synthesis.

This article is about designing that system end-to-end. If you need the baseline RAG plumbing — chunking strategy, hybrid retrieval, cross-encoder reranking, RAGAS evaluation — start with Design a RAG Pipeline. The ANN index internals are in Design a Vector Database. This article assumes you know both and asks: when do you build a graph on top, and how?

Functional requirements

Ingest a document corpus (PDFs, wikis, contracts, transcripts) and extract a queryable knowledge graph from the text.
Local search: answer entity-centric and multi-hop questions by traversing the graph neighborhood of relevant entities.
Global search: answer aggregation and summarization questions ("what are the main risks across this corpus?") by synthesizing pre-computed community summaries.
Hybrid mode: fall back to vector similarity for single-hop fact retrieval where graph search regresses in quality.
Cite sources: every answer must trace back to specific source chunks and the entities within them.
Incremental ingestion: add new documents without re-running full Leiden community detection (or explicitly budget a daily full rebuild).

Non-functional requirements

Indexing cost must be bounded and predictable per document; unpredictable LLM call explosion is a showstopper.
Global search query latency under 60 seconds for a 10,000-document corpus (it is expensive — users must understand the trade-off).
Local search query latency under 10 seconds for the same corpus.
Graph storage must scale to 1M+ nodes and 5M+ edges without full in-memory loading.
Schema-guided entity extraction to prevent hallucinated relationship types polluting the graph.

Capacity estimation

Dimension	Estimate	How we got there
Source corpus	10,000 docs × 5,000 tokens = 50M tokens	Typical enterprise knowledge base: contracts, wikis, transcripts
Chunks created	50M tokens ÷ 600 tokens/chunk = 83,333 chunks	Microsoft GraphRAG default: 600-token chunks with 100-token overlap
LLM extraction calls	83,333 chunks × 1 call/chunk + gleanings = ~125,000 LLM calls	Gleanings self-reflection adds ~50% more calls for complex domains
Extraction token cost	~17,600 tokens processed/doc × 10,000 docs = 176M tokens	75% of total indexing cost; at GPT-4o-mini ~$0.15/1M = ~$26
Community summaries	~12,000 communities × 4 levels × ~500 tokens/summary = ~24M tokens	2.6% (C0) to 74.6% (C3) of raw source text per Microsoft's podcast corpus
Total indexing cost	~$50–60 in LLM API calls	Extraction (~~$26 input) + output tokens (~~$21) + community summarization input (~~$3.60) + output (~~$5) ≈ $55; drops to ~$9–11 with KET-RAG selective extraction (18.3% of $50–60)
Graph nodes	~100,000 unique entities + 83,333 chunk nodes + ~12,000 community nodes	After dedup from ~830,000 raw entity mentions
Graph edges	~500,000–1M edges	HAS_ENTITY + RELATED + IN_COMMUNITY across the corpus
Vector index size	100,000 entities × 1,536 dims × 4 bytes = ~600 MB	Entity description embeddings for local search entry point
Global search cost	~470 reports × 800 tokens × $0.0025/1K = ~$0.94/query	Dynamic community selection; static Level-1 processes ~1,500 reports ≈ $3/query
Local search latency	5–10 seconds (hosted), 30–34 seconds (local hardware)	Vector lookup + 1–2 hop Cypher traversal + LLM synthesis

Takeaway: Graph construction costs roughly 10–50x more than vector-only indexing for the same corpus. That cost is justified only for corpora where multi-hop or aggregative queries are a primary use case. For a high-churn corpus, LazyGraphRAG's deferred approach reduces indexing to parity with vector RAG.

Building up to the design

Start with the naive version. You have 10,000 documents. You chunk them, embed the chunks, load them into a vector index, and answer questions with standard RAG. This works for 80% of queries. Then your first enterprise customer sends: "Can you tell me which regulatory risks appear across our entire contract portfolio, grouped by risk type?" The top-5 similar chunks cover one or two contracts. The LLM answers from those and misses 40 other contracts that mention the same risks. The answer is confidently wrong by omission.

V1 — Graph extraction only. You add an LLM extraction step: for each chunk, prompt the model to emit named entities (companies, people, concepts, regulations) and relationships between them as typed triples — (ACME Corp, EXPOSED_TO, GDPR Article 17), (GDPR Article 17, REQUIRES, Data Deletion). You store these triples in Neo4j. On a query, you parse entity names from the question, look them up in the graph, and collect their neighbors. The result is better multi-hop coverage. But you cannot do global search — "what are all the risk themes?" still has no answer because you have no summary of the entire graph.

V2 — Community detection + summaries. After building the graph, you run Leiden community detection. The algorithm partitions the entity nodes into clusters of tightly-connected entities — each cluster tends to represent a coherent topic (e.g., "EU data privacy regulations," "competitor landscape," "supply chain partners"). You then prompt an LLM to write a natural-language report for each community: "This cluster covers GDPR compliance obligations, including Articles 17, 20, and 83, affecting seven contracts with EU-based customers." Now you have a hierarchical index of pre-computed summaries. A global search query becomes a map-reduce over these summaries: send each summary to the LLM and ask it to score its relevance (0–100) and extract a partial answer; filter zero-score results; synthesize the remainder. This works. But it is expensive — 1,500 community reports per query at Level 1.

V3 — Dynamic community selection + hybrid routing. Two refinements from Microsoft's 2024 follow-up work. First, instead of processing all community summaries regardless of relevance, use HyDE (hypothetical document embedding) to generate a synthetic answer to the query, embed it, and match it against community report embeddings. Process only the top-matched communities. This drops the average from ~1,500 to ~470 reports per query (~69% fewer reports; Microsoft reports ~77% token savings because the discarded reports are disproportionately the longest ones) at equivalent quality. Second, add a query classifier at the front: simple fact questions route to a standard vector RAG path (cheaper, faster, higher quality for that query type); entity-centric questions route to local graph search; thematic questions route to global search with dynamic selection. The result is the production GraphRAG architecture shown above.

flowchart LR
    V1["V1: Graph extraction<br/>entity triples in Neo4j"] -->|no global search| V2
    V2["V2: Leiden communities<br/>+ LLM summaries<br/>enables global search"] -->|1500 reports per query| V3
    V3["V3: Dynamic selection<br/>+ query router<br/>470 reports avg, correct path per query"]
    style V1 fill:#0e7490,color:#fff
    style V2 fill:#15803d,color:#fff
    style V3 fill:#ff6b1a,color:#0a0a0f

API

The external API surface is simpler than the internals:

POST /index
{
  "corpus_id": "contracts-v3",
  "documents": [{ "id": "doc_001", "content": "...", "metadata": { "date": "2024-01-15" } }],
  "extraction_schema": {
    "entity_types": ["Organization", "Regulation", "Obligation", "Risk"],
    "relation_types": ["EXPOSED_TO", "REQUIRES", "MITIGATED_BY", "PARTY_TO"]
  },
  "chunk_size": 600,
  "community_levels": 4
}

Response: { "job_id": "idx_8f2a3c", "estimated_cost_usd": 52.40, "estimated_duration_minutes": 45 }

POST /query
{
  "corpus_id": "contracts-v3",
  "question": "Which regulatory risks appear most frequently across the EU contracts?",
  "mode": "auto",          // "local" | "global" | "drift" | "vector" | "auto"
  "max_community_reports": 500,
  "community_levels": [1, 2]
}

Response: {
  "answer": "The three most prevalent regulatory risks across 43 EU contracts are...",
  "sources": [
    { "type": "community_report", "community_id": "c_142", "level": 1, "title": "EU Data Privacy Cluster" },
    { "type": "chunk", "doc_id": "doc_023", "chunk_id": "chnk_0891", "snippet": "..." }
  ],
  "mode_used": "global",
  "tokens_used": 387420,
  "latency_ms": 28400
}

The extraction_schema is the key production parameter. Schema-guided extraction constrains the LLM to a defined ontology of entity and relation types, preventing hallucinated edges. Free-form extraction discovers unexpected relationships but produces noisy graphs; the tradeoff matters more for domains with specialized vocabulary (legal, medical) where a generic LLM invents plausible-sounding but incorrect relation types.

The schema

The knowledge graph uses a property graph model. In Neo4j, three node types and three relationship types cover the full structure:

// Entity node
{
  "id": "ent_gdpr_art17",
  "type": "__Entity__",
  "name": "GDPR Article 17",
  "description": "Right to erasure ('right to be forgotten') under the General Data Protection Regulation",
  "description_embedding": [0.023, -0.187, ...],   // 1536-dim, cosine similarity index
  "rank": 142,                                       // degree centrality for PageRank-based retrieval
  "community_ids": ["c_049", "c_012", "c_002"]      // membership at each Leiden level
}

// Chunk node
{
  "id": "chnk_4821",
  "type": "__Chunk__",
  "doc_id": "doc_023",
  "content": "Supplier agrees to comply with GDPR Article 17...",
  "position": 14
}

// Community node
{
  "id": "c_049",
  "type": "__Community__",
  "level": 2,
  "title": "EU Data Privacy Obligations",
  "summary": "This community covers...",
  "summary_embedding": [0.041, -0.092, ...],
  "rank": 8.7,            // LLM helpfulness score 0-100, averaged
  "entity_count": 34
}

// Relationships:
// (:__Chunk__)-[:HAS_ENTITY]->(:__Entity__)
// (:__Entity__)-[:RELATED { description: "exposed to", weight: 0.87 }]->(:__Entity__)
// (:__Entity__)-[:IN_COMMUNITY { level: 2 }]->(:__Community__)

FalkorDB is an alternative to Neo4j for this schema — it uses sparse matrix operations internally and claims 700x speedup for complex multi-hop traversals, which matters when your queries involve 3+ hops across a dense entity graph. For smaller corpora under 100,000 nodes, networkx in-memory is sufficient and avoids operational overhead entirely.

Architecture

The full system has two distinct phases: index build and query serving.

flowchart TD
    subgraph BUILD["Index Build Pipeline (offline / async)"]
        D[Documents] --> CH[Chunker<br/>600 tok / 100 overlap]
        CH --> EXTR[LLM Extractor<br/>schema-guided + gleanings]
        EXTR --> DEDUP[Entity Resolver<br/>embed + cluster merge]
        DEDUP --> GDB[(Graph DB<br/>Neo4j / FalkorDB)]
        CH --> CEMB[Chunk Embedder]
        CEMB --> VDB[(Vector Index<br/>entity descriptions)]
        GDB --> LEI[Leiden Community<br/>Detection C0-C3]
        LEI --> CSUMM[LLM Community<br/>Summarizer bottom-up]
        CSUMM --> GDB
        CSUMM --> SVDB[(Summary Vector<br/>Index for DRIFT)]
    end
    subgraph SERVE["Query Serving (online)"]
        Q[User Query] --> CLASS[Query Classifier<br/>local / global / drift / vector]
        CLASS -->|local or multi-hop| LS[Local Search<br/>vector entry + Cypher hop]
        CLASS -->|aggregative| GS[Global Search<br/>dynamic community select<br/>map-reduce]
        CLASS -->|hybrid| DR[DRIFT<br/>HyDE + primer + local drill]
        CLASS -->|fact lookup| VR[Vector RAG<br/>BM25 + ANN + rerank]
        LS --> CTX[Context Assembler]
        GS --> CTX
        DR --> CTX
        VR --> CTX
        CTX --> GEN[LLM Generation<br/>grounded + cited]
        GEN --> OUT[Answer + sources]
        GDB --> LS
        GDB --> DR
        SVDB --> GS
        SVDB --> DR
        VDB --> LS
        VDB --> DR
    end
    style BUILD fill:none
    style SERVE fill:none
    style EXTR fill:#0e7490,color:#fff
    style GDB fill:#15803d,color:#fff
    style VDB fill:#15803d,color:#fff
    style LEI fill:#a855f7,color:#fff
    style CSUMM fill:#a855f7,color:#fff
    style CLASS fill:#ff6b1a,color:#0a0a0f
    style GEN fill:#ffaa00,color:#0a0a0f
    style DR fill:#ff2e88,color:#fff

The local search hot path in more detail:

sequenceDiagram
    participant U as User
    participant R as Query Router
    participant V as Vector Index
    participant G as Graph DB
    participant L as LLM

    U->>R: "Which investors backed companies connected to Sequoia portfolio firms?"
    R->>R: classify -- local or multi-hop
    R->>V: embed query, top-K entity similarity
    V-->>R: top-10 entity nodes (Sequoia, portfolio co A, portfolio co B...)
    R->>G: Cypher 1-2 hop traversal from top_entities via RELATED edges
    G-->>R: neighbors + relationship descriptions + connected chunks
    R->>G: fetch community reports for matched entities
    G-->>R: 3 community reports (VC ecosystem, tech startup landscape, ...)
    R->>L: context = chunks + relationships + reports, answer the question
    L-->>U: "Sequoia portfolio firms that received follow-on from firms also backing..."

The indexing pipeline in depth

The extraction prompt is the most expensive component — roughly 75% of total indexing cost by token count. A typical prompt sends a 600-token chunk and instructs the LLM to return all entities in the format ("entity"<|>name<|>type<|>description) and all relationships as ("relationship"<|>source<|>target<|>description<|>strength). Microsoft's extraction prompt then runs a "gleanings" self-reflection loop: after the initial extraction, it re-prompts the LLM with "MANY entities and relations were missed. What did you miss?" and parses the delta. This catches entities that appear only once or in unusual syntactic positions. In practice it adds 40–60% more entities for dense technical documents at ~50% more token cost for that chunk.

The gleanings loop reveals a key tension: smaller chunks extract more entities per token of source text (a 600-token chunk yields ~2x more entity references than a 2,400-token chunk at constant LLM call count), but they multiply the number of LLM calls proportionally. For a 50M-token corpus, moving from 2,400-token to 600-token chunks quadruples extraction cost. That said, smaller chunks also yield richer multi-hop connections because cross-document references appear at finer granularity.

After extraction, entity resolution is the step most implementations skip and later regret. Raw extraction produces ("GDPR Article 17"), ("GDPR Art. 17"), ("Art 17 GDPR"), and ("right to erasure") as four separate entity nodes. These are the same entity. If they are not merged, the graph has four disconnected subgraphs where there should be one hub. The naive fix is string normalization and exact matching. The correct fix is to embed all entity descriptions, cluster with DBSCAN or hierarchical clustering at a cosine distance threshold around 0.12, then prompt the LLM to confirm merges within each cluster. This adds cost but directly determines graph connectivity — and graph connectivity is the entire value proposition of GraphRAG.

Leiden community detection runs on the entity graph after deduplication. The algorithm is non-deterministic; different runs produce slightly different communities. In practice, Microsoft runs it at four granularity levels parameterized by a resolution parameter. At C0 (lowest resolution, largest communities), you get broad macro-topics with short summaries — 26,657 tokens total for the podcast corpus. At C3 (highest resolution, smallest communities), you have fine-grained leaf clusters — 746,100 tokens of summaries. Community summaries are generated bottom-up: leaf communities are summarized first, and their summaries are fed into the parent community prompts, letting parent summaries synthesize rather than repeat.

The two query paths

Local search is the entity-first path. The query is embedded and matched against entity description embeddings via ANN (the same vector index used for baseline RAG, but over entity descriptions rather than chunk text). The top-K entities form the starting set. From each starting entity, a Cypher traversal collects the 1–2 hop neighborhood: directly connected entities, the chunks containing those entities (via HAS_ENTITY edges), and the community reports those entities belong to. All of this — entities, relationship descriptions, chunks, community reports — is assembled into an LLM context window and synthesized into an answer. The total context is typically 5,000–15,000 tokens. On optimized hosted deployments, end-to-end latency is 5–10 seconds; on consumer hardware with a local Llama 3.1 model, 30–34 seconds consistently across benchmarks (arXiv:2605.20815).

Global search ignores the graph entirely at query time and works on the pre-computed community summary index. The naive version processes every community report at a chosen Leiden level: split all summaries into token-sized batches, send each batch to the LLM with a prompt asking "given these community summaries, generate a partial answer to the query and rate your confidence 0–100," collect all partial answers, filter zero-scored ones, sort by confidence, fill the context window from highest-confidence down, and synthesize a final answer. At Level 1, this touches ~1,500 reports. Microsoft's dynamic community selection improvement (2024) first embeds the query and matches it against community report embeddings, reducing the processed set to ~470 on average (~69% fewer reports; Microsoft measures ~77% token reduction because the reports eliminated tend to be the longer, lower-relevance ones). For GPT-4o at $2.50/1M input tokens, that difference is about $0.94 per global query (dynamic) vs $3.00 (static, ~1,500 reports) — not trivial at scale.

The benchmark results are clear about when each mode wins. On Microsoft's 5,590-article AP news corpus (Edge et al., 2024), GraphRAG global search beat vector RAG on comprehensiveness at a 72–83% win rate and diversity at 62–75% (all p < 0.001). But on Natural Questions (single-hop fact retrieval), basic RAG scores 71.70% F1 vs Community-GraphRAG Local at 63.01% F1 — and 60.92% vs 49.29% on Exact Match (arXiv:2502.11371). The graph's strength is synthesis; its weakness is precision retrieval. A production system needs both, hence the query classifier.

DRIFT search is the third mode. Introduced by Microsoft Research in late 2024, it targets queries that start thematic but need entity-level grounding. Phase one: expand the query using HyDE (generate a synthetic answer, embed it, match against community report embeddings) to find a broad starting point — the "primer." Phase two: generate 3–5 follow-up sub-questions from the primer and run two iterations of local search on each sub-question, expanding the graph neighborhood on each iteration. Phase three: aggregate sub-answers into a ranked Q&A hierarchy, filter for quality, and synthesize a final answer. On 50 queries over 5,590 AP news articles, DRIFT outperformed Local Search on comprehensiveness 78% of the time and diversity 81% of the time.

Alternatives and cost-reduction strategies

Four systems sit on the cost-vs-quality spectrum and represent real production choices:

Full Microsoft GraphRAG amortizes all LLM cost at index time. You pay once; every query benefits from pre-computed community summaries and a rich entity graph. Best for static corpora where you query heavily over time. The 281-minute, 654,673-token index build for a ~1M-token corpus (Edge et al., arXiv:2404.16130) is the baseline.

LazyGraphRAG (Microsoft Research, late 2024) defers all LLM summarization to query time and uses NLP noun-phrase co-occurrence for cheap indexing — zero LLM calls during index build. Indexing cost becomes 0.1% of full GraphRAG (comparable to vector RAG). At query time, it uses the same co-occurrence graph to select relevant text passages for on-the-fly summarization. At a 500-relevance-test query budget, LazyGraphRAG's query cost is 4% of GraphRAG Global Search while matching or exceeding quality on both local and global benchmarks over the same 5,590-article corpus. The tradeoff: per-query latency is higher because summarization happens live.

LightRAG (arXiv:2410.05779) offers incremental document insertion without full re-indexing — the key differentiator over full GraphRAG. It uses a dual-level retrieval (low-level for specific entities, high-level for broad themes) and integrates both vector embeddings and graph structure. Construction took 710 seconds vs 292 for MS-GraphRAG on the same benchmark but at substantially lower per-query token cost (~100,832 tokens vs ~331,375 for global search). The best fit is a corpus that changes frequently.

KET-RAG (arXiv:2502.09304) uses PageRank to identify "core chunks" — the β-fraction of chunks with the highest centrality in an initial co-occurrence graph — and runs LLM extraction only on those. For the remaining chunks, it builds a lightweight text-keyword bipartite graph without LLM calls. On HotpotQA, KET-RAG achieves 81.6% of full GraphRAG coverage at 18.3% of the indexing cost. For a 3.2MB dataset, full KG-GraphRAG cost $21; the equivalent 5GB legal dataset early estimates were $33,000 with GPT-4, dropping to ~$33 with LazyGraphRAG by mid-2025.

HippoRAG (Ohio State, NeurIPS 2024, arXiv:2405.14831) takes a neuroscience-inspired approach: LLM extracts knowledge graph triples (neocortex role), then Personalized PageRank spreads activation across graph edges at retrieval time (hippocampus role). Single-step retrieval outperforms iterative IRCoT-style methods on multi-hop QA benchmarks by up to 20%, while being 10–30x cheaper and 6–13x faster. HippoRAG2's clustering coefficient of 0.657 vs HippoRAG's 0.100 on the same corpus demonstrates that graph topology quality — not just node count — directly determines retrieval accuracy.

Incremental updates: the unsolved problem

Full GraphRAG has no clean answer for document additions. Adding a new document requires re-running entity extraction on the new chunks (cheap), merging new entities into the graph (medium), then re-running Leiden community detection on the modified graph (expensive — the algorithm processes the whole graph), then re-generating community summaries for affected communities (expensive — requires determining which communities changed, which propagates up the hierarchy).

Microsoft measured this directly: full re-indexing costs 30M prompt tokens + 7.8M completion tokens. An ancestor-node-only incremental update — where only the community hierarchy above changed entities is re-summarized — costs 1.6M prompt + 2M completion tokens, roughly 15–20x cheaper. LightRAG implements something close to this. LazyGraphRAG sidesteps it entirely.

The production consequence is that most GraphRAG deployments accept a daily full rebuild cadence. This makes GraphRAG unsuitable for corpora where documents are added or updated more frequently than once per day and answer freshness matters. For high-churn environments — news ingestion, real-time support documentation, streaming event logs — either LazyGraphRAG's deferred model or a pure vector RAG system with good chunking and incremental vector upserts will serve better. The Design a RAG Pipeline article covers CDC-based incremental vector updates that achieve sub-minute freshness.

Real production deployments

Writer deployed a production knowledge graph system that scored 86.31% on RobustQA, versus 59–75% for seven competing vector RAG implementations. This is the strongest published accuracy benchmark comparing graph-augmented vs vector-only retrieval on a real-world QA benchmark.

LinkedIn uses an internal knowledge graph connecting professionals, companies, skills, and job postings — the economic graph — and has reported that graph-based retrieval in their support tooling delivered a 27.5% reduction in median per-issue ticket resolution time (40 hours to ~29 hours), and one internal deployment reported improvement from 40 to 15 hours (63%) on resolution time for specific support categories.

Data.world ran a benchmark of 43 business questions comparing standard RAG to a Neo4j-based graph RAG implementation and found 3x accuracy improvement. Neo4j's GraphRAG integration uses Cypher queries of the form MATCH (n:__Entity__)<-[:HAS_ENTITY]-(c:__Chunk__) WHERE n.name IN $entity_names to collect text, combined with vector similarity on entity description embeddings, then assembles community reports from the matching entity clusters.

LlamaIndex Property Graph Index supports three extraction modes in production: schema-guided (explicit entity/relationship type definitions), implicit (PREVIOUS/NEXT/SOURCE structural links), and free-form (LLM-inferred). All four retrieval methods are composable: keyword/synonym expansion, vector similarity on embedded nodes, Cypher query, and custom traversal subclasses. The composability is the differentiator — you can combine vector entry with graph expansion without writing a custom retrieval pipeline.

Edge cases & gotchas

Entity deduplication failure cascades. String normalization catches ("GDPR Article 17") and ("gdpr article 17") but not ("Art. 17 GDPR"), ("right to erasure"), or German ("Artikel 17 DSGVO") when processing multilingual corpora. Each unmerged duplicate becomes an isolated node. The entity that should be the most-connected hub in the graph — appearing in 50 contracts — becomes 12 nodes each with 4–5 edges instead of one node with 50+ edges. Multi-hop traversal starting from a sparse node finds far fewer connected facts than a dense hub would, directly degrading recall. Fix: embedding-based clustering plus LLM-assisted confirmation within each cluster, not string normalization alone.

Global search token explosion on complex queries. A simple global query at dynamic selection processes ~470 reports. A complex multi-part global query can trigger broader community matching and process 1,200+ reports. Without explicit max_community_reports caps at the API layer and per-query token budgets in the map-reduce loop, a single misbehaving query can cost more than 10,000 typical vector RAG queries. Rate limiting and per-query cost caps are load-bearing, not optional.

Stale community summaries. A company mentioned in 5 contracts is part of a well-summarized community. A new contract is added that dramatically changes the narrative about this company. The daily rebuild hasn't run yet. Global search returns the pre-update community summary, which now actively misleads. The gap between "corpus updated" and "graph rebuilt" is a data freshness SLA you must communicate to users — GraphRAG is not a real-time retrieval system.

LLM hallucination in relationship extraction. Free-form extraction with few-shot examples encourages the model to invent plausible-sounding relationships not present in the source text. A prompt that says "extract all relationships" may yield ("ACME Corp", "PARTNERS_WITH", "GlobalBank") when the text only says "ACME Corp mentioned GlobalBank's lending rates in a footnote." Schema-guided extraction with a defined relationship type ontology constrains this. Validation passes that check extracted relationships against source chunk text catch the worst hallucinations before they enter the graph.

Text-to-Cypher silent failures. When using LangChain's GraphCypherQAChain or LlamaIndex's Cypher query retriever, the LLM generating the Cypher can produce syntactically valid but semantically wrong traversals that silently return empty results. Unlike vector search — which always returns K results even if they are irrelevant — an empty Cypher result produces an answer of "I found no information," which the user interprets as a knowledge gap rather than a query translation error. Instrument every Cypher query with result-count logging and alert on unexpectedly empty results.

Graph density mismatch between domains. A knowledge graph built from legal contracts (dense, entity-rich, explicit relationships) will have very different connectivity properties than one built from customer support tickets (sparse, mostly entities are product names and issue categories). The same Leiden resolution parameter that yields 50-entity communities for legal text yields 3-entity communities for support tickets. Tune resolution per corpus type; what works out of the box for one domain will produce either over-aggregated or over-fragmented communities in another.

Adversarial graph injection. Documents injected into the corpus can contain fabricated entity descriptions that, once absorbed into community summaries, influence all future global searches touching that community. A poisoned document claiming "RegulatorX has formally approved Product Y" propagates through entity extraction into the APPROVED_BY relationship, into the community summary for the regulatory cluster, and from there into every global query about product approvals. This is document poisoning with graph amplification — one injected document can corrupt an entire community summary rather than just one chunk's context. Input validation, source trust scoring, and human review for documents from untrusted sources are the mitigations.

Trade-offs to discuss in an interview

GraphRAG vs long-context LLM. GPT-4o's 128K context window and Gemini 1.5's 1M context window invite the question: why not just stuff all 10,000 documents into the context? For small corpora, you should — it is cheaper and faster than building a graph. The break-even is roughly 500–1,000 documents: beyond that, the LLM's attention degrades over irrelevant content, the per-query cost exceeds the amortized graph build cost, and latency becomes prohibitive. GraphRAG is specifically the answer to "I have more data than fits in any context window, and I need global queries."

Global search quality vs cost. GraphRAG's 72–83% comprehensiveness win over vector RAG comes from the map-reduce over community summaries. That same mechanism costs 5–10x more per query. In a product where 95% of user queries are single-hop fact lookups and 5% are aggregation queries, the average cost savings of routing fact queries to vector RAG and only running global GraphRAG search for aggregation queries is substantial. Query classifiers that route with 90%+ accuracy pay for themselves immediately.

Schema-guided vs open-IE extraction. Schema-guided extraction (defined entity types and relationship types) produces a clean, consistent graph where every edge has a meaningful type. Open-IE discovers unexpected relationships but produces noisy, inconsistent type labels ("is_related_to," "mentioned_alongside," "appears_with" as separate relationship types covering the same semantic). For domains with well-understood ontologies (legal, medical, financial), schema-guided is strictly better. For exploratory research over unknown corpora, open-IE reveals structure you didn't know to look for.

Neo4j vs FalkorDB vs in-memory. Neo4j has the richest tooling, best LLM integration (LangChain, LlamaIndex native connectors), and a proven production track record. FalkorDB's sparse matrix operations claim 700x speedup for complex multi-hop traversals — meaningful if your queries regularly involve 4–5 hop paths across dense entity graphs. networkx in-memory is correct for corpora under ~100,000 nodes and eliminates operational overhead entirely. The right choice depends on traversal depth and available engineering resources.

Things you should now be able to answer

Why does vanilla vector RAG fail on "what are all the themes across this corpus?" and how does GraphRAG's community summary architecture specifically fix this?
What is the Leiden algorithm, how does it differ from Louvain, and why does hierarchical community structure matter for serving queries at different granularities?
Walk through the entity extraction pipeline: chunking, LLM extraction prompt structure, gleanings self-reflection loop, and the deduplication step — and explain why deduplication is the most common production pitfall.
Explain local search vs global search: what query types each serves, their latency/cost profiles, and where each one regresses.
How does dynamic community selection reduce the number of processed reports from ~1,500 to ~470 (~69% fewer) and achieve ~77% token savings (per Microsoft), and what mechanism does it use to select relevant communities without processing all of them?
Compare full GraphRAG, LazyGraphRAG, LightRAG, and KET-RAG on the axes of indexing cost, query cost, and suitability for high-churn corpora.
What is DRIFT search and when does it outperform pure local or global search?
Why does GraphRAG regress on single-hop fact retrieval compared to vector RAG, and what does a query classifier do to prevent that regression in a production system?
What makes incremental GraphRAG updates expensive, and what are the current strategies for avoiding a full daily rebuild?
Describe two adversarial or failure scenarios specific to GraphRAG that do not exist in a standard vector RAG system.

Frequently asked questions

▸When should I use GraphRAG instead of vanilla RAG?

GraphRAG earns its cost on two failure modes of vanilla RAG: multi-hop questions (where the answer requires connecting information from two or more separate documents) and aggregative questions (corpus-wide themes, summaries, or trend analysis). On simple single-hop fact retrieval, basic RAG actually scores higher — 60.92% vs 49.29% for MS-GraphRAG on NQ benchmarks (Exact Match metric; F1 scores are 71.70% vs 63.01% per the same study, arXiv:2502.11371). Use GraphRAG when your users ask things like "what are all the strategic risks mentioned across these 200 earnings calls?" and accept the indexing cost, which starts at roughly $5–6 per 1,000 documents at GPT-4o-mini pricing.

▸How does global search differ from local search in Microsoft GraphRAG?

Local search finds the most semantically similar entities to the query via vector lookup, then traverses their 1–2 hop neighborhood in the graph to collect related chunks and community reports — fast (30–34 seconds on local hardware) and precise for entity-centric questions. Global search ignores the graph traversal entirely and instead runs a map-reduce pass over all pre-computed community summaries: each summary batch produces a partial answer scored 0–100 for helpfulness, zero-score batches are dropped, and the remainder is synthesized into a final answer. Global search processes up to 1,500 community reports per query, costs 5–10x more tokens, but handles "what are the main themes" questions no retrieval index can answer.

▸What is the Leiden algorithm and why does GraphRAG use it instead of Louvain?

Both are modularity-maximizing community detection algorithms that partition a graph into clusters. Louvain is fast but can produce disconnected communities — a cluster where some nodes have no internal path to others — which creates incoherent community summaries. Leiden adds a refinement phase that guarantees every detected community is internally well-connected, yielding more topically coherent groups. Microsoft GraphRAG uses a hierarchical Leiden run producing four levels: C0 (macro-topics, 26,657 tokens of summaries for one podcast corpus) through C3 (leaf-level topics, 746,100 tokens). Finer levels answer more specific questions; coarser levels cover broader themes at lower token cost.

▸How expensive is GraphRAG indexing, and how has the cost changed?

The original 2024 Microsoft GraphRAG paper (Edge et al., arXiv:2404.16130) took 281 minutes to index a ~1M token podcast corpus using GPT-4-turbo. A 5GB legal dataset in early 2024 cost roughly $33,000 in LLM API calls. By mid-2025, LazyGraphRAG (which defers all LLM summarization to query time and uses NLP noun-phrase co-occurrence for indexing) brings that to 0.1% of the original cost — roughly $33 for the same legal dataset. At GPT-4o-mini pricing for a standard 1,000-document enterprise corpus (full GraphRAG, both input and output tokens), expect around $5–6 in LLM tokens; a 10,000-document corpus runs $50–60, dropping to ~$9–11 with KET-RAG selective extraction (18.3% of full cost). The entity extraction prompt accounts for approximately 75% of total indexing cost.

▸What is DRIFT Search and when should I use it?

DRIFT (Dynamic Reasoning and Inference with Flexible Traversal) is a three-phase hybrid that starts global and drills local. Phase one uses HyDE to expand the query and matches it against community reports (global breadth). Phase two generates follow-up sub-questions and runs two iterations of local search on them (graph precision). Phase three produces a ranked Q&A hierarchy. In Microsoft Research evaluations on the 5,590-article AP news corpus, DRIFT outperformed Local Search on comprehensiveness 78% of the time and diversity 81% of the time. Use it when queries are thematic but still benefit from entity-level grounding — between pure local and pure global.

← previous

Design an LLM Evaluation Platform

Design a Feature Store

// RELATED