Design an AI Coding Assistant (Copilot / Cursor)
Architect a system that delivers inline ghost-text completions in under 200ms and drives an autonomous agent that edits dozens of files — the two-product architecture behind GitHub Copilot, Cursor, and Sourcegraph Cody at billions of completions per day.
The problem
In November 2023, GitHub reported that developers using Copilot completed tasks 55% faster in a controlled study. The tool had 1.3 million paid users at the time. By mid-2025, it was serving 400 million completion requests per day, handling 8,000 requests per second at peak. Cursor, a full VS Code fork built around AI editing, was processing 1 million transactions per second at peak and had crossed $500 million ARR in under two years.
The engineering problem is harder than it looks. The product surface is simple: a ghost text overlay in an editor. The constraint is brutal: you have roughly 200 milliseconds between the last keystroke and the moment the suggestion must appear, or the developer ignores it. At GitHub Copilot's scale, that means serving the equivalent of the world's largest code autocomplete system, globally, with sub-200ms p50 latency, while running a completely different system — an autonomous multi-file editing agent — on the same codebase, with the same IDE surface, from the same backend.
The core architectural insight is that these are two distinct products that share a UI. Inline completion is a latency problem: a small model, a carefully packed prompt, and an aggressive cancellation strategy. Agent mode is a reliability and context problem: a frontier model, a tool execution loop, and a compaction strategy to prevent the context window from collapsing on a 20-file refactor. Conflating them — or trying to build one model that serves both — is the mistake that leads to a system that is mediocre at both.
This article focuses on the coding-specific latency, context retrieval, and edit-application problems. For the broader mechanics of agentic tool loops and durable state, see Design an AI Agent Platform.
Functional requirements
- Inline completions: show single-line or multi-line ghost text within 200ms of the last keystroke; support fill-in-the-middle (cursor in the middle of existing code).
- Context retrieval: use the current file, open tabs, recently edited files, and a semantic index of the repository to assemble relevant context.
- Cancellation: when the user types past the suggestion window, cancel the in-flight request immediately without a TCP teardown.
- Agent mode: accept a natural-language task; plan steps; call tools (read file, write file, run bash command, search codebase); apply multi-file diffs; observe test/lint output; loop until done or budget exhausted.
- Model routing: dispatch ghost-text requests to a fast small model and agent turns to a large frontier model.
- Streaming: stream both completion tokens and agent turn output to the IDE in real time.
- Privacy: customer code must not be used to train the base model; enterprise deployments must support fully on-prem inference with zero data retention.
- Telemetry: capture accept/reject per completion, token-level survival, per-turn latency, agent task completion rate, and error codes.
Non-functional requirements
- Ghost-text p50 latency < 200ms end-to-end (proxy → model → proxy → IDE); p99 < 500ms.
- Agent turn p50 latency < 5s (time to first meaningful output streamed); total task p99 < 5 minutes.
- Cancellation latency (HTTP/2 stream reset acknowledged) < 10ms from the proxy.
- Index freshness: repo embedding index updated within 60 seconds of a file save.
- Availability: > 99.9% for the completion path; > 99.5% for agent (longer tasks tolerate queuing).
- Multi-tenancy: per-organization token quotas; one org's traffic spike must not degrade another's latency.
- On-prem: enterprise customers can run inference on their own H100s with the same latency SLOs.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Completion requests/day | 400 million | GitHub Copilot production figure (InfoQ/QCon SF 2024) |
| Peak RPS | 8,000 | ~1.7x average (4,630 RPS average → 8,000 peak); bursty during business hours across time zones |
| Effective completions/sec (post-cancel) | ~4,000 at peak | 50% abandonment rate applied to 8,000 peak RPS; average is ~2,300 |
| Tokens per completion request | ~4,096 prompt + ~50 output | FIM prompt fills the context window; short output by design |
| Throughput per H100 (decode) | ~3,000 tokens/sec at batch size 32 | Conservative; continuous batching; small model (4B params) |
| GPU count for decode | ~80 H100s | 4,000 req/s × 50 tokens ÷ 3,000 tokens/s/GPU = 66.7; add ~20% buffer for prefill + overhead → 66.7 × 1.20 ≈ 80 |
| Embedding index size | ~820 MB per 100k-file repo | 100k files × 4 chunks × 512 dims × 4 bytes = 819 MB; larger repos proportional |
| Re-indexing latency | <60 seconds | GitHub Copilot March 2025 GA; Merkle-tree diff tracking limits what must be re-embedded |
| Agent task QPS | ~40-80 RPS | ~100x fewer than completions; longer per turn |
| Context window per agent turn | 8,000-200,000 tokens | Depends on frontier model; Claude Sonnet 4.5 at 200K, latest OpenAI frontier models at 1M+ |
Takeaway: The completion path is a high-throughput, low-latency serving problem identical to bulk inference — continuous batching and prefix caching are mandatory, not optional. The agent path is a low-QPS, high-context problem where context management and diff application are the failure modes to design against.
Building up to the design
V1: naive inline completion
The simplest possible system: the IDE sends the text before the cursor to a cloud API, the model returns a completion, the IDE inserts it. Three things kill it in the first week of production:
First, latency. A round trip to a 70B model takes 2-5 seconds for the first token. The developer has moved on. At GitHub Copilot's launch, the product team found that completions shown after 1,500ms had near-zero acceptance rate — developers had already written the next line themselves.
Second, cancellation. The user types fast. By the time the model responds, they have typed 5 more characters. The model is still computing a completion for a context that no longer exists. Without cancellation, every keystroke queues a completion request and the GPU farm is burning compute on stale prompts.
Third, context. The current file prefix is 20% of what matters. The other 80% is: what is imported, what is in the open tabs, what functions exist elsewhere in the repository. Without that context, the model suggests user_id when the codebase calls it userId, or imports a package that does not exist in this project.
V2: debounce + small model + basic context
Fix the latency by switching to a 1-7B FIM-tuned model (examples: JetBrains Mellum at 4B parameters, Cursor's Supermaven-derived model). Fix cancellation by adding a debounce: wait ~100ms after the last keystroke before firing the request (configurable in the 75-150ms range). If a new keystroke arrives, reset the timer and cancel the in-flight HTTP/2 stream — not a TCP teardown, just a stream reset, which the server handles in microseconds.
Fix basic context by adding the FIM format. Instead of just the prefix, send:
<fim_prefix>{current file content above cursor}<fim_suffix>{current file content below cursor}<fim_middle>
The model learns to generate the middle token sequence. Add the content of open editor tabs as additional context, ranked by Jaccard similarity (60-line sliding windows) to the cursor neighborhood. This gets acceptance rate to 20-25%.
The debounce itself filters ~40% of requests before they even leave the IDE — the 400M/day proxy count is the post-debounce figure, meaning debounce has already been applied upstream. The HTTP/2 cancellation catches another ~50% of those post-debounce requests that do reach the proxy. GitHub Copilot's production telemetry (David Cheney, QCon SF 2024) shows roughly half of proxy-received requests are cancelled this way (so ~50% of 400M complete) — a number that shapes every capacity estimate.
V3: semantic repo index + prompt budget controller
The remaining quality gap is cross-file context. The open tabs are fine when the developer has the right files open. They do not help when the right file is 3 hops away in the import graph, or when the developer is implementing an interface defined in a type file they never opened.
The fix is a background indexer. On first open of a repository, the system chunks the code into 100-250 token segments, embeds them into a 512-dimensional vector space, and stores them in a purpose-built vector store (Turbopuffer for Cursor, LanceDB for Continue). Merkle-tree diff tracking means that on each file save, only the modified chunks are re-embedded — GitHub's March 2025 launch of instant semantic indexing brought re-index latency from 5 minutes to under 60 seconds.
At query time, the context retrieval layer runs in parallel: a BM25 lexical search (fast, ~7ms, good for identifiers and exact names), a dense ANN query (~20ms, good for semantic similarity), and an LSP/tree-sitter query for the symbol definition at the cursor. Results are merged, ranked by relevance, and passed to the prompt budget controller.
The prompt budget controller has a fixed token budget — typically 2,048-8,192 tokens depending on the model. It fills slots in priority order:
- Current file prefix and suffix (non-negotiable)
- Import statements and type signatures from referenced modules
- Open tabs, ranked by Jaccard similarity
- Retrieved snippets from the semantic index
In practice, 40-60% of the token window is consumed by context the user did not explicitly supply. The priority order matters more than the window size for final quality.
This gets acceptance rate to 27-34% and closes most of the "wrong identifier name" class of errors.
V4: agent mode — plan, tool, observe, loop
flowchart TD
TASK["Developer task<br/>'refactor auth service'"] --> PLAN[Planner<br/>frontier model]
PLAN -->|"step list"| EX[Tool Executor]
EX -->|"ReadFile"| FS[File System]
EX -->|"EditFile"| DA[Diff Applier]
EX -->|"Bash"| SHELL[Shell<br/>tests / lint / build]
EX -->|"Search"| IDX[Repo Index]
SHELL -->|"test output"| OBS[Observation<br/>Assembler]
FS -->|"file content"| OBS
DA -->|"diff result"| OBS
IDX -->|"snippets"| OBS
OBS -->|"next prompt"| PLAN
PLAN -->|"DONE or FAILED"| USER[IDE surface]
style PLAN fill:#a855f7,color:#fff
style EX fill:#15803d,color:#fff
style DA fill:#ff2e88,color:#fff
style OBS fill:#ff6b1a,color:#0a0a0f
style SHELL fill:#0e7490,color:#fff
Agent mode takes a natural-language task and runs a ReAct loop: the planner emits a thought and a tool call, the executor runs the tool, the result goes back into the context, and the planner sees the full history and decides the next step. The loop continues until the planner emits a DONE signal or the budget is exhausted.
This is where the hard problems are. See the deep-dive sections below.
API
Completion endpoint
POST /v1/completions
Content-Type: application/json
Authorization: Bearer <token>
X-Request-ID: <uuid>
{
"editor_version": "vscode/1.89.0",
"file_path": "src/auth/service.ts",
"language_id": "typescript",
"prefix": "export class AuthService {\n async validateToken(",
"suffix": "\n ) {\n return this.jwt.verify(token);\n }\n}",
"open_files": [
{ "path": "src/auth/types.ts", "content": "..." },
{ "path": "src/auth/jwt.ts", "content": "..." }
],
"cursor_position": { "line": 2, "character": 28 },
"max_tokens": 100
}
HTTP/2 200 OK
Content-Type: text/event-stream
data: {"token": "token: string", "finish_reason": null}
data: {"token": ",\n options?: ValidateOptions", "finish_reason": null}
data: {"token": "", "finish_reason": "stop"}
Cancellation: the IDE sends an HTTP/2 RST_STREAM frame. The proxy acknowledges and releases the model slot immediately.
Agent task endpoint
POST /v1/agent/tasks
Content-Type: application/json
{
"task": "Refactor the AuthService to use the new JWTv2 library. Run tests after each file change.",
"repo_context": {
"root": "/workspace/myapp",
"active_file": "src/auth/service.ts"
},
"model": "claude-sonnet-4-5",
"budget": { "max_turns": 30, "max_tokens": 500000 }
}
# Response: task_id for SSE polling
# SSE stream emits: plan_step, tool_call, tool_result, diff_preview, task_complete
The schema
The completion system is largely stateless on the serving path. The persistent state is the semantic index.
-- Per-repository embedding index metadata
CREATE TABLE repo_index (
repo_id UUID PRIMARY KEY,
org_id UUID NOT NULL,
last_full_scan TIMESTAMPTZ,
chunk_count INTEGER,
index_version INTEGER,
merkle_root TEXT -- tracks which files need re-indexing
);
-- Individual chunk records (vector stored in Turbopuffer/LanceDB separately)
CREATE TABLE code_chunks (
chunk_id UUID PRIMARY KEY,
repo_id UUID REFERENCES repo_index(repo_id),
file_path TEXT NOT NULL,
start_line INTEGER,
end_line INTEGER,
language TEXT,
content TEXT,
embedding VECTOR(512),
indexed_at TIMESTAMPTZ DEFAULT now()
);
-- Agent task state
CREATE TABLE agent_tasks (
task_id UUID PRIMARY KEY,
org_id UUID NOT NULL,
status TEXT CHECK (status IN ('queued','running','done','failed','cancelled')),
task_prompt TEXT NOT NULL,
model TEXT NOT NULL,
turn_count INTEGER DEFAULT 0,
token_count BIGINT DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT now(),
completed_at TIMESTAMPTZ,
transcript JSONB -- full message history, compacted at context thresholds
);
Architecture
The completion hot path
sequenceDiagram
participant IDE as IDE
participant PX as Edge Proxy
participant CTX as Context Layer
participant IDX as Repo Index
participant MOD as Completion Model
IDE->>PX: keystroke (char N)
Note over PX: debounce timer reset; cancel prior stream
PX-->>MOD: HTTP/2 RST_STREAM (prior request)
IDE->>PX: keystroke (char N+1, then idle 100ms)
PX->>CTX: assemble context request
CTX->>IDX: BM25 + ANN query (parallel, ~20ms)
IDX-->>CTX: top-10 relevant chunks
CTX->>CTX: rank + trim to token budget
CTX-->>PX: assembled FIM prompt
PX->>MOD: stream completion (HTTP/2)
MOD-->>PX: token stream
PX-->>IDE: ghost text overlay (first token ~50ms after model receives prompt)
IDE->>PX: user presses Tab (accept)
PX->>PX: log accept event (telemetry)
The geographic routing strategy matters here. GitHub Copilot rejected a CDN PoP model after discovering that traffic from Singapore was routing to West Coast US model servers anyway, adding two extra intercontinental hops. The fix is to co-locate the proxy and model in the same region — copilot-proxy uses octoDNS with continent/country/state resolution to route directly to the nearest region where both proxy and model run. Measuring SLO at the proxy layer (not the upstream model's metrics) is what surfaces this class of trombone routing.
The agent loop
The agent runs on a frontier model (Claude Sonnet 4.5, GPT-5.4, Gemini 2.5 Pro) with a large context window. Each turn: the planner sees the full transcript of prior steps plus the current tool results, emits a thought and a tool call (or DONE), the tool executor runs it, and the result is appended to the transcript.
Claude Code's implementation is instructive: a pure ReAct while-loop (queryLoop() in query.ts), no DAG orchestrator, 54 tools including Bash (universal shell execution), Read, Write, Edit, Grep, Glob, and Task (sub-agent spawning). Read-only tools run concurrently via a StreamingToolExecutor; write tools run sequentially to avoid races. On Bash errors, a sibling abort controller cancels other running subprocesses.
Cursor's approach differs: up to 8 parallel agents each isolated in a separate Git worktree, coordinated by a Composer model (RL-trained on edit sequences, 250 tokens/sec, <30s per turn). Parallel worktrees mean the agents can run test suites simultaneously without interfering with each other's file state.
For the diff application, the model emits either SEARCH/REPLACE blocks, unified diff hunks, or full file rewrites. Full file rewrite achieves the highest benchmark scores but is expensive for large files. SEARCH/REPLACE is efficient but requires exact string matching — a trailing space or CRLF/LF mismatch causes an apply failure. The apply layer should attempt exact match first, then fuzzy match, then fall back to requesting a whole-file rewrite before declaring failure.
FIM prompting and KV cache architecture
The most concrete performance win in the completion path is getting KV cache to work correctly with FIM prompts.
Standard prefix caching works by pinning stable context (system prompt, repo metadata) as a shared KV prefix. Any request that starts with the same prefix hits the cache and skips the prefill computation for those tokens. But FIM prompts break this: the format is <fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>. The suffix appears in the middle of the prompt and changes with every cursor position — so the suffix token range never matches a prior cached prefix, and cache hit rates drop to near zero.
EFIM (arXiv 2505.21889) restructures the format so the suffix appears as the first stable section and the prefix delta (only the tokens added since the last request) appears at the end. Because the suffix is stable across keystrokes — the code below the cursor does not change when the user types above it — it becomes an ideal cache anchor. Measured effect: 52% reduction in mean latency, 98% increase in throughput, from cache alone. No model change required.
JetBrains Mellum (arXiv 2510.05788) documents the SPM format (Suffix-Prefix-Middle) as slightly superior to PSM in FIM quality benchmarks, because the model generates the middle after seeing the suffix left-to-right, which is more natural. It also documents the alignment step: a base 4B LLaMA-style model trained on ~4 trillion tokens (The Stack v1/v2, StarCoder data) is SFT-fine-tuned on syntactic-boundary examples to avoid completions that break existing code structure, then DPO-aligned with an LLM-as-judge to suppress over-generated, multi-paragraph completions. Before DPO, the base model averaged 16 lines per completion; after, 7-8. Over-generated completions are rejected by users at higher rates and consume context window faster.
For serving, a speculative decoding setup pairs a tiny draft model (121M parameters) with the production 4B model. The draft proposes 5-8 tokens; the production model verifies in one parallel pass. At the acceptance rates typical of code completion (repetitive patterns, known API names), draft acceptance runs at ~51.5%, yielding 2-3x throughput improvement with no change to output quality. Note: arXiv 2603.05974 (MCCom) covers confidence-based local-cloud cascading, not speculative decoding — the speculative decoding figures cited here reflect the general technique applied to small-code-model serving.
Context retrieval layer
Context retrieval is a three-tier pipeline that runs every time a completion is requested.
Tier 1 — immediate context. The IDE sends the current file prefix and suffix, the language identifier, and the full content of open editor tabs. The proxy ranks open tabs by Jaccard similarity using 60-line sliding windows against the cursor neighborhood. This takes <1ms and is always included.
Tier 2 — local repo index. Code is chunked into 100-250 token segments at syntactic boundaries (tree-sitter parses the AST to find function and class boundaries so chunks do not split mid-expression). Chunks are embedded into 512-dimensional vectors and stored in a local vector store — Turbopuffer for Cursor, LanceDB (embedded TypeScript library, no separate process) for Continue. BM25 lexical search runs in parallel with ANN dense search. Results are fused and re-ranked. BM25 latency: ~7ms. ANN latency: ~20ms. Both run on the completion serving machines, not a separate service, to avoid an extra network hop.
Continue's granularity is illustrative: 10-line chunks produce roughly 1 million vectors for a 10 million line codebase (LanceDB blog). At 512 dims × 4 bytes = 2KB per vector, that is 2GB of raw embedding data — manageable in memory on a serving node.
Tier 3 — remote repo index. For cross-repository context (a monorepo with 100,000 files, or a workspace spanning multiple repos), a cloud index runs Zoekt trigram/regex/structural search plus a server-side embedding index. Sourcegraph Cody uses this for chat context (up to 10 remote repos simultaneously) but deliberately skips it for autocomplete — the added latency (network round trip + larger index) exceeds the completion latency budget. Only the fast tiers run on the hot path.
The prompt budget controller receives the merged candidates and fills the FIM prompt's context section in priority order: (1) current file prefix/suffix, (2) import statements and type signatures referenced at the cursor, (3) open tabs by Jaccard rank, (4) retrieved semantic snippets. It trims to fit the model's context window (typically 2,048-8,192 tokens for the completion model). In practice, 40-60% of the window is consumed by implicit context.
For a deeper treatment of hybrid retrieval with BM25 and dense ANN, see Design a RAG Pipeline and Design a Vector Database.
Model routing and cascading
The simplest routing strategy is latency class: ghost text → small model, chat → medium model, agent → large model. This is what GitHub Copilot does — a custom completion model (35% lower latency, 3x higher throughput, 12% higher acceptance rate vs. the prior model per the 2024 GitHub Blog post) handles all inline completions, while agent mode switches to Claude 3.7 Sonnet or GPT-5.4.
A more sophisticated strategy is confidence-based cascading, documented in MCCom (arXiv 2603.05974). A 121M local model handles completions first; if the confidence on the first 3 output tokens falls below a threshold (~0.8), the request escalates to a 7B cloud model. Production results: roughly 60% of requests stay local (235ms avg latency), ~40% escalate to cloud (675ms avg latency), blended average ~407ms (0.61 × 235 + 0.39 × 675) — compared to ~675ms if all requests hit the cloud. Cloud API costs drop ~60% with only 8.9% accuracy loss vs. cloud-only.
The local model also enables an enterprise on-prem option without the penalty of round-tripping to a remote endpoint. For Codeium/Windsurf's VMware Private AI Foundation deployment, the proprietary completion model runs on customer-provisioned hardware with zero data retention. The serving infrastructure is vLLM or TensorRT-LLM on NVIDIA H100s. The critical underestimation: a single 7B model on one A100 without continuous batching cannot hit sub-500ms p90 at production QPS; the enterprise cluster needs the same batching infrastructure as the cloud service.
Agent mode deep dive
Context management across a long task
Every tool call appends its result to the agent's transcript. A 20-file refactor with test runs generates 50-100 turns before it is done. Without compaction, a 200K-token context window is exhausted in 10-15 turns on verbose tool outputs.
Claude Code implements a 5-stage pipeline triggered at different thresholds:
- Budget Reduction: at 80% context fill, drop the least recent non-critical tool results.
- Snip: truncate very long tool outputs (e.g., a test run with 10,000 lines of output) to a tail summary.
- Microcompact: summarize completed subtask groups into a paragraph.
- Context Collapse: aggressive summarization of the full prior history into a structured summary.
- Auto-Compact: triggered automatically if the model would otherwise overflow; replaces the entire transcript with a dense summary.
The core trade-off is that aggressive compaction loses detail. An agent that compacted the first 30 turns may not remember that it already tried one approach and found it failing, causing it to re-attempt the same approach on the next planning step. Arize AI's production tracking of millions of agent decisions found hallucinations beginning to appear at ~70% context capacity and becoming severe at ~85% — meaning compaction should trigger at 65-70%, not 90%.
CLAUDE.md (a project-level instruction file loaded lazily at run start) and per-sub-agent sidechain transcripts are practical mitigations: they let the main agent externalize stable context (project conventions, architectural constraints) so it does not need to keep them in the rolling window.
The self-correction loop
The distinguishing feature of agent mode over a plain "generate a diff" command is that the agent observes tool output and adapts. The pattern:
- Planner emits a plan: "Edit AuthService to replace jwt.verify with jwtv2.verify."
- Tool executor applies the edit, then runs:
npx tsc --noEmit. - Compiler output: 3 errors — one import is missing, one type changed, one test mock needs updating.
- Compiler output goes back into the context.
- Planner sees the errors, emits targeted edits for each.
- Re-run tests: all pass.
- Planner emits DONE.
Aider's Architect/Editor split makes this concrete: the Architect model (e.g., o1-preview, which is good at reasoning but poor at formatting exact diffs) plans the approach and describes what to change; the Editor model (e.g., DeepSeek or o1-mini, which is faster and better at structured output) formats the actual SEARCH/REPLACE blocks. Measured result: o1-preview + DeepSeek achieves 85% pass rate on aider's benchmark, beating o1-preview solo (leaderboard figure at time of writing: ~77%; see aider.chat/docs/leaderboards, which updates over time). The cost is two API calls per agent turn — acceptable for batch tasks, prohibitive for interactive completions.
The self-correction loop has a stopping condition: a maximum turn count (typically 20-30) and a dollar budget per task. Without these guards, a confused agent that keeps failing the same test can run for hours. GitHub Copilot Agent Mode hit 56% on SWE-bench Verified (Claude 3.7 Sonnet, 2025); Claude Sonnet 4.5 solo hits 77.2% — meaning roughly 1 in 4 tasks still requires human intervention.
Tool execution and diff application
Tools available to the agent: ReadFile, WriteFile, EditFile (patch), Bash (shell command), Grep, Glob, WebFetch, and MCP servers (for external integrations). Claude Code runs concurrent read-only operations via a StreamingToolExecutor and serializes write operations to avoid write races. On a Bash error, a sibling abort controller cancels all other running subprocesses from the same planning step.
Diff application is the single most common failure mode. The four formats, from most fragile to most robust:
| Format | Example | Fragility | Best for |
|---|---|---|---|
| SEARCH/REPLACE | Match exact string, replace | Breaks on whitespace, CRLF, minor model drift | Small precise edits |
| Unified diff | @@ -3,7 +3,8 @@ hunks | Requires correct line numbers | Multi-line edits |
| Editor-diff | Simplified diff for Editor role | Requires model trained on format | Architect/Editor split |
| Whole file | Full file rewrite | Never fails to apply; expensive for large files | Files <500 lines |
Apply failure accounted for 60% of human interventions in Claude Code production data. The apply layer should: attempt exact SEARCH/REPLACE, then fuzzy match (normalize whitespace/CRLF), then ask the model to retry in whole-file format. Three retries covers >95% of apply failures.
For MCP-based integrations (connecting the agent to external systems like Linear, Jira, or database schemas), the Model Context Protocol provides a standard tool registration and invocation interface. See Model Context Protocol (MCP) for the protocol internals.
Evaluation and telemetry
The metrics are divided into online (production traffic) and offline (benchmark).
Online metrics:
- Acceptance Rate (AR): 27-34% for GitHub Copilot inline completions. Rises from ~29% in the first 3 months of use to ~34% after 6+ months (ACM Communications research) as the model learns the user's style and the user learns what to expect from the tool.
- Code survival rate: 88% of accepted Copilot suggestions survive to final commit. This is the real productivity signal; AR alone is gameable by producing short obvious completions.
- Ratio of Completed Code (RoCC): fraction of the final submitted code that was AI-generated; JetBrains lifted this from 0.25 to 0.39 on Python after their custom Mellum model launch.
- Strong AR: accepted AND not deleted within 30 seconds AND less than 50% edited. Filters out "accept and immediately rewrite" cases.
- Agent task completion rate: fraction of agent tasks completed without human intervention; SWE-bench Verified is the industry benchmark.
Offline benchmarks:
- HumanEval Infilling, RepoCoder, RepoEval, StmtEval for FIM quality.
- Pass@k (fraction of tasks where at least 1 of k samples passes all tests) for agent mode.
- Exact Match and token-level edit distance for completion precision.
A/B testing gates new models before fleet-wide rollout. The Copilot custom model (2024) was validated with a controlled rollout showing 35% latency reduction, 12% AR lift, and 20% more accepted-and-retained characters before replacing the prior model.
For the LLM inference serving infrastructure that underlies both the completion model and the agent model, see Design an LLM Inference & Serving System.
Edge cases and gotchas
Context rot in long agent sessions. Accuracy degrades before the context window is full. Arize AI's production tracking found hallucinations beginning at ~70% context fill and becoming severe by ~85%. An agent in a long refactor session may re-introduce code it deleted 30 turns ago, contradict a decision it already made, or loop on a sub-problem it already solved. Design compaction to trigger at 65%, not 90%.
Phantom package hallucination. About 20% of AI-generated code references non-existent libraries or APIs. In completion mode, the developer catches this when the import fails. In agent mode, the agent may run pip install nonexistent-package, observe the failure, then hallucinate an alternative package name, cycling through a cascade of wrong dependencies. Mitigation: before any install command, verify the package exists in the registry with a WebFetch call; lock to known-valid package lists for common languages.
Apply failure brittleness. SEARCH/REPLACE exact-match breaks on trailing whitespace, CRLF vs. LF, tab vs. space, or any minor model drift in how the search string was generated. In Claude Code production data, this was the cause of 60% of human interventions. The fix: fuzzy normalization first (strip trailing whitespace, normalize line endings), then retry with whole-file format. Do not present apply failure to the user until 3 retry strategies have been exhausted.
Stale index after large refactors. The semantic index reflects the file state at indexing time. If the agent renames 50 functions and then queries the semantic index, it may retrieve chunks referencing the old names, producing completions that re-introduce renamed identifiers. Mitigation: after each WriteFile or EditFile tool call, invalidate the index entries for the modified file and trigger a synchronous re-embed before the next query.
Cancelled-request waste without HTTP/2 stream reset. At Copilot's scale, 50% of completion requests are abandoned. Without stream reset, the model finishes generating a completion no one will see, consuming GPU-seconds and inflating cost. Teams instrumenting at the model API layer (not the proxy) miss this entirely — they see 100% utilization, attribute it to demand, and provision more GPUs, when the real fix is a proper cancellation path. The copilot-proxy in Go handles this at the HAProxy GLB layer, not the application layer.
Permission fatigue in agent mode. Claude Code production data shows 93% of permission prompts are approved by users. Over extended use, auto-approve rates grow from 20% to 40%+ of sessions. This is efficient but dangerous: an agent with auto-approve and a misguided plan can delete test fixtures, push to a main branch, or run a destructive migration. Solution: scope auto-approve to read-only operations by default; require explicit confirmation for write operations outside the active workspace.
Traffic trombone in global edge routing. GitHub experimented with routing completions to the nearest CDN PoP, expecting to reduce latency. In testing, Singapore traffic routed to a West Coast US PoP — which then forwarded to the actual model running in a US region anyway, adding two intercontinental hops compared to direct routing. The fix: co-locate proxy and model in the same cloud region; use octoDNS geo-routing to select the nearest region where both run, not the nearest network PoP.
On-prem latency shock. Moving from a managed cloud inference endpoint (H100 clusters, continuous batching, speculative decoding at scale) to a single customer-run server typically quadruples latency and halves throughput. A 7B model on one A100 without batching cannot meet sub-500ms p90 SLOs at production QPS. Enterprise on-prem requires the same continuous batching infrastructure — vLLM or TensorRT-LLM, multi-GPU tensor parallelism — not a naive single-process server.
Trade-offs to discuss in an interview
Plugin vs. IDE fork. Copilot and Sourcegraph Cody are IDE plugins — broad distribution, easy updates, constrained by plugin API. Cursor and Windsurf are full VS Code forks — direct file system access, terminal control, full editor event stream for training data, lower-overhead diff application. The fork gives a 2-3x advantage in diff application speed and richer telemetry, at the cost of a harder update cycle and smaller initial reach. Cursor's TAB model latency dropped measurably after the Supermaven acquisition — internal telemetry cited a roughly 40–45% reduction — partly attributable to the lower overhead of a fork vs. plugin API.
All-cloud vs. local+cloud cascade. GitHub Copilot runs all inference in Azure. MCCom and JetBrains Mellum run a local model for most completions, escalating to cloud only on low confidence. All-cloud is simpler operationally but adds ~200ms network round-trip. Local+cloud reduces latency and cost (~60% fewer cloud calls when ~60% of requests stay local) but requires model deployment on developer machines — fine for an M-series Mac, painful on a low-end Windows laptop.
Embedding-based retrieval vs. lexical search. Sourcegraph Cody originally used OpenAI's text-embedding-ada-002 for semantic search. They switched to native Sourcegraph search (BM25 + Zoekt trigram) to eliminate third-party code exposure and scale to 100,000+ file monorepos. Embeddings outperform lexical search on semantic queries; lexical search outperforms on exact identifier names and regex patterns. Hybrid retrieval wins on most benchmarks but adds operational complexity: two indexes to maintain and a fusion layer.
Architect/Editor split vs. single model. Separating reasoning (Architect: large model, extended thinking) from edit formatting (Editor: fast model, structured output) yields measurably better benchmark scores (85% vs. ~77% for o1-preview solo on Aider's benchmark; see aider.chat/docs/leaderboards). The cost is two API calls per agent turn — higher latency and cost. This is worth it for batch or async agent tasks; it is not worth it for interactive completions where every turn is user-facing.
Context window size vs. compaction quality. A 1M-token context window (GPT-5.4/5.5, Gemini 2.5 Pro) allows an agent to hold a large codebase in context without compaction. But larger context means slower prefill (quadratic attention cost before flash attention optimizations), higher cost per turn, and context rot effects that begin earlier in absolute token terms. Compaction is not just a workaround for small windows — it is also a quality intervention that focuses the model on the relevant recent history.
Things you should now be able to answer
- Why does inline completion need a different model than agent mode, and what are the specific latency targets for each?
- Describe the FIM prompt format (PSM vs. SPM) and explain why EFIM's restructuring achieves a 52% latency reduction without changing the model.
- How does the debounce-plus-HTTP/2-stream-reset cancellation pipeline work, and why does instrumenting at the model API layer miss the 50% cancellation signal?
- Walk through the three tiers of context retrieval and explain which tiers run on the hot path vs. the cold path, and why.
- What does the prompt budget controller prioritize, and what fraction of the context window is typically consumed by implicit context?
- Explain the self-correction loop in agent mode and name two concrete stopping conditions that must be enforced to prevent runaway cost.
- What are the four diff application formats, which is most fragile and why, and what is the retry strategy for apply failures?
- Why does context rot begin before the context window is full, and at what approximate fill level do hallucinations become severe in production agents?
- What is the trade-off between all-cloud inference and a local-plus-cloud cascade model, and what efficiency gains does MCCom demonstrate?
- How does "code never trains the base model" work architecturally, and what additional infrastructure is required for a fully on-prem enterprise deployment?
Further reading
Papers
- arXiv 2505.21889 (2025) — "EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse." FIM format restructuring that achieves 52% latency reduction and 98% throughput increase.
- arXiv 2510.05788 (JetBrains, 2025) — "Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding." 4B-param FIM model, SFT+DPO alignment, RoCC metrics, p90 latency 500ms on H100.
- arXiv 2603.05974 (2026) — "Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading (MCCom)." ~60% of requests handled locally, ~40% blended latency reduction vs. cloud-only (0.61 × 235 ms + 0.39 × 675 ms = 407 ms vs. 675 ms all-cloud), with only 8.9% accuracy loss.
- arXiv 2604.14228 (VILA Lab, 2026) — "Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems." Full architecture walkthrough: 54 tools, 7 permission modes, 5-layer context shaping.
- arXiv 2408.05344 (Sourcegraph, 2024) — "AI-assisted Coding with Cody: Lessons from Context Retrieval and Evaluation for Code Recommendations." Hybrid retrieval ablations, token budget handling.
- arXiv 2303.12570 (EMNLP 2023) — "RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation."
Engineering blog posts
- David Cheney, GitHub / InfoQ, QCon SF Nov 2024 — "How GitHub Copilot Serves 400 Million Completion Requests a Day." Covers sub-200ms architecture, HTTP/2, proxy design, cancellation, geographic routing.
- GitHub Blog, 2024 — "The Road to Better Completions: Building a Faster, Smarter GitHub Copilot with a New Custom Model." 35% latency reduction, 12% AR lift, RL training details.
- Sourcegraph Blog — "How Cody Understands Your Codebase." BM25+embedding hybrid, Zoekt, tree-sitter for autocomplete.
- ByteByteGo Blog — "How Cursor Serves Billions of AI Code Completions Every Day." Turbopuffer, Merkle-tree re-indexing, Fireworks inference, AWS multi-region.
- LanceDB Blog — "The Future of AI-Native Development is Local: Inside Continue's LanceDB-Powered Evolution." 10-line chunks, 1M vectors for 10M-line codebase.
- Aider, Sep 2024 — "Separating Code Reasoning and Editing." o1-preview + DeepSeek 85% pass rate, four edit formats. https://aider.chat/2024/09/26/architect.html
Related articles
- Design an AI Agent Platform — the general-purpose agentic loop, durable state, idempotent tool execution, and safety guardrails that agent mode builds on.
- Design an LLM Inference & Serving System — continuous batching, PagedAttention, speculative decoding, and prefill/decode disaggregation — the serving infrastructure for both the fast completion model and the frontier agent model.
- Design a RAG Pipeline — end-to-end retrieval-augmented generation: chunking, hybrid retrieval, cross-encoder reranking, and RAGAS evaluation — the retrieval patterns the context layer builds on.
- Design a Vector Database — HNSW, IVF+PQ, quantization, filtered queries — the ANN internals behind the semantic repo index.
- Model Context Protocol (MCP) — the standard protocol for connecting agents to external tools and data sources, used by Cursor, Cody, and Claude Code.
- Design an LLM Gateway — routing, rate-limiting, cost tracking, and fallback across multiple LLM providers — the infrastructure the edge proxy builds on for multi-model routing.
Frequently asked questions
▸Why does inline completion need a different model than agent mode?
Ghost text must appear within 200ms or the developer has already typed past the suggestion. A 70B frontier model on a cold prompt takes 2-5 seconds for the first token. Completion uses a 1B-7B FIM-tuned model with greedy decoding, designed for sub-300ms server latency. Agent mode tolerates seconds to minutes per turn because the user asked for a multi-step task; it gets a frontier model (Claude Sonnet, GPT-5.4) with extended thinking and tool-use. Routing by latency class — not capability — is the core infrastructure decision.
▸What is fill-in-the-middle (FIM) prompting and why does format matter?
FIM prompts the model with a prefix (code before the cursor) and a suffix (code after) separately, asking it to generate the middle span. The two formats differ in order: PSM sends Prefix-Suffix-Middle, SPM sends Suffix-Prefix-Middle. SPM trains the model to generate left-to-right after seeing the suffix, which tests show performs slightly better. For serving, format matters for KV cache reuse: EFIM restructures prompts so the stable suffix appears first as a cacheable prefix — the suffix is stable across keystrokes because the code below the cursor does not change when the user types above it — yielding 52% latency reduction and 98% throughput increase from cache hits on the suffix alone.
▸How do you handle the 50% of completion requests that are abandoned mid-flight?
At GitHub Copilot scale (400 million requests/day), roughly half of requests are abandoned because the user kept typing before the model responds. Cancelling a request via HTTP/2 stream reset — not a TCP teardown — releases the server-side compute immediately. Without this, the model finishes generating a completion no one will see, burning GPU cycles and inflating cost. The Go-based proxy (copilot-proxy) tracks in-flight stream IDs and issues resets on new keystrokes. Teams instrumenting only at the model API layer miss this signal and over-estimate GPU utilization by nearly 2x.
▸What does "code never trains the base model" mean architecturally?
For enterprise customers, it is both a contractual and an architectural guarantee. User code is not written to any persistent training store; the inference path reads the code from the IDE, assembles a prompt, calls the model, and discards everything. Fine-tuning pipelines for the base completion model train only on licensed public code corpora. Enterprise air-gap goes further: inference runs on a customer-provisioned GPU cluster (vLLM or TensorRT-LLM on H100s), so code never leaves the customer network. Sourcegraph Cody dropped OpenAI embeddings specifically to eliminate this third-party exposure path.
▸How do you evaluate an AI coding assistant beyond acceptance rate?
Acceptance Rate (AR, typically 27-34%) is gameable — models that output short, obvious completions get accepted but provide little value. Better metrics: code survival rate (88% of accepted Copilot suggestions survive to final commit), Ratio of Completed Code (RoCC, used by JetBrains — lifted from 0.25 to 0.39 post-model update), and Strong AR (accepted AND not deleted AND less than 50% edited afterward). For agent mode: SWE-bench Verified pass rate (GitHub Copilot hit 56% with Claude 3.7 Sonnet) and per-task human-intervention rate are the real proxies for productivity.
You may also like
Model Context Protocol (MCP) and Tool-Use Infrastructure
How LLMs safely reach the outside world — from raw function calling to MCP, the open standard that collapses N×M bespoke integrations to N+M, with production-grade security, reliability, and a ~88% token reduction via deferred tool loading.
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.