~/articles/design-ai-coding-assistant

◆◆◆Advancedasked at GitHubasked at Cursorasked at Sourcegraphasked at OpenAIasked at Anthropic

Design an AI Coding Assistant (Copilot / Cursor)

Architect a system that delivers inline ghost-text completions in under 200ms and drives an autonomous agent that edits dozens of files — the two-product architecture behind GitHub Copilot, Cursor, and Sourcegraph Cody at billions of completions per day.

31 min read2026-06-25Ironclad Academy

#interview #ai #llm #agents #search

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

In November 2023, GitHub reported that developers using Copilot completed tasks 55% faster in a controlled study. The tool had 1.3 million paid users at the time. By mid-2025, it was serving 400 million completion requests per day, handling 8,000 requests per second at peak. Cursor, a full VS Code fork built around AI editing, was processing 1 million transactions per second at peak and had crossed $500 million ARR in under two years.

The engineering problem is harder than it looks. The product surface is simple: a ghost text overlay in an editor. The constraint is brutal: you have roughly 200 milliseconds between the last keystroke and the moment the suggestion must appear, or the developer ignores it. At GitHub Copilot's scale, that means serving the equivalent of the world's largest code autocomplete system, globally, with sub-200ms p50 latency, while running a completely different system — an autonomous multi-file editing agent — on the same codebase, with the same IDE surface, from the same backend.

The core architectural insight is that these are two distinct products that share a UI. Inline completion is a latency problem: a small model, a carefully packed prompt, and an aggressive cancellation strategy. Agent mode is a reliability and context problem: a frontier model, a tool execution loop, and a compaction strategy to prevent the context window from collapsing on a 20-file refactor. Conflating them — or trying to build one model that serves both — is the mistake that leads to a system that is mediocre at both.

This article focuses on the coding-specific latency, context retrieval, and edit-application problems. For the broader mechanics of agentic tool loops and durable state, see Design an AI Agent Platform.

Functional requirements

Inline completions: show single-line or multi-line ghost text within 200ms of the last keystroke; support fill-in-the-middle (cursor in the middle of existing code).
Context retrieval: use the current file, open tabs, recently edited files, and a semantic index of the repository to assemble relevant context.
Cancellation: when the user types past the suggestion window, cancel the in-flight request immediately without a TCP teardown.
Agent mode: accept a natural-language task; plan steps; call tools (read file, write file, run bash command, search codebase); apply multi-file diffs; observe test/lint output; loop until done or budget exhausted.
Model routing: dispatch ghost-text requests to a fast small model and agent turns to a large frontier model.
Streaming: stream both completion tokens and agent turn output to the IDE in real time.
Privacy: customer code must not be used to train the base model; enterprise deployments must support fully on-prem inference with zero data retention.
Telemetry: capture accept/reject per completion, token-level survival, per-turn latency, agent task completion rate, and error codes.

Non-functional requirements

Ghost-text p50 latency < 200ms end-to-end (proxy → model → proxy → IDE); p99 < 500ms.
Agent turn p50 latency < 5s (time to first meaningful output streamed); total task p99 < 5 minutes.
Cancellation latency (HTTP/2 stream reset acknowledged) < 10ms from the proxy.
Index freshness: repo embedding index updated within 60 seconds of a file save.
Availability: > 99.9% for the completion path; > 99.5% for agent (longer tasks tolerate queuing).
Multi-tenancy: per-organization token quotas; one org's traffic spike must not degrade another's latency.
On-prem: enterprise customers can run inference on their own H100s with the same latency SLOs.

Capacity estimation

Dimension	Estimate	How we got there
Completion requests/day	400 million	GitHub Copilot production figure (InfoQ/QCon SF 2024)
Peak RPS	8,000	~1.7x average (4,630 RPS average → 8,000 peak); bursty during business hours across time zones
Effective completions/sec (post-cancel)	~4,000 at peak	50% abandonment rate applied to 8,000 peak RPS; average is ~2,300
Tokens per completion request	~4,096 prompt + ~50 output	FIM prompt fills the context window; short output by design
Throughput per H100 (decode)	~3,000 tokens/sec at batch size 32	Conservative; continuous batching; small model (4B params)
GPU count for decode	~80 H100s	4,000 req/s × 50 tokens ÷ 3,000 tokens/s/GPU = 66.7; add ~20% buffer for prefill + overhead → 66.7 × 1.20 ≈ 80
Embedding index size	~820 MB per 100k-file repo	100k files × 4 chunks × 512 dims × 4 bytes = 819 MB; larger repos proportional
Re-indexing latency	<60 seconds	GitHub Copilot March 2025 GA; Merkle-tree diff tracking limits what must be re-embedded
Agent task QPS	~40-80 RPS	~100x fewer than completions; longer per turn
Context window per agent turn	8,000-200,000 tokens	Depends on frontier model; Claude Sonnet 4.5 at 200K, latest OpenAI frontier models at 1M+

Takeaway: The completion path is a high-throughput, low-latency serving problem identical to bulk inference — continuous batching and prefix caching are mandatory, not optional. The agent path is a low-QPS, high-context problem where context management and diff application are the failure modes to design against.

Building up to the design

V1: naive inline completion

The simplest possible system: the IDE sends the text before the cursor to a cloud API, the model returns a completion, the IDE inserts it. Three things kill it in the first week of production:

First, latency. A round trip to a 70B model takes 2-5 seconds for the first token. The developer has moved on. At GitHub Copilot's launch, the product team found that completions shown after 1,500ms had near-zero acceptance rate — developers had already written the next line themselves.

Second, cancellation. The user types fast. By the time the model responds, they have typed 5 more characters. The model is still computing a completion for a context that no longer exists. Without cancellation, every keystroke queues a completion request and the GPU farm is burning compute on stale prompts.

Third, context. The current file prefix is 20% of what matters. The other 80% is: what is imported, what is in the open tabs, what functions exist elsewhere in the repository. Without that context, the model suggests user_id when the codebase calls it userId, or imports a package that does not exist in this project.

V2: debounce + small model + basic context

Fix the latency by switching to a 1-7B FIM-tuned model (examples: JetBrains Mellum at 4B parameters, Cursor's Supermaven-derived model). Fix cancellation by adding a debounce: wait ~100ms after the last keystroke before firing the request (configurable in the 75-150ms range). If a new keystroke arrives, reset the timer and cancel the in-flight HTTP/2 stream — not a TCP teardown, just a stream reset, which the server handles in microseconds.

Fix basic context by adding the FIM format. Instead of just the prefix, send:

<fim_prefix>{current file content above cursor}<fim_suffix>{current file content below cursor}<fim_middle>

The model learns to generate the middle token sequence. Add the content of open editor tabs as additional context, ranked by Jaccard similarity (60-line sliding windows) to the cursor neighborhood. This gets acceptance rate to 20-25%.

The debounce itself filters ~40% of requests before they even leave the IDE — the 400M/day proxy count is the post-debounce figure, meaning debounce has already been applied upstream. The HTTP/2 cancellation catches another ~50% of those post-debounce requests that do reach the proxy. GitHub Copilot's production telemetry (David Cheney, QCon SF 2024) shows roughly half of proxy-received requests are cancelled this way (so ~50% of 400M complete) — a number that shapes every capacity estimate.

V3: semantic repo index + prompt budget controller

The remaining quality gap is cross-file context. The open tabs are fine when the developer has the right files open. They do not help when the right file is 3 hops away in the import graph, or when the developer is implementing an interface defined in a type file they never opened.

The fix is a background indexer. On first open of a repository, the system chunks the code into 100-250 token segments, embeds them into a 512-dimensional vector space, and stores them in a purpose-built vector store (Turbopuffer for Cursor, LanceDB for Continue). Merkle-tree diff tracking means that on each file save, only the modified chunks are re-embedded — GitHub's March 2025 launch of instant semantic indexing brought re-index latency from 5 minutes to under 60 seconds.

At query time, the context retrieval layer runs in parallel: a BM25 lexical search (fast, ~7ms, good for identifiers and exact names), a dense ANN query (~20ms, good for semantic similarity), and an LSP/tree-sitter query for the symbol definition at the cursor. Results are merged, ranked by relevance, and passed to the prompt budget controller.

The prompt budget controller has a fixed token budget — typically 2,048-8,192 tokens depending on the model. It fills slots in priority order:

Current file prefix and suffix (non-negotiable)
Import statements and type signatures from referenced modules
Open tabs, ranked by Jaccard similarity
Retrieved snippets from the semantic index

In practice, 40-60% of the token window is consumed by context the user did not explicitly supply. The priority order matters more than the window size for final quality.

This gets acceptance rate to 27-34% and closes most of the "wrong identifier name" class of errors.

V4: agent mode — plan, tool, observe, loop

flowchart TD
    TASK["Developer task<br/>'refactor auth service'"] --> PLAN[Planner<br/>frontier model]
    PLAN -->|"step list"| EX[Tool Executor]
    EX -->|"ReadFile"| FS[File System]
    EX -->|"EditFile"| DA[Diff Applier]
    EX -->|"Bash"| SHELL[Shell<br/>tests / lint / build]
    EX -->|"Search"| IDX[Repo Index]
    SHELL -->|"test output"| OBS[Observation<br/>Assembler]
    FS -->|"file content"| OBS
    DA -->|"diff result"| OBS
    IDX -->|"snippets"| OBS
    OBS -->|"next prompt"| PLAN
    PLAN -->|"DONE or FAILED"| USER[IDE surface]
    style PLAN fill:#a855f7,color:#fff
    style EX fill:#15803d,color:#fff
    style DA fill:#ff2e88,color:#fff
    style OBS fill:#ff6b1a,color:#0a0a0f
    style SHELL fill:#0e7490,color:#fff

Agent mode takes a natural-language task and runs a ReAct loop: the planner emits a thought and a tool call, the executor runs the tool, the result goes back into the context, and the planner sees the full history and decides the next step. The loop continues until the planner emits a DONE signal or the budget is exhausted.

This is where the hard problems are. See the deep-dive sections below.

API

Completion endpoint

POST /v1/completions
Content-Type: application/json
Authorization: Bearer <token>
X-Request-ID: <uuid>

{
  "editor_version": "vscode/1.89.0",
  "file_path": "src/auth/service.ts",
  "language_id": "typescript",
  "prefix": "export class AuthService {\n  async validateToken(",
  "suffix": "\n  ) {\n    return this.jwt.verify(token);\n  }\n}",
  "open_files": [
    { "path": "src/auth/types.ts", "content": "..." },
    { "path": "src/auth/jwt.ts", "content": "..." }
  ],
  "cursor_position": { "line": 2, "character": 28 },
  "max_tokens": 100
}

HTTP/2 200 OK
Content-Type: text/event-stream

data: {"token": "token: string", "finish_reason": null}
data: {"token": ",\n  options?: ValidateOptions", "finish_reason": null}
data: {"token": "", "finish_reason": "stop"}

Cancellation: the IDE sends an HTTP/2 RST_STREAM frame. The proxy acknowledges and releases the model slot immediately.

Agent task endpoint

POST /v1/agent/tasks
Content-Type: application/json

{
  "task": "Refactor the AuthService to use the new JWTv2 library. Run tests after each file change.",
  "repo_context": {
    "root": "/workspace/myapp",
    "active_file": "src/auth/service.ts"
  },
  "model": "claude-sonnet-4-5",
  "budget": { "max_turns": 30, "max_tokens": 500000 }
}

# Response: task_id for SSE polling
# SSE stream emits: plan_step, tool_call, tool_result, diff_preview, task_complete

The schema

The completion system is largely stateless on the serving path. The persistent state is the semantic index.

-- Per-repository embedding index metadata
CREATE TABLE repo_index (
  repo_id       UUID PRIMARY KEY,
  org_id        UUID NOT NULL,
  last_full_scan TIMESTAMPTZ,
  chunk_count   INTEGER,
  index_version INTEGER,
  merkle_root   TEXT  -- tracks which files need re-indexing
);

-- Individual chunk records (vector stored in Turbopuffer/LanceDB separately)
CREATE TABLE code_chunks (
  chunk_id    UUID PRIMARY KEY,
  repo_id     UUID REFERENCES repo_index(repo_id),
  file_path   TEXT NOT NULL,
  start_line  INTEGER,
  end_line    INTEGER,
  language    TEXT,
  content     TEXT,
  embedding   VECTOR(512),
  indexed_at  TIMESTAMPTZ DEFAULT now()
);

-- Agent task state
CREATE TABLE agent_tasks (
  task_id       UUID PRIMARY KEY,
  org_id        UUID NOT NULL,
  status        TEXT CHECK (status IN ('queued','running','done','failed','cancelled')),
  task_prompt   TEXT NOT NULL,
  model         TEXT NOT NULL,
  turn_count    INTEGER DEFAULT 0,
  token_count   BIGINT DEFAULT 0,
  created_at    TIMESTAMPTZ DEFAULT now(),
  completed_at  TIMESTAMPTZ,
  transcript    JSONB  -- full message history, compacted at context thresholds
);

Architecture

The completion hot path

sequenceDiagram
    participant IDE as IDE
    participant PX as Edge Proxy
    participant CTX as Context Layer
    participant IDX as Repo Index
    participant MOD as Completion Model

    IDE->>PX: keystroke (char N)
    Note over PX: debounce timer reset; cancel prior stream
    PX-->>MOD: HTTP/2 RST_STREAM (prior request)
    IDE->>PX: keystroke (char N+1, then idle 100ms)
    PX->>CTX: assemble context request
    CTX->>IDX: BM25 + ANN query (parallel, ~20ms)
    IDX-->>CTX: top-10 relevant chunks
    CTX->>CTX: rank + trim to token budget
    CTX-->>PX: assembled FIM prompt
    PX->>MOD: stream completion (HTTP/2)
    MOD-->>PX: token stream
    PX-->>IDE: ghost text overlay (first token ~50ms after model receives prompt)
    IDE->>PX: user presses Tab (accept)
    PX->>PX: log accept event (telemetry)

The geographic routing strategy matters here. GitHub Copilot rejected a CDN PoP model after discovering that traffic from Singapore was routing to West Coast US model servers anyway, adding two extra intercontinental hops. The fix is to co-locate the proxy and model in the same region — copilot-proxy uses octoDNS with continent/country/state resolution to route directly to the nearest region where both proxy and model run. Measuring SLO at the proxy layer (not the upstream model's metrics) is what surfaces this class of trombone routing.

The agent loop

The agent runs on a frontier model (Claude Sonnet 4.5, GPT-5.4, Gemini 2.5 Pro) with a large context window. Each turn: the planner sees the full transcript of prior steps plus the current tool results, emits a thought and a tool call (or DONE), the tool executor runs it, and the result is appended to the transcript.

Claude Code's implementation is instructive: a pure ReAct while-loop (queryLoop() in query.ts), no DAG orchestrator, 54 tools including Bash (universal shell execution), Read, Write, Edit, Grep, Glob, and Task (sub-agent spawning). Read-only tools run concurrently via a StreamingToolExecutor; write tools run sequentially to avoid races. On Bash errors, a sibling abort controller cancels other running subprocesses.

Cursor's approach differs: up to 8 parallel agents each isolated in a separate Git worktree, coordinated by a Composer model (RL-trained on edit sequences, 250 tokens/sec, <30s per turn). Parallel worktrees mean the agents can run test suites simultaneously without interfering with each other's file state.

For the diff application, the model emits either SEARCH/REPLACE blocks, unified diff hunks, or full file rewrites. Full file rewrite achieves the highest benchmark scores but is expensive for large files. SEARCH/REPLACE is efficient but requires exact string matching — a trailing space or CRLF/LF mismatch causes an apply failure. The apply layer should attempt exact match first, then fuzzy match, then fall back to requesting a whole-file rewrite before declaring failure.

FIM prompting and KV cache architecture

The most concrete performance win in the completion path is getting KV cache to work correctly with FIM prompts.

Standard prefix caching works by pinning stable context (system prompt, repo metadata) as a shared KV prefix. Any request that starts with the same prefix hits the cache and skips the prefill computation for those tokens. But FIM prompts break this: the format is <fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>. The suffix appears in the middle of the prompt and changes with every cursor position — so the suffix token range never matches a prior cached prefix, and cache hit rates drop to near zero.

EFIM (arXiv 2505.21889) restructures the format so the suffix appears as the first stable section and the prefix delta (only the tokens added since the last request) appears at the end. Because the suffix is stable across keystrokes — the code below the cursor does not change when the user types above it — it becomes an ideal cache anchor. Measured effect: 52% reduction in mean latency, 98% increase in throughput, from cache alone. No model change required.

JetBrains Mellum (arXiv 2510.05788) documents the SPM format (Suffix-Prefix-Middle) as slightly superior to PSM in FIM quality benchmarks, because the model generates the middle after seeing the suffix left-to-right, which is more natural. It also documents the alignment step: a base 4B LLaMA-style model trained on ~4 trillion tokens (The Stack v1/v2, StarCoder data) is SFT-fine-tuned on syntactic-boundary examples to avoid completions that break existing code structure, then DPO-aligned with an LLM-as-judge to suppress over-generated, multi-paragraph completions. Before DPO, the base model averaged 16 lines per completion; after, 7-8. Over-generated completions are rejected by users at higher rates and consume context window faster.

For serving, a speculative decoding setup pairs a tiny draft model (121M parameters) with the production 4B model. The draft proposes 5-8 tokens; the production model verifies in one parallel pass. At the acceptance rates typical of code completion (repetitive patterns, known API names), draft acceptance runs at ~51.5%, yielding 2-3x throughput improvement with no change to output quality. Note: arXiv 2603.05974 (MCCom) covers confidence-based local-cloud cascading, not speculative decoding — the speculative decoding figures cited here reflect the general technique applied to small-code-model serving.

Context retrieval layer

Context retrieval is a three-tier pipeline that runs every time a completion is requested.

Tier 1 — immediate context. The IDE sends the current file prefix and suffix, the language identifier, and the full content of open editor tabs. The proxy ranks open tabs by Jaccard similarity using 60-line sliding windows against the cursor neighborhood. This takes <1ms and is always included.

Tier 2 — local repo index. Code is chunked into 100-250 token segments at syntactic boundaries (tree-sitter parses the AST to find function and class boundaries so chunks do not split mid-expression). Chunks are embedded into 512-dimensional vectors and stored in a local vector store — Turbopuffer for Cursor, LanceDB (embedded TypeScript library, no separate process) for Continue. BM25 lexical search runs in parallel with ANN dense search. Results are fused and re-ranked. BM25 latency: ~7ms. ANN latency: ~20ms. Both run on the completion serving machines, not a separate service, to avoid an extra network hop.

Continue's granularity is illustrative: 10-line chunks produce roughly 1 million vectors for a 10 million line codebase (LanceDB blog). At 512 dims × 4 bytes = 2KB per vector, that is 2GB of raw embedding data — manageable in memory on a serving node.

Tier 3 — remote repo index. For cross-repository context (a monorepo with 100,000 files, or a workspace spanning multiple repos), a cloud index runs Zoekt trigram/regex/structural search plus a server-side embedding index. Sourcegraph Cody uses this for chat context (up to 10 remote repos simultaneously) but deliberately skips it for autocomplete — the added latency (network round trip + larger index) exceeds the completion latency budget. Only the fast tiers run on the hot path.

The prompt budget controller receives the merged candidates and fills the FIM prompt's context section in priority order: (1) current file prefix/suffix, (2) import statements and type signatures referenced at the cursor, (3) open tabs by Jaccard rank, (4) retrieved semantic snippets. It trims to fit the model's context window (typically 2,048-8,192 tokens for the completion model). In practice, 40-60% of the window is consumed by implicit context.

For a deeper treatment of hybrid retrieval with BM25 and dense ANN, see Design a RAG Pipeline and Design a Vector Database.

Model routing and cascading

The simplest routing strategy is latency class: ghost text → small model, chat → medium model, agent → large model. This is what GitHub Copilot does — a custom completion model (35% lower latency, 3x higher throughput, 12% higher acceptance rate vs. the prior model per the 2024 GitHub Blog post) handles all inline completions, while agent mode switches to Claude 3.7 Sonnet or GPT-5.4.

A more sophisticated strategy is confidence-based cascading, documented in MCCom (arXiv 2603.05974). A 121M local model handles completions first; if the confidence on the first 3 output tokens falls below a threshold (~0.8), the request escalates to a 7B cloud model. Production results: roughly 60% of requests stay local (235ms avg latency), ~40% escalate to cloud (675ms avg latency), blended average ~407ms (0.61 × 235 + 0.39 × 675) — compared to ~675ms if all requests hit the cloud. Cloud API costs drop ~60% with only 8.9% accuracy loss vs. cloud-only.

The local model also enables an enterprise on-prem option without the penalty of round-tripping to a remote endpoint. For Codeium/Windsurf's VMware Private AI Foundation deployment, the proprietary completion model runs on customer-provisioned hardware with zero data retention. The serving infrastructure is vLLM or TensorRT-LLM on NVIDIA H100s. The critical underestimation: a single 7B model on one A100 without continuous batching cannot hit sub-500ms p90 at production QPS; the enterprise cluster needs the same batching infrastructure as the cloud service.

Agent mode deep dive

Context management across a long task

Every tool call appends its result to the agent's transcript. A 20-file refactor with test runs generates 50-100 turns before it is done. Without compaction, a 200K-token context window is exhausted in 10-15 turns on verbose tool outputs.

Claude Code implements a 5-stage pipeline triggered at different thresholds:

Budget Reduction: at 80% context fill, drop the least recent non-critical tool results.
Snip: truncate very long tool outputs (e.g., a test run with 10,000 lines of output) to a tail summary.
Microcompact: summarize completed subtask groups into a paragraph.
Context Collapse: aggressive summarization of the full prior history into a structured summary.
Auto-Compact: triggered automatically if the model would otherwise overflow; replaces the entire transcript with a dense summary.

The core trade-off is that aggressive compaction loses detail. An agent that compacted the first 30 turns may not remember that it already tried one approach and found it failing, causing it to re-attempt the same approach on the next planning step. Arize AI's production tracking of millions of agent decisions found hallucinations beginning to appear at ~70% context capacity and becoming severe at ~85% — meaning compaction should trigger at 65-70%, not 90%.

CLAUDE.md (a project-level instruction file loaded lazily at run start) and per-sub-agent sidechain transcripts are practical mitigations: they let the main agent externalize stable context (project conventions, architectural constraints) so it does not need to keep them in the rolling window.

The self-correction loop

The distinguishing feature of agent mode over a plain "generate a diff" command is that the agent observes tool output and adapts. The pattern:

Planner emits a plan: "Edit AuthService to replace jwt.verify with jwtv2.verify."
Tool executor applies the edit, then runs: npx tsc --noEmit.
Compiler output: 3 errors — one import is missing, one type changed, one test mock needs updating.
Compiler output goes back into the context.
Planner sees the errors, emits targeted edits for each.
Re-run tests: all pass.
Planner emits DONE.

Aider's Architect/Editor split makes this concrete: the Architect model (e.g., o1-preview, which is good at reasoning but poor at formatting exact diffs) plans the approach and describes what to change; the Editor model (e.g., DeepSeek or o1-mini, which is faster and better at structured output) formats the actual SEARCH/REPLACE blocks. Measured result: o1-preview + DeepSeek achieves 85% pass rate on aider's benchmark, beating o1-preview solo (leaderboard figure at time of writing: ~77%; see aider.chat/docs/leaderboards, which updates over time). The cost is two API calls per agent turn — acceptable for batch tasks, prohibitive for interactive completions.

The self-correction loop has a stopping condition: a maximum turn count (typically 20-30) and a dollar budget per task. Without these guards, a confused agent that keeps failing the same test can run for hours. GitHub Copilot Agent Mode hit 56% on SWE-bench Verified (Claude 3.7 Sonnet, 2025); Claude Sonnet 4.5 solo hits 77.2% — meaning roughly 1 in 4 tasks still requires human intervention.

Tool execution and diff application

Tools available to the agent: ReadFile, WriteFile, EditFile (patch), Bash (shell command), Grep, Glob, WebFetch, and MCP servers (for external integrations). Claude Code runs concurrent read-only operations via a StreamingToolExecutor and serializes write operations to avoid write races. On a Bash error, a sibling abort controller cancels all other running subprocesses from the same planning step.

Diff application is the single most common failure mode. The four formats, from most fragile to most robust:

Format	Example	Fragility	Best for
SEARCH/REPLACE	Match exact string, replace	Breaks on whitespace, CRLF, minor model drift	Small precise edits
Unified diff	`@@ -3,7 +3,8 @@` hunks	Requires correct line numbers	Multi-line edits
Editor-diff	Simplified diff for Editor role	Requires model trained on format	Architect/Editor split
Whole file	Full file rewrite	Never fails to apply; expensive for large files	Files <500 lines

Apply failure accounted for 60% of human interventions in Claude Code production data. The apply layer should: attempt exact SEARCH/REPLACE, then fuzzy match (normalize whitespace/CRLF), then ask the model to retry in whole-file format. Three retries covers >95% of apply failures.

For MCP-based integrations (connecting the agent to external systems like Linear, Jira, or database schemas), the Model Context Protocol provides a standard tool registration and invocation interface. See Model Context Protocol (MCP) for the protocol internals.

Evaluation and telemetry

The metrics are divided into online (production traffic) and offline (benchmark).

Online metrics:

Acceptance Rate (AR): 27-34% for GitHub Copilot inline completions. Rises from ~29% in the first 3 months of use to ~34% after 6+ months (ACM Communications research) as the model learns the user's style and the user learns what to expect from the tool.
Code survival rate: 88% of accepted Copilot suggestions survive to final commit. This is the real productivity signal; AR alone is gameable by producing short obvious completions.
Ratio of Completed Code (RoCC): fraction of the final submitted code that was AI-generated; JetBrains lifted this from 0.25 to 0.39 on Python after their custom Mellum model launch.
Strong AR: accepted AND not deleted within 30 seconds AND less than 50% edited. Filters out "accept and immediately rewrite" cases.
Agent task completion rate: fraction of agent tasks completed without human intervention; SWE-bench Verified is the industry benchmark.

Offline benchmarks:

HumanEval Infilling, RepoCoder, RepoEval, StmtEval for FIM quality.
Pass@k (fraction of tasks where at least 1 of k samples passes all tests) for agent mode.
Exact Match and token-level edit distance for completion precision.

A/B testing gates new models before fleet-wide rollout. The Copilot custom model (2024) was validated with a controlled rollout showing 35% latency reduction, 12% AR lift, and 20% more accepted-and-retained characters before replacing the prior model.

For the LLM inference serving infrastructure that underlies both the completion model and the agent model, see Design an LLM Inference & Serving System.

Edge cases and gotchas

Context rot in long agent sessions. Accuracy degrades before the context window is full. Arize AI's production tracking found hallucinations beginning at ~70% context fill and becoming severe by ~85%. An agent in a long refactor session may re-introduce code it deleted 30 turns ago, contradict a decision it already made, or loop on a sub-problem it already solved. Design compaction to trigger at 65%, not 90%.

Phantom package hallucination. About 20% of AI-generated code references non-existent libraries or APIs. In completion mode, the developer catches this when the import fails. In agent mode, the agent may run pip install nonexistent-package, observe the failure, then hallucinate an alternative package name, cycling through a cascade of wrong dependencies. Mitigation: before any install command, verify the package exists in the registry with a WebFetch call; lock to known-valid package lists for common languages.

Apply failure brittleness. SEARCH/REPLACE exact-match breaks on trailing whitespace, CRLF vs. LF, tab vs. space, or any minor model drift in how the search string was generated. In Claude Code production data, this was the cause of 60% of human interventions. The fix: fuzzy normalization first (strip trailing whitespace, normalize line endings), then retry with whole-file format. Do not present apply failure to the user until 3 retry strategies have been exhausted.

Stale index after large refactors. The semantic index reflects the file state at indexing time. If the agent renames 50 functions and then queries the semantic index, it may retrieve chunks referencing the old names, producing completions that re-introduce renamed identifiers. Mitigation: after each WriteFile or EditFile tool call, invalidate the index entries for the modified file and trigger a synchronous re-embed before the next query.

Cancelled-request waste without HTTP/2 stream reset. At Copilot's scale, 50% of completion requests are abandoned. Without stream reset, the model finishes generating a completion no one will see, consuming GPU-seconds and inflating cost. Teams instrumenting at the model API layer (not the proxy) miss this entirely — they see 100% utilization, attribute it to demand, and provision more GPUs, when the real fix is a proper cancellation path. The copilot-proxy in Go handles this at the HAProxy GLB layer, not the application layer.

Permission fatigue in agent mode. Claude Code production data shows 93% of permission prompts are approved by users. Over extended use, auto-approve rates grow from 20% to 40%+ of sessions. This is efficient but dangerous: an agent with auto-approve and a misguided plan can delete test fixtures, push to a main branch, or run a destructive migration. Solution: scope auto-approve to read-only operations by default; require explicit confirmation for write operations outside the active workspace.

Traffic trombone in global edge routing. GitHub experimented with routing completions to the nearest CDN PoP, expecting to reduce latency. In testing, Singapore traffic routed to a West Coast US PoP — which then forwarded to the actual model running in a US region anyway, adding two intercontinental hops compared to direct routing. The fix: co-locate proxy and model in the same cloud region; use octoDNS geo-routing to select the nearest region where both run, not the nearest network PoP.

On-prem latency shock. Moving from a managed cloud inference endpoint (H100 clusters, continuous batching, speculative decoding at scale) to a single customer-run server typically quadruples latency and halves throughput. A 7B model on one A100 without batching cannot meet sub-500ms p90 SLOs at production QPS. Enterprise on-prem requires the same continuous batching infrastructure — vLLM or TensorRT-LLM, multi-GPU tensor parallelism — not a naive single-process server.

Trade-offs to discuss in an interview

Plugin vs. IDE fork. Copilot and Sourcegraph Cody are IDE plugins — broad distribution, easy updates, constrained by plugin API. Cursor and Windsurf are full VS Code forks — direct file system access, terminal control, full editor event stream for training data, lower-overhead diff application. The fork gives a 2-3x advantage in diff application speed and richer telemetry, at the cost of a harder update cycle and smaller initial reach. Cursor's TAB model latency dropped measurably after the Supermaven acquisition — internal telemetry cited a roughly 40–45% reduction — partly attributable to the lower overhead of a fork vs. plugin API.

All-cloud vs. local+cloud cascade. GitHub Copilot runs all inference in Azure. MCCom and JetBrains Mellum run a local model for most completions, escalating to cloud only on low confidence. All-cloud is simpler operationally but adds ~200ms network round-trip. Local+cloud reduces latency and cost (~60% fewer cloud calls when ~60% of requests stay local) but requires model deployment on developer machines — fine for an M-series Mac, painful on a low-end Windows laptop.

Embedding-based retrieval vs. lexical search. Sourcegraph Cody originally used OpenAI's text-embedding-ada-002 for semantic search. They switched to native Sourcegraph search (BM25 + Zoekt trigram) to eliminate third-party code exposure and scale to 100,000+ file monorepos. Embeddings outperform lexical search on semantic queries; lexical search outperforms on exact identifier names and regex patterns. Hybrid retrieval wins on most benchmarks but adds operational complexity: two indexes to maintain and a fusion layer.

Architect/Editor split vs. single model. Separating reasoning (Architect: large model, extended thinking) from edit formatting (Editor: fast model, structured output) yields measurably better benchmark scores (85% vs. ~77% for o1-preview solo on Aider's benchmark; see aider.chat/docs/leaderboards). The cost is two API calls per agent turn — higher latency and cost. This is worth it for batch or async agent tasks; it is not worth it for interactive completions where every turn is user-facing.

Context window size vs. compaction quality. A 1M-token context window (GPT-5.4/5.5, Gemini 2.5 Pro) allows an agent to hold a large codebase in context without compaction. But larger context means slower prefill (quadratic attention cost before flash attention optimizations), higher cost per turn, and context rot effects that begin earlier in absolute token terms. Compaction is not just a workaround for small windows — it is also a quality intervention that focuses the model on the relevant recent history.

Things you should now be able to answer

Why does inline completion need a different model than agent mode, and what are the specific latency targets for each?
Describe the FIM prompt format (PSM vs. SPM) and explain why EFIM's restructuring achieves a 52% latency reduction without changing the model.
How does the debounce-plus-HTTP/2-stream-reset cancellation pipeline work, and why does instrumenting at the model API layer miss the 50% cancellation signal?
Walk through the three tiers of context retrieval and explain which tiers run on the hot path vs. the cold path, and why.
What does the prompt budget controller prioritize, and what fraction of the context window is typically consumed by implicit context?
Explain the self-correction loop in agent mode and name two concrete stopping conditions that must be enforced to prevent runaway cost.
What are the four diff application formats, which is most fragile and why, and what is the retry strategy for apply failures?
Why does context rot begin before the context window is full, and at what approximate fill level do hallucinations become severe in production agents?
What is the trade-off between all-cloud inference and a local-plus-cloud cascade model, and what efficiency gains does MCCom demonstrate?
How does "code never trains the base model" work architecturally, and what additional infrastructure is required for a fully on-prem enterprise deployment?

Frequently asked questions

▸Why does inline completion need a different model than agent mode?

Ghost text must appear within 200ms or the developer has already typed past the suggestion. A 70B frontier model on a cold prompt takes 2-5 seconds for the first token. Completion uses a 1B-7B FIM-tuned model with greedy decoding, designed for sub-300ms server latency. Agent mode tolerates seconds to minutes per turn because the user asked for a multi-step task; it gets a frontier model (Claude Sonnet, GPT-5.4) with extended thinking and tool-use. Routing by latency class — not capability — is the core infrastructure decision.

▸What is fill-in-the-middle (FIM) prompting and why does format matter?

FIM prompts the model with a prefix (code before the cursor) and a suffix (code after) separately, asking it to generate the middle span. The two formats differ in order: PSM sends Prefix-Suffix-Middle, SPM sends Suffix-Prefix-Middle. SPM trains the model to generate left-to-right after seeing the suffix, which tests show performs slightly better. For serving, format matters for KV cache reuse: EFIM restructures prompts so the stable suffix appears first as a cacheable prefix — the suffix is stable across keystrokes because the code below the cursor does not change when the user types above it — yielding 52% latency reduction and 98% throughput increase from cache hits on the suffix alone.

▸How do you handle the 50% of completion requests that are abandoned mid-flight?

At GitHub Copilot scale (400 million requests/day), roughly half of requests are abandoned because the user kept typing before the model responds. Cancelling a request via HTTP/2 stream reset — not a TCP teardown — releases the server-side compute immediately. Without this, the model finishes generating a completion no one will see, burning GPU cycles and inflating cost. The Go-based proxy (copilot-proxy) tracks in-flight stream IDs and issues resets on new keystrokes. Teams instrumenting only at the model API layer miss this signal and over-estimate GPU utilization by nearly 2x.

▸What does "code never trains the base model" mean architecturally?

For enterprise customers, it is both a contractual and an architectural guarantee. User code is not written to any persistent training store; the inference path reads the code from the IDE, assembles a prompt, calls the model, and discards everything. Fine-tuning pipelines for the base completion model train only on licensed public code corpora. Enterprise air-gap goes further: inference runs on a customer-provisioned GPU cluster (vLLM or TensorRT-LLM on H100s), so code never leaves the customer network. Sourcegraph Cody dropped OpenAI embeddings specifically to eliminate this third-party exposure path.

▸How do you evaluate an AI coding assistant beyond acceptance rate?

Acceptance Rate (AR, typically 27-34%) is gameable — models that output short, obvious completions get accepted but provide little value. Better metrics: code survival rate (88% of accepted Copilot suggestions survive to final commit), Ratio of Completed Code (RoCC, used by JetBrains — lifted from 0.25 to 0.39 post-model update), and Strong AR (accepted AND not deleted AND less than 50% edited afterward). For agent mode: SWE-bench Verified pass rate (GitHub Copilot hit 56% with Claude 3.7 Sonnet) and per-task human-intervention rate are the real proxies for productivity.

← previous

Design an AI Guardrails & Safety System

The ML / GenAI System Design Interview Framework

// RELATED