~/articles/design-llm-eval-platform

◆◆◆Advancedasked at OpenAIasked at LangChainasked at Braintrustasked at Arize

Design an LLM Evaluation Platform

Build the system that tells a team whether a prompt or model change made the product better or worse — automatically. Covers offline eval with LLM-as-judge, CI regression gating, online production sampling, human annotation queues, and eval for RAG, agents, and classifiers at the scale of 450 million evaluations per month.

30 min read2026-06-25Ironclad Academy

#interview #ai #llm #llmops #evaluation

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

When Voiceflow upgraded to a new GPT model version in 2023, they caught a 10% classification accuracy regression before it shipped. The catch was not luck — it was a test suite of 200 labeled examples that ran automatically in CI. Most teams do not have this. They ship prompt changes because "it feels better in the playground," discover hallucination regressions three days after deploy from user complaints, and have no systematic way to verify whether a model upgrade improved anything for their specific use case.

This is not a new problem. Software engineering solved it with test suites and CI. LLM products need the same thing, but the assertion layer is fundamentally different: you cannot compare an LLM output to a hardcoded string and declare it pass or fail. The output space is open-ended. Meaning matters. A three-sentence answer and a twelve-sentence answer to the same question can both be correct or both be wrong. The eval platform's job is to turn "did this change make the product better?" from a subjective debate into an automated measurement.

There are two regimes. Offline eval catches intentional changes — things you did on purpose, like editing a system prompt or swapping from GPT-4o to Claude Sonnet 4. Online eval catches environmental drift — things that happen to you, like an upstream model API changing its behavior, user query distributions shifting, or a silent infrastructure change altering context assembly. You need both. Teams that run only offline eval discover the second category from their users. Teams that skip offline eval are shipping blind.

The design in this article powers a multi-tenant eval platform at the scale of 450 million evaluations per month — the tier of platform used by enterprises running continuous LLMOps at scale (Arize Phoenix, whose customers include DoorDash, Reddit, Instacart, and Uber, publicly reports 50+ million evaluations/month with 1 trillion spans processed). See design-llm-observability-platform for the production tracing infrastructure that feeds the online pipeline, and design-rag-pipeline for the RAG-specific metrics (faithfulness, context recall) that plug into the scorer layer described here.

Functional requirements

Run a versioned golden dataset through any version of a system under test (SUT), score every row with one or more scorers, store the results keyed by (run ID, dataset version, scorer, row), and emit an aggregate report.
Support three scorer types: deterministic/code-based (exact match, regex, JSON schema validation, F1), LLM-as-judge (rubric-prompted model returning score + chain-of-thought explanation), and human annotation.
CI integration: compare a run's aggregate and per-slice scores to a baseline using a statistical test; post results as a PR comment; block merge on hard regressions.
Online eval: consume production traces from a message queue, run scorers asynchronously, feed results into the same scoring store, and alert when rolling quality metrics drop.
Human annotation queue: surface low-confidence judge cases and random calibration samples to annotators; measure Cohen's kappa; accept approved labels back into the golden dataset.
Dataset versioning: content-addressed snapshots so every run is reproducible against the exact data it was scored on; slice queries by metadata tags (topic, difficulty, user segment).
Experiment tracking: link each eval run to (prompt version, model ID, dataset version, code commit SHA).
Specialized eval modes for RAG systems (RAGAS-style reference-free metrics) and agents (task success, step trajectory scoring).

Non-functional requirements

Offline eval on a 500-row dataset completes within 5 minutes (CI gate wall-time target) using a cheap judge model.
Scoring throughput: 450 million evaluations/month sustained; columnar storage with sub-second analytical query latency for dashboard slice aggregations.
Judge calls are retried with exponential backoff; transient API failures must not poison an entire run's results; partial run results are recoverable.
PII in production traces must be scrubbed before entering the dataset store or being sent to judge model APIs.
All eval runs are reproducible: given (run ID), any engineer can replay the exact inputs, model versions, and judge prompts that produced the stored scores.
Human annotation throughput sufficient to keep calibration current as the judge model changes: target 200–500 annotations/month per active eval dimension.

Capacity estimation

Dimension	Estimate	How we got there
Offline evaluations/day	5 M	200 teams × 50 runs/day × 500 rows/run
Online evaluations/day	10 M	100 M production LLM calls/day × 10% sample
Total evaluations/month	450 M	(5M + 10M) × 30 days
Judge API calls/month	~15 M	~33% of evals use LLM-as-judge (~148 M rows); batch prompting groups ~10 rows per judge call, so 148 M / 10 ≈ 15 M calls
LLM-judge cost/month	~$300K	15M calls × $0.02 average (GPT-4o-mini for CI, GPT-4o for nightly)
Score record storage	~450 GB/month	450M records × ~1 KB JSON per record (450M × 1 KB = 450 GB)
CI eval wall time (500 rows)	~50 s	500 rows / 20 parallel workers × 2 s/LLM call
Human annotations/month	~5,000	200 teams × 25 calibration + low-confidence samples/month each; assuming ~10–25 shared eval dimensions platform-wide, that is ~200–500 annotations per dimension/month, matching the NFR

Takeaway: The LLM-judge cost is the dominant operating expense — $300K/month is a real number for a platform at this scale. Judge model selection (mini vs. full) and sampling rates are cost-control levers, not just performance knobs. At the same time, the storage volume is manageable in a columnar store; the engineering challenge is latency of analytical queries, not raw data size.

Building up to the design

Start with the obvious version: a script that loads a CSV of test cases, calls your model for each one, checks if the output contains a keyword, and prints a pass rate. This is what most teams ship first. Four iterations are needed to fix it: hardcoded string checks fail on open-ended tasks; LLM judges introduce rubric inconsistency and position/self-enhancement bias; unsupervised judges can't be trusted for gating until calibrated; and manual eval scripts are forgotten and bypassed. The versioned narrative below walks through each fix in turn.

V1: hardcoded string checks. One-line assertions like assert "Paris" in output work for deterministic lookup tasks. They fail as soon as you have an open-ended question where ten different phrasings of the correct answer are all valid. The moment you add a chatbot that explains why Paris is the capital of France, string matching becomes meaningless.

V2: add LLM-as-judge for open-ended tasks. You write a prompt that tells GPT-4 to score responses 1–5 on helpfulness. It works better than string matching. Two problems emerge immediately. First, different engineers write different rubrics for the same dimension and get inconsistent scores — you have no idea whether "helpfulness 3.8" from last week and "helpfulness 3.7" from this week represent a regression or noise. Second, the judge shows position bias in pairwise comparisons and self-enhancement bias when you use the same model family as the one you're evaluating. You notice this when Claude judging Claude gives your Claude-powered chatbot a mysteriously high win-rate against GPT-4.

V3: bias-corrected, calibrated judges. Fix the rubric problem by using binary pass/fail initially (Hamel Husain's field recommendation after working with 30+ companies). Fix position bias with swap augmentation. Require a held-out calibration set showing Spearman rho > 0.7 against human labels before any judge is trusted for CI gating. Now you have trustworthy per-run scores.

But the scoring infrastructure is still a script you run manually. Engineers forget to run it. Prompt changes ship without it.

V4: offline eval runner with CI gate. The eval runner is a service, not a script. Every PR triggers a GitHub Actions step that (a) materializes the current system prompt + model config, (b) loads the pinned dataset version, (c) fans out to scorers in parallel, (d) runs a Welch's t-test comparing this run's scores to the baseline, and (e) posts the per-scorer delta table as a PR comment, blocking merge if any delta exceeds the regression threshold. This is what Braintrust's braintrustdata/eval-action provides. Notion reports a 10x improvement in issue-resolution throughput (from 3 to 30 eval-driven fixes per day) after adopting automated eval in CI.

flowchart LR
    PR["Pull Request"] --> CI["CI step<br/>GitHub Actions"]
    CI --> OFR["Offline eval runner"]
    OFR --> LOAD["Load dataset v1.4.2<br/>500 rows JSONL"]
    LOAD --> FANOUT["Fan-out worker pool<br/>20 parallel workers"]
    FANOUT --> SUT["System under test<br/>new prompt + model"]
    SUT --> SCORE_D["Deterministic scorers<br/>exact match · JSON schema · F1"]
    SUT --> SCORE_J["LLM-as-judge scorers<br/>faithfulness · tone · safety"]
    SCORE_D --> AGG["Aggregator<br/>per-slice means"]
    SCORE_J --> AGG
    AGG --> STAT["Statistical test<br/>Welch t-test / z-test"]
    STAT --> GATE{"Regression?"}
    GATE -->|"No — delta within threshold"| PASS["Post green table<br/>PR passes"]
    GATE -->|"Yes — p<0.05 AND delta>0.03"| FAIL["Post red table<br/>Block merge"]
    style OFR fill:#ff6b1a,color:#0a0a0f
    style SCORE_J fill:#a855f7,color:#fff
    style SCORE_D fill:#0e7490,color:#fff
    style GATE fill:#ff2e88,color:#fff
    style FAIL fill:#ff2e88,color:#fff

You still need the online pipeline, because offline eval only catches changes you made on purpose.

V5: online eval and dataset curation loops. Add an async consumer on production traces. Run cheap heuristics (PII regex, output schema validation, blocked-content list) on 100% of traces, LLM-judge on the sampled 5–10%. Low-scoring traces automatically get routed to the human annotation queue. Approved labels feed back into the golden dataset as new hard cases. Now the dataset evolves with the product's actual failure modes rather than staying frozen at launch.

API

# Create an eval run
POST /v1/eval-runs
{
  "dataset_id": "golden-v1.4.2",
  "dataset_version": "sha256:a3f9...",
  "experiment": {
    "prompt_version": "support-v23",
    "model": "gpt-4o-mini",
    "commit": "abc1234"
  },
  "scorers": [
    { "type": "exact_match", "field": "category" },
    { "type": "llm_judge", "judge_id": "faithfulness-v4", "model": "gpt-4o" },
    { "type": "llm_judge", "judge_id": "tone-v2", "model": "gpt-4o-mini" }
  ],
  "parallelism": 20
}

# Response: 202 Accepted
{ "run_id": "run_7f8d...", "status": "running", "eta_seconds": 50 }

# Get run results
GET /v1/eval-runs/run_7f8d.../results?slice_by=topic

# Response
{
  "run_id": "run_7f8d...",
  "status": "complete",
  "aggregate": {
    "faithfulness": { "mean": 0.891, "baseline": 0.923, "delta": -0.032, "p_value": 0.011 },
    "tone": { "mean": 0.774, "baseline": 0.761, "delta": +0.013, "p_value": 0.24 }
  },
  "slices": {
    "topic=medical": { "faithfulness": { "mean": 0.712, "baseline": 0.901, "delta": -0.189 } },
    "topic=billing": { "faithfulness": { "mean": 0.944, "baseline": 0.937, "delta": +0.007 } }
  },
  "gate_decision": "block",
  "gate_reason": "faithfulness delta -0.032 below threshold -0.03 with p=0.011 < 0.05"
}

# Submit a human annotation
POST /v1/annotations
{
  "run_id": "run_7f8d...",
  "example_id": "ex_3a2b...",
  "scorer_id": "faithfulness-v4",
  "annotator_id": "ann_012",
  "label": "fail",
  "critique": "Response claims the plan covers dental, but context says dental is excluded."
}

The schema

-- Datasets and versions
CREATE TABLE datasets (
  id           TEXT PRIMARY KEY,        -- e.g. "golden-support-v1"
  display_name TEXT NOT NULL,
  created_at   TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE dataset_versions (
  version_sha  TEXT PRIMARY KEY,        -- sha256 of file content
  dataset_id   TEXT REFERENCES datasets(id),
  row_count    INT NOT NULL,
  storage_path TEXT NOT NULL,           -- s3://... or GCS path to JSONL/Parquet
  created_at   TIMESTAMPTZ DEFAULT now()
);

-- Eval runs
CREATE TABLE eval_runs (
  run_id           TEXT PRIMARY KEY,
  dataset_id       TEXT NOT NULL,
  dataset_version  TEXT NOT NULL,        -- content-addressed sha
  prompt_version   TEXT,
  model_id         TEXT NOT NULL,
  commit_sha       TEXT,
  status           TEXT NOT NULL,        -- running | complete | failed
  started_at       TIMESTAMPTZ,
  completed_at     TIMESTAMPTZ
);

-- Score records (high-volume; use ClickHouse or TimescaleDB in production)
CREATE TABLE score_records (
  run_id       TEXT NOT NULL,
  example_id   TEXT NOT NULL,
  scorer_id    TEXT NOT NULL,
  score        FLOAT NOT NULL,           -- 0.0 to 1.0 (or binary 0/1)
  explanation  TEXT,                     -- judge chain-of-thought
  latency_ms   INT,
  cost_usd     FLOAT,
  judge_model  TEXT,
  ts           TIMESTAMPTZ DEFAULT now(),
  PRIMARY KEY (run_id, example_id, scorer_id)
);

-- Human annotations
CREATE TABLE annotations (
  annotation_id TEXT PRIMARY KEY,
  run_id        TEXT,
  example_id    TEXT NOT NULL,
  scorer_id     TEXT NOT NULL,
  annotator_id  TEXT NOT NULL,
  label         TEXT NOT NULL,           -- pass | fail
  critique      TEXT,
  created_at    TIMESTAMPTZ DEFAULT now()
);

-- Judge definitions
CREATE TABLE judges (
  judge_id       TEXT PRIMARY KEY,       -- e.g. "faithfulness-v4"
  dimension      TEXT NOT NULL,          -- faithfulness | tone | safety | ...
  prompt_text    TEXT NOT NULL,
  few_shot_json  JSONB,
  calibration_rho FLOAT,                 -- Spearman rho vs. held-out human labels
  model          TEXT NOT NULL,
  promoted_at    TIMESTAMPTZ             -- NULL until calibration passes
);

Architecture

Using the scoring store as the single write target for all four subsystems — offline runner, online consumer, human annotation, and CI gate — means every analytical query (slice aggregation, drift detection, calibration check) runs against the same data without ETL.

flowchart TD
    subgraph OFFLINE["Offline eval subsystem"]
        DS[("Dataset store<br/>S3 + metadata DB")] --> RUN["Eval runner<br/>fan-out worker pool"]
        JR[("Judge registry<br/>prompts + calibration")] --> RUN
        EXPR[("Experiment tracker<br/>Brainstore / MLflow / W&B Weave")] --> RUN
    end
    subgraph SCORE["Scorer layer"]
        RUN --> DET["Deterministic scorers<br/>exact-match · regex · schema · F1"]
        RUN --> JUDG["LLM-as-judge scorers<br/>faithfulness · tone · safety · format"]
        RUN --> HAQ["Human annotation queue<br/>Label Studio / Scale AI"]
    end
    subgraph STORE["Scoring store"]
        DET --> SS[("ClickHouse<br/>score_records")]
        JUDG --> SS
        HAQ --> SS
        SS --> DS
    end
    subgraph ONLINE["Online eval subsystem"]
        PTRACE["Production traces<br/>OTel / OpenInference"] --> SAMP["Stratified sampler<br/>5-10% by route/segment"]
        SAMP --> OCONS["Async eval consumer<br/>Kafka worker"]
        OCONS --> DET
        OCONS --> JUDG
        SS --> DRIFT["Drift detector<br/>rolling 15-min window"]
        DRIFT --> ALT["Alerting<br/>PagerDuty / Slack"]
    end
    subgraph GATE["CI gate"]
        SS --> STAT["Statistical comparator<br/>Welch t-test / z-test"]
        STAT --> PRCOMM["PR comment<br/>per-scorer + per-slice table"]
    end
    style RUN fill:#ff6b1a,color:#0a0a0f
    style JUDG fill:#a855f7,color:#fff
    style SS fill:#15803d,color:#fff
    style OCONS fill:#0e7490,color:#fff
    style DRIFT fill:#ff2e88,color:#fff
    style HAQ fill:#ffaa00,color:#0a0a0f

Offline runner. The runner is a horizontally scalable worker pool. When a run is created, it materializes the dataset rows into a work queue (one message per row). Workers pull rows, call the SUT (the live model endpoint or a stubbed version for speed), then fan out to all configured scorers in parallel. Each (run_id, example_id, scorer_id) triplet is written atomically to the scoring store. The runner tracks completion count and marks the run complete when all rows are scored. Partial failures are recorded rather than retried immediately — the orchestrator exposes a "re-score failed rows" endpoint.

Scoring store. ClickHouse is the right engine here. Eval writes are append-only at high volume; the read pattern is analytical aggregation (mean score over the last 30 runs for dataset v3, sliced by topic). Braintrust's Brainstore is a purpose-built columnar store that claims 80x faster query performance than Postgres for this workload. TimescaleDB on Postgres is a viable open-source alternative. The key schema decision: one row per (run, example, scorer), not per run — this allows arbitrary per-scorer and per-slice aggregation without schema changes as new scorers are added.

Online eval consumer. An async Kafka consumer (or SQS equivalent) subscribes to the production trace stream. It applies stratified sampling — sampling more heavily from routes with historically low scores or high volume — and fans out sampled traces to the scorer layer. Critically, the consumer does not block the production request path. The trace is emitted as a side-effect after the response is sent to the user. For the LLM's observability tracing infrastructure that produces this trace stream, see design-llm-observability-platform.

CI gate. For continuous metric scorers, the gate runs a Welch's t-test comparing the current run's scores to the baseline (the last green run on the main branch). It fails if p < 0.05 AND the absolute delta exceeds the configured threshold (e.g., faithfulness delta > –0.03). For binary pass/fail scorers, a two-proportion z-test is used. Using both conditions prevents the gate from blocking on statistically significant but practically irrelevant fluctuations.

Hot path: an online evaluation.

sequenceDiagram
    participant USER as User
    participant APP as Application
    participant LLM as LLM API
    participant OT as OTel Collector
    participant KAF as Kafka / SQS
    participant SAMP as Sampler
    participant DET as Deterministic scorer
    participant JUDG as LLM-as-judge
    participant SS as Scoring store
    participant DRIFT as Drift detector

    USER->>APP: Send message
    APP->>LLM: Call model (system prompt + context)
    LLM-->>APP: Response tokens
    APP-->>USER: Stream response to user
    APP->>OT: Emit trace span (input, output, latency, cost)
    OT->>KAF: Publish trace event
    KAF->>SAMP: Deliver trace
    SAMP->>DET: 100% — run heuristic checks (PII, schema)
    SAMP->>JUDG: 10% — run LLM-as-judge (faithfulness, tone)
    DET->>SS: Write score record
    JUDG->>SS: Write score record + explanation
    SS->>DRIFT: Update rolling mean
    DRIFT-->>APP: Alert if mean drops > 2–3pp over 15 min

The LLM-as-judge problem

This section is the heart of the platform. Everything else is plumbing. Getting judges right is where teams fail.

Calibration before deployment. The single most important practice is not shipping a judge until it has been calibrated against human labels on a held-out set. Concretely: collect 30–50 diverse examples (Hamel Husain recommends starting with 30), hand-label each one with the binary pass/fail judgment and a written critique, then measure the judge's agreement with those labels using Spearman rho. A rho below 0.7 means the judge is measuring something different from what the human labels measure — do not trust it for gating. The critique shadowing technique (write the judge rubric by reading human critiques, not from first principles) reliably produces rubrics that pass calibration faster.

Self-enhancement bias. GPT-4 assigned to judge GPT-4 outputs shows a +10% win-rate boost for those outputs over equivalent outputs from other models. Claude-v1 (2023) judging Claude-v1 outputs shows +25%. This is documented by Eugene Yan (2024) drawing on Zheng et al. and is the reason you should not use the same model family as both the SUT and the judge. Use a different family, or average two judges from different families.

Position bias and swap augmentation. MT-Bench (Zheng et al., 2023, arXiv:2306.05685) quantified position bias across models: GPT-4 showed 65% swap-consistency — it agreed with its own verdict when A and B were swapped 65% of the time — good. Claude-v1 showed only 23.8%, essentially random. GPT-3.5 showed 46.2%. The fix: for pairwise comparisons, always run A vs. B and B vs. A, then average. If the judge disagrees with itself on swap, mark the case as "inconclusive" and route to the human annotation queue rather than forcing a verdict.

Verbosity/conciseness bias shift. Models from 2022–2023 preferred longer responses more than 90% of the time. Some 2025-era models have partially reversed this: Claude Sonnet 4 shows a conciseness preference in the same paper, while older models still trend verbose. Bias direction is not stable across the model family — it must be re-measured per model version whenever the judge is updated. The Judging the Judges paper (arXiv:2604.23178) measured that a combined debiasing budget strategy improves Claude Sonnet 4's judging accuracy by +11.5 percentage points (p < 0.0001).

Binary vs. numeric scales. Hamel Husain and Shreya Shankar both strongly recommend binary pass/fail for initial judge calibration. A 1–5 scale introduces ambiguity at the margins: reasonable people disagree about whether something is a 3 or a 4, which inflates apparent judge disagreement and makes calibration harder. Start binary. Graduate to numeric only once binary judgments are saturated and you need gradient signal for optimization. LMArena's pairwise approach is a variation: rather than assigning a number, ask "which is better?" and accumulate comparisons into a Bradley-Terry model. BT handles ties correctly and outputs proper confidence intervals, which raw Elo does not.

One judge per dimension. An omnibus prompt asking the judge to score faithfulness, tone, format, and safety simultaneously produces conflated scores that are hard to calibrate and even harder to act on. When a score drops, you cannot tell which dimension caused it. Build one judge per dimension. Yes, this multiplies judge call count. The alternative is a single metric that hides the root cause of every regression.

Chain-of-thought is required. A judge that outputs only "fail" cannot be calibrated, debugged, or explained to a product manager. Every judge prompt must require the model to produce a chain-of-thought reasoning trace before the score. This enables: (a) engineers to read why the judge scored a case a certain way, (b) annotators to agree or disagree with the reasoning rather than the raw score, and (c) rubric iteration based on the failure modes the judge itself identifies.

Online eval and production monitoring

Offline eval catches what you do on purpose. Online eval catches what happens to you.

Sampling strategy. Run cheap heuristic checks (PII regex, output schema validation, blocked-word lists, response length out of range) on 100% of production traces — these cost microseconds. Run LLM-as-judge on 5–10% of traces, stratified by route and user segment. Arize Phoenix and Langfuse both report this as the standard production pattern. Stratified sampling matters: if you sample uniformly, rare but high-risk routes (medical, financial, legal) may appear only a handful of times per day even at 10% sampling. Oversample those routes.

Drift detection. Compute a rolling mean of each score metric over a 15–60 minute window. Alert when the rolling mean drops more than 2–3 percentage points below the established baseline for the same time-of-day distribution (to avoid false positives from normal traffic variation). The threshold must be tuned per metric — faithfulness has low natural variance and a 2pp drop is significant; tone has higher natural variance and may need a 5pp threshold. Arize customers like DoorDash use this pattern for continuous quality monitoring at 1 trillion spans processed.

Guardrail layer (synchronous, not a judge). Do not confuse online eval with synchronous safety guardrails. Guardrails live in the production critical path and must complete in under 50ms. An LLM judge call takes 500ms to 3 seconds — it cannot be synchronous. Guardrails use: (a) deterministic checks (PII regex, blocked-content list, output schema validation) and (b) a fine-tuned binary classifier trained on labeled examples. The classifier provides <50ms latency and costs a fraction of a cent per call. Eugene Yan's analysis shows that a fine-tuned NLI model for factual consistency in summarization reaches ROC-AUC 0.85 with around 1,000 labeled samples — that is your target before trusting a classifier for production guardrails. Reserve the LLM judge for asynchronous evaluation.

Eval for RAG, agents, and classifiers

Each system type has distinct eval requirements. A single "quality" metric obscures what is actually broken.

RAG evaluation. A RAG system has two independent failure modes: retrieval failure (the right chunk was not retrieved) and generation failure (the retrieved chunk was retrieved but the answer still hallucinated). Conflating them into one score hides root cause. The RAGAS framework (Es et al., EACL 2024) provides four reference-free metrics computed by an LLM judge against the retrieved context rather than a ground-truth answer, which makes them usable without a large labeled dataset:

Faithfulness (0–1): every factual claim in the answer must be supported by the retrieved context. Decompose the answer into claims, then ask a judge whether each claim is entailed by the context. RAGAS reports example scores of 0.936 on technical documentation.
Answer relevance (0–1): embed the question and the judge-generated pseudo-questions from the answer, then compute cosine similarity. If the answer drifts off-topic, the pseudo-questions diverge from the original.
Context precision (0–1): fraction of retrieved chunks that are actually relevant to the question.
Context recall (0–1): fraction of the needed information that appears somewhere in the retrieved context.

For the design of the retrieval and generation pipeline these metrics evaluate, see design-rag-pipeline.

Agent evaluation. End-to-end task success — did the agent complete the task? — is necessary but not sufficient. If task success is 60%, you need to know whether the failure is in planning, tool selection, tool execution, or final answer synthesis. Trajectory-level eval examines each step of the agent's run: was the tool call appropriate given the prior context? Did the agent correctly interpret the tool result? Was the final synthesis faithful to the trajectory?

For a RAG agent, this means separate eval passes for: retrieval recall (did it retrieve the right documents?), reasoning (did it correctly identify which retrieved facts answer the question?), and synthesis (did the final answer accurately reflect those facts?). A good platform-level eval system for agents stores the full step trajectory per run — as described in design-ai-agent-platform — and allows scorers to evaluate individual steps rather than only the terminal output.

Classifier evaluation. LLM-based classifiers (intent routing, topic tagging, sentiment) have well-established metrics: precision, recall, F1, and confusion matrix per class. These are deterministic and need no LLM judge. The critical point: eval per-class, not just macro-F1. A 90% macro-F1 can hide a class with 40% recall if that class is rare. Per-class thresholds in the CI gate (e.g., every class must have recall >= 0.80) catch this. Voiceflow's 10% regression catch used classification-specific evals, not open-ended quality metrics.

Human annotation queues and calibration

Human labels are the ground truth. Every other part of the eval platform is an approximation of human judgment. The platform's job is to use human labeling resources efficiently.

Routing to the queue. Three categories of cases go to human annotation: (a) low-confidence judge cases — any evaluation where the judge score falls near the decision boundary (e.g., 0.4–0.6 on a 0–1 binary-ish scale, or explicitly flagged by the judge's chain-of-thought as uncertain), (b) random calibration samples — 5% of all evaluations, sampled uniformly, to maintain a running measure of judge accuracy, and (c) flagged production failures — traces that triggered drift alerts or guardrail blocks.

Inter-annotator agreement. When a case is sent to the queue, assign it to two independent annotators. Measure Cohen's kappa on their labels. A kappa below 0.7 across a dimension indicates the rubric is underspecified — annotators are applying different interpretations of the same criterion. A kappa below 0.4 indicates the task is ambiguous enough that even well-specified binary judgment is unreliable; consider decomposing the criterion into smaller, more concrete sub-questions. Lilian Weng's analysis of annotation quality (Lil'Log, 2024) documents the techniques for handling disagreements: adjudication rounds, majority vote with weighted annotator reliability (MACE), and iterative rubric revision.

Closing the loop. Approved human annotations do two things: they update the calibration measure for the judge prompt (recompute Spearman rho on the growing calibration set), and they optionally get added to the golden dataset as hard cases. The second part is the more valuable long-term. A production failure that was flagged, annotated, and added to the golden set will be caught by CI on every future change. Over time, the golden dataset becomes a comprehensive regression suite built from real failure modes rather than launch-time thought experiments.

PII before annotation. Production traces contain real user data. Before any trace enters the annotation queue or the golden dataset, run a PII scrubber (regex for SSN/email/phone patterns, plus an NER model for names and addresses). Log the scrubbing actions. This is a GDPR/CCPA compliance requirement, and it also prevents PII from being sent to judge model APIs from third-party providers.

Experiment tracking and scoring storage

Every eval run must be reproducible. That requires storing four things for every run: the exact prompt version, the model version, the content-addressed dataset version (a sha256 hash of the JSONL file), and the judge prompt versions used for scoring. Without all four, you cannot reproduce the run six months later when someone asks "why did we change this prompt?"

Brainstore (Braintrust), W&B Weave, Langfuse, and MLflow are all viable experiment tracking backends. Langfuse (acquired by ClickHouse in January 2026) runs on ClickHouse natively, which gives it a structural advantage for co-locating experiment metadata and score records in the same columnar store. Arize Phoenix uses OpenInference (OpenTelemetry-compatible) as its trace format, which allows eval spans to plug into existing APM infrastructure — reducing the number of systems engineers must maintain.

The key query patterns the scoring store must serve efficiently:

Mean faithfulness over the last 30 runs for dataset v3 (time-series drill-down per experiment).
Distribution of faithfulness scores for "topic=medical" rows this week vs. last week (slice regression detection).
All runs where faithfulness dropped below 0.80 and the judge explanation contains "hallucinated" (failure mode clustering).

These are analytical aggregations on high-cardinality dimensions. A row-oriented database (Postgres without TimescaleDB) struggles with the third query at 450M rows/month. ClickHouse handles it with columnar scans and vectorized execution.

Edge cases and gotchas

Metric sprawl. The most common LLMOps failure mode Hamel Husain documented across 30+ companies: teams add metrics faster than they calibrate them. After a year, they have 15 metrics, none of them correlated to human labels, all drifting independently. The discipline is to add a new metric only when you can show that the existing set does not detect a known real failure. Every metric must pass calibration (rho > 0.7) before it enters the CI gate.

Static golden datasets. A dataset created at product launch does not contain the hard cases users discover in month six. Teams ship regressions on queries absent from their eval set. The fix is continuous mining: every low-scoring production trace is a candidate for golden set inclusion after human annotation. Run a deduplication step before adding (embedding-based nearest-neighbor check against the existing dataset) to avoid accumulating many similar examples of the same failure.

Aggregate-score-only CI gates. A prompt change that improves average faithfulness by 2% while degrading faithfulness on medical queries by 15% passes an aggregate threshold gate. Per-slice thresholds on all major capability dimensions (topic, user segment, query type) must be independently enforced. The LLM Readiness Harness paper (arXiv:2603.27355) demonstrates this with concrete thresholds: Groundedness ≥ 0.85, ContextRelevance ≥ 0.80, Completeness ≥ 0.75, CitationValidity ≥ 0.99 — enforced per route, not just overall.

Judge calibration on training data. Engineers iterate the judge prompt, check agreement against a set of examples, improve the prompt, check again — using the same examples throughout. The resulting judge achieves high apparent agreement because it has been optimized on that set. The held-out calibration set (never shown during prompt development) must be used for the final rho measurement. This is analogous to train/test leakage in ML.

Verbosity bias reversal. Older LLM judges preferred verbose responses more than 90% of the time. Mitigation strategies encoded "penalize length" in rubrics. Some 2025-era models (notably Claude Sonnet 4) show the opposite — conciseness preference — so those mitigation instructions now actively harm judge quality for those models. Other models may still favor length. Bias must be re-measured per model version whenever the judge is updated, not assumed stable.

Synchronous judge calls in production. Teams add an LLM judge call to the production request path as a "quality check before responding." This adds 500ms to 3 seconds to user-facing latency. Users notice. The correct architecture is asynchronous: emit the trace, let the online eval consumer score it afterward. Only fine-tuned binary classifiers (sub-50ms) belong in the synchronous path.

Missing agent trajectory evals. Teams deploy agent evals that measure only final task success. A 60% task success rate tells you nothing about where the agent fails. Is it using the wrong tool? Misinterpreting tool results? Hallucinating in the synthesis step? Trajectory-level scoring on each step is required to diagnose root causes. End-to-end success alone is useful only for tracking, not for debugging.

Trade-offs to discuss in an interview

Pairwise (Elo/Bradley-Terry) vs. absolute scoring. Pairwise is more reliable for subjective quality because it anchors to a concrete alternative rather than an abstract scale. LMArena's BT model on 5.6M+ human votes is the gold standard for model-level leaderboards. But pairwise requires O(N²) comparisons as the set of alternatives grows and cannot give you a monotone threshold for CI gating. Use pairwise for subjective tasks and leaderboard construction; use absolute reference-based scoring with explicit thresholds for CI gates.

LLM judge vs. fine-tuned classifier. LLM judges (GPT-4 class) give high accuracy, explainable chain-of-thought, and generalize to new criteria without training data. They cost $0.01–$0.10 per call and add 500ms–3s latency. Fine-tuned classifiers add under 50ms and cost fractions of a cent but require ~1,000 labeled examples to train, have narrow generalization, and cannot explain their verdicts. Use LLM judges for offline eval, calibration, and low-volume annotation; use classifiers for synchronous production guardrails.

Synthetic vs. production-mined test data. Synthetic data (LLM-generated hard cases) covers known failure modes quickly and cheaply. Production-mined data surfaces unknown failure modes that your synthetic generation missed. The right answer is both, in sequence: start synthetic to bootstrap coverage, then continuously replace or supplement with production-mined cases as the system matures. Treat the fraction of production-derived examples as a maturity metric.

Same judge model family vs. independent. Using the same model as judge and SUT risks self-enhancement bias (+10–25% win-rate inflation). Using a different family introduces potential misalignment in evaluation criteria. Averaging two judges from different families is the most reliable approach, at 2x the judge cost. For cost-sensitive deployments, prefer the independent-family judge and accept some evaluation style mismatch over the systematic inflation of self-judging.

OpenTelemetry standard vs. proprietary trace format. Arize Phoenix (OpenInference) and Langfuse adopt OpenTelemetry, allowing eval spans to appear in existing APM dashboards alongside infrastructure metrics. Braintrust and early LangSmith used proprietary formats with richer LLM-specific metadata (token counts per model, tool call arguments). The industry trend is toward OpenTelemetry as the substrate, with LLM-specific semantic conventions layered on top. For new systems, prefer OTel and avoid lock-in.

Things you should now be able to answer

A team ships a new system prompt. Their LLM judge shows a 1-point improvement on a 5-point helpfulness scale. Why might this be meaningless, and what would you require instead?
Walk through how swap augmentation cancels position bias. At what cost, and what do you do with inconclusive swap results?
Your CI gate uses a single faithfulness metric with a threshold of 0.85. A PR passes with score 0.87, but later you discover medical queries regressed by 18%. What is the architectural fix?
What is the difference between a guardrail and an online eval judge? Why can't the same component serve both roles?
A team uses GPT-4o to judge GPT-4o outputs. They notice their model consistently beats Claude in head-to-head evals. What bias explains this, and how do you detect it?
Describe the dataset curation loop: how does a production failure become a golden test case?
Why is binary pass/fail recommended for initial judge calibration rather than a 1-5 scale?
An agent achieves 65% end-to-end task success. Product wants to know why it fails. What eval does the team need to add?
What storage engine would you choose for 450 million score records per month and why?
A judge's Spearman rho drops from 0.78 to 0.51 on the weekly calibration check. What are the three most likely causes?

Frequently asked questions

▸Why not just use ROUGE or BLEU to evaluate LLM outputs?

ROUGE and BLEU measure n-gram overlap with a reference string. For translation, where the reference is a known correct target, they correlate poorly with human judgment at the sentence level and rank near the bottom of WMT22/23 leaderboards — replaced there by COMET and chrF. For open-ended generation (summarization, chatbots, agents), the correlation is essentially zero. A model can score 0.1 ROUGE-L while producing a factually correct, well-written answer in different words. LLM-as-judge and task-specific metrics like faithfulness replace them.

▸What is position bias in LLM-as-judge and how do you fix it?

When asked to compare two responses (A vs B), an LLM judge systematically favors whichever appears first. Zheng et al. (2023, MT-Bench) measured swap consistency — whether a judge agrees with its own verdict when A and B are reversed. GPT-4 was 65% swap-consistent (good), GPT-3.5 only 46% (near coin-flip), and Claude-v1 only 23.8% (essentially random — it disagreed with its own verdict after swapping more than three-quarters of the time). Separately, first-position preference rates: Claude-v1 preferred the first-listed response 75% of the time; GPT-4 only 30% of the time. The standard fix is swap augmentation: run the comparison twice with A/B order swapped and average the scores. This doubles judge call cost but reduces position bias to near zero. Alternatively, use reference-anchored scoring where each response is scored independently against a rubric rather than compared pairwise.

▸How do you prevent an LLM from judging its own outputs?

Self-enhancement bias is well-documented: GPT-4 assigned to judge GPT-4 outputs shows a +10% win-rate for those outputs in head-to-head comparisons; Claude-v1 (2023) judging Claude-v1 outputs shows +25% (Zheng et al., MT-Bench). The fix is to use a judge from a different model family than the system under test, or average verdicts from two judges from different families. Then calibrate the judge against human labels on a held-out set and require Spearman rho > 0.7 before that judge is trusted for CI gating.

▸When should you use pairwise (Elo/Bradley-Terry) scoring versus absolute scoring?

Pairwise comparison — asking a judge "which response is better, A or B?" — is more reliable for subjective quality tasks (tone, helpfulness, creativity) because it anchors to a concrete alternative rather than an abstract scale. LMArena uses the Bradley-Terry model on 5.6M+ human pairwise votes because BT handles ties and gives proper confidence intervals. Use pairwise when building leaderboards or comparing prompt versions head-to-head. Use absolute reference-based scoring when you need monotone thresholds for CI gates, such as "faithfulness must be >= 0.85 to merge."

▸What is the difference between offline eval and online eval?

Offline eval runs a versioned golden dataset through the system under test in a controlled, reproducible batch — before a change ships. It catches intentional regressions from prompt edits, model swaps, or code changes. Online eval runs scorers asynchronously against sampled live production traces after a change ships. It catches environmental drift: upstream model API changes, shifting user query distributions, or infrastructure changes that offline datasets do not contain. Both are required; offline without online leaves you blind to what happens to you; online without offline means you are shipping changes without pre-screening them.

← previous

Design an LLM Fine-Tuning Platform

Design a GraphRAG System (Knowledge-Graph-Augmented Retrieval)

// RELATED