~/articles/design-llm-observability-platform
◆◆◆Advancedasked at LangChainasked at Langfuseasked at Datadogasked at Arize

Design an LLM Observability Platform

Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.

30 min read2026-06-25Ironclad Academy
// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The problem

In late 2023, the team behind a customer-support chatbot built on GPT-4 and a retrieval pipeline noticed their CSAT score dropping. Their Datadog dashboard showed all green: p99 latency was fine, error rate was 0.1%. What they could not see was that a prompt change shipped three weeks earlier had silently degraded answer quality on billing questions — the model was now confidently wrong instead of escalating to a human. They had no way to detect it until customers complained.

This is the core problem with LLM observability. Conventional monitoring tracks operational health — is the service up, is it fast, are requests failing? LLM applications have a third axis: quality. A response can be fast, non-erroring, and factually wrong. The model can ignore a retrieved document, hallucinate a policy that does not exist, or enter a reasoning loop that costs $50 before a budget cap fires. None of those events register as errors in a conventional APM tool.

LLM apps also differ structurally from conventional services. A single user request may fan out across a retrieval step, two embedding calls, a tool invocation, and three sequential LLM calls on different models with different prompts. Each hop adds latency, cost, and a potential failure mode. Understanding why a response was bad requires inspecting the entire execution tree — which prompt was used, which documents were retrieved, what the model was told at each step, how many tokens it consumed, and what the cost was. The instrumentation problem is a distributed tracing problem.

The systems built specifically for this space — LangSmith, Langfuse, Arize Phoenix, Helicone, Datadog LLM Observability — all converge on the same architecture: nested spans modeled on OpenTelemetry, a columnar store for fast aggregation, async ingest with PII controls, and a specialized UI for inspecting execution trees. This article is a ground-up design for that architecture.

For general metrics infrastructure see design-metrics-monitoring-system and for log aggregation see design-log-aggregation — this article assumes you know those systems and focuses on what is specific to LLM workloads.

Functional requirements

  • Accept telemetry from any LLM framework (LangChain, LlamaIndex, OpenAI SDK, Anthropic SDK, custom code) via OTLP or a vendor SDK.
  • Model each request as a trace with nested spans: root span for the user-facing request; child spans for each LLM call, retrieval step, tool invocation, embedding call, and agent reasoning cycle.
  • Each span records: operation type, model name, input tokens, output tokens, cached tokens, cost in nanodollars, latency, model parameters (temperature, top-p, max-tokens), error info, and a reference to stored prompt/completion content.
  • Tag every span with user_id, feature, deployment, prompt_version, and agent_run_id for attribution.
  • Support streaming spans: record TTFT on first chunk, accumulate token counts across chunks, close span on final chunk.
  • Prompt version management: store prompt text with version IDs in a registry; link spans to versions; expose per-version latency, cost, and quality metrics.
  • Attach online quality signals to spans: LLM-as-judge scores, guardrail hit events, explicit user feedback.
  • Query interface: trace tree viewer for single-run debugging; time-series dashboards for cost, latency, and quality trends; alerting on TTFT SLO, error rate, and cost spikes.
  • PII redaction before storage; opt-in content capture; configurable sampling rates.

Non-functional requirements

  • The SDK OTLP export call completes in under 2 ms on the local thread; all network I/O is dispatched by a background exporter thread so application code is never blocked on observability.
  • 90% of spans written to the trace store within 10 seconds of being emitted.
  • Full-trace retrieval (reconstruct entire execution tree for one trace ID) in under 200 ms.
  • Dashboard queries for weekly aggregations (cost by feature, TTFT by model) in under 1 second.
  • At least 30 days of trace retention with configurable TTL; 90 days for cost attribution data.
  • Horizontally scalable ingest: handle 10× traffic spikes without data loss (Kafka backpressure absorbs bursts).
  • Multi-tenant: cost, token usage, and trace data isolated per tenant; no cross-tenant data leakage.

Capacity estimation

DimensionEstimateHow we got there
Span ingest rate60,000 spans/sec10,000 req/sec × 6 spans/req avg (1 root + 1 LLM + 2 retrieval + 1 tool + 1 eval)
Daily span volume5.2 billion spans60,000 × 86,400 sec
Uncompressed span size~200 bytes/span (metadata only)TraceId + SpanId + ParentSpanId + timestamps + ~20 gen_ai.* attributes at avg 8 bytes each
Daily storage (metadata)~1 TB uncompressed → ~100 GB compressed5.2B × 200 bytes; ZSTD(1) achieves 9–10x on trace data
Content capture (5% sample)86 GB/day to S310,000 req/sec × 5% × 2 KB avg prompt+completion × 86,400
Kafka partition throughput60,000 × 200 bytes = 12 MB/secWell within 30-partition topic at ~400 KB/sec per partition (12 MB/sec ÷ 30)
ClickHouse worker pool~15 ECS tasks5 original tasks × 3 consumers each at 50 MB batch; headroom for 3× spikes
Full-trace retrieval latency110–197 msClickHouse OTel demo: bloom filter on TraceId + materialized view for min/max timestamp (200M spans in 13.68 GiB compressed / ~125 GB uncompressed — ~625 bytes/span in the full OTel demo schema; this article's optimized subset targets ~200 bytes/span, which would compress further)

Takeaway: At 10,000 LLM req/sec, you are ingesting roughly 60K spans/sec. Kafka absorbs burst; ClickHouse compresses 1 TB/day to ~100 GB; content goes to S3 where it is cheap. The expensive mistake is Postgres — IOPS exhaustion at this volume caused 50-second latency spikes in production at Langfuse before they migrated to ClickHouse in 2024.

Building up to the design

V1: Log everything to Postgres. The first instinct is to write a middleware that logs each LLM call's prompt, response, and token count to a Postgres table. This works fine at modest traffic — a few hundred spans per second. Once you cross into tens of millions of rows per day (roughly 500+ rows/sec sustained), synchronous writes start adding measurable latency to every user request, IOPS saturate on write-heavy Postgres instances, and the query planner struggles with the wide jsonb column you inevitably add for attributes. Langfuse ran this setup in V1 and V2, hit 50-second ingestion latency spikes as IOPS saturated, and rebuilt the entire ingest path.

V2: Async ingest with a queue, still Postgres as store. Decouple the application from the storage write: the SDK enqueues events to Redis or Kafka and returns immediately, and a worker consumes them and inserts to Postgres. Application latency is now zero. But Postgres still struggles with analytical queries — a query like "show me p99 TTFT by model for last week" requires scanning millions of rows, and any time-series aggregation with a jsonb attribute filter is painful. Aggregate tables help but require careful pre-planning of every dimension you might want.

V3: ClickHouse as the trace store. Replace Postgres-for-traces with ClickHouse. The query that took 45 seconds on Postgres runs in under a second on ClickHouse because columnar storage reads only the columns it needs (e.g., gen_ai.usage.input_tokens and Timestamp) without touching prompt text or other wide fields. ZSTD compression brings 200M spans to 13.68 GiB. Partition by day for TTL deletion. Bloom filter indexes on TraceId make single-trace reconstruction fast. Keep Postgres only for transactional data: accounts, project config, prompt definitions, billing metadata. This is the Langfuse V3 architecture, shipped December 2024.

flowchart TD
    A[Application] -->|fire-and-forget OTLP| B[OTel Collector<br/>tail sampling · PII redaction]
    B -->|batched OTLP| C[(Kafka<br/>30 partitions · 7-day TTL)]
    B -->|metric counters| D[(Prometheus)]
    C --> E[Worker pool<br/>state compute · dedup · S3 offload]
    E -->|INSERT with version| F[(ClickHouse<br/>ReplacingMergeTree)]
    E -->|content reference| G[(S3<br/>prompt and completion bodies)]
    F --> H[Trace tree UI]
    F --> I[Cost dashboard]
    D --> J[TTFT SLO alert]
    style B fill:#ff6b1a,color:#0a0a0f
    style C fill:#0e7490,color:#fff
    style F fill:#15803d,color:#fff
    style E fill:#ffaa00,color:#0a0a0f
    style G fill:#a855f7,color:#fff
    style J fill:#ff2e88,color:#fff

V4: Add the evaluation loop and prompt management. The trace store is now operational. The next gap is quality: you can see how much the model cost and how long it took, but not whether it answered correctly. Add an async evaluation pipeline that samples 10–20% of completed traces, runs LLM-as-judge evaluators, and writes scores back to ClickHouse as score rows linked to trace IDs. Add a prompt version registry in Postgres, tag every span with prompt_version, and build a query that shows cost/quality trade-offs across versions. Now you can compare prompt v7 versus v8 on real production traffic before declaring v8 the winner.

API

The SDK surface is intentionally thin — the heavy lifting happens server-side.

# Auto-instrumentation (OpenLLMetry, zero-code change)
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument()

# Manual span wrapping for business context
from opentelemetry import trace
tracer = trace.get_tracer("my-app")

with tracer.start_as_current_span("handle-support-ticket") as span:
    span.set_attribute("user.id", user_id)
    span.set_attribute("feature", "support-chat")
    span.set_attribute("prompt_version", "v12")
    span.set_attribute("agent_run_id", run_id)
    # LLM calls are instrumented automatically by OpenLLMetry
    response = openai_client.chat.completions.create(...)

Ingest endpoint (OTLP over HTTP, standard):

POST /v1/traces
Content-Type: application/x-protobuf
Body: ExportTraceServiceRequest (OTel proto)

POST /v1/metrics  
Body: ExportMetricsServiceRequest

POST /v1/logs
Body: ExportLogsServiceRequest

Score ingestion (vendor-specific, Langfuse-style):

POST /api/public/scores
{
  "traceId": "abc123",
  "name": "faithfulness",
  "value": 0.87,
  "dataType": "NUMERIC",
  "comment": "All claims appear in retrieved context"
}

Trace retrieval:

GET /api/traces/{traceId}
→ {
    "traceId": "abc123",
    "rootSpan": { "spanId": "...", "operationName": "handle-support-ticket",
      "startTime": "...", "endTime": "...", "durationMs": 1823,
      "attributes": { "user.id": "u-99", "feature": "support-chat" },
      "children": [
        { "operationName": "openai.chat", "gen_ai.request.model": "gpt-4o",
          "gen_ai.usage.input_tokens": 1247, "gen_ai.usage.output_tokens": 183,
          "cost_usd": 0.00421, "ttft_ms": 312, "contentRef": "s3://traces/abc123/llm-0" },
        { "operationName": "pinecone.query", "db.vector.query.top_k": 10,
          "durationMs": 38 }
      ]
    },
    "scores": [{ "name": "faithfulness", "value": 0.87 }],
    "totalCost": 0.00421,
    "totalInputTokens": 1247,
    "totalOutputTokens": 183
  }

The schema

ClickHouse traces table (simplified from OTel schema):

CREATE TABLE otel_traces (
    Timestamp          DateTime64(9)   CODEC(Delta(8), ZSTD(1)),
    TraceId            String          CODEC(ZSTD(1)),
    SpanId             String          CODEC(ZSTD(1)),
    ParentSpanId       String          CODEC(ZSTD(1)),
    ServiceName        LowCardinality(String),
    SpanName           LowCardinality(String),
    SpanKind           LowCardinality(String),  -- LLM, Tool, Retrieval, Embedding, Agent, Workflow
    Duration           Int64           CODEC(ZSTD(1)),  -- nanoseconds
    StatusCode         LowCardinality(String),
    -- gen_ai.* attributes (materialized for fast aggregation)
    GenAIModel         LowCardinality(String),
    GenAIInputTokens   UInt32          CODEC(ZSTD(1)),
    GenAIOutputTokens  UInt32          CODEC(ZSTD(1)),
    GenAICachedTokens  UInt32          CODEC(ZSTD(1)),
    CostNanodollars    UInt64          CODEC(ZSTD(1)),
    TTFT_ms            UInt32          CODEC(ZSTD(1)),
    -- high-cardinality user dimensions (trace attributes only, NOT metric labels)
    UserId             String          CODEC(ZSTD(1)),
    Feature            LowCardinality(String),
    PromptVersion      LowCardinality(String),
    AgentRunId         String          CODEC(ZSTD(1)),
    ContentRef         String          CODEC(ZSTD(1)),  -- S3 URL if content captured
    -- raw attribute map for flexible querying
    SpanAttributes     Map(String, String) CODEC(ZSTD(1)),
    ResourceAttributes Map(String, String) CODEC(ZSTD(1)),
    -- deduplication version (for ReplacingMergeTree)
    Version            UInt32
) ENGINE = ReplacingMergeTree(Version)
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId, SpanId)
TTL toDate(Timestamp) + INTERVAL 30 DAY
SETTINGS index_granularity = 8192;

-- Bloom filter skip indexes for TraceId and attribute lookups
ALTER TABLE otel_traces ADD INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1;
ALTER TABLE otel_traces ADD INDEX idx_span_attrs SpanAttributes TYPE bloom_filter(0.01) GRANULARITY 1;

-- Materialized view: min/max timestamps per TraceId for full-trace reconstruction
CREATE MATERIALIZED VIEW otel_traces_trace_id_ts
ENGINE = AggregatingMergeTree()
ORDER BY (TraceId)
AS SELECT TraceId, minState(Timestamp) AS Start, maxState(Timestamp) AS End
FROM otel_traces GROUP BY TraceId;

Postgres prompt registry:

CREATE TABLE prompt_versions (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    project_id  UUID NOT NULL REFERENCES projects(id),
    name        TEXT NOT NULL,
    version     INT NOT NULL,
    label       TEXT,                   -- 'production', 'staging', 'canary'
    content     TEXT NOT NULL,          -- the prompt template
    variables   JSONB,                  -- declared variable names
    commit_sha  TEXT,                   -- GitHub webhook source
    created_at  TIMESTAMPTZ DEFAULT now(),
    UNIQUE (project_id, name, version)
);

Architecture

The production system has four logical zones: instrumentation, ingest, storage, and query. Here is the full architecture:

flowchart LR
    subgraph INST["Instrumentation"]
        AI[Auto-instrumentation<br/>OpenLLMetry · ddtrace · openlit]
        MAN[Manual spans<br/>business context tags]
        PROXY[HTTP proxy<br/>Helicone / LiteLLM gateway]
    end
    subgraph INGEST["Ingest pipeline"]
        GW[OTel Collector<br/>gateway mode]
        PIIRED[PII redaction<br/>regex · NER service]
        SAMP[Tail-based sampler<br/>100% errors · 5–10% content]
        KAF[(Kafka<br/>30 partitions · 15 DLQ · 7-day TTL)]
        WPOOL[Worker pool<br/>ECS tasks · 3 consumers each]
        REDIS[(Redis<br/>live span state cache)]
    end
    subgraph STORE["Storage"]
        CH[(ClickHouse<br/>ReplacingMergeTree)]
        S3[(S3<br/>prompt and completion bodies)]
        PG[(Postgres<br/>prompts · accounts · billing)]
        PROM[(Prometheus)]
    end
    subgraph QUERY["Query and action"]
        TREEUI[Trace tree UI]
        COSTDASH[Cost dashboard]
        EVALPIPE[Eval pipeline<br/>LLM-as-judge · 10–20% sample]
        ALERT[Alert engine<br/>TTFT · cost · error rate]
        PROMPTUI[Prompt hub<br/>version diff · label assignment]
    end
    AI --> GW
    MAN --> GW
    PROXY --> GW
    GW --> PIIRED
    PIIRED --> SAMP
    SAMP --> KAF
    SAMP --> PROM
    KAF --> WPOOL
    WPOOL --> REDIS
    WPOOL --> CH
    WPOOL --> S3
    PG --> PROMPTUI
    CH --> TREEUI
    CH --> COSTDASH
    CH --> EVALPIPE
    EVALPIPE --> CH
    PROM --> ALERT
    CH --> ALERT
    style GW fill:#ff6b1a,color:#0a0a0f
    style KAF fill:#0e7490,color:#fff
    style CH fill:#15803d,color:#fff
    style WPOOL fill:#ffaa00,color:#0a0a0f
    style EVALPIPE fill:#a855f7,color:#fff
    style ALERT fill:#ff2e88,color:#fff

Hot path: a single LLM call through the system.

sequenceDiagram
    participant APP as Application
    participant SDK as OpenLLMetry SDK
    participant LLM as LLM Provider API
    participant COL as OTel Collector
    participant KAF as Kafka
    participant WRK as Worker
    participant CH as ClickHouse
    participant S3 as S3

    APP->>SDK: openai.chat.completions.create(messages, stream=True)
    SDK->>LLM: POST /v1/chat/completions (span opens)
    LLM-->>SDK: SSE chunk 1 (first token)
    SDK->>SDK: Record TTFT = now() - span_start
    LLM-->>SDK: SSE chunks 2..N
    LLM-->>SDK: SSE DONE with usage object
    SDK->>SDK: Close span; set input_tokens, output_tokens, cost
    SDK->>COL: OTLP export (fire-and-forget, async)
    COL->>COL: PII redaction; tail sampling decision
    COL->>KAF: Produce span metadata record (with content payload if sampled)
    KAF->>WRK: Consumer batch (up to 10,000 msgs / 50 MB)
    WRK->>S3: Write prompt+completion body (if sampled)
    WRK->>CH: INSERT INTO otel_traces (version=N, contentRef=S3 URL)
    WRK->>CH: Materialized view auto-updates TraceId min/max
    Note over WRK,CH: Background merge deduplicates via ReplacingMergeTree

The application sees zero added latency — the SDK export to the OTel Collector is a background async call. The OTel Collector's tail sampler holds completed traces in memory (bounded buffer, typically 5 minutes) and makes a sampling decision based on the outcome: 100% of errors, 100% of traces above TTFT p99 threshold, 5–10% of successful traces including content, 100% of traces metadata-only. This is the key difference from head-based sampling.

Instrumentation layer

Auto-instrumentation intercepts LLM provider HTTP calls via monkey-patching (OpenLLMetry) or ddtrace integration hooks. You get gen_ai.* attributes on every call with no code changes. The OTel GenAI SIG (formed April 2024) standardizes these: gen_ai.operation.name (chat / embeddings / invoke_agent / execute_tool), gen_ai.provider.name, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons. For streaming, the instrumentation wraps the async iterator: it records gen_ai.client.time_to_first_token_ms on the first chunk callback and accumulates token counts from per-chunk usage metadata.

Auto-instrumentation is zero-code but sees only one hop. A LangChain chain that calls a retriever, then an LLM, then a summarizer needs framework-level callbacks to produce the correct parent-child span hierarchy. LangSmith uses a RunTree data model where each LangChain callback emits start/end events that the SDK assembles into a tree. LangChain, LlamaIndex, and LangGraph all expose callback/event interfaces for this.

Manual span wrapping adds the business-context attributes that frameworks cannot know: user.id, feature (which product surface triggered this call), deployment, prompt_version, and agent_run_id. These are the attribution dimensions you need for cost rollups and quality correlation. They must be set before the LLM call — not appended later — or they will not propagate to child spans via context propagation.

Proxy-based instrumentation (Helicone, LiteLLM) intercepts at the HTTP layer without SDK integration. Helicone runs on Cloudflare Workers with V8 isolates; their published benchmarks show mean overhead of roughly 10 ms — effectively negligible for most production workloads. The advantage is language-agnostic coverage and zero application changes; the disadvantage is you see only one hop — no retrieval spans, no tool call spans, no inter-agent messages. For multi-step agents, SDK tracing is necessary.

Token and cost accounting

Every span close triggers a cost computation. The platform maintains a pricing registry: model name → price per input token, price per output token, price per cache-read token, price per cache-write token. Datadog's implementation covers 800+ models with costs in nanodollars to avoid floating-point precision loss. The formula:

cost_nanodollars = (input_tokens × price_in_nano)
                 + (output_tokens × price_out_nano)
                 + (cache_read_tokens × price_cache_read_nano)
                 + (cache_write_tokens × price_cache_write_nano)

OpenAI reasoning models add a reasoning_tokens field within the completion_tokens block. Anthropic's cache_write tokens cost 25% more than regular input for the 5-minute TTL tier (2× base input for the 1-hour TTL tier); cache_read tokens are 90% cheaper. The pricing registry handles these per-provider conventions.

The cost is tagged with every attribution dimension at span close: user_id, feature, deployment, prompt_version, agent_run_id, customer_id. The worker aggregates these into five ClickHouse materialized views for reporting:

  • Cost per user per day (billing dashboards, per-user spend limits)
  • Cost per feature per day (ROI analysis: does the summarization feature pay for itself?)
  • Cost per agent run (budget cap enforcement; detect runaway loops before they reach $50)
  • Cost per successful eval (quality-adjusted cost: how much does it cost to get a correct answer?)
  • Cost per customer (B2B: which enterprise customer is consuming the most tokens?)

The critical attribute cardinality rule: user_id, agent_run_id, trace_id, and prompt_text must never become Prometheus metric label dimensions. These are unbounded cardinalities that break time-series stores. They belong only in ClickHouse span attributes and the materialized views above. Safe Prometheus label dimensions are bounded: model, model_family, endpoint, region, status_code, deployment, feature. The PromQL for TTFT SLO:

histogram_quantile(0.95,
  sum(rate(gen_ai_client_operation_duration_bucket[5m]))
  by (le, gen_ai_request_model, feature)
)

PII redaction and content sampling

Prompts are PII minefields. A customer-support app's prompts contain names, account numbers, addresses, order histories. A medical chatbot's prompts may contain diagnoses and medication lists. The OTel GenAI spec is privacy-first by design: OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT defaults to false. Operators must explicitly opt in to content capture.

The OTel Collector pipeline is the enforcement point. Before any content leaves the collector, it runs through two stages:

  1. Regex patterns for structured PII: email addresses, US SSNs (\d{3}-\d{2}-\d{4}), credit card numbers (Luhn check), phone numbers, AWS access keys, JWT tokens. These are fast and deterministic.

  2. NER microservice for unstructured PII: a small model (DistilBERT-NER at ~40ms per call) identifies PERSON, ORG, LOC, and GPE entities and replaces them with placeholders ([PERSON], [ORG]). This runs only on the sampled content fraction to keep latency acceptable.

After redaction, content goes to S3 (not ClickHouse inline). The span in ClickHouse holds only a reference URL: s3://traces-bucket/tenant-id/trace-id/span-id/content.json. The S3 bucket has a separate IAM policy from the ClickHouse cluster — GDPR deletion requests become S3 object deletions rather than ClickHouse mutations. This is the pattern used in production at Langfuse and is the only architecture that passes a SOC 2 / HIPAA audit without heroics.

Sampling rates in practice: 100% of spans emit metadata (tokens, cost, latency, model) — this is cheap and essential for alerting. For content (prompt + completion text): 5–10% for successful traces, 100% for errors, 100% for tail-latency traces (above p99 TTFT), and 0% for any trace that matches a PII-detection pattern on the first chunk of the response.

Streaming spans

Most production LLM applications stream responses. Span instrumentation for streaming is subtler than for non-streaming because the response body arrives in chunks and the usage metadata arrives last.

The pattern is to start the span before calling the LLM, wrap the streaming iterator, and close the span after the final chunk:

with tracer.start_as_current_span("openai.chat") as span:
    span.set_attribute("gen_ai.request.model", model)
    stream = openai_client.chat.completions.create(stream=True, ...)
    first_chunk = True
    for chunk in stream:
        if first_chunk:
            ttft_ms = (time.time() - span_start) * 1000
            span.set_attribute("gen_ai.client.time_to_first_token_ms", ttft_ms)
            first_chunk = False
        if chunk.usage:  # final chunk contains usage
            span.set_attribute("gen_ai.usage.input_tokens", chunk.usage.prompt_tokens)
            span.set_attribute("gen_ai.usage.output_tokens", chunk.usage.completion_tokens)
    # span closes here; end_time = after final chunk

Three metrics come from streaming that are invisible in non-streaming: TTFT (time to first token — the user-perceived latency), ITL (inter-token latency — variance between chunks, measured as p99 of inter-chunk gaps), and generation speed (output tokens per second). TTFT is the key SLO metric for chat applications; the target is under 1 second for good UX. Be careful: TTFT p99 reported without model or input-length segmentation is misleading — the same model on the same infrastructure can show 800ms to 28s depending on queue depth and input length.

Agent trajectory debugging

This is where LLM observability diverges most sharply from conventional tracing. A conventional trace shows you one execution path. An agent run is a non-deterministic, branching trajectory where the same input can produce different tool call sequences across runs. A study of 38 practitioners (arXiv 2503.06745) found a 63% mean coefficient of variation in execution flow across identical inputs on the same agent. You cannot debug a failed run by looking at one trace.

The trace tree viewer reconstructs the hierarchical execution from ClickHouse: root agent span → workflow spans → LLM call spans, tool call spans, retrieval spans, embedding spans. Each node in the tree shows:

  • Input messages and output (prompt + completion, loaded from S3 on demand)
  • Token counts and cost for that node
  • Latency breakdown (time in queue vs. time in model)
  • Model parameters at call time (temperature, top-p, max-tokens, system prompt version)
  • Error or exception if the step failed
  • Quality scores attached to this span

The trajectory debugger goes further. It compares many runs of the same logical task using an aggregate task-flow DAG: which tool call sequences are most common, which paths correlate with high quality scores, which paths terminate in errors. This is the AgenTracer approach (arXiv 2509.03312).

For root-cause attribution, the most powerful technique is counterfactual replay (AgenTracer, arXiv 2509.03312): take a failed trajectory and replace each action with an oracle-corrected alternative, re-simulate the agent's subsequent reasoning, and find the earliest decision flip that changes the outcome. The minimum event set for replay is every LLM request with full prompt+response, every tool call with inputs+outputs, and every agent state snapshot at handoffs. Without content capture — even sampled — this is impossible, which is why 100% content capture for failed traces is non-negotiable.

For the design-ai-agent-platform use case specifically: the LLM observability platform is the external debugger that the agent platform writes telemetry to. The agent platform owns durable run state and checkpoints; the observability platform owns the queryable trace history, cost attribution, and quality analysis across many runs.

Prompt management as first-class infrastructure

Treating prompts as inline strings edited directly in code is one of the most documented sources of silent production regressions. A one-word change to a system prompt can degrade answer quality by 20% across an entire product surface, with no deployment artifact, no reviewer, no rollback path.

The prompt hub solves this by treating prompts as versioned artifacts in Postgres:

  • Every prompt has a name, version number, and label (production / staging / canary).
  • The SDK resolves the label at runtime: langfuse.get_prompt("support-system", label="production") fetches the current production version.
  • Every span is tagged with prompt_version at instrumentation time.
  • Deployment is a label reassignment, not a code push: langfuse.set_label("support-system", version=8, label="production") instantly shifts all new traffic to v8.
  • Rollback is the same operation: reassign the label to v7.
  • GitHub webhook integration: every merge to main that touches a prompt file creates a new version in the registry with the commit SHA as audit trail.

The power of this infrastructure is the comparison query: given that prompt_version v7 and v8 are both in production (via A/B traffic split), what is the difference in mean faithfulness score, mean cost per request, and mean TTFT? This query runs entirely from ClickHouse materialized views — all three dimensions were tagged on every span.

Online evaluation pipeline

Token counts tell you how expensive a response was; latency tells you how fast; neither tells you whether it was correct. Online evaluation closes that gap.

Online evaluation runs on sampled production traces (10–20% to avoid doubling inference cost). The evaluators are asynchronous — they read completed traces from ClickHouse, run their logic, and write scores back as score rows. The score schema matches Langfuse's: name, value, data_type (NUMERIC 0.0–1.0, CATEGORICAL pass/fail/unclear, BOOLEAN, TEXT), linked to a trace_id and optionally a span_id.

Common online evaluators:

  • Faithfulness (LLM-as-judge): does every factual claim in the answer appear in the retrieved context? This catches hallucination. See design-llm-eval-platform for full eval platform design.
  • Answer relevance (LLM-as-judge): does the answer actually address the question asked?
  • Guardrail hits (rule-based or model-based): did the input trigger an injection detection rule? Did the output contain toxic content, PII, or confidential patterns? Guardrail events are emitted as telemetry spans themselves — pre-LLM (input check) and post-LLM (output check) — with gen_ai.operation.name = "guardrail".
  • User feedback: explicit thumbs-up/down or implicit engagement signals (user rephrased the question = implicit negative). Both are stored as score objects on the trace. Explicit feedback at a 2% engagement rate on 10,000 req/sec = 200 explicit signals per second — high enough to detect regressions within an hour.

LLM judges are noisy. Before trusting a judge for alerting, validate it against a human-labeled test set. Common calibration steps: use pairwise comparisons rather than Likert scales, reverse option order to check position bias, require chain-of-thought reasoning in the judge prompt, allow ties. Eugene Yan et al. (O'Reilly, May 2024) document a 5–10% baseline hallucination rate on simple summarization tasks; getting below 2% is "challenging even with prompt engineering." This baseline is why you need to trend quality metrics rather than set a fixed pass/fail threshold. For judge calibration methodology, benchmark datasets, and regression detection pipelines, see Design an LLM Eval Platform.

Edge cases & gotchas

1. Postgres IOPS exhaustion at scale. This is not a theoretical concern — it is what killed Langfuse's V1/V2 architecture. At tens of millions of span events per day, synchronous Postgres writes saturate IOPS and cause cascading latency: ingestion p99 hit 50 seconds. The fix is architectural: async queue (Redis or Kafka) absorbs the write burst; workers drain to ClickHouse. Do not try to tune your way out with Postgres read replicas or connection poolers — the problem is write throughput, not read scaling.

2. Row update anti-pattern in ClickHouse. Spans arrive as a stream of partial events: span_start, then token counts after streaming completes, then LLM-as-judge scores minutes later. Naively issuing ALTER TABLE ... UPDATE in ClickHouse for each event is catastrophically expensive — it triggers a full part rewrite. The correct pattern: every event is an INSERT with an incrementing version number; ReplacingMergeTree keeps the highest version per primary key in the background. During the merge window, Redis caches the current state so the worker can compute the next version by reading from Redis rather than querying ClickHouse.

3. LLM body size in the queue. A vision model request with a high-resolution image can produce a JSON body of several megabytes. Putting this in a Kafka message exhausts broker storage and slows down consumer lag. Helicone's documented pattern: write the body to S3 immediately at ingest time, put only the S3 reference URL in the Kafka message. This keeps Kafka messages small (under 10 KB) and lets the worker retrieve the body from S3 only when needed for content processing.

4. Head-based sampling discarding failures. Configuring the OTel SDK to sample 1% of traces at trace creation time means 99% of error traces are gone before you know they are interesting. For LLM workloads where production failures are rare and expensive, this is fatal to debugging. The correct architecture is tail-based sampling at the OTel Collector gateway: the collector buffers complete traces in memory (with a configurable timeout and memory limit), then applies the sampling decision based on the outcome. A trace with a non-2xx status or a timeout always gets sampled. A successful trace gets sampled at the configured rate.

5. Cardinality explosion in metric labels. Adding user_id or trace_id as a Prometheus label dimension creates a new time series per unique value — millions of time series — which OOM-kills the Prometheus instance and makes dashboards unusable. The collector-level attribute filter is the enforcement mechanism: a filter/drop_high_cardinality processor strips these fields from the metric export pipeline while leaving them intact in the trace pipeline where ClickHouse handles them correctly.

6. Missing streaming instrumentation. If your span closes on the first chunk of a streaming response — a common instrumentation bug — you lose TTFT and all inter-token latency data. TTFT is measured from request start to first token arrival; ITL is the p99 of gaps between subsequent chunks. Without wrapping the async iterator correctly (as shown in the Streaming Spans section), these metrics are invisible and you are flying blind on the user-facing UX metric that matters most.

7. Ignoring the cost of observability itself. Full content capture at 10,000 req/sec with 2 KB average prompt+completion = 20 MB/sec of content. At $0.023/GB for S3 storage and data transfer, that is roughly $40/day (20 MB/sec × 86,400 sec/day = 1,728 GB/day × $0.023 ≈ $40/day, or ~$1,200/month) — modest in absolute terms but it adds up. A 5–10% content sampling rate cuts this to $2–4/day, which is the real reason the tiered strategy (metadata always, content sampled) is the production-appropriate approach, not merely a compromise. Add TTL policies (30 days for traces, 90 days for cost aggregates) to keep storage costs bounded.

8. Non-determinism masking root cause. With 63% coefficient of variation in agent execution flow (arXiv 2503.06745), the same agent on the same input may take completely different tool call paths on different runs. A single failed trace might look like a tool failure, but on inspection of 100 similar runs, only 5% hit that path — the failure is elsewhere. Build aggregate task-flow DAGs from ClickHouse and analyze failure rates by path segment, not just by individual run.

Trade-offs to discuss in an interview

Proxy vs. SDK instrumentation. A proxy (Helicone, LiteLLM gateway) requires zero application code changes and works across all languages. It adds roughly 10 ms of overhead per call (edge-network latency; negligible relative to LLM inference time) and sees only the single LLM hop. SDK instrumentation requires code changes but captures the full call graph: retrieval, tool execution, inter-agent messages, and business-context attributes. For a simple single-model chatbot, the proxy is often sufficient. For a multi-step agent, the SDK is necessary. Most production systems end up with both: the proxy for baseline coverage, the SDK for attribution and full-tree visibility.

All-in-one platform vs. composable stack. LangSmith, Langfuse, and Arize Phoenix are all-in-one: ingest, trace store, eval pipeline, prompt hub, dashboards, all under one roof. The composable alternative is OpenLLMetry → OTel Collector → ClickHouse + Prometheus + Grafana + custom eval pipeline. All-in-one is faster to ship; composable avoids vendor lock-in and is required when you have strict data-residency requirements. The migration path matters: both LangSmith and Langfuse support OTLP ingest, so you can start with the hosted platform and export to your own ClickHouse later.

Tail-based vs. head-based sampling. Tail sampling requires the OTel Collector to buffer entire traces in memory before making the sampling decision, adding memory pressure and a configuration-dependent delay (typically 30 seconds to 5 minutes). Head sampling is simpler and has no memory overhead, but discards the most valuable traces — errors, outliers — before you know they are interesting. For LLM workloads, tail-based sampling for content is strongly preferred; metadata should always be 100% sampled.

ReplacingMergeTree vs. VersionedCollapsingMergeTree. Langfuse uses ReplacingMergeTree for simplicity: INSERT with version, background merge deduplicates. Reads during the merge window require FINAL (which forces merge) or a LIMIT BY workaround. Helicone uses VersionedCollapsingMergeTree with explicit sign-based collapse/upsert: more predictable reads, more complex writes. Both are correct; choose based on your team's ClickHouse familiarity and read SLA requirements.

LLM-as-judge sampling rate. Running the judge on 100% of traces doubles your inference cost and adds 1–3 seconds of latency to the eval pipeline. Running it on 10% gives statistically meaningful signal for most workloads; at high traffic (10,000 req/sec), even a 1% eval rate produces 100 evaluated traces/sec — more than enough to surface a failure mode present in 1% of requests within minutes. The concern about detection lag is real only at low traffic: at 10 req/sec, a 10% eval rate yields one evaluated trace per second, and a 1% failure mode may take hours to accumulate enough samples to confirm. Scale your eval rate inversely with traffic volume.

Things you should now be able to answer

  1. Why does Langfuse use ClickHouse instead of Postgres for trace storage, and what was the failure mode that drove the migration?
  2. Draw the sequence diagram for a streaming LLM call through the observability system, from SDK instrumentation to ClickHouse insert. Where is TTFT recorded?
  3. Why must user_id, trace_id, and raw prompt text never become Prometheus metric label dimensions? What happens if they do?
  4. A span in your system arrives as three separate events: span start, token usage (after stream completes), and a quality score (3 minutes later). How do you handle this in ClickHouse without using UPDATE?
  5. What is tail-based sampling, how does it differ from head-based sampling, and why does it matter specifically for LLM workloads?
  6. How do you implement PII redaction in the ingest pipeline? Name two specific techniques and explain where in the pipeline each runs.
  7. Describe the cost attribution model: what five dimensions do you roll up costs to, and why does the unit need to be nanodollars rather than dollars?
  8. A customer reports that "the agent gave a wrong answer on Tuesday at 2pm." Walk through exactly how you would debug this using the observability platform — which queries you would run, what data you would inspect, and how you would identify root cause.
  9. What are the schema requirements for a ClickHouse table to support both full-trace reconstruction (given a TraceId) and cost aggregations (sum cost by feature for last 30 days) in under 200 ms each?
  10. A team is building a multi-tenant LLM product. What attributes must they set on every span to enable correct per-customer cost reporting and data isolation? What happens if they forget to set customer_id on 5% of spans?

Further reading

  • Langfuse Blog — "Langfuse V3 Infrastructure Evolution" (Langfuse team, Dec 2024) — the canonical account of the Postgres → ClickHouse migration, including the 7s → 100ms prompt API improvement and the ReplacingMergeTree decision.
  • ClickHouse Blog — "Building an Observability Solution with ClickHouse – Part 2: Traces" (ClickHouse, 2024) — full OTel schema DDL, bloom filter index setup, compression numbers, and trace reconstruction queries.
  • OpenTelemetry — "An Introduction to Observability for LLM-based Applications Using OpenTelemetry" (CNCF, 2024) — the official OTel GenAI SIG guidance, gen_ai.* attribute definitions, and MCP span conventions.
  • Upstash Blog — "Handling Billions of LLM Logs with Upstash Kafka and Cloudflare Workers" (Upstash, 2024) — Helicone's Kafka topic design, ECS consumer configuration, and body-to-S3 pattern.
  • Rost Glukhov — "Observability for LLM Systems: Metrics, Traces, Logs, and Testing in Production" (glukhov.org) — TTFT/ITL definitions, cardinality trap list, PromQL examples for TTFT SLO.
  • Eugene Yan, Bryan Bischof, Charles Frye, Hamel Husain, Jason Liu, Shreya Shankar — "What We Learned from a Year of Building with LLMs (Part I)" (O'Reilly, May 2024) — LLM-as-judge pitfalls, hallucination baseline rates, and practical logging recommendations.
  • arXiv 2503.06745 — "Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems" (2025) — 63% mean coefficient-of-variation in execution flow, user study of 38 practitioners, task-flow DAG analytics.
  • arXiv 2603.29848 — "AgentFixer: From Failure Detection to Fix Recommendations in LLM Agentic Systems" (2026) — systematic failure detection using 15 rule-based and LLM-as-judge tools, applied to multi-agent systems; produces actionable fix recommendations. For counterfactual replay specifically, see AgenTracer (arXiv 2509.03312).
  • arXiv 2509.03312 — "AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?" (2025) — aggregate task-flow DAG analytics and counterfactual replay methodology for root-cause attribution in failed agent trajectories; the basis for the trajectory debugger approach described in this article.
  • Design an AI Agent Platform — the platform this observability system monitors; covers durable run state, idempotent tool execution, and per-tenant safety guardrails.
  • Design an LLM Eval Platform — pairs with this article: quality measurement, offline eval pipelines, benchmark datasets, and regression detection (the evaluation side of the quality loop).
  • Design an LLM Gateway — the proxy layer that sits between application and LLM provider; complements observability with rate limiting, routing, and cost controls.
  • Design a Data Pipeline / ETL — the columnar ingest patterns (Kafka → Spark/Flink → ClickHouse) are the same foundation this observability pipeline uses.
  • Design a Metrics Monitoring System — the general-purpose metrics infrastructure (Prometheus, Grafana, alerting) that the TTFT SLO layer of this system runs on.
  • Design a Log Aggregation System — the structured logging backbone; LLM observability is a specialized extension of log aggregation with trace context.
// FAQ

Frequently asked questions

Why do conventional APM tools fall short for LLM applications?

Conventional APM tools like Datadog or New Relic track latency, error rate, and throughput as numeric time-series metrics on well-typed operations. An LLM application has none of those properties: the "output" is an unbounded string, quality is not a boolean, a single user request may fan out across retrieval, embedding, tool calls, and multiple model invocations with different models and prompts, and cost is a function of token counts rather than request count. Capturing this requires a trace model that nests heterogeneous span types, records prompt and completion text, and computes cost inline — none of which standard APM instrumentations provide out of the box.

What is TTFT and why is p99 without distribution context misleading?

TTFT (time to first token) is the latency from when a request reaches the LLM API until the first token of the streaming response arrives. It is the dominant factor in perceived responsiveness for chat applications; users can tolerate slow streaming if the first word arrives within a second. The p99 is misleading in isolation because LLM latency has a wide, heavy-tailed distribution — on a single model the TTFT range can span 800 ms to 28 seconds depending on server load, input length, and queue depth. Always report p50/p95/p99 together and segment by model, input token bucket, and time of day to get actionable signal.

How do you handle PII in prompt and completion capture?

Content capture must be opt-in, not opt-out. The OTel GenAI spec defaults to NOT recording prompts and completions; a developer must explicitly set OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true. In practice, production systems store raw content in object storage (S3 or GCS) with strict IAM policies and put only a reference URL in the trace span. The ingest pipeline runs a PII redaction pass at the OTel Collector layer — regex patterns for known PII formats (email, SSN, credit card) plus an optional NER microservice for free-form names and addresses — before content reaches the trace store. GDPR and HIPAA teams then govern the S3 bucket independently of the observability data.

What is the ReplacingMergeTree pattern and why does LLM telemetry need it?

LLM spans arrive as a stream of partial events: the span opens at request time, token counts are known only after streaming completes, and evaluation scores may arrive minutes later as async LLM-as-judge pipelines finish. A naive design tries to UPDATE a row in the trace store as each piece arrives, but UPDATE is extremely expensive in ClickHouse — it rewrites entire parts. ReplacingMergeTree accepts INSERTs with a version column and deduplicates in the background during merge, keeping only the highest-version row per primary key. The trade-off is a read-time inconsistency window during merges, handled by using FINAL or LIMIT BY queries, or by caching current event state in Redis during the active window and reading from ClickHouse only for completed spans.

How do you attribute LLM cost to individual users, features, and agent runs?

Cost attribution starts at instrumentation time: every span must be tagged with user_id, feature, deployment, prompt_version, and agent_run_id before the trace is shipped. At span close, the platform looks up provider pricing (Datadog maintains a registry of 800+ models, priced in nanodollars) and computes cost = (input_tokens × price_in) + (output_tokens × price_out), with separate rates for cached-read tokens. These tagged cost events then roll up into five reporting dimensions: cost-per-user-per-day (billing), cost-per-feature-request (ROI), cost-per-agent-run (budget caps), cost-per-successful-eval (quality-adjusted cost), and cost-per-customer (for B2B margin analysis).

// RELATED

You may also like