// TOPIC

#observability

4 articles

Design an LLM Observability Platform

Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.

#interview#ai#llm

30 min

◆◆◆AdvancedElasticSplunk

Design a Centralized Log Aggregation System (ELK / Splunk)

Collect, store, and search logs from thousands of services. Collection agents, a buffered ingestion pipeline, time-based inverted indices, hot-warm-cold tiers, and cost control.

#interview#observability#search

25 min

◆◆◆AdvancedDatadogGoogle

Design a Metrics & Monitoring System (Prometheus / Datadog)

Ingest billions of time-series points, store them cheaply, and answer dashboard + alerting queries fast. TSDB internals, cardinality, downsampling, and pull vs push.

#interview#observability#time-series

22 min

◆◆Intermediate

Observability — Logs, Metrics, Tracing

The three pillars of observability, structured logging, metric cardinality, distributed tracing, and how to find the needle when production catches fire at 3am.

#observability#monitoring#operations

13 min