Design a Metrics & Monitoring System (Prometheus / Datadog)
Ingest billions of time-series points, store them cheaply, and answer dashboard + alerting queries fast. TSDB internals, cardinality, downsampling, and pull vs push.
The problem
Production systems lie to you constantly — not on purpose, but because without continuous measurement, everything is invisible. Prometheus crossed one billion active time-series in large enterprise deployments. Datadog processes trillions of data points per day from millions of hosts. At that scale, the "monitoring" problem stops being about collecting numbers and starts being about three deeply conflicting sub-problems: writing data at hundreds of thousands of points per second, storing it cheaply enough that you can afford years of history, and querying it fast enough that a dashboard loads before an engineer loses patience.
The concrete system looks like this. Every host in your fleet runs a metrics agent that exposes CPU, memory, request rate, error rate, and hundreds of custom business counters. A monitoring backend scrapes all of them every 15 seconds, ingests the results, and keeps them queryable. An engineer opens a dashboard during an incident and types rate(http_errors[5m]) — that expression has to scan millions of stored data points and return an answer in under half a second. Prometheus and Datadog both solve this problem; so does Victoria Metrics, Thanos, Grafana Mimir, and a dozen internal systems at companies like Google and Uber.
What makes this hard is a combination of write volume and query shape. At 5 million active time series on a 15-second scrape interval, you are ingesting roughly 333,000 data points per second, steady state. Storing them as raw floats would consume around 5.3 MB/s — roughly 458 GB/day per replica uncompressed. Gorilla-style columnar compression brings that down to ~1.5 bytes per sample, reducing the on-disk footprint to around 43 GB/day per replica. The write problem is solved by this compression scheme and append-only chunk files. But compression alone doesn't help if you can't find the right series fast: a label-selector query like {service="payments", status="500"} might need to intersect posting lists across 5 million series in microseconds. That's an inverted-index problem, not a database problem.
The second tension is cardinality. Every unique combination of label values is a distinct time series. One developer adding user_id as a label on a high-traffic metric can turn 5 million series into 5 billion overnight — filling every ingester's memory, making WAL replay take hours, and crashing the cluster. The monitoring system must enforce cardinality limits at ingestion time, before the damage propagates. That constraint shapes nearly every API and operational decision in the design.
Functional requirements
write_metric(name, labels, timestamp, value)— ingest a float64 value for a labeled time series.query_range(expr, start, end, step)— evaluate a PromQL-like expression over a time window; return a matrix of(series, [(ts, val)]).query_instant(expr, timestamp)— evaluate at a single point in time (for alert rule evaluation).- Alerting: define rules like
rate(http_errors[5m]) > 0.01; evaluate on a schedule; route notifications; support silences and inhibition. - (Optional) Multi-tenancy: data isolation between customers.
- (Optional) Exemplars: link a metric point to a specific trace.
Non-functional requirements
- Write throughput: hundreds of thousands of samples/second, scaling to millions for large deployments.
- Ingest latency: data visible for query within seconds of being written.
- Query latency: dashboard queries (covering hours of data, millions of points) at < 500ms p99; instant queries for alerting at < 100ms p99.
- Retention: raw data for days to weeks; downsampled long-term for months to years.
- Durability: no data loss on single-node failure (WAL + replication).
- Availability: 99.9%+ for both ingest and query paths.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Active time series | 5M | Mid-size deployment; Datadog processes trillions of points/day across millions of hosts |
| Scrape interval | 15 s | Standard Prometheus default |
| Ingest rate | ~333K samples/s | 5M ÷ 15 |
| Raw point size | 16 bytes | 8 B (float64) + 8 B (int64 timestamp) |
| Compressed point size | ~1.5 bytes | Gorilla-style; real deployments get 1.3–1.7; Gorilla paper reports 1.37 B/point on Facebook's workload |
| Compression ratio | ~10–12× | 16 B ÷ 1.5 B |
| Ingest bandwidth (uncompressed) | ~5.3 MB/s | 333K × 16 B |
| Ingest bandwidth (compressed, per replica) | ~500 KB/s | 333K × 1.5 B |
| Daily storage (one replica) | ~43 GB/day | 333K × 1.5 B × 86,400 |
| Daily storage (3× replication) | ~130 GB/day | 43 GB × 3 |
| Retention: raw (15 days) | ~645 GB | 43 GB × 15 |
| Retention: 5-min rollup (90 days) | ~600–800 GB | 43 GB × (15s ÷ 300s) × 90 ≈ 194 GB raw; each downsampled point stores min/max/sum/count → 3–4× higher |
| Retention: 1-hr rollup (2 years) | ~500 GB total | 43 GB × (15s ÷ 3600s) × 730 ≈ 130 GB raw; ×4 for min/max/sum/count |
| Raw label data (index) | ~1 GB serialized | 5M series × ~200 B (metric name + 6 labels avg) |
| In-memory index | ~1–2 GB | Posting lists + symbol tables + chunk refs for 5M series |
| Head block (in-memory) | ~15–40 GB | Index + 2h of chunk data per series at 5M series |
| Points per series in a 24h query | 5,760 | 24 × 3600 ÷ 15 |
| Total points for a 10K-series query | 57.6M | 10K × 5,760 — must be read from compressed chunks |
Takeaway: At 5M active series on a 15-second scrape interval, Gorilla-style compression reduces daily ingest from ~458 GB/day raw to ~43 GB/day per replica (~130 GB with 3× replication); the entire label index for all 5M series fits in 1–2 GB of RAM, keeping label-selector queries microsecond-fast.
A raw scrape of 5M series every 15 seconds generates ~333K points/second. That's achievable with a small number of ingestion nodes — a modern ingester (TSDB-backed, NVMe storage) can sustain 100K–400K samples/sec per node depending on label cardinality and write concurrency, so 333K samples/sec needs only 2–4 nodes in practice. The number goes up as cardinality rises. Every inefficiency in the write path multiplies badly.
Building up to the design
V1: Write timestamps and values into Postgres
CREATE TABLE metrics (
name TEXT,
labels JSONB,
ts TIMESTAMPTZ,
value DOUBLE PRECISION
);
CREATE INDEX ON metrics (name, ts);
Works at demo scale. At 333K inserts/sec a single Postgres dies instantly, and WHERE labels @> '{"host":"web01"}' with JSONB gets slow well before you hit that limit.
V2: Separate series identity from samples
Split into two tables: series(id, name, labels) and samples(series_id, ts, value). The series table is small and mostly static — resolve label selectors to a set of series IDs there, then look up samples in the narrow table with a (series_id, ts) index. Dramatically fewer index entries; label resolution is O(series) not O(samples).
The new bottleneck is disk access pattern. A range query reading 10K series × 5760 points each → 57M random B-tree seeks. B-trees with random I/O are the wrong shape for time-series data.
V3: Append-only chunk files, one per series
Keep a write buffer of the last N samples in memory for each series. When the buffer fills — typically 120 samples, about 30 minutes at a 15s scrape interval — compress the chunk and flush it to disk as one sequential write. This is the key insight behind Prometheus's TSDB and Facebook's Gorilla paper: sequential append access to a columnar layout beats general-purpose databases for time-series at scale. You get sequential disk writes, sequential reads for range queries, and 10–12× compression.
The next problem is crash recovery. If the node dies before the in-memory buffer flushes, you lose up to 30 minutes of data.
V4: Add a write-ahead log and replicate
Before accepting a sample into memory, append it to a WAL on disk. On crash, replay the WAL from the last checkpoint to rebuild the in-memory state. Replicate WAL segments to at least one standby. Now you're crash-safe with zero data loss on a single-node failure — but you still have a single writer per series set. At 5M series across multiple ingesters, queries that cross shard boundaries need a federation layer.
V5: Shard ingesters + distributed query layer
Hash each (metric_name, sorted_label_fingerprint) to one of N ingester shards. A stateless query layer routes sub-queries to the right shards and merges the results. Long-term data gets compacted and offloaded to object storage (S3/GCS). This is the production shape.
flowchart LR
V1["V1: Postgres table\ndemo only"] --> V2["V2: series + samples\nbetter index"]
V2 --> V3["V3: chunk files\n10-12x compression"]
V3 --> V4["V4: + WAL + replicas\ncrash-safe"]
V4 --> V5["V5: sharded ingesters\n+ object store\nproduction scale"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V5 fill:#a855f7,color:#fff
The data model
Every metric is a time series — a stream of (timestamp, float64) pairs uniquely identified by a metric name plus a set of key-value labels (also called tags):
http_requests_total{service="payments", method="POST", status="200", region="us-east-1"}
This is called a label set. Two series that share the same metric name but differ in any label value are different series. The set of all unique label sets for a metric is the cardinality of that metric.
A single sample (data point) is: (int64 timestamp in milliseconds, float64 value).
Prometheus-style systems use 64-bit millisecond Unix timestamps and IEEE 754 float64 values. This is the atomic unit of storage.
flowchart TD
MN["metric name\ne.g. http_requests_total"]
L1["label: service=payments"]
L2["label: status=200"]
L3["label: region=us-east-1"]
SID["Series ID (fingerprint of name + sorted labels)"]
S1["sample: ts=1700000015000, val=4821.0"]
S2["sample: ts=1700000030000, val=4836.0"]
S3["sample: ts=1700000045001, val=4850.0"]
MN --> SID
L1 --> SID
L2 --> SID
L3 --> SID
SID --> S1
SID --> S2
SID --> S3
style SID fill:#ff6b1a,color:#0a0a0f
style S1 fill:#0e7490,color:#fff
style S2 fill:#0e7490,color:#fff
style S3 fill:#0e7490,color:#fff
The cardinality problem
The #1 way a monitoring system falls over in production is a developer adding a label like user_id or request_id to a metric. Each unique value of that label becomes a separate time series, so the label multiplies the total series count by the number of distinct values — even a label with 1,000 unique values turns 5M series into 5B.
# Safe: bounded cardinality
http_latency_ms{service="checkout", status_code="200"} # ~10 combinations
# Dangerous: unbounded cardinality
http_latency_ms{service="checkout", user_id="u_19283764"} # 10M+ combinations
At 5M series, each active series costs roughly 700 bytes–1 KB in index and metadata structures alone (label postings, symbol tables, chunk references), plus several KB more for the in-memory chunk data. Jump to 50M series and you're looking at tens of GB just for the index; at 5B series — one badly chosen label away — the monitoring cluster has no chance. This is not a theoretical concern; unbounded labels have taken down production monitoring clusters.
The fix is ingestion-time enforcement: maintain a counter per (metric_name, label_name) and reject any new series that would push a label past a configured limit (say, 10K unique values per label). Drop or hash high-cardinality labels at the agent before they reach the central store. Expose cardinality stats to developers so they catch their own mistakes early.
Collection: pull vs push
Pull (Prometheus model)
The monitoring server maintains a list of targets — hosts and services to scrape. On a schedule (every 15s), it sends GET /metrics to each target. The target exposes its metrics in a text or protobuf format; the server parses and stores them.
sequenceDiagram
participant SD as Service discovery
participant SCR as Scraper
participant SVC as "Service /metrics"
participant ING as Ingester
SD->>SCR: "targets: [web01:9100, web02:9100, ...]"
loop every 15s per target
SCR->>SVC: "GET /metrics (e.g., :9100/metrics for node_exporter)"
SVC-->>SCR: "http_requests_total 4821\ncpu_usage 0.43\n..."
SCR->>ING: write_batch(samples)
end
Pull's main advantage is control: the monitoring system owns the scrape timing, so you get aligned timestamps. It also makes dead targets visible instantly — a failed scrape means the up metric drops to 0, which you can alert on. And you can debug a target just by curling its /metrics endpoint yourself.
The downside is topology: pull requires network access from the scraper to every target, which gets awkward across firewalls or NAT. It also misses ephemeral jobs that finish before the next scrape fires.
Push (StatsD / Telegraf / agents model)
Services and agents push metric payloads to a central ingest endpoint whenever they have data. This works for any process lifetime — a batch job can push its final counters just before exiting — and it works across NAT because the agent initiates the connection. Agents can also buffer locally and retry, which helps when the ingest tier has a brief outage.
The trade-off is that silence becomes ambiguous. With pull, a missing scrape means something is wrong. With push, a sender that goes quiet might be dead, or might just have nothing new to report.
In practice, production systems use both. Prometheus-style scraping handles long-lived services with dynamic service discovery; a push gateway or agent handles batch jobs and environments where the network topology prevents pull.
Service discovery
For pull-based scraping to work at scale, you can't maintain a static list of 10,000 hosts. The scrape coordinator integrates with service discovery backends — Kubernetes pod annotations, Consul catalogs, EC2/GCE instance tags — so that when a new pod starts it appears as a scrape target within seconds, and when it dies it disappears automatically. This is "zero-touch" observability: no engineer has to update a config file when the fleet changes.
TSDB internals
This is the core of the system. Understanding it lets you reason about performance, capacity, and failure modes.
Series chunks: columnar, compressed, append-only
Each unique time series gets its own chunk — a contiguous byte block storing up to N samples. Prometheus's TSDB uses 120-sample chunks; Gorilla (Facebook's in-memory TSDB) uses 2-hour blocks. Both achieve similar compression via two complementary tricks.
1. Delta-of-delta timestamp encoding
Scrape intervals are regular. If the interval is 15s, consecutive timestamps differ by 15000ms, 15000ms, 15001ms, … The second-order deltas — differences of differences — are almost always 0 or ±1.
Timestamps: 1700000000000, 1700000015000, 1700000030001, 1700000045000
Deltas: 15000, 15001, 14999
Delta-of-deltas: 1, -2
Encode these delta-of-deltas with a variable-length scheme: a few bits for small values, more bits for rare large deviations. The Gorilla paper reports that ~96% of timestamps compress to a single bit (delta-of-delta = 0). The weighted average is roughly 1–2 bits per timestamp for regular scrape intervals.
2. XOR float compression (Gorilla-style)
Consecutive metric values for a single series tend to be similar. XOR two adjacent float64 values and you get a bitstring where the leading and trailing bits are mostly zero — only the "meaningful" middle changes. Encode just those bits.
val_1 = 0.372 → 0x3FD7CED916872B02
val_2 = 0.374 → 0x3FD7EF9DB22D0E56 (XOR has 18 leading zeros)
For slowly-changing metrics (CPU usage hovering around 40%), XOR compression can reach 1.1 bits/sample. Spiky metrics are harder but still achieve 1.5–2 bits/sample. The practical average across mixed workloads is ~1.5 bytes/sample — the original Gorilla paper reports 1.37 bytes/point on Facebook's workload, a 12× reduction from 16 bytes raw.
Inverted index: label → series
To answer http_requests_total{status="500"}, you need to find all series whose label set contains status="500". Scanning all 5M series is not viable. Instead, for each label_name=label_value pair, maintain a posting list of series IDs:
"service=payments" → [series_42, series_43, series_44, ...]
"status=200" → [series_42, series_45, ...]
"status=500" → [series_43, ...]
A multi-label query is an intersection of posting lists:
{service="payments", status="500"} → series_43 ∩ ... → [series_43]
Posting lists use compressed integer sequences so intersections are fast even across millions of IDs. Prometheus TSDB stores posting lists as sorted arrays of uint64 series references on disk, with delta-compression applied to the sorted IDs; other systems such as Elasticsearch and ClickHouse use full Roaring Bitmaps for their posting lists.
The entire index for 5M series fits comfortably in memory (~1–2 GB), making label-selector queries microsecond-fast.
Block structure and compaction
flowchart TD
HEAD["Head block\n(in-memory, 2h)"]
B1["Block 0–2h\n(flushed, immutable)"]
B2["Block 2–4h"]
B3["Block 0–4h\n(compacted)"]
B4["Block 0–8h\n(compacted again)"]
OBJ["Object store\n(blocks > 1 day)"]
HEAD -->|"flush at 2h"| B1
B1 -->|"compact"| B3
B2 -->|"compact"| B3
B3 -->|"compact"| B4
B4 -->|"ship"| OBJ
style HEAD fill:#ff6b1a,color:#0a0a0f
style OBJ fill:#ffaa00,color:#0a0a0f
style B4 fill:#15803d,color:#fff
The head block is mutable, stored in memory plus WAL. Every ~2 hours it is sealed into an immutable block on disk. Immutable blocks are periodically compacted — merging smaller blocks into larger ones reduces the number of files and improves query performance. Long-term blocks are uploaded to object storage.
This is the same pattern as LSM-tree compaction in LevelDB/RocksDB — optimize writes by making storage mostly-sequential, pay a periodic cost for compaction.
Full architecture
flowchart TD
SD["Service discovery\n(Kubernetes / Consul / EC2)"]
SCR["Scrape coordinators"]
AGT["Push agents\n(for batch jobs)"]
LB["Ingest load balancer"]
ING1["Ingester 0"]
ING2["Ingester 1"]
ING3["Ingester N"]
WAL1["WAL"]
WAL2["WAL"]
CHUNK["On-disk TSDB blocks"]
OBJ["Object store\n(S3 / GCS)"]
IDX["Label index\n(in-memory)"]
QFE["Query frontend\n(sharding + caching)"]
QBE1["Querier 0"]
QBE2["Querier 1"]
STORE["Store gateway\n(reads from object store)"]
DASH["Dashboards"]
ARULE["Alert rule evaluator"]
NROUTE["Notification router\n(email / PD / Slack)"]
SD --> SCR
SCR --> LB
AGT --> LB
LB --> ING1
LB --> ING2
LB --> ING3
ING1 --> WAL1
ING2 --> WAL2
ING1 --> CHUNK
CHUNK --> OBJ
CHUNK --> IDX
IDX --> QBE1
IDX --> QBE2
OBJ --> STORE
STORE --> QBE1
QFE --> QBE1
QFE --> QBE2
QFE --> DASH
QFE --> ARULE
ARULE --> NROUTE
style SCR fill:#ff6b1a,color:#0a0a0f
style ING1 fill:#0e7490,color:#fff
style ING2 fill:#0e7490,color:#fff
style ING3 fill:#0e7490,color:#fff
style OBJ fill:#ffaa00,color:#0a0a0f
style QFE fill:#15803d,color:#fff
style ARULE fill:#a855f7,color:#fff
style NROUTE fill:#ff2e88,color:#fff
Ingestion path
- A scrape coordinator resolves the target list from service discovery and fires HTTP GET requests every 15s.
- Parsed samples are hashed by
(metric_name, sorted_label_fingerprint)and routed to the responsible ingester shard. - The ingester appends the sample to the in-memory head block and to the WAL on disk.
- Every 2 hours, the head block is flushed to an immutable on-disk block; the WAL segments up to that point can be discarded.
- A background compactor merges smaller blocks into larger ones and ships aged-out blocks to object storage.
Query path
sequenceDiagram
participant User
participant QFE as Query frontend
participant QBE as Querier
participant ING as Ingester
participant STORE as Store gateway
User->>QFE: "rate(http_errors[5m])"
QFE->>QFE: parse + plan
QFE->>QBE: sub-query (last 24h)
par fetch recent data
QBE->>ING: series for last 2h (from head block)
and fetch older data
QBE->>STORE: series for 2h–24h (from object store)
end
QBE->>QFE: merged result
QFE->>User: evaluated metric matrix
The query frontend is stateless and handles three jobs: it splits a long query (say, 7 days) into 24-hour sub-queries that run in parallel; it caches sub-query results aligned to fixed time boundaries so repeated dashboard loads skip the queriers entirely; and it shards high-cardinality queries across multiple querier nodes, each handling a slice of the series space.
Downsampling and retention tiers
Keeping raw 15-second samples for 2 years is prohibitive. The solution is to downsample older data periodically.
| Tier | Resolution | Retention | Storage (5M series) |
|---|---|---|---|
| Raw | 15 s | 15 days | ~645 GB |
| Downsampled | 5 min | 90 days | ~600–800 GB (20× fewer points × 4 values each) |
| Long-term | 1 hour | 2 years | ~500 GB total (~250 GB/year) |
Each downsampled point stores not just value but {min, max, sum, count} — so you can reconstruct averages, min, max, and approximate percentiles even from the coarser resolution. A background compactor runs over aged-out blocks, computes rollups, and writes the downsampled blocks to object storage. Queriers transparently use the highest-resolution tier available for the requested time range.
The trade-off is precision: a 1-hour data point cannot tell you whether there was a 5-second spike within that hour. For dashboards covering months, that's fine. For alerting, you always run against raw data.
flowchart LR
RAW["Raw data\n15s resolution\nkeep 15 days"]
D5["5-min rollup\nkeep 90 days"]
D60["1-hr rollup\nkeep 2 years"]
OBJ["Object store\n(S3 / GCS)"]
RAW -->|"compact + downsample"| D5
D5 -->|"compact + downsample"| D60
D5 --> OBJ
D60 --> OBJ
style RAW fill:#ff6b1a,color:#0a0a0f
style D5 fill:#0e7490,color:#fff
style D60 fill:#15803d,color:#fff
style OBJ fill:#ffaa00,color:#0a0a0f
Alerting
Alert evaluation is a separate subsystem — decoupled from the query path and the storage path.
flowchart LR
RULES["Alert rules\n(PromQL expressions)"] --> EVAL["Rule evaluator\n(runs every 15–60 s)"]
QFE["Query frontend"] --> EVAL
EVAL -->|"active alert"| AM["Alert manager"]
AM --> DEDUP["Dedup + grouping"]
DEDUP --> INH["Inhibition rules"]
INH --> SIL["Silences"]
SIL --> ROUTE["Routing rules"]
ROUTE --> EMAIL["Email"]
ROUTE --> PD["PagerDuty"]
ROUTE --> SLACK["Slack"]
style EVAL fill:#a855f7,color:#fff
style AM fill:#ff6b1a,color:#0a0a0f
style ROUTE fill:#15803d,color:#fff
An alert rule looks like:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 2m # must be true for 2 consecutive minutes before firing
labels:
severity: critical
annotations:
summary: "Error rate > 1% for {{ $labels.service }}"
The evaluator runs every evaluation_interval (typically 15–60 seconds), executes each rule expression against the query frontend, and compares the result to the rule's threshold. If the condition holds for the for duration, the alert fires.
The alert manager then takes over. It deduplicates alerts — the same condition may fire from multiple evaluator replicas, so it collapses by (alertname, labels). It groups related alerts (all high-error-rate alerts for a service) into a single notification so on-call engineers don't get paged 50 times for one incident. It applies inhibition rules to suppress lower-severity alerts when a higher-severity one is already firing for the same system. Silences let you time-bound suppression during planned maintenance. Finally, routing rules send the alert to the right team based on label matchers.
Storage choices
| Data | Store | Why |
|---|---|---|
| Active samples (last 2h) | In-memory head block + WAL | Sub-millisecond write; recoverable on crash |
| Recent blocks (2h–15d) | Local NVMe / SSD | Sequential reads; warm for dashboard queries |
| Long-term blocks | Object store (S3/GCS) | Cheap, durable, effectively unlimited; queried less often |
| Label index | In-memory | 1–2 GB for 5M series; must be fast |
| Alert rules + routing config | Config file / version-controlled YAML | Simple; rarely changes |
| Notification state | In-memory + durable KV (etcd/Postgres) | For dedup across alert manager replicas |
Failure modes
Cardinality explosion
A single deployment adds a high-cardinality label. Series count grows from 5M to 5B. Memory on ingesters exhausts; ingesters OOM and restart; on restart, WAL replay takes hours because there are 5B series to reconstruct. Cascading failure.
The fix is hard per-metric cardinality limits enforced at ingestion, plus dashboards showing the top-N metrics by series count so developers catch explosions before they reach the ingesters.
Scrape gaps
The scraper loses connectivity to a target. Samples stop for that target. Alert rules using rate(metric[5m]) return no data (empty/NaN) rather than 0 once the look-back window contains no samples. A threshold rule like rate(errors[5m]) > 0 won't fire because there's no value to compare — silence is the most dangerous failure mode in monitoring. Worse, a low-error-rate assertion like rate(errors[5m]) < 0.01 will silently stop firing, giving a false green.
The defense is an explicit up metric: the scraper sets up=1 on a successful scrape and up=0 on failure. Alert on up == 0 to detect dead targets. Also use absent(metric) in alert rules to catch missing time series explicitly.
Ingester failure (WAL replay)
An ingester crashes. On restart it replays the WAL from the last checkpoint to rebuild the in-memory head block. With large WAL segments — say, 2 hours of data across 1M series — replay can take minutes. During replay the ingester is not available for writes or queries.
WAL segment size limits reduce the maximum replay window. Replication (two ingesters handling the same series fingerprints) means the replica covers reads while the primary replays. Checkpoint the WAL more frequently to shrink recovery time further.
Query DoS
A dashboard engineer writes rate(some_metric[30d]) — a 30-day query with 15s resolution, touching 2 billion data points. It saturates the query layer and starves other users.
Mitigate with a query timeout (typically 2 minutes), a max result set size in bytes, query cost estimation before execution, and per-tenant query concurrency limits.
Hot tenant (multi-tenant deployments)
A single tenant ingests 10× their share due to a runaway agent or a new high-cardinality metric. They saturate their ingester shard, affecting neighbors. Per-tenant ingestion rate limits, tenant-to-shard isolation, and autoscaling ingesters all help here.
Things to discuss in an interview
- Pull vs push and why you'd use both: pull for long-running services with service discovery, push gateway for batch jobs.
- Cardinality as the #1 failure mode: every label is a multiplier on series count; enforce hard limits at ingest.
- Why TSDB uses chunks not rows: sequential I/O access pattern; enables compression; aligns with time-range query patterns.
- How compression works: delta-of-delta for timestamps (regular intervals → tiny second-order deltas); XOR for floats (similar consecutive values → high-bit XOR runs).
- Downsampling trade-off: you lose intra-period precision; always query recent data from raw, old data from rollups.
- Alert evaluation is a read path, not a write path: run it as a separate periodic job so it doesn't block ingestion; dedup in the alert manager.
Things you should now be able to answer
- Why does adding
user_idas a label break a monitoring system? - How does delta-of-delta encoding achieve near-1 bit per timestamp for regular scrape intervals, and what happens when a scrape fires late?
- Why do monitoring systems flush the head block to an immutable on-disk block every 2 hours rather than writing to a single large file?
- How does scrape silence differ from a metric dropping to zero — and how do you alert on each?
- What does a query frontend do that a querier cannot?
- Why does downsampled storage keep min/max/sum/count rather than a single average?
Further reading
- "Gorilla: A Fast, Scalable, In-Memory Time Series Database" — Pelkonen et al., VLDB 2015 (the foundational paper for modern TSDB compression)
- Prometheus TSDB documentation — prometheus.io/docs/prometheus/latest/storage/
- "Thanos: Highly available Prometheus setup with long term storage capabilities" — thanos.io (open-source reference for multi-tenant, object-store-backed Prometheus)
- "Cortex: A multi-tenant, horizontally scalable Prometheus as a Service" — CNCF project, similar design space to Thanos
- Datadog Engineering Blog — engineering.datadoghq.com (search for "metrics", "Monocle", "Husky", or "time series" for relevant architecture deep-dives)
Frequently asked questions
▸What is cardinality in a metrics system, and why is it the number-one failure mode?
Cardinality is the count of unique label-set combinations for a metric — every distinct (metric_name, label_key=label_value) permutation is a separate time series. It is the top failure mode because a single high-cardinality label like user_id can multiply 5 million series into 5 billion overnight, exhausting ingester memory, causing OOM crashes, and making WAL replay take hours. The fix is hard per-metric limits enforced at ingestion time before the damage propagates.
▸How does Gorilla-style compression achieve roughly 1.5 bytes per sample from 16 bytes raw?
It combines two tricks: delta-of-delta encoding on timestamps (regular 15-second scrape intervals produce second-order deltas of nearly zero, so about 96% of timestamps compress to a single bit) and XOR encoding on float64 values (adjacent samples that are close in value produce XOR results with long leading and trailing zero runs, encoding only the meaningful changed bits). The Gorilla paper reports 1.37 bytes per point on Facebook workloads; mixed production workloads average around 1.5 bytes, a 10-12x reduction from 16 bytes raw.
▸When should a metrics system use pull scraping versus push, and can you use both?
Pull (Prometheus-style) is preferred for long-running services because the monitoring server owns scrape timing, timestamps are aligned, and a failed scrape immediately surfaces as an up=0 alert. Push is necessary for ephemeral batch jobs that may finish before the next scrape fires, and works across NAT because the agent initiates the connection. Production systems use both: pull via service-discovery-driven scraping for services, a push gateway for short-lived processes.
▸What are the three retention tiers and how much storage does each consume at 5 million active series?
Raw 15-second data is kept for 15 days at roughly 645 GB. Five-minute rollups are kept for 90 days at 600-800 GB (far fewer points but each stores min, max, sum, and count). One-hour rollups are kept for 2 years at around 500 GB total. Each downsampled point stores four values rather than one to allow reconstruction of averages, min, max, and approximate percentiles from coarser resolution.
▸Why does scrape silence produce a worse alert outcome than a metric dropping to zero?
When a target goes silent, rate() expressions return no data (NaN) rather than zero, so threshold rules like rate(errors[5m]) > 0 never fire because there is no value to compare against. A low-error-rate assertion like rate(errors[5m]) < 0.01 silently stops firing, giving a false green signal. The defense is an explicit up metric set to 0 on scrape failure, plus absent() in alert rules to catch missing series explicitly.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.