~/articles/design-metrics-monitoring-system

◆◆◆Advancedasked at Datadogasked at Googleasked at Amazonasked at Uber

Design a Metrics & Monitoring System (Prometheus / Datadog)

Q: What is cardinality in a metrics system, and why is it the number-one failure mode?

Cardinality is the count of unique label-set combinations for a metric — every distinct (metric_name, label_key=label_value) permutation is a separate time series. It is the top failure mode because a single high-cardinality label like user_id can multiply 5 million series into 5 billion overnight, exhausting ingester memory, causing OOM crashes, and making WAL replay take hours. The fix is hard per-metric limits enforced at ingestion time before the damage propagates.

Q: How does Gorilla-style compression achieve roughly 1.5 bytes per sample from 16 bytes raw?

It combines two tricks: delta-of-delta encoding on timestamps (regular 15-second scrape intervals produce second-order deltas of nearly zero, so about 96% of timestamps compress to a single bit) and XOR encoding on float64 values (adjacent samples that are close in value produce XOR results with long leading and trailing zero runs, encoding only the meaningful changed bits). The Gorilla paper reports 1.37 bytes per point on Facebook workloads; mixed production workloads average around 1.5 bytes, a 10-12x reduction from 16 bytes raw.

Q: When should a metrics system use pull scraping versus push, and can you use both?

Pull (Prometheus-style) is preferred for long-running services because the monitoring server owns scrape timing, timestamps are aligned, and a failed scrape immediately surfaces as an up=0 alert. Push is necessary for ephemeral batch jobs that may finish before the next scrape fires, and works across NAT because the agent initiates the connection. Production systems use both: pull via service-discovery-driven scraping for services, a push gateway for short-lived processes.

Q: What are the three retention tiers and how much storage does each consume at 5 million active series?

Raw 15-second data is kept for 15 days at roughly 645 GB. Five-minute rollups are kept for 90 days at 600-800 GB (far fewer points but each stores min, max, sum, and count). One-hour rollups are kept for 2 years at around 500 GB total. Each downsampled point stores four values rather than one to allow reconstruction of averages, min, max, and approximate percentiles from coarser resolution.

Q: Why does scrape silence produce a worse alert outcome than a metric dropping to zero?

When a target goes silent, rate() expressions return no data (NaN) rather than zero, so threshold rules like rate(errors[5m]) > 0 never fire because there is no value to compare against. A low-error-rate assertion like rate(errors[5m]) < 0.01 silently stops firing, giving a false green signal. The defense is an explicit up metric set to 0 on scrape failure, plus absent() in alert rules to catch missing series explicitly.

Ingest billions of time-series points, store them cheaply, and answer dashboard + alerting queries fast. TSDB internals, cardinality, downsampling, and pull vs push.

22 min read2026-03-12Ironclad Academy

#interview #observability #time-series #big-data

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Production systems lie to you constantly — not on purpose, but because without continuous measurement, everything is invisible. Prometheus crossed one billion active time-series in large enterprise deployments. Datadog processes trillions of data points per day from millions of hosts. At that scale, the "monitoring" problem stops being about collecting numbers and starts being about three deeply conflicting sub-problems: writing data at hundreds of thousands of points per second, storing it cheaply enough that you can afford years of history, and querying it fast enough that a dashboard loads before an engineer loses patience.

The concrete system looks like this. Every host in your fleet runs a metrics agent that exposes CPU, memory, request rate, error rate, and hundreds of custom business counters. A monitoring backend scrapes all of them every 15 seconds, ingests the results, and keeps them queryable. An engineer opens a dashboard during an incident and types rate(http_errors[5m]) — that expression has to scan millions of stored data points and return an answer in under half a second. Prometheus and Datadog both solve this problem; so does Victoria Metrics, Thanos, Grafana Mimir, and a dozen internal systems at companies like Google and Uber.

What makes this hard is a combination of write volume and query shape. At 5 million active time series on a 15-second scrape interval, you are ingesting roughly 333,000 data points per second, steady state. Storing them as raw floats would consume around 5.3 MB/s — roughly 458 GB/day per replica uncompressed. Gorilla-style columnar compression brings that down to ~1.5 bytes per sample, reducing the on-disk footprint to around 43 GB/day per replica. The write problem is solved by this compression scheme and append-only chunk files. But compression alone doesn't help if you can't find the right series fast: a label-selector query like {service="payments", status="500"} might need to intersect posting lists across 5 million series in microseconds. That's an inverted-index problem, not a database problem.

The second tension is cardinality. Every unique combination of label values is a distinct time series. One developer adding user_id as a label on a high-traffic metric can turn 5 million series into 5 billion overnight — filling every ingester's memory, making WAL replay take hours, and crashing the cluster. The monitoring system must enforce cardinality limits at ingestion time, before the damage propagates. That constraint shapes nearly every API and operational decision in the design.

Functional requirements

write_metric(name, labels, timestamp, value) — ingest a float64 value for a labeled time series.
query_range(expr, start, end, step) — evaluate a PromQL-like expression over a time window; return a matrix of (series, [(ts, val)]).
query_instant(expr, timestamp) — evaluate at a single point in time (for alert rule evaluation).
Alerting: define rules like rate(http_errors[5m]) > 0.01; evaluate on a schedule; route notifications; support silences and inhibition.
(Optional) Multi-tenancy: data isolation between customers.
(Optional) Exemplars: link a metric point to a specific trace.

Non-functional requirements

Write throughput: hundreds of thousands of samples/second, scaling to millions for large deployments.
Ingest latency: data visible for query within seconds of being written.
Query latency: dashboard queries (covering hours of data, millions of points) at < 500ms p99; instant queries for alerting at < 100ms p99.
Retention: raw data for days to weeks; downsampled long-term for months to years.
Durability: no data loss on single-node failure (WAL + replication).
Availability: 99.9%+ for both ingest and query paths.

Capacity estimation

Dimension	Estimate	How we got there
Active time series	5M	Mid-size deployment; Datadog processes trillions of points/day across millions of hosts
Scrape interval	15 s	Standard Prometheus default
Ingest rate	~333K samples/s	`5M ÷ 15`
Raw point size	16 bytes	`8 B (float64) + 8 B (int64 timestamp)`
Compressed point size	~1.5 bytes	Gorilla-style; real deployments get 1.3–1.7; Gorilla paper reports 1.37 B/point on Facebook's workload
Compression ratio	~10–12×	`16 B ÷ 1.5 B`
Ingest bandwidth (uncompressed)	~5.3 MB/s	`333K × 16 B`
Ingest bandwidth (compressed, per replica)	~500 KB/s	`333K × 1.5 B`
Daily storage (one replica)	~43 GB/day	`333K × 1.5 B × 86,400`
Daily storage (3× replication)	~130 GB/day	`43 GB × 3`
Retention: raw (15 days)	~645 GB	`43 GB × 15`
Retention: 5-min rollup (90 days)	~600–800 GB	`43 GB × (15s ÷ 300s) × 90 ≈ 194 GB raw`; each downsampled point stores min/max/sum/count → 3–4× higher
Retention: 1-hr rollup (2 years)	~500 GB total	`43 GB × (15s ÷ 3600s) × 730 ≈ 130 GB raw`; ×4 for min/max/sum/count
Raw label data (index)	~1 GB serialized	`5M series × ~200 B (metric name + 6 labels avg)`
In-memory index	~1–2 GB	Posting lists + symbol tables + chunk refs for 5M series
Head block (in-memory)	~15–40 GB	Index + 2h of chunk data per series at 5M series
Points per series in a 24h query	5,760	`24 × 3600 ÷ 15`
Total points for a 10K-series query	57.6M	`10K × 5,760` — must be read from compressed chunks

Takeaway: At 5M active series on a 15-second scrape interval, Gorilla-style compression reduces daily ingest from ~458 GB/day raw to ~43 GB/day per replica (~130 GB with 3× replication); the entire label index for all 5M series fits in 1–2 GB of RAM, keeping label-selector queries microsecond-fast.

A raw scrape of 5M series every 15 seconds generates ~333K points/second. That's achievable with a small number of ingestion nodes — a modern ingester (TSDB-backed, NVMe storage) can sustain 100K–400K samples/sec per node depending on label cardinality and write concurrency, so 333K samples/sec needs only 2–4 nodes in practice. The number goes up as cardinality rises. Every inefficiency in the write path multiplies badly.

Building up to the design

V1: Write timestamps and values into Postgres

CREATE TABLE metrics (
  name        TEXT,
  labels      JSONB,
  ts          TIMESTAMPTZ,
  value       DOUBLE PRECISION
);
CREATE INDEX ON metrics (name, ts);

Works at demo scale. At 333K inserts/sec a single Postgres dies instantly, and WHERE labels @> '{"host":"web01"}' with JSONB gets slow well before you hit that limit.

V2: Separate series identity from samples

Split into two tables: series(id, name, labels) and samples(series_id, ts, value). The series table is small and mostly static — resolve label selectors to a set of series IDs there, then look up samples in the narrow table with a (series_id, ts) index. Dramatically fewer index entries; label resolution is O(series) not O(samples).

The new bottleneck is disk access pattern. A range query reading 10K series × 5760 points each → 57M random B-tree seeks. B-trees with random I/O are the wrong shape for time-series data.

V3: Append-only chunk files, one per series

Keep a write buffer of the last N samples in memory for each series. When the buffer fills — typically 120 samples, about 30 minutes at a 15s scrape interval — compress the chunk and flush it to disk as one sequential write. This is the key insight behind Prometheus's TSDB and Facebook's Gorilla paper: sequential append access to a columnar layout beats general-purpose databases for time-series at scale. You get sequential disk writes, sequential reads for range queries, and 10–12× compression.

The next problem is crash recovery. If the node dies before the in-memory buffer flushes, you lose up to 30 minutes of data.

V4: Add a write-ahead log and replicate

Before accepting a sample into memory, append it to a WAL on disk. On crash, replay the WAL from the last checkpoint to rebuild the in-memory state. Replicate WAL segments to at least one standby. Now you're crash-safe with zero data loss on a single-node failure — but you still have a single writer per series set. At 5M series across multiple ingesters, queries that cross shard boundaries need a federation layer.

V5: Shard ingesters + distributed query layer

Hash each (metric_name, sorted_label_fingerprint) to one of N ingester shards. A stateless query layer routes sub-queries to the right shards and merges the results. Long-term data gets compacted and offloaded to object storage (S3/GCS). This is the production shape.

flowchart LR
    V1["V1: Postgres table\ndemo only"] --> V2["V2: series + samples\nbetter index"]
    V2 --> V3["V3: chunk files\n10-12x compression"]
    V3 --> V4["V4: + WAL + replicas\ncrash-safe"]
    V4 --> V5["V5: sharded ingesters\n+ object store\nproduction scale"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V5 fill:#a855f7,color:#fff

The data model

Every metric is a time series — a stream of (timestamp, float64) pairs uniquely identified by a metric name plus a set of key-value labels (also called tags):

http_requests_total{service="payments", method="POST", status="200", region="us-east-1"}

This is called a label set. Two series that share the same metric name but differ in any label value are different series. The set of all unique label sets for a metric is the cardinality of that metric.

A single sample (data point) is: (int64 timestamp in milliseconds, float64 value).

Prometheus-style systems use 64-bit millisecond Unix timestamps and IEEE 754 float64 values. This is the atomic unit of storage.

flowchart TD
    MN["metric name\ne.g. http_requests_total"]
    L1["label: service=payments"]
    L2["label: status=200"]
    L3["label: region=us-east-1"]
    SID["Series ID (fingerprint of name + sorted labels)"]
    S1["sample: ts=1700000015000, val=4821.0"]
    S2["sample: ts=1700000030000, val=4836.0"]
    S3["sample: ts=1700000045001, val=4850.0"]

    MN --> SID
    L1 --> SID
    L2 --> SID
    L3 --> SID
    SID --> S1
    SID --> S2
    SID --> S3

    style SID fill:#ff6b1a,color:#0a0a0f
    style S1 fill:#0e7490,color:#fff
    style S2 fill:#0e7490,color:#fff
    style S3 fill:#0e7490,color:#fff

The cardinality problem

The #1 way a monitoring system falls over in production is a developer adding a label like user_id or request_id to a metric. Each unique value of that label becomes a separate time series, so the label multiplies the total series count by the number of distinct values — even a label with 1,000 unique values turns 5M series into 5B.

# Safe: bounded cardinality
http_latency_ms{service="checkout", status_code="200"}  # ~10 combinations

# Dangerous: unbounded cardinality
http_latency_ms{service="checkout", user_id="u_19283764"}  # 10M+ combinations

At 5M series, each active series costs roughly 700 bytes–1 KB in index and metadata structures alone (label postings, symbol tables, chunk references), plus several KB more for the in-memory chunk data. Jump to 50M series and you're looking at tens of GB just for the index; at 5B series — one badly chosen label away — the monitoring cluster has no chance. This is not a theoretical concern; unbounded labels have taken down production monitoring clusters.

The fix is ingestion-time enforcement: maintain a counter per (metric_name, label_name) and reject any new series that would push a label past a configured limit (say, 10K unique values per label). Drop or hash high-cardinality labels at the agent before they reach the central store. Expose cardinality stats to developers so they catch their own mistakes early.

Collection: pull vs push

Pull (Prometheus model)

The monitoring server maintains a list of targets — hosts and services to scrape. On a schedule (every 15s), it sends GET /metrics to each target. The target exposes its metrics in a text or protobuf format; the server parses and stores them.

sequenceDiagram
    participant SD as Service discovery
    participant SCR as Scraper
    participant SVC as "Service /metrics"
    participant ING as Ingester
    SD->>SCR: "targets: [web01:9100, web02:9100, ...]"
    loop every 15s per target
        SCR->>SVC: "GET /metrics  (e.g., :9100/metrics for node_exporter)"
        SVC-->>SCR: "http_requests_total 4821\ncpu_usage 0.43\n..."
        SCR->>ING: write_batch(samples)
    end

Pull's main advantage is control: the monitoring system owns the scrape timing, so you get aligned timestamps. It also makes dead targets visible instantly — a failed scrape means the up metric drops to 0, which you can alert on. And you can debug a target just by curling its /metrics endpoint yourself.

The downside is topology: pull requires network access from the scraper to every target, which gets awkward across firewalls or NAT. It also misses ephemeral jobs that finish before the next scrape fires.

Push (StatsD / Telegraf / agents model)

Services and agents push metric payloads to a central ingest endpoint whenever they have data. This works for any process lifetime — a batch job can push its final counters just before exiting — and it works across NAT because the agent initiates the connection. Agents can also buffer locally and retry, which helps when the ingest tier has a brief outage.

The trade-off is that silence becomes ambiguous. With pull, a missing scrape means something is wrong. With push, a sender that goes quiet might be dead, or might just have nothing new to report.

In practice, production systems use both. Prometheus-style scraping handles long-lived services with dynamic service discovery; a push gateway or agent handles batch jobs and environments where the network topology prevents pull.

Service discovery

For pull-based scraping to work at scale, you can't maintain a static list of 10,000 hosts. The scrape coordinator integrates with service discovery backends — Kubernetes pod annotations, Consul catalogs, EC2/GCE instance tags — so that when a new pod starts it appears as a scrape target within seconds, and when it dies it disappears automatically. This is "zero-touch" observability: no engineer has to update a config file when the fleet changes.

TSDB internals

This is the core of the system. Understanding it lets you reason about performance, capacity, and failure modes.

Series chunks: columnar, compressed, append-only

Each unique time series gets its own chunk — a contiguous byte block storing up to N samples. Prometheus's TSDB uses 120-sample chunks; Gorilla (Facebook's in-memory TSDB) uses 2-hour blocks. Both achieve similar compression via two complementary tricks.

1. Delta-of-delta timestamp encoding

Scrape intervals are regular. If the interval is 15s, consecutive timestamps differ by 15000ms, 15000ms, 15001ms, … The second-order deltas — differences of differences — are almost always 0 or ±1.

Timestamps:      1700000000000, 1700000015000, 1700000030001, 1700000045000
Deltas:                         15000,          15001,          14999
Delta-of-deltas:                                    1,             -2

Encode these delta-of-deltas with a variable-length scheme: a few bits for small values, more bits for rare large deviations. The Gorilla paper reports that ~96% of timestamps compress to a single bit (delta-of-delta = 0). The weighted average is roughly 1–2 bits per timestamp for regular scrape intervals.

2. XOR float compression (Gorilla-style)

Consecutive metric values for a single series tend to be similar. XOR two adjacent float64 values and you get a bitstring where the leading and trailing bits are mostly zero — only the "meaningful" middle changes. Encode just those bits.

val_1 = 0.372 → 0x3FD7CED916872B02
val_2 = 0.374 → 0x3FD7EF9DB22D0E56  (XOR has 18 leading zeros)

For slowly-changing metrics (CPU usage hovering around 40%), XOR compression can reach 1.1 bits/sample. Spiky metrics are harder but still achieve 1.5–2 bits/sample. The practical average across mixed workloads is ~1.5 bytes/sample — the original Gorilla paper reports 1.37 bytes/point on Facebook's workload, a 12× reduction from 16 bytes raw.

Inverted index: label → series

To answer http_requests_total{status="500"}, you need to find all series whose label set contains status="500". Scanning all 5M series is not viable. Instead, for each label_name=label_value pair, maintain a posting list of series IDs:

"service=payments"   → [series_42, series_43, series_44, ...]
"status=200"         → [series_42, series_45, ...]
"status=500"         → [series_43, ...]

A multi-label query is an intersection of posting lists:

{service="payments", status="500"} → series_43 ∩ ... → [series_43]

Posting lists use compressed integer sequences so intersections are fast even across millions of IDs. Prometheus TSDB stores posting lists as sorted arrays of uint64 series references on disk, with delta-compression applied to the sorted IDs; other systems such as Elasticsearch and ClickHouse use full Roaring Bitmaps for their posting lists.

The entire index for 5M series fits comfortably in memory (~1–2 GB), making label-selector queries microsecond-fast.

Block structure and compaction

flowchart TD
    HEAD["Head block\n(in-memory, 2h)"]
    B1["Block 0–2h\n(flushed, immutable)"]
    B2["Block 2–4h"]
    B3["Block 0–4h\n(compacted)"]
    B4["Block 0–8h\n(compacted again)"]
    OBJ["Object store\n(blocks > 1 day)"]

    HEAD -->|"flush at 2h"| B1
    B1 -->|"compact"| B3
    B2 -->|"compact"| B3
    B3 -->|"compact"| B4
    B4 -->|"ship"| OBJ
    style HEAD fill:#ff6b1a,color:#0a0a0f
    style OBJ fill:#ffaa00,color:#0a0a0f
    style B4 fill:#15803d,color:#fff

The head block is mutable, stored in memory plus WAL. Every ~2 hours it is sealed into an immutable block on disk. Immutable blocks are periodically compacted — merging smaller blocks into larger ones reduces the number of files and improves query performance. Long-term blocks are uploaded to object storage.

This is the same pattern as LSM-tree compaction in LevelDB/RocksDB — optimize writes by making storage mostly-sequential, pay a periodic cost for compaction.

Full architecture

flowchart TD
    SD["Service discovery\n(Kubernetes / Consul / EC2)"]
    SCR["Scrape coordinators"]
    AGT["Push agents\n(for batch jobs)"]
    LB["Ingest load balancer"]
    ING1["Ingester 0"]
    ING2["Ingester 1"]
    ING3["Ingester N"]
    WAL1["WAL"]
    WAL2["WAL"]
    CHUNK["On-disk TSDB blocks"]
    OBJ["Object store\n(S3 / GCS)"]
    IDX["Label index\n(in-memory)"]
    QFE["Query frontend\n(sharding + caching)"]
    QBE1["Querier 0"]
    QBE2["Querier 1"]
    STORE["Store gateway\n(reads from object store)"]
    DASH["Dashboards"]
    ARULE["Alert rule evaluator"]
    NROUTE["Notification router\n(email / PD / Slack)"]

    SD --> SCR
    SCR --> LB
    AGT --> LB
    LB --> ING1
    LB --> ING2
    LB --> ING3
    ING1 --> WAL1
    ING2 --> WAL2
    ING1 --> CHUNK
    CHUNK --> OBJ
    CHUNK --> IDX
    IDX --> QBE1
    IDX --> QBE2
    OBJ --> STORE
    STORE --> QBE1
    QFE --> QBE1
    QFE --> QBE2
    QFE --> DASH
    QFE --> ARULE
    ARULE --> NROUTE

    style SCR fill:#ff6b1a,color:#0a0a0f
    style ING1 fill:#0e7490,color:#fff
    style ING2 fill:#0e7490,color:#fff
    style ING3 fill:#0e7490,color:#fff
    style OBJ fill:#ffaa00,color:#0a0a0f
    style QFE fill:#15803d,color:#fff
    style ARULE fill:#a855f7,color:#fff
    style NROUTE fill:#ff2e88,color:#fff

Ingestion path

A scrape coordinator resolves the target list from service discovery and fires HTTP GET requests every 15s.
Parsed samples are hashed by (metric_name, sorted_label_fingerprint) and routed to the responsible ingester shard.
The ingester appends the sample to the in-memory head block and to the WAL on disk.
Every 2 hours, the head block is flushed to an immutable on-disk block; the WAL segments up to that point can be discarded.
A background compactor merges smaller blocks into larger ones and ships aged-out blocks to object storage.

Query path

sequenceDiagram
    participant User
    participant QFE as Query frontend
    participant QBE as Querier
    participant ING as Ingester
    participant STORE as Store gateway
    User->>QFE: "rate(http_errors[5m])"
    QFE->>QFE: parse + plan
    QFE->>QBE: sub-query (last 24h)
    par fetch recent data
        QBE->>ING: series for last 2h (from head block)
    and fetch older data
        QBE->>STORE: series for 2h–24h (from object store)
    end
    QBE->>QFE: merged result
    QFE->>User: evaluated metric matrix

The query frontend is stateless and handles three jobs: it splits a long query (say, 7 days) into 24-hour sub-queries that run in parallel; it caches sub-query results aligned to fixed time boundaries so repeated dashboard loads skip the queriers entirely; and it shards high-cardinality queries across multiple querier nodes, each handling a slice of the series space.

Downsampling and retention tiers

Keeping raw 15-second samples for 2 years is prohibitive. The solution is to downsample older data periodically.

Tier	Resolution	Retention	Storage (5M series)
Raw	15 s	15 days	~645 GB
Downsampled	5 min	90 days	~600–800 GB (20× fewer points × 4 values each)
Long-term	1 hour	2 years	~500 GB total (~250 GB/year)

Each downsampled point stores not just value but {min, max, sum, count} — so you can reconstruct averages, min, max, and approximate percentiles even from the coarser resolution. A background compactor runs over aged-out blocks, computes rollups, and writes the downsampled blocks to object storage. Queriers transparently use the highest-resolution tier available for the requested time range.

The trade-off is precision: a 1-hour data point cannot tell you whether there was a 5-second spike within that hour. For dashboards covering months, that's fine. For alerting, you always run against raw data.

flowchart LR
    RAW["Raw data\n15s resolution\nkeep 15 days"]
    D5["5-min rollup\nkeep 90 days"]
    D60["1-hr rollup\nkeep 2 years"]
    OBJ["Object store\n(S3 / GCS)"]

    RAW -->|"compact + downsample"| D5
    D5 -->|"compact + downsample"| D60
    D5 --> OBJ
    D60 --> OBJ

    style RAW fill:#ff6b1a,color:#0a0a0f
    style D5 fill:#0e7490,color:#fff
    style D60 fill:#15803d,color:#fff
    style OBJ fill:#ffaa00,color:#0a0a0f

Alerting

Alert evaluation is a separate subsystem — decoupled from the query path and the storage path.

flowchart LR
    RULES["Alert rules\n(PromQL expressions)"] --> EVAL["Rule evaluator\n(runs every 15–60 s)"]
    QFE["Query frontend"] --> EVAL
    EVAL -->|"active alert"| AM["Alert manager"]
    AM --> DEDUP["Dedup + grouping"]
    DEDUP --> INH["Inhibition rules"]
    INH --> SIL["Silences"]
    SIL --> ROUTE["Routing rules"]
    ROUTE --> EMAIL["Email"]
    ROUTE --> PD["PagerDuty"]
    ROUTE --> SLACK["Slack"]
    style EVAL fill:#a855f7,color:#fff
    style AM fill:#ff6b1a,color:#0a0a0f
    style ROUTE fill:#15803d,color:#fff

An alert rule looks like:

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
  for: 2m         # must be true for 2 consecutive minutes before firing
  labels:
    severity: critical
  annotations:
    summary: "Error rate > 1% for {{ $labels.service }}"

The evaluator runs every evaluation_interval (typically 15–60 seconds), executes each rule expression against the query frontend, and compares the result to the rule's threshold. If the condition holds for the for duration, the alert fires.

The alert manager then takes over. It deduplicates alerts — the same condition may fire from multiple evaluator replicas, so it collapses by (alertname, labels). It groups related alerts (all high-error-rate alerts for a service) into a single notification so on-call engineers don't get paged 50 times for one incident. It applies inhibition rules to suppress lower-severity alerts when a higher-severity one is already firing for the same system. Silences let you time-bound suppression during planned maintenance. Finally, routing rules send the alert to the right team based on label matchers.

Storage choices

Data	Store	Why
Active samples (last 2h)	In-memory head block + WAL	Sub-millisecond write; recoverable on crash
Recent blocks (2h–15d)	Local NVMe / SSD	Sequential reads; warm for dashboard queries
Long-term blocks	Object store (S3/GCS)	Cheap, durable, effectively unlimited; queried less often
Label index	In-memory	1–2 GB for 5M series; must be fast
Alert rules + routing config	Config file / version-controlled YAML	Simple; rarely changes
Notification state	In-memory + durable KV (etcd/Postgres)	For dedup across alert manager replicas

Failure modes

Cardinality explosion

A single deployment adds a high-cardinality label. Series count grows from 5M to 5B. Memory on ingesters exhausts; ingesters OOM and restart; on restart, WAL replay takes hours because there are 5B series to reconstruct. Cascading failure.

The fix is hard per-metric cardinality limits enforced at ingestion, plus dashboards showing the top-N metrics by series count so developers catch explosions before they reach the ingesters.

Scrape gaps

The scraper loses connectivity to a target. Samples stop for that target. Alert rules using rate(metric[5m]) return no data (empty/NaN) rather than 0 once the look-back window contains no samples. A threshold rule like rate(errors[5m]) > 0 won't fire because there's no value to compare — silence is the most dangerous failure mode in monitoring. Worse, a low-error-rate assertion like rate(errors[5m]) < 0.01 will silently stop firing, giving a false green.

The defense is an explicit up metric: the scraper sets up=1 on a successful scrape and up=0 on failure. Alert on up == 0 to detect dead targets. Also use absent(metric) in alert rules to catch missing time series explicitly.

Ingester failure (WAL replay)

An ingester crashes. On restart it replays the WAL from the last checkpoint to rebuild the in-memory head block. With large WAL segments — say, 2 hours of data across 1M series — replay can take minutes. During replay the ingester is not available for writes or queries.

WAL segment size limits reduce the maximum replay window. Replication (two ingesters handling the same series fingerprints) means the replica covers reads while the primary replays. Checkpoint the WAL more frequently to shrink recovery time further.

Query DoS

A dashboard engineer writes rate(some_metric[30d]) — a 30-day query with 15s resolution, touching 2 billion data points. It saturates the query layer and starves other users.

Mitigate with a query timeout (typically 2 minutes), a max result set size in bytes, query cost estimation before execution, and per-tenant query concurrency limits.

Hot tenant (multi-tenant deployments)

A single tenant ingests 10× their share due to a runaway agent or a new high-cardinality metric. They saturate their ingester shard, affecting neighbors. Per-tenant ingestion rate limits, tenant-to-shard isolation, and autoscaling ingesters all help here.

Things to discuss in an interview

Pull vs push and why you'd use both: pull for long-running services with service discovery, push gateway for batch jobs.
Cardinality as the #1 failure mode: every label is a multiplier on series count; enforce hard limits at ingest.
Why TSDB uses chunks not rows: sequential I/O access pattern; enables compression; aligns with time-range query patterns.
How compression works: delta-of-delta for timestamps (regular intervals → tiny second-order deltas); XOR for floats (similar consecutive values → high-bit XOR runs).
Downsampling trade-off: you lose intra-period precision; always query recent data from raw, old data from rollups.
Alert evaluation is a read path, not a write path: run it as a separate periodic job so it doesn't block ingestion; dedup in the alert manager.

Things you should now be able to answer

Why does adding user_id as a label break a monitoring system?
How does delta-of-delta encoding achieve near-1 bit per timestamp for regular scrape intervals, and what happens when a scrape fires late?
Why do monitoring systems flush the head block to an immutable on-disk block every 2 hours rather than writing to a single large file?
How does scrape silence differ from a metric dropping to zero — and how do you alert on each?
What does a query frontend do that a querier cannot?
Why does downsampled storage keep min/max/sum/count rather than a single average?

Frequently asked questions

▸What is cardinality in a metrics system, and why is it the number-one failure mode?

Cardinality is the count of unique label-set combinations for a metric — every distinct (metric_name, label_key=label_value) permutation is a separate time series. It is the top failure mode because a single high-cardinality label like user_id can multiply 5 million series into 5 billion overnight, exhausting ingester memory, causing OOM crashes, and making WAL replay take hours. The fix is hard per-metric limits enforced at ingestion time before the damage propagates.

▸How does Gorilla-style compression achieve roughly 1.5 bytes per sample from 16 bytes raw?

It combines two tricks: delta-of-delta encoding on timestamps (regular 15-second scrape intervals produce second-order deltas of nearly zero, so about 96% of timestamps compress to a single bit) and XOR encoding on float64 values (adjacent samples that are close in value produce XOR results with long leading and trailing zero runs, encoding only the meaningful changed bits). The Gorilla paper reports 1.37 bytes per point on Facebook workloads; mixed production workloads average around 1.5 bytes, a 10-12x reduction from 16 bytes raw.

▸When should a metrics system use pull scraping versus push, and can you use both?

Pull (Prometheus-style) is preferred for long-running services because the monitoring server owns scrape timing, timestamps are aligned, and a failed scrape immediately surfaces as an up=0 alert. Push is necessary for ephemeral batch jobs that may finish before the next scrape fires, and works across NAT because the agent initiates the connection. Production systems use both: pull via service-discovery-driven scraping for services, a push gateway for short-lived processes.

▸What are the three retention tiers and how much storage does each consume at 5 million active series?

Raw 15-second data is kept for 15 days at roughly 645 GB. Five-minute rollups are kept for 90 days at 600-800 GB (far fewer points but each stores min, max, sum, and count). One-hour rollups are kept for 2 years at around 500 GB total. Each downsampled point stores four values rather than one to allow reconstruction of averages, min, max, and approximate percentiles from coarser resolution.

▸Why does scrape silence produce a worse alert outcome than a metric dropping to zero?

When a target goes silent, rate() expressions return no data (NaN) rather than zero, so threshold rules like rate(errors[5m]) > 0 never fire because there is no value to compare against. A low-error-rate assertion like rate(errors[5m]) < 0.01 silently stops firing, giving a false green signal. The defense is an explicit up metric set to 0 on scrape failure, plus absent() in alert rules to catch missing series explicitly.

← previous

Idempotency & Exactly-Once Semantics

Event Sourcing & CQRS

// RELATED

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform