MODULE 12 / 12crash course

~/roadmap/12-observability

◆◆Intermediate

Observability — Logs, Metrics, Tracing

The three pillars of observability, structured logging, metric cardinality, distributed tracing, and how to find the needle when production catches fire at 3am.

13 min read2026-01-26Ironclad Academy

#observability #monitoring #operations #fundamentals

You can build the cleanest system in the world. The first time it misbehaves at 3am with a vague Slack alert, you'll wish you'd spent more time thinking about how you'd see what's happening. That seeing is observability.

Observability is the property of a system that lets you ask arbitrary questions about its internal state from the outside, without having to push new code or attach a debugger. If you can only answer the questions you anticipated when you wrote the dashboards, you have monitoring, not observability.

The three pillars

flowchart TD
    O[Observability] --> L[Logs<br/>discrete events with context]
    O --> M[Metrics<br/>numbers aggregated over time]
    O --> T[Traces<br/>request paths across services]
    L --> LE[ELK, Loki, CloudWatch]
    M --> ME[Prometheus, Datadog, CloudWatch Metrics]
    T --> TE[Jaeger, Tempo, Honeycomb, X-Ray]
    style L fill:#ff6b1a,color:#0a0a0f
    style M fill:#0e7490,color:#fff
    style T fill:#15803d,color:#fff

Each pillar answers a different question:

Pillar	Best for	Granularity
Logs	"What happened in this request?"	Per-event
Metrics	"Is the system healthy overall?"	Aggregated
Traces	"Where did this slow request spend its time?"	Per-request, per-span

In modern stacks these merge — you can derive metrics from logs (Loki), or attach trace IDs to every log line (correlation). The distinction is becoming blurry, but the mental model still helps.

Logs

A log is a record of an event. The right way to write logs in 2026:

Write structured logs, not strings

# Wrong
log.info("User 42 placed order 7 for $50")

# Right
log.info("order_placed", user_id=42, order_id=7, amount_cents=5000)

Structured logs are emitted as JSON:

{
  "ts": "2026-01-26T14:23:01.123Z",
  "level": "info",
  "msg": "order_placed",
  "service": "checkout",
  "trace_id": "01HXM2A7ZQ8...",
  "user_id": 42,
  "order_id": 7,
  "amount_cents": 5000,
  "currency": "usd"
}

The difference is searchability. A string log forces you to grep against an opaque blob. A JSON log lets you filter on service:checkout AND level:error AND user_id:42, aggregate sum(amount_cents) by service, and join across services by trace_id. Once your fleet grows past a few services, string logs become archaeology.

Log levels

Level	When
`debug`	Verbose diagnostic output, rarely on in production
`info`	Normal operations: started, finished, sent, received
`warn`	Unusual but recoverable — fallback used, deprecation hit
`error`	Failed operations a human should look at
`fatal`	Process exiting because of unrecoverable state

Half of all production noise comes from info being too chatty and warn being treated as error. Be deliberate.

What to log — and what not to

For every request, you want four things: the entry (method, path, user/tenant ID, request ID, key params), the exit (status code, latency, bytes sent), any outbound calls (target, latency, status), and caught exceptions with stack trace and the inputs that caused them. State transitions are worth a log line too — "order state changed; payment authorized."

Keep secrets out entirely. Tokens, passwords, full credit card numbers, API keys — logs are widely accessible, and any logged secret should be treated as already leaked. PII (email, full name, IP) belongs in logs only when there's a clear, regulated reason, not by default. And never log inside tight loops; 1M log lines per request will wreck your pipeline cost fast. Use a redaction layer in your logger so secrets are stripped automatically before anything hits the wire.

Log retention

Tier	Retention	Cost
Hot (searchable)	7–14 days	Highest
Warm (indexed but slow)	30–90 days	Medium
Cold (S3 / Glacier)	1+ years	Cheapest

A common pattern: hot in Elasticsearch / Datadog, warm in S3 with Athena queryability, cold in Glacier for compliance. Each tier is roughly 10× cheaper than the one above.

Metrics

A metric is a single number with a name, a value, and labels (dimensions), sampled at intervals. Four canonical types:

flowchart TD
    M[Metric types] --> C[Counter<br/>monotonically increasing]
    M --> G[Gauge<br/>can go up or down]
    M --> H[Histogram<br/>distribution of values]
    M --> S[Summary<br/>pre-computed quantiles]
    C --> CEX["http_requests_total<br/>orders_created_total"]
    G --> GEX["queue_depth<br/>memory_bytes_used<br/>active_connections"]
    H --> HEX["http_request_duration_seconds<br/>response_size_bytes"]
    style C fill:#ff6b1a,color:#0a0a0f
    style G fill:#0e7490,color:#fff
    style H fill:#15803d,color:#fff

The four golden signals

Google's SRE book canonized these as the metrics that matter for any user-facing service:

Signal	What	Why
Latency	Time to serve a request	Slow → bad UX
Traffic	Requests per second	Unexpected drops/spikes signal trouble
Errors	Failed requests	Direct impact on users
Saturation	How full your resources are (CPU, memory, queues)	Predicts impending failure

If you instrument nothing else, instrument these for every service.

USE method (for resources)

For each resource (CPU, disk, network, pool):

Utilization — how busy is it?
Saturation — is there queueing/waiting?
Errors — error events?

CPU at 80% is fine; at 80% with queue depth growing, you're about to fall over.

Cardinality: the metric trap

Every label combination is a separate time series. A metric with 10 services × 5 status codes × 100 endpoints × 1000 customer IDs = 5 million series. Each series costs storage and query time.

flowchart LR
    A[Low cardinality:<br/>service, status, endpoint] --> OK[Healthy: ~thousands of series]
    B[High cardinality:<br/>+ user_id, request_id, trace_id] --> BAD[Cost explosion]
    style OK fill:#15803d,color:#fff
    style BAD fill:#ff2e88,color:#fff

Never label a metric with anything that has unbounded values — user IDs, request IDs, IPs, URLs with query parameters. If you need that level of detail, use traces or logs, which are built for high cardinality.

Counters and rates

A counter only goes up: requests_total. The interesting quantity is its derivative:

rate(http_requests_total[5m])

That gives you requests/second over the last 5 minutes. Counters are robust against process restarts (Prometheus handles the reset detection); always prefer counters to gauges for "things that happen."

Histograms for latency

histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

This gives you p99 latency over the last 5 minutes. The histogram has buckets (0.005s, 0.01s, 0.025s, 0.05s, 0.1s, 0.25s, 0.5s, 1s, 2.5s, 5s, 10s by default); each request increments the bucket it falls into. From buckets, you can query any quantile.

The most important metric is p99 latency, not average. Averages hide tail latency. A service with 95ms average latency might have 500ms p99 and 5s p99.9 — and those tail users are the ones telling friends not to use you.

flowchart LR
    A[p50: 50ms] --> B[p95: 200ms]
    B --> C[p99: 500ms]
    C --> D[p99.9: 5s]
    Note["Look at all of them.<br/>Never just the average."]
    style A fill:#15803d,color:#fff
    style C fill:#ff6b1a,color:#0a0a0f
    style D fill:#ff2e88,color:#fff

For consumer-facing services, p99 is what most users notice. For high-frequency callers (mobile apps making 100 requests per session), p99.9 starts to matter — every user will hit it.

RED method (for services)

For each service:

Rate — requests per second.
Errors — failures per second (or error rate %).
Duration — latency distribution.

A simple "RED dashboard per service" gives 80% of monitoring value with 20% of the work.

Traces

A trace follows one request through every service it touches. Each unit of work is a span; spans nest into a tree.

gantt
    title One request: total 250 ms
    dateFormat X
    axisFormat %L
    section API
    API handler           :a1, 0, 250
    section Auth
    auth.validate          :a2, 5, 30
    section DB
    db.users.get           :a3, 35, 60
    section Cache
    cache.timeline         :a4, 60, 80
    section External
    fanout.publish         :a5, 200, 250

Trace anatomy:

trace_id: same across the whole request.
span_id: unique per span.
parent_span_id: builds the tree.
start time, duration, attributes (HTTP method, target service, error info).

Why traces beat logs for slow-request debugging

Logs tell you a request happened in service A and a request happened in service B and a request happened in service C. They don't tell you the requests in B and C were caused by the request in A.

A trace makes the causal structure visible: "A's request triggered 3 calls to B, which each triggered 2 to C; this one took 500ms because the second call to B was slow."

flowchart LR
    REQ[Request] --> A[Service A: 50ms]
    A --> B1[B: cache check 5ms]
    A --> B2[B: db read 80ms]
    A --> C[C: external call 410ms ← slow!]
    style C fill:#ff2e88,color:#fff

Without distributed tracing, finding C as the slow part takes hours of cross-team grep. With it: one click on the slow span.

Trace propagation

The trace_id has to flow with the request. The W3C standard (traceparent header):

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
              ^^  ^^ trace_id ^^^^^^^^^^^^^^^^^^^^  ^^ parent-id ^^^ ^^ flags

Frameworks like OpenTelemetry, Datadog APM, and Jaeger client libraries automatically inject and read this header for HTTP, gRPC, Kafka, etc.

Sampling

Tracing every request is expensive. Most stacks sample.

Strategy	Description
Head sampling	Decide at the start of the request (e.g. 1%) — lose information about rare events
Tail sampling	Buffer all spans, decide after request finishes (keep all errors, slow requests, plus 1% of normal)
Adaptive sampling	Sample more from low-volume endpoints, less from high-volume

Tail sampling is much better but costs memory and infra. Honeycomb and Datadog APM do this; OpenTelemetry Collector can too.

Correlation: the killer feature

When all three pillars share a trace_id, you can pivot freely:

flowchart LR
    A[Alert: p99 latency spike] --> B[Find slow trace]
    B --> C[Look at logs for that trace_id]
    C --> D[See exception with stack]
    D --> E[Fix]
    style A fill:#ff2e88,color:#fff
    style E fill:#15803d,color:#fff

Without correlation, an alert triggers a 30-minute hunt across three tools. With correlation, the trace_id is the same string in your alert, your traces UI, and your log search — and you find the exception in 60 seconds.

Make every log line carry the trace_id. It's the single most valuable observability investment you'll ever make.

How a real incident plays out

To make the three pillars concrete, here's the flow from alert to fix:

sequenceDiagram
    participant Alert as Alerting System
    participant ONCALL as On-call engineer
    participant Dash as Metrics dashboard
    participant Trace as Tracing UI
    participant Logs as Log search
    Alert->>ONCALL: p99 latency > 500ms for checkout
    ONCALL->>Dash: Open RED dashboard
    Dash-->>ONCALL: Error rate normal, traffic normal, duration spiked
    ONCALL->>Trace: Find a slow trace from last 5 min
    Trace-->>ONCALL: "checkout → payment-svc: 450ms on db.charge"
    ONCALL->>Logs: Search trace_id from slow span
    Logs-->>ONCALL: "DB pool exhausted — waiting for connection"
    ONCALL->>ONCALL: Scale up DB pool, deploy fix

Notice the path: metrics told you something's wrong, the trace told you where, and logs told you why. Each pillar does one job, and the trace_id is the thread connecting them.

Alerting

Metrics + thresholds = alerts. The art is in what to alert on.

Page on symptoms, not causes

flowchart LR
    A["Symptom alert:<br/>p99 latency > 500ms"] -->|"page!"| ONCALL[On-call human]
    B["Cause alert:<br/>CPU > 80%"] -.-> X[FYI dashboard]
    style A fill:#ff6b1a,color:#0a0a0f
    style B fill:#6b6b85,color:#0a0a0f

Page on the things that affect users. High CPU might be fine if latency is OK — it's not an incident until users feel it.

SLO-based alerting

Better than threshold alerts: alert on error budget burn rate.

If your SLO is 99.9% (43 min downtime/month), and you're currently consuming budget at 14× normal, you'll exhaust it in ~2 days. Alert on that, not on the raw error rate.

# burn rate over last hour
(1 - sli_ratio_1h) / (1 - SLO)

A burn rate of 1 means you'll exactly meet your SLO; 14 means you're going to blow it in days; 100 means hours.

Alert fatigue

Pages that fire every night without action drain your team. Every alert should have a clear runbook — "if this fires, do these steps." Alerts that fire and require no action should be deleted or rewritten. A monthly "alert review" helps: which alerts fired, which actually mattered. Bad alerts are worse than no alerts; people learn to ignore them, and then the real ones get missed.

Dashboards

Each service deserves a small set of dashboards:

The 4 golden signals — latency, traffic, errors, saturation.
Per-dependency — calls to each downstream, latency, error rate.
Business metrics — orders/sec, revenue/min, signups/hour.
SLO budget — are we tracking?

Avoid the temptation of the "100-graph dashboard." A good dashboard fits on one screen and shows you a problem in 10 seconds.

OpenTelemetry: the unification

Two years ago, every vendor had a proprietary agent. OpenTelemetry (OTel) is now the consensus standard:

OTel SDK in your code emits spans, metrics, logs.
OTel Collector receives, processes (sampling, batching), and forwards to any backend.
Backend can be Jaeger, Tempo, Honeycomb, Datadog, New Relic, anything.

flowchart LR
    A[Your service<br/>OTel SDK] --> COL[OTel Collector]
    B[Other service<br/>OTel SDK] --> COL
    COL --> J[Jaeger]
    COL --> P[Prometheus]
    COL --> D[Datadog]
    style COL fill:#ff6b1a,color:#0a0a0f

The benefit: you instrument once, change backends without rewriting code.

Logging at scale

Some numbers to set expectations:

A medium-traffic service (1,000 RPS) at 5 log lines per request = 5,000 log lines/sec (the "5 lines" is an estimate; your service may vary).
200 bytes/line = 1 MB/sec = 86 GB/day ≈ 30 TB/year.
At Datadog's log ingestion pricing of ~$0.10/GB, that's roughly $3,000/year in ingestion fees alone for one service — before indexing costs ($1.70/million events, charged separately and typically the larger bill).

Multiply by 50 services in your fleet and logging is a real line item. The levers: sample debug logs at 1% and keep all warns/errors; tier logs to S3 after 7 days; move high-cardinality detail (request payloads) to traces instead of logs; and aggregate before logging where you can ("processed 1000 events in batch" beats 1000 individual lines).

What good observability looks like in practice

For a typical web service:

Logs: structured JSON, every line carries service, trace_id, user_id. Sent to a log aggregator (Loki, ES, Datadog Logs).
Metrics: Prometheus / OTel scrapes /metrics. Dashboards built on RED + the 4 golden signals.
Traces: OTel SDK auto-instruments HTTP and DB calls. Tail-sampled at the collector. ~1% kept normally, 100% kept for errors.
Alerts: SLO-based, paging on symptom, with runbooks linked.
Pivot story: alert fires → click trace ID → see waterfall → click "view related logs" → see exception.

Set this up before your service has its first incident, and the first incident becomes a 10-minute fix instead of a postmortem.

Things you should now be able to answer

Why are structured logs (JSON) better than string-formatted logs?
A new metric you added causes your bill to triple. What likely happened?
Average latency is 50ms but the dashboard looks fine. Why might users still be unhappy?
A request is slow. You have logs from three services. How do traces help that logs don't?
Why is "alert on CPU > 80%" worse than "alert on p99 latency > 500ms"?
Your team gets paged 30 times a week and ignores most. What's the cheapest fix?

🎉 You finished the crash course.

→ Browse the deep-dive articles →

// FAQ

Frequently asked questions

▸What is the difference between observability and monitoring?

Monitoring lets you answer questions you anticipated when you wrote the dashboards. Observability lets you ask arbitrary questions about a system's internal state from the outside, without pushing new code or attaching a debugger. If your tooling can only answer pre-planned questions, you have monitoring, not observability.

▸Why should you never use a user ID or request ID as a metric label?

Every unique label combination creates a separate time series. A metric with 10 services, 5 status codes, 100 endpoints, and 1,000 customer IDs produces 5 million series, each costing storage and query time. High-cardinality identifiers like user IDs, request IDs, and IPs cause cost explosions; use traces or logs for that level of detail instead.

▸Why is p99 latency more important than average latency?

Averages hide tail latency. A service can show 95ms average latency while its p99 sits at 500ms and its p99.9 at 5 seconds. For high-frequency callers such as mobile apps making 100 requests per session, p99.9 matters too, because every user will eventually hit it.

▸What is tail sampling in distributed tracing, and how does it differ from head sampling?

Head sampling decides at the start of a request whether to record it, typically keeping a fixed percentage like 1%, which means rare events and errors are discarded at the same rate as normal traffic. Tail sampling buffers all spans and decides after the request finishes, keeping 100% of errors and slow requests plus a small fraction of normal traffic. Tail sampling is better but costs more memory and infrastructure; Honeycomb, Datadog APM, and the OpenTelemetry Collector all support it.

▸What does SLO-based alerting on burn rate mean, and why is it better than raw threshold alerts?

Instead of alerting when an error rate crosses a fixed threshold, you alert when you are consuming your error budget faster than sustainable. For a 99.9% SLO (43 minutes of downtime per month), a burn rate of 14 means you will exhaust that budget in roughly 2 days. Alerting on burn rate catches problems proportional to their actual user impact, not arbitrary metric thresholds.

← previous module

Reliability and Failure Patterns

course complete →

Browse all articles