Design a Centralized Log Aggregation System (ELK / Splunk)
Collect, store, and search logs from thousands of services. Collection agents, a buffered ingestion pipeline, time-based inverted indices, hot-warm-cold tiers, and cost control.
The problem
Netflix runs thousands of microservices. On any given afternoon, tens of millions of log lines per second pour out of containers across tens of thousands of hosts — one slow upstream call, one unexpected null pointer, one rejected payment. The only way an engineer finds that needle is if every log line from every service has already been collected, structured, and indexed somewhere central, ready for a sub-second query at 2 AM during an outage.
That is what a log aggregation system does. It replaces the old workflow — SSH into a server, grep through /var/log/app.log — with a unified pipeline: lightweight agents tail log files on every host, ship the events to a durable buffer, and a fleet of indexers writes them into a full-text search engine. Products like the ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, and Datadog are all industrial implementations of this pattern.
The core engineering tension is that logs are a write-dominated, high-cardinality, unbounded stream. Unlike metrics — which are pre-aggregated numbers on known label sets — every log line is a unique text event that may carry arbitrary JSON fields. You must index it fast enough to keep up with the firehose, make it searchable within seconds, and store months of history without the bill overwhelming the engineering budget. Backpressure during traffic spikes and cost at petabyte scale are the two problems that will break a naive design.
A second tension is the gap between what engineers want ("show me all errors for request ID abc123 in the last hour") and what the storage system can deliver cheaply. Full-text search across terabytes of cold history is expensive; scanning is even worse. The architecture that follows — agents, Kafka, Elasticsearch with time-based daily indices, and hot/warm/cold tiering — exists specifically to collapse that gap.
Functional requirements
- Collect logs from all hosts and containers without changing application code.
- Accept both structured logs (JSON with known fields) and unstructured log lines.
- Parse, enrich (add hostname, region, severity), and index every event.
- Expose full-text search and structured-field filter queries via an API and a UI.
- Alert when a log pattern crosses a threshold (e.g., ERROR rate > 1 % for 5 min).
- Correlate log events with distributed traces via a shared
trace_idfield. - Support multiple teams in a shared cluster with access control per log stream.
Non-functional requirements
- Durability: no log line dropped during an indexer outage or traffic spike.
- Ingestion latency: logs visible for search within 5–15 seconds of emission (near-real-time, not real-time).
- Query latency: p95 < 5 s for queries over the last 24 h; < 60 s over 30-day history.
- Retention: 7-day hot, 30-day warm, 90-day cold, 1–2-year frozen — tiered by access frequency.
- Availability: 99.9 % for the query path; ingestion can tolerate brief queue build-up.
- Cost-aware: the platform must provide controls to cap storage spend; logs scale with traffic, not headcount.
Capacity estimation
Large fleet (10,000 hosts, sustained peak):
| Dimension | Estimate | How we got there |
|---|---|---|
| Fleet size | 10,000 hosts | — |
| Log rate (peak) | 5,000 lines/host/sec (typical ~500) | Sustained peak assumption |
| Total ingest rate | 50 M lines/sec | 10,000 × 5,000 |
| Avg line size | 500 bytes | JSON with stack traces can be larger; short access logs shorter |
| Raw ingest bandwidth | 25 GB/s | 50 M × 500 B |
| Compression ratio | ~5× | LZ4 / Zstd typical for log data |
| Compressed ingest | 5 GB/s | 25 GB/s ÷ 5 |
| Compressed per day | 432 TB/day | 5 GB/s × 86,400 s |
| 7-day hot tier | ~3 PB | 432 TB × 7 |
| 30-day warm tier | ~10 PB (additional days) | 432 TB × 23 |
Mid-scale (1,000 hosts, 500 lines/host/sec average):
| Dimension | Estimate | How we got there |
|---|---|---|
| Total ingest rate | 500,000 lines/sec | 1,000 × 500 |
| Raw ingest bandwidth | 250 MB/s | 500,000 × 500 B |
| Compressed ingest | 50 MB/s | 250 MB/s ÷ 5 |
| Compressed per day | 4.3 TB/day | 50 MB/s × 86,400 s |
| 7-day hot tier | ~30 TB | Fits a small Elasticsearch cluster (6–10 nodes) |
Both scenarios have the same structure; only the cluster size changes. The math is what matters in an interview — show the multiplication, then say "so we need time-based sharding and tiering."
Takeaway: at 10,000 hosts the uncompressed firehose is 25 GB/s — you need Kafka as a write buffer, daily time-based indices for query pruning, and hot/warm/cold tiering to make the storage bill survivable.
Logs vs. metrics: why they're different systems
This distinction comes up in every observability interview.
| Property | Logs | Metrics |
|---|---|---|
| Data shape | Discrete text events, variable fields | Numeric values (gauge, counter, histogram) with labels |
| Cardinality | Unbounded — any JSON field can appear | Controlled — label sets are predefined |
| Query type | Full-text search + field filter + aggregation | Time series aggregation + alerting |
| Volume | Very high (MB–GB/s raw) | Low (thousands of series) |
| Storage | Inverted index (Lucene/Elasticsearch) | TSDB (Prometheus, InfluxDB, Thanos) |
| Retention | Often tiered, expensive to store long-term | Aggregated; cheap at 15-second granularity |
| Trace link | Ship trace_id as a field | Not directly applicable |
A metrics system gives you "how many?" cheaply and reliably. Logs give you "what happened, to which request, in what context?" That richness is also why logs are so much more expensive to store and query.
Building up to the design
V1: Syslog to a file on a shared NFS mount
Every Unix host has syslog. Redirect it to a shared NFS volume. grep to search.
Zero infrastructure, and it works on day one for three services. The problems arrive fast: at ten services, concurrent writes corrupt the file. At a hundred, grep over terabytes is unusable. And NFS is a single point of failure for every application that tries to write a log.
V2: Centralized log server (rsyslog / syslog-ng)
Run a log aggregation daemon. Applications forward via UDP/TCP syslog; the server writes to local disk. You search with grep or awk.
Centralizing removes the NFS mess. But search is still O(bytes scanned) — one bad query pauses ingestion because they share the same box. The log server itself becomes a single point of failure. And every log line is still a flat string with no structure, so parsing is a later problem.
V3: Add structure and a search index
Require services to emit JSON. Parse, enrich, and write to Elasticsearch. Now a query like:
service:payment AND level:ERROR AND @timestamp:[now-1h TO now]
hits an inverted index and returns in milliseconds.
Near-real-time structured search is a huge leap. The new problem shows up during deploys: a service that starts logging 100× its normal rate will overwhelm the Elasticsearch indexer. When the indexer's queue fills, it rejects _bulk requests with HTTP 429, and the agent has nowhere to put the events. They get dropped.
V4: Add Kafka as a durable buffer
Insert Kafka between agents and Elasticsearch. This is the most important architectural decision in the whole system.
Agent → Kafka topic (replicated) → Logstash / OpenSearch Ingest → ES
The agent publishes and moves on; Kafka durably stores the event. The indexer pulls at its own pace — if it lags, Kafka holds the backlog. An indexer restart or redeployment doesn't lose events; the consumer offset tracks exactly where to resume. You can also add multiple consumer groups: one for indexing, one for real-time alerting, one for archival — all reading the same stream independently.
The remaining pain after this is cost: old data on high-IOPS SSDs is expensive. A month of logs on the same nodes as today's data is unnecessary for almost every query.
V5: Time-based indices + hot/warm/cold tiering
Elasticsearch's Index Lifecycle Management (ILM) — and its open-source equivalent in OpenSearch — automates this:
- Each day creates a new index:
logs-2026.06.01,logs-2026.06.02, … - After 7 days, the index moves to warm nodes (HDDs, fewer replicas).
- After 30 days, the index is snapshotted to S3 and mounted as a searchable snapshot (cold or frozen tier).
- After the retention window, the index is deleted.
Time-range queries prune index selection automatically: a "last 2 hours" query touches only the current and previous day's index — never the terabytes on warm/cold.
V6: Cost controls, multi-tenancy, alerting
The production system adds sampling and drop rules (noisy DEBUG logs from a batch job are sampled at 1%; health-check endpoints are dropped at the agent), index templates with explicit field mappings to prevent mapping explosion from ad-hoc JSON fields, tenant isolation via separate indices per team or document-level security, and a stream-processing alerting engine that consumes a dedicated Kafka consumer group and triggers PagerDuty when an ERROR rate crosses a threshold — never polling Elasticsearch.
flowchart LR
V1["V1: syslog to NFS<br/>3 services, grep"] --> V2["V2: rsyslog server<br/>centralized, no index"]
V2 --> V3["V3: + Elasticsearch<br/>structured, searchable"]
V3 --> V4["V4: + Kafka buffer<br/>backpressure solved"]
V4 --> V5["V5: + ILM tiering<br/>cost controlled"]
V5 --> V6["V6: + sampling, ACLs, alerts<br/>production-ready"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V6 fill:#a855f7,color:#fff
Collection agents
Every host runs a lightweight collection agent. Its job is to tail log files (or read from Docker/containerd stdout), parse and enrich events, and publish to Kafka. It must be low CPU and low memory — it shares the machine with production services.
Agent options
| Agent | Language | Strengths | Watch out for |
|---|---|---|---|
| Filebeat (Elastic) | Go | Near-zero CPU, simple config, native Elasticsearch output | Limited pipeline transforms; needs Logstash for complex parsing |
| Fluentd | Ruby/C | Huge plugin ecosystem, flexible routing | Higher memory footprint (>60 MB baseline per official docs) |
| Fluent Bit | C | Ultra-lightweight (~450 KB binary, ~5–10 MB RSS in production, zero external dependencies), built-in Kafka output | Smaller plugin ecosystem than Fluentd |
| Vector (Datadog) | Rust | Best-in-class throughput, rich transforms, low memory | Relatively newer; fewer community integrations |
For most deployments: Fluent Bit on every node (resource-efficient), with Kafka as the output. Parsing and enrichment happens at the indexer tier, not the agent, to keep agent CPU usage minimal.
Structured vs. unstructured logs
# Unstructured (legacy)
2026-06-01 14:23:11 ERROR Payment service failed for user 42: timeout
# Structured (JSON — preferred)
{"timestamp":"2026-06-01T14:23:11Z","level":"ERROR","service":"payment",
"user_id":42,"error":"timeout","trace_id":"abc123","duration_ms":5001}
Structured logs eliminate the regex-parsing step. The agent forwards the JSON blob directly; the indexer extracts fields without a Grok pattern. This is why every modern service should write JSON to stdout — the logging infrastructure can be a "dumb pipe" rather than a fragile parsing layer.
For legacy unstructured logs, the indexer tier (Logstash / OpenSearch's ingest pipelines) runs Grok patterns:
%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}
Grok is expensive at scale — a complex pattern on a hot log stream can consume 20–30 % of an indexer's CPU. Migrate services to structured output wherever possible.
Agent backpressure and at-least-once delivery
The agent maintains a cursor (file offset, inode) in a state file on local disk. If Kafka is unavailable, the agent stops advancing its cursor and waits. The application continues writing to the log file on disk — the log file itself is the implicit buffer, bounded by available disk space. On reconnect, the agent resumes from the saved offset — no events lost, but delivery may be delayed.
This gives at-least-once semantics. Duplicates are possible on agent restart; the indexer pipeline should be idempotent (Elasticsearch's document ID deduplication handles this for most cases).
The Kafka buffer: why it's non-negotiable
Without Kafka, the write path is:
Agent → Elasticsearch _bulk API
Elasticsearch applies backpressure at ~HTTP 429 when its indexing queue fills. The agent must either block (slowing down the application if log writes are synchronous) or drop events. Neither is acceptable.
With Kafka:
Agent → Kafka partition → Indexer consumer
Kafka's retention window (e.g., 24–48 hours) means an indexer can be restarted, upgraded, or replaced without losing a single event. Consumer group offset tracking means an indexer picks up exactly where it left off. Partition count scales throughput linearly: 100 partitions → 100 parallel indexer threads. And multiple consumer groups decouple indexing from alerting from archival — each reads the same stream independently.
sequenceDiagram
participant A as Agent (host 1..N)
participant K as Kafka
participant I as Indexer
participant E as Elasticsearch
participant AL as Alert engine
A->>K: produce log event (ack required)
K-->>A: ack (replicated to 2 brokers)
Note over K: event durable in Kafka for 48 h
I->>K: poll batch (consumer group "indexers")
K-->>I: batch of events
I->>E: _bulk index
E-->>I: OK
AL->>K: poll batch (consumer group "alerts")
K-->>AL: same events, independently
Kafka topic partitioning strategy: partition by service_name (keeps a service's events ordered) or by host_id (uniform distribution). Do not partition by timestamp — that creates a hot partition for the current second.
Indexing: inverted indices and time-based partitioning
How Elasticsearch / Lucene indexes text
When a log line is indexed, Lucene tokenizes each text field and builds an inverted index: a mapping from each token to the list of document IDs containing it.
"payment timeout error" →
payment: [doc1, doc5, doc9, ...]
timeout: [doc1, doc3, ...]
error: [doc1, doc2, ...]
A full-text query for "payment AND timeout" intersects two posting lists. This is fast even over hundreds of millions of documents because posting lists are sorted and can be intersected in O(x + y) where x and y are the lengths of the two lists, not the total corpus size. Lucene further accelerates this with skip lists and block-based compression (PFOR), so in practice it avoids decompressing blocks that cannot match.
Structured field queries (e.g., level:ERROR, user_id:42) hit a separate doc_values column store — like a columnar index, fast for range and exact-match queries.
This is why distributed search systems built on Lucene can answer "find all ERROR logs in the last hour" in milliseconds across billions of documents, but "count all errors in the last year" is slower — it must scan more posting list entries.
flowchart LR
LOG["Log line: 'payment timeout error'"] --> TOK[Tokenizer]
TOK --> INV[(Inverted index\npayment → doc1,5,9\ntimeout → doc1,3\nerror → doc1,2)]
LOG --> DV[(doc_values\nlevel, user_id\nfast field queries)]
Q["Query: payment AND timeout"] --> INT[Posting list intersection\nO of x plus y]
INV --> INT
INT --> HITS[Matching docs]
style INV fill:#15803d,color:#fff
style DV fill:#0e7490,color:#fff
style INT fill:#ff6b1a,color:#0a0a0f
style HITS fill:#a855f7,color:#fff
Time-based index partitioning
The most important physical design decision in a log system is indexing by time window.
logs-2026.06.01 ← today's index, hot, SSD
logs-2026.06.02 ← yesterday's index, hot, SSD
...
logs-2026.05.26 ← 7 days ago, moving to warm nodes
logs-2026.05.01 ← 30+ days ago, searchable snapshot on S3
Three reasons this works so well together. First, query pruning: a "last 2 hours" query selects only today's index — Elasticsearch resolves which indices cover the requested time range before scanning any documents. Second, cheap data management: deleting 30-day-old data is DELETE /logs-2026.05.01 — one index drop, no per-document deletions, no reindexing. Third, tiering alignment: moving an index to warm nodes is an atomic operation; the shard files move together.
Index rollover (either by time or by size — whichever comes first):
Rollover condition: age > 1 day OR primary size > 50 GB
Capping index size prevents single-shard hotspots in high-volume services.
Shard sizing
Each Elasticsearch shard is a self-contained Lucene index, which itself consists of multiple immutable segments. Rule of thumb: 20–50 GB per shard for log use cases. Too small → too many shards, high cluster state overhead. Too large → slow segment merges, slow shard recovery after node failure.
Example: 50 TB/day ingest on hot nodes (a mid-to-large fleet after tiering/compression)
50 GB per shard → 1,000 primary shards/day
Replicas: 1 → 2,000 physical shards/day
Over 7 hot days: 14,000 shards in cluster state
At the 432 TB/day compressed rate from the capacity section:
432 TB / 50 GB ≈ 8,600 primary shards/day → 17,200 physical shards/day
Over 7 days: ~120,000 shards — requires multiple clusters or aggressive tiering
At 10,000+ shards, Elasticsearch's master node can struggle. This is why large deployments split into multiple clusters, or use data streams (OpenSearch/Elasticsearch ≥7.9) which consolidate rollover bookkeeping.
Hot/warm/cold tiering and index lifecycle management
stateDiagram-v2
[*] --> Hot: index created (rollover)
Hot --> Warm: age > 7 days
Warm --> Cold: age > 30 days
Cold --> Frozen: age > 90 days
Frozen --> Deleted: past retention window
Hot: Hot tier\nSSD nodes, full replicas\nread + write
Warm: Warm tier\nHDD nodes, 1 replica\nread only, force-merged
Cold: Cold tier\nFully-mounted searchable snapshot\nall segment data stored locally on cold nodes, no replicas
Frozen: Frozen tier\nPartially-mounted searchable snapshot\non-demand fetch from S3, shared partial cache
| Tier | Storage | Replicas | Query latency | Cost vs. Hot |
|---|---|---|---|---|
| Hot | SSD (NVMe) | 2 | < 2 s | 1× (baseline) |
| Warm | HDD (SATA) | 1 | 5–30 s | ~5–10× cheaper |
| Cold / Searchable snapshot | S3 / GCS, fully mounted on cold nodes (data stored locally) | 0 | 10–60 s | ~20–50× cheaper |
| Frozen | S3, partially cached (shared cache on frozen nodes) | 0 | 30 s – minutes | ~100× cheaper |
Elasticsearch searchable snapshots come in two mounting modes. In the cold tier, indices are fully mounted: all segment data is stored locally on dedicated cold nodes (no on-demand S3 fetching), so queries are slower than warm but complete without remote round-trips. In the frozen tier, indices are partially mounted: only recently searched segments are kept in a shared local cache; cache misses trigger on-demand fetches from S3, which is why frozen queries can take 30 seconds or more. Both tiers need zero replicas — the snapshot in object storage is the durable copy.
The volume and cost problem
Logs grow with traffic, not with headcount. Without controls, log storage will exceed the compute budget eventually — and often sooner than teams expect.
Sampling
Not every log line is equally valuable. Drop rules and sampling are applied at the agent or at the Kafka consumer before indexing:
# Fluent Bit sampling config (pseudocode)
[FILTER]
Name throttle
Match kube.*
Rate 1000 # pass at most 1000 events/sec per stream
Window 5
Action drop
[FILTER]
Name grep
Match app.healthcheck
Exclude path /health # drop health-check endpoint logs entirely
Structured sampling: keep 100 % of ERROR/WARN; sample INFO at 10 %; drop DEBUG entirely in production.
Head-based sampling: at the agent, randomly drop a fraction of events before they enter Kafka. Fast, but loses visibility into rare events.
Tail-based sampling: downstream consumer decides which events to keep based on their content (e.g., always keep events with error:true). More accurate, but requires buffering.
Field limits and mapping explosion
Elasticsearch dynamically maps new JSON fields — convenient but dangerous. A service that starts emitting request.headers.* as top-level fields can create thousands of mappings overnight, bloating the cluster state and degrading indexing throughput.
Mitigations: enforce strict index templates that define allowed fields explicitly and reject or route unknown fields to a generic labels keyword field. Alert when a new index's field count exceeds a threshold (e.g., 500). Disable dynamic mapping on write indices; only pre-approved fields are indexed.
Retention tiers as cost levers
A practical tiering policy for a mid-size company:
| Tier | Retention | Who queries it? |
|---|---|---|
| Hot | Last 7 days | On-call engineers, dashboards, alerts |
| Warm | 7–30 days | Incident post-mortems, compliance audits |
| Cold (S3 snapshot) | 30 days – 1 year | Rare forensic queries, compliance |
| Deleted | > 1–2 years | — |
Moving from "keep everything hot for 90 days" to this tier structure typically reduces storage costs by 5–10×.
Full architecture
flowchart TD
SVC1[Service A<br/>container stdout] -->|"JSON logs"| AGT1[Fluent Bit agent<br/>node 1]
SVC2[Service B<br/>legacy syslog] -->|"unstructured"| AGT2[Fluent Bit agent<br/>node 2]
SVCN[Service N...] --> AGTN[Fluent Bit agent<br/>node N]
AGT1 & AGT2 & AGTN -->|"produce, ack required"| KFK[Kafka<br/>N partitions, RF=3]
KFK -->|"consumer group: indexers"| IDX[Indexer fleet<br/>Logstash / OS ingest node]
KFK -->|"consumer group: alerts"| ALE[Alert evaluator<br/>stream processing]
KFK -->|"consumer group: archive"| ARC[S3 archiver<br/>raw Parquet / ORC]
IDX --> HOT[(Hot nodes<br/>SSD, today − 7d)]
HOT -.ILM rollover.-> WARM[(Warm nodes<br/>HDD, 7 − 30d)]
WARM -.ILM snapshot.-> COLD[(Cold: S3<br/>30d − 1yr)]
ALE --> PGD[PagerDuty / Slack]
ARC --> LAKE[(Data lake<br/>for ad-hoc analytics)]
QFE[Query frontend<br/>Kibana / Grafana] --> COORD[Coordinating nodes]
COORD --> HOT
COORD --> WARM
COORD -.searchable snapshot.-> COLD
style KFK fill:#ff6b1a,color:#0a0a0f
style HOT fill:#15803d,color:#fff
style WARM fill:#0e7490,color:#fff
style COLD fill:#a855f7,color:#fff
style AGT1 fill:#ffaa00,color:#0a0a0f
style AGT2 fill:#ffaa00,color:#0a0a0f
style AGTN fill:#ffaa00,color:#0a0a0f
style IDX fill:#ff2e88,color:#fff
Query path
A user submits:
service:payment AND level:ERROR AND @timestamp:[now-7d TO now]
The coordinating node:
- Resolves the time range to a list of index names:
logs-2026.05.26throughlogs-2026.06.02. - Determines which are on hot nodes vs. warm nodes vs. cold snapshots.
- Fans the query out to the primary (or replica) shards holding those indices.
- Merges and sorts results; returns the top-N hits.
sequenceDiagram
participant U as User (Kibana)
participant C as Coordinating node
participant H as Hot shard (today)
participant W as Warm shard (3 days ago)
participant S as S3 (cold snapshot)
U->>C: query { service:payment, level:ERROR, last 7d }
C->>H: scatter query (today's index)
C->>W: scatter query (3-day-old index)
C->>S: scatter query (cold — load from S3 cache)
H-->>C: top 200 hits
W-->>C: top 200 hits
S-->>C: top 200 hits (slower — 30s)
C-->>U: merged top 10 hits + total count
Query DoS (wide-range queries): a query over 365 days hits every index. If 100 engineers run such queries simultaneously, the cluster can become unresponsive for interactive queries. Route queries older than 30 days to a separate coordinating tier with lower priority, enforce a maximum time range in the query UI (e.g., cap at 90 days without explicit override), and circuit-break queries with an estimated shard count above a threshold.
Alerting on log patterns
Alerting runs as a separate consumer group on Kafka, before indexing. This is important: do not alert off search queries on Elasticsearch — polling every 30 seconds adds query load and introduces a 30-second minimum latency. Stream-process the Kafka topic instead.
Kafka topic → Flink / Kafka Streams job →
window: 5 min tumbling
group by: service, level
filter: level = ERROR
aggregate: count
condition: count > 100 → emit alert
Stream processing over the log stream gives low-latency alerting: stateless per-event rules fire in seconds, while windowed aggregations (like the 5-minute tumbling window above) emit results as each window closes. For complex multi-line patterns (e.g., "error A followed by error B within 60 seconds on the same trace"), use a stateful CEP (Complex Event Processing) engine like Flink.
Multi-tenancy and access control
A shared logging cluster across teams requires isolation:
| Layer | Mechanism |
|---|---|
| Index isolation | Separate index per team (logs-team-payments-*) |
| Query isolation | Kibana Spaces / tenant-scoped API keys |
| Document-level security | Elasticsearch DLS: filter by team field at query time |
| Write isolation | Each team's agent uses its own Kafka topic partition group; indexer routes to team's index |
| Cost attribution | Per-index storage and query metrics billed back to teams |
Document-level security is flexible but carries a query-time cost (every query gets an implicit filter injected). For high-QPS tenants, index-level isolation (separate index per team) is cheaper.
Correlating logs with traces
Every log event should carry the distributed trace ID when one exists:
{"timestamp":"2026-06-01T14:23:11Z","level":"ERROR","service":"payment",
"trace_id":"7b9a2c1d3e4f5a6b","span_id":"a1b2c3d4","user_id":42,
"error":"upstream timeout","duration_ms":5001}
With trace_id as a keyword field, clicking it in Kibana jumps directly to Jaeger / Tempo / Zipkin to see the full distributed trace. In an alert, including the trace_id in the notification payload means on-call engineers land directly on the relevant trace without hunting.
This correlation is the backbone of the "three pillars of observability" (logs, metrics, traces) — each pillar becomes much more useful when they share a common identifier.
Storage choices summary
| Data type | Store | Rationale |
|---|---|---|
| Raw log stream (in-flight) | Kafka (replicated, 48-hour retention) | Durable buffer; decouples agents from indexers |
| Hot log events (0–7 days) | Elasticsearch / OpenSearch on SSD | Fast full-text + field queries; p95 < 2 s |
| Warm log events (7–30 days) | Elasticsearch on HDD, force-merged | Reduced replica count; 5–10× cheaper than SSD |
| Cold log events (30 d – 1 yr) | Searchable snapshot on S3 | Near-zero storage cost; queries load on demand |
| Raw archive (for replay / analytics) | S3 in Parquet / ORC format | Queryable by Athena / BigQuery; cheap long-term |
| Agent state (file offsets) | Local disk on each host | Survives agent restart; required for at-least-once |
| Index metadata / cluster state | Elasticsearch master nodes (in-memory) | Limits shard count; keep below ~10k shards/cluster |
Failure modes
Indexer overload / backpressure
Symptom: Kafka consumer lag grows; Elasticsearch returns HTTP 429.
A deployment causes an ERROR storm; the fleet emits 10× normal log volume for 5 minutes. The lag accumulates in Kafka; agents keep producing; applications are unaffected. Once the storm subsides, indexers process the backlog. No events lost.
Where it fails: if the storm persists and Kafka's retention window (e.g., 48 hours) is exhausted, the oldest events are evicted before indexing. Prevention: autoscale the indexer fleet based on consumer lag, or increase Kafka retention for the log topics.
Index hotspots
Symptom: one shard in today's index receives 80 % of writes; that node's CPU pegs at 100 %.
Usually caused by all agents routing to the same Kafka partition, or Elasticsearch's indexing being imbalanced across shards. Fix: increase Kafka partition count; use a hash of host_id as the partition key for uniform distribution. Elasticsearch primary shard count is fixed at index creation — size shards correctly upfront.
Mapping explosion
Symptom: cluster state grows to GB; indexing throughput drops; master OOM-kills.
A service started emitting deeply nested dynamic JSON (e.g., HTTP request headers as first-class fields). Fix: enforce dynamic: strict on index templates; route unknown fields to a generic string field. Reindex affected indices after fixing the mapping.
Expensive wide-range queries (query DoS)
Symptom: one grep-style query over a 90-day range consumes all coordinating node resources, timing out concurrent queries.
Rate-limit queries at the API gateway; enforce max time range in the query UI; route historical queries to a dedicated coordinating tier with resource limits.
Cost explosion
Symptom: log storage bill doubles quarter-over-quarter with no corresponding traffic increase.
A noisy service (e.g., a batch job emitting DEBUG for every record processed) started logging heavily. Fix: per-service ingest rate dashboards; alerts when a service's log volume exceeds a budget; sampling and drop rules deployed at the agent within minutes.
Things to discuss in an interview
- Why Kafka and not a direct agent → Elasticsearch connection? The buffer is the key insight. Draw the failure scenario without it.
- Why time-based indices? Tie it to query pruning, cheap retention management, and tiering alignment. An interviewer who sees "daily indices" and doesn't hear "because time range prunes index selection automatically" will mark it down.
- Inverted index internals: you don't need to know Lucene segment merging deeply, but knowing that full-text search is a posting-list intersection (not a scan) and that
doc_valuesenables fast field-level aggregations differentiates senior candidates. - Mapping explosion: surprisingly common in production, rarely mentioned by candidates. Mentioning it signals operational experience.
- Logs vs. metrics: if asked to design an "observability system," always clarify: are we building logging, metrics, or tracing? Each has fundamentally different storage requirements. Metrics monitoring uses a time-series database (Prometheus, Thanos), not an inverted index.
- Cost as a first-class concern: at scale, log storage often costs more than compute. Mentioning tiering, sampling, and retention policies distinguishes candidates who've operated these systems from those who've only read about them.
- Trace correlation: mentioning
trace_idas a structured field and linking to the distributed tracing system shows end-to-end observability thinking.
Things you should now be able to answer
- Why does Kafka sit between agents and Elasticsearch, and what failure mode does it prevent?
- What is an inverted index, and why does it make full-text search fast?
- Why are log indices organized by day rather than by service or severity?
- What happens to shard count and cluster stability when you keep all data on hot nodes for 90 days?
- How does a time-range query avoid scanning all indices?
- What is mapping explosion, and how do you prevent it?
- How do you prevent a single wide-range query from degrading interactive dashboards?
- How is a log aggregation system architecturally different from a metrics monitoring system?
Further reading
- Elasticsearch documentation — Index Lifecycle Management (ILM): elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
- OpenSearch documentation — Index State Management: opensearch.org/docs/latest/im-plugin/ism
- "The Log: What every software engineer should know about real-time data's unifying abstraction" — Jay Kreps, LinkedIn Engineering Blog (foundational reading on Kafka's design philosophy)
- Fluent Bit documentation — High-performance log forwarding: docs.fluentbit.io
- Lucene's inverted index — "Introduction to Information Retrieval" (Manning, Raghavan, Schütze), Chapter 1 (free online at nlp.stanford.edu/IR-book)
- Design a Distributed Search System — the indexing and query mechanics that underpin Elasticsearch
- Design a Metrics & Monitoring System — the complementary numeric time-series system; know when to use each
Frequently asked questions
▸Why does a log aggregation pipeline put Kafka between collection agents and Elasticsearch?
Without Kafka, Elasticsearch applies backpressure via HTTP 429 when its indexing queue fills, and agents must either block or drop events. Kafka's durable retention window (typically 24-48 hours) lets indexers fall behind during traffic spikes and catch up without losing a single event; consumer group offset tracking means an indexer restart resumes exactly where it left off.
▸What is an inverted index and why does it make log search fast?
Lucene tokenizes each log line and builds a mapping from every token to the sorted list of document IDs containing it (a posting list). A query like 'payment AND timeout' intersects two posting lists in O(x + y) time proportional to the matching list lengths, not the total corpus size, so it returns results across hundreds of millions of documents in milliseconds rather than scanning raw bytes.
▸Why organize log indices by day rather than by service or severity?
Time-based daily indices enable automatic query pruning: a 'last 2 hours' query touches only today's index, never the terabytes on warm or cold tiers. They also make retention management trivial (deleting 30-day-old data is a single index drop, not per-document deletions) and align perfectly with ILM tiering, since an entire index moves atomically to warm or cold nodes.
▸What is mapping explosion in Elasticsearch and how do you prevent it?
Elasticsearch dynamically maps every new JSON field it sees; a service that suddenly emits HTTP request headers as top-level fields can create thousands of new mappings overnight, bloating cluster state and degrading indexing throughput to the point of master node OOM. Prevention requires enforcing strict index templates that define allowed fields explicitly and routing unknown fields to a single generic keyword field, combined with alerts when a new index's field count exceeds a threshold such as 500.
▸How much cheaper is cold object storage compared to keeping logs on hot SSD nodes?
Cold searchable snapshots on S3 or GCS cost roughly 20-50 times less than hot SSD nodes, and the frozen tier (partially cached, on-demand S3 fetch) costs approximately 100 times less. Moving from 90-day hot retention to the full hot-warm-cold tier structure typically reduces storage costs by 5-10 times for a mid-size company.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.