~/articles/design-log-aggregation

◆◆◆Advancedasked at Elasticasked at Splunkasked at Datadogasked at Amazon

Design a Centralized Log Aggregation System (ELK / Splunk)

Q: Why does a log aggregation pipeline put Kafka between collection agents and Elasticsearch?

Without Kafka, Elasticsearch applies backpressure via HTTP 429 when its indexing queue fills, and agents must either block or drop events. Kafka's durable retention window (typically 24-48 hours) lets indexers fall behind during traffic spikes and catch up without losing a single event; consumer group offset tracking means an indexer restart resumes exactly where it left off.

Q: What is an inverted index and why does it make log search fast?

Lucene tokenizes each log line and builds a mapping from every token to the sorted list of document IDs containing it (a posting list). A query like 'payment AND timeout' intersects two posting lists in O(x + y) time proportional to the matching list lengths, not the total corpus size, so it returns results across hundreds of millions of documents in milliseconds rather than scanning raw bytes.

Q: Why organize log indices by day rather than by service or severity?

Time-based daily indices enable automatic query pruning: a 'last 2 hours' query touches only today's index, never the terabytes on warm or cold tiers. They also make retention management trivial (deleting 30-day-old data is a single index drop, not per-document deletions) and align perfectly with ILM tiering, since an entire index moves atomically to warm or cold nodes.

Q: What is mapping explosion in Elasticsearch and how do you prevent it?

Elasticsearch dynamically maps every new JSON field it sees; a service that suddenly emits HTTP request headers as top-level fields can create thousands of new mappings overnight, bloating cluster state and degrading indexing throughput to the point of master node OOM. Prevention requires enforcing strict index templates that define allowed fields explicitly and routing unknown fields to a single generic keyword field, combined with alerts when a new index's field count exceeds a threshold such as 500.

Q: How much cheaper is cold object storage compared to keeping logs on hot SSD nodes?

Cold searchable snapshots on S3 or GCS cost roughly 20-50 times less than hot SSD nodes, and the frozen tier (partially cached, on-demand S3 fetch) costs approximately 100 times less. Moving from 90-day hot retention to the full hot-warm-cold tier structure typically reduces storage costs by 5-10 times for a mid-size company.

Collect, store, and search logs from thousands of services. Collection agents, a buffered ingestion pipeline, time-based inverted indices, hot-warm-cold tiers, and cost control.

25 min read2026-06-16Ironclad Academy

#interview #observability #search #big-data

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Netflix runs thousands of microservices. On any given afternoon, tens of millions of log lines per second pour out of containers across tens of thousands of hosts — one slow upstream call, one unexpected null pointer, one rejected payment. The only way an engineer finds that needle is if every log line from every service has already been collected, structured, and indexed somewhere central, ready for a sub-second query at 2 AM during an outage.

That is what a log aggregation system does. It replaces the old workflow — SSH into a server, grep through /var/log/app.log — with a unified pipeline: lightweight agents tail log files on every host, ship the events to a durable buffer, and a fleet of indexers writes them into a full-text search engine. Products like the ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, and Datadog are all industrial implementations of this pattern.

The core engineering tension is that logs are a write-dominated, high-cardinality, unbounded stream. Unlike metrics — which are pre-aggregated numbers on known label sets — every log line is a unique text event that may carry arbitrary JSON fields. You must index it fast enough to keep up with the firehose, make it searchable within seconds, and store months of history without the bill overwhelming the engineering budget. Backpressure during traffic spikes and cost at petabyte scale are the two problems that will break a naive design.

A second tension is the gap between what engineers want ("show me all errors for request ID abc123 in the last hour") and what the storage system can deliver cheaply. Full-text search across terabytes of cold history is expensive; scanning is even worse. The architecture that follows — agents, Kafka, Elasticsearch with time-based daily indices, and hot/warm/cold tiering — exists specifically to collapse that gap.

Functional requirements

Collect logs from all hosts and containers without changing application code.
Accept both structured logs (JSON with known fields) and unstructured log lines.
Parse, enrich (add hostname, region, severity), and index every event.
Expose full-text search and structured-field filter queries via an API and a UI.
Alert when a log pattern crosses a threshold (e.g., ERROR rate > 1 % for 5 min).
Correlate log events with distributed traces via a shared trace_id field.
Support multiple teams in a shared cluster with access control per log stream.

Non-functional requirements

Durability: no log line dropped during an indexer outage or traffic spike.
Ingestion latency: logs visible for search within 5–15 seconds of emission (near-real-time, not real-time).
Query latency: p95 < 5 s for queries over the last 24 h; < 60 s over 30-day history.
Retention: 7-day hot, 30-day warm, 90-day cold, 1–2-year frozen — tiered by access frequency.
Availability: 99.9 % for the query path; ingestion can tolerate brief queue build-up.
Cost-aware: the platform must provide controls to cap storage spend; logs scale with traffic, not headcount.

Capacity estimation

Large fleet (10,000 hosts, sustained peak):

Dimension	Estimate	How we got there
Fleet size	10,000 hosts	—
Log rate (peak)	5,000 lines/host/sec (typical ~500)	Sustained peak assumption
Total ingest rate	50 M lines/sec	`10,000 × 5,000`
Avg line size	500 bytes	JSON with stack traces can be larger; short access logs shorter
Raw ingest bandwidth	25 GB/s	`50 M × 500 B`
Compression ratio	~5×	LZ4 / Zstd typical for log data
Compressed ingest	5 GB/s	`25 GB/s ÷ 5`
Compressed per day	432 TB/day	`5 GB/s × 86,400 s`
7-day hot tier	~3 PB	`432 TB × 7`
30-day warm tier	~10 PB (additional days)	`432 TB × 23`

Mid-scale (1,000 hosts, 500 lines/host/sec average):

Dimension	Estimate	How we got there
Total ingest rate	500,000 lines/sec	`1,000 × 500`
Raw ingest bandwidth	250 MB/s	`500,000 × 500 B`
Compressed ingest	50 MB/s	`250 MB/s ÷ 5`
Compressed per day	4.3 TB/day	`50 MB/s × 86,400 s`
7-day hot tier	~30 TB	Fits a small Elasticsearch cluster (6–10 nodes)

Both scenarios have the same structure; only the cluster size changes. The math is what matters in an interview — show the multiplication, then say "so we need time-based sharding and tiering."

Takeaway: at 10,000 hosts the uncompressed firehose is 25 GB/s — you need Kafka as a write buffer, daily time-based indices for query pruning, and hot/warm/cold tiering to make the storage bill survivable.

Logs vs. metrics: why they're different systems

This distinction comes up in every observability interview.

Property	Logs	Metrics
Data shape	Discrete text events, variable fields	Numeric values (gauge, counter, histogram) with labels
Cardinality	Unbounded — any JSON field can appear	Controlled — label sets are predefined
Query type	Full-text search + field filter + aggregation	Time series aggregation + alerting
Volume	Very high (MB–GB/s raw)	Low (thousands of series)
Storage	Inverted index (Lucene/Elasticsearch)	TSDB (Prometheus, InfluxDB, Thanos)
Retention	Often tiered, expensive to store long-term	Aggregated; cheap at 15-second granularity
Trace link	Ship `trace_id` as a field	Not directly applicable

A metrics system gives you "how many?" cheaply and reliably. Logs give you "what happened, to which request, in what context?" That richness is also why logs are so much more expensive to store and query.

Building up to the design

V1: Syslog to a file on a shared NFS mount

Every Unix host has syslog. Redirect it to a shared NFS volume. grep to search.

Zero infrastructure, and it works on day one for three services. The problems arrive fast: at ten services, concurrent writes corrupt the file. At a hundred, grep over terabytes is unusable. And NFS is a single point of failure for every application that tries to write a log.

V2: Centralized log server (rsyslog / syslog-ng)

Run a log aggregation daemon. Applications forward via UDP/TCP syslog; the server writes to local disk. You search with grep or awk.

Centralizing removes the NFS mess. But search is still O(bytes scanned) — one bad query pauses ingestion because they share the same box. The log server itself becomes a single point of failure. And every log line is still a flat string with no structure, so parsing is a later problem.

V3: Add structure and a search index

Require services to emit JSON. Parse, enrich, and write to Elasticsearch. Now a query like:

service:payment AND level:ERROR AND @timestamp:[now-1h TO now]

hits an inverted index and returns in milliseconds.

Near-real-time structured search is a huge leap. The new problem shows up during deploys: a service that starts logging 100× its normal rate will overwhelm the Elasticsearch indexer. When the indexer's queue fills, it rejects _bulk requests with HTTP 429, and the agent has nowhere to put the events. They get dropped.

V4: Add Kafka as a durable buffer

Insert Kafka between agents and Elasticsearch. This is the most important architectural decision in the whole system.

Agent  →  Kafka topic (replicated)  →  Logstash / OpenSearch Ingest  →  ES

The agent publishes and moves on; Kafka durably stores the event. The indexer pulls at its own pace — if it lags, Kafka holds the backlog. An indexer restart or redeployment doesn't lose events; the consumer offset tracks exactly where to resume. You can also add multiple consumer groups: one for indexing, one for real-time alerting, one for archival — all reading the same stream independently.

The remaining pain after this is cost: old data on high-IOPS SSDs is expensive. A month of logs on the same nodes as today's data is unnecessary for almost every query.

V5: Time-based indices + hot/warm/cold tiering

Elasticsearch's Index Lifecycle Management (ILM) — and its open-source equivalent in OpenSearch — automates this:

Each day creates a new index: logs-2026.06.01, logs-2026.06.02, …
After 7 days, the index moves to warm nodes (HDDs, fewer replicas).
After 30 days, the index is snapshotted to S3 and mounted as a searchable snapshot (cold or frozen tier).
After the retention window, the index is deleted.

Time-range queries prune index selection automatically: a "last 2 hours" query touches only the current and previous day's index — never the terabytes on warm/cold.

V6: Cost controls, multi-tenancy, alerting

The production system adds sampling and drop rules (noisy DEBUG logs from a batch job are sampled at 1%; health-check endpoints are dropped at the agent), index templates with explicit field mappings to prevent mapping explosion from ad-hoc JSON fields, tenant isolation via separate indices per team or document-level security, and a stream-processing alerting engine that consumes a dedicated Kafka consumer group and triggers PagerDuty when an ERROR rate crosses a threshold — never polling Elasticsearch.

flowchart LR
    V1["V1: syslog to NFS<br/>3 services, grep"] --> V2["V2: rsyslog server<br/>centralized, no index"]
    V2 --> V3["V3: + Elasticsearch<br/>structured, searchable"]
    V3 --> V4["V4: + Kafka buffer<br/>backpressure solved"]
    V4 --> V5["V5: + ILM tiering<br/>cost controlled"]
    V5 --> V6["V6: + sampling, ACLs, alerts<br/>production-ready"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

Collection agents

Every host runs a lightweight collection agent. Its job is to tail log files (or read from Docker/containerd stdout), parse and enrich events, and publish to Kafka. It must be low CPU and low memory — it shares the machine with production services.

Agent options

Agent	Language	Strengths	Watch out for
Filebeat (Elastic)	Go	Near-zero CPU, simple config, native Elasticsearch output	Limited pipeline transforms; needs Logstash for complex parsing
Fluentd	Ruby/C	Huge plugin ecosystem, flexible routing	Higher memory footprint (>60 MB baseline per official docs)
Fluent Bit	C	Ultra-lightweight (~450 KB binary, ~5–10 MB RSS in production, zero external dependencies), built-in Kafka output	Smaller plugin ecosystem than Fluentd
Vector (Datadog)	Rust	Best-in-class throughput, rich transforms, low memory	Relatively newer; fewer community integrations

For most deployments: Fluent Bit on every node (resource-efficient), with Kafka as the output. Parsing and enrichment happens at the indexer tier, not the agent, to keep agent CPU usage minimal.

Structured vs. unstructured logs

# Unstructured (legacy)
2026-06-01 14:23:11 ERROR Payment service failed for user 42: timeout

# Structured (JSON — preferred)
{"timestamp":"2026-06-01T14:23:11Z","level":"ERROR","service":"payment",
 "user_id":42,"error":"timeout","trace_id":"abc123","duration_ms":5001}

Structured logs eliminate the regex-parsing step. The agent forwards the JSON blob directly; the indexer extracts fields without a Grok pattern. This is why every modern service should write JSON to stdout — the logging infrastructure can be a "dumb pipe" rather than a fragile parsing layer.

For legacy unstructured logs, the indexer tier (Logstash / OpenSearch's ingest pipelines) runs Grok patterns:

%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}

Grok is expensive at scale — a complex pattern on a hot log stream can consume 20–30 % of an indexer's CPU. Migrate services to structured output wherever possible.

Agent backpressure and at-least-once delivery

The agent maintains a cursor (file offset, inode) in a state file on local disk. If Kafka is unavailable, the agent stops advancing its cursor and waits. The application continues writing to the log file on disk — the log file itself is the implicit buffer, bounded by available disk space. On reconnect, the agent resumes from the saved offset — no events lost, but delivery may be delayed.

This gives at-least-once semantics. Duplicates are possible on agent restart; the indexer pipeline should be idempotent (Elasticsearch's document ID deduplication handles this for most cases).

The Kafka buffer: why it's non-negotiable

Without Kafka, the write path is:

Agent → Elasticsearch _bulk API

Elasticsearch applies backpressure at ~HTTP 429 when its indexing queue fills. The agent must either block (slowing down the application if log writes are synchronous) or drop events. Neither is acceptable.

With Kafka:

Agent → Kafka partition → Indexer consumer

Kafka's retention window (e.g., 24–48 hours) means an indexer can be restarted, upgraded, or replaced without losing a single event. Consumer group offset tracking means an indexer picks up exactly where it left off. Partition count scales throughput linearly: 100 partitions → 100 parallel indexer threads. And multiple consumer groups decouple indexing from alerting from archival — each reads the same stream independently.

sequenceDiagram
    participant A as Agent (host 1..N)
    participant K as Kafka
    participant I as Indexer
    participant E as Elasticsearch
    participant AL as Alert engine

    A->>K: produce log event (ack required)
    K-->>A: ack (replicated to 2 brokers)
    Note over K: event durable in Kafka for 48 h
    I->>K: poll batch (consumer group "indexers")
    K-->>I: batch of events
    I->>E: _bulk index
    E-->>I: OK
    AL->>K: poll batch (consumer group "alerts")
    K-->>AL: same events, independently

Kafka topic partitioning strategy: partition by service_name (keeps a service's events ordered) or by host_id (uniform distribution). Do not partition by timestamp — that creates a hot partition for the current second.

Indexing: inverted indices and time-based partitioning

How Elasticsearch / Lucene indexes text

When a log line is indexed, Lucene tokenizes each text field and builds an inverted index: a mapping from each token to the list of document IDs containing it.

"payment timeout error" →
  payment: [doc1, doc5, doc9, ...]
  timeout: [doc1, doc3, ...]
  error:   [doc1, doc2, ...]

A full-text query for "payment AND timeout" intersects two posting lists. This is fast even over hundreds of millions of documents because posting lists are sorted and can be intersected in O(x + y) where x and y are the lengths of the two lists, not the total corpus size. Lucene further accelerates this with skip lists and block-based compression (PFOR), so in practice it avoids decompressing blocks that cannot match.

Structured field queries (e.g., level:ERROR, user_id:42) hit a separate doc_values column store — like a columnar index, fast for range and exact-match queries.

This is why distributed search systems built on Lucene can answer "find all ERROR logs in the last hour" in milliseconds across billions of documents, but "count all errors in the last year" is slower — it must scan more posting list entries.

flowchart LR
    LOG["Log line: 'payment timeout error'"] --> TOK[Tokenizer]
    TOK --> INV[(Inverted index\npayment → doc1,5,9\ntimeout → doc1,3\nerror → doc1,2)]
    LOG --> DV[(doc_values\nlevel, user_id\nfast field queries)]
    Q["Query: payment AND timeout"] --> INT[Posting list intersection\nO of x plus y]
    INV --> INT
    INT --> HITS[Matching docs]
    style INV fill:#15803d,color:#fff
    style DV fill:#0e7490,color:#fff
    style INT fill:#ff6b1a,color:#0a0a0f
    style HITS fill:#a855f7,color:#fff

Time-based index partitioning

The most important physical design decision in a log system is indexing by time window.

logs-2026.06.01    ← today's index, hot, SSD
logs-2026.06.02    ← yesterday's index, hot, SSD
...
logs-2026.05.26    ← 7 days ago, moving to warm nodes
logs-2026.05.01    ← 30+ days ago, searchable snapshot on S3

Three reasons this works so well together. First, query pruning: a "last 2 hours" query selects only today's index — Elasticsearch resolves which indices cover the requested time range before scanning any documents. Second, cheap data management: deleting 30-day-old data is DELETE /logs-2026.05.01 — one index drop, no per-document deletions, no reindexing. Third, tiering alignment: moving an index to warm nodes is an atomic operation; the shard files move together.

Index rollover (either by time or by size — whichever comes first):

Rollover condition:  age > 1 day  OR  primary size > 50 GB

Capping index size prevents single-shard hotspots in high-volume services.

Shard sizing

Each Elasticsearch shard is a self-contained Lucene index, which itself consists of multiple immutable segments. Rule of thumb: 20–50 GB per shard for log use cases. Too small → too many shards, high cluster state overhead. Too large → slow segment merges, slow shard recovery after node failure.

Example: 50 TB/day ingest on hot nodes (a mid-to-large fleet after tiering/compression)
50 GB per shard → 1,000 primary shards/day
Replicas: 1 → 2,000 physical shards/day
Over 7 hot days: 14,000 shards in cluster state

At the 432 TB/day compressed rate from the capacity section:
432 TB / 50 GB ≈ 8,600 primary shards/day → 17,200 physical shards/day
Over 7 days: ~120,000 shards — requires multiple clusters or aggressive tiering

At 10,000+ shards, Elasticsearch's master node can struggle. This is why large deployments split into multiple clusters, or use data streams (OpenSearch/Elasticsearch ≥7.9) which consolidate rollover bookkeeping.

Hot/warm/cold tiering and index lifecycle management

stateDiagram-v2
    [*] --> Hot: index created (rollover)
    Hot --> Warm: age > 7 days
    Warm --> Cold: age > 30 days
    Cold --> Frozen: age > 90 days
    Frozen --> Deleted: past retention window

    Hot: Hot tier\nSSD nodes, full replicas\nread + write
    Warm: Warm tier\nHDD nodes, 1 replica\nread only, force-merged
    Cold: Cold tier\nFully-mounted searchable snapshot\nall segment data stored locally on cold nodes, no replicas
    Frozen: Frozen tier\nPartially-mounted searchable snapshot\non-demand fetch from S3, shared partial cache

Tier	Storage	Replicas	Query latency	Cost vs. Hot
Hot	SSD (NVMe)	2	< 2 s	1× (baseline)
Warm	HDD (SATA)	1	5–30 s	~5–10× cheaper
Cold / Searchable snapshot	S3 / GCS, fully mounted on cold nodes (data stored locally)	0	10–60 s	~20–50× cheaper
Frozen	S3, partially cached (shared cache on frozen nodes)	0	30 s – minutes	~100× cheaper

Elasticsearch searchable snapshots come in two mounting modes. In the cold tier, indices are fully mounted: all segment data is stored locally on dedicated cold nodes (no on-demand S3 fetching), so queries are slower than warm but complete without remote round-trips. In the frozen tier, indices are partially mounted: only recently searched segments are kept in a shared local cache; cache misses trigger on-demand fetches from S3, which is why frozen queries can take 30 seconds or more. Both tiers need zero replicas — the snapshot in object storage is the durable copy.

The volume and cost problem

Logs grow with traffic, not with headcount. Without controls, log storage will exceed the compute budget eventually — and often sooner than teams expect.

Sampling

Not every log line is equally valuable. Drop rules and sampling are applied at the agent or at the Kafka consumer before indexing:

# Fluent Bit sampling config (pseudocode)
[FILTER]
    Name     throttle
    Match    kube.*
    Rate     1000          # pass at most 1000 events/sec per stream
    Window   5
    Action   drop

[FILTER]
    Name     grep
    Match    app.healthcheck
    Exclude  path /health   # drop health-check endpoint logs entirely

Structured sampling: keep 100 % of ERROR/WARN; sample INFO at 10 %; drop DEBUG entirely in production.

Head-based sampling: at the agent, randomly drop a fraction of events before they enter Kafka. Fast, but loses visibility into rare events.

Tail-based sampling: downstream consumer decides which events to keep based on their content (e.g., always keep events with error:true). More accurate, but requires buffering.

Field limits and mapping explosion

Elasticsearch dynamically maps new JSON fields — convenient but dangerous. A service that starts emitting request.headers.* as top-level fields can create thousands of mappings overnight, bloating the cluster state and degrading indexing throughput.

Mitigations: enforce strict index templates that define allowed fields explicitly and reject or route unknown fields to a generic labels keyword field. Alert when a new index's field count exceeds a threshold (e.g., 500). Disable dynamic mapping on write indices; only pre-approved fields are indexed.

Retention tiers as cost levers

A practical tiering policy for a mid-size company:

Tier	Retention	Who queries it?
Hot	Last 7 days	On-call engineers, dashboards, alerts
Warm	7–30 days	Incident post-mortems, compliance audits
Cold (S3 snapshot)	30 days – 1 year	Rare forensic queries, compliance
Deleted	> 1–2 years	—

Moving from "keep everything hot for 90 days" to this tier structure typically reduces storage costs by 5–10×.

Full architecture

flowchart TD
    SVC1[Service A<br/>container stdout] -->|"JSON logs"| AGT1[Fluent Bit agent<br/>node 1]
    SVC2[Service B<br/>legacy syslog] -->|"unstructured"| AGT2[Fluent Bit agent<br/>node 2]
    SVCN[Service N...] --> AGTN[Fluent Bit agent<br/>node N]

    AGT1 & AGT2 & AGTN -->|"produce, ack required"| KFK[Kafka<br/>N partitions, RF=3]

    KFK -->|"consumer group: indexers"| IDX[Indexer fleet<br/>Logstash / OS ingest node]
    KFK -->|"consumer group: alerts"| ALE[Alert evaluator<br/>stream processing]
    KFK -->|"consumer group: archive"| ARC[S3 archiver<br/>raw Parquet / ORC]

    IDX --> HOT[(Hot nodes<br/>SSD, today − 7d)]
    HOT -.ILM rollover.-> WARM[(Warm nodes<br/>HDD, 7 − 30d)]
    WARM -.ILM snapshot.-> COLD[(Cold: S3<br/>30d − 1yr)]

    ALE --> PGD[PagerDuty / Slack]
    ARC --> LAKE[(Data lake<br/>for ad-hoc analytics)]

    QFE[Query frontend<br/>Kibana / Grafana] --> COORD[Coordinating nodes]
    COORD --> HOT
    COORD --> WARM
    COORD -.searchable snapshot.-> COLD

    style KFK fill:#ff6b1a,color:#0a0a0f
    style HOT fill:#15803d,color:#fff
    style WARM fill:#0e7490,color:#fff
    style COLD fill:#a855f7,color:#fff
    style AGT1 fill:#ffaa00,color:#0a0a0f
    style AGT2 fill:#ffaa00,color:#0a0a0f
    style AGTN fill:#ffaa00,color:#0a0a0f
    style IDX fill:#ff2e88,color:#fff

Query path

A user submits:

service:payment AND level:ERROR AND @timestamp:[now-7d TO now]

The coordinating node:

Resolves the time range to a list of index names: logs-2026.05.26 through logs-2026.06.02.
Determines which are on hot nodes vs. warm nodes vs. cold snapshots.
Fans the query out to the primary (or replica) shards holding those indices.
Merges and sorts results; returns the top-N hits.

sequenceDiagram
    participant U as User (Kibana)
    participant C as Coordinating node
    participant H as Hot shard (today)
    participant W as Warm shard (3 days ago)
    participant S as S3 (cold snapshot)

    U->>C: query { service:payment, level:ERROR, last 7d }
    C->>H: scatter query (today's index)
    C->>W: scatter query (3-day-old index)
    C->>S: scatter query (cold — load from S3 cache)
    H-->>C: top 200 hits
    W-->>C: top 200 hits
    S-->>C: top 200 hits (slower — 30s)
    C-->>U: merged top 10 hits + total count

Query DoS (wide-range queries): a query over 365 days hits every index. If 100 engineers run such queries simultaneously, the cluster can become unresponsive for interactive queries. Route queries older than 30 days to a separate coordinating tier with lower priority, enforce a maximum time range in the query UI (e.g., cap at 90 days without explicit override), and circuit-break queries with an estimated shard count above a threshold.

Alerting on log patterns

Alerting runs as a separate consumer group on Kafka, before indexing. This is important: do not alert off search queries on Elasticsearch — polling every 30 seconds adds query load and introduces a 30-second minimum latency. Stream-process the Kafka topic instead.

Kafka topic → Flink / Kafka Streams job →
  window: 5 min tumbling
  group by: service, level
  filter: level = ERROR
  aggregate: count
  condition: count > 100 → emit alert

Stream processing over the log stream gives low-latency alerting: stateless per-event rules fire in seconds, while windowed aggregations (like the 5-minute tumbling window above) emit results as each window closes. For complex multi-line patterns (e.g., "error A followed by error B within 60 seconds on the same trace"), use a stateful CEP (Complex Event Processing) engine like Flink.

Multi-tenancy and access control

A shared logging cluster across teams requires isolation:

Layer	Mechanism
Index isolation	Separate index per team (`logs-team-payments-*`)
Query isolation	Kibana Spaces / tenant-scoped API keys
Document-level security	Elasticsearch DLS: filter by `team` field at query time
Write isolation	Each team's agent uses its own Kafka topic partition group; indexer routes to team's index
Cost attribution	Per-index storage and query metrics billed back to teams

Document-level security is flexible but carries a query-time cost (every query gets an implicit filter injected). For high-QPS tenants, index-level isolation (separate index per team) is cheaper.

Correlating logs with traces

Every log event should carry the distributed trace ID when one exists:

{"timestamp":"2026-06-01T14:23:11Z","level":"ERROR","service":"payment",
 "trace_id":"7b9a2c1d3e4f5a6b","span_id":"a1b2c3d4","user_id":42,
 "error":"upstream timeout","duration_ms":5001}

With trace_id as a keyword field, clicking it in Kibana jumps directly to Jaeger / Tempo / Zipkin to see the full distributed trace. In an alert, including the trace_id in the notification payload means on-call engineers land directly on the relevant trace without hunting.

This correlation is the backbone of the "three pillars of observability" (logs, metrics, traces) — each pillar becomes much more useful when they share a common identifier.

Storage choices summary

Data type	Store	Rationale
Raw log stream (in-flight)	Kafka (replicated, 48-hour retention)	Durable buffer; decouples agents from indexers
Hot log events (0–7 days)	Elasticsearch / OpenSearch on SSD	Fast full-text + field queries; p95 < 2 s
Warm log events (7–30 days)	Elasticsearch on HDD, force-merged	Reduced replica count; 5–10× cheaper than SSD
Cold log events (30 d – 1 yr)	Searchable snapshot on S3	Near-zero storage cost; queries load on demand
Raw archive (for replay / analytics)	S3 in Parquet / ORC format	Queryable by Athena / BigQuery; cheap long-term
Agent state (file offsets)	Local disk on each host	Survives agent restart; required for at-least-once
Index metadata / cluster state	Elasticsearch master nodes (in-memory)	Limits shard count; keep below ~10k shards/cluster

Failure modes

Indexer overload / backpressure

Symptom: Kafka consumer lag grows; Elasticsearch returns HTTP 429.

A deployment causes an ERROR storm; the fleet emits 10× normal log volume for 5 minutes. The lag accumulates in Kafka; agents keep producing; applications are unaffected. Once the storm subsides, indexers process the backlog. No events lost.

Where it fails: if the storm persists and Kafka's retention window (e.g., 48 hours) is exhausted, the oldest events are evicted before indexing. Prevention: autoscale the indexer fleet based on consumer lag, or increase Kafka retention for the log topics.

Index hotspots

Symptom: one shard in today's index receives 80 % of writes; that node's CPU pegs at 100 %.

Usually caused by all agents routing to the same Kafka partition, or Elasticsearch's indexing being imbalanced across shards. Fix: increase Kafka partition count; use a hash of host_id as the partition key for uniform distribution. Elasticsearch primary shard count is fixed at index creation — size shards correctly upfront.

Mapping explosion

Symptom: cluster state grows to GB; indexing throughput drops; master OOM-kills.

A service started emitting deeply nested dynamic JSON (e.g., HTTP request headers as first-class fields). Fix: enforce dynamic: strict on index templates; route unknown fields to a generic string field. Reindex affected indices after fixing the mapping.

Expensive wide-range queries (query DoS)

Symptom: one grep-style query over a 90-day range consumes all coordinating node resources, timing out concurrent queries.

Rate-limit queries at the API gateway; enforce max time range in the query UI; route historical queries to a dedicated coordinating tier with resource limits.

Cost explosion

Symptom: log storage bill doubles quarter-over-quarter with no corresponding traffic increase.

A noisy service (e.g., a batch job emitting DEBUG for every record processed) started logging heavily. Fix: per-service ingest rate dashboards; alerts when a service's log volume exceeds a budget; sampling and drop rules deployed at the agent within minutes.

Things to discuss in an interview

Why Kafka and not a direct agent → Elasticsearch connection? The buffer is the key insight. Draw the failure scenario without it.
Why time-based indices? Tie it to query pruning, cheap retention management, and tiering alignment. An interviewer who sees "daily indices" and doesn't hear "because time range prunes index selection automatically" will mark it down.
Inverted index internals: you don't need to know Lucene segment merging deeply, but knowing that full-text search is a posting-list intersection (not a scan) and that doc_values enables fast field-level aggregations differentiates senior candidates.
Mapping explosion: surprisingly common in production, rarely mentioned by candidates. Mentioning it signals operational experience.
Logs vs. metrics: if asked to design an "observability system," always clarify: are we building logging, metrics, or tracing? Each has fundamentally different storage requirements. Metrics monitoring uses a time-series database (Prometheus, Thanos), not an inverted index.
Cost as a first-class concern: at scale, log storage often costs more than compute. Mentioning tiering, sampling, and retention policies distinguishes candidates who've operated these systems from those who've only read about them.
Trace correlation: mentioning trace_id as a structured field and linking to the distributed tracing system shows end-to-end observability thinking.

Things you should now be able to answer

Why does Kafka sit between agents and Elasticsearch, and what failure mode does it prevent?
What is an inverted index, and why does it make full-text search fast?
Why are log indices organized by day rather than by service or severity?
What happens to shard count and cluster stability when you keep all data on hot nodes for 90 days?
How does a time-range query avoid scanning all indices?
What is mapping explosion, and how do you prevent it?
How do you prevent a single wide-range query from degrading interactive dashboards?
How is a log aggregation system architecturally different from a metrics monitoring system?

Frequently asked questions

▸Why does a log aggregation pipeline put Kafka between collection agents and Elasticsearch?

Without Kafka, Elasticsearch applies backpressure via HTTP 429 when its indexing queue fills, and agents must either block or drop events. Kafka's durable retention window (typically 24-48 hours) lets indexers fall behind during traffic spikes and catch up without losing a single event; consumer group offset tracking means an indexer restart resumes exactly where it left off.

▸What is an inverted index and why does it make log search fast?

Lucene tokenizes each log line and builds a mapping from every token to the sorted list of document IDs containing it (a posting list). A query like 'payment AND timeout' intersects two posting lists in O(x + y) time proportional to the matching list lengths, not the total corpus size, so it returns results across hundreds of millions of documents in milliseconds rather than scanning raw bytes.

▸Why organize log indices by day rather than by service or severity?

Time-based daily indices enable automatic query pruning: a 'last 2 hours' query touches only today's index, never the terabytes on warm or cold tiers. They also make retention management trivial (deleting 30-day-old data is a single index drop, not per-document deletions) and align perfectly with ILM tiering, since an entire index moves atomically to warm or cold nodes.

▸What is mapping explosion in Elasticsearch and how do you prevent it?

Elasticsearch dynamically maps every new JSON field it sees; a service that suddenly emits HTTP request headers as top-level fields can create thousands of new mappings overnight, bloating cluster state and degrading indexing throughput to the point of master node OOM. Prevention requires enforcing strict index templates that define allowed fields explicitly and routing unknown fields to a single generic keyword field, combined with alerts when a new index's field count exceeds a threshold such as 500.

▸How much cheaper is cold object storage compared to keeping logs on hot SSD nodes?

Cold searchable snapshots on S3 or GCS cost roughly 20-50 times less than hot SSD nodes, and the frozen tier (partially cached, on-demand S3 fetch) costs approximately 100 times less. Moving from 90-day hot retention to the full hot-warm-cold tier structure typically reduces storage costs by 5-10 times for a mid-size company.

← previous

Design an AI Agent Platform

Design a Large-Scale Data Pipeline (ETL / Batch + Streaming)

// RELATED

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform