~/articles/design-ad-click-aggregator

◆◆◆Advancedasked at Metaasked at Googleasked at Amazonasked at TikTok

Design an Ad Click Aggregator (real-time analytics)

Q: What is a watermark in the context of Flink stream processing?

A watermark is the stream processor's signal that all events with a timestamp earlier than a given value have arrived. Flink advances it as max_seen_event_time minus an allowed lateness; with a 5-minute allowed lateness, events up to 5 minutes late are folded into the correct window, while events arriving later are emitted to a side output and handled by the Spark batch path instead.

Q: Why can't the billing service read directly from the Flink streaming output?

Three failure modes make the stream path unsuitable for billing: Flink recovers from checkpoints by replaying Kafka, potentially double-writing partial window counts; late events past the watermark are silently dropped from streaming aggregation; and the bloom-filter dedup at the collector has a non-zero false-positive rate, meaning some legitimate clicks are discarded before reaching Kafka. Only the Spark batch job, which does exact dedup via GROUP BY event_id over the full S3 raw log, is suitable for charging advertisers.

Q: When should you choose a lambda architecture over a kappa architecture for ad click aggregation?

Choose lambda when the approximation logic and the exactness logic are genuinely different. In this system the stream path uses a probabilistic bloom-filter dedup optimized for latency, while the batch path uses full GROUP BY event_id dedup optimized for correctness — different enough that maintaining two codebases is worthwhile. Kappa is preferable when the same aggregation logic serves both purposes and you can achieve correctness simply by replaying the event log through the stream processor.

Q: How does key salting prevent a viral ad from collapsing one Flink partition?

The partition key is computed as ad_id plus a salt derived from event_id_hash modulo S (typically 8 or 16), spreading a single ad's events across S partitions. A two-stage Flink topology then sums the partial counts from each salted partition at window-close time before writing to the OLAP store, adding roughly one intra-cluster network round trip of extra latency but preventing any single task from being overwhelmed.

Q: What are the raw storage numbers for this system at 1 million events per second?

At sustained 1 M events/sec with 500 bytes per event, Kafka ingests 500 MB/s on average and needs about 100 partitions. S3 accumulates 43 TB/day uncompressed; stored as Snappy-compressed Parquet at roughly 6x compression, that is about 7 TB/day, or 210 TB over the 30-day retention window. ClickHouse minute-level rows account for roughly 345 GB/day raw, collapsing to a few GB/day after rollups and compression.

Ingest billions of ad events, serve per-minute metrics in near-real-time, and produce exact totals for billing — the canonical streaming + lambda/kappa problem.

21 min read2026-05-24Ironclad Academy

#interview #streaming #analytics #big-data

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Google runs roughly 8.5 billion ad auctions per day. Every time a user clicks a banner on a publisher site, that click event needs to be counted — accurately enough to charge the advertiser, quickly enough to update a live dashboard, and reliably enough to survive crashes, retries, and mobile clients that connect hours after the click actually happened. The system that does this is an ad click aggregator.

At its core the job is simple: receive events, count them, serve the counts. Meta's ad platform, Google Ads, and TikTok's commercial feed all run systems of this shape. The volume is not the interesting part — Kafka handles ingestion volume easily. The interesting part is that two very different consumers need those counts. A campaign dashboard refreshing every 10 seconds can tolerate a count that's 99% right. A billing system that charges an advertiser for 1.4 million clicks per month cannot: overcount by 0.1% and you issue mass refunds; undercount and you lose real revenue. One consumer wants speed, the other wants exactness, and they are in direct tension.

This tension is the core engineering problem. A pure streaming system gives you speed but not exactness — stream processors lose events during crash recovery, and late-arriving events (a mobile click buffered while underground) silently miss their window. A pure batch system gives you exactness but not speed — running a full Spark job every 10 seconds is wildly impractical at terabyte scale. The canonical answer is two parallel paths over a single durable event log: one fast and approximate, one slow and exact.

The secondary challenges are just as real: a Super Bowl ad can drive 100× normal click volume through a single partition and overwhelm one stream-processor task; bots and click-fraud actors inflate counts before they're ever stored; and duplicate events from retry-happy mobile SDKs can turn one real click into two billing records. Getting this system right in an interview means addressing all of it — not just the happy path.

Functional requirements

Ingest ad click and impression events from web, mobile, and third-party ad networks.
Serve "clicks per ad per minute" for real-time dashboards; acceptable latency: a few seconds.
Produce exact click counts per campaign for billing cycles (daily/monthly); no double-counting.
Allow querying aggregates at minute, hour, and day granularity.
Filter invalid/fraudulent clicks before they enter counts.
Late events (up to 1 hour delayed) must be reflected in billing totals.

Non-functional requirements

Throughput: 1 M events/sec sustained; 3× peak.
Dashboard query latency: < 1 s for per-ad, per-minute counts over a rolling 24-hour window.
Billing accuracy: zero tolerance for double-counting; late events must be included.
Durability: no event loss after the collector acknowledges receipt.
Fault tolerance: stream processor crash must not lose aggregated state or require manual intervention.
Retention: raw events 30 days; aggregates 2 years.

Capacity estimation

Dimension	Estimate	How we got there
Peak event rate	3 M events/sec	3× headroom over 1 M/sec sustained
Per-event size	~500 bytes	`event_id, ad_id, campaign_id, user_id, timestamp, IP, user_agent, click_type, placement`
Ingestion bandwidth (peak)	1.5 GB/s	`3 M × 500 B`
Kafka throughput (avg)	500 MB/s	`1 M events/sec × 500 B`
Kafka retention (raw, pre-replication)	~300 TB	`500 MB/s × 86 400 s × 7 days`
Kafka retention (with 3× replication)	~900 TB	`300 TB × 3`
Kafka partitions	~100	`500 MB/s ÷ 25 MB/s per partition = 20` at avg; `×3 headroom for 3 M/sec peak (1 500 MB/s ÷ 25 MB/s = 60 partitions minimum)`; round up to 100 for operational flexibility and Flink parallelism match
S3 raw archive (uncompressed/day)	43 TB/day	`86.4 B events × 500 B`
S3 raw archive (Snappy/Parquet, ~6× compression)	~7 TB/day	`43 TB ÷ 6`
S3 raw archive (30-day window)	~210 TB	`7 TB/day × 30`
ClickHouse minute-level rows/day (raw)	~345 GB/day	`10 M ads × 1 440 min/day × 24 B/row` (8 B count + 8 B ad_id + 8 B bucket_ts)
ClickHouse after rollups + compression	a few GB/day	Rollup to hour/day granularity; fits in a few TB over 2-year retention
Dashboard QPS	5 000 QPS	`50 000 concurrent ads × 1 query / 10 s`
Billing QPS	negligible	Batch reads, few queries/day; not latency-sensitive

Takeaway: At 1 M events/sec sustained (3 M peak), the system ingests 500 MB/s into ~100 Kafka partitions, archives 43 TB/day uncompressed (~7 TB/day compressed) to S3, and keeps ClickHouse aggregates well under 1 TB/day — storage is dominated by the raw S3 archive.

Building up to the design

V1: A single API writing to Postgres

Every click hits one endpoint, which writes to a clicks table. A dashboard queries SELECT COUNT(*) FROM clicks WHERE ad_id=? AND ts BETWEEN ? AND ?.

This works at demo scale. At 1 M events/sec, though, a single Postgres can't keep up with writes (realistic transactional throughput — individual row inserts with full ACID guarantees and WAL fsync — tops out around tens of thousands per second on typical hardware; bulk COPY is faster but not what a per-click write path does), queries contend with ingestion, and no amount of indexing makes COUNT(*) over billions of rows sub-second.

V2: Decouple ingestion from storage with Kafka

A stateless collector fleet accepts events and publishes them to Kafka, which buffers millions of events/sec durably. Downstream consumers read at their own pace.

The collector and storage are now decoupled. Kafka acts as both the shock absorber and the replay log — if a downstream consumer falls behind or crashes, it can rewind and re-read. What this doesn't solve: something still has to read Kafka and compute aggregates. A simple consumer that increments Redis counters is fast but has no notion of time windows — it cannot answer "clicks in the minute between 14:03:00 and 14:04:00."

V3: Add a stream processor for windowed aggregation

A stream processor (Flink or Kafka Streams) reads Kafka, assigns events to 1-minute tumbling windows keyed by ad_id, and emits window results to the serving store. Now the dashboard query is a point-lookup: "give me the row for ad_id=X, minute=14:03."

Sub-second dashboard reads, horizontal scale, back-pressure handled by Kafka. But there's a problem that surfaces once you start running this in production: stream processor crashes. Flink checkpoints its window state every N seconds using a configurable state backend (EmbeddedRocksDBStateBackend is the production recommendation for large state; HashMapStateBackend is the default but limited by heap) and writes the checkpoint data to a durable remote store such as S3 or HDFS. On recovery it replays Kafka from the saved offset. Events that arrived between the last checkpoint and the crash are replayed — potentially double-written to the OLAP store. And a click arriving 10 minutes late (mobile phone in a tunnel) may land after its window has already been emitted and closed.

V4: Add a batch reprocessing path for billing correctness

Instead of trying to make the stream path exactly-correct (extremely hard), accept that the stream path is approximate and add a second path: the raw S3 event archive. A Spark (or Trino/BigQuery) job reads the full day's raw events, deduplicates by event_id, and writes exact counts. Billing reads those exact rows.

This is the lambda architecture: two paths, one raw truth, one fast path, one slow-but-exact path. The modern alternative (kappa) uses a single stream that you replay for correctness. The trade-offs are real — more on this in the lambda vs. kappa section below.

V5: Deduplication, hot-key mitigation, fraud filtering

Add per-event deduplication at the collector, two-stage aggregation for hot ad IDs, and a fraud-filtering stage inline before events land in the main Kafka topic.

flowchart LR
    V1["V1: Postgres write-per-click\n~50k events/sec cap"] --> V2["V2: + Kafka ingestion\n1M+ events/sec, decoupled"]
    V2 --> V3["V3: + Flink windows\nnear-real-time counts"]
    V3 --> V4["V4: + S3 archive + Spark\nbilling-grade exact counts"]
    V4 --> V5["V5: + dedup + hot-key fix\n+ fraud filter\nproduction-ready"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V5 fill:#a855f7,color:#fff

Event schema

Every event must carry a globally unique event_id (used for dedup), the event's own timestamp (for event-time windowing), and enough dimensions for grouping.

{
  "event_id":    "01J9K4X7ZP-abcd-1234-ef56",  // UUID or ULID
  "event_type":  "click",                         // or "impression"
  "ad_id":       "ad_8371920",
  "campaign_id": "cmp_10042",
  "user_id":     "u_9182736",
  "session_id":  "sess_abc123",
  "timestamp":   "2026-05-24T14:03:27.412Z",     // event time (client-generated)
  "received_at": "2026-05-24T14:03:27.891Z",     // processing time (collector-stamped)
  "ip":          "198.51.100.42",
  "user_agent":  "Mozilla/5.0 ...",
  "placement":   "homepage_banner"
}

event_id is the deduplication key. timestamp is the event time used for windowing. received_at lets us measure how late an event is (lag = received_at - timestamp).

Detailed architecture

flowchart TD
    SDK[Web/Mobile SDKs\nThird-party ad networks] --> COL[Collector Fleet\nstateless, horizontally scaled]

    COL -->|deduplicate event_id\nwrite to Kafka| KAFKA_RAW[Kafka: raw-events\n~100 partitions, key=ad_id]
    COL -->|archive all events| S3_RAW[(S3 / GCS\nraw Parquet archive\n30-day retention)]

    KAFKA_RAW --> FRAUD[Fraud Filter\nFlink job\nIP rate-limit, bot UA, click velocity]
    FRAUD -->|clean events| KAFKA_CLEAN[Kafka: clean-events]

    KAFKA_CLEAN --> FLINK[Flink Streaming\n1-min tumbling windows\nevent-time + watermarks]

    FLINK -->|pre-aggregated counts| OLAP[(ClickHouse / Druid\nrollup: minute → hour → day)]

    S3_RAW --> SPARK[Spark Batch\nhourly / daily reprocess\nexact dedup by event_id]
    SPARK -->|exact counts upsert| OLAP

    OLAP --> DASH[Dashboard API\n< 1s query latency]
    OLAP --> BILL[Billing Service\nreads exact-count rows only]

    style COL fill:#ff6b1a,color:#0a0a0f
    style KAFKA_RAW fill:#a855f7,color:#fff
    style KAFKA_CLEAN fill:#a855f7,color:#fff
    style FLINK fill:#15803d,color:#fff
    style OLAP fill:#0e7490,color:#fff
    style S3_RAW fill:#ffaa00,color:#0a0a0f
    style SPARK fill:#ff2e88,color:#fff
    style FRAUD fill:#15803d,color:#fff

The diagram shows the two-path split clearly. Every event hits the collector and fans out: one path flows through Kafka into Flink for near-real-time counts, the other lands in S3 for later batch processing. The OLAP store serves both consumers from the same table — differentiated by the is_exact flag on each row.

Stream processing deep dive

Event time vs. processing time

A click happens at 14:03:27 on a user's phone. The phone is underground, so the event doesn't reach the collector until 14:07:12. Which timestamp should the stream processor use?

Concept	Definition	When it matters
Event time	Timestamp on the event itself (when the click actually happened)	Always — it's what the advertiser cares about
Processing time	When the event arrives at the stream processor	Internal metrics, but wrong for ad aggregation
Ingestion time	When the event enters Kafka	A compromise; better than processing time, worse than event time

Aggregations must be keyed on event time. A click at 14:03:45 belongs in the 14:03 window, even if the stream processor sees it at 14:07.

Watermarks and late events

The problem is that the stream processor doesn't know when all events for a given minute have arrived. It needs a signal: "I believe all events with timestamp < W have now arrived." That signal is called a watermark.

Flink advances the watermark as events flow in, typically as max_seen_event_time - allowed_lateness. With an allowed lateness of 5 minutes, events up to 5 minutes late are accepted and folded into the correct window. Events more than 5 minutes late are emitted as a side output (a late data stream) — the stream path ignores them for approximate counts. The batch reprocessing path handles events that are hours or days late.

sequenceDiagram
    participant K as Kafka
    participant F as Flink
    participant O as OLAP Store

    K->>F: event ts=14:03:27 (on time)
    K->>F: event ts=14:01:45 (2 min late)
    Note over F: watermark now at ~14:02\nwindow 14:01 still open (within 5 min lateness)
    F->>O: window 14:01 emitted (preliminary)
    K->>F: event ts=14:01:12 (7 min late — past watermark)
    F-->>F: side output: late event
    Note over F: late event NOT in streaming count
    Note over O: Spark batch will pick this up from S3

Tumbling windows and pre-aggregation

Flink groups events into non-overlapping 1-minute tumbling windows keyed by (ad_id, window_start). Within each window, events are counted locally per Flink task, then a final aggregation step merges partial counts from all tasks before writing to the OLAP store.

This two-level aggregation matters at scale. With 1 M events/sec and 100 Flink tasks, each task handles ~10 000 events/sec. Local counts within a task produce a single number per (ad_id, window); only that number crosses the network at window close time, not raw events.

Output schema written to ClickHouse per window close:

INSERT INTO ad_clicks_1min
  (ad_id, campaign_id, window_start, click_count, impression_count, is_exact)
VALUES
  ('ad_8371920', 'cmp_10042', '2026-05-24 14:03:00', 1423, 8192, false);

is_exact = false for stream-path rows. The Spark batch job writes is_exact = true rows (or upserts existing rows). Billing queries WHERE is_exact = true.

Lambda vs. kappa architecture

This problem is the textbook case for choosing between them.

Dimension	Lambda (two paths)	Kappa (single stream)
Correctness	Batch path is exact ground truth	Stream path must be re-run for corrections
Latency	Streaming path is low-latency; batch is hours	Single path can be both with careful design
Complexity	Two codebases to maintain, two outputs to reconcile	One codebase; reprocessing is just replaying Kafka/S3
Reprocessing	Spark batch reads S3; straightforward	Replay the entire event log through the stream processor from offset 0
Late events	Batch path naturally handles them	Requires long Kafka retention or reading from S3
When to use	Different logic needed for approx vs. exact	Same aggregation logic; correctness via replay

The two topologies look like this side by side:

flowchart LR
    subgraph LAMBDA["Lambda (two paths)"]
        direction TB
        KL[Kafka] --> FL[Flink\napprox counts]
        KL --> SL[(S3 raw log)]
        FL --> OL[(OLAP\nis_exact=false)]
        SL --> SPL[Spark\nexact dedup]
        SPL --> OL2[(OLAP\nis_exact=true)]
    end
    subgraph KAPPA["Kappa (single stream)"]
        direction TB
        KK[Kafka\nlong retention] --> FK[Flink\nor replay job]
        FK --> OK[(OLAP)]
        KK -.replay from offset 0.-> FK
    end
    style FL fill:#15803d,color:#fff
    style SPL fill:#ff2e88,color:#fff
    style FK fill:#ff6b1a,color:#0a0a0f
    style KL fill:#a855f7,color:#fff
    style KK fill:#a855f7,color:#fff
    style SL fill:#ffaa00,color:#0a0a0f

For billing, lambda is the safer choice. The batch and stream paths can use different dedup and aggregation logic because their requirements are genuinely different — the stream path optimizes for latency (relaxed dedup with bloom filter); the batch path optimizes for correctness (exact dedup by scanning event_id in the S3 partition).

Why billing cannot trust the streaming path alone:

Flink crashes and recovers from a checkpoint — on restart, Flink re-reads Kafka from the checkpointed offset, but the already-emitted partial window counts are not automatically corrected. Late events past the watermark are silently dropped from streaming aggregation. Network retries from collectors can produce duplicate events; bloom-filter dedup is probabilistic (non-zero false-positive rate — a new, unique event_id is incorrectly classified as a duplicate and dropped). Bloom filters have no false negatives, but false positives mean some legitimate events are silently discarded before Kafka.

All three causes lead to undercounting or overcounting that the stream path alone cannot self-detect.

Deduplication

Collectors retry on network failure. A click may arrive 2–5 times. The event_id is the idempotency key, and the defense is layered.

At the collector (best-effort, fast): A bloom filter per 5-minute bucket holds event_ids seen in that window. Before publishing to Kafka, check the bloom. The filter is tuned to a < 0.01% false-positive rate — meaning at most 0.01% of legitimate, unique events are incorrectly flagged as duplicates and dropped. Because the bloom filter has no false negatives, every true duplicate (an event_id already inserted) is caught with certainty and blocked from reaching Kafka.

At the batch path (exact, slow): The Spark job reads all S3 files for the day, groups by event_id, and takes one row per event. GROUP BY event_id on Parquet with predicate pushdown on date partition is fast even over terabytes.

Idempotent writes to the OLAP store: Both Flink and Spark write with upsert semantics. In ClickHouse, the ReplacingMergeTree engine deduplicates inserts with the same primary key at merge time; use the FINAL keyword on reads to force dedup before the query returns. In Druid, enable rollup at ingestion time and include event_id as a dimension; a count metric over the rollup group naturally deduplicates within a segment's granularity.

Hot-key skew

A viral ad — say, a Super Bowl spot — can drive 100× normal click volume through one ad_id. If Kafka is partitioned by ad_id % N, one partition becomes 100× hotter and one Flink task ends up drowning.

The fix is key salting combined with two-stage aggregation. In Stage 1, salt the partition key so events for the same ad spread across S partitions:

partition_key = ad_id + "_" + (event_id_hash % S)   // S = 8 or 16

Each Flink partition computes a partial count independently. In Stage 2, a second Flink operator reads partial counts keyed by ad_id and sums them at window-close time before writing to the OLAP store. This adds a small amount of extra latency — roughly one intra-cluster network round trip, typically under 10–20 ms depending on hardware — but prevents a hot partition from stalling the entire topology.

flowchart LR
    AD[Hot ad_id: 100k events/min] --> P1[Partition ad_X_0\n12 500 events/min]
    AD --> P2[Partition ad_X_1\n12 500 events/min]
    AD --> P3[Partition ad_X_2\n12 500 events/min]
    AD --> P4[Partition ad_X_3..7\n...]
    P1 --> MERGE[Merge operator\nsum partial counts]
    P2 --> MERGE
    P3 --> MERGE
    P4 --> MERGE
    MERGE --> OLAP[(OLAP Store)]
    style AD fill:#ff6b1a,color:#0a0a0f
    style MERGE fill:#15803d,color:#fff
    style OLAP fill:#0e7490,color:#fff

Fraud and invalid click filtering

A bot can inflate an advertiser's click count (click fraud) or a competitor's (ad sabotage). Filtering before counting is critical, and the right place to do it is in a separate Flink job between raw-events and clean-events — not at the collector where the logic is harder to update and audit.

An inline Flink fraud-filter job reads from raw-events and publishes clean events to clean-events, dropping or flagging:

Signal	Rule	Action
IP click velocity	> 100 clicks/min from same IP	Drop
User agent	Known bot UA strings	Drop
Click-impression gap	Click without preceding impression in last 24h	Flag / drop
Repeated user+ad pair	Same user_id clicks same ad_id > N times/hour	Drop duplicates beyond threshold
Geographic anomaly	IP geo vs. campaign target region mismatch	Flag for human review

Flagged events are written to a separate suspect-events Kafka topic for offline analysis. Advertisers can dispute counts and request a re-audit from the S3 raw log — with suspect events excluded or included, depending on the outcome.

Serving store: ClickHouse / Druid

Both are OLAP databases designed for aggregation queries on append-heavy time-series data.

Feature	ClickHouse	Apache Druid
Ingestion	Kafka consumer built-in; batch INSERT	Kafka supervisor; batch ingestion
Rollup	Materialized views; `AggregatingMergeTree`	Native rollup at ingestion time
Query language	Full SQL	Druid SQL (subset) or native JSON
Real-time ingestion	Yes (Kafka table engine)	Yes (real-time ingestion segments)
Dedup on write	`ReplacingMergeTree`	Per-segment uniqueness
Horizontal scale	Distributed tables, sharding	Segment-based replication
Latency (p99 agg query)	< 100 ms on indexed columns	< 100 ms
Operational complexity	Lower (single binary family)	Higher (many components)

Either is correct. ClickHouse is typically simpler operationally; Druid is battle-tested at very large deployments. Pinterest uses Druid specifically for ad analytics (documented in their engineering blog). Airbnb uses Druid for general real-time platform analytics and dashboards, not specifically for ad analytics. Other large-scale Druid users include Wikimedia, Netflix, and Lyft (listed on druid.apache.org/druid-powered).

Rollup tiers

Store minute-level data for 7 days, hour-level for 90 days, day-level for 2 years. A materialized view rolls up minute rows to hour rows nightly.

-- ClickHouse: minute-level table
CREATE TABLE ad_clicks_1min (
    ad_id        String,
    campaign_id  String,
    window_start DateTime,
    click_count  UInt64,
    impression_count UInt64,
    is_exact     UInt8
) ENGINE = ReplacingMergeTree(is_exact)
ORDER BY (ad_id, window_start);

-- Hour-level rollup via a materialized view over the minute table
CREATE TABLE ad_clicks_1hour (
    ad_id            String,
    campaign_id      String,
    window_start     DateTime,   -- truncated to the hour boundary
    click_count      AggregateFunction(sum, UInt64),
    impression_count AggregateFunction(sum, UInt64)
) ENGINE = AggregatingMergeTree()
ORDER BY (ad_id, window_start);

CREATE MATERIALIZED VIEW ad_clicks_1hour_mv TO ad_clicks_1hour AS
SELECT
    ad_id,
    campaign_id,
    toStartOfHour(window_start) AS window_start,
    sumState(click_count)       AS click_count,
    sumState(impression_count)  AS impression_count
FROM ad_clicks_1min
WHERE is_exact = 1   -- roll up only exact rows for billing; include 0 for dashboards
GROUP BY ad_id, campaign_id, window_start;

Failure modes

Failure	Symptom	Mitigation
Collector crash mid-flight	Events in flight are lost	SDK retries with exponential backoff; `event_id` dedup on replay
Kafka partition leader failure	Ingest stalls for seconds	Kafka leader election (<30 s typical); collector buffers locally
Flink job crash	Streaming counts stall; window state is partially computed	Flink restarts from last checkpoint; replays Kafka from saved offset
Flink emits partial window on restart	OLAP row contains a lower-than-correct count	`is_exact=false` flag; billing ignores these rows; Spark batch overwrites
Late event past watermark	Event excluded from streaming count	Written to S3; Spark batch includes it in exact counts
OLAP store slow / backpressure	Flink's output buffer fills; checkpoint latency rises	Flink applies backpressure to Kafka consumer; Kafka buffers the delay
Duplicate events in billing	Advertiser overcharged	Spark `GROUP BY event_id` dedup before aggregation; idempotent OLAP writes
Hot partition	One Flink task overwhelmed; checkpoint lag spikes	Key salting + two-stage aggregation
S3 write failure	Raw event not archived; batch path loses coverage	Collector retries S3 write; fallback: read directly from Kafka if within retention

Storage choices summary

Data	Store	Retention	Why
Raw events (durable log)	Kafka (7 days) + S3 Parquet (30 days)	30 days	Replay for reprocessing; S3 for long-term
Streaming aggregates	ClickHouse (minute/hour)	7 days / 90 days	Low-latency dashboard reads
Exact billing aggregates	ClickHouse (same table, `is_exact=true`)	2 years	Billing audit trail
Fraud suspect events	Kafka topic + S3	90 days	Dispute resolution
Bloom filter state	Redis (per 5-min bucket with TTL)	10 minutes	Collector-level dedup
Flink checkpoints	S3 / HDFS	Last 3 checkpoints	Job recovery

Things to discuss in an interview

Why two paths? The stream path is inherently approximate — late events, at-least-once delivery, probabilistic dedup. Billing requires exact counts. The only way to be exact is to read the complete, deduplicated raw event log, which is the batch path.
Lambda vs. kappa trade-off. Kappa simplifies the codebase (one streaming job that you replay for corrections). Lambda is safer when the approximation logic and exact logic are genuinely different, as they are here. Be explicit about which you'd choose and why.
Watermarks and lateness. An interviewer will ask what happens when a mobile click arrives 20 minutes late. Your answer: the streaming path silently misses it (past watermark); the Spark batch picks it up from S3 and corrects the billing row. Streaming count stays approximate.
Hot-key skew. The interviewer may say "a Super Bowl ad gets 10 M clicks in one minute." Your answer: salt the partition key, do two-stage aggregation in Flink, merge before writing to the OLAP store.
Dedup and billing. Double-counting is the nightmare scenario. Explain the three-layer defense: bloom filter at the collector, GROUP BY event_id in Spark, and idempotent upsert writes to the OLAP store.
Fraud filtering placement. Place it in a separate Flink job between raw-events and clean-events so the filtering logic can be updated and re-run independently. Do not do it at the collector — too close to the network edge, harder to update and audit.

Things you should now be able to answer

Why can't the billing service simply read the streaming (Flink) output?
What is a watermark, and what does it control in a windowed aggregation?
What happens to a click event that arrives 2 hours after it was generated?
Why is the partition key salted, and how does two-stage aggregation compensate?
What does ReplacingMergeTree in ClickHouse give you, and when do you need FINAL?
Why is the bloom filter used at the collector rather than inside Flink?
What is the difference between a tumbling window and a sliding window, and which is appropriate here?

Frequently asked questions

▸What is a watermark in the context of Flink stream processing?

A watermark is the stream processor's signal that all events with a timestamp earlier than a given value have arrived. Flink advances it as max_seen_event_time minus an allowed lateness; with a 5-minute allowed lateness, events up to 5 minutes late are folded into the correct window, while events arriving later are emitted to a side output and handled by the Spark batch path instead.

▸Why can't the billing service read directly from the Flink streaming output?

Three failure modes make the stream path unsuitable for billing: Flink recovers from checkpoints by replaying Kafka, potentially double-writing partial window counts; late events past the watermark are silently dropped from streaming aggregation; and the bloom-filter dedup at the collector has a non-zero false-positive rate, meaning some legitimate clicks are discarded before reaching Kafka. Only the Spark batch job, which does exact dedup via GROUP BY event_id over the full S3 raw log, is suitable for charging advertisers.

▸When should you choose a lambda architecture over a kappa architecture for ad click aggregation?

Choose lambda when the approximation logic and the exactness logic are genuinely different. In this system the stream path uses a probabilistic bloom-filter dedup optimized for latency, while the batch path uses full GROUP BY event_id dedup optimized for correctness — different enough that maintaining two codebases is worthwhile. Kappa is preferable when the same aggregation logic serves both purposes and you can achieve correctness simply by replaying the event log through the stream processor.

▸How does key salting prevent a viral ad from collapsing one Flink partition?

The partition key is computed as ad_id plus a salt derived from event_id_hash modulo S (typically 8 or 16), spreading a single ad's events across S partitions. A two-stage Flink topology then sums the partial counts from each salted partition at window-close time before writing to the OLAP store, adding roughly one intra-cluster network round trip of extra latency but preventing any single task from being overwhelmed.

▸What are the raw storage numbers for this system at 1 million events per second?

At sustained 1 M events/sec with 500 bytes per event, Kafka ingests 500 MB/s on average and needs about 100 partitions. S3 accumulates 43 TB/day uncompressed; stored as Snappy-compressed Parquet at roughly 6x compression, that is about 7 TB/day, or 210 TB over the 30-day retention window. ClickHouse minute-level rows account for roughly 345 GB/day raw, collapsing to a few GB/day after rollups and compression.

← previous

Design an Object Storage Service (S3)

Design a Payment System (Stripe-style)

// RELATED

Design an Ad Click Aggregator (real-time analytics)

The problem

Functional requirements

Non-functional requirements

Capacity estimation

Building up to the design

V1: A single API writing to Postgres

V2: Decouple ingestion from storage with Kafka

V3: Add a stream processor for windowed aggregation

V4: Add a batch reprocessing path for billing correctness

V5: Deduplication, hot-key mitigation, fraud filtering

Event schema

Detailed architecture

Stream processing deep dive

Event time vs. processing time

Watermarks and late events

Tumbling windows and pre-aggregation

Lambda vs. kappa architecture

Deduplication

Hot-key skew

Fraud and invalid click filtering

Serving store: ClickHouse / Druid

Rollup tiers

Failure modes

Storage choices summary

Things to discuss in an interview

Things you should now be able to answer

Further reading

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform