Design an Ad Click Aggregator (real-time analytics)
Ingest billions of ad events, serve per-minute metrics in near-real-time, and produce exact totals for billing — the canonical streaming + lambda/kappa problem.
The problem
Google runs roughly 8.5 billion ad auctions per day. Every time a user clicks a banner on a publisher site, that click event needs to be counted — accurately enough to charge the advertiser, quickly enough to update a live dashboard, and reliably enough to survive crashes, retries, and mobile clients that connect hours after the click actually happened. The system that does this is an ad click aggregator.
At its core the job is simple: receive events, count them, serve the counts. Meta's ad platform, Google Ads, and TikTok's commercial feed all run systems of this shape. The volume is not the interesting part — Kafka handles ingestion volume easily. The interesting part is that two very different consumers need those counts. A campaign dashboard refreshing every 10 seconds can tolerate a count that's 99% right. A billing system that charges an advertiser for 1.4 million clicks per month cannot: overcount by 0.1% and you issue mass refunds; undercount and you lose real revenue. One consumer wants speed, the other wants exactness, and they are in direct tension.
This tension is the core engineering problem. A pure streaming system gives you speed but not exactness — stream processors lose events during crash recovery, and late-arriving events (a mobile click buffered while underground) silently miss their window. A pure batch system gives you exactness but not speed — running a full Spark job every 10 seconds is wildly impractical at terabyte scale. The canonical answer is two parallel paths over a single durable event log: one fast and approximate, one slow and exact.
The secondary challenges are just as real: a Super Bowl ad can drive 100× normal click volume through a single partition and overwhelm one stream-processor task; bots and click-fraud actors inflate counts before they're ever stored; and duplicate events from retry-happy mobile SDKs can turn one real click into two billing records. Getting this system right in an interview means addressing all of it — not just the happy path.
Functional requirements
- Ingest ad click and impression events from web, mobile, and third-party ad networks.
- Serve "clicks per ad per minute" for real-time dashboards; acceptable latency: a few seconds.
- Produce exact click counts per campaign for billing cycles (daily/monthly); no double-counting.
- Allow querying aggregates at minute, hour, and day granularity.
- Filter invalid/fraudulent clicks before they enter counts.
- Late events (up to 1 hour delayed) must be reflected in billing totals.
Non-functional requirements
- Throughput: 1 M events/sec sustained; 3× peak.
- Dashboard query latency: < 1 s for per-ad, per-minute counts over a rolling 24-hour window.
- Billing accuracy: zero tolerance for double-counting; late events must be included.
- Durability: no event loss after the collector acknowledges receipt.
- Fault tolerance: stream processor crash must not lose aggregated state or require manual intervention.
- Retention: raw events 30 days; aggregates 2 years.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Peak event rate | 3 M events/sec | 3× headroom over 1 M/sec sustained |
| Per-event size | ~500 bytes | event_id, ad_id, campaign_id, user_id, timestamp, IP, user_agent, click_type, placement |
| Ingestion bandwidth (peak) | 1.5 GB/s | 3 M × 500 B |
| Kafka throughput (avg) | 500 MB/s | 1 M events/sec × 500 B |
| Kafka retention (raw, pre-replication) | ~300 TB | 500 MB/s × 86 400 s × 7 days |
| Kafka retention (with 3× replication) | ~900 TB | 300 TB × 3 |
| Kafka partitions | ~100 | 500 MB/s ÷ 25 MB/s per partition = 20 at avg; ×3 headroom for 3 M/sec peak (1 500 MB/s ÷ 25 MB/s = 60 partitions minimum); round up to 100 for operational flexibility and Flink parallelism match |
| S3 raw archive (uncompressed/day) | 43 TB/day | 86.4 B events × 500 B |
| S3 raw archive (Snappy/Parquet, ~6× compression) | ~7 TB/day | 43 TB ÷ 6 |
| S3 raw archive (30-day window) | ~210 TB | 7 TB/day × 30 |
| ClickHouse minute-level rows/day (raw) | ~345 GB/day | 10 M ads × 1 440 min/day × 24 B/row (8 B count + 8 B ad_id + 8 B bucket_ts) |
| ClickHouse after rollups + compression | a few GB/day | Rollup to hour/day granularity; fits in a few TB over 2-year retention |
| Dashboard QPS | 5 000 QPS | 50 000 concurrent ads × 1 query / 10 s |
| Billing QPS | negligible | Batch reads, few queries/day; not latency-sensitive |
Takeaway: At 1 M events/sec sustained (3 M peak), the system ingests 500 MB/s into ~100 Kafka partitions, archives 43 TB/day uncompressed (~7 TB/day compressed) to S3, and keeps ClickHouse aggregates well under 1 TB/day — storage is dominated by the raw S3 archive.
Building up to the design
V1: A single API writing to Postgres
Every click hits one endpoint, which writes to a clicks table. A dashboard queries SELECT COUNT(*) FROM clicks WHERE ad_id=? AND ts BETWEEN ? AND ?.
This works at demo scale. At 1 M events/sec, though, a single Postgres can't keep up with writes (realistic transactional throughput — individual row inserts with full ACID guarantees and WAL fsync — tops out around tens of thousands per second on typical hardware; bulk COPY is faster but not what a per-click write path does), queries contend with ingestion, and no amount of indexing makes COUNT(*) over billions of rows sub-second.
V2: Decouple ingestion from storage with Kafka
A stateless collector fleet accepts events and publishes them to Kafka, which buffers millions of events/sec durably. Downstream consumers read at their own pace.
The collector and storage are now decoupled. Kafka acts as both the shock absorber and the replay log — if a downstream consumer falls behind or crashes, it can rewind and re-read. What this doesn't solve: something still has to read Kafka and compute aggregates. A simple consumer that increments Redis counters is fast but has no notion of time windows — it cannot answer "clicks in the minute between 14:03:00 and 14:04:00."
V3: Add a stream processor for windowed aggregation
A stream processor (Flink or Kafka Streams) reads Kafka, assigns events to 1-minute tumbling windows keyed by ad_id, and emits window results to the serving store. Now the dashboard query is a point-lookup: "give me the row for ad_id=X, minute=14:03."
Sub-second dashboard reads, horizontal scale, back-pressure handled by Kafka. But there's a problem that surfaces once you start running this in production: stream processor crashes. Flink checkpoints its window state every N seconds using a configurable state backend (EmbeddedRocksDBStateBackend is the production recommendation for large state; HashMapStateBackend is the default but limited by heap) and writes the checkpoint data to a durable remote store such as S3 or HDFS. On recovery it replays Kafka from the saved offset. Events that arrived between the last checkpoint and the crash are replayed — potentially double-written to the OLAP store. And a click arriving 10 minutes late (mobile phone in a tunnel) may land after its window has already been emitted and closed.
V4: Add a batch reprocessing path for billing correctness
Instead of trying to make the stream path exactly-correct (extremely hard), accept that the stream path is approximate and add a second path: the raw S3 event archive. A Spark (or Trino/BigQuery) job reads the full day's raw events, deduplicates by event_id, and writes exact counts. Billing reads those exact rows.
This is the lambda architecture: two paths, one raw truth, one fast path, one slow-but-exact path. The modern alternative (kappa) uses a single stream that you replay for correctness. The trade-offs are real — more on this in the lambda vs. kappa section below.
V5: Deduplication, hot-key mitigation, fraud filtering
Add per-event deduplication at the collector, two-stage aggregation for hot ad IDs, and a fraud-filtering stage inline before events land in the main Kafka topic.
flowchart LR
V1["V1: Postgres write-per-click\n~50k events/sec cap"] --> V2["V2: + Kafka ingestion\n1M+ events/sec, decoupled"]
V2 --> V3["V3: + Flink windows\nnear-real-time counts"]
V3 --> V4["V4: + S3 archive + Spark\nbilling-grade exact counts"]
V4 --> V5["V5: + dedup + hot-key fix\n+ fraud filter\nproduction-ready"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V5 fill:#a855f7,color:#fff
Event schema
Every event must carry a globally unique event_id (used for dedup), the event's own timestamp (for event-time windowing), and enough dimensions for grouping.
{
"event_id": "01J9K4X7ZP-abcd-1234-ef56", // UUID or ULID
"event_type": "click", // or "impression"
"ad_id": "ad_8371920",
"campaign_id": "cmp_10042",
"user_id": "u_9182736",
"session_id": "sess_abc123",
"timestamp": "2026-05-24T14:03:27.412Z", // event time (client-generated)
"received_at": "2026-05-24T14:03:27.891Z", // processing time (collector-stamped)
"ip": "198.51.100.42",
"user_agent": "Mozilla/5.0 ...",
"placement": "homepage_banner"
}
event_id is the deduplication key. timestamp is the event time used for windowing. received_at lets us measure how late an event is (lag = received_at - timestamp).
Detailed architecture
flowchart TD
SDK[Web/Mobile SDKs\nThird-party ad networks] --> COL[Collector Fleet\nstateless, horizontally scaled]
COL -->|deduplicate event_id\nwrite to Kafka| KAFKA_RAW[Kafka: raw-events\n~100 partitions, key=ad_id]
COL -->|archive all events| S3_RAW[(S3 / GCS\nraw Parquet archive\n30-day retention)]
KAFKA_RAW --> FRAUD[Fraud Filter\nFlink job\nIP rate-limit, bot UA, click velocity]
FRAUD -->|clean events| KAFKA_CLEAN[Kafka: clean-events]
KAFKA_CLEAN --> FLINK[Flink Streaming\n1-min tumbling windows\nevent-time + watermarks]
FLINK -->|pre-aggregated counts| OLAP[(ClickHouse / Druid\nrollup: minute → hour → day)]
S3_RAW --> SPARK[Spark Batch\nhourly / daily reprocess\nexact dedup by event_id]
SPARK -->|exact counts upsert| OLAP
OLAP --> DASH[Dashboard API\n< 1s query latency]
OLAP --> BILL[Billing Service\nreads exact-count rows only]
style COL fill:#ff6b1a,color:#0a0a0f
style KAFKA_RAW fill:#a855f7,color:#fff
style KAFKA_CLEAN fill:#a855f7,color:#fff
style FLINK fill:#15803d,color:#fff
style OLAP fill:#0e7490,color:#fff
style S3_RAW fill:#ffaa00,color:#0a0a0f
style SPARK fill:#ff2e88,color:#fff
style FRAUD fill:#15803d,color:#fff
The diagram shows the two-path split clearly. Every event hits the collector and fans out: one path flows through Kafka into Flink for near-real-time counts, the other lands in S3 for later batch processing. The OLAP store serves both consumers from the same table — differentiated by the is_exact flag on each row.
Stream processing deep dive
Event time vs. processing time
A click happens at 14:03:27 on a user's phone. The phone is underground, so the event doesn't reach the collector until 14:07:12. Which timestamp should the stream processor use?
| Concept | Definition | When it matters |
|---|---|---|
| Event time | Timestamp on the event itself (when the click actually happened) | Always — it's what the advertiser cares about |
| Processing time | When the event arrives at the stream processor | Internal metrics, but wrong for ad aggregation |
| Ingestion time | When the event enters Kafka | A compromise; better than processing time, worse than event time |
Aggregations must be keyed on event time. A click at 14:03:45 belongs in the 14:03 window, even if the stream processor sees it at 14:07.
Watermarks and late events
The problem is that the stream processor doesn't know when all events for a given minute have arrived. It needs a signal: "I believe all events with timestamp < W have now arrived." That signal is called a watermark.
Flink advances the watermark as events flow in, typically as max_seen_event_time - allowed_lateness. With an allowed lateness of 5 minutes, events up to 5 minutes late are accepted and folded into the correct window. Events more than 5 minutes late are emitted as a side output (a late data stream) — the stream path ignores them for approximate counts. The batch reprocessing path handles events that are hours or days late.
sequenceDiagram
participant K as Kafka
participant F as Flink
participant O as OLAP Store
K->>F: event ts=14:03:27 (on time)
K->>F: event ts=14:01:45 (2 min late)
Note over F: watermark now at ~14:02\nwindow 14:01 still open (within 5 min lateness)
F->>O: window 14:01 emitted (preliminary)
K->>F: event ts=14:01:12 (7 min late — past watermark)
F-->>F: side output: late event
Note over F: late event NOT in streaming count
Note over O: Spark batch will pick this up from S3
Tumbling windows and pre-aggregation
Flink groups events into non-overlapping 1-minute tumbling windows keyed by (ad_id, window_start). Within each window, events are counted locally per Flink task, then a final aggregation step merges partial counts from all tasks before writing to the OLAP store.
This two-level aggregation matters at scale. With 1 M events/sec and 100 Flink tasks, each task handles ~10 000 events/sec. Local counts within a task produce a single number per (ad_id, window); only that number crosses the network at window close time, not raw events.
Output schema written to ClickHouse per window close:
INSERT INTO ad_clicks_1min
(ad_id, campaign_id, window_start, click_count, impression_count, is_exact)
VALUES
('ad_8371920', 'cmp_10042', '2026-05-24 14:03:00', 1423, 8192, false);
is_exact = false for stream-path rows. The Spark batch job writes is_exact = true rows (or upserts existing rows). Billing queries WHERE is_exact = true.
Lambda vs. kappa architecture
This problem is the textbook case for choosing between them.
| Dimension | Lambda (two paths) | Kappa (single stream) |
|---|---|---|
| Correctness | Batch path is exact ground truth | Stream path must be re-run for corrections |
| Latency | Streaming path is low-latency; batch is hours | Single path can be both with careful design |
| Complexity | Two codebases to maintain, two outputs to reconcile | One codebase; reprocessing is just replaying Kafka/S3 |
| Reprocessing | Spark batch reads S3; straightforward | Replay the entire event log through the stream processor from offset 0 |
| Late events | Batch path naturally handles them | Requires long Kafka retention or reading from S3 |
| When to use | Different logic needed for approx vs. exact | Same aggregation logic; correctness via replay |
The two topologies look like this side by side:
flowchart LR
subgraph LAMBDA["Lambda (two paths)"]
direction TB
KL[Kafka] --> FL[Flink\napprox counts]
KL --> SL[(S3 raw log)]
FL --> OL[(OLAP\nis_exact=false)]
SL --> SPL[Spark\nexact dedup]
SPL --> OL2[(OLAP\nis_exact=true)]
end
subgraph KAPPA["Kappa (single stream)"]
direction TB
KK[Kafka\nlong retention] --> FK[Flink\nor replay job]
FK --> OK[(OLAP)]
KK -.replay from offset 0.-> FK
end
style FL fill:#15803d,color:#fff
style SPL fill:#ff2e88,color:#fff
style FK fill:#ff6b1a,color:#0a0a0f
style KL fill:#a855f7,color:#fff
style KK fill:#a855f7,color:#fff
style SL fill:#ffaa00,color:#0a0a0f
For billing, lambda is the safer choice. The batch and stream paths can use different dedup and aggregation logic because their requirements are genuinely different — the stream path optimizes for latency (relaxed dedup with bloom filter); the batch path optimizes for correctness (exact dedup by scanning event_id in the S3 partition).
Why billing cannot trust the streaming path alone:
Flink crashes and recovers from a checkpoint — on restart, Flink re-reads Kafka from the checkpointed offset, but the already-emitted partial window counts are not automatically corrected. Late events past the watermark are silently dropped from streaming aggregation. Network retries from collectors can produce duplicate events; bloom-filter dedup is probabilistic (non-zero false-positive rate — a new, unique event_id is incorrectly classified as a duplicate and dropped). Bloom filters have no false negatives, but false positives mean some legitimate events are silently discarded before Kafka.
All three causes lead to undercounting or overcounting that the stream path alone cannot self-detect.
Deduplication
Collectors retry on network failure. A click may arrive 2–5 times. The event_id is the idempotency key, and the defense is layered.
At the collector (best-effort, fast): A bloom filter per 5-minute bucket holds event_ids seen in that window. Before publishing to Kafka, check the bloom. The filter is tuned to a < 0.01% false-positive rate — meaning at most 0.01% of legitimate, unique events are incorrectly flagged as duplicates and dropped. Because the bloom filter has no false negatives, every true duplicate (an event_id already inserted) is caught with certainty and blocked from reaching Kafka.
At the batch path (exact, slow): The Spark job reads all S3 files for the day, groups by event_id, and takes one row per event. GROUP BY event_id on Parquet with predicate pushdown on date partition is fast even over terabytes.
Idempotent writes to the OLAP store: Both Flink and Spark write with upsert semantics. In ClickHouse, the ReplacingMergeTree engine deduplicates inserts with the same primary key at merge time; use the FINAL keyword on reads to force dedup before the query returns. In Druid, enable rollup at ingestion time and include event_id as a dimension; a count metric over the rollup group naturally deduplicates within a segment's granularity.
Hot-key skew
A viral ad — say, a Super Bowl spot — can drive 100× normal click volume through one ad_id. If Kafka is partitioned by ad_id % N, one partition becomes 100× hotter and one Flink task ends up drowning.
The fix is key salting combined with two-stage aggregation. In Stage 1, salt the partition key so events for the same ad spread across S partitions:
partition_key = ad_id + "_" + (event_id_hash % S) // S = 8 or 16
Each Flink partition computes a partial count independently. In Stage 2, a second Flink operator reads partial counts keyed by ad_id and sums them at window-close time before writing to the OLAP store. This adds a small amount of extra latency — roughly one intra-cluster network round trip, typically under 10–20 ms depending on hardware — but prevents a hot partition from stalling the entire topology.
flowchart LR
AD[Hot ad_id: 100k events/min] --> P1[Partition ad_X_0\n12 500 events/min]
AD --> P2[Partition ad_X_1\n12 500 events/min]
AD --> P3[Partition ad_X_2\n12 500 events/min]
AD --> P4[Partition ad_X_3..7\n...]
P1 --> MERGE[Merge operator\nsum partial counts]
P2 --> MERGE
P3 --> MERGE
P4 --> MERGE
MERGE --> OLAP[(OLAP Store)]
style AD fill:#ff6b1a,color:#0a0a0f
style MERGE fill:#15803d,color:#fff
style OLAP fill:#0e7490,color:#fff
Fraud and invalid click filtering
A bot can inflate an advertiser's click count (click fraud) or a competitor's (ad sabotage). Filtering before counting is critical, and the right place to do it is in a separate Flink job between raw-events and clean-events — not at the collector where the logic is harder to update and audit.
An inline Flink fraud-filter job reads from raw-events and publishes clean events to clean-events, dropping or flagging:
| Signal | Rule | Action |
|---|---|---|
| IP click velocity | > 100 clicks/min from same IP | Drop |
| User agent | Known bot UA strings | Drop |
| Click-impression gap | Click without preceding impression in last 24h | Flag / drop |
| Repeated user+ad pair | Same user_id clicks same ad_id > N times/hour | Drop duplicates beyond threshold |
| Geographic anomaly | IP geo vs. campaign target region mismatch | Flag for human review |
Flagged events are written to a separate suspect-events Kafka topic for offline analysis. Advertisers can dispute counts and request a re-audit from the S3 raw log — with suspect events excluded or included, depending on the outcome.
Serving store: ClickHouse / Druid
Both are OLAP databases designed for aggregation queries on append-heavy time-series data.
| Feature | ClickHouse | Apache Druid |
|---|---|---|
| Ingestion | Kafka consumer built-in; batch INSERT | Kafka supervisor; batch ingestion |
| Rollup | Materialized views; AggregatingMergeTree | Native rollup at ingestion time |
| Query language | Full SQL | Druid SQL (subset) or native JSON |
| Real-time ingestion | Yes (Kafka table engine) | Yes (real-time ingestion segments) |
| Dedup on write | ReplacingMergeTree | Per-segment uniqueness |
| Horizontal scale | Distributed tables, sharding | Segment-based replication |
| Latency (p99 agg query) | < 100 ms on indexed columns | < 100 ms |
| Operational complexity | Lower (single binary family) | Higher (many components) |
Either is correct. ClickHouse is typically simpler operationally; Druid is battle-tested at very large deployments. Pinterest uses Druid specifically for ad analytics (documented in their engineering blog). Airbnb uses Druid for general real-time platform analytics and dashboards, not specifically for ad analytics. Other large-scale Druid users include Wikimedia, Netflix, and Lyft (listed on druid.apache.org/druid-powered).
Rollup tiers
Store minute-level data for 7 days, hour-level for 90 days, day-level for 2 years. A materialized view rolls up minute rows to hour rows nightly.
-- ClickHouse: minute-level table
CREATE TABLE ad_clicks_1min (
ad_id String,
campaign_id String,
window_start DateTime,
click_count UInt64,
impression_count UInt64,
is_exact UInt8
) ENGINE = ReplacingMergeTree(is_exact)
ORDER BY (ad_id, window_start);
-- Hour-level rollup via a materialized view over the minute table
CREATE TABLE ad_clicks_1hour (
ad_id String,
campaign_id String,
window_start DateTime, -- truncated to the hour boundary
click_count AggregateFunction(sum, UInt64),
impression_count AggregateFunction(sum, UInt64)
) ENGINE = AggregatingMergeTree()
ORDER BY (ad_id, window_start);
CREATE MATERIALIZED VIEW ad_clicks_1hour_mv TO ad_clicks_1hour AS
SELECT
ad_id,
campaign_id,
toStartOfHour(window_start) AS window_start,
sumState(click_count) AS click_count,
sumState(impression_count) AS impression_count
FROM ad_clicks_1min
WHERE is_exact = 1 -- roll up only exact rows for billing; include 0 for dashboards
GROUP BY ad_id, campaign_id, window_start;
Failure modes
| Failure | Symptom | Mitigation |
|---|---|---|
| Collector crash mid-flight | Events in flight are lost | SDK retries with exponential backoff; event_id dedup on replay |
| Kafka partition leader failure | Ingest stalls for seconds | Kafka leader election (<30 s typical); collector buffers locally |
| Flink job crash | Streaming counts stall; window state is partially computed | Flink restarts from last checkpoint; replays Kafka from saved offset |
| Flink emits partial window on restart | OLAP row contains a lower-than-correct count | is_exact=false flag; billing ignores these rows; Spark batch overwrites |
| Late event past watermark | Event excluded from streaming count | Written to S3; Spark batch includes it in exact counts |
| OLAP store slow / backpressure | Flink's output buffer fills; checkpoint latency rises | Flink applies backpressure to Kafka consumer; Kafka buffers the delay |
| Duplicate events in billing | Advertiser overcharged | Spark GROUP BY event_id dedup before aggregation; idempotent OLAP writes |
| Hot partition | One Flink task overwhelmed; checkpoint lag spikes | Key salting + two-stage aggregation |
| S3 write failure | Raw event not archived; batch path loses coverage | Collector retries S3 write; fallback: read directly from Kafka if within retention |
Storage choices summary
| Data | Store | Retention | Why |
|---|---|---|---|
| Raw events (durable log) | Kafka (7 days) + S3 Parquet (30 days) | 30 days | Replay for reprocessing; S3 for long-term |
| Streaming aggregates | ClickHouse (minute/hour) | 7 days / 90 days | Low-latency dashboard reads |
| Exact billing aggregates | ClickHouse (same table, is_exact=true) | 2 years | Billing audit trail |
| Fraud suspect events | Kafka topic + S3 | 90 days | Dispute resolution |
| Bloom filter state | Redis (per 5-min bucket with TTL) | 10 minutes | Collector-level dedup |
| Flink checkpoints | S3 / HDFS | Last 3 checkpoints | Job recovery |
Things to discuss in an interview
- Why two paths? The stream path is inherently approximate — late events, at-least-once delivery, probabilistic dedup. Billing requires exact counts. The only way to be exact is to read the complete, deduplicated raw event log, which is the batch path.
- Lambda vs. kappa trade-off. Kappa simplifies the codebase (one streaming job that you replay for corrections). Lambda is safer when the approximation logic and exact logic are genuinely different, as they are here. Be explicit about which you'd choose and why.
- Watermarks and lateness. An interviewer will ask what happens when a mobile click arrives 20 minutes late. Your answer: the streaming path silently misses it (past watermark); the Spark batch picks it up from S3 and corrects the billing row. Streaming count stays approximate.
- Hot-key skew. The interviewer may say "a Super Bowl ad gets 10 M clicks in one minute." Your answer: salt the partition key, do two-stage aggregation in Flink, merge before writing to the OLAP store.
- Dedup and billing. Double-counting is the nightmare scenario. Explain the three-layer defense: bloom filter at the collector,
GROUP BY event_idin Spark, and idempotent upsert writes to the OLAP store. - Fraud filtering placement. Place it in a separate Flink job between
raw-eventsandclean-eventsso the filtering logic can be updated and re-run independently. Do not do it at the collector — too close to the network edge, harder to update and audit.
Things you should now be able to answer
- Why can't the billing service simply read the streaming (Flink) output?
- What is a watermark, and what does it control in a windowed aggregation?
- What happens to a click event that arrives 2 hours after it was generated?
- Why is the partition key salted, and how does two-stage aggregation compensate?
- What does
ReplacingMergeTreein ClickHouse give you, and when do you needFINAL? - Why is the bloom filter used at the collector rather than inside Flink?
- What is the difference between a tumbling window and a sliding window, and which is appropriate here?
Further reading
- "The Log: What every software engineer should know about real-time data" — Jay Kreps, Confluent (foundational reading on Kafka's log-as-truth model)
- "Questioning the Lambda Architecture" — Jay Kreps, O'Reilly (the original argument for kappa)
- Apache Flink documentation: Event Time and Watermarks — flink.apache.org
- ClickHouse documentation: ReplacingMergeTree — clickhouse.com/docs
- "Unified Batch and Streaming with Apache Flink" — various Apache Flink conference talks (public)
- "Druid: A Real-time Analytical Data Store" — Yang et al., SIGMOD 2014 (foundational paper)
Frequently asked questions
▸What is a watermark in the context of Flink stream processing?
A watermark is the stream processor's signal that all events with a timestamp earlier than a given value have arrived. Flink advances it as max_seen_event_time minus an allowed lateness; with a 5-minute allowed lateness, events up to 5 minutes late are folded into the correct window, while events arriving later are emitted to a side output and handled by the Spark batch path instead.
▸Why can't the billing service read directly from the Flink streaming output?
Three failure modes make the stream path unsuitable for billing: Flink recovers from checkpoints by replaying Kafka, potentially double-writing partial window counts; late events past the watermark are silently dropped from streaming aggregation; and the bloom-filter dedup at the collector has a non-zero false-positive rate, meaning some legitimate clicks are discarded before reaching Kafka. Only the Spark batch job, which does exact dedup via GROUP BY event_id over the full S3 raw log, is suitable for charging advertisers.
▸When should you choose a lambda architecture over a kappa architecture for ad click aggregation?
Choose lambda when the approximation logic and the exactness logic are genuinely different. In this system the stream path uses a probabilistic bloom-filter dedup optimized for latency, while the batch path uses full GROUP BY event_id dedup optimized for correctness — different enough that maintaining two codebases is worthwhile. Kappa is preferable when the same aggregation logic serves both purposes and you can achieve correctness simply by replaying the event log through the stream processor.
▸How does key salting prevent a viral ad from collapsing one Flink partition?
The partition key is computed as ad_id plus a salt derived from event_id_hash modulo S (typically 8 or 16), spreading a single ad's events across S partitions. A two-stage Flink topology then sums the partial counts from each salted partition at window-close time before writing to the OLAP store, adding roughly one intra-cluster network round trip of extra latency but preventing any single task from being overwhelmed.
▸What are the raw storage numbers for this system at 1 million events per second?
At sustained 1 M events/sec with 500 bytes per event, Kafka ingests 500 MB/s on average and needs about 100 partitions. S3 accumulates 43 TB/day uncompressed; stored as Snappy-compressed Parquet at roughly 6x compression, that is about 7 TB/day, or 210 TB over the 30-day retention window. ClickHouse minute-level rows account for roughly 345 GB/day raw, collapsing to a few GB/day after rollups and compression.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.