Design an A/B Testing & Experimentation Platform
Run thousands of controlled experiments at once. Deterministic bucketing, exposure logging, the metrics pipeline, statistical significance, and the peeking problem.
The problem
Netflix, Airbnb, and Google collectively run tens of thousands of experiments per year — every button color, ranking tweak, and pricing change gets tested before shipping. An A/B testing platform is the infrastructure that makes this possible: it assigns each user to a variant of an experiment, records who actually encountered the change, collects what they did afterward, and then applies statistics to figure out whether the change helped or hurt. The platform is the arbiter of truth for every product decision at a data-driven company.
The naive design — put users in a bucket, measure a metric, compare averages — falls apart the moment you run more than a handful of experiments. The core engineering tension is threefold. First, assignment must be deterministic and sub-millisecond with no database on the hot path, which rules out the obvious approach of storing assignments in a lookup table. Second, the exposure log must be complete and correct — a missing or mislabeled event poisons the entire analysis population. Third, the statistical layer must protect against the ways teams naturally lie to themselves: checking results before you have enough data, running experiments that interfere with each other, and trusting allocation ratios that have silently drifted from what you specified.
The platform must also be correct statistically, not just correct architecturally. A system that assigns variants flawlessly and logs exposures perfectly can still ship a regression if the stats layer lets engineers peek at p-values daily and call it done when a number dips below 0.05. Senior interviewers will push hard on the statistics and on how you manage thousands of concurrent experiments without them polluting each other's results.
Functional requirements
- Experiment CRUD: Create experiments with a name, description, variants (control + 1–N treatments), allocation percentages, targeting/eligibility rules (country, platform, user segment), and a set of metrics (primary, secondary, guardrail).
- Assignment: Given a unit ID (user, session, device, request), return a stable variant for each active experiment that the unit is eligible for. Must be deterministic, consistent across calls, and not require a database lookup.
- Exposure logging: Record, for each unit, which variant they were actually exposed to and when. This is distinct from assignment (a user may be assigned to treatment but never actually view the affected UI).
- Outcome tracking: Ingest arbitrary product events (purchases, clicks, errors, latency observations) tagged with unit IDs and timestamps.
- Metrics computation: Join exposure and outcome streams, compute per-variant metric values, and surface aggregate statistics.
- Statistical analysis: Compute p-values, confidence intervals, and sample-size estimates. Flag winners/losers. Support variance reduction (CUPED).
- Guardrail enforcement: Auto-pause experiments that cause statistically significant degradation on guardrail metrics (latency, error rate, revenue).
- Concurrent experiments: Run thousands of non-interfering experiments simultaneously with a clear framework for managing overlaps.
Non-functional requirements
- Assignment latency < 5 ms p99 (ideally < 1 ms) — this is on the critical path of every page render.
- Assignment consistency — a given unit_id must land in the same variant across services, calls, and devices for the lifetime of the experiment.
- Exposure log completeness — missing exposures bias the analysis. The pipeline must be at-least-once with deduplication.
- High availability — a broken assignment path means every feature gated on an experiment stops working.
- Metric pipeline freshness — guardrail checks should surface within ~1 hour; full analysis can be daily.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Concurrent experiments | 10,000 | Given |
| DAU | 500M | Given |
| Assignment throughput (avg) | ~29,000 req/sec | 500M × 5 page views ÷ 86,400 s |
| Assignment throughput (peak) | ~100k req/sec | Spiky traffic multiplier |
| Hash computation latency | ~1–15 ns per call | MurmurHash3 / xxHash on a ~20-byte key in-process; 100k/sec is trivially fast |
| Exposure events (sustained) | ~5,800/sec | 500M exposures/day ÷ 86,400 s |
| Exposure events (peak) | ~50k/sec | Spiky; not every assignment triggers an exposure |
| Exposure ingest bandwidth (sustained) | ~1.2 MB/s | 5,800/s × 200 B/event |
| Exposure ingest bandwidth (peak) | ~10 MB/s | 50k/s × 200 B/event |
| Outcome events (sustained) | ~60k–290k/sec | 5B–25B/day ÷ 86,400 s; 10–50 outcomes/user/day |
| Exposure log storage | ~100 GB/day | 500M rows/day × 200 B |
| Outcome log storage | ~1 TB/day | 10B rows/day × 100 B |
| Data warehouse (90-day rolling, raw) | ~100 TB | (100 GB + 1 TB) × 90 days |
| Data warehouse (90-day rolling, compressed) | ~20 TB | Columnar compression ~5× |
Takeaway: Assignment throughput is trivially cheap because it is pure in-process hashing. The hard numbers are on the data side: ~100 GB/day of exposure logs and ~1 TB/day of outcome events flowing into Kafka, settling into roughly 20 TB of compressed columnar storage over a 90-day window.
Building up to the design
Let's earn each component. Start with the simplest possible thing and grow it until it breaks in a specific, nameable way.
V1: A flag in the app and a spreadsheet
# Hardcoded in the app
SHOW_NEW_CHECKOUT = user_id % 2 == 0 # 50/50 split
# Track outcomes however you can
This is a real A/B test, and you can manually compute conversion rates from it. But every new experiment requires a code deploy. There's no centralized record of who saw what. Analysts query production databases ad hoc and disagree on numbers. You cannot run 10 experiments simultaneously without collision.
V2: A config service + server-side bucketing
Pull experiment definitions from a config store and do the bucketing in the application server.
def get_variant(user_id, experiment_name, config):
exp = config[experiment_name]
bucket = hash(str(user_id) + exp["salt"]) % 100
for variant in exp["variants"]:
if bucket < variant["allocation_end"]:
return variant["name"]
Now experiments are data, not code. Non-engineers can change allocations without deploys. Multiple experiments run concurrently. But the config is fetched on every request, adding latency. And multiple services bucket independently without sharing experiment context. Worse: you are computing a p-value by looking at your dashboard every morning and calling it done when it dips below 0.05 — more on that peeking problem shortly.
V3: In-process SDK with pushed config + exposure log
App boot: SDK loads full experiment config into memory (~100 kB compressed).
Runtime: assignment is a pure hash call — no I/O.
Exposure: SDK fires a non-blocking event to Kafka when user actually sees the treatment.
Config: background thread polls/receives config updates; ~30 s staleness.
Now assignment is sub-millisecond with no external I/O on the hot path, and you have a correct exposure population for analysis. But you now have accurate exposure counts with no metrics pipeline behind them. Guardrails don't exist. You ship a bug that raises crash rate 20% and don't know for 24 hours.
V4: Streaming metrics pipeline + guardrails
Add a near-real-time pipeline that joins the last hour of exposure events with the last hour of outcome events. Compute per-variant metric deltas. Alert if any guardrail metric (crash rate, p50 latency, revenue per user) degrades significantly. Harm is now detectable within ~1 hour, and experiments auto-pause on guardrail breach.
You're still doing repeated significance testing (peeking), though. And with 500 experiments running, some of them interact — the checkout button color experiment and the checkout flow redesign experiment affect the same users at the same time.
V5: Statistical rigor + experiment layers
Add fixed-horizon sample-size planning before launch, sequential testing (mSPRT) for valid early stopping, and a layering system (orthogonal salts per layer) to isolate cross-experiment interference. This is the production design.
flowchart LR
V1["V1: hardcoded flag"] --> V2["V2: config service + server bucketing"]
V2 --> V3["V3: in-process SDK + exposure log"]
V3 --> V4["V4: streaming pipeline + guardrails"]
V4 --> V5["V5: sequential testing + layers"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V5 fill:#a855f7,color:#fff
Assignment: deterministic hashing
This is the foundation. The entire system rests on one equation:
bucket = hash(unit_id + experiment_salt) % 100
unit_id is the stable identifier for the randomization unit — usually user_id, sometimes session_id, device_id, or request_id depending on what you want to measure. experiment_salt is a unique string per experiment (typically the experiment name or a UUID assigned at creation). This ensures that a user in bucket 42 for experiment A is not necessarily in bucket 42 for experiment B — the salts break correlation. hash() itself is a fast, uniform, non-cryptographic hash — MurmurHash3 and xxHash are common choices. MD5 works but is slower than necessary; the output must be uniformly distributed modulo 100.
Three things make this approach powerful. First, no per-user storage: the assignment for user 12345 in experiment "checkout_v2" is always hash("12345|checkout_v2") % 100, computed fresh every time. Second, consistent across services: the iOS app, the Android app, and the backend API all independently compute the same answer, with no coordination. Third, debuggable: "why did user X get variant B?" — just recompute the hash.
flowchart LR
UID["user_id: '12345'"] --> CONCAT["concat with salt<br/>'12345|checkout_v2'"]
SALT["experiment_salt:<br/>'checkout_v2'"] --> CONCAT
CONCAT --> HASH["MurmurHash3<br/>→ 2847392"]
HASH --> MOD["% 100 → bucket 42"]
MOD --> CHECK{"bucket < 50?"}
CHECK -->|yes| CTRL["control"]
CHECK -->|no| TREAT["treatment_a"]
style HASH fill:#ff6b1a,color:#0a0a0f
style CTRL fill:#15803d,color:#fff
style TREAT fill:#0e7490,color:#fff
Here is the allocation code in full:
def assign(user_id: str, experiment: dict) -> str:
salt = experiment["salt"]
bucket = mmh3.hash(user_id + "|" + salt, signed=False) % 100
cumulative = 0
for variant in experiment["variants"]:
cumulative += variant["allocation_pct"]
if bucket < cumulative:
return variant["name"]
return "control" # fallback; should not reach here if allocations sum to 100
If allocations sum to less than 100, users in the unallocated buckets are not in the experiment at all — useful for gradual ramp-up (start at 1%, grow to 50%).
Randomization unit choice matters:
| Unit | Use when | Watch out for |
|---|---|---|
user_id | Measuring per-user outcomes over sessions | Logged-out users need a stable fallback (device ID or cookie) |
session_id | Testing session-level features (search ranking) | Same user sees different variants — contamination if they compare |
device_id | Pre-login experiments | Device-sharing (family computers) pollutes user-level analysis |
request_id | Infrastructure / backend experiments (caching, routing) | Only valid for metrics that don't require user-level aggregation |
Experiment config and the config service
Assignment is in-process, but the config must flow from the experiment management service to every running SDK instance. This is structurally identical to a feature flag service — A/B testing is feature flags plus measurement.
Config payload per experiment (~100 bytes each):
- experiment_id, salt, status (running/paused/concluded)
- variants: [{name, allocation_pct}]
- eligibility rules: [{dimension, operator, values}]
- start_time, end_time
Total config: 10,000 experiments × 100 bytes = 1 MB uncompressed; ~100 kB gzip.
The SDK fetches the full config on startup and polls every 30 seconds (or receives a push notification of changes). Config staleness is at most ~30 seconds — acceptable for most experiments. Emergency pauses propagate within one polling interval.
Eligibility filtering runs the hash only if the user qualifies:
def is_eligible(user: dict, rules: list) -> bool:
for rule in rules:
user_val = user.get(rule["dimension"])
if rule["op"] == "IN" and user_val not in rule["values"]:
return False
if rule["op"] == "NOT_IN" and user_val in rule["values"]:
return False
return True
If the user is not eligible, they are not assigned and not exposed. This is how you target "only US users on iOS version >= 16" without any server round-trip.
Exposure logging — the most important log in the system
Assignment ≠ Exposure. A user assigned to "treatment" might never see the affected UI if the experiment is behind a feature gate that only fires on checkout. If you analyze "all assigned users," you dilute your treatment effect with users who got no treatment — you'll miss real effects or manufacture phantom ones.
The exposure event fires precisely when the user encounters the treatment:
{
"event_type": "experiment_exposure",
"timestamp": "2026-06-06T14:23:01.452Z",
"unit_id": "user_12345",
"experiment_id": "checkout_v2",
"variant": "treatment_a",
"client": "web",
"session_id": "sess_abc"
}
This is a fire-and-forget write to Kafka. If the SDK cannot reach Kafka (client-side SDK), it buffers locally and flushes on the next network opportunity. The pipeline deduplicates on (unit_id, experiment_id) per analysis window — a user exposed multiple times still counts once per experiment.
High-level architecture
flowchart TD
EXP_DEF[Experiment author] -->|create / edit| EXP_SVC[Experiment service<br/>CRUD + validation]
EXP_SVC --> CFGDB[(Config store<br/>Postgres)]
CFGDB -->|push on change| CFGSVC[Config delivery<br/>CDN / edge]
CFGSVC -->|poll 30 s| SDK[Client SDK<br/>in-process assignment]
SDK -->|exposure event| KAFKA_EXP[Kafka<br/>exposures topic]
APP[Product app] -->|outcome events| KAFKA_OUT[Kafka<br/>outcomes topic]
KAFKA_EXP --> ETL[Metrics pipeline<br/>Spark batch / Flink streaming]
KAFKA_OUT --> ETL
ETL --> DW[(Data warehouse<br/>columnar store)]
DW --> STATS[Stats engine<br/>p-values · CIs · CUPED]
STATS --> DASH[Experiment dashboard]
STATS --> GUARD[Guardrail service<br/>auto-pause]
GUARD -->|pause experiment| EXP_SVC
style SDK fill:#ff6b1a,color:#0a0a0f
style KAFKA_EXP fill:#a855f7,color:#fff
style ETL fill:#0e7490,color:#fff
style STATS fill:#15803d,color:#fff
style GUARD fill:#ffaa00,color:#0a0a0f
The metrics pipeline
The pipeline must join two streams: exposure events (who was in which variant, when) and outcome events (what actions did users take — revenue, clicks, errors, latency).
sequenceDiagram
participant SDK as SDK
participant K as Kafka
participant SP as Spark job
participant DW as Data warehouse
participant STATS as Stats engine
SDK->>K: exposure event — user, experiment, variant, ts
SDK->>K: outcome event — user, purchase, amount, ts
Note over K: buffered, low-latency write
SP->>K: consume both topics, hourly or daily
SP->>SP: join on unit_id where outcome_ts >= exposure_ts
SP->>SP: aggregate per variant: mean, variance, count
SP->>DW: write experiment_metrics table
DW->>STATS: read for analysis
STATS->>STATS: compute p-value, CI, CUPED adjustment
Three join rules matter. Only count outcome events that occurred after the user's exposure time — pre-exposure behavior is not affected by the treatment. If a user is exposed multiple times, use the first exposure timestamp. The join is a left join from exposures — every exposed user appears in the analysis, even those with zero outcome events (zero is a valid data point for metrics like "revenue").
| Approach | Latency | Use for |
|---|---|---|
| Spark batch (hourly) | ~1 hour | Standard analysis, sample ratio mismatch checks |
| Flink streaming (micro-batch) | ~5 minutes | Guardrail metrics, early harm detection |
| Real-time (pure stream) | ~1 minute | Emergency use; harder to ensure join correctness |
Most teams run both: a near-real-time Flink job for guardrails on a narrow set of key metrics, and a daily Spark job that produces the authoritative numbers for decision-making.
Statistics done correctly
This is where most interview answers and most production platforms fall short.
Hypothesis testing
For a metric with observed mean μ_c (control) and μ_t (treatment) and sample sizes n_c, n_t:
Δ = μ_t - μ_c
SE = sqrt(σ_c²/n_c + σ_t²/n_t)
z = Δ / SE
p-value = 2 × Φ(-|z|) # two-sided test
At α = 0.05, reject H₀ (no effect) if p < 0.05. The 95% confidence interval for Δ is:
CI = [Δ - 1.96 × SE, Δ + 1.96 × SE]
Sample size planning
Before you launch, compute the minimum sample size required to detect the minimum detectable effect (MDE) with acceptable power (1 - β, typically 0.8) at significance level α (typically 0.05):
n = 2 × (z_α/2 + z_β)² × σ² / MDE²
For α = 0.05, β = 0.2: z_α/2 ≈ 1.96, z_β ≈ 0.84
→ n ≈ 2 × (1.96 + 0.84)² × σ² / MDE²
≈ 15.7 × σ² / MDE²
If σ = 50 (dollars) and MDE = 2 dollars:
n ≈ 15.7 × 2500 / 4 ≈ 9,813 users per variant
This tells you your experiment needs ~10k users per arm before any analysis. Run it long enough to hit that sample size; do not decide earlier.
The peeking problem
This is a classic trap and interviewers will ask about it.
Suppose you check significance every day for a 2-week experiment. You make roughly 14 tests. Even with a true null effect (no real difference between control and treatment), the probability that at least one of those 14 daily checks gives p < 0.05 is far higher than 5%. Simulations show that even with a modest number of interim looks, the effective false-positive rate rises well above 5% — continuous monitoring with early stopping at p < 0.05 can push it to roughly 26% (Johari et al. 2022). The exact inflation depends on how many looks you take and your stopping rule, but the direction is unambiguous: repeated looks at a p-value without adjustment inflate Type I errors.
flowchart LR
START["Experiment launched<br/>day 0"] --> D3["Day 3 check:<br/>p = 0.08 — still running"]
D3 --> D7["Day 7 check:<br/>p = 0.04 — ship it!"]
D7 --> WRONG["False positive shipped<br/>— true effect was zero"]
NOTE["Multiple looks inflate<br/>cumulative Type I error<br/>above nominal α"] -.-> D7
style WRONG fill:#ff2e88,color:#fff
style D7 fill:#ffaa00,color:#0a0a0f
style NOTE fill:#0e7490,color:#fff
There are three principled fixes. Fixed horizon: commit to a sample size before launch, run the experiment to that size, look exactly once. Mathematically clean but inflexible — you cannot stop early even if the treatment is obviously winning or clearly harmful. Sequential testing: use a test that is valid at any stopping time. The mixture Sequential Probability Ratio Test (mSPRT) provides an "always-valid p-value" — you can check any time without inflating false positives, and you can stop early when the result is clear. The cost is a larger final sample than a fixed-horizon test to achieve the same power. Bonferroni correction: if you must look K times, divide your α threshold by K. Simple but conservative — kills power if K is large.
Most production platforms (including those at Airbnb, Netflix, and Microsoft) use a combination: a fixed horizon for the primary analysis, and sequential testing to enable principled early stopping when warranted.
Variance reduction: CUPED
CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique. Much of the variance in a metric like "revenue this week" can be predicted from "revenue last week." Subtract out that predictable component and the residual variance is smaller — giving you more statistical power, or equivalently, needing fewer users to detect the same effect.
Y_cuped = Y - θ × X
where:
Y = outcome metric during the experiment
X = the same metric from a pre-experiment window (same users, week before)
θ = Cov(Y, X) / Var(X) ← computed from the control group
Analysis proceeds on Y_cuped instead of Y.
CUPED does not change the expected value of the treatment effect — the adjustment is mean-preserving and only reduces variance. A variance reduction of 50% is equivalent to doubling your sample size. Variance reduction depends almost entirely on how strongly the pre-experiment covariate correlates with the outcome metric. The original Microsoft paper on Bing reported around 50% reduction using queries-per-user as the covariate; in practice, reported reductions across teams and products range from roughly 20% (weak covariate correlation) to 70% (strong correlation, e.g., booking frequency predicting revenue).
Sample Ratio Mismatch (SRM)
Before trusting any result, check: did treatment and control receive users in approximately the allocation ratio?
Expected ratio: 50% control, 50% treatment.
Observed: control n=1,002,000; treatment n=987,000.
Total N: 1,989,000. Expected per arm: 994,500.
Chi-squared: Σ (observed - expected)² / expected
= (1,002,000 - 994,500)² / 994,500 + (987,000 - 994,500)² / 994,500
= 56,250,000 / 994,500 + 56,250,000 / 994,500
≈ 56.6 + 56.6 = 113.2 (df=1, p ≪ 0.001) → SRM detected.
An SRM means the randomization or logging is broken. Common causes: the SDK only loads for some users (e.g., late-loading JavaScript), bots are filtered from one variant but not the other, or the assignment hash function is not uniform.
flowchart TD
RESULTS["Experiment results ready"] --> SRM_CHECK["SRM check:<br/>chi-squared test on observed vs expected allocation"]
SRM_CHECK -->|"p > 0.05 — allocation looks fine"| METRICS["Show metric results<br/>p-values, CIs, CUPED"]
SRM_CHECK -->|"p < 0.001 — allocation is broken"| BLOCK["Block results display<br/>raise data-quality alarm"]
BLOCK --> DEBUG["Debug: SDK load order?<br/>bot filtering? hash uniformity?"]
style BLOCK fill:#ff2e88,color:#fff
style METRICS fill:#15803d,color:#fff
style SRM_CHECK fill:#ff6b1a,color:#0a0a0f
Never trust experimental results when an SRM is detected. Surface it as a data-quality alarm on the dashboard before showing any metric results.
Concurrent experiments: layers and orthogonal salts
The naive approach — assign users to experiments independently — has a subtle problem. Suppose experiment A tests a new ranking algorithm and experiment B tests a new UI. 25% of users get both treatments simultaneously. If the two treatments interact (the new ranking algorithm works better with the new UI), the effect of experiment A is confounded with experiment B. You cannot tell which change caused what.
Google's overlapping experiment framework (described in their 2010 paper "Overlapping Experiment Infrastructure") solves this with layers. Each layer contains a set of mutually-exclusive experiments. Experiments in different layers can overlap freely.
flowchart TD
USER[User ID] --> L1[Layer 1: Ranking experiments<br/>salt: rank_layer]
USER --> L2[Layer 2: UI experiments<br/>salt: ui_layer]
USER --> L3[Layer 3: Pricing experiments<br/>salt: price_layer]
L1 --> R1[Exp A: new ranking]
L1 --> R2[Exp B: old ranking]
L2 --> U1[Exp C: new button]
L2 --> U2[Exp D: old button]
L3 --> P1[Exp E: free shipping test]
style L1 fill:#ff6b1a,color:#0a0a0f
style L2 fill:#0e7490,color:#fff
style L3 fill:#a855f7,color:#fff
Each layer uses a different salt:
layer_bucket = hash(user_id + layer_salt) % 100
A user's layer 1 bucket is independent of their layer 2 bucket, by the properties of a good hash function. Within a layer, experiments are allocated as contiguous ranges of the 0–100 bucket space, so they are mutually exclusive. Across layers, users are orthogonally randomized.
Mutually-exclusive experiments — when interactions are known to be a concern — go in the same layer, where they cannot overlap. If you have only 100 bucket slots per layer and you launch many experiments, layers fill up. Most platforms manage this by expiring concluded experiments quickly to free up slots, and by using more layers.
Staged ramp-up and feature flag integration
A/B testing and feature flag delivery share nearly identical infrastructure. The key difference is measurement.
A staged ramp-up is a special case: start an experiment at 1% allocation (just enough to catch crashes and SRMs quickly), verify guardrails, then ramp to 10%, 50%, 100%. Each ramp step is a config change — no code deploy.
stateDiagram-v2
[*] --> Draft: experiment created
Draft --> Running: launched at N%
Running --> Ramping: allocation increased
Ramping --> Running: stable
Running --> Paused: guardrail breach or manual
Paused --> Running: issue resolved
Running --> Concluded: sample size reached
Concluded --> Shipped: winner deployed
Concluded --> Reverted: loser discarded
Shipped --> [*]
Reverted --> [*]
Failure modes
| Failure | Symptom | Mitigation |
|---|---|---|
| SRM (Sample Ratio Mismatch) | Treatment/control counts don't match allocation | Automatic SRM check before surfacing results; block analysis until resolved |
| Peeking / repeated testing | False positive shipped as a winner | Fixed-horizon commitment; sequential tests for early stopping |
| Assignment skew | Hash function not uniform; correlated with user attributes | Use well-tested hash functions (MurmurHash3, xxHash); test uniformity on real IDs |
| Config staleness | Old config served after experiment paused | Aggressive polling (30 s) + push notifications for pauses; SDK caches pause state with short TTL |
| Exposure log loss | Kafka consumer falls behind; events dropped | At-least-once delivery; deduplicate downstream; monitor consumer lag |
| Metric pipeline lag | Guardrails fire hours late | Separate fast-path Flink job for guardrail metrics; alert on pipeline lag itself |
| Interaction effects | Two experiments in same layer pollute each other | Layer enforcement at config creation time; lint rules prevent placing conflicting experiments in one layer |
| Novelty effect | Treatment looks good for 3 days then regresses | Run experiments for at least 1–2 full behavioral cycles (typically 1–2 weeks) before concluding |
Storage choices
| Data | Store | Rationale |
|---|---|---|
| Experiment config | Postgres | Low write rate, complex queries, needs ACID for allocation changes |
| Config delivery cache | Redis / CDN | Fast reads for SDK polling; tolerate ~30 s stale |
| Exposure events (raw) | Kafka → S3 / object store | Append-only, high-volume; cheap cold storage |
| Outcome events (raw) | Kafka → S3 / object store | Same pattern; separate topics per event type |
| Joined experiment results | BigQuery / Snowflake (columnar) | Aggregate queries over billions of rows; columnar format is 10–100× faster than row-oriented |
| Computed statistics | Postgres / document store | Low volume; read by dashboard |
Things to discuss in an interview
- Why hash-based assignment and not a lookup table? No per-user storage, no DB on the hot path, deterministic, portable across services. Trade-off: you cannot un-assign a user mid-experiment without changing the salt (which re-randomizes everyone — rarely desirable).
- Why exposure logging and not assignment logging? Many assigned users never encounter the treatment. Analyzing on assignment inflates N and dilutes the effect. In clinical-trial terms, assignment-based analysis is the intent-to-treat (ITT) approach; exposure-based analysis is closer to per-protocol — it restricts to participants who actually received the treatment.
- How do you handle the peeking problem in practice? Fixed horizon for primary decisions, sequential testing (mSPRT) for guardrails and early stopping. Brief the stakeholders before launch on the sample-size commitment — this is a cultural problem as much as a technical one.
- How does CUPED work? Regress out pre-experiment variance using the same metric from a holdout period before the experiment. Does not bias the estimate; reduces variance roughly 20–70% depending on the covariate correlation.
- What is SRM and what causes it? Mismatch between expected and observed allocation ratios. Caused by logging bugs, bot filtering, or hash non-uniformity. Always check before trusting results.
- How do you scale to 10k concurrent experiments? Layers with orthogonal salts. Experiments within a layer are mutually exclusive; experiments across layers are independent. Slot management (expiring old experiments) keeps layers from overfilling.
- How are A/B testing and feature flags related? Same assignment and delivery infrastructure; A/B adds measurement, exposure logging, and statistics on top. See feature flag service.
Things you should now be able to answer
- Why does the deterministic hash use
unit_id + experiment_saltrather than justunit_id? - What is the difference between assignment and exposure, and which one drives the analysis population?
- Why does checking p-values every day inflate the false-positive rate, and what are two principled fixes?
- What does CUPED do and how does it not bias the result?
- What is an SRM and why does it invalidate an experiment's results?
- How do experiment layers prevent cross-experiment contamination without requiring exclusion of overlapping users?
- What happens if you change the experiment salt mid-experiment?
Further reading
- Kohavi, Longbotham, Sommerfield, Henne — "Controlled Experiments on the Web: Survey and Practical Guide" (2009). The canonical reference.
- Tang, Agarwal, O'Brien, Meyer — "Overlapping Experiment Infrastructure: More, Better, Faster Experimentation" (Google, KDD 2010). The layers framework.
- Deng, Xu, Kohavi, Walker — "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data" (CUPED, WSDM 2013).
- Johari, Pekelis, Walsh — "Always Valid Inference: Bringing Sequential Analysis to A/B Testing" (mSPRT / sequential testing, arXiv 2015, Operations Research 2022).
- Airbnb Engineering: "Experiment Reporting Framework" — medium.com/airbnb-engineering.
- Netflix Technology Blog: "It's All A/Bout Testing: The Netflix Experimentation Platform".
- Microsoft ExP (Experimentation Platform) papers — various, available on exp-platform.com.
Frequently asked questions
▸What is the peeking problem in A/B testing and how do you fix it?
Peeking means checking a p-value repeatedly before the experiment reaches its planned sample size. Even with a true null effect, roughly 14 daily checks can push the effective false-positive rate to around 26% (Johari et al. 2022). The two principled fixes are a fixed horizon — commit to a sample size before launch and look exactly once — and sequential testing using mSPRT, which provides an always-valid p-value you can check at any time without inflating Type I error.
▸Why use hash-based assignment instead of storing variant assignments in a database?
The formula bucket = hash(unit_id + experiment_salt) % 100 is computed entirely in-process, requiring no database lookup on the hot path and achieving sub-millisecond latency. It is deterministic forever — user 12345 always lands in the same bucket for a given experiment — and any service (iOS, Android, backend) independently computes the same answer with no coordination. The trade-off is that changing the salt mid-experiment re-randomizes every user, which is rarely desirable.
▸What is CUPED and how much variance does it reduce?
CUPED (Controlled-experiment Using Pre-Experiment Data) regresses out the predictable component of the outcome metric using the same metric measured in a pre-experiment window, then runs the analysis on the residual. It does not change the expected treatment effect — the adjustment is mean-preserving — but reduces variance. The original Microsoft paper on Bing reported roughly 50% reduction; in practice teams see 20% to 70% depending on how strongly the pre-experiment covariate correlates with the outcome.
▸What is a Sample Ratio Mismatch (SRM) and when does it occur?
An SRM is a statistically significant difference between the observed allocation ratio and the expected one — for example, a 50/50 experiment where control receives 1,002,000 users and treatment receives only 987,000, yielding a chi-squared statistic of 113.2 (p much less than 0.001). Common causes are late-loading SDKs that only fire for some users, bot filtering applied to one variant but not the other, and a non-uniform hash function. Any SRM invalidates the experiment's results and must be treated as a data-quality alarm before showing any metrics.
▸How do experiment layers prevent cross-experiment interference when running thousands of concurrent experiments?
Based on Google's overlapping experiment framework (KDD 2010), each layer uses a distinct salt so that a user's bucket assignment in layer 1 is statistically independent of their bucket in layer 2 or layer 3. Within a layer, experiments occupy contiguous, mutually exclusive bucket ranges so they never overlap. Experiments that are known to interact go in the same layer where they cannot co-occur; experiments in different layers can freely overlap across the full user population without confounding each other.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.