~/articles/design-ab-testing-platform
◆◆◆Advancedasked at Metaasked at Googleasked at Netflixasked at Airbnb

Design an A/B Testing & Experimentation Platform

Run thousands of controlled experiments at once. Deterministic bucketing, exposure logging, the metrics pipeline, statistical significance, and the peeking problem.

23 min read2026-06-06Ironclad Academy
// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Netflix, Airbnb, and Google collectively run tens of thousands of experiments per year — every button color, ranking tweak, and pricing change gets tested before shipping. An A/B testing platform is the infrastructure that makes this possible: it assigns each user to a variant of an experiment, records who actually encountered the change, collects what they did afterward, and then applies statistics to figure out whether the change helped or hurt. The platform is the arbiter of truth for every product decision at a data-driven company.

The naive design — put users in a bucket, measure a metric, compare averages — falls apart the moment you run more than a handful of experiments. The core engineering tension is threefold. First, assignment must be deterministic and sub-millisecond with no database on the hot path, which rules out the obvious approach of storing assignments in a lookup table. Second, the exposure log must be complete and correct — a missing or mislabeled event poisons the entire analysis population. Third, the statistical layer must protect against the ways teams naturally lie to themselves: checking results before you have enough data, running experiments that interfere with each other, and trusting allocation ratios that have silently drifted from what you specified.

The platform must also be correct statistically, not just correct architecturally. A system that assigns variants flawlessly and logs exposures perfectly can still ship a regression if the stats layer lets engineers peek at p-values daily and call it done when a number dips below 0.05. Senior interviewers will push hard on the statistics and on how you manage thousands of concurrent experiments without them polluting each other's results.

Functional requirements

  • Experiment CRUD: Create experiments with a name, description, variants (control + 1–N treatments), allocation percentages, targeting/eligibility rules (country, platform, user segment), and a set of metrics (primary, secondary, guardrail).
  • Assignment: Given a unit ID (user, session, device, request), return a stable variant for each active experiment that the unit is eligible for. Must be deterministic, consistent across calls, and not require a database lookup.
  • Exposure logging: Record, for each unit, which variant they were actually exposed to and when. This is distinct from assignment (a user may be assigned to treatment but never actually view the affected UI).
  • Outcome tracking: Ingest arbitrary product events (purchases, clicks, errors, latency observations) tagged with unit IDs and timestamps.
  • Metrics computation: Join exposure and outcome streams, compute per-variant metric values, and surface aggregate statistics.
  • Statistical analysis: Compute p-values, confidence intervals, and sample-size estimates. Flag winners/losers. Support variance reduction (CUPED).
  • Guardrail enforcement: Auto-pause experiments that cause statistically significant degradation on guardrail metrics (latency, error rate, revenue).
  • Concurrent experiments: Run thousands of non-interfering experiments simultaneously with a clear framework for managing overlaps.

Non-functional requirements

  • Assignment latency < 5 ms p99 (ideally < 1 ms) — this is on the critical path of every page render.
  • Assignment consistency — a given unit_id must land in the same variant across services, calls, and devices for the lifetime of the experiment.
  • Exposure log completeness — missing exposures bias the analysis. The pipeline must be at-least-once with deduplication.
  • High availability — a broken assignment path means every feature gated on an experiment stops working.
  • Metric pipeline freshness — guardrail checks should surface within ~1 hour; full analysis can be daily.

Capacity estimation

DimensionEstimateHow we got there
Concurrent experiments10,000Given
DAU500MGiven
Assignment throughput (avg)~29,000 req/sec500M × 5 page views ÷ 86,400 s
Assignment throughput (peak)~100k req/secSpiky traffic multiplier
Hash computation latency~1–15 ns per callMurmurHash3 / xxHash on a ~20-byte key in-process; 100k/sec is trivially fast
Exposure events (sustained)~5,800/sec500M exposures/day ÷ 86,400 s
Exposure events (peak)~50k/secSpiky; not every assignment triggers an exposure
Exposure ingest bandwidth (sustained)~1.2 MB/s5,800/s × 200 B/event
Exposure ingest bandwidth (peak)~10 MB/s50k/s × 200 B/event
Outcome events (sustained)~60k–290k/sec5B–25B/day ÷ 86,400 s; 10–50 outcomes/user/day
Exposure log storage~100 GB/day500M rows/day × 200 B
Outcome log storage~1 TB/day10B rows/day × 100 B
Data warehouse (90-day rolling, raw)~100 TB(100 GB + 1 TB) × 90 days
Data warehouse (90-day rolling, compressed)~20 TBColumnar compression ~5×

Takeaway: Assignment throughput is trivially cheap because it is pure in-process hashing. The hard numbers are on the data side: ~100 GB/day of exposure logs and ~1 TB/day of outcome events flowing into Kafka, settling into roughly 20 TB of compressed columnar storage over a 90-day window.

Building up to the design

Let's earn each component. Start with the simplest possible thing and grow it until it breaks in a specific, nameable way.

V1: A flag in the app and a spreadsheet

# Hardcoded in the app
SHOW_NEW_CHECKOUT = user_id % 2 == 0   # 50/50 split

# Track outcomes however you can

This is a real A/B test, and you can manually compute conversion rates from it. But every new experiment requires a code deploy. There's no centralized record of who saw what. Analysts query production databases ad hoc and disagree on numbers. You cannot run 10 experiments simultaneously without collision.

V2: A config service + server-side bucketing

Pull experiment definitions from a config store and do the bucketing in the application server.

def get_variant(user_id, experiment_name, config):
    exp = config[experiment_name]
    bucket = hash(str(user_id) + exp["salt"]) % 100
    for variant in exp["variants"]:
        if bucket < variant["allocation_end"]:
            return variant["name"]

Now experiments are data, not code. Non-engineers can change allocations without deploys. Multiple experiments run concurrently. But the config is fetched on every request, adding latency. And multiple services bucket independently without sharing experiment context. Worse: you are computing a p-value by looking at your dashboard every morning and calling it done when it dips below 0.05 — more on that peeking problem shortly.

V3: In-process SDK with pushed config + exposure log

App boot:  SDK loads full experiment config into memory (~100 kB compressed).
Runtime:   assignment is a pure hash call — no I/O.
Exposure:  SDK fires a non-blocking event to Kafka when user actually sees the treatment.
Config:    background thread polls/receives config updates; ~30 s staleness.

Now assignment is sub-millisecond with no external I/O on the hot path, and you have a correct exposure population for analysis. But you now have accurate exposure counts with no metrics pipeline behind them. Guardrails don't exist. You ship a bug that raises crash rate 20% and don't know for 24 hours.

V4: Streaming metrics pipeline + guardrails

Add a near-real-time pipeline that joins the last hour of exposure events with the last hour of outcome events. Compute per-variant metric deltas. Alert if any guardrail metric (crash rate, p50 latency, revenue per user) degrades significantly. Harm is now detectable within ~1 hour, and experiments auto-pause on guardrail breach.

You're still doing repeated significance testing (peeking), though. And with 500 experiments running, some of them interact — the checkout button color experiment and the checkout flow redesign experiment affect the same users at the same time.

V5: Statistical rigor + experiment layers

Add fixed-horizon sample-size planning before launch, sequential testing (mSPRT) for valid early stopping, and a layering system (orthogonal salts per layer) to isolate cross-experiment interference. This is the production design.

flowchart LR
    V1["V1: hardcoded flag"] --> V2["V2: config service + server bucketing"]
    V2 --> V3["V3: in-process SDK + exposure log"]
    V3 --> V4["V4: streaming pipeline + guardrails"]
    V4 --> V5["V5: sequential testing + layers"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V5 fill:#a855f7,color:#fff

Assignment: deterministic hashing

This is the foundation. The entire system rests on one equation:

bucket = hash(unit_id + experiment_salt) % 100

unit_id is the stable identifier for the randomization unit — usually user_id, sometimes session_id, device_id, or request_id depending on what you want to measure. experiment_salt is a unique string per experiment (typically the experiment name or a UUID assigned at creation). This ensures that a user in bucket 42 for experiment A is not necessarily in bucket 42 for experiment B — the salts break correlation. hash() itself is a fast, uniform, non-cryptographic hash — MurmurHash3 and xxHash are common choices. MD5 works but is slower than necessary; the output must be uniformly distributed modulo 100.

Three things make this approach powerful. First, no per-user storage: the assignment for user 12345 in experiment "checkout_v2" is always hash("12345|checkout_v2") % 100, computed fresh every time. Second, consistent across services: the iOS app, the Android app, and the backend API all independently compute the same answer, with no coordination. Third, debuggable: "why did user X get variant B?" — just recompute the hash.

flowchart LR
    UID["user_id: '12345'"] --> CONCAT["concat with salt<br/>'12345|checkout_v2'"]
    SALT["experiment_salt:<br/>'checkout_v2'"] --> CONCAT
    CONCAT --> HASH["MurmurHash3<br/>→ 2847392"]
    HASH --> MOD["% 100 → bucket 42"]
    MOD --> CHECK{"bucket < 50?"}
    CHECK -->|yes| CTRL["control"]
    CHECK -->|no| TREAT["treatment_a"]
    style HASH fill:#ff6b1a,color:#0a0a0f
    style CTRL fill:#15803d,color:#fff
    style TREAT fill:#0e7490,color:#fff

Here is the allocation code in full:

def assign(user_id: str, experiment: dict) -> str:
    salt = experiment["salt"]
    bucket = mmh3.hash(user_id + "|" + salt, signed=False) % 100
    cumulative = 0
    for variant in experiment["variants"]:
        cumulative += variant["allocation_pct"]
        if bucket < cumulative:
            return variant["name"]
    return "control"   # fallback; should not reach here if allocations sum to 100

If allocations sum to less than 100, users in the unallocated buckets are not in the experiment at all — useful for gradual ramp-up (start at 1%, grow to 50%).

Randomization unit choice matters:

UnitUse whenWatch out for
user_idMeasuring per-user outcomes over sessionsLogged-out users need a stable fallback (device ID or cookie)
session_idTesting session-level features (search ranking)Same user sees different variants — contamination if they compare
device_idPre-login experimentsDevice-sharing (family computers) pollutes user-level analysis
request_idInfrastructure / backend experiments (caching, routing)Only valid for metrics that don't require user-level aggregation

Experiment config and the config service

Assignment is in-process, but the config must flow from the experiment management service to every running SDK instance. This is structurally identical to a feature flag service — A/B testing is feature flags plus measurement.

Config payload per experiment (~100 bytes each):
  - experiment_id, salt, status (running/paused/concluded)
  - variants: [{name, allocation_pct}]
  - eligibility rules: [{dimension, operator, values}]
  - start_time, end_time

Total config: 10,000 experiments × 100 bytes = 1 MB uncompressed; ~100 kB gzip.

The SDK fetches the full config on startup and polls every 30 seconds (or receives a push notification of changes). Config staleness is at most ~30 seconds — acceptable for most experiments. Emergency pauses propagate within one polling interval.

Eligibility filtering runs the hash only if the user qualifies:

def is_eligible(user: dict, rules: list) -> bool:
    for rule in rules:
        user_val = user.get(rule["dimension"])
        if rule["op"] == "IN" and user_val not in rule["values"]:
            return False
        if rule["op"] == "NOT_IN" and user_val in rule["values"]:
            return False
    return True

If the user is not eligible, they are not assigned and not exposed. This is how you target "only US users on iOS version >= 16" without any server round-trip.

Exposure logging — the most important log in the system

Assignment ≠ Exposure. A user assigned to "treatment" might never see the affected UI if the experiment is behind a feature gate that only fires on checkout. If you analyze "all assigned users," you dilute your treatment effect with users who got no treatment — you'll miss real effects or manufacture phantom ones.

The exposure event fires precisely when the user encounters the treatment:

{
  "event_type": "experiment_exposure",
  "timestamp": "2026-06-06T14:23:01.452Z",
  "unit_id": "user_12345",
  "experiment_id": "checkout_v2",
  "variant": "treatment_a",
  "client": "web",
  "session_id": "sess_abc"
}

This is a fire-and-forget write to Kafka. If the SDK cannot reach Kafka (client-side SDK), it buffers locally and flushes on the next network opportunity. The pipeline deduplicates on (unit_id, experiment_id) per analysis window — a user exposed multiple times still counts once per experiment.

High-level architecture

flowchart TD
    EXP_DEF[Experiment author] -->|create / edit| EXP_SVC[Experiment service<br/>CRUD + validation]
    EXP_SVC --> CFGDB[(Config store<br/>Postgres)]
    CFGDB -->|push on change| CFGSVC[Config delivery<br/>CDN / edge]
    CFGSVC -->|poll 30 s| SDK[Client SDK<br/>in-process assignment]

    SDK -->|exposure event| KAFKA_EXP[Kafka<br/>exposures topic]
    APP[Product app] -->|outcome events| KAFKA_OUT[Kafka<br/>outcomes topic]

    KAFKA_EXP --> ETL[Metrics pipeline<br/>Spark batch / Flink streaming]
    KAFKA_OUT --> ETL

    ETL --> DW[(Data warehouse<br/>columnar store)]
    DW --> STATS[Stats engine<br/>p-values · CIs · CUPED]
    STATS --> DASH[Experiment dashboard]
    STATS --> GUARD[Guardrail service<br/>auto-pause]
    GUARD -->|pause experiment| EXP_SVC

    style SDK fill:#ff6b1a,color:#0a0a0f
    style KAFKA_EXP fill:#a855f7,color:#fff
    style ETL fill:#0e7490,color:#fff
    style STATS fill:#15803d,color:#fff
    style GUARD fill:#ffaa00,color:#0a0a0f

The metrics pipeline

The pipeline must join two streams: exposure events (who was in which variant, when) and outcome events (what actions did users take — revenue, clicks, errors, latency).

sequenceDiagram
    participant SDK as SDK
    participant K as Kafka
    participant SP as Spark job
    participant DW as Data warehouse
    participant STATS as Stats engine

    SDK->>K: exposure event — user, experiment, variant, ts
    SDK->>K: outcome event — user, purchase, amount, ts
    Note over K: buffered, low-latency write
    SP->>K: consume both topics, hourly or daily
    SP->>SP: join on unit_id where outcome_ts >= exposure_ts
    SP->>SP: aggregate per variant: mean, variance, count
    SP->>DW: write experiment_metrics table
    DW->>STATS: read for analysis
    STATS->>STATS: compute p-value, CI, CUPED adjustment

Three join rules matter. Only count outcome events that occurred after the user's exposure time — pre-exposure behavior is not affected by the treatment. If a user is exposed multiple times, use the first exposure timestamp. The join is a left join from exposures — every exposed user appears in the analysis, even those with zero outcome events (zero is a valid data point for metrics like "revenue").

ApproachLatencyUse for
Spark batch (hourly)~1 hourStandard analysis, sample ratio mismatch checks
Flink streaming (micro-batch)~5 minutesGuardrail metrics, early harm detection
Real-time (pure stream)~1 minuteEmergency use; harder to ensure join correctness

Most teams run both: a near-real-time Flink job for guardrails on a narrow set of key metrics, and a daily Spark job that produces the authoritative numbers for decision-making.

Statistics done correctly

This is where most interview answers and most production platforms fall short.

Hypothesis testing

For a metric with observed mean μ_c (control) and μ_t (treatment) and sample sizes n_c, n_t:

Δ = μ_t - μ_c

SE = sqrt(σ_c²/n_c + σ_t²/n_t)

z = Δ / SE

p-value = 2 × Φ(-|z|)   # two-sided test

At α = 0.05, reject H₀ (no effect) if p < 0.05. The 95% confidence interval for Δ is:

CI = [Δ - 1.96 × SE,  Δ + 1.96 × SE]

Sample size planning

Before you launch, compute the minimum sample size required to detect the minimum detectable effect (MDE) with acceptable power (1 - β, typically 0.8) at significance level α (typically 0.05):

n = 2 × (z_α/2 + z_β)² × σ² / MDE²

For α = 0.05, β = 0.2:  z_α/2 ≈ 1.96, z_β ≈ 0.84
→ n ≈ 2 × (1.96 + 0.84)² × σ² / MDE²
  ≈ 15.7 × σ² / MDE²

If σ = 50 (dollars) and MDE = 2 dollars:
n ≈ 15.7 × 2500 / 4 ≈ 9,813 users per variant

This tells you your experiment needs ~10k users per arm before any analysis. Run it long enough to hit that sample size; do not decide earlier.

The peeking problem

This is a classic trap and interviewers will ask about it.

Suppose you check significance every day for a 2-week experiment. You make roughly 14 tests. Even with a true null effect (no real difference between control and treatment), the probability that at least one of those 14 daily checks gives p < 0.05 is far higher than 5%. Simulations show that even with a modest number of interim looks, the effective false-positive rate rises well above 5% — continuous monitoring with early stopping at p < 0.05 can push it to roughly 26% (Johari et al. 2022). The exact inflation depends on how many looks you take and your stopping rule, but the direction is unambiguous: repeated looks at a p-value without adjustment inflate Type I errors.

flowchart LR
    START["Experiment launched<br/>day 0"] --> D3["Day 3 check:<br/>p = 0.08 — still running"]
    D3 --> D7["Day 7 check:<br/>p = 0.04 — ship it!"]
    D7 --> WRONG["False positive shipped<br/>— true effect was zero"]
    NOTE["Multiple looks inflate<br/>cumulative Type I error<br/>above nominal α"] -.-> D7
    style WRONG fill:#ff2e88,color:#fff
    style D7 fill:#ffaa00,color:#0a0a0f
    style NOTE fill:#0e7490,color:#fff

There are three principled fixes. Fixed horizon: commit to a sample size before launch, run the experiment to that size, look exactly once. Mathematically clean but inflexible — you cannot stop early even if the treatment is obviously winning or clearly harmful. Sequential testing: use a test that is valid at any stopping time. The mixture Sequential Probability Ratio Test (mSPRT) provides an "always-valid p-value" — you can check any time without inflating false positives, and you can stop early when the result is clear. The cost is a larger final sample than a fixed-horizon test to achieve the same power. Bonferroni correction: if you must look K times, divide your α threshold by K. Simple but conservative — kills power if K is large.

Most production platforms (including those at Airbnb, Netflix, and Microsoft) use a combination: a fixed horizon for the primary analysis, and sequential testing to enable principled early stopping when warranted.

Variance reduction: CUPED

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique. Much of the variance in a metric like "revenue this week" can be predicted from "revenue last week." Subtract out that predictable component and the residual variance is smaller — giving you more statistical power, or equivalently, needing fewer users to detect the same effect.

Y_cuped = Y - θ × X

where:
  Y = outcome metric during the experiment
  X = the same metric from a pre-experiment window (same users, week before)
  θ = Cov(Y, X) / Var(X)  ← computed from the control group

Analysis proceeds on Y_cuped instead of Y.

CUPED does not change the expected value of the treatment effect — the adjustment is mean-preserving and only reduces variance. A variance reduction of 50% is equivalent to doubling your sample size. Variance reduction depends almost entirely on how strongly the pre-experiment covariate correlates with the outcome metric. The original Microsoft paper on Bing reported around 50% reduction using queries-per-user as the covariate; in practice, reported reductions across teams and products range from roughly 20% (weak covariate correlation) to 70% (strong correlation, e.g., booking frequency predicting revenue).

Sample Ratio Mismatch (SRM)

Before trusting any result, check: did treatment and control receive users in approximately the allocation ratio?

Expected ratio: 50% control, 50% treatment.
Observed:       control n=1,002,000; treatment n=987,000.
Total N: 1,989,000. Expected per arm: 994,500.

Chi-squared: Σ (observed - expected)² / expected
  = (1,002,000 - 994,500)² / 994,500  +  (987,000 - 994,500)² / 994,500
  = 56,250,000 / 994,500  +  56,250,000 / 994,500
  ≈ 56.6 + 56.6 = 113.2   (df=1, p ≪ 0.001) → SRM detected.

An SRM means the randomization or logging is broken. Common causes: the SDK only loads for some users (e.g., late-loading JavaScript), bots are filtered from one variant but not the other, or the assignment hash function is not uniform.

flowchart TD
    RESULTS["Experiment results ready"] --> SRM_CHECK["SRM check:<br/>chi-squared test on observed vs expected allocation"]
    SRM_CHECK -->|"p > 0.05 — allocation looks fine"| METRICS["Show metric results<br/>p-values, CIs, CUPED"]
    SRM_CHECK -->|"p < 0.001 — allocation is broken"| BLOCK["Block results display<br/>raise data-quality alarm"]
    BLOCK --> DEBUG["Debug: SDK load order?<br/>bot filtering? hash uniformity?"]
    style BLOCK fill:#ff2e88,color:#fff
    style METRICS fill:#15803d,color:#fff
    style SRM_CHECK fill:#ff6b1a,color:#0a0a0f

Never trust experimental results when an SRM is detected. Surface it as a data-quality alarm on the dashboard before showing any metric results.

Concurrent experiments: layers and orthogonal salts

The naive approach — assign users to experiments independently — has a subtle problem. Suppose experiment A tests a new ranking algorithm and experiment B tests a new UI. 25% of users get both treatments simultaneously. If the two treatments interact (the new ranking algorithm works better with the new UI), the effect of experiment A is confounded with experiment B. You cannot tell which change caused what.

Google's overlapping experiment framework (described in their 2010 paper "Overlapping Experiment Infrastructure") solves this with layers. Each layer contains a set of mutually-exclusive experiments. Experiments in different layers can overlap freely.

flowchart TD
    USER[User ID] --> L1[Layer 1: Ranking experiments<br/>salt: rank_layer]
    USER --> L2[Layer 2: UI experiments<br/>salt: ui_layer]
    USER --> L3[Layer 3: Pricing experiments<br/>salt: price_layer]

    L1 --> R1[Exp A: new ranking]
    L1 --> R2[Exp B: old ranking]

    L2 --> U1[Exp C: new button]
    L2 --> U2[Exp D: old button]

    L3 --> P1[Exp E: free shipping test]

    style L1 fill:#ff6b1a,color:#0a0a0f
    style L2 fill:#0e7490,color:#fff
    style L3 fill:#a855f7,color:#fff

Each layer uses a different salt:

layer_bucket = hash(user_id + layer_salt) % 100

A user's layer 1 bucket is independent of their layer 2 bucket, by the properties of a good hash function. Within a layer, experiments are allocated as contiguous ranges of the 0–100 bucket space, so they are mutually exclusive. Across layers, users are orthogonally randomized.

Mutually-exclusive experiments — when interactions are known to be a concern — go in the same layer, where they cannot overlap. If you have only 100 bucket slots per layer and you launch many experiments, layers fill up. Most platforms manage this by expiring concluded experiments quickly to free up slots, and by using more layers.

Staged ramp-up and feature flag integration

A/B testing and feature flag delivery share nearly identical infrastructure. The key difference is measurement.

A staged ramp-up is a special case: start an experiment at 1% allocation (just enough to catch crashes and SRMs quickly), verify guardrails, then ramp to 10%, 50%, 100%. Each ramp step is a config change — no code deploy.

stateDiagram-v2
    [*] --> Draft: experiment created
    Draft --> Running: launched at N%
    Running --> Ramping: allocation increased
    Ramping --> Running: stable
    Running --> Paused: guardrail breach or manual
    Paused --> Running: issue resolved
    Running --> Concluded: sample size reached
    Concluded --> Shipped: winner deployed
    Concluded --> Reverted: loser discarded
    Shipped --> [*]
    Reverted --> [*]

Failure modes

FailureSymptomMitigation
SRM (Sample Ratio Mismatch)Treatment/control counts don't match allocationAutomatic SRM check before surfacing results; block analysis until resolved
Peeking / repeated testingFalse positive shipped as a winnerFixed-horizon commitment; sequential tests for early stopping
Assignment skewHash function not uniform; correlated with user attributesUse well-tested hash functions (MurmurHash3, xxHash); test uniformity on real IDs
Config stalenessOld config served after experiment pausedAggressive polling (30 s) + push notifications for pauses; SDK caches pause state with short TTL
Exposure log lossKafka consumer falls behind; events droppedAt-least-once delivery; deduplicate downstream; monitor consumer lag
Metric pipeline lagGuardrails fire hours lateSeparate fast-path Flink job for guardrail metrics; alert on pipeline lag itself
Interaction effectsTwo experiments in same layer pollute each otherLayer enforcement at config creation time; lint rules prevent placing conflicting experiments in one layer
Novelty effectTreatment looks good for 3 days then regressesRun experiments for at least 1–2 full behavioral cycles (typically 1–2 weeks) before concluding

Storage choices

DataStoreRationale
Experiment configPostgresLow write rate, complex queries, needs ACID for allocation changes
Config delivery cacheRedis / CDNFast reads for SDK polling; tolerate ~30 s stale
Exposure events (raw)Kafka → S3 / object storeAppend-only, high-volume; cheap cold storage
Outcome events (raw)Kafka → S3 / object storeSame pattern; separate topics per event type
Joined experiment resultsBigQuery / Snowflake (columnar)Aggregate queries over billions of rows; columnar format is 10–100× faster than row-oriented
Computed statisticsPostgres / document storeLow volume; read by dashboard

Things to discuss in an interview

  • Why hash-based assignment and not a lookup table? No per-user storage, no DB on the hot path, deterministic, portable across services. Trade-off: you cannot un-assign a user mid-experiment without changing the salt (which re-randomizes everyone — rarely desirable).
  • Why exposure logging and not assignment logging? Many assigned users never encounter the treatment. Analyzing on assignment inflates N and dilutes the effect. In clinical-trial terms, assignment-based analysis is the intent-to-treat (ITT) approach; exposure-based analysis is closer to per-protocol — it restricts to participants who actually received the treatment.
  • How do you handle the peeking problem in practice? Fixed horizon for primary decisions, sequential testing (mSPRT) for guardrails and early stopping. Brief the stakeholders before launch on the sample-size commitment — this is a cultural problem as much as a technical one.
  • How does CUPED work? Regress out pre-experiment variance using the same metric from a holdout period before the experiment. Does not bias the estimate; reduces variance roughly 20–70% depending on the covariate correlation.
  • What is SRM and what causes it? Mismatch between expected and observed allocation ratios. Caused by logging bugs, bot filtering, or hash non-uniformity. Always check before trusting results.
  • How do you scale to 10k concurrent experiments? Layers with orthogonal salts. Experiments within a layer are mutually exclusive; experiments across layers are independent. Slot management (expiring old experiments) keeps layers from overfilling.
  • How are A/B testing and feature flags related? Same assignment and delivery infrastructure; A/B adds measurement, exposure logging, and statistics on top. See feature flag service.

Things you should now be able to answer

  • Why does the deterministic hash use unit_id + experiment_salt rather than just unit_id?
  • What is the difference between assignment and exposure, and which one drives the analysis population?
  • Why does checking p-values every day inflate the false-positive rate, and what are two principled fixes?
  • What does CUPED do and how does it not bias the result?
  • What is an SRM and why does it invalidate an experiment's results?
  • How do experiment layers prevent cross-experiment contamination without requiring exclusion of overlapping users?
  • What happens if you change the experiment salt mid-experiment?

Further reading

  • Kohavi, Longbotham, Sommerfield, Henne — "Controlled Experiments on the Web: Survey and Practical Guide" (2009). The canonical reference.
  • Tang, Agarwal, O'Brien, Meyer — "Overlapping Experiment Infrastructure: More, Better, Faster Experimentation" (Google, KDD 2010). The layers framework.
  • Deng, Xu, Kohavi, Walker — "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data" (CUPED, WSDM 2013).
  • Johari, Pekelis, Walsh — "Always Valid Inference: Bringing Sequential Analysis to A/B Testing" (mSPRT / sequential testing, arXiv 2015, Operations Research 2022).
  • Airbnb Engineering: "Experiment Reporting Framework" — medium.com/airbnb-engineering.
  • Netflix Technology Blog: "It's All A/Bout Testing: The Netflix Experimentation Platform".
  • Microsoft ExP (Experimentation Platform) papers — various, available on exp-platform.com.
// FAQ

Frequently asked questions

What is the peeking problem in A/B testing and how do you fix it?

Peeking means checking a p-value repeatedly before the experiment reaches its planned sample size. Even with a true null effect, roughly 14 daily checks can push the effective false-positive rate to around 26% (Johari et al. 2022). The two principled fixes are a fixed horizon — commit to a sample size before launch and look exactly once — and sequential testing using mSPRT, which provides an always-valid p-value you can check at any time without inflating Type I error.

Why use hash-based assignment instead of storing variant assignments in a database?

The formula bucket = hash(unit_id + experiment_salt) % 100 is computed entirely in-process, requiring no database lookup on the hot path and achieving sub-millisecond latency. It is deterministic forever — user 12345 always lands in the same bucket for a given experiment — and any service (iOS, Android, backend) independently computes the same answer with no coordination. The trade-off is that changing the salt mid-experiment re-randomizes every user, which is rarely desirable.

What is CUPED and how much variance does it reduce?

CUPED (Controlled-experiment Using Pre-Experiment Data) regresses out the predictable component of the outcome metric using the same metric measured in a pre-experiment window, then runs the analysis on the residual. It does not change the expected treatment effect — the adjustment is mean-preserving — but reduces variance. The original Microsoft paper on Bing reported roughly 50% reduction; in practice teams see 20% to 70% depending on how strongly the pre-experiment covariate correlates with the outcome.

What is a Sample Ratio Mismatch (SRM) and when does it occur?

An SRM is a statistically significant difference between the observed allocation ratio and the expected one — for example, a 50/50 experiment where control receives 1,002,000 users and treatment receives only 987,000, yielding a chi-squared statistic of 113.2 (p much less than 0.001). Common causes are late-loading SDKs that only fire for some users, bot filtering applied to one variant but not the other, and a non-uniform hash function. Any SRM invalidates the experiment's results and must be treated as a data-quality alarm before showing any metrics.

How do experiment layers prevent cross-experiment interference when running thousands of concurrent experiments?

Based on Google's overlapping experiment framework (KDD 2010), each layer uses a distinct salt so that a user's bucket assignment in layer 1 is statistically independent of their bucket in layer 2 or layer 3. Within a layer, experiments occupy contiguous, mutually exclusive bucket ranges so they never overlap. Experiments that are known to interact go in the same layer where they cannot co-occur; experiments in different layers can freely overlap across the full user population without confounding each other.

// RELATED

You may also like