~/articles/design-fraud-detection-system

◆◆◆Advancedasked at Stripeasked at PayPalasked at Metaasked at Uber

Design a Real-Time Fraud Detection System

Score transactions for fraud inline in milliseconds. Feature stores, streaming velocity features, rules + ML hybrids, graph fraud rings, and the label-delay problem.

23 min read2026-06-04Ironclad Academy

#interview #ml #streaming #risk

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Stripe blocks roughly 1.5 billion fraudulent dollars per year on behalf of its merchants. PayPal's fraud team scores hundreds of millions of transactions daily. Both systems face the same brutal constraint: the decision — allow, deny, or challenge — must arrive before the payment authorization completes, which means you have at most 100 ms from the moment a customer clicks "Pay."

That timing constraint is what makes this problem hard. You can't batch decisions, you can't afford expensive graph traversals on the critical path, and you can't wait for clean labels. A card number stolen in a data breach today will be tested against your system tonight. By the time a chargeback confirms the fraud 30–90 days later, the attack pattern has already evolved. Your model is perpetually chasing yesterday's fraud with yesterday's labels.

The system also straddles two failure modes with opposite costs. Miss a fraudulent transaction (false negative) and you lose the chargedback dollars plus the chargeback fee — plus you risk Visa or Mastercard putting you on their excessive-chargeback watchlists if rates climb above 1.5%. Block a legitimate transaction (false positive) and you lose a customer, possibly permanently. There is no setting that gets both wrong zero times. The system has to be tunable, per merchant and per risk tier, so the business — not the ML team — decides where on the precision-recall curve to operate.

This is a different problem from the payment system, which is about moving money correctly. Fraud detection is about whether to move it at all. The two systems run side by side in the authorization hot path, with distributed systems, ML engineering, and adversarial game theory all in play at once.

Functional requirements

Score payment events (card transactions, ACH, peer-to-peer transfers) with a risk label: allow, deny, or challenge (step-up authentication).
Score account events: signups (new-account fraud), logins (account takeover).
Support a human review queue for borderline cases — analysts can flip decisions.
Provide an explanation for every decision (audit trail, regulatory requirement, analyst tooling).
Accept feedback: when a review analyst overrides a decision, or a chargeback arrives, feed that signal back into the system.

Non-functional requirements

Latency: p99 < 100 ms inline with payment authorization. p50 < 20 ms.
Availability: 99.99% uptime on the scoring path. A scoring failure must not block payments — fall back to rules-only, never to a hard error.
Throughput: 1,000–5,000 events/sec sustained; spikes of 2–3× during sales events.
Consistency: a "deny" decision must be deterministic given the same inputs — regulatory audits will ask why a transaction was blocked.
Freshness: velocity features (e.g., "how many transactions on this card in the last 5 minutes") must reflect events from at most a few seconds ago.

Capacity estimation

Dimension	Estimate	How we got there
Peak write throughput	5,000 events/sec	Payment gateway scale target
Daily events scored	86.4M events/day	`1,000 × 86,400`
Redis reads/sec (feature lookups)	150,000 reads/sec	`5,000 events/sec × 30 features/event`
Redis cluster headroom	Well within a 3-node cluster	Each Redis node handles ~100k ops/sec
Feature store active keys	~50M keys	Cards + users + devices + merchant pairs
Feature store memory (raw)	10 GB	`50M keys × 200 bytes/key`
Feature store memory (with replication)	30 GB RAM	`10 GB × 3× replication`
Model inference latency (native XGBoost)	~0.7–2 ms per prediction	1,000-tree GBT on a modern CPU core
Model inference latency (optimized runtime)	< 0.5 ms	Treelite or ONNX Runtime
CPU cores needed @ 0.5 ms/prediction	~3 cores	`5,000/sec × 0.5 ms = 2.5 cores`
CPU cores needed @ 2 ms/prediction (unoptimized)	~10 cores	`5,000/sec × 2 ms = 10 cores`
Raw training rows/year	~31.5B rows	`86.4M events/day × 365`
Training rows after negative sampling	~1B labeled rows	10:1 negative:positive ratio applied
Training data storage/year	~500 GB	`~1B rows × 500 bytes/row compressed (Parquet on S3)`
Label delay (chargeback confirmation)	30–90 days post-transaction	Visa/Mastercard dispute resolution rules
Surrogate label signal available	T+1 to T+7	Early dispute signals used for faster feedback

Takeaway: The inline scoring path — 150k Redis reads/sec and ~10 CPU cores for model inference — is modest. The hard constraint is latency (< 100 ms end-to-end) and label freshness: production training data for a given week is not fully labeled for 90+ days, so you retrain on early dispute surrogates in the meantime.

Building up to the design

V1: Hard rules only

Every fraud system starts here: a list of known-bad signals turned into code.

def score(txn):
    if txn.card in blocklist:         return DENY
    if txn.amount > country_limit[txn.country]: return DENY
    if txn.ip in known_bad_ips:       return DENY
    return ALLOW

This is interpretable, fast, and auditable — a junior analyst can explain every decision. The problem is that rules are static. Fraudsters learn your rules by probing the system. A card not yet on any blocklist passes freely. Velocity attacks — spreading $9.99 across 1,000 cards in an hour — are completely invisible, because no single transaction triggers anything.

V2: Add velocity counters

"This card has been used 8 times in the last 3 minutes" is one of the strongest fraud signals you have, regardless of any individual transaction amount. So we start computing counts over rolling windows and checking them at scoring time.

def get_velocity(card_id, window_seconds):
    key = f"vel:{card_id}:{window_seconds}"
    return redis.get(key) or 0

if get_velocity(card.id, 300) > 5:
    return DENY

This catches velocity attacks rules couldn't see. But now we have a new problem: a card first seen today has zero history. The rules fire based on counts, but the pattern matters — is 3 transactions in 5 minutes abnormal for this particular cardholder? Rules can't answer that. We need something that learns.

V3: Add an ML model for the gray zone

A gradient-boosted tree (GBT) trained on historical labeled transactions learns the joint distribution of features that separates fraud from legitimate activity. It handles the gray zone that rules cannot.

score = model.predict(feature_vector)   # → 0.0 to 1.0
if score > 0.85: return DENY
if score > 0.4:  return CHALLENGE       # step-up auth
return ALLOW

The model can learn non-obvious combinations. A $15 transaction at a foreign gas station is fine — unless the card was used in New York 20 minutes ago. That joint signal is something no single rule captures, but a trained model learns immediately.

What breaks next: the model is trained on yesterday's fraud patterns. Fraudsters adapt. The model drifts. And model features require precomputation at inference time — naive real-time feature computation blows the 100 ms budget.

V4: A proper feature store

Feature computation moves out of the hot path. A streaming pipeline (Kafka → Flink) continuously maintains velocity windows. A batch pipeline (daily Spark job) updates longer-horizon profile features. Both write to a shared feature store — Redis for real-time reads, a columnar store for offline training.

At inference time you fetch precomputed features, not recompute them. Every feature lookup becomes a single Redis GET: ~1 ms flat.

V5: Graph-based ring detection (async)

Connected fraud rings — multiple accounts sharing a device, IP, or card — are invisible to per-transaction scoring. An asynchronous graph analysis layer links entities and propagates risk scores across the graph. This doesn't sit on the inline path; it feeds into feature store updates and review queue prioritization.

V6: Full production system

V3 + V4 + rules engine + delayed-label retraining + human review queue + graph analysis + champion/challenger deployment.

flowchart LR
    V1["V1: hard rules\nfast, brittle"] --> V2["V2: + velocity counters\ngray zone visible"]
    V2 --> V3["V3: + ML model\ngray zone handled"]
    V3 --> V4["V4: + feature store\n< 20ms p50 / < 25ms p99"]
    V4 --> V5["V5: + graph detection\nring fraud caught"]
    V5 --> V6["V6: + retraining pipeline\nadversarial drift handled"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

High-level architecture

flowchart TD
    TXN[Transaction event] --> GW[API Gateway / Scoring endpoint]
    GW --> ENRICH[Feature enrichment service]

    ENRICH --> RFS[(Real-time feature store\nRedis: velocity, device, session)]
    ENRICH --> OFS[(Offline feature store\nS3 + Redis: historical profiles)]

    ENRICH --> ENGINE[Decision engine]
    ENGINE --> RULES[Rules engine\nhard deny / allow]
    ENGINE --> ML[ML model server\nXGBoost / LightGBM]

    RULES -->|deny| DEC{Decision}
    ML -->|score| DEC

    DEC -->|allow| OK[Authorize]
    DEC -->|deny| BLOCK[Block]
    DEC -->|challenge| STEP[Step-up auth]
    DEC -->|borderline| QUEUE[Human review queue]

    TXN --> KAFKA[Kafka: raw event stream]
    KAFKA --> FLINK[Flink: velocity aggregation]
    FLINK --> RFS

    LABELS[Chargeback / dispute labels] --> DELAY[Delayed label pipeline]
    DELAY --> TRAIN[Training pipeline\nSpark + MLflow]
    OFS --> TRAIN
    TRAIN --> CHAL[Challenger model]
    CHAL --> SHADOW[Shadow scoring\nA/B eval]
    SHADOW --> ML

    style ENGINE fill:#ff6b1a,color:#0a0a0f
    style RFS fill:#15803d,color:#fff
    style ML fill:#a855f7,color:#fff
    style FLINK fill:#0e7490,color:#fff
    style DELAY fill:#ffaa00,color:#0a0a0f

The inline scoring pipeline

The inline path is everything. Every millisecond here is a millisecond the payment gateway is waiting.

sequenceDiagram
    participant GW as Payment Gateway
    participant SCORE as Scoring Service
    participant FEAT as Feature Enrichment
    participant RED as Redis (real-time)
    participant MODEL as Model Server
    participant AUTH as Authorization

    GW->>SCORE: ScoreRequest (card, amount, merchant, device)
    SCORE->>FEAT: enrich(event)
    FEAT->>RED: MGET [vel:card:60, vel:card:300, vel:card:3600, device_age, ip_country_mismatch, ...]
    RED-->>FEAT: ~20 feature values (< 2 ms)
    FEAT-->>SCORE: feature_vector
    SCORE->>SCORE: run rules engine (< 1 ms)
    alt hard rule match
        SCORE-->>GW: DecisionResponse: DENY + explanation (immediate)
    else gray zone
        SCORE->>MODEL: predict(feature_vector)
        MODEL-->>SCORE: risk_score (< 5 ms)
        SCORE-->>GW: DecisionResponse: allow / deny / challenge + explanation
    end
    note over AUTH: GW proceeds with\nauthorization based on decision

Latency budget breakdown:

Step	Target p99
Network (gateway → scoring service)	5 ms
Feature lookup (Redis MGET, ~20–30 keys)	5 ms
Rules engine evaluation	1 ms
Model inference (GBT)	5 ms
Network (scoring service → gateway)	5 ms
Total	< 25 ms (leaves margin for p99 tail)

Keep rules and model inference on the same host as the scoring service to avoid extra network hops. Use Redis pipelining (batch all MGET in a single round trip).

Feature store and velocity features

Velocity features are the single most valuable feature class in fraud detection. They answer: "Is this card / user / device behaving unusually right now?"

Example velocity features:

Feature	Window	Entity	Redis key pattern
Transaction count	1 min, 5 min, 1 hr, 24 hr	card_id	`vel:card:{id}:{window}:count`
Transaction amount sum	5 min, 1 hr	card_id	`vel:card:{id}:{window}:sum`
Distinct merchants	1 hr	card_id	`vel:card:{id}:1h:merchants` (HyperLogLog)
Distinct IPs	24 hr	user_id	`vel:user:{id}:24h:ips`
Distinct devices	7 days	user_id	`vel:user:{id}:7d:devices`
Failed auth attempts	30 min	device_id	`vel:device:{id}:30m:fails`
Cross-border txns	1 hr	card_id	`vel:card:{id}:1h:countries` (HyperLogLog)

How velocity features are maintained (the streaming pipeline):

flowchart LR
    KAFKA[Kafka\ntxn events] --> FLINK[Flink job\nwindowed aggregation]
    FLINK --> RED[(Redis\nvelocity counters)]
    FLINK --> TSCALE[Time-series store\nfor audit / backfill]

    style KAFKA fill:#0e7490,color:#fff
    style FLINK fill:#ff6b1a,color:#0a0a0f
    style RED fill:#15803d,color:#fff

Flink maintains sliding windows using its native event-time processing. Each transaction event increments the relevant counters. Redis TTLs are set to slightly longer than the window to handle late events. For cardinality estimates ("distinct IPs this card has used in 24h"), use a HyperLogLog in Redis — constant memory, 0.81% standard error, exactly what you need.

Why not compute velocity at scoring time? Because computing "count of transactions on this card in the last 5 minutes" requires scanning recent history — that's O(events in window) per lookup, not O(1). Pre-aggregation by the streaming pipeline makes every feature lookup a single Redis GET: O(1), ~1 ms.

Rules engine

Rules run before the ML model. They are fast, deterministic, and explainable — critical for regulatory compliance.

Rule categories:

Category	Examples	Action
Blocklist	Card in stolen-card list, IP in botnet list	Hard deny
Limit rules	Amount > daily card limit, > country limit	Hard deny
Impossible travel	Last txn in London 5 min ago, now in Sydney	Deny / challenge
Velocity threshold	> 10 txns on card in last 5 min	Challenge
Device mismatch	Device country ≠ billing country	Challenge
New entity	Card created < 2h ago + high amount	Challenge

Rules are stored as data (YAML/DB rows), not hardcoded. An analyst can add a new rule without a code deploy. A rules engine service evaluates them in priority order; the first matching terminal rule wins.

Rules are not enough on their own. They are binary and static — they require explicit enumeration of every bad pattern. ML fills the gap for the gray zone, where combinations of moderately suspicious signals are what matter, not any single rule.

The ML model

Feature vector

At inference time the feature vector includes:

Velocity features (from Redis): counts, sums, cardinalities over multiple windows
Transaction features: amount, currency, merchant category code (MCC), cross-border flag
Entity profile features (from offline store): average spend by MCC for this card, historical dispute rate, account age
Device/session features: device age, IP risk score, browser fingerprint consistency
Derived features: amount relative to historical average, time-of-day normalized by timezone

The diagram below shows how these sources converge at inference time:

flowchart LR
    VEL[(Redis\nvelocity counters)] --> FV[Feature vector]
    PROF[(Offline store\nhistorical profiles)] --> FV
    TXN2[Transaction fields\namount / MCC / currency] --> FV
    DEV[Device / session\nfingerprint, IP risk] --> FV
    FV --> MODEL[GBT model\nXGBoost / LightGBM]
    MODEL --> SCORE2[Risk score 0.0–1.0]
    SCORE2 --> THR{Threshold}
    THR -->|"< 0.40"| ALLOW2[Allow]
    THR -->|"0.40–0.85"| CHAL2[Challenge]
    THR -->|"> 0.85"| DENY2[Deny]
    style FV fill:#ff6b1a,color:#0a0a0f
    style MODEL fill:#a855f7,color:#fff
    style ALLOW2 fill:#15803d,color:#fff
    style DENY2 fill:#ff2e88,color:#fff
    style CHAL2 fill:#ffaa00,color:#0a0a0f

Model choice

Gradient-boosted trees (XGBoost or LightGBM) are the standard choice for tabular fraud data. They handle mixed feature types — numeric, categorical, binary — without extensive preprocessing, tolerate missing values natively, and model non-linear interactions between features. Inference is fast: a 500-tree model scores a single row in under 1 ms on a single CPU core with an optimized inference runtime (Treelite, ONNX Runtime). Native XGBoost predict() is slower, around 0.7–2 ms for a single row (benchmarks typically show 0.7–2 ms depending on tree count and threading config), due to Python/threading overhead on single rows.

Neural networks show up as ensembles or for specific sub-problems — graph embeddings for fraud ring detection, sequence models for transaction history — but they don't replace GBTs for the primary tabular scoring task. The interpretability cost is high and the accuracy gain on structured tabular data is marginal compared to a well-tuned GBT.

Class imbalance

Fraud rates are typically 0.1%–1% of transactions. Training naively on this distribution produces a model that predicts "legitimate" for everything and hits 99.5% accuracy — useless. The standard fix: undersample negatives. Keep all positives, sample negatives to a 10:1 or 20:1 ratio. Simple and effective. You can also penalize false negatives more heavily in the loss function via scale_pos_weight in XGBoost.

Evaluate with precision, recall, and AUC-PR — never accuracy. The area under the precision-recall curve is the right summary metric for imbalanced datasets because accuracy tells you nothing about how well you're catching the rare class.

Threshold calibration

The model outputs a probability: 0.0 (definitely legitimate) to 1.0 (definitely fraud). The threshold that maps score to action is a business decision, not a model decision:

Threshold: deny_above = 0.85, challenge_above = 0.40

                |     Score < 0.40    | allow   |
                |  0.40 ≤ score < 0.85| challenge |
                |     Score ≥ 0.85    | deny    |

Different contexts get different thresholds: high-value merchants use a lower challenge threshold; subscription renewals use a higher deny threshold because recurring charges have strong history; new accounts get tightened thresholds for first transactions.

The label-delay problem

This is the hardest operational problem in fraud ML — and largely unique to this domain.

The timeline:

T+0:   Transaction occurs. You score it. Maybe you're right, maybe wrong.
T+1–7: Customer disputes the charge (early dispute signal).
T+30:  Issuing bank files chargeback against acquiring bank.
T+60:  Acquiring bank processes the chargeback; merchant notified.
T+90:  Chargeback resolved (or escalated to arbitration); you get the final "fraud confirmed" label.

Training on T+90 labels means your training data for last month isn't fully labeled for another two months. Meanwhile, a new fraud pattern that emerged three weeks ago is hurting you with no label signal yet.

The mitigations work in layers:

Early-signal surrogates: disputes filed within 7 days are a reliable leading indicator. Use them as provisional labels for recent transactions. Retrain weekly on T+7 surrogates; retrain monthly on T+90 confirmed labels.
Decoupled training runs: maintain two models — one trained on recent-but-noisy T+7 data (catches new patterns fast), one on older-but-clean T+90 data (more accurate baseline). Ensemble the two.
Label pipeline with event sourcing: every fraud signal (chargeback notification, dispute file, manual analyst decision) is an event on a Kafka topic. A label service joins them to the original transaction by txn_id and writes the labeled record to the training store when the label is confident.

sequenceDiagram
    participant TXN as Transaction (T+0)
    participant SCORE as Scoring Service
    participant STORE as Event Store
    participant DISPUTE as Dispute event (T+7)
    participant CHARGEBACK as Chargeback (T+90)
    participant LABEL as Label Pipeline
    participant TRAIN as Training Pipeline

    TXN->>SCORE: score event
    SCORE->>STORE: store {txn_id, features, decision, score}
    DISPUTE->>LABEL: dispute filed → provisional label
    LABEL->>STORE: join to txn_id → labeled record (surrogate)
    CHARGEBACK->>LABEL: confirmed fraud label
    LABEL->>STORE: update label (final)
    STORE->>TRAIN: weekly training job reads labeled records
    TRAIN->>SCORE: deploy retrained model

Graph-based fraud ring detection

Individual transaction scoring misses coordinated rings — groups of accounts sharing phones, email domains, IPs, or cards that each look slightly suspicious but together are obviously fraudulent.

The entity graph:

flowchart LR
    U1[User A] --- DEV1((Device X))
    U2[User B] --- DEV1
    U3[User C] --- DEV1
    U2 --- IP1((IP 1.2.3.4))
    U4[User D] --- IP1
    U3 --- CARD1((Card ending 7812))
    U5[User E] --- CARD1
    DEV1 --- EMAIL1((email domain\nfakemail.io))
    U4 --- EMAIL1

    style DEV1 fill:#ff2e88,color:#fff
    style IP1 fill:#ff2e88,color:#fff
    style CARD1 fill:#ff2e88,color:#fff
    style EMAIL1 fill:#ff2e88,color:#fff
    style U1 fill:#0e7490,color:#fff
    style U2 fill:#0e7490,color:#fff
    style U3 fill:#0e7490,color:#fff
    style U4 fill:#0e7490,color:#fff
    style U5 fill:#0e7490,color:#fff

Nodes are users, cards, devices, IPs, email addresses, phone numbers. Edges represent "this user has used this device," "this card was used from this IP," and so on.

The detection approach runs in three steps. A batch or micro-batch job (Spark, or a graph DB like Amazon Neptune / JanusGraph) builds the entity graph from the past N days of transaction events. Connected-component analysis finds clusters of highly-connected entities. If a cluster contains confirmed-fraud nodes, other nodes in the cluster get elevated risk scores — even if their individual transaction history looks clean. Those cluster-level risk scores become features in the feature store, read by the inline scoring pipeline.

This is not on the inline path — graph queries are expensive. It runs on a schedule (every few hours) and surfaces results as precomputed features.

Why this matters: a fraud ring typically has one "burned" account that gets caught first. Without graph analysis, the other 99 accounts in the ring transact freely until each one burns individually. With graph analysis, catching one propagates risk to the whole cluster immediately.

Precision/recall trade-off and human review

Every threshold setting is a point on the precision-recall curve. Lowering the deny threshold catches more fraud but blocks more legitimate customers. Raising it passes more legitimate customers but lets more fraud through. Neither extreme is right.

The optimal operating point depends on:

Fraud loss per transaction (how expensive is a miss?)
Customer lifetime value (how expensive is a false block?)
Chargeback rate limits imposed by card networks — Visa's VAMP program flags merchants above a 1.5% combined fraud+dispute ratio (TC40 + TC15 vs. settled CNP transactions; 1.5% applies to US, Canada, EU, APAC as of April 2026; CEMEA stays at 2.2%); Mastercard's Excessive Chargeback Program triggers at 1.5% chargeback-to-transaction ratio; excessive rates lead to losing card acceptance, which is catastrophic.

The human review queue sits between auto-allow and auto-deny. Transactions with scores in a borderline range (e.g., 0.40–0.55) are sent to analysts. Analysts see the full feature set, linked transactions, and can flip the decision. Every manual decision becomes a high-quality training label.

stateDiagram-v2
    [*] --> Scored
    Scored --> AutoAllow: score < low_threshold
    Scored --> AutoDeny: score > high_threshold
    Scored --> PendingReview: low_threshold ≤ score ≤ high_threshold
    PendingReview --> AnalystApprove: analyst reviews → approve
    PendingReview --> AnalystDeny: analyst reviews → deny
    AnalystApprove --> LabeledLegit: label stored for training
    AnalystDeny --> LabeledFraud: label stored for training
    AutoAllow --> [*]
    AutoDeny --> [*]
    LabeledLegit --> [*]
    LabeledFraud --> [*]

Adversarial drift and champion/challenger deployment

Fraudsters are adversarial agents. They probe your system and adapt. A model that was 95% accurate three months ago may be 85% accurate now because the attack patterns it learned no longer match what fraudsters are doing.

The deployment pattern that handles this is champion/challenger. The champion is the currently-live production model. A challenger is a freshly retrained version running in shadow mode: it scores every transaction and logs its decision, but the champion's decision is what actually executes.

flowchart LR
    TXN3[Transaction] --> CHAMP[Champion model\nlive decision]
    TXN3 --> CHAL3[Challenger model\nshadow scoring]
    CHAMP --> LIVE[Execute decision]
    CHAL3 --> LOG[Log score only]
    LOG --> COMP[Compare AUC-PR\nover labeled transactions]
    COMP -->|challenger wins| PROMO[Promote challenger\nzero-downtime swap]
    COMP -->|champion holds| HOLD[Champion stays]
    style CHAMP fill:#ff6b1a,color:#0a0a0f
    style CHAL3 fill:#a855f7,color:#fff
    style PROMO fill:#15803d,color:#fff

Compare champion and challenger AUC-PR over a week on transactions that have since received labels. When the challenger statistically outperforms the champion, promote it — zero-downtime model swap.

Track model output distributions continuously alongside this. If the distribution of risk scores shifts materially — say, far more predictions in the 0.3–0.7 gray zone than baseline — that's a signal the model is becoming uncertain, possibly because the feature distribution has shifted. Alert before the model silently degrades.

Storage choices

Data	Store	Why
Velocity counters (real-time)	Redis	Sub-millisecond reads, TTL support, HyperLogLog for cardinality
Historical feature profiles	Redis (hot) + S3/Parquet (cold)	Hot: inference-time read; cold: training data generation
Raw transaction events	Kafka (7-day retention) + S3 (archive)	Durable event source for reprocessing
Labeled training data	S3 Parquet + Delta Lake	Cheap, columnar, versioned for point-in-time correct training
Model artifacts	MLflow / S3	Versioned, reproducible, metadata tracked
Decisions / audit log	Append-only store (Postgres / S3)	Regulatory requirement: every decision must be explainable
Review queue	Postgres + task queue (e.g., Celery)	Low volume, rich queries, analyst workflow
Graph data	Graph DB (Neptune) or Spark adjacency lists	Connected components, not relational joins

Failure modes

Failure	Impact	Mitigation
Feature pipeline lag (Kafka / Flink falls behind)	Velocity features are stale — model makes decisions on outdated counts	Freshness check per feature; fall back to conservative rules if `feature_age > threshold`
Redis cluster failure	All feature lookups fail — cannot enrich events	Rules-only fallback mode (pre-loaded in scoring service memory); alert on-call
Model server crash	No ML scoring	Fall back to rules-only; do not block payments; page ML on-call
False-positive storm	Model triggers on a new legitimate pattern (e.g., first day of a major holiday promotion)	Real-time precision/recall dashboard; circuit breaker that widens thresholds if false-positive rate spikes beyond 3× baseline
Model staleness / drift	Fraud rates spike undetected	Weekly automated retraining; AUC-PR monitoring; chargeback rate alert
Fraud ring evasion	Ring rotates devices/IPs faster than graph jobs run	Decrease graph refresh interval; use streaming graph updates (Flink + graph DB CDC)
Cold start on new user/card	No velocity or profile features	Default features (population averages for new entities); higher challenge rate for first transactions
Label pipeline failure	No new labels reach training	Training halts; champion model ages; alert and skip the training run rather than train on partial data

Things to discuss in an interview

Why inline, not async? Post-hoc fraud detection can reverse transactions after the fact, but chargebacks are expensive and card network rules penalize high chargeback rates — you need to block at authorization.
Rules vs. ML trade-off: rules for known-bad patterns and regulatory clarity; ML for the gray zone where no explicit rule exists. Both are necessary; neither alone is sufficient.
The feature store architecture: real-time (Kafka/Flink → Redis) for velocity; offline (batch Spark → S3 → Redis hot cache) for historical profiles. Why are these separate? Latency and cost.
The label-delay problem and what it does to your training loop: this is the answer that separates senior candidates. The model you deploy today is trained on fraud patterns from 90 days ago.
Precision vs. recall and who sets the threshold: the ML team sets the curve; the business decides which point on that curve to operate at. Engineers should not be setting fraud policy.
Champion/challenger: how do you A/B test a new model without risking production fraud rates? Shadow scoring on the offline path.
Graph detection: why can't you do this inline? What's the right refresh cadence?

Things you should now be able to answer

What is a velocity feature and why is it the most important feature class in fraud?
Why does fraud detection need a dedicated feature store rather than computing features at query time?
What is the label-delay problem and how do you mitigate it?
Why is accuracy the wrong metric for fraud models?
What is champion/challenger deployment and why does it matter for adversarial systems?
What happens to your scoring system when the ML model is unavailable?
How do you detect a coordinated fraud ring with a per-transaction scorer?

Frequently asked questions

▸Why must fraud scoring happen inline before payment authorization, not asynchronously after the fact?

Post-hoc detection can reverse transactions, but chargebacks are expensive and card networks penalize merchants that exceed chargeback thresholds — Visa's VAMP program flags merchants above a 1.5% combined fraud-plus-dispute ratio, and Mastercard's Excessive Chargeback Program triggers at 1.5%. Losing card acceptance entirely is catastrophic, so the decision must block authorization rather than unwind it.

▸What is the label-delay problem in fraud detection, and how do you mitigate it?

A transaction that occurs today may not receive a confirmed fraud label until a chargeback is resolved 30 to 90 days later, meaning the model deployed this month is trained on fraud patterns from three months ago. The standard mitigation is to use early dispute signals available at T+1 to T+7 as surrogate labels for weekly retraining, while reserving the confirmed T+90 labels for a more accurate monthly retrain, and optionally ensembling both models.

▸Why are gradient-boosted trees the primary model choice for fraud scoring rather than neural networks?

GBTs handle mixed feature types and missing values natively, produce interpretable outputs required for regulatory audit, and are fast at inference — under 1 ms per row with an optimized runtime like Treelite or ONNX, and roughly 0.7 to 2 ms with native XGBoost predict(). Neural networks offer marginal accuracy gains on structured tabular data and impose a high interpretability cost that is difficult to justify to regulators.

▸Why is accuracy the wrong evaluation metric for a fraud model, and what should you use instead?

Fraud rates are typically 0.1% to 1% of transactions, so a model that predicts every transaction as legitimate scores 99.5% accuracy while catching zero fraud. The correct metrics are precision, recall, and AUC-PR — the area under the precision-recall curve — because they measure performance on the rare positive class rather than the overwhelming majority.

▸How does graph-based fraud ring detection work, and why can it not run on the inline scoring path?

A batch or micro-batch job builds an entity graph linking users, cards, devices, IPs, and email addresses, then runs connected-component analysis to find clusters containing confirmed-fraud nodes and propagates elevated risk scores to the whole cluster. Graph queries are too expensive to run in the 100 ms authorization window, so the job runs on a schedule every few hours and surfaces cluster-level risk scores as precomputed features in Redis, where the inline pipeline reads them as a single O(1) lookup.

← previous

Design Nearby Friends (Real-Time Friend Location Sharing)

Design Ticketmaster (seat booking / reservations)

// RELATED