Design a Real-Time Fraud Detection System
Score transactions for fraud inline in milliseconds. Feature stores, streaming velocity features, rules + ML hybrids, graph fraud rings, and the label-delay problem.
The problem
Stripe blocks roughly 1.5 billion fraudulent dollars per year on behalf of its merchants. PayPal's fraud team scores hundreds of millions of transactions daily. Both systems face the same brutal constraint: the decision — allow, deny, or challenge — must arrive before the payment authorization completes, which means you have at most 100 ms from the moment a customer clicks "Pay."
That timing constraint is what makes this problem hard. You can't batch decisions, you can't afford expensive graph traversals on the critical path, and you can't wait for clean labels. A card number stolen in a data breach today will be tested against your system tonight. By the time a chargeback confirms the fraud 30–90 days later, the attack pattern has already evolved. Your model is perpetually chasing yesterday's fraud with yesterday's labels.
The system also straddles two failure modes with opposite costs. Miss a fraudulent transaction (false negative) and you lose the chargedback dollars plus the chargeback fee — plus you risk Visa or Mastercard putting you on their excessive-chargeback watchlists if rates climb above 1.5%. Block a legitimate transaction (false positive) and you lose a customer, possibly permanently. There is no setting that gets both wrong zero times. The system has to be tunable, per merchant and per risk tier, so the business — not the ML team — decides where on the precision-recall curve to operate.
This is a different problem from the payment system, which is about moving money correctly. Fraud detection is about whether to move it at all. The two systems run side by side in the authorization hot path, with distributed systems, ML engineering, and adversarial game theory all in play at once.
Functional requirements
- Score payment events (card transactions, ACH, peer-to-peer transfers) with a risk label: allow, deny, or challenge (step-up authentication).
- Score account events: signups (new-account fraud), logins (account takeover).
- Support a human review queue for borderline cases — analysts can flip decisions.
- Provide an explanation for every decision (audit trail, regulatory requirement, analyst tooling).
- Accept feedback: when a review analyst overrides a decision, or a chargeback arrives, feed that signal back into the system.
Non-functional requirements
- Latency: p99 < 100 ms inline with payment authorization. p50 < 20 ms.
- Availability: 99.99% uptime on the scoring path. A scoring failure must not block payments — fall back to rules-only, never to a hard error.
- Throughput: 1,000–5,000 events/sec sustained; spikes of 2–3× during sales events.
- Consistency: a "deny" decision must be deterministic given the same inputs — regulatory audits will ask why a transaction was blocked.
- Freshness: velocity features (e.g., "how many transactions on this card in the last 5 minutes") must reflect events from at most a few seconds ago.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Peak write throughput | 5,000 events/sec | Payment gateway scale target |
| Daily events scored | 86.4M events/day | 1,000 × 86,400 |
| Redis reads/sec (feature lookups) | 150,000 reads/sec | 5,000 events/sec × 30 features/event |
| Redis cluster headroom | Well within a 3-node cluster | Each Redis node handles ~100k ops/sec |
| Feature store active keys | ~50M keys | Cards + users + devices + merchant pairs |
| Feature store memory (raw) | 10 GB | 50M keys × 200 bytes/key |
| Feature store memory (with replication) | 30 GB RAM | 10 GB × 3× replication |
| Model inference latency (native XGBoost) | ~0.7–2 ms per prediction | 1,000-tree GBT on a modern CPU core |
| Model inference latency (optimized runtime) | < 0.5 ms | Treelite or ONNX Runtime |
| CPU cores needed @ 0.5 ms/prediction | ~3 cores | 5,000/sec × 0.5 ms = 2.5 cores |
| CPU cores needed @ 2 ms/prediction (unoptimized) | ~10 cores | 5,000/sec × 2 ms = 10 cores |
| Raw training rows/year | ~31.5B rows | 86.4M events/day × 365 |
| Training rows after negative sampling | ~1B labeled rows | 10:1 negative:positive ratio applied |
| Training data storage/year | ~500 GB | ~1B rows × 500 bytes/row compressed (Parquet on S3) |
| Label delay (chargeback confirmation) | 30–90 days post-transaction | Visa/Mastercard dispute resolution rules |
| Surrogate label signal available | T+1 to T+7 | Early dispute signals used for faster feedback |
Takeaway: The inline scoring path — 150k Redis reads/sec and ~10 CPU cores for model inference — is modest. The hard constraint is latency (< 100 ms end-to-end) and label freshness: production training data for a given week is not fully labeled for 90+ days, so you retrain on early dispute surrogates in the meantime.
Building up to the design
V1: Hard rules only
Every fraud system starts here: a list of known-bad signals turned into code.
def score(txn):
if txn.card in blocklist: return DENY
if txn.amount > country_limit[txn.country]: return DENY
if txn.ip in known_bad_ips: return DENY
return ALLOW
This is interpretable, fast, and auditable — a junior analyst can explain every decision. The problem is that rules are static. Fraudsters learn your rules by probing the system. A card not yet on any blocklist passes freely. Velocity attacks — spreading $9.99 across 1,000 cards in an hour — are completely invisible, because no single transaction triggers anything.
V2: Add velocity counters
"This card has been used 8 times in the last 3 minutes" is one of the strongest fraud signals you have, regardless of any individual transaction amount. So we start computing counts over rolling windows and checking them at scoring time.
def get_velocity(card_id, window_seconds):
key = f"vel:{card_id}:{window_seconds}"
return redis.get(key) or 0
if get_velocity(card.id, 300) > 5:
return DENY
This catches velocity attacks rules couldn't see. But now we have a new problem: a card first seen today has zero history. The rules fire based on counts, but the pattern matters — is 3 transactions in 5 minutes abnormal for this particular cardholder? Rules can't answer that. We need something that learns.
V3: Add an ML model for the gray zone
A gradient-boosted tree (GBT) trained on historical labeled transactions learns the joint distribution of features that separates fraud from legitimate activity. It handles the gray zone that rules cannot.
score = model.predict(feature_vector) # → 0.0 to 1.0
if score > 0.85: return DENY
if score > 0.4: return CHALLENGE # step-up auth
return ALLOW
The model can learn non-obvious combinations. A $15 transaction at a foreign gas station is fine — unless the card was used in New York 20 minutes ago. That joint signal is something no single rule captures, but a trained model learns immediately.
What breaks next: the model is trained on yesterday's fraud patterns. Fraudsters adapt. The model drifts. And model features require precomputation at inference time — naive real-time feature computation blows the 100 ms budget.
V4: A proper feature store
Feature computation moves out of the hot path. A streaming pipeline (Kafka → Flink) continuously maintains velocity windows. A batch pipeline (daily Spark job) updates longer-horizon profile features. Both write to a shared feature store — Redis for real-time reads, a columnar store for offline training.
At inference time you fetch precomputed features, not recompute them. Every feature lookup becomes a single Redis GET: ~1 ms flat.
V5: Graph-based ring detection (async)
Connected fraud rings — multiple accounts sharing a device, IP, or card — are invisible to per-transaction scoring. An asynchronous graph analysis layer links entities and propagates risk scores across the graph. This doesn't sit on the inline path; it feeds into feature store updates and review queue prioritization.
V6: Full production system
V3 + V4 + rules engine + delayed-label retraining + human review queue + graph analysis + champion/challenger deployment.
flowchart LR
V1["V1: hard rules\nfast, brittle"] --> V2["V2: + velocity counters\ngray zone visible"]
V2 --> V3["V3: + ML model\ngray zone handled"]
V3 --> V4["V4: + feature store\n< 20ms p50 / < 25ms p99"]
V4 --> V5["V5: + graph detection\nring fraud caught"]
V5 --> V6["V6: + retraining pipeline\nadversarial drift handled"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V6 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
TXN[Transaction event] --> GW[API Gateway / Scoring endpoint]
GW --> ENRICH[Feature enrichment service]
ENRICH --> RFS[(Real-time feature store\nRedis: velocity, device, session)]
ENRICH --> OFS[(Offline feature store\nS3 + Redis: historical profiles)]
ENRICH --> ENGINE[Decision engine]
ENGINE --> RULES[Rules engine\nhard deny / allow]
ENGINE --> ML[ML model server\nXGBoost / LightGBM]
RULES -->|deny| DEC{Decision}
ML -->|score| DEC
DEC -->|allow| OK[Authorize]
DEC -->|deny| BLOCK[Block]
DEC -->|challenge| STEP[Step-up auth]
DEC -->|borderline| QUEUE[Human review queue]
TXN --> KAFKA[Kafka: raw event stream]
KAFKA --> FLINK[Flink: velocity aggregation]
FLINK --> RFS
LABELS[Chargeback / dispute labels] --> DELAY[Delayed label pipeline]
DELAY --> TRAIN[Training pipeline\nSpark + MLflow]
OFS --> TRAIN
TRAIN --> CHAL[Challenger model]
CHAL --> SHADOW[Shadow scoring\nA/B eval]
SHADOW --> ML
style ENGINE fill:#ff6b1a,color:#0a0a0f
style RFS fill:#15803d,color:#fff
style ML fill:#a855f7,color:#fff
style FLINK fill:#0e7490,color:#fff
style DELAY fill:#ffaa00,color:#0a0a0f
The inline scoring pipeline
The inline path is everything. Every millisecond here is a millisecond the payment gateway is waiting.
sequenceDiagram
participant GW as Payment Gateway
participant SCORE as Scoring Service
participant FEAT as Feature Enrichment
participant RED as Redis (real-time)
participant MODEL as Model Server
participant AUTH as Authorization
GW->>SCORE: ScoreRequest (card, amount, merchant, device)
SCORE->>FEAT: enrich(event)
FEAT->>RED: MGET [vel:card:60, vel:card:300, vel:card:3600, device_age, ip_country_mismatch, ...]
RED-->>FEAT: ~20 feature values (< 2 ms)
FEAT-->>SCORE: feature_vector
SCORE->>SCORE: run rules engine (< 1 ms)
alt hard rule match
SCORE-->>GW: DecisionResponse: DENY + explanation (immediate)
else gray zone
SCORE->>MODEL: predict(feature_vector)
MODEL-->>SCORE: risk_score (< 5 ms)
SCORE-->>GW: DecisionResponse: allow / deny / challenge + explanation
end
note over AUTH: GW proceeds with\nauthorization based on decision
Latency budget breakdown:
| Step | Target p99 |
|---|---|
| Network (gateway → scoring service) | 5 ms |
| Feature lookup (Redis MGET, ~20–30 keys) | 5 ms |
| Rules engine evaluation | 1 ms |
| Model inference (GBT) | 5 ms |
| Network (scoring service → gateway) | 5 ms |
| Total | < 25 ms (leaves margin for p99 tail) |
Keep rules and model inference on the same host as the scoring service to avoid extra network hops. Use Redis pipelining (batch all MGET in a single round trip).
Feature store and velocity features
Velocity features are the single most valuable feature class in fraud detection. They answer: "Is this card / user / device behaving unusually right now?"
Example velocity features:
| Feature | Window | Entity | Redis key pattern |
|---|---|---|---|
| Transaction count | 1 min, 5 min, 1 hr, 24 hr | card_id | vel:card:{id}:{window}:count |
| Transaction amount sum | 5 min, 1 hr | card_id | vel:card:{id}:{window}:sum |
| Distinct merchants | 1 hr | card_id | vel:card:{id}:1h:merchants (HyperLogLog) |
| Distinct IPs | 24 hr | user_id | vel:user:{id}:24h:ips |
| Distinct devices | 7 days | user_id | vel:user:{id}:7d:devices |
| Failed auth attempts | 30 min | device_id | vel:device:{id}:30m:fails |
| Cross-border txns | 1 hr | card_id | vel:card:{id}:1h:countries (HyperLogLog) |
How velocity features are maintained (the streaming pipeline):
flowchart LR
KAFKA[Kafka\ntxn events] --> FLINK[Flink job\nwindowed aggregation]
FLINK --> RED[(Redis\nvelocity counters)]
FLINK --> TSCALE[Time-series store\nfor audit / backfill]
style KAFKA fill:#0e7490,color:#fff
style FLINK fill:#ff6b1a,color:#0a0a0f
style RED fill:#15803d,color:#fff
Flink maintains sliding windows using its native event-time processing. Each transaction event increments the relevant counters. Redis TTLs are set to slightly longer than the window to handle late events. For cardinality estimates ("distinct IPs this card has used in 24h"), use a HyperLogLog in Redis — constant memory, 0.81% standard error, exactly what you need.
Why not compute velocity at scoring time? Because computing "count of transactions on this card in the last 5 minutes" requires scanning recent history — that's O(events in window) per lookup, not O(1). Pre-aggregation by the streaming pipeline makes every feature lookup a single Redis GET: O(1), ~1 ms.
Rules engine
Rules run before the ML model. They are fast, deterministic, and explainable — critical for regulatory compliance.
Rule categories:
| Category | Examples | Action |
|---|---|---|
| Blocklist | Card in stolen-card list, IP in botnet list | Hard deny |
| Limit rules | Amount > daily card limit, > country limit | Hard deny |
| Impossible travel | Last txn in London 5 min ago, now in Sydney | Deny / challenge |
| Velocity threshold | > 10 txns on card in last 5 min | Challenge |
| Device mismatch | Device country ≠ billing country | Challenge |
| New entity | Card created < 2h ago + high amount | Challenge |
Rules are stored as data (YAML/DB rows), not hardcoded. An analyst can add a new rule without a code deploy. A rules engine service evaluates them in priority order; the first matching terminal rule wins.
Rules are not enough on their own. They are binary and static — they require explicit enumeration of every bad pattern. ML fills the gap for the gray zone, where combinations of moderately suspicious signals are what matter, not any single rule.
The ML model
Feature vector
At inference time the feature vector includes:
- Velocity features (from Redis): counts, sums, cardinalities over multiple windows
- Transaction features: amount, currency, merchant category code (MCC), cross-border flag
- Entity profile features (from offline store): average spend by MCC for this card, historical dispute rate, account age
- Device/session features: device age, IP risk score, browser fingerprint consistency
- Derived features: amount relative to historical average, time-of-day normalized by timezone
The diagram below shows how these sources converge at inference time:
flowchart LR
VEL[(Redis\nvelocity counters)] --> FV[Feature vector]
PROF[(Offline store\nhistorical profiles)] --> FV
TXN2[Transaction fields\namount / MCC / currency] --> FV
DEV[Device / session\nfingerprint, IP risk] --> FV
FV --> MODEL[GBT model\nXGBoost / LightGBM]
MODEL --> SCORE2[Risk score 0.0–1.0]
SCORE2 --> THR{Threshold}
THR -->|"< 0.40"| ALLOW2[Allow]
THR -->|"0.40–0.85"| CHAL2[Challenge]
THR -->|"> 0.85"| DENY2[Deny]
style FV fill:#ff6b1a,color:#0a0a0f
style MODEL fill:#a855f7,color:#fff
style ALLOW2 fill:#15803d,color:#fff
style DENY2 fill:#ff2e88,color:#fff
style CHAL2 fill:#ffaa00,color:#0a0a0f
Model choice
Gradient-boosted trees (XGBoost or LightGBM) are the standard choice for tabular fraud data. They handle mixed feature types — numeric, categorical, binary — without extensive preprocessing, tolerate missing values natively, and model non-linear interactions between features. Inference is fast: a 500-tree model scores a single row in under 1 ms on a single CPU core with an optimized inference runtime (Treelite, ONNX Runtime). Native XGBoost predict() is slower, around 0.7–2 ms for a single row (benchmarks typically show 0.7–2 ms depending on tree count and threading config), due to Python/threading overhead on single rows.
Neural networks show up as ensembles or for specific sub-problems — graph embeddings for fraud ring detection, sequence models for transaction history — but they don't replace GBTs for the primary tabular scoring task. The interpretability cost is high and the accuracy gain on structured tabular data is marginal compared to a well-tuned GBT.
Class imbalance
Fraud rates are typically 0.1%–1% of transactions. Training naively on this distribution produces a model that predicts "legitimate" for everything and hits 99.5% accuracy — useless. The standard fix: undersample negatives. Keep all positives, sample negatives to a 10:1 or 20:1 ratio. Simple and effective. You can also penalize false negatives more heavily in the loss function via scale_pos_weight in XGBoost.
Evaluate with precision, recall, and AUC-PR — never accuracy. The area under the precision-recall curve is the right summary metric for imbalanced datasets because accuracy tells you nothing about how well you're catching the rare class.
Threshold calibration
The model outputs a probability: 0.0 (definitely legitimate) to 1.0 (definitely fraud). The threshold that maps score to action is a business decision, not a model decision:
Threshold: deny_above = 0.85, challenge_above = 0.40
| Score < 0.40 | allow |
| 0.40 ≤ score < 0.85| challenge |
| Score ≥ 0.85 | deny |
Different contexts get different thresholds: high-value merchants use a lower challenge threshold; subscription renewals use a higher deny threshold because recurring charges have strong history; new accounts get tightened thresholds for first transactions.
The label-delay problem
This is the hardest operational problem in fraud ML — and largely unique to this domain.
The timeline:
T+0: Transaction occurs. You score it. Maybe you're right, maybe wrong.
T+1–7: Customer disputes the charge (early dispute signal).
T+30: Issuing bank files chargeback against acquiring bank.
T+60: Acquiring bank processes the chargeback; merchant notified.
T+90: Chargeback resolved (or escalated to arbitration); you get the final "fraud confirmed" label.
Training on T+90 labels means your training data for last month isn't fully labeled for another two months. Meanwhile, a new fraud pattern that emerged three weeks ago is hurting you with no label signal yet.
The mitigations work in layers:
-
Early-signal surrogates: disputes filed within 7 days are a reliable leading indicator. Use them as provisional labels for recent transactions. Retrain weekly on T+7 surrogates; retrain monthly on T+90 confirmed labels.
-
Decoupled training runs: maintain two models — one trained on recent-but-noisy T+7 data (catches new patterns fast), one on older-but-clean T+90 data (more accurate baseline). Ensemble the two.
-
Label pipeline with event sourcing: every fraud signal (chargeback notification, dispute file, manual analyst decision) is an event on a Kafka topic. A label service joins them to the original transaction by
txn_idand writes the labeled record to the training store when the label is confident.
sequenceDiagram
participant TXN as Transaction (T+0)
participant SCORE as Scoring Service
participant STORE as Event Store
participant DISPUTE as Dispute event (T+7)
participant CHARGEBACK as Chargeback (T+90)
participant LABEL as Label Pipeline
participant TRAIN as Training Pipeline
TXN->>SCORE: score event
SCORE->>STORE: store {txn_id, features, decision, score}
DISPUTE->>LABEL: dispute filed → provisional label
LABEL->>STORE: join to txn_id → labeled record (surrogate)
CHARGEBACK->>LABEL: confirmed fraud label
LABEL->>STORE: update label (final)
STORE->>TRAIN: weekly training job reads labeled records
TRAIN->>SCORE: deploy retrained model
Graph-based fraud ring detection
Individual transaction scoring misses coordinated rings — groups of accounts sharing phones, email domains, IPs, or cards that each look slightly suspicious but together are obviously fraudulent.
The entity graph:
flowchart LR
U1[User A] --- DEV1((Device X))
U2[User B] --- DEV1
U3[User C] --- DEV1
U2 --- IP1((IP 1.2.3.4))
U4[User D] --- IP1
U3 --- CARD1((Card ending 7812))
U5[User E] --- CARD1
DEV1 --- EMAIL1((email domain\nfakemail.io))
U4 --- EMAIL1
style DEV1 fill:#ff2e88,color:#fff
style IP1 fill:#ff2e88,color:#fff
style CARD1 fill:#ff2e88,color:#fff
style EMAIL1 fill:#ff2e88,color:#fff
style U1 fill:#0e7490,color:#fff
style U2 fill:#0e7490,color:#fff
style U3 fill:#0e7490,color:#fff
style U4 fill:#0e7490,color:#fff
style U5 fill:#0e7490,color:#fff
Nodes are users, cards, devices, IPs, email addresses, phone numbers. Edges represent "this user has used this device," "this card was used from this IP," and so on.
The detection approach runs in three steps. A batch or micro-batch job (Spark, or a graph DB like Amazon Neptune / JanusGraph) builds the entity graph from the past N days of transaction events. Connected-component analysis finds clusters of highly-connected entities. If a cluster contains confirmed-fraud nodes, other nodes in the cluster get elevated risk scores — even if their individual transaction history looks clean. Those cluster-level risk scores become features in the feature store, read by the inline scoring pipeline.
This is not on the inline path — graph queries are expensive. It runs on a schedule (every few hours) and surfaces results as precomputed features.
Why this matters: a fraud ring typically has one "burned" account that gets caught first. Without graph analysis, the other 99 accounts in the ring transact freely until each one burns individually. With graph analysis, catching one propagates risk to the whole cluster immediately.
Precision/recall trade-off and human review
Every threshold setting is a point on the precision-recall curve. Lowering the deny threshold catches more fraud but blocks more legitimate customers. Raising it passes more legitimate customers but lets more fraud through. Neither extreme is right.
The optimal operating point depends on:
- Fraud loss per transaction (how expensive is a miss?)
- Customer lifetime value (how expensive is a false block?)
- Chargeback rate limits imposed by card networks — Visa's VAMP program flags merchants above a 1.5% combined fraud+dispute ratio (TC40 + TC15 vs. settled CNP transactions; 1.5% applies to US, Canada, EU, APAC as of April 2026; CEMEA stays at 2.2%); Mastercard's Excessive Chargeback Program triggers at 1.5% chargeback-to-transaction ratio; excessive rates lead to losing card acceptance, which is catastrophic.
The human review queue sits between auto-allow and auto-deny. Transactions with scores in a borderline range (e.g., 0.40–0.55) are sent to analysts. Analysts see the full feature set, linked transactions, and can flip the decision. Every manual decision becomes a high-quality training label.
stateDiagram-v2
[*] --> Scored
Scored --> AutoAllow: score < low_threshold
Scored --> AutoDeny: score > high_threshold
Scored --> PendingReview: low_threshold ≤ score ≤ high_threshold
PendingReview --> AnalystApprove: analyst reviews → approve
PendingReview --> AnalystDeny: analyst reviews → deny
AnalystApprove --> LabeledLegit: label stored for training
AnalystDeny --> LabeledFraud: label stored for training
AutoAllow --> [*]
AutoDeny --> [*]
LabeledLegit --> [*]
LabeledFraud --> [*]
Adversarial drift and champion/challenger deployment
Fraudsters are adversarial agents. They probe your system and adapt. A model that was 95% accurate three months ago may be 85% accurate now because the attack patterns it learned no longer match what fraudsters are doing.
The deployment pattern that handles this is champion/challenger. The champion is the currently-live production model. A challenger is a freshly retrained version running in shadow mode: it scores every transaction and logs its decision, but the champion's decision is what actually executes.
flowchart LR
TXN3[Transaction] --> CHAMP[Champion model\nlive decision]
TXN3 --> CHAL3[Challenger model\nshadow scoring]
CHAMP --> LIVE[Execute decision]
CHAL3 --> LOG[Log score only]
LOG --> COMP[Compare AUC-PR\nover labeled transactions]
COMP -->|challenger wins| PROMO[Promote challenger\nzero-downtime swap]
COMP -->|champion holds| HOLD[Champion stays]
style CHAMP fill:#ff6b1a,color:#0a0a0f
style CHAL3 fill:#a855f7,color:#fff
style PROMO fill:#15803d,color:#fff
Compare champion and challenger AUC-PR over a week on transactions that have since received labels. When the challenger statistically outperforms the champion, promote it — zero-downtime model swap.
Track model output distributions continuously alongside this. If the distribution of risk scores shifts materially — say, far more predictions in the 0.3–0.7 gray zone than baseline — that's a signal the model is becoming uncertain, possibly because the feature distribution has shifted. Alert before the model silently degrades.
Storage choices
| Data | Store | Why |
|---|---|---|
| Velocity counters (real-time) | Redis | Sub-millisecond reads, TTL support, HyperLogLog for cardinality |
| Historical feature profiles | Redis (hot) + S3/Parquet (cold) | Hot: inference-time read; cold: training data generation |
| Raw transaction events | Kafka (7-day retention) + S3 (archive) | Durable event source for reprocessing |
| Labeled training data | S3 Parquet + Delta Lake | Cheap, columnar, versioned for point-in-time correct training |
| Model artifacts | MLflow / S3 | Versioned, reproducible, metadata tracked |
| Decisions / audit log | Append-only store (Postgres / S3) | Regulatory requirement: every decision must be explainable |
| Review queue | Postgres + task queue (e.g., Celery) | Low volume, rich queries, analyst workflow |
| Graph data | Graph DB (Neptune) or Spark adjacency lists | Connected components, not relational joins |
Failure modes
| Failure | Impact | Mitigation |
|---|---|---|
| Feature pipeline lag (Kafka / Flink falls behind) | Velocity features are stale — model makes decisions on outdated counts | Freshness check per feature; fall back to conservative rules if feature_age > threshold |
| Redis cluster failure | All feature lookups fail — cannot enrich events | Rules-only fallback mode (pre-loaded in scoring service memory); alert on-call |
| Model server crash | No ML scoring | Fall back to rules-only; do not block payments; page ML on-call |
| False-positive storm | Model triggers on a new legitimate pattern (e.g., first day of a major holiday promotion) | Real-time precision/recall dashboard; circuit breaker that widens thresholds if false-positive rate spikes beyond 3× baseline |
| Model staleness / drift | Fraud rates spike undetected | Weekly automated retraining; AUC-PR monitoring; chargeback rate alert |
| Fraud ring evasion | Ring rotates devices/IPs faster than graph jobs run | Decrease graph refresh interval; use streaming graph updates (Flink + graph DB CDC) |
| Cold start on new user/card | No velocity or profile features | Default features (population averages for new entities); higher challenge rate for first transactions |
| Label pipeline failure | No new labels reach training | Training halts; champion model ages; alert and skip the training run rather than train on partial data |
Things to discuss in an interview
- Why inline, not async? Post-hoc fraud detection can reverse transactions after the fact, but chargebacks are expensive and card network rules penalize high chargeback rates — you need to block at authorization.
- Rules vs. ML trade-off: rules for known-bad patterns and regulatory clarity; ML for the gray zone where no explicit rule exists. Both are necessary; neither alone is sufficient.
- The feature store architecture: real-time (Kafka/Flink → Redis) for velocity; offline (batch Spark → S3 → Redis hot cache) for historical profiles. Why are these separate? Latency and cost.
- The label-delay problem and what it does to your training loop: this is the answer that separates senior candidates. The model you deploy today is trained on fraud patterns from 90 days ago.
- Precision vs. recall and who sets the threshold: the ML team sets the curve; the business decides which point on that curve to operate at. Engineers should not be setting fraud policy.
- Champion/challenger: how do you A/B test a new model without risking production fraud rates? Shadow scoring on the offline path.
- Graph detection: why can't you do this inline? What's the right refresh cadence?
Things you should now be able to answer
- What is a velocity feature and why is it the most important feature class in fraud?
- Why does fraud detection need a dedicated feature store rather than computing features at query time?
- What is the label-delay problem and how do you mitigate it?
- Why is accuracy the wrong metric for fraud models?
- What is champion/challenger deployment and why does it matter for adversarial systems?
- What happens to your scoring system when the ML model is unavailable?
- How do you detect a coordinated fraud ring with a per-transaction scorer?
Further reading
- "Practical Lessons from Predicting Clicks on Ads at Facebook" (He et al., 2014) — influential paper showing GBTs as feature transformers feeding logistic regression; widely cited for demonstrating GBT superiority on large-scale tabular data at Facebook
- "Graph Neural Networks for Fraud Detection in E-Commerce" — Alibaba AntGroup (2019) — GNN-based approach to fraud ring detection
- Stripe's engineering blog on ML platform and feature stores — stripe.com/blog
- "The Imbalanced Dataset Problem" — scikit-learn documentation on class_weight and sampling strategies
- Visa / Mastercard chargeback dispute resolution timelines — publicly documented in their merchant operating guides
- Related: Design a Payment System — the correct-money-movement layer that this system gates
- Related: Design a Recommendation System — shares feature store and ML serving infrastructure patterns
Frequently asked questions
▸Why must fraud scoring happen inline before payment authorization, not asynchronously after the fact?
Post-hoc detection can reverse transactions, but chargebacks are expensive and card networks penalize merchants that exceed chargeback thresholds — Visa's VAMP program flags merchants above a 1.5% combined fraud-plus-dispute ratio, and Mastercard's Excessive Chargeback Program triggers at 1.5%. Losing card acceptance entirely is catastrophic, so the decision must block authorization rather than unwind it.
▸What is the label-delay problem in fraud detection, and how do you mitigate it?
A transaction that occurs today may not receive a confirmed fraud label until a chargeback is resolved 30 to 90 days later, meaning the model deployed this month is trained on fraud patterns from three months ago. The standard mitigation is to use early dispute signals available at T+1 to T+7 as surrogate labels for weekly retraining, while reserving the confirmed T+90 labels for a more accurate monthly retrain, and optionally ensembling both models.
▸Why are gradient-boosted trees the primary model choice for fraud scoring rather than neural networks?
GBTs handle mixed feature types and missing values natively, produce interpretable outputs required for regulatory audit, and are fast at inference — under 1 ms per row with an optimized runtime like Treelite or ONNX, and roughly 0.7 to 2 ms with native XGBoost predict(). Neural networks offer marginal accuracy gains on structured tabular data and impose a high interpretability cost that is difficult to justify to regulators.
▸Why is accuracy the wrong evaluation metric for a fraud model, and what should you use instead?
Fraud rates are typically 0.1% to 1% of transactions, so a model that predicts every transaction as legitimate scores 99.5% accuracy while catching zero fraud. The correct metrics are precision, recall, and AUC-PR — the area under the precision-recall curve — because they measure performance on the rare positive class rather than the overwhelming majority.
▸How does graph-based fraud ring detection work, and why can it not run on the inline scoring path?
A batch or micro-batch job builds an entity graph linking users, cards, devices, IPs, and email addresses, then runs connected-component analysis to find clusters containing confirmed-fraud nodes and propagates elevated risk scores to the whole cluster. Graph queries are too expensive to run in the 100 ms authorization window, so the job runs on a schedule every few hours and surfaces cluster-level risk scores as precomputed features in Redis, where the inline pipeline reads them as a single O(1) lookup.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.