Design a Feature Store
Serve the exact same feature values to model training and online inference — eliminating training-serving skew — across batch, streaming, and on-demand tiers at sub-10ms latency and millions of reads per second. The architecture powering Uber Michelangelo, Airbnb Chronon, and DoorDash Gigascale.
The problem
In 2017, Uber's ML team announced Michelangelo, their internal ML platform. The biggest source of model degradation wasn't bad model architecture or insufficient training data. It was features. Teams wrote one version of a feature computation in Python for training, and a subtly different version in Java for the serving microservice. Window sizes differed by an hour. Null handling diverged. Timezone conventions conflicted. The model learned one distribution; production served another.
This is training-serving skew, and it is the most common silent killer in ML systems. No exception is thrown. The model still returns a score. But the predictions are off in ways that take weeks to trace to root cause — weeks in which a fraud model lets through more transactions, a ranking model shows users less relevant content, or a pricing model leaves money on the table.
A feature store is the infrastructure that makes this failure mode impossible by design. It stores the full history of computed feature values, serves them to prediction services at low latency, and — most critically — generates them using a single shared definition that is used identically in training and serving.
The concept showed up at scale first in Big Tech: Uber Michelangelo's Palette (2017), Airbnb's Zipline / Chronon (2020-2022), DoorDash's Gigascale Feature Store (2020), LinkedIn's Feathr (2022). Open-source projects like Feast and commercial platforms like Tecton followed. Every serious ML platform above a certain scale either builds one or buys one.
This article is about the full system: offline store architecture, online store optimization, materialization, point-in-time joins, the three feature tiers, and the monitoring you need to detect when it's failing. For the recommendation use case that motivates most feature store deployments, see Design a Recommendation System. For fraud detection — the tightest latency constraint use case — see Design a Fraud Detection System. For the ML interview framing, see ML System Design Interview Framework.
Functional requirements
- Compute and store feature values for any entity (user, item, merchant, driver) from batch, streaming, and request-time sources.
- Serve the latest feature values for a set of entity IDs to a prediction service in under 10ms at the 99th percentile.
- Generate point-in-time correct training datasets: for each labeled example at time T, return the feature value that was available at time T and no later.
- Provide a feature registry: define features as code, track ownership, schema, staleness SLA, and which models consume each feature.
- Support backfills: given a new feature definition, compute its historical values so the feature can be used to retrain models on historical data.
Non-functional requirements
- Online serving: p99 latency < 10ms; p50 < 2ms. Availability 99.99%+.
- Throughput: support 20M+ reads/second for high-traffic use cases (restaurant ranking, real-time fraud).
- Freshness SLA per tier: batch features stale < 1 hour; streaming features stale < 60 seconds; on-demand features have zero staleness by definition.
- Training-serving consistency: zero tolerance for silent divergence between training and serving transformation logic.
- Horizontal scalability: online store read throughput and offline store compute must both scale independently.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Active entities (users) | 100 M | Given — a large consumer platform |
| Features per entity | 200 | Typical for a mature recommendation or fraud model |
| Bytes per feature value | 100 bytes | Mix of floats (8 bytes) and categorical encodings; average ~100 bytes with overhead |
| Online store size | 2 TB | 100M × 200 × 100 bytes = 2 TB; sharded Redis cluster, ~20 nodes at 100 GB usable each |
| Online read throughput | 2 M reads/sec (steady), 20 M/sec (peak) | 200 features × 10K predictions/sec steady → 2M features/sec; 100K predictions/sec peak → 20M/sec; pipelined Redis fetches collapse this to ~1 round-trip per prediction |
| Streaming feature lag | < 30 seconds | Flink processing time + Kafka consumer lag; critical for fraud, moderate for ranking |
| Batch pipeline runtime | 2–4 hours/day | 10 B events/day on a 200-node Spark cluster, reading from Hive partitioned by date |
| Training dataset gen | 90 minutes | 50 M labeled examples × 200 features AS-OF join on a 50-node Spark cluster |
| Offline store size | ~1.3 PB | 3 years × 365 days × 100 M entities × 200 features × 100 bytes = ~2.2 PB raw; compressed ~5:1 in Parquet ≈ 440 TB compressed; with 3× cross-region replication ~1.3 PB total |
Takeaway: The online store is small enough (~2 TB) to live entirely in Redis. The offline store is orders of magnitude larger (~1.3 PB with replication) and lives on object storage. The hard engineering is keeping them synchronized without diverging.
Building up to the design
V1: Feature code in the model service. The simplest approach is to compute features inline in the prediction service. When a request arrives, the service queries the database, runs the aggregation, and feeds the result to the model. This works for a single team and a handful of features. It fails the moment two teams need the same feature — they each write their own version, and they diverge. It also fails at scale: a prediction service under load cannot afford to run a 30-day rolling COUNT(DISTINCT) query against a Postgres instance on every request.
V2: Precomputed feature tables in the data warehouse. Teams write Spark jobs that precompute features nightly and write them to Hive tables. The prediction service reads from those tables at startup and caches values in memory. This is better: the computation is shared, and the prediction service doesn't hit a live database on every request. But three problems remain. First, features are hours stale by the time the batch job finishes and the cache refreshes. Second, the training code still re-implements the same aggregation in a different Spark job or SQL query — skew lives on. Third, there is no registry: no one knows which model uses which feature, or what the expected freshness contract is.
flowchart LR
DW[("Data Warehouse")] --> SPARK_BATCH["Nightly Spark Job"]
SPARK_BATCH --> HIVE[("Hive Feature Table")]
HIVE --> SVC["Prediction Service<br/>in-memory cache"]
TRAIN_JOB["Separate Training Job"] --> HIVE2[("Another Hive Table")]
HIVE2 --> MODEL["Trained Model"]
SVC --> MODEL
style SPARK_BATCH fill:#0e7490,color:#fff
style HIVE fill:#15803d,color:#fff
style HIVE2 fill:#15803d,color:#fff
style SVC fill:#ff6b1a,color:#0a0a0f
V2 still has two separate pipelines — the batch pipeline feeding the prediction service and the training job feeding a separate Hive table — each re-implementing the same logic. Any difference between them is silent skew.
V3: Shared feature definition with offline + online stores. The key insight is to invert the architecture. Instead of letting teams write feature code wherever they need it, mandate that all feature logic lives in one place — the feature definition layer — and have the framework generate both the training pipeline and the serving-time code from that single definition. The offline store (Hive, BigQuery, Parquet on S3) stores the full history for training. A separate online store (Redis, Cassandra) stores only the latest value per entity for sub-10ms serving. A materialization engine keeps the two in sync.
This is the architecture that Feast, Tecton, Airbnb Chronon, and Uber Palette all converge on, with different engineering tradeoffs in each component.
API
The feature store exposes two primary APIs: an offline API for training dataset generation and an online API for prediction serving.
Online serving (gRPC/HTTP)
# Request: fetch features for a set of entities
POST /feature-service/get-online-features
{
"feature_service": "restaurant_ranking_v3",
"entities": {
"restaurant_id": ["r_123", "r_456", "r_789"],
"user_id": ["u_999"]
}
}
# Response: feature matrix, one row per entity combination
{
"results": [
{
"entities": {"restaurant_id": "r_123", "user_id": "u_999"},
"features": {
"restaurant__order_count_7d": 142,
"restaurant__avg_prep_time_30d": 18.3,
"user__order_count_30d": 7,
"distance_km": 1.4 // on-demand: computed from request payload
},
"statuses": {
"restaurant__order_count_7d": "PRESENT",
"distance_km": "PRESENT"
}
}
],
"metadata": {
"feature_service_version": "3.1.2",
"latency_ms": 4.2
}
}
Offline training dataset generation (Python SDK)
# Point-in-time correct training dataset
training_df = store.get_historical_features(
entity_df=labeled_events, # columns: user_id, restaurant_id, event_timestamp, label
features=[
"restaurant_features:order_count_7d",
"restaurant_features:avg_rating_90d",
"user_features:order_count_30d",
"user_features:cuisine_affinity_italian",
]
).to_df()
# Returns: entity_df joined with feature values as of each row's event_timestamp
# No future values — guaranteed by AS-OF join semantics
Feature definition (Python SDK / YAML — Tecton-style decorator)
# Tecton-style decorator definition
# (Feast uses FeatureView objects; see the registry section below)
@batch_feature_view(
sources=[restaurant_orders_source],
entities=[restaurant],
mode="spark",
batch_schedule=timedelta(hours=1),
ttl=timedelta(days=2),
owner="ml-platform@company.com",
tags={"domain": "restaurant", "sensitivity": "internal"},
)
def restaurant_features(df: DataFrame) -> DataFrame:
return df.groupby("restaurant_id").agg(
count("order_id").alias("order_count_7d"), # Tecton handles window via sources
avg("prep_time_minutes").alias("avg_prep_time_30d"),
)
The schema
The offline store uses a partitioned Parquet schema optimized for time-range scans:
-- Offline feature table (Hive / Iceberg)
CREATE TABLE restaurant_features (
restaurant_id STRING NOT NULL,
event_timestamp TIMESTAMP NOT NULL, -- when the feature value was valid
created_timestamp TIMESTAMP, -- when this row was written (for dedup)
order_count_7d BIGINT,
avg_prep_time_30d DOUBLE,
avg_rating_90d DOUBLE
)
PARTITIONED BY (date DATE) -- partition by day for efficient pruning
STORED AS PARQUET
TBLPROPERTIES ('history.expire.max-snapshot-age-ms' = '2592000000'); -- 30 days in ms
The online store uses Redis Hashes keyed by entity ID. DoorDash found that storing all features for an entity in a single Hash (using HGET/HMGET) rather than separate string keys reduced memory by 60% and CPU by 65% — Redis Hashes use a compact listpack encoding (called ziplist in Redis < 7.0) when field count is below 128.
# Redis key structure
KEY: fstore:{entity_type}:{entity_id}
TYPE: Hash
FIELD: {feature_name_hash} (xxHash32 of feature name, 4 bytes vs. 40 bytes)
VALUE: {serialized value} (Protocol Buffers or MessagePack)
# Example
HGETALL fstore:restaurant:r_123
→ {
"a3f2c1d0": <protobuf: order_count_7d=142>,
"b7e9a2f1": <protobuf: avg_prep_time_30d=18.3>,
}
Architecture
The full production architecture has five planes: source systems, the offline store, the online store, the feature serving layer, and the registry.
flowchart TD
subgraph SOURCES["Source systems"]
DWH[("Data warehouse<br/>Hive / BigQuery")]
KAFKA["Kafka topics<br/>clickstream and orders"]
REQPAY["Request payload"]
end
subgraph PIPELINES["Feature pipelines"]
SPARK["Batch Spark<br/>hourly / daily"]
FLINK["Flink streaming<br/>seconds latency"]
ODFN["On-demand fn<br/>Python / Java"]
end
subgraph STORES["Stores"]
ICEBERG[("Offline store<br/>Iceberg / Parquet / S3")]
REDIS[("Online store<br/>Redis cluster")]
end
subgraph SERVING["Feature server"]
FSERVER["Feature serving<br/>gRPC and HTTP"]
CACHE["Client-side<br/>cache TTL 5s"]
end
subgraph REG["Registry"]
REGISTRY[("Metadata store<br/>Postgres / DynamoDB")]
CATALOG["Feature catalog<br/>UI and API"]
end
subgraph TRAIN["Training path"]
PITJOIN["Point-in-time<br/>join engine"]
TRAINDS["Training dataset<br/>Parquet"]
end
DWH --> SPARK
KAFKA --> FLINK
SPARK --> ICEBERG
SPARK --> REDIS
FLINK --> REDIS
FLINK --> ICEBERG
ICEBERG --> PITJOIN
PITJOIN --> TRAINDS
REDIS --> FSERVER
REQPAY --> ODFN
ODFN --> FSERVER
FSERVER --> CACHE
CACHE --> PRED["Prediction<br/>service"]
REGISTRY --> SPARK
REGISTRY --> FLINK
REGISTRY --> ODFN
REGISTRY --> CATALOG
style REDIS fill:#ff2e88,color:#fff
style ICEBERG fill:#15803d,color:#fff
style FLINK fill:#0e7490,color:#fff
style SPARK fill:#0e7490,color:#fff
style FSERVER fill:#ff6b1a,color:#0a0a0f
style REGISTRY fill:#a855f7,color:#fff
style PITJOIN fill:#ffaa00,color:#0a0a0f
Online lookup sequence diagram — the hot path (target: <10ms total)
sequenceDiagram
participant PS as Prediction Service
participant FS as Feature Server
participant CC as Client Cache
participant RD as Redis Cluster
participant OD as On-Demand Engine
PS->>FS: GetFeatures entity_ids and feature_service name
FS->>CC: Check client-side cache
CC-->>FS: Partial hit — warm batch features returned (TTL 5s); streaming features missing
FS->>RD: HMGET pipeline remaining entity_ids and features batched
RD-->>FS: Feature vectors 2-4ms round-trip
FS->>OD: Compute on-demand features from request payload
OD-->>FS: On-demand features 1ms
FS-->>PS: Assembled feature vector 4-8ms total
PS->>PS: Run ML model inference
The Redis lookup is pipelined — all entity/feature combinations in a single round-trip using PIPELINE commands. For a restaurant ranking call with 23 features across 50 candidate restaurants, that is one Redis round-trip returning 1,150 field values, typically in 2–4ms including network.
The three feature tiers
Batch features come from a scheduled Spark job that reads from the data warehouse, computes windowed aggregations (sum, count, average, quantile over 7d / 30d / 90d windows), and writes results to both the offline store (for training) and the online store (for serving). The job runs hourly or daily. Freshness: up to 1 hour stale. Cost: cheap — a single daily Spark job. Examples: user 30-day order count, restaurant average rating, merchant historical fraud rate.
Streaming features come from an always-on Flink job consuming a Kafka topic. The job maintains windowed state and emits updated feature values within seconds of each event, writing to both the Redis online store and an Iceberg/Parquet table in the offline store (so training data can use the streaming feature with training parity). Freshness: 5–60 seconds stale depending on Kafka lag and Flink checkpoint interval. Cost: 10–100x more expensive than batch per feature — Flink jobs run continuously, consume cluster resources 24/7, and require exactly-once semantics with careful checkpoint management. Use streaming features only when seconds-level freshness genuinely changes model outcomes. Examples: transactions-in-last-60-seconds (fraud), user's last-N-click sequence (recommendation), current order backlog for a restaurant (ETA prediction).
On-demand features are computed inside the feature server at request time, using data from the live request payload combined with pre-fetched batch/streaming features. They cannot be precomputed because they depend on request-context data. Examples: distance between user's current GPS coordinates (in the request) and the restaurant's stored location (from Redis); ratio of the current cart value (request) to the user's historical average order size (Redis); whether the current time is within the restaurant's historical peak hours (request timestamp + batch feature). On-demand features are defined as Python functions and executed identically at training time (where they receive the historical request context) and serving time (where they receive the live context), preserving parity.
The choice of tier is an explicit engineering decision, not an accident. Uber Michelangelo documented a useful rule of thumb: ask how much model quality degrades per hour of staleness for each feature, then pick the cheapest tier that keeps degradation below your threshold.
The offline store and point-in-time joins
The offline store is a columnar lakehouse — Hive, BigQuery, Snowflake, or Parquet files on S3 organized as Apache Iceberg or Delta Lake tables. It stores the full history of every feature value with two timestamps: event_timestamp (when the underlying event that generated this value occurred) and created_timestamp (when the value was written to the store).
Training dataset generation runs an AS-OF join (also called a point-in-time join) against this history:
For each row in entity_df (entity_id, event_timestamp, label):
Find the row in feature_table WHERE:
feature_table.entity_id = entity_df.entity_id
AND feature_table.event_timestamp <= entity_df.event_timestamp
AND feature_table.event_timestamp = MAX(such timestamp)
In SQL terms this is an ASOF JOIN in ClickHouse or a range-join pattern in Spark (illustrative pseudocode — production Spark uses Feast's get_historical_features() or a native range-join):
-- Illustrative two-step range-join pattern used in Spark
-- Step 1: for each (user_id, label_timestamp) pair, find the most recent feature row
SELECT
e.user_id,
e.event_timestamp AS label_ts,
e.label,
f.order_count_7d AS order_count_7d_at_event_time
FROM labeled_events e
JOIN (
SELECT f1.user_id,
e2.event_timestamp AS label_ts, -- carry the label timestamp through
f1.order_count_7d,
ROW_NUMBER() OVER (
PARTITION BY f1.user_id, e2.event_timestamp
ORDER BY f1.event_timestamp DESC -- pick the most recent feature row
) AS rn
FROM feature_history f1
JOIN labeled_events e2
ON f1.user_id = e2.user_id
AND f1.event_timestamp <= e2.event_timestamp -- only rows available at label time
) f ON e.user_id = f.user_id
AND e.event_timestamp = f.label_ts -- match back to each labeled event
AND f.rn = 1 -- keep only the closest prior feature row
-- ClickHouse: ASOF JOIN handles this natively in a single pass
Airbnb Chronon implements this more efficiently using tile-based pre-aggregation: instead of storing raw event rows, it pre-computes partial aggregates at geometrically increasing granularities (1-minute tiles, 5-minute tiles, 1-hour tiles, 1-day tiles). A 90-day rolling sum requires reading only ~200 tiles rather than 90 days of raw rows. This makes backfills for large temporal windows tractable.
Netflix found that replaying from Apache Iceberg snapshots costs ~$2M/year versus ~$93M/year if the approach relied on 30-day Kafka retention for historical replay. The offline store format is a first-order cost decision.
Feature registry and definitions as code
Every feature store worth operating has a registry: a central catalog of what features exist, who owns them, what they compute, what their freshness SLA is, and which models consume them. Without a registry, every team rediscovers and re-implements the same signals, and nobody knows what will break if a source table schema changes.
The registry is implemented as code in modern feature stores. Feast uses Python decorators; Tecton uses a Python SDK; LinkedIn Feathr uses YAML configs. The definition is checked into git, goes through code review, and runs through CI/CD pipelines that validate schema, test transformation logic, and deploy the updated pipeline.
# Feast-style feature view definition
from feast import FeatureView, Entity, Field
from feast.types import Float64, Int64
restaurant = Entity(name="restaurant", join_keys=["restaurant_id"])
restaurant_features = FeatureView(
name="restaurant_features",
entities=[restaurant],
ttl=timedelta(days=2),
schema=[
Field(name="order_count_7d", dtype=Int64),
Field(name="avg_prep_time_30d", dtype=Float64),
Field(name="avg_rating_90d", dtype=Float64),
],
source=restaurant_orders_batch_source,
tags={
"owner": "ml-restaurant-team@company.com",
"freshness_sla": "1h",
"consumers": "restaurant_ranking_v3,eta_prediction_v2",
},
)
The registry entry also tracks model consumers. When a source table is scheduled for deprecation, the platform can enumerate which models depend on which features and alert their owners before the change lands.
Uber's Palette meta-store refactor (2023) demonstrates the value at scale: by centralizing feature metadata — previously scattered across 20,000+ features with no common catalog — they cut deployment time by 95% and reduced Cassandra migration time by 90%.
Feature freshness and TTL
Every feature in the online store has a TTL. A feature whose source pipeline has failed should not serve a value that is 72 hours old to a fraud model that expects 60-second freshness. TTL in Redis enforces a hard expiry: after TTL seconds, the key is deleted and the feature server returns MISSING (not zero, not null — an explicit missing signal).
The feature server must communicate this status to the prediction service so the model can handle it correctly. Most production systems have three statuses: PRESENT, MISSING (the key expired or was never written), and OUTSIDE_TTL (the key exists but its event_timestamp is older than the feature's declared freshness SLA). The model contract must specify how to handle each case — typically by substituting a fallback value defined in the feature registry, or by routing the prediction to a degraded model that does not use that feature.
Freshness SLA targets vary by use case:
| Use case | Streaming freshness target | Staleness alert threshold |
|---|---|---|
| Fraud detection | 30 seconds | 2 minutes |
| Ride-sharing ETA | 60 seconds | 5 minutes |
| Content recommendation | 5 minutes | 15 minutes |
| Credit scoring | 30 minutes | 2 hours |
| Churn prediction | 4 hours | 24 hours |
Lambda vs. Kappa: the pipeline architecture choice
Lambda architecture uses two separate pipelines: a batch path (Spark, running hourly/daily) and a streaming path (Flink, running continuously). The batch path is accurate and handles large historical windows. The streaming path is fast but may use approximations or limited history. The outputs are reconciled in the online store, with the batch path correcting any streaming approximation errors.
Uber Michelangelo and Airbnb Zipline both use Lambda. The problem Lambda introduces is precisely what a feature store is supposed to prevent: two separate code paths, two chances for logic to diverge.
Kappa architecture uses a single streaming pipeline for everything. Batch processing is replaced by replaying historical data through the same Flink job. Backfills are implemented by rewinding the Kafka topic (or replaying from an Iceberg snapshot). The streaming job writes to both the online store and the offline lakehouse via the Flink Iceberg connector (or Confluent Tableflow for Confluent-managed deployments).
Kappa eliminates the dual-pipeline divergence problem but requires mature exactly-once stream processing and an efficient historical replay mechanism. Netflix is moving in this direction, replacing expensive Kafka retention-based replay with Apache Iceberg snapshot replay at 46x lower cost.
The practical choice for most teams: start with Lambda (Spark batch + Flink streaming, separate paths, accepted divergence risk), then consolidate to Kappa as streaming infrastructure matures and the cost of maintaining two pipelines exceeds the operational cost of a more sophisticated single pipeline.
Training-serving skew: detection and prevention
Prevention happens at the source: single feature definition, shared transformation code, same runtime for both paths. This is the primary mechanism.
Detection catches the skew that slips through anyway — due to infrastructure differences, version mismatches, or subtle semantic differences in how the same code behaves under different data distributions.
Shadow comparison pipeline. At serving time, log the feature vector that was actually served. After the fact (when labels are available), run the same entity + timestamp through the training pipeline and compare values. Compute per-feature: exact match %, mean difference, P99 difference. Alert when exact match rate drops below 99%.
Distribution monitoring. At regular intervals, compare the distribution of each feature in the last N training datasets against the live serving distribution (sampled from serving logs). Z-score the mean and standard deviation. Flag features where Z-score > 3 for manual review. This catches semantic drift — cases where the transformation is technically the same code but the underlying data distribution has shifted in a way that creates training-serving mismatch.
Null vs. zero enforcement. The most common silent skew is not computational — it is semantic. Training data uses NULL for missing features (user has no transactions in the 30-day window). The serving code substitutes 0 on a Redis miss. The model learns that NULL means "new user with no history" and 0 means "user with zero transactions" — but at serving time, it always sees 0 for both. Fix: define the missing value semantics explicitly in the feature registry and enforce them in both paths.
Nubank's engineering team documented this pattern specifically: their shadow comparison pipeline computes exact match %, mean difference, and P99 difference for every feature after each model deployment, providing a direct measure of training-serving consistency rather than relying on downstream model metrics to catch it.
Edge cases & gotchas
1. Silent freshness failure on infrastructure incidents. A batch pipeline that depends on an upstream warehouse table fails when that table's schema changes during a data engineering migration. The online store continues serving values that are now 48 hours stale. Models degrade gradually — the feature values are numerically valid, just old. Fix: monitor feature_last_updated_at per feature group with alerting at 2× the expected refresh interval; implement TTL-based null returns for features older than maximum acceptable staleness.
2. Zombie writes from out-of-order streaming events. Flink processes a Kafka event that arrived 10 minutes late (due to network partition) and writes a stale feature value to Redis, overwriting a fresher value computed from a more recent event. The model now has yesterday's transaction count in a slot that should hold today's. Fix: implement conditional writes — before writing to Redis, compare the event's event_timestamp against the stored value's timestamp; only overwrite if the new value is fresher.
3. Hot-key skew in the online store. The top 100 restaurants in New York City receive 10,000× the traffic of a median restaurant. All Redis lookups for those entities land on the same shard, causing P99 latency spikes to 800ms while median latency is 2ms. Fix: consistent hashing with virtual nodes distributes entities evenly; client-side caching with a 5-second TTL absorbs hot-key reads — DoorDash found client-side caching reduced ElastiCache costs by 50%.
4. First-deploy skew for new streaming features. A new streaming feature (user click count in the last 5 minutes) is deployed and starts accumulating values. A model is retrained on a backfilled training dataset covering the last 12 months of history. At training time, the feature has full 12-month history; at deployment, the online store has 6 hours of history. The model predicts poorly for users whose 5-minute click count is missing. Fix: warm up the online store before deploying the model that depends on the new feature; cap the training lookback to match the production warm-up period.
5. Approximate features introducing systematic bias. High-cardinality features like "distinct IP addresses used by this account in the last 7 days" cannot be computed exactly in streaming without unbounded state. Production uses HyperLogLog (±2% error). The training pipeline uses exact COUNT(DISTINCT). The values differ systematically for accounts with many IPs — exactly the accounts a fraud model cares most about. Fix: use the same HyperLogLog approximation in the training pipeline; document in the feature registry which features are approximate and what their error bounds are.
6. Non-idempotent backfill jobs creating duplicate aggregations. A backfill Spark job fails on partition 47 of 200 and is re-run from scratch. The job does not check for existing output and double-writes rows to the offline store. The training dataset now contains duplicate events, corrupting rolling window aggregations. Fix: use Iceberg's upsert semantics keyed on (entity_id, event_timestamp) for the offline store; validate output row counts against expected values before marking the backfill complete.
7. Materialization lag causing deployment race conditions. A model is retrained on today's feature data and deployed at 14:00. The batch materialization job that writes today's features to the online store does not complete until 15:30. Between 14:00 and 15:30, the deployed model expects today's features but the online store serves yesterday's. Fix: gate model deployment on materialization completion; or use streaming-based materialization to eliminate the batch lag window entirely.
Trade-offs to discuss in an interview
Lambda vs. Kappa. Lambda (separate batch + streaming paths) is operationally simpler to reason about — each path can be optimized independently. Kappa (single streaming pipeline for both historical replay and live serving) eliminates dual-pipeline divergence but requires exactly-once Flink with efficient replay. Most teams start with Lambda and migrate to Kappa as engineering maturity grows. The right answer depends on the team's streaming infrastructure investment.
Physical feature store vs. virtual/literal. A physical store (Michelangelo, Chronon) owns the compute engine and maximizes consistency guarantees, but requires teams to adopt a new DSL or framework. A literal store (Feast) is storage-only — you bring your own pipelines — with the lowest adoption cost but the highest risk of skew, since the pipelines remain separate. A virtual store (Tecton, Featureform) sits as a metadata/orchestration layer over existing infrastructure, offering a middle path. Lyft deliberately avoided DSL lock-in by using SparkSQL + JSON configs, trading some consistency guarantees for portability.
Redis vs. DynamoDB vs. Cassandra for online storage. Redis is fastest but memory-bound; DynamoDB is managed and cost-effective at moderate QPS; Cassandra handles multi-region writes well. Tecton benchmarks show Redis Enterprise is 3× faster and 14× less expensive than DynamoDB for high-throughput feature serving. The right choice depends on whether you're memory-constrained (Aerospike hybrid DRAM+SSD), latency-constrained (Redis), cost-constrained at moderate QPS (DynamoDB), or multi-region write-heavy (Cassandra).
Streaming freshness vs. cost. Always-on Flink jobs cost 10–100× more per feature than hourly Spark jobs. Most features do not need sub-minute freshness — user demographic features, historical aggregations, and item metadata are fine at hourly or daily cadence. Reserve streaming pipelines for signals where staleness directly degrades model quality in a measurable way: fraud velocity signals, real-time inventory, current location-based features. Everything else should be batch.
Centralized registry vs. federated ownership. A single central registry (Palette, Feast registry) provides discoverability and prevents duplication, but becomes a bottleneck for large organizations with hundreds of ML teams. Federated ownership (each team owns a feature group namespace, a central catalog aggregates) scales better but complicates cross-team feature sharing and access control. Most mature platforms (Uber Palette, Airbnb Chronon) trend toward centralized governance with team-level namespacing.
Things you should now be able to answer
- What is training-serving skew and name two concrete mechanisms by which it occurs — even when teams use a shared feature store.
- Walk through a point-in-time join. What goes wrong if you join only on entity ID without timestamp filtering? Give a specific example with numbers.
- A fraud model's AUC dropped from 0.91 to 0.83 after a seemingly unrelated data warehouse migration. How would you diagnose this? What would you check first?
- Explain the difference between a batch feature, a streaming feature, and an on-demand feature. Give a real example of each and justify why each belongs in its tier.
- DoorDash's Redis Hashes optimization delivered 60% memory reduction. Explain why storing features as a Hash rather than separate string keys achieves this.
- What is the Kappa architecture and why does it eliminate one class of training-serving skew that Lambda architecture does not?
- A new streaming feature is deployed today. The model that will use it is retrained on 12 months of historical data. Describe the startup skew problem and two ways to mitigate it.
- What is TTL in the context of a feature store and what should the feature server return when a key's TTL has expired? Why is returning
0wrong? - Netflix found that replaying from Iceberg snapshots costs $2M/year vs. $93M/year from Kafka retention. Why is Iceberg so much cheaper? What are its limitations?
- A Redis shard is seeing 100× the traffic of neighboring shards. Diagnose the root cause and describe three mitigations at different layers of the stack.
Further reading
Primary sources
- Hermit, Mubarak, et al. "Meet Michelangelo: Uber's Machine Learning Platform." Uber Engineering Blog, 2017. eng.uber.com/michelangelo-machine-learning-platform/ — the canonical reference; P95 <10ms, Cassandra online store, 250K preds/sec, DSL approach.
- Kothapalli, Sagar, et al. "Palette Meta Store Journey." Uber Engineering Blog, 2023. uber.com/blog/palette-meta-store-journey/ — 20,000+ features, 95% deployment time reduction, centralized metadata architecture.
- Wang, Nikhil, et al. "Chronon: A Declarative Feature Engineering Framework." Airbnb Engineering Blog, 2022. medium.com/airbnb-engineering — point-in-time correct backfills, Kafka/Spark/Flink/Hive stack, 10,000+ features, under-a-week feature development.
- Abdelbaki, Hany, et al. "Building a Gigascale ML Feature Store with Redis." DoorDash Engineering Blog, 2020. doordash.engineering/2020/11/19 — 20M+ reads/second, Redis Hashes data type optimization, 35.7% mismatch discovered in dual-pipeline setup.
- "ML Feature Serving Infrastructure at Lyft." Lyft Engineering Blog, 2021. eng.lyft.com — single-digit ms latency, 99.99% availability, Flyte + Flink architecture, deliberate choice to avoid DSL lock-in.
- Huyen, Chip. "Self-Serve Feature Platforms: Architectures and APIs." huyenchip.com, 2023 — authoritative survey of batch/streaming/on-demand tiers, storage options, backfill strategies.
- "How to Eliminate Training-Serving Skew in MLOps." Confluent Engineering Blog. confluent.io — Kappa architecture with Flink + Iceberg; DoorDash 35.7% mismatch case study; Netflix $93M→$2M backfill cost analysis.
- "Dealing with Train-Serve Skew in Real-Time ML Models." Nubank Engineering Blog. building.nubank.com — shadow comparison pipeline, null vs. zero distinction, per-feature skew monitoring.
- AWS Blog. "Build an Ultra-Low Latency Online Feature Store Using Amazon ElastiCache for Redis." aws.amazon.com/blogs/database — 72% throughput increase, 71% P99 reduction with enhanced I/O multiplexing; 500-shard cluster specs.
- Feast Documentation. docs.feast.dev — registry, offline/online store options, get_historical_features() AS-OF join, materialize-incremental command, Java gRPC server sub-ms latency.
Related articles
- Design a Recommendation System — the primary consumer of a feature store; two-stage candidate generation + ranking, training/serving skew in the ranking context.
- Design a Fraud Detection System — tightest latency use case for a feature store; streaming velocity features, 100ms authorization deadline, Flink + Redis architecture.
- ML System Design Interview Framework — how to frame feature store questions in an ML system design interview.
- Design a Data Pipeline (ETL) — the batch ingestion infrastructure that feeds the offline store; Spark, Airflow, warehouse patterns.
- Change Data Capture (CDC) — how source system changes propagate to the feature store's batch and streaming pipelines without full table scans.
- Design a Distributed Cache — the Redis layer that backs the online store; eviction, sharding, hot-key mitigations at depth.
- Design a Vector Database — embedding features are one category of feature stored here; HNSW, IVF+PQ internals, sharding.
Frequently asked questions
▸What is training-serving skew and why is it the most dangerous failure mode in ML systems?
Training-serving skew occurs when a feature value computed offline for training differs from the value the model sees at inference time — due to divergent code paths, different aggregation windows, or inconsistent null handling. It is dangerous because it is silent: no errors appear in logs, model inference still returns a number, but the predictions are subtly wrong. DoorDash measured a 35.7% feature-value mismatch between their training and serving pipelines before consolidating them. The standard mitigation is a single transformation definition that generates both the training data and the serving-time compute, executed by the same runtime in both contexts.
▸What is the difference between batch features, streaming features, and on-demand features?
Batch features are precomputed by a scheduled Spark job from a data warehouse and written to the online store on an hourly or daily cadence — good for stable signals like a user's 30-day spend total. Streaming features are maintained by a Flink job consuming a Kafka topic, updated in seconds — good for velocity signals like transactions-in-last-60-seconds that are critical for fraud detection. On-demand features are computed at inference time inside the feature server using data from the live request payload — good for signals that combine request context with stored data, like the distance between a user's current GPS location and their stored home coordinates. Most production systems use all three tiers.
▸What is a point-in-time join and why does training dataset generation require it?
A point-in-time join (also called an AS-OF join) finds, for each labeled training example at time T, the feature value that was actually available at time T — not the value that was updated afterward. Without it, a naive left-join on entity ID alone would pull in future feature values, leaking information the model could not have had at prediction time. A classic example: joining a 'customer lifetime value' feature — updated nightly — to a churn label from six months ago would pull in knowledge from the future. The symptom is inflated offline AUC (0.95) that collapses at deployment (0.78). Feast implements this via get_historical_features(); Airbnb Chronon calls it window-accurate backfill.
▸How do you choose between Redis, DynamoDB, and Cassandra for the online store?
Redis delivers sub-millisecond median latency with the highest throughput per node, but is memory-bound — DoorDash achieved 20M+ reads/second with Redis Hashes, at 60% lower memory and 65% lower CPU than a naive implementation. DynamoDB is managed and scales automatically, but single-digit millisecond latency and per-request cost make it better for sporadic reads than sustained high-throughput serving. Cassandra handles high write throughput and multi-region replication well (Uber Michelangelo's original choice), but its read latency is slightly higher than Redis and operational complexity is significant. For most new feature stores, Redis is the default online store; DynamoDB is reasonable at moderate QPS; Aerospike is worth evaluating when features exceed available DRAM since it uses SSD-backed storage at near-DRAM latency.
▸How should you handle backfills when deploying a new feature?
When a new feature is deployed, the online store accumulates values from launch onward. If you retrain a model on a backfilled historical dataset that spans years of history, the training distribution covers the full history while production at launch only has hours or days of values — introducing a startup skew. The fix is to either cap the training lookback window to match the production warm-up period, or run a historical replay through the feature pipeline before launch so the online store is pre-populated. Airbnb Chronon's tile-based pre-aggregation makes this replay efficient by storing partial aggregates rather than re-reading full raw history.
You may also like
Model Context Protocol (MCP) and Tool-Use Infrastructure
How LLMs safely reach the outside world — from raw function calling to MCP, the open standard that collapses N×M bespoke integrations to N+M, with production-grade security, reliability, and a ~88% token reduction via deferred tool loading.
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.