Designing a Feature Flag Service
An in-house LaunchDarkly. Distributing config with sub-100ms freshness to thousands of services, targeting rules, and the safety properties that prevent a flag flip from taking the site down.
The problem
LaunchDarkly processes billions of flag evaluations a day across tens of thousands of customer services. The entire product rests on a deceptively simple idea: let engineers ship code with a behavior gated behind a boolean, then flip that boolean in production without a deployment. Dark launches, gradual rollouts, kill switches, A/B tests, per-customer feature gating — all of it reduces to "evaluate this rule and return a value."
The naive implementation is a database column. An app reads SELECT enabled FROM flags WHERE name = ? on every request. At 100k requests per second across your fleet, that's 100k database queries per second for what is essentially a constant, and each one adds ~5ms of latency. Add a cache and you solve the load problem — but now a kill switch change takes up to 30 seconds to propagate to every server. In an incident, 30 seconds is a lifetime.
The core engineering tension is this: sub-millisecond evaluation on every hot-path request, rule changes propagated globally within seconds, and the system keeps working even when the flag service itself is completely down. Those three requirements each individually push you toward a different architecture. Getting all three simultaneously is what makes this problem interesting — and what separates a toy implementation from how LaunchDarkly, Statsig, Flagsmith, and Unleash actually work.
The resolution is a shift in where evaluation happens. Instead of evaluating flags on a remote server, you ship the ruleset into the application process. The in-process SDK evaluates locally in microseconds. The flag service's only job becomes streaming rule updates to connected SDKs. One flag change propagates to every running server in 1–3 seconds. And if the flag service goes dark, SDKs keep running on their last-known ruleset — applications never degrade.
Capacity
| Dimension | Estimate | How we got there |
|---|---|---|
| Flag evaluations | 1B/day (~11,600/sec) | 1,000 services × 1,000 req/sec average × 1 flag eval/req |
| Flag writes | ~1,000/day (~0.01/sec) | Human-driven changes; teams rarely change more than a few flags per hour |
| SSE connections | ~3,000–30,000 globally (15 edge nodes) | 1,000 services × 1–10 pods/service × 3 regions; ~200–2,000 per edge node |
| Ruleset size per environment | ~1–5 MB | Hundreds of flags × a few KB of JSON rules each |
| Audit log growth | ~1,000 rows/day | One row per flag change; small indefinitely |
Takeaway: The flag service is write-tiny and read-free — evaluations happen entirely in-process, so the service only handles thousands of rule-update events per day, not billions of reads. A single Postgres instance and a handful of edge distributor nodes carry the full production load.
Requirements
Functional
- Define a flag:
new-checkout-flow. Type: boolean / string / number / JSON. - Define targeting rules: "true for users in the EU"; "50% rollout to free-tier users"; "true if
user_idis in this list." - Evaluate a flag in client code:
flags.is_on("new-checkout-flow", user). - Update flags from a dashboard, propagate to all running services in seconds.
- Audit log: who changed what, when.
- Environment scoping: dev / staging / prod separate.
Non-functional
- Evaluation latency: < 1ms p99 (this is in the application's hot path).
- Propagation latency: < 5s from "save flag" to "all servers see new value."
- Availability: 99.99%. A flag service outage must not take down the apps using it.
- Throughput: 1B flag evaluations/day across all apps; the service itself sees 1000 writes/day.
- Consistency: eventually consistent is fine; never serve a corrupt value.
Out of scope (for this article)
- Experimentation analytics (Statsig territory).
- Per-user audit at the eval call (impossibly expensive at scale).
Building up to the design
A feature flag service looks like "a config table" until you actually try to use it on the hot path. Walking the evolution makes the central insight — SDK-side evaluation — obvious rather than magical.
V1: A row in Postgres
CREATE TABLE flags (name TEXT PRIMARY KEY, enabled BOOL);
App code: SELECT enabled FROM flags WHERE name = ?.
That SELECT is on the hot path of every request. At 100k req/sec across your app fleet, you're doing 100k QPS to a database for what is essentially a constant. Each lookup adds ~5ms of latency, and when the flag DB hiccups, every app stalls.
V2: Cache the flag in process
The app reads the flag once at startup and refreshes every 30 seconds in the background. Zero per-request DB load, and evaluation drops to ~100ns — a plain in-process hash map lookup.
The problem is propagation speed. A 30-second refresh interval is fine for a configuration change. It is not fine when your checkout flow is broken and you need to flip a kill switch right now. It also means different boxes see the new value at different moments, giving the same user inconsistent behavior across requests.
V3: Push updates via pub/sub
Add a pub/sub channel — Redis Streams, Kafka, or NATS. Each app subscribes. When a flag is updated, the service publishes the new value and every subscriber refreshes within milliseconds. Now you have both fast propagation and cheap evaluation.
The next thing you want to add is targeting rules: enable for 10% of users, enable for users in the EU, enable for a specific list of user IDs. The moment you have rules, you face a choice. Do you send every evaluation call to a server that applies the rules? That's the V1 latency problem again — you've just moved it. Or do you ship the rules themselves to the app so they evaluate locally?
V4: SDK with in-process evaluation
Ship a client library that holds the full ruleset in memory, not just the boolean answer. The SDK evaluates locally — boolean, percentage, list-match, attribute-equals, all in microseconds.
flags.is_on("new-checkout-flow", user={"country": "DE", "tier": "free"})
# ↓ in-process evaluation
# match rules → return True
Rich rules, sub-microsecond evaluation, propagation in seconds. The remaining gap: a new app process starts up but can't reach the flag service. What does it do?
V5: Local snapshot + safe defaults + edge distributors
Three pieces close the gap. The SDK persists the last good ruleset to local disk, so a cold-start without network connectivity still has a ruleset to work from. Every flag carries a compiled-in safe default for when the SDK truly has nothing. And instead of SDKs connecting directly to the flag service, a small pool of edge distributors hold the latest ruleset and stream updates to connected SDKs — decoupling app boot from flag service availability entirely.
V6: Production
V4 + V5 plus audit log, environment scoping (dev/staging/prod), gradual rollouts with consistent hashing (so user X stays in the same bucket across requests), and observability: eval counters, exposure events fed back to compute experiment metrics.
flowchart LR
V1[V1: SQL row<br/>5ms per call] --> V2[V2: + in-process cache<br/>30s lag]
V2 --> V3[V3: + pub/sub<br/>fast prop]
V3 --> V4[V4: + SDK with rules<br/>local eval]
V4 --> V5[V5: + snapshot + edge<br/>survives outage]
V5 --> V6[V6: + audit + experiments<br/>production]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V6 fill:#a855f7,color:#fff
The trick that makes the math work
Every production flag service shares one core insight:
Move evaluation into the application process, not into the flag service.
flowchart LR
APP[App Process] --> SDK[Flag SDK<br/>has full ruleset in memory]
SDK --> EVAL{evaluate locally}
EVAL --> RES[true/false]
SDK -.stream updates.-> FLAG[Flag Service]
style SDK fill:#15803d,color:#fff
style FLAG fill:#ff6b1a,color:#0a0a0f
The app embeds an SDK. The SDK holds the entire flag ruleset in memory and evaluates locally. The service's job is to stream rule updates to the SDK as they change.
Why this works: evaluation is free (it's a memory lookup inside the process). The SDK keeps a ruleset snapshot on disk, so cold starts work even if the flag service is unreachable. And the service only handles a handful of rule updates per second, not billions of evaluations.
This is how LaunchDarkly, Statsig, Flagsmith, Unleash, and most internal flag systems are built.
High-level architecture
flowchart TB
UI[Admin UI] --> API[Control Plane API]
API --> DB[(Postgres<br/>flag definitions, rules)]
API --> PUB[Pub/Sub<br/>'flag-updates' topic]
PUB --> ED[Edge Distributors]
ED -->|"SSE / WS / poll"| SDK1[SDK in App 1]
ED -->|"SSE / WS / poll"| SDK2[SDK in App 2]
ED -->|"SSE / WS / poll"| SDKN[SDK in App N]
SDK1 -->|cache snapshot| DISK1[(Local disk)]
SDK2 -->|cache snapshot| DISK2[(Local disk)]
style API fill:#ff6b1a,color:#0a0a0f
style ED fill:#15803d,color:#fff
style SDK1 fill:#0e7490,color:#fff
Four layers:
- Control plane: dashboard + API + database. Source of truth for flag definitions.
- Event bus: notifies "something changed" — Kafka, Redis Streams, or NATS. (Postgres LISTEN/NOTIFY can work for very small, single-region deployments but loses messages if no listener is active, can't cross DB instance boundaries, and doesn't replicate to standby nodes — so avoid it for multi-region setups.)
- Edge distributors: thin servers (often regional) that hold the latest ruleset and stream to SDKs.
- SDK: in-process library that fetches the ruleset, holds it in memory, evaluates locally.
The flag data model
A flag is small but rich:
{
"key": "new-checkout-flow",
"type": "boolean",
"default": false,
"environments": {
"prod": {
"enabled": true,
"rules": [
{
"if": { "attribute": "country", "op": "in", "values": ["US","CA"] },
"serve": true
},
{
"if": { "attribute": "plan", "op": "eq", "value": "enterprise" },
"serve": true
},
{
"rollout": { "true": 30, "false": 70 }
}
]
}
}
}
Rules evaluate top to bottom. First match wins. A rollout at the end gives a deterministic split by hashing user.id.
flowchart TD
REQ["evaluate(flag, user)"] --> R1{"Rule 1: country in US/CA?"}
R1 -->|yes| V1[serve true]
R1 -->|no| R2{"Rule 2: plan == enterprise?"}
R2 -->|yes| V2[serve true]
R2 -->|no| R3["Rollout: hash(flag+user) % 100"]
R3 -->|"bucket < 30"| V3[serve true]
R3 -->|"bucket >= 30"| V4[serve false]
style R1 fill:#0e7490,color:#fff
style R2 fill:#0e7490,color:#fff
style R3 fill:#ff6b1a,color:#0a0a0f
style V1 fill:#15803d,color:#fff
style V2 fill:#15803d,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff2e88,color:#fff
Deterministic rollouts
A 30% rollout must be stable per user — the same user should always get the same answer, or the UI flickers on every page load.
import mmh3 # MurmurHash3 — fast, uniform, no crypto overhead
def hash_to_bucket(flag_key: str, user_id: str) -> int:
# Salt with flag_key so each flag independently distributes users
return abs(mmh3.hash(f"{flag_key}:{user_id}")) % 100 # bucket 0..99
def evaluate_rollout(flag_key, user, rollout: dict) -> str:
bucket = hash_to_bucket(flag_key, user.id)
cumulative = 0
for value, pct in rollout.items():
cumulative += pct
if bucket < cumulative:
return value
Use a fast, non-cryptographic hash — MurmurHash3 (via mmh3) is a good default. Don't use MD5 or SHA-1 for this: they're cryptographic hashes, slower than needed, and add no value for a task that has no security requirement. For reference: Optimizely and Unleash both use MurmurHash for deterministic bucketing; LaunchDarkly uses SHA-1 (taking the first 15 hex characters and mapping to a 100,000-bucket space) — it works, but MurmurHash is faster and sufficient. The key property is that the function is deterministic and produces uniform bucket distribution, not which specific non-cryptographic hash you pick.
Salting with the flag key ensures different flags independently distribute users. Without the salt, the same 30% of users receive every new feature — correlation makes experiment results meaningless.
Propagation: how updates fan out
The slowest legal mechanism is polling:
flowchart LR
SDK1[SDK 1] -->|"GET /flags<br/>every 30s"| API
SDK2[SDK 2] -->|"GET /flags<br/>every 30s"| API
SDKN[SDK N] -->|"GET /flags<br/>every 30s"| API
style API fill:#ff6b1a,color:#0a0a0f
Polling is stupid simple and works through any firewall, but you get up to 30s lag and high baseline load. It's a fine place to start and a useful fallback.
The fast mechanism is streaming — Server-Sent Events (SSE) or WebSocket from each SDK to the nearest edge distributor:
sequenceDiagram
participant Dash as Dashboard
participant API as Control Plane
participant DB as DB
participant Bus as Pub/Sub
participant Edge as Edge Distributor
participant SDK as SDK in app
Dash->>API: update flag
API->>DB: persist
API->>Bus: publish "flag changed"
Bus->>Edge: notify
Edge->>SDK: stream new ruleset
SDK->>SDK: swap in-memory ruleset
End-to-end, a dashboard click reaches every connected SDK in 1–3 seconds at p99 for a well-built in-house system. Commercial services that pre-deploy globally can tighten this further — LaunchDarkly publishes 200ms or less via their CDN-backed streaming infrastructure — but that requires 100+ global PoPs, which is out of scope for an in-house design.
Why an edge distributor instead of SDKs connecting to Kafka directly?
Apps often live in restricted networks; Kafka is a complex binary protocol. Kafka also isn't sized for hundreds of thousands of consumer connections. The edge tier translates the message bus into simple SSE or WebSocket, can deduplicate and batch events for the SDKs, and can be globally deployed for latency. LaunchDarkly's edge nodes do exactly this.
SDK design
The SDK is where most of the complexity lives.
Bootstrapping
On startup, the SDK tries three things in order: fetch the latest ruleset from the edge, fall back to the last ruleset saved to local disk, and if neither is available, use compiled-in defaults.
def init_flag_sdk():
try:
ruleset = fetch_from_edge(timeout=2)
save_to_disk(ruleset)
except:
ruleset = load_from_disk() or COMPILED_DEFAULTS
return SDK(ruleset)
flowchart TD
START([SDK init]) --> TRY[Try fetch from edge]
TRY -->|success| SAVE[Save to disk]
SAVE --> USE1[Use live ruleset]
TRY -->|timeout / error| DISK{Disk cache exists?}
DISK -->|yes| USE2[Use disk snapshot]
DISK -->|no| USE3[Use compiled defaults]
USE1 --> RDY([SDK ready])
USE2 --> RDY
USE3 --> RDY
style USE1 fill:#15803d,color:#fff
style USE2 fill:#ffaa00,color:#0a0a0f
style USE3 fill:#ff2e88,color:#fff
This is why a flag service outage doesn't bring down customers. Apps that were already running have a cached ruleset. New apps starting up during the outage fall back to compiled defaults — those defaults must be safe (typically false for kill switches, false for unreleased features).
Hot reload
When the edge pushes an update, the SDK atomically swaps the in-memory ruleset. The key is that ruleset_ref is a single pointer to an immutable snapshot. Readers grab the current pointer at the start of an evaluation and hold it for the duration — if a new ruleset is swapped in concurrently, the in-flight evaluation completes against the snapshot it started with. No mid-evaluation flag state change, no locking on the read path.
self.ruleset_ref = AtomicRef(new_ruleset) # readers grab current ref
This is a copy-on-write / read-copy-update (RCU) pattern, applicable in any language with reference semantics (Python, Go, Java, Node.js).
One failure mode worth calling out: what if the edge pushes an update while the SDK is mid-startup, with the ruleset only partially loaded? The answer is to always apply updates as a complete atomic swap of a fully-validated snapshot — never patch individual rules in place. Validate the full ruleset before swapping.
Eval safety
A bad rule shouldn't crash your app. Wrap every evaluation and fall back to the flag's default:
def evaluate(self, flag_key, user):
try:
return self._eval(flag_key, user)
except Exception as e:
log.warn(f"flag eval failed {flag_key}: {e}")
return self.ruleset.get(flag_key, {}).get('default')
Eval logs
For experimentation analytics, you need to record "this user saw value X at time T." You can't emit one event per evaluation — at 1B/day that's ~11,600/sec of write traffic. Instead: buffer locally and send batched events every 30 seconds, reuse the same evaluation for the same user within a session, or aggregate counts per minute per (flag, value) at the SDK so you send rates rather than individual events.
Data plane storage
CREATE TABLE flags (
key VARCHAR PRIMARY KEY,
type VARCHAR, -- bool / string / number / json
default_value JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE flag_environments (
flag_key VARCHAR REFERENCES flags(key),
env_key VARCHAR,
rules JSONB, -- full rule definition
version INT, -- increments on each change
updated_at TIMESTAMP,
updated_by VARCHAR,
PRIMARY KEY (flag_key, env_key)
);
CREATE TABLE flag_audit (
id BIGSERIAL PRIMARY KEY,
flag_key VARCHAR,
env_key VARCHAR,
version INT,
rules_before JSONB,
rules_after JSONB,
changed_by VARCHAR,
changed_at TIMESTAMP
);
Postgres handles this easily — 1,000 writes/day with rich JSONB rules. The audit table is append-only. The version integer lets SDKs ask "anything newer than version N?" for efficient incremental sync.
Safety: don't break production
A feature-flag service has unusual blast radius. One bad value can take down every app at once. Real services invest in several safety layers.
Production flag changes require N approvals before taking effect — mirror the code-review gate your team already uses. Even approved changes should ramp gradually: 1% → 10% → 50% → 100%, with a mandatory bake window at each step enforced by the dashboard so no one can skip straight to a full rollout.
When something goes wrong, a one-click revert to the previous version must always be available, with no approval gate blocking it. Separately, edge distributors should continuously checksum the ruleset they serve; any divergence from the source of truth triggers an immediate alert.
Stale-write protection. When the dashboard saves, it sends the expected_version. If another user updated the flag in the meantime, the write is rejected — optimistic locking.
UPDATE flag_environments
SET rules = $new_rules, version = version + 1
WHERE flag_key = $key AND env_key = $env AND version = $expected;
-- 0 rows updated → 409 Conflict
Finally, rules must parse and validate before they are saved — refusing a malformed ruleset at write time is far cheaper than letting every SDK fetch a bad payload and fall back to defaults.
Multi-region considerations
flowchart LR
DB[(Primary DB<br/>us-east)] --> KAFKA[Kafka<br/>us-east]
KAFKA --> ED_US[Edge<br/>us-east]
KAFKA -->|mirror| ED_EU[Edge<br/>eu-west]
KAFKA -->|mirror| ED_AP[Edge<br/>ap-southeast]
ED_US --> SDK_US[SDKs US]
ED_EU --> SDK_EU[SDKs EU]
ED_AP --> SDK_AP[SDKs AP]
style KAFKA fill:#ff6b1a,color:#0a0a0f
One write region, many read regions. This is acceptable because flag updates run at ~1,000/day, not 1,000/sec — a single writer eliminates split-brain and keeps the audit log simple. Even with cross-region replication adding 50–150ms to propagation, the human-perceptible latency from "click save" to "all apps updated globally" stays well under 5 seconds.
Why not multi-master writes? Flag updates carry an implicit ordering requirement: "turn on at 1%", "ramp to 10%", "ramp to 50%" must be applied in order. Multi-master without a global sequencer risks reordering. CRDTs handle concurrent-write convergence but don't enforce strict sequential ordering, so they're the wrong tool here. Single-writer gives you total order at trivial cost because the write rate is so low.
Stale-read window. Between "update written in us-east" and "Kafka mirror delivers to eu-west edge", SDKs in that region serve the previous ruleset. For gradual rollouts and kill switches at this write rate, a 200–500ms stale window is acceptable. For hard legal flags ("show EU disclosure"), consider a synchronous cross-region write before the API returns 200.
Cost analysis
For 1B evaluations/day across 1,000 client services:
- Evaluations: free (in-process, no network).
- SDK ↔ edge connections: ~1,000–10,000 long-lived SSE connections per region (1,000 services × 1–10 pods each). Trivial.
- Edge nodes:
5 per region × 3 regions = 15 small machines ($20–40/machine/month on any cloud). ~$300–600/month. - Control plane: 1 Postgres + 1 small API service. ~$100–200/month.
- Pub/sub: self-hosted Kafka (3 brokers on small VMs) ~$150–300/month; managed Kafka (MSK, Confluent) starts at $300–500/month for a minimal cluster.
Total: roughly $600–1,500/month for a self-hosted pub/sub setup, $1–3k/month if using managed Kafka, at 1B evals/day, mostly fixed cost. Hosted services charge per MAU because they're charging for SDK metadata and analytics, not per-eval compute — the in-process evaluation model means raw eval cost is effectively zero.
Variants you might be asked about
If you need A/B testing, add per-flag analytics tracking which variant each user saw and what conversion rate each arm achieved. That requires event collection, attribution windows, and statistical significance — a distinct problem that Statsig and Optimizely specialize in.
For per-user manual overrides ("set new-checkout-flow = true for user_id=12345 specifically"), store overrides in a targets table and have the SDK check it before evaluating rules. Related: sticky users — once someone is assigned to "true" during a rollout, they should stay there even if you ramp down. The deterministic hash above handles this naturally; no server-side state needed.
Some flags carry legal weight ("show this EU disclosure to all EU users"). Those environments need immutable audit logs (write-once S3 or similar), four-eye approvals, and cryptographically signed rule deployments to satisfy compliance requirements.
Failure modes
| Failure | Behavior |
|---|---|
| Control plane down | SDKs keep using cached rules. New SDK boots fall back to compiled defaults. |
| Edge node down | SDKs reconnect to a healthy node. |
| Pub/sub down | Edge nodes serve stale rules; updates delayed. Recovers when bus returns. |
| Bad rule deployed | Revert button. SDKs swap back in seconds. |
| SDK loses its cache file | Falls back to compiled defaults. |
| App can't reach edge at boot | Falls back to disk cache, then compiled defaults. |
In every case: applications keep running. That's the design property that matters.
Trade-offs to discuss in an interview
- Why is evaluation in-process, not server-side? Latency, scale, blast radius. A remote call adds network overhead on every request and creates a new SPOF; in-process costs nothing and removes the dependency.
- Why does the SDK cache rules on disk? So a cold start survives a flag service outage. Without this, every new pod deployment during an outage would start up with broken defaults.
- Why is sub-5-second propagation achievable even though CAP says you give up consistency? Eventual consistency is fine here. The system never needs to be globally consistent at an instant — it just needs to converge within seconds, which SSE from regional edge nodes delivers reliably.
- Why not just a config file reloaded on a cron? It works, but propagation lag is unacceptable for kill switches. A 1-minute cron means up to 60 seconds of a broken feature still running in production.
Things you should now be able to answer
- Why is the SDK + edge distributor pattern universal for flag services?
- How do you make a 30% rollout sticky for individual users?
- Why must compiled-in defaults be "safe"?
- What happens to clients when your control plane is down?
- How does the audit table work, and why is
versionimportant on writes?
Further reading
- LaunchDarkly's engineering blog
- Flagsmith open-source code — a working implementation
- OpenFeature spec (openfeature.dev) — a CNCF incubating project; vendor-agnostic standard SDK API, providers available for LaunchDarkly, Flagsmith, Unleash, and more
- "Experimentation at Scale" — Statsig and Eppo blogs
- Real-time communication patterns on this site
Frequently asked questions
▸What is the core architectural insight behind LaunchDarkly, Statsig, Flagsmith, and Unleash?
All of them move flag evaluation into the application process rather than routing calls to a remote server. Each service embeds an SDK that holds the full ruleset in memory and evaluates locally in microseconds. The flag service's only job is to stream rule updates, not answer individual evaluation calls.
▸How does a feature flag service achieve sub-5-second propagation to every running service?
When a flag changes, the control plane writes to Postgres and publishes to an event bus (Kafka or Redis Streams). Regional edge distributors consume that event and push the new ruleset to every connected SDK via SSE or WebSocket. End-to-end round trip for an in-house system is 1-3 seconds at p99; LaunchDarkly achieves 200ms or less via CDN-backed streaming across 100-plus global PoPs.
▸What happens to applications when the flag service goes completely down?
Applications keep running without degradation. SDKs that were already running continue using their last-known in-memory ruleset. New processes starting during the outage fall back to the most recent ruleset saved to local disk, and if that is absent, fall back to compiled-in safe defaults (typically false for kill switches and unreleased features).
▸How do deterministic percentage rollouts work, and why is the flag key used as a hash salt?
The SDK computes murmur3(flag_key + ":" + user_id) % 100 to assign each user a stable bucket from 0 to 99. Salting with the flag key ensures different flags independently distribute users; without the salt, the same 30% of users would receive every new feature simultaneously, making experiment results meaningless due to correlated assignments.
▸Why use a single write region for flag updates even in a multi-region deployment?
Flag updates carry an implicit ordering requirement — ramp steps like 1% to 10% to 50% must be applied in sequence. A single primary writer eliminates split-brain and enforces total order at trivial cost because the write rate is only roughly 1,000 flag changes per day, not 1,000 per second. Cross-region Kafka mirroring adds 50-150ms to propagation but the human-perceptible latency from save to global update still stays well under 5 seconds.
You may also like
Design an AI Guardrails & Safety System
Build the validation layer that wraps every LLM call — detecting prompt injections, redacting PII, catching toxic outputs, and verifying groundedness — while staying inside a 200 ms latency budget for 10 million daily requests.
Design a Social Graph Service (Facebook's TAO)
Serve billions of "who follows whom" reads over a graph of trillions of edges. The objects-and-associations model, a cache in front of sharded SQL, and the hot-vertex problem.
Design an Authorization System (Google Zanzibar / RBAC / ReBAC)
Answer "can user U do action A on resource R?" globally, in milliseconds, consistently. RBAC vs ABAC vs ReBAC, Zanzibar relation tuples, and the new-enemy problem.