Design a Shopping Cart & Checkout System
Keep a cart consistent across devices, then check out without overselling or double-charging. The available-cart vs consistent-checkout split, inventory holds, and the order saga.
The problem
Amazon's cart holds billions of active line items at any moment. Shopify powers millions of storefronts, each running its own cart-and-checkout pipeline. At their core, both do the same thing: let a shopper collect items over time, then exchange money for those items in a single atomic step. The cart half is casual and forgiving. The checkout half is unforgiving and exact.
The cart looks simple — it is just a list of SKUs and quantities. But the moment you add multi-device sync (phone adds a shoe, laptop removes it, tablet reads the result) and anonymous-to-logged-in transitions (guest session merges into a real account on login), a plain SQL row is no longer the right tool. The cart must stay available even during partial outages; a shopper who can't add an item is a lost sale.
Checkout is the opposite problem. Once a shopper hits "Place Order," two shoppers cannot both buy the last unit, and a payment that succeeds cannot silently produce a failed order. The system must span three separate services — inventory, order creation, and a payment processor — and if any step fails, it must undo the preceding ones cleanly. This is where exactly-once semantics, distributed sagas, and idempotency keys matter.
The core engineering tension is that these two halves demand fundamentally different storage strategies. The cart wants high availability and eventual consistency; checkout wants strong consistency and atomic multi-step commits across services that don't share a database. Getting both right in one user flow, and getting the handoff between them right, is what makes this a compelling interview problem.
Functional requirements
PUT /cart/items— add or update a line item (user or guest session).GET /cart— fetch current cart; works on any device, any session.DELETE /cart/items/{sku}— remove a line item.- On login: merge guest cart into user cart.
POST /orders— place an order: validate prices + promotions, reserve inventory, charge payment, create order record.- Order history and status via
GET /orders.
Non-functional requirements
- Cart availability over consistency — a user must be able to add to cart even when downstream services are degraded. Stale cart data is acceptable; lost carts are not.
- Checkout must be strongly consistent — no overselling, no double-charges, no "payment succeeded but order failed" ghosts.
- Idempotency — "place order" must be safe to retry (network drops, double-clicks).
- Low checkout latency — p99 under 3 seconds including payment authorization.
- Abandoned cart handling — inventory held at checkout must be released if the user doesn't complete payment.
- Scale — 50M DAU, read/write ratio on carts ~10:1, checkout rate ~1–2% of cart events.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| DAU | 50M users | baseline assumption |
| Add-to-cart rate (avg) | ~2,800/sec | 50M × 1 / (5 × 3,600) — one add-to-cart per user per 5 hours |
| Add-to-cart rate (peak) | ~10,000/sec | 2–4× evening spike |
| Checkout rate (peak) | ~150/sec | 10,000 × 1.5% — ~1.5% of add-to-cart events complete as orders |
| Cart size | 800 B per cart | avg 4 line items × 200 B each |
| Active cart data | 8 GB raw · ~24 GB with replication | 10M carts × 800 B = 8 GB; 20% of DAU have a live cart; 3× replication |
| Inventory reads | ~600/sec | 150 checkouts/sec × 4 items — negligible for sharded Postgres; cache aggressively |
| Inventory writes (reservations) | ~600/sec typical · 10,000+/sec flash sale | same as inventory reads; hot SKUs spike dramatically — see failure modes |
| Orders write throughput | ~300 KB/sec | 150 orders/sec × 2 KB — trivial |
| Orders volume | ~13M/day · ~5B/year | 150 × 86,400 ≈ 13M/day — shard by user_id or order_id |
Takeaway: Cart storage (8 GB raw, 24 GB replicated) fits comfortably in a mid-size Redis cluster; checkout throughput is modest at ~150/sec — the dominant scaling pressure is hot-SKU lock contention during flash sales, not raw write volume.
Building up to the design
The interesting thing about this problem is that it's really two problems wearing the same clothes. The cart and the checkout look like a single flow to the user, but they need completely different storage strategies. Walking the evolution makes that split obvious.
V1: One database, one cart table
CREATE TABLE cart_items (
user_id BIGINT,
sku VARCHAR(64),
qty INT,
PRIMARY KEY (user_id, sku)
);
On POST /orders, read cart, check inventory in code, debit inventory, insert order, clear cart — all in one transaction. This is correct, atomic, and handles thousands of users without breaking a sweat.
The problem shows up when you need the cart to be available during partial outages. Postgres replication lags, or a primary failover happens, and suddenly your "simple list of items" is unavailable to users who just want to browse and add things. There's also no clean story for multi-device sync or conflict resolution built into SQL transactions.
V2: Move the cart to Redis
Cart items go into a Redis hash: HSET cart:{user_id} {sku} {qty}. Reads and writes are in-memory and sub-millisecond. If Redis has problems, you fall back to a cookie or a degraded mode — the user might lose some in-flight state, but they can keep shopping. Postgres would have taken the whole cart experience down.
Multi-device sync also becomes trivial: every device reads the same Redis key.
The new problem is the anonymous user. A guest fills a cart keyed by session_id. When they log in, you now have two carts — cart:guest:{session_id} and cart:user:{user_id} — and you need a merge strategy.
V3: Guest cart merge on login
The merge is a per-line-item union. For each SKU in the guest cart, if the user cart already has that SKU, you need a reconciliation rule; if it doesn't, you copy the guest item over.
The right rule is max-register merge on quantity — not "add quantities." A shopper who added 2 pairs of shoes on their phone and already has 1 pair in their saved cart probably wants 2, not 3. They changed their mind on quantity, not added a second purchase. So:
merged_qty(sku) = max(guest_qty, user_qty) # most recent intent wins
This is structurally the same as a CRDT merge — take the max, which is monotonically non-decreasing, so concurrent edits from two devices can never cause a quantity to silently go backward. Product teams sometimes surface this as a UI prompt ("Your guest cart has item X (qty 2), your saved cart has qty 1 — keep 2?"), but as a background merge, taking the higher quantity is the conservative safe choice.
The remaining gap: checkout. The single-transaction approach from V1 can't span Redis, Postgres, and a payment processor.
V4: The checkout saga
Once you fan out across services, a single database transaction is no longer available to you. Checkout becomes a sequence of local transactions — reserve inventory, create order, charge payment — and any one of them can fail. If payment fails after inventory was reserved, you must release the reservation. This is the saga pattern: a sequence of forward steps with compensating transactions that undo each one if something goes wrong later.
flowchart LR
V1["V1: SQL cart + checkout<br/>Simple, single DB"] --> V2["V2: Redis cart<br/>Available, fast reads"]
V2 --> V3["V3: + guest merge<br/>max-register per-item reconciliation"]
V3 --> V4["V4: Checkout saga<br/>Reserve → charge → confirm"]
V4 --> V5["V5: + idempotency keys<br/>+ abandoned-cart TTL<br/>+ promotions engine"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V5 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
C[Client: browser / mobile] --> GW[API Gateway + Auth]
GW --> CART[Cart Service]
GW --> CHK[Checkout Service]
CART --> REDIS[(Redis<br/>cart by user_id → items)]
CART -.guest merge.-> REDIS
CHK --> PRICE[Pricing + Promotions Engine]
CHK --> INV[Inventory Service]
CHK --> ORD[Order Service]
CHK --> PAY[Payment Service]
INV --> IDB[(Inventory DB<br/>Postgres — sharded by SKU)]
ORD --> ODB[(Orders DB<br/>Postgres — sharded by user_id)]
PAY --> PSP[Payment Processor<br/>Stripe / Adyen]
CHK --> SAGA[Saga Coordinator]
SAGA -.compensate.-> INV
SAGA -.compensate.-> PAY
SAGA --> KAFKA[Kafka: order events]
KAFKA --> NOTIFY[Notification Service]
KAFKA --> ANA[Analytics]
style CART fill:#0e7490,color:#fff
style REDIS fill:#15803d,color:#fff
style CHK fill:#ff6b1a,color:#0a0a0f
style SAGA fill:#a855f7,color:#fff
style INV fill:#ffaa00,color:#0a0a0f
style PAY fill:#ff2e88,color:#fff
The cart: available by design
Storage schema (Redis)
Each user's cart is a Redis hash. Line items are fields, quantities are values:
HSET cart:user:9182736 "SKU-SHOE-RED-10" "2"
HSET cart:user:9182736 "SKU-BELT-BLK-M" "1"
HSET cart:user:9182736 "SKU-SOCK-WHT-10" "3"
EXPIRE cart:user:9182736 2592000 # 30-day TTL; refresh on activity
For guest sessions:
HSET cart:guest:sess_abc123 "SKU-SHOE-RED-10" "1"
EXPIRE cart:guest:sess_abc123 86400 # 1-day TTL for anonymous sessions
Why Redis over Postgres for the cart? The cart is read many times for every checkout — every page view, every ad re-targeting event, every "you left something in your cart" nudge. Reads outnumber checkouts by 50–100:1. Redis gives sub-millisecond reads, and its native hash operations make per-item updates atomic at the field level without locking.
Eventual consistency is acceptable here because the cart is not money. If two devices add the same item concurrently and one update is delayed, the worst outcome is a transient inconsistency in quantity — resolved on the next read. The authoritative quantity check happens at checkout, not in the cart. Losing an entire cart (data loss) is unacceptable; losing one concurrent update is not.
Guest-to-user cart merge
sequenceDiagram
participant App
participant CartSvc as Cart Service
participant Redis
App->>CartSvc: POST /login (user logs in)
CartSvc->>Redis: HGETALL cart:guest:{session_id}
Redis-->>CartSvc: guest cart items
CartSvc->>Redis: HGETALL cart:user:{user_id}
Redis-->>CartSvc: user cart items
CartSvc->>CartSvc: merge: union of SKUs, max-register qty on conflict
CartSvc->>Redis: HSET cart:user:{user_id} merged items
CartSvc->>Redis: DEL cart:guest:{session_id}
CartSvc-->>App: merged cart
The merge runs as a short Lua script on Redis to keep it atomic — no partial merges visible to other readers mid-flight. Note that Redis Lua scripts are atomic on a single node, but in Redis Cluster mode all keys accessed by the script must hash to the same slot. Use hash tags to guarantee co-location: cart:guest:{sess_abc} and cart:user:{9182736} — the curly-brace segment is the hash key, so you must ensure the guest and user cart keys share a tag (or run the merge on the node owning the user cart, or use a proxy that routes both keys to the same shard):
-- KEYS[1] = guest cart key, KEYS[2] = user cart key
local guest = redis.call('HGETALL', KEYS[1])
for i = 1, #guest, 2 do
local sku = guest[i]
local qty = tonumber(guest[i+1])
local existing = tonumber(redis.call('HGET', KEYS[2], sku) or 0)
if qty > existing then
redis.call('HSET', KEYS[2], sku, qty)
end
end
redis.call('DEL', KEYS[1])
Checkout: strongly consistent by requirement
Pricing and promotions at checkout
Never trust a price from the client. At checkout:
- Fetch current prices for all SKUs from the Pricing Service.
- Apply promotions (coupon codes, buy-2-get-1, category discounts) in a deterministic order.
- Compute tax by shipping jurisdiction (typically a call to a tax service like TaxJar or Avalara).
- Present the final total to the user for confirmation before capturing payment.
Many systems hold the computed price for a short window (15 minutes) so the user doesn't see it change mid-checkout if a sale ends while they're entering their card details.
Inventory reservation: atomic conditional decrement
The core operation that prevents oversell:
UPDATE inventory
SET reserved = reserved + :qty,
available = stock - reserved - :qty
WHERE sku = :sku
AND (stock - reserved) >= :qty -- only succeeds if sufficient stock remains
RETURNING available;
If the UPDATE affects 0 rows, stock was insufficient — return an out-of-stock error. If it succeeds, the reservation is live. This single statement is atomic in Postgres (row-level locking on the sku row), so two concurrent checkouts for the last unit compete at the database level: exactly one wins.
The reservation carries a reserved_until timestamp. A background job (or Postgres pg_cron) sweeps expired reservations and restores available stock:
WITH released AS (
UPDATE reservations
SET status = 'RELEASED'
WHERE expires_at < now()
AND status = 'PENDING'
RETURNING sku, qty
)
UPDATE inventory
SET reserved = reserved - released.qty
FROM released
WHERE inventory.sku = released.sku;
flowchart TD
CHK[Checkout Service] -->|"UPDATE ... WHERE (stock - reserved) >= qty"| INVDB[(Inventory DB)]
INVDB -->|0 rows updated| OOS[Return OUT_OF_STOCK]
INVDB -->|1 row updated| RES[Reservation created<br/>reserved_until = now + TTL]
RES --> SAGA[Continue saga]
SWEEP[Background sweeper<br/>pg_cron every minute] -->|expired.expires_at < now| INVDB2[(Inventory DB)]
INVDB2 -->|"reserved -= qty"| FREE[Stock returned to available]
style CHK fill:#ff6b1a,color:#0a0a0f
style INVDB fill:#15803d,color:#fff
style INVDB2 fill:#15803d,color:#fff
style OOS fill:#ff2e88,color:#fff
style SWEEP fill:#0e7490,color:#fff
This is the same inventory-hold pattern used in flash sale systems — the "reserve now, release if not purchased" model that prevents oversell without holding stock permanently.
The checkout saga
The saga runs these steps in order, with compensations for each:
sequenceDiagram
participant CHK as Checkout Service
participant INV as Inventory Service
participant ORD as Order Service
participant PAY as Payment Service
participant PSP as Payment Processor
CHK->>INV: Reserve stock (conditional decrement, TTL ~10–15 min)
INV-->>CHK: reservation_id or OUT_OF_STOCK
CHK->>ORD: Create order (status=PENDING, reservation_id)
ORD-->>CHK: order_id
CHK->>PAY: Authorize payment (order_id, amount, idempotency_key)
PAY->>PSP: Charge card
PSP-->>PAY: auth_code or DECLINED
PAY-->>CHK: auth_code or DECLINED
alt Payment declined
CHK->>ORD: Cancel order (status=CANCELLED)
CHK->>INV: Release reservation (reservation_id)
else Payment authorized
CHK->>ORD: Confirm order (status=CONFIRMED)
CHK->>INV: Commit reservation (convert reserved → sold)
CHK->>PAY: Capture payment (auth_code)
CHK-->>Client: order_id, confirmation
end
| Step | Forward action | Compensation on failure |
|---|
- Reserve inventory | Conditional decrement + TTL | Release reservation (
reserved -= qty) - Create order record | INSERT order (PENDING) | UPDATE order status = CANCELLED
- Authorize payment | PSP authorization call | Void authorization if already issued
- Confirm order | UPDATE order status = CONFIRMED | (terminal; payment capture follows)
- Commit reservation | UPDATE reserved → sold in inventory | Reverse commit (
sold -= qty, reserved += qty) - Capture payment | PSP capture call | Refund if capture already processed
The saga can be implemented as orchestration (the Checkout Service drives each step synchronously and calls compensations on failure) or choreography (each service listens to events and publishes results). For checkout, orchestration is almost always cleaner. The Checkout Service needs to make decisions about which compensation to call given partial failure — choreography makes that hard to reason about when failures happen in either direction.
For deeper background on the pattern, see the saga pattern article and the payment system design.
Idempotency: protecting against double-clicks and retries
The client generates a UUID before submitting the checkout form — the idempotency key — and includes it in the request header:
POST /orders HTTP/1.1
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Content-Type: application/json
{ "cart_id": "...", "payment_method_id": "..." }
The Checkout Service records the (idempotency_key, result) in a durable table before returning:
CREATE TABLE idempotency_keys (
key UUID PRIMARY KEY,
user_id BIGINT NOT NULL,
result JSONB,
created_at TIMESTAMPTZ DEFAULT now(),
expires_at TIMESTAMPTZ DEFAULT now() + INTERVAL '24 hours'
);
On a duplicate request with the same key, the service returns the cached result immediately — no second charge, no second order. Here's what that lookup flow looks like in practice:
flowchart LR
REQ[POST /orders<br/>Idempotency-Key: abc123] --> LOOKUP{Key in<br/>idempotency_keys?}
LOOKUP -->|Yes| CACHED[Return cached result<br/>no charge, no order created]
LOOKUP -->|No| SAGA[Run checkout saga]
SAGA --> STORE[Store result in<br/>idempotency_keys]
STORE --> RESP[Return result to client]
style LOOKUP fill:#ff6b1a,color:#0a0a0f
style CACHED fill:#15803d,color:#fff
style SAGA fill:#a855f7,color:#fff
This covers three failure modes that would otherwise bite you. A double-click submits before the first response arrives — both requests carry the same key, the second one hits the cache. A network timeout causes the client to retry — same key, same cached response, no second charge. A mobile app reconnects after a brief disconnect and resubmits — again, same key.
The order state machine
stateDiagram-v2
[*] --> PENDING : create order
PENDING --> CONFIRMED : payment authorized + captured
PENDING --> CANCELLED : payment declined or timeout
CONFIRMED --> PROCESSING : warehouse picks order
PROCESSING --> SHIPPED : fulfillment ships
SHIPPED --> DELIVERED : delivery confirmed
DELIVERED --> RETURN_REQUESTED : customer initiates return
RETURN_REQUESTED --> REFUNDED : return approved + refund issued
CONFIRMED --> CANCELLED : customer cancels before processing
CANCELLED --> [*]
REFUNDED --> [*]
DELIVERED --> [*]
Every state transition emits a Kafka event. Downstream services (notifications, analytics, fulfillment, returns) consume those events rather than polling the Orders DB. The Order Service is the single source of truth for order status.
Storage choices
| Data | Store | Reason |
|---|---|---|
| Cart items | Redis (AP) | Available, fast, trivially sharded by user_id |
| Inventory (available / reserved / sold counts) | Postgres (sharded by SKU range) | Needs strong consistency; conditional-write support; row-level locking |
| Reservations | Postgres (collocated with inventory) | Same transaction scope as inventory update |
| Orders | Postgres (sharded by user_id or order_id) | Strong consistency, rich queries, foreign keys |
| Idempotency keys | Postgres or Redis (with TTL) | Short-lived; Redis TTL is operationally simpler |
| Promotions / pricing | Postgres + read-through cache | Changes infrequently; cache with 60-second TTL |
| Order events | Kafka → S3 / data warehouse | Fan-out to notifications, analytics, fulfillment |
| Session / auth tokens | Redis | Short-lived, high-read |
Failure modes
Oversell on hot SKUs
A limited-edition product launches and 5,000 users click "Buy Now" simultaneously. All 5,000 hit the inventory service with UPDATE ... WHERE (stock - reserved) >= 1. Postgres row-level lock on that SKU row serializes them — the first N succeed (N = stock), the rest get 0 rows updated and see "out of stock." The database is doing exactly the right thing.
The risk at extreme concurrency is lock contention: wait queues build up, checkout latency spikes. You can attack this from several directions. Pre-shard inventory by SKU range so a hot item doesn't contend with unrelated stock. Read available count from cache first and return "out of stock" early for obvious cases before touching the DB. Rate-limit checkout attempts per SKU at the API gateway. For true flash sales — think limited-edition sneakers with 10,000+ concurrent buyers — see design-flash-sale, which uses a Redis DECR counter as a first-pass gate before writing to Postgres.
Double-charge (payment-succeeded, order-failed)
The payment processor returns success, but the network drops before the Checkout Service receives the response. The service retries, and the customer gets charged twice.
The fix: pass the idempotency key to the payment processor's API. Stripe and Adyen both support this natively. The second call with the same key returns the first charge's result without issuing a new charge — the deduplication happens on their side, not just yours.
Payment succeeded but order not created (orphan payment)
The Create Order step fails after payment authorization — say, Postgres is temporarily unavailable. Now money is authorized but no order exists.
This is why authorization and capture are separate steps. The payment is only authorized at step 3, not captured until step 5 after the order is confirmed. An authorization that is never captured is voided automatically by the payment processor after a window that varies by card network and acquirer (Visa card-not-present transactions: 10 calendar days; Visa card-present: 5 calendar days; Mastercard final authorizations: 7 calendar days; specialized merchant preauthorizations such as lodging or vehicle rental: up to 30 days). If the system detects the failure in real time, the compensation step explicitly voids the authorization immediately rather than waiting for the network timeout.
Reservation leak (inventory held forever)
Checkout fails after reserving inventory, but the compensation message is lost because the saga coordinator crashes. Stock is permanently held.
This is why the reservation has a reserved_until TTL. The background sweeper unconditionally releases reservations past their TTL, regardless of whether the saga fired its compensation. The compensation path is the fast release; the sweeper is the safety net. Neither one alone is sufficient — together they give you defense in depth.
Cart data loss on Redis failure
When Redis is down, cart reads fail. The options, roughly in order of complexity:
Redis Sentinel or Cluster with replicas promotes a replica on primary failure with acceptable brief inconsistency — this is what most teams run.
Write-through to Postgres as a durable fallback: on Redis miss, read from Postgres; write to both on update. More complex to operate but zero data loss.
Cookie fallback embeds a truncated cart in a signed cookie. For small carts, this is enough for a degraded mode while Redis recovers.
Most production implementations use Redis Cluster (option 1) with Postgres write-through as a recovery path on reconnect, not as a live read path.
Abandoned cart reservation
A user reaches the payment step with inventory reserved, then abandons the browser tab. The reservation TTL — 15 minutes is a common choice — must be long enough for the user to complete payment but short enough that stock isn't tied up indefinitely. After TTL expiry, the sweeper releases the hold. If the user returns and tries to complete checkout after the TTL, they must re-reserve — which may fail if stock ran out while they were away.
Things to discuss in an interview
- The available/consistent split: the cart is AP (Redis); checkout is CP (Postgres with conditional writes). Name this explicitly and justify the choice.
- Max-register cart merge vs. additive merge: why you take the max quantity rather than summing, and the CRDT analogy.
- The idempotency key flow: where it's generated, where it's stored, what it prevents — especially the double-charge scenario.
- Saga orchestration vs. choreography: for checkout, orchestration is simpler to reason about; name the trade-off.
- Inventory reservation TTL: why you use TTL rather than relying on compensations alone — defense in depth against saga coordinator failures.
- Hot-SKU contention: how Postgres row locks serialize access, and when you need to escalate to a Redis-based pre-gate (flash sale pattern).
- Price recomputation at checkout: why you can't trust client-sent prices, and the price-locking window.
Things you should now be able to answer
- Why is the cart stored in Redis rather than Postgres, and what are the trade-offs?
- What merge strategy do you use when a guest cart and a user cart have the same SKU?
- How does the conditional
UPDATE ... WHERE (stock - reserved) >= qtyprevent overselling without a distributed lock? - What is the idempotency key, where does it come from, and what failure modes does it prevent?
- Walk me through the compensation steps if payment authorization fails after inventory is reserved.
- How does the reservation TTL protect against inventory leaks when the saga coordinator crashes?
- What happens to a payment authorization if the order creation step fails?
Further reading
- Design a Payment System — payment authorization, capture, and idempotency in depth.
- Saga Pattern for Distributed Transactions — orchestration vs. choreography, failure handling.
- Design a Flash Sale System — inventory pre-gating with Redis, handling 10,000+ concurrent buyers for a limited item.
- "Applying the Saga Pattern" — Caitie McCaffrey, GOTO Chicago 2015 (public talk); an expanded version titled "Distributed Sagas: A Protocol for Coordinating Microservices" was delivered at JOTB 2017.
- Stripe API docs on idempotency keys — stripe.com/docs/api/idempotent_requests.
- "Eventually Consistent" — Werner Vogels, ACM Queue 2008 — the original framing of the AP vs. CP trade-off in practice.
Frequently asked questions
▸Why is the cart stored in Redis instead of Postgres?
Cart reads outnumber checkouts by 50-100:1, so sub-millisecond in-memory reads matter far more than strong consistency. Redis also keeps the cart available during partial outages where a Postgres primary failover would take the entire cart experience down. Eventual consistency is acceptable for the cart because the authoritative stock check happens at checkout, not at add-to-cart time.
▸How does the conditional inventory UPDATE prevent overselling without a distributed lock?
The Checkout Service runs a single Postgres statement: UPDATE inventory SET reserved = reserved + qty WHERE sku = :sku AND (stock - reserved) >= qty. If the UPDATE affects 0 rows, stock was insufficient and the caller gets an out-of-stock error. Because Postgres uses row-level locking on the SKU row, two concurrent checkouts for the last unit race at the database level and exactly one wins — no distributed lock required.
▸What merge strategy resolves a conflict when a guest cart and a user cart contain the same SKU?
The system applies a max-register merge on quantity using max(guest_qty, user_qty), not an additive sum. A shopper who set qty to 2 on their phone and has qty 1 in their saved cart most likely changed their mind rather than intending a combined order of 3. This is structurally equivalent to a CRDT max-register merge, which is monotonically non-decreasing and safe under concurrent edits.
▸What failure modes does the idempotency key on POST /orders prevent?
It prevents double-charges from three concrete scenarios: a double-click that submits before the first response arrives, a network timeout that causes the client to retry, and a mobile app that resubmits after a brief disconnect. The client generates a UUID before submitting, the Checkout Service stores the (key, result) pair in a durable table, and any duplicate request with the same key returns the cached result with no second order or charge created. The idempotency key is also passed to Stripe or Adyen so deduplication happens on the payment processor side as well.
▸Why does checkout use saga orchestration rather than choreography?
For the linear reserve-create-authorize-capture flow, the Checkout Service needs to make precise decisions about which compensation to call given partial failure — choreography makes that hard to reason about when failures happen in either direction. Orchestration keeps all the control logic in one place: the Checkout Service drives each step synchronously and explicitly calls compensations (release reservation, void authorization) on failure.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.