Design Ticketmaster (seat booking / reservations)
Sell limited inventory to a stampede of buyers without double-booking a seat. Reservation holds, a conditional-update concurrency guard, and the read-vs-write consistency split.
The problem
Ticketmaster, StubHub, and similar platforms solve a deceptively simple problem: sell a finite number of numbered seats to a much larger crowd of buyers, without ever selling the same seat twice. At 10 AM when a Taylor Swift on-sale drops, half a million people arrive at once for 20,000 seats. The system has to hand out exactly 20,000 confirmations — not 19,999 (inventory loss) and not 20,001 (double-booking).
The seat map is small. A typical venue has 10,000–80,000 rows in a database table, totalling a few megabytes. What makes this hard is the traffic shape: a single point in time produces an enormous write spike against a tiny, highly contended dataset. Every buyer wants the same popular rows. A naive implementation will either bottleneck on the database primary under that write load or — far worse — issue two confirmations for the same seat when two buyers race through an insufficiently guarded code path.
There are two orthogonal tensions to resolve. First, read/write asymmetry: for every seat booked, hundreds of users are just browsing the seat map. The read path needs aggressive caching and can tolerate short staleness; the write path (the actual hold) needs strong consistency and cannot use the same shortcuts. Second, the stampede problem: half a million simultaneous HTTP requests will collapse a booking API long before a single database lock is even contended. Metering the inbound traffic is as important as getting the locking right.
The ticket-booking problem is the inventory consistency interview in disguise. The seat map is small — 20k rows. The throughput during a hot on-sale is enormous. And the penalty for a mistake — two confirmation emails, one seat, two furious people at the gate — is unacceptable. Getting this right requires separating what must be strongly consistent (the booking write) from what can be eventually consistent (the browsing read), and building a coherent failure story around the boundary.
Functional requirements
GET /events/:id/seatmap— return seat availability for an event.POST /bookings/hold— atomically reserve one or more seats for the authenticated user; returns abooking_idwith a TTL.POST /bookings/:id/confirm— charge the user's payment method and confirm the booking.DELETE /bookings/:id— explicitly release a hold before TTL expiry.- Seat holds expire automatically after ~10 minutes if confirm is never called.
- A confirmed seat cannot be cancelled by another user; cancellation is handled separately (refunds, resale).
Non-functional requirements
- Core invariant: a seat is sold to exactly one buyer — never zero (inventory lost) and never two (double-booking).
- High availability: the event on-sale cannot go down during peak traffic.
- Read latency: seat map page loads < 200ms p99 — many concurrent browsers.
- Write correctness: seat reservation must be strongly consistent — no eventual-consistency shortcuts.
- Fair access: users who arrive earlier (in wall-clock time) should have a higher probability of getting a seat.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Seats per event | 20,000 | Typical large venue |
| Concurrent users (hot on-sale) | 500,000 | Given demand assumption |
| Browse reads (peak) | ~40,000 reads/sec | 500k users × ~5 seat-map refreshes ÷ 60 s on-sale window |
| Seat map response size | ~50 KB | JSON with all seat statuses |
| Read bandwidth | ~2 GB/s | 40,000 reads/s × 50 KB — CDN / cache mandatory |
| Reservation writes (peak) | ~5,000 hold attempts/sec | 500k users attempting in first ~100 s rush window; most fail (only 20k seats, 500k buyers) |
| Row size | ~200 bytes | event_id, seat_id, section, row, number, status, user_id, expires_at |
| Storage per event | ~4 MB | 20k rows × 200 B |
| Total seat storage (annual) | ~800 GB | 200k events/year × 4 MB — fits in a sharded relational DB |
| Active holds per event | up to 20,000 | All seats could be in HOLD simultaneously |
| Expiry scan cadence | every 30 seconds | Index on (event_id, status, expires_at) |
Takeaway: the read path (40k reads/sec, 2 GB/s) is nearly an order of magnitude larger than the write path (5k writes/sec), so caching seat maps is not an optimisation — it is a prerequisite for survival.
5k writes/sec for one event approaches or exceeds single-primary Postgres limits under realistic OLTP conditions (WAL flush is a hard serial bottleneck; practical throughput is ~1,000–2,500 TPS on well-tuned hardware). Sharding by event or vertical scaling is worth flagging here. The read path at 40k/sec is the traffic that needs caching.
Building up to the design
V1: A status column and a SELECT … FOR UPDATE
The minimal correct booking system is a single Postgres table with a status column per seat, and a transaction that uses a row-level lock to prevent races.
CREATE TABLE seats (
id BIGSERIAL PRIMARY KEY,
event_id BIGINT NOT NULL,
section VARCHAR(20) NOT NULL,
row_label VARCHAR(5) NOT NULL,
seat_num INT NOT NULL,
status VARCHAR(12) NOT NULL DEFAULT 'AVAILABLE',
user_id BIGINT,
booking_id UUID,
expires_at TIMESTAMPTZ,
CONSTRAINT seats_event_seat UNIQUE (event_id, section, row_label, seat_num)
);
CREATE INDEX seats_event_status ON seats (event_id, status, expires_at);
A reservation transaction:
BEGIN;
SELECT id FROM seats
WHERE event_id = $1
AND status = 'AVAILABLE'
AND section = $2
AND row_label = $3
AND seat_num = $4
FOR UPDATE; -- row-level lock; blocks concurrent transactions on this row
UPDATE seats
SET status = 'HOLD',
user_id = $user_id,
booking_id = $booking_id,
expires_at = now() + interval '10 minutes'
WHERE id = $seat_id
AND status = 'AVAILABLE'; -- defensive guard even with the lock held
COMMIT;
If two buyers request the same seat simultaneously, one transaction wins the lock; the other blocks and then re-reads status = 'HOLD' after commit, returning a "seat no longer available" error to the second buyer.
This is correct at small scale — zero double-bookings, understandable code. The problem shows up at 5k writes/sec across a hot event: a single Postgres primary is likely at or beyond its practical write ceiling (WAL flush bottleneck; realistic throughput ~1,000–2,500 TPS on well-tuned hardware). Worse, 500k users hitting the server simultaneously — many not even buying, just browsing — will overwhelm the database before a single lock is contended. The read and write traffic need to be separated.
V2: Separate the read and write paths
The seat map browse — "show me which seats are green/red" — doesn't need to read from the primary. It can tolerate 10–30 seconds of staleness. A seat transitioning from AVAILABLE to HOLD will become visible soon enough; users learn it's unavailable when they actually try to reserve it.
Cache the seat map in Redis:
key: seatmap:{event_id}
value: JSON payload of all seats and their statuses
TTL: 30 seconds (or invalidate on every seat status change)
Now the 40k read requests/sec hit Redis, not Postgres. Only the 5k hold/confirm writes reach the primary. Read latency drops to under 5ms, and the database is no longer in the path of every curious browser on the page.
That still leaves the stampede problem. The 500k users at on-sale open don't just browse — they all hit the booking endpoint at the same moment, and the API servers and load balancer saturate before the database even sees the traffic.
V3: Virtual waiting room
Instead of letting every buyer hammer the reservation endpoint directly, put a queue at ingress. When the on-sale opens:
- Each incoming user request gets a queue token — a position in a virtual line.
- The waiting room service maintains a maximum concurrent user count in the active booking flow — say, 2,000 simultaneous checkout sessions. As users complete or abandon checkout, new users are admitted from the front of the queue.
- The remaining users see a "You are in the queue — position ~4,200" page, which updates every few seconds.
- A user who reaches the front gets a short-lived JWT ("admission token") to actually call the booking API.
This converts a thundering-herd stampede into steady, metered load on the booking system. The waiting room itself stores almost nothing durable — queue position is a Redis counter. If the waiting room service restarts, users re-join the queue, which is an acceptable trade-off compared to the complexity of durable queue state.
The booking system never sees more than ~2k concurrent checkout sessions at once, instead of 500k.
V4: Booking saga (hold → charge → confirm)
Payment is a network call to a third-party processor. It can fail, time out, or succeed but have the acknowledgment lost in transit. The booking saga handles this explicitly:
HOLD seat (DB write, strongly consistent)
↓ success
CHARGE payment method (external call, may take 2-5s)
↓ success ↓ failure / timeout
CONFIRM booking RELEASE hold (status → AVAILABLE)
(HOLD → PURCHASED) + return error to user
Every step uses an idempotency key (the booking_id UUID) so retries are safe. If the API server dies between CHARGE and CONFIRM, a reconciliation job compares payment processor records to booking records and resolves the ambiguity: if payment succeeded, confirm; if payment failed, release.
V5: Production Ticketmaster
Row-level locking from V1 + read/write separation from V2 + waiting room from V3 + payment saga from V4 + hold expiry workers + idempotent APIs. Each layer earns its place.
flowchart LR
V1["V1: SELECT FOR UPDATE<br/>correct, single box"] --> V2["V2: + Redis cache<br/>reads isolated"]
V2 --> V3["V3: + waiting room<br/>metered ingress"]
V3 --> V4["V4: + booking saga<br/>payment integrated"]
V4 --> V5["V5: + expiry workers<br/>+ idempotency + sharding"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V5 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
U[Browser / App] -->|"on-sale traffic"| WR[Waiting Room Service]
WR -->|"admission token"| LB[Load Balancer]
LB --> API[Booking API cluster]
API -->|"GET seatmap"| CACHE[(Redis<br/>seat map cache)]
CACHE -.miss.-> REPLICA[("DB read replica")]
API -->|"HOLD / CONFIRM"| PRIMARY[("DB primary<br/>Postgres")]
PRIMARY --> REPLICA
API -->|"charge"| PAYGATE[Payment Gateway]
PAYGATE -.failure.-> API
EXPIRE[Hold Expiry<br/>Worker] -->|"status=HOLD AND expires_at < now()"| PRIMARY
EXPIRE -->|"invalidate"| CACHE
RECON[Reconciliation<br/>Worker] -.compares.-> PAYGATE
RECON -.resolves.-> PRIMARY
PRIMARY --> KAFKA[Kafka events]
KAFKA --> NOTIFY[Notification Service]
KAFKA --> ANA[Analytics]
style WR fill:#ff6b1a,color:#0a0a0f
style CACHE fill:#15803d,color:#fff
style PRIMARY fill:#0e7490,color:#fff
style PAYGATE fill:#ffaa00,color:#0a0a0f
style EXPIRE fill:#a855f7,color:#fff
style RECON fill:#ff2e88,color:#fff
Two things to notice here. First, the read replica absorbs all seat-map browsing — the primary only sees hold and confirm writes. Second, there are two background workers: the hold expiry worker recycles abandoned carts, and the reconciliation worker is the safety net for the specific nightmare of a payment succeeding right as the API server crashes.
Seat status lifecycle
The heart of the system is a three-state machine on each seat row:
stateDiagram-v2
[*] --> AVAILABLE : event created
AVAILABLE --> HOLD : reserve (lock acquired)
HOLD --> AVAILABLE : hold expires (TTL) OR user releases OR payment fails
HOLD --> PURCHASED : payment confirmed
PURCHASED --> AVAILABLE : cancellation / refund (future feature)
PURCHASED --> [*]
State transitions are the only place strong consistency is required. Every transition is a single SQL UPDATE with a WHERE status = 'AVAILABLE' (or 'HOLD') predicate — this is the conditional update / optimistic lock that prevents races without requiring explicit SELECT … FOR UPDATE in some implementations.
Alternative implementation using optimistic concurrency:
-- Atomic AVAILABLE → HOLD using conditional update
UPDATE seats
SET status = 'HOLD',
user_id = $user_id,
booking_id = $booking_id,
expires_at = now() + interval '10 minutes'
WHERE event_id = $event_id
AND id = $seat_id
AND status = 'AVAILABLE'; -- the guard predicate
-- rows_affected = 1 → success; = 0 → someone else got there first
No separate lock-acquisition step is required before the write. Under Postgres's default READ COMMITTED isolation, if two UPDATE transactions race on the same row, one acquires the implicit row-level exclusive lock and commits; the second briefly waits for that lock, then re-evaluates the WHERE predicate against the newly committed row version, sees status='HOLD', and gets rows_affected=0 — returning an error to the client. The wait is bounded by the duration of the first transaction (milliseconds for a simple DB write), not an open-ended block as with a SELECT … FOR UPDATE that holds the lock across a long-running external call. This approach avoids the two-round-trip overhead (SELECT then UPDATE) and is preferable to SELECT … FOR UPDATE when the contention rate is low — which it almost always is, since most seats are not contested simultaneously.
The waiting room in depth
A virtual waiting room is the single biggest reliability lever for an on-sale. Without it, the booking API receives 500k requests simultaneously and either returns 503 to most users or passes the full load to the database. Both outcomes are disasters. With a waiting room, the booking system never sees more than ~2k concurrent checkout sessions at once.
sequenceDiagram
participant U as User browser
participant WR as Waiting Room
participant Q as Redis queue counter
participant API as Booking API
U->>WR: GET /onsale/event:123 (at 10:00:00)
WR->>Q: INCR queue:event:123:position → returns position P
WR->>U: "You are position 4,200 — estimated wait 3 min"
loop every 5 sec
U->>WR: GET /queue/status
WR->>Q: GET queue:event:123:admitted_position
WR->>U: "Position 4,200 — now serving 2,100"
end
Note over Q: new users admitted as active sessions complete (target: ~2k concurrent)
WR->>U: "You are next — here is your admission token (JWT, 5 min TTL)"
U->>API: POST /bookings/hold [Authorization: Bearer <admission_token>]
API->>U: booking_id, seat held for 10 minutes
Admission tokens are short-lived JWTs signed by the waiting room service. The booking API verifies the signature before processing any hold request — no valid token, no hold. A user without a valid token gets a 403.
The waiting room stores nothing durable. Queue position is a Redis counter; if the waiting room service restarts, users re-join. This is a deliberate trade-off: durable queue state would require a substantially more complex system, and the UX cost of re-queuing is far lower than the reliability cost of a complex waiting room.
Hold expiry
Holds that are never confirmed — abandoned carts, browser crashes, payment timeouts — must be released back to AVAILABLE so other buyers can claim the seat.
The expiry worker runs every 30 seconds:
flowchart LR
CRON[Expiry worker<br/>every 30s] -->|"WHERE status=HOLD AND expires_at < now()"| DB[("Postgres primary")]
DB -->|"RETURNING event_id, seat_id"| WORKER[Worker]
WORKER -->|"publish invalidation"| CACHE[(Redis seat-map<br/>cache)]
WORKER -->|"optional: emit event"| KAFKA[Kafka<br/>BookingExpired]
style CRON fill:#a855f7,color:#fff
style DB fill:#0e7490,color:#fff
style CACHE fill:#15803d,color:#fff
style KAFKA fill:#ff6b1a,color:#0a0a0f
UPDATE seats
SET status = 'AVAILABLE',
user_id = NULL,
booking_id = NULL,
expires_at = NULL
WHERE status = 'HOLD'
AND expires_at < now()
RETURNING event_id, id;
For each released seat, the worker also publishes an invalidation event to the seat-map cache so Redis reflects the new availability. The window between a hold expiring and the expiry worker running (up to 30 seconds) is acceptable — a browsing user may see a seat as unavailable for up to 30 extra seconds, but no seat is ever permanently lost.
There is one race worth calling out: what if the payment is in-flight when the hold expires? The booking saga must check that the hold is still valid before confirming a payment. If the payment succeeds but the hold has already expired, the correct behavior is an immediate refund — the seat may already be held by someone else.
-- CONFIRM is only valid if the hold still belongs to this booking
UPDATE seats
SET status = 'PURCHASED',
expires_at = NULL
WHERE booking_id = $booking_id
AND status = 'HOLD' -- guard: hold must still be active
AND expires_at > now(); -- guard: not expired
-- rows_affected = 0 → hold expired → must refund payment
Payment saga and compensation
The booking saga has three happy-path steps and multiple compensation paths:
sequenceDiagram
participant API as Booking API
participant DB as Inventory DB
participant PAY as Payment Gateway
participant RECON as Reconciliation Worker
API->>DB: UPDATE seat → HOLD (idempotency key: booking_id)
DB-->>API: rows_affected=1 (success)
API->>PAY: charge(amount, card_token, idempotency_key=booking_id)
alt payment success
PAY-->>API: charge_id, status=succeeded
API->>DB: UPDATE seat → PURCHASED (guard: still HOLD + not expired)
DB-->>API: rows_affected=1 (success)
API->>API: emit BookingConfirmed event
else payment declined
PAY-->>API: status=declined
API->>DB: UPDATE seat → AVAILABLE (release hold)
API->>API: return error to user
else API server crashes between charge and confirm
Note over RECON: Runs every 60 seconds
RECON->>PAY: list charges with idempotency_key=booking_id
PAY-->>RECON: status=succeeded
RECON->>DB: confirm booking (HOLD → PURCHASED)
end
The reconciliation worker is the safety net for the most dangerous failure mode: payment succeeds at the processor, but the API server dies before writing PURCHASED to the database. Without reconciliation, the seat would expire back to AVAILABLE while the user's card was charged — a support nightmare. Most candidates forget this piece; mentioning it in an interview signals that you think through the failure boundaries, not just the happy path.
API design
POST /api/v1/bookings/hold
Authorization: Bearer <admission_token>
Idempotency-Key: <client-generated UUID>
Content-Type: application/json
{
"event_id": "evt_9kx2",
"seats": [{"section": "Floor", "row": "A", "num": 12}]
}
→ 200 OK
{
"booking_id": "bkg_7f3a",
"seats": [{"id": 84221, "section": "Floor", "row": "A", "num": 12}],
"expires_at": "2026-06-01T10:12:00Z",
"total_price_cents": 18500
}
POST /api/v1/bookings/bkg_7f3a/confirm
Authorization: Bearer <user_jwt>
Idempotency-Key: <same UUID as hold, or new UUID>
{
"payment_method_id": "pm_card_abc"
}
→ 200 OK
{
"booking_id": "bkg_7f3a",
"status": "PURCHASED",
"confirmation_number": "TM-9981234"
}
DELETE /api/v1/bookings/bkg_7f3a
Authorization: Bearer <user_jwt>
→ 204 No Content
The DELETE endpoint releases the hold immediately (sets status = AVAILABLE) without waiting for TTL expiry. It is a no-op if the booking is already PURCHASED or already AVAILABLE.
Idempotency on hold: a second request with the same Idempotency-Key returns the existing booking instead of creating a new one, as long as it is within the TTL window. Idempotency on confirm: a second confirm request on the same booking is a no-op if the booking is already PURCHASED — it does not re-charge the card.
Storage choices
| Data | Store | Rationale |
|---|---|---|
| Seat inventory (status, holds) | Postgres (primary) | Row-level locking, ACID transactions, conditional updates; money-level consistency |
| Seat map read cache | Redis (hash by event_id) | Sub-millisecond reads; invalidated on status change; 30s TTL fallback |
| Waiting room queue positions | Redis counters | Single INCR/GET per user; ephemeral, re-join on restart is acceptable |
| User accounts | Postgres | Standard relational |
| Booking records (archive) | Postgres → S3/data warehouse | Operational DB is append-only for completed bookings; archive for analytics |
| Booking events | Kafka | Notification, analytics, reconciliation consumers |
| Payment records | Postgres (separate schema) | Strong consistency required; never co-mingle with seat table writes |
Why NOT eventual consistency for seat inventory
Eventual consistency means accepting a window where the system believes something is true that isn't. For seat inventory, that window produces double-bookings. Consider:
Two buyers each read seat A12 as AVAILABLE. Both issue a reserve. An eventually-consistent store may accept both writes, resolving the conflict "later" by keeping one — but both users received a confirmation. That is a double-booking.
Strong consistency for the inventory write is non-negotiable. You may sacrifice consistency in adjacent systems — browsing, notifications, analytics — but not on the status column of the seat row.
This is a fundamental difference from systems like shopping carts (where a product going slightly out-of-stock after you add it to a cart is recoverable) or social feeds (where eventual consistency is the intended behavior). Ticket seats are zero-sum finite inventory with no meaningful recovery path after double-booking.
See also: Design a Flash Sale for the closely related inventory-depletion pattern at even higher write throughput (flash sales often use pre-decrement counters in Redis rather than row-level locks, which trades a small oversell risk for throughput — a trade-off that is acceptable for fungible goods but not for assigned seating).
Failure modes and mitigations
| Failure | Symptom | Mitigation |
|---|---|---|
| Double-booking race | Two users confirm the same seat | Conditional UPDATE with WHERE status='AVAILABLE' guard; only one gets rows_affected=1 |
| Hold expiry vs payment in-flight | Seat released; payment succeeds | Confirm checks expires_at > now() before writing PURCHASED; immediate refund if check fails |
| Payment succeeds, API server crashes | User charged, seat not confirmed | Reconciliation worker matches charge records to booking records every 60s |
| Queue fairness (jumping the line) | Users refresh-spam for better position | Admission tokens are position-bound and short-lived; queue position is assigned at first connection |
| Inventory oversell (many concurrent holds) | All seats HOLD, but some expire | Holds expire back to AVAILABLE; browsing users see updated availability after expiry sweep |
| DB primary failure | Writes fail during on-sale | Promote a replica (seconds with streaming replication + automatic failover e.g. Patroni); in-flight holds may be lost; waiting room buffers users |
| Waiting room service failure | Users lose queue position | Stateless; users re-queue; on-sale may need to be restarted; lower stakes than DB failure |
| Idempotency key collision | Duplicate charge | Client-generated UUID + server-side deduplication table (keyed by idempotency key, TTL 24h) |
Sharding considerations
For a single event at peak, 5k writes/sec to one Postgres table exceeds what a typical single primary can sustain under realistic OLTP workloads. Benchmarks on high-end hardware (fast NVMe, large RAM, tuned connection pooling) show realistic row-level-locking write throughput in the 1,000–2,500 TPS range; WAL flushing is a hard serialization bottleneck. At 5k hold attempts/sec the primary will likely be the bottleneck, even before lock contention on hot seats. Most events are far below this peak — a single well-tuned primary handles the common case — but you should flag vertical scaling limits and be prepared to discuss connection pooling (PgBouncer), write batching, or sharding for a true blockbuster show.
If sharding is required, the two realistic options are:
flowchart LR
REQ["Hold request<br/>event_id=99"] --> ROUTE{hash(event_id)}
ROUTE -->|"events 0-999"| SH0[("Shard 0<br/>all seats for events 0–999")]
ROUTE -->|"events 1000-1999"| SH1[("Shard 1<br/>all seats for events 1000–1999")]
ROUTE -->|"events 2000+"| SH2[("Shard 2<br/>...")]
SH0 --> NOTE["All lock contention<br/>for one event stays<br/>on one shard"]
style ROUTE fill:#ff6b1a,color:#0a0a0f
style SH0 fill:#0e7490,color:#fff
style SH1 fill:#0e7490,color:#fff
style SH2 fill:#0e7490,color:#fff
style NOTE fill:#ffaa00,color:#0a0a0f
Shard by event_id: all seats for one event live on one shard, keeping all seat-lock contention local to one node and avoiding distributed transactions. The downside is that a blockbuster event is always hot on one shard — a classic hot partition. Shard by (event_id, section) spreads a large venue across multiple shards, but buying multiple seats in one transaction now requires a distributed transaction or saga — substantial added complexity.
For most realistic interview scopes, sharding by event_id with vertical scaling of the primary (plus read replicas for the read path) is the right answer. Only revisit if you are told the system handles 1000+ simultaneous hot on-sales.
Things to discuss in an interview
- Why strong consistency here: the cost of a double-booking (legal liability, refund cost, support cost, reputational damage) far exceeds the cost of slightly lower write throughput.
- The read/write split: seat map browsing is roughly 8× more frequent than bookings at peak (40k reads/sec vs 5k writes/sec); keeping reads on Redis/replicas is what makes the system survive the on-sale.
- Hold TTL: what's the right value? Too short (2 min) and users fail checkout under normal conditions. Too long (30 min) and scarce inventory is locked by abandoning users. Ticketmaster's published checkout window is approximately 10 minutes, though it varies by demand; a design choice of 5–10 minutes is a reasonable range that balances checkout conversion against inventory lock-up.
- The reconciliation worker: the most important piece most candidates forget. Without it, a server crash between payment-success and DB-confirm is a money bug.
- Fairness vs throughput: the waiting room makes the on-sale fair but adds ~100ms latency per user-facing request. Is that trade-off worth it? (Yes, almost always.)
- Cancellation and resale: once PURCHASED, releasing a seat is a separate business workflow (refund policy, resale marketplace). Out of scope for the booking system but worth mentioning.
Things you should now be able to answer
- Why can't you use eventual consistency for seat reservations?
- What does
SELECT … FOR UPDATEdo, and when would you use a conditional UPDATE instead? - A user's payment succeeds but the API server crashes before confirming the booking. How do you detect and resolve this?
- How does the waiting room prevent the DB from being overwhelmed during an on-sale?
- How do you ensure a hold that expires while payment is in-flight doesn't produce a double-booking?
- What happens if the hold expiry worker runs late — can inventory be permanently lost?
Further reading
- Martin Fowler — "Saga Pattern" (martinfowler.com) — the compensating-transaction pattern used in the booking saga.
- PostgreSQL docs — "Explicit Locking" —
SELECT … FOR UPDATE,NOWAIT, andSKIP LOCKEDsemantics;SKIP LOCKEDis particularly useful for building a hold-expiry queue without lock contention. - Design a Flash Sale — the higher-throughput cousin of this problem; explores Redis-based inventory counters and the oversell trade-off.
- "Database Internals" (Alex Petrov, O'Reilly) — Chapter on transaction isolation and MVCC; essential background for understanding why conditional updates work without explicit locks.
Frequently asked questions
▸Why can't seat reservations use eventual consistency?
Eventual consistency allows a window where the system accepts two writes for the same seat before resolving the conflict, meaning both buyers receive a confirmation — a double-booking. Ticket seats are zero-sum finite inventory with no meaningful recovery path after the fact, unlike fungible goods such as shopping cart items where a slight oversell is recoverable.
▸What is the difference between SELECT FOR UPDATE and a conditional UPDATE for reserving a seat?
SELECT FOR UPDATE acquires a row-level lock in a separate round-trip before the write, holding that lock across any external calls made in the same transaction. A conditional UPDATE atomically flips status from AVAILABLE to HOLD in a single statement using a WHERE status = 'AVAILABLE' guard predicate; if two transactions race, one commits and the second re-evaluates the predicate, finds status = 'HOLD', and returns rows_affected = 0. The conditional UPDATE is preferred when contention is low because it avoids the two-round-trip overhead and does not hold locks across long-running calls.
▸What hold TTL does Ticketmaster use, and how should you choose it?
Ticketmaster's published checkout window is approximately 10 minutes, though it varies by demand. A design choice of 5 to 10 minutes is a reasonable range: too short (2 minutes) causes legitimate users to fail checkout, while too long (30 minutes) locks scarce inventory for abandoning users.
▸How does the system handle a payment succeeding right before the API server crashes?
A reconciliation worker runs every 60 seconds, comparing payment processor charge records to booking records using the booking_id as the idempotency key. If a charge shows status = succeeded but the seat row is still HOLD (not PURCHASED), the worker writes PURCHASED to the database. Without this worker, the hold would expire back to AVAILABLE while the user's card remained charged.
▸How many concurrent users does a virtual waiting room protect the booking system from, and what is the target checkout concurrency?
A hot on-sale can produce 500,000 simultaneous users. The waiting room meters ingress to a target of approximately 2,000 concurrent checkout sessions at once; users beyond that limit receive a queue position and a polling page updated every 5 seconds. Admission to the booking API requires a short-lived JWT signed by the waiting room service.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.