~/articles/design-ticketmaster

◆◆◆Advancedasked at Amazonasked at Bookingasked at StubHub

Design Ticketmaster (seat booking / reservations)

Sell limited inventory to a stampede of buyers without double-booking a seat. Reservation holds, a conditional-update concurrency guard, and the read-vs-write consistency split.

21 min read2026-06-01Ironclad Academy

#interview #concurrency #inventory #consistency

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Ticketmaster, StubHub, and similar platforms solve a deceptively simple problem: sell a finite number of numbered seats to a much larger crowd of buyers, without ever selling the same seat twice. At 10 AM when a Taylor Swift on-sale drops, half a million people arrive at once for 20,000 seats. The system has to hand out exactly 20,000 confirmations — not 19,999 (inventory loss) and not 20,001 (double-booking).

The seat map is small. A typical venue has 10,000–80,000 rows in a database table, totalling a few megabytes. What makes this hard is the traffic shape: a single point in time produces an enormous write spike against a tiny, highly contended dataset. Every buyer wants the same popular rows. A naive implementation will either bottleneck on the database primary under that write load or — far worse — issue two confirmations for the same seat when two buyers race through an insufficiently guarded code path.

There are two orthogonal tensions to resolve. First, read/write asymmetry: for every seat booked, hundreds of users are just browsing the seat map. The read path needs aggressive caching and can tolerate short staleness; the write path (the actual hold) needs strong consistency and cannot use the same shortcuts. Second, the stampede problem: half a million simultaneous HTTP requests will collapse a booking API long before a single database lock is even contended. Metering the inbound traffic is as important as getting the locking right.

The ticket-booking problem is the inventory consistency interview in disguise. The seat map is small — 20k rows. The throughput during a hot on-sale is enormous. And the penalty for a mistake — two confirmation emails, one seat, two furious people at the gate — is unacceptable. Getting this right requires separating what must be strongly consistent (the booking write) from what can be eventually consistent (the browsing read), and building a coherent failure story around the boundary.

Functional requirements

GET /events/:id/seatmap — return seat availability for an event.
POST /bookings/hold — atomically reserve one or more seats for the authenticated user; returns a booking_id with a TTL.
POST /bookings/:id/confirm — charge the user's payment method and confirm the booking.
DELETE /bookings/:id — explicitly release a hold before TTL expiry.
Seat holds expire automatically after ~10 minutes if confirm is never called.
A confirmed seat cannot be cancelled by another user; cancellation is handled separately (refunds, resale).

Non-functional requirements

Core invariant: a seat is sold to exactly one buyer — never zero (inventory lost) and never two (double-booking).
High availability: the event on-sale cannot go down during peak traffic.
Read latency: seat map page loads < 200ms p99 — many concurrent browsers.
Write correctness: seat reservation must be strongly consistent — no eventual-consistency shortcuts.
Fair access: users who arrive earlier (in wall-clock time) should have a higher probability of getting a seat.

Capacity estimation

Dimension	Estimate	How we got there
Seats per event	20,000	Typical large venue
Concurrent users (hot on-sale)	500,000	Given demand assumption
Browse reads (peak)	~40,000 reads/sec	`500k users × ~5 seat-map refreshes ÷ 60 s on-sale window`
Seat map response size	~50 KB	JSON with all seat statuses
Read bandwidth	~2 GB/s	`40,000 reads/s × 50 KB` — CDN / cache mandatory
Reservation writes (peak)	~5,000 hold attempts/sec	`500k users attempting in first ~100 s rush window`; most fail (only 20k seats, 500k buyers)
Row size	~200 bytes	`event_id, seat_id, section, row, number, status, user_id, expires_at`
Storage per event	~4 MB	`20k rows × 200 B`
Total seat storage (annual)	~800 GB	`200k events/year × 4 MB` — fits in a sharded relational DB
Active holds per event	up to 20,000	All seats could be in HOLD simultaneously
Expiry scan cadence	every 30 seconds	Index on `(event_id, status, expires_at)`

Takeaway: the read path (40k reads/sec, 2 GB/s) is nearly an order of magnitude larger than the write path (5k writes/sec), so caching seat maps is not an optimisation — it is a prerequisite for survival.

5k writes/sec for one event approaches or exceeds single-primary Postgres limits under realistic OLTP conditions (WAL flush is a hard serial bottleneck; practical throughput is ~1,000–2,500 TPS on well-tuned hardware). Sharding by event or vertical scaling is worth flagging here. The read path at 40k/sec is the traffic that needs caching.

Building up to the design

V1: A status column and a SELECT … FOR UPDATE

The minimal correct booking system is a single Postgres table with a status column per seat, and a transaction that uses a row-level lock to prevent races.

CREATE TABLE seats (
  id         BIGSERIAL PRIMARY KEY,
  event_id   BIGINT      NOT NULL,
  section    VARCHAR(20) NOT NULL,
  row_label  VARCHAR(5)  NOT NULL,
  seat_num   INT         NOT NULL,
  status     VARCHAR(12) NOT NULL DEFAULT 'AVAILABLE',
  user_id    BIGINT,
  booking_id UUID,
  expires_at TIMESTAMPTZ,
  CONSTRAINT seats_event_seat UNIQUE (event_id, section, row_label, seat_num)
);

CREATE INDEX seats_event_status ON seats (event_id, status, expires_at);

A reservation transaction:

BEGIN;
SELECT id FROM seats
  WHERE event_id = $1
    AND status = 'AVAILABLE'
    AND section = $2
    AND row_label = $3
    AND seat_num = $4
  FOR UPDATE;          -- row-level lock; blocks concurrent transactions on this row

UPDATE seats
  SET status     = 'HOLD',
      user_id    = $user_id,
      booking_id = $booking_id,
      expires_at = now() + interval '10 minutes'
  WHERE id = $seat_id
    AND status = 'AVAILABLE';  -- defensive guard even with the lock held
COMMIT;

If two buyers request the same seat simultaneously, one transaction wins the lock; the other blocks and then re-reads status = 'HOLD' after commit, returning a "seat no longer available" error to the second buyer.

This is correct at small scale — zero double-bookings, understandable code. The problem shows up at 5k writes/sec across a hot event: a single Postgres primary is likely at or beyond its practical write ceiling (WAL flush bottleneck; realistic throughput ~1,000–2,500 TPS on well-tuned hardware). Worse, 500k users hitting the server simultaneously — many not even buying, just browsing — will overwhelm the database before a single lock is contended. The read and write traffic need to be separated.

V2: Separate the read and write paths

The seat map browse — "show me which seats are green/red" — doesn't need to read from the primary. It can tolerate 10–30 seconds of staleness. A seat transitioning from AVAILABLE to HOLD will become visible soon enough; users learn it's unavailable when they actually try to reserve it.

Cache the seat map in Redis:

key:   seatmap:{event_id}
value: JSON payload of all seats and their statuses
TTL:   30 seconds (or invalidate on every seat status change)

Now the 40k read requests/sec hit Redis, not Postgres. Only the 5k hold/confirm writes reach the primary. Read latency drops to under 5ms, and the database is no longer in the path of every curious browser on the page.

That still leaves the stampede problem. The 500k users at on-sale open don't just browse — they all hit the booking endpoint at the same moment, and the API servers and load balancer saturate before the database even sees the traffic.

V3: Virtual waiting room

Instead of letting every buyer hammer the reservation endpoint directly, put a queue at ingress. When the on-sale opens:

Each incoming user request gets a queue token — a position in a virtual line.
The waiting room service maintains a maximum concurrent user count in the active booking flow — say, 2,000 simultaneous checkout sessions. As users complete or abandon checkout, new users are admitted from the front of the queue.
The remaining users see a "You are in the queue — position ~4,200" page, which updates every few seconds.
A user who reaches the front gets a short-lived JWT ("admission token") to actually call the booking API.

This converts a thundering-herd stampede into steady, metered load on the booking system. The waiting room itself stores almost nothing durable — queue position is a Redis counter. If the waiting room service restarts, users re-join the queue, which is an acceptable trade-off compared to the complexity of durable queue state.

The booking system never sees more than ~2k concurrent checkout sessions at once, instead of 500k.

V4: Booking saga (hold → charge → confirm)

Payment is a network call to a third-party processor. It can fail, time out, or succeed but have the acknowledgment lost in transit. The booking saga handles this explicitly:

HOLD seat (DB write, strongly consistent)
  ↓ success
CHARGE payment method (external call, may take 2-5s)
  ↓ success          ↓ failure / timeout
CONFIRM booking    RELEASE hold (status → AVAILABLE)
(HOLD → PURCHASED) + return error to user

Every step uses an idempotency key (the booking_id UUID) so retries are safe. If the API server dies between CHARGE and CONFIRM, a reconciliation job compares payment processor records to booking records and resolves the ambiguity: if payment succeeded, confirm; if payment failed, release.

V5: Production Ticketmaster

Row-level locking from V1 + read/write separation from V2 + waiting room from V3 + payment saga from V4 + hold expiry workers + idempotent APIs. Each layer earns its place.

flowchart LR
    V1["V1: SELECT FOR UPDATE<br/>correct, single box"] --> V2["V2: + Redis cache<br/>reads isolated"]
    V2 --> V3["V3: + waiting room<br/>metered ingress"]
    V3 --> V4["V4: + booking saga<br/>payment integrated"]
    V4 --> V5["V5: + expiry workers<br/>+ idempotency + sharding"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V5 fill:#a855f7,color:#fff

High-level architecture

flowchart TD
    U[Browser / App] -->|"on-sale traffic"| WR[Waiting Room Service]
    WR -->|"admission token"| LB[Load Balancer]
    LB --> API[Booking API cluster]

    API -->|"GET seatmap"| CACHE[(Redis<br/>seat map cache)]
    CACHE -.miss.-> REPLICA[("DB read replica")]

    API -->|"HOLD / CONFIRM"| PRIMARY[("DB primary<br/>Postgres")]
    PRIMARY --> REPLICA

    API -->|"charge"| PAYGATE[Payment Gateway]
    PAYGATE -.failure.-> API

    EXPIRE[Hold Expiry<br/>Worker] -->|"status=HOLD AND expires_at < now()"| PRIMARY
    EXPIRE -->|"invalidate"| CACHE

    RECON[Reconciliation<br/>Worker] -.compares.-> PAYGATE
    RECON -.resolves.-> PRIMARY

    PRIMARY --> KAFKA[Kafka events]
    KAFKA --> NOTIFY[Notification Service]
    KAFKA --> ANA[Analytics]

    style WR fill:#ff6b1a,color:#0a0a0f
    style CACHE fill:#15803d,color:#fff
    style PRIMARY fill:#0e7490,color:#fff
    style PAYGATE fill:#ffaa00,color:#0a0a0f
    style EXPIRE fill:#a855f7,color:#fff
    style RECON fill:#ff2e88,color:#fff

Two things to notice here. First, the read replica absorbs all seat-map browsing — the primary only sees hold and confirm writes. Second, there are two background workers: the hold expiry worker recycles abandoned carts, and the reconciliation worker is the safety net for the specific nightmare of a payment succeeding right as the API server crashes.

Seat status lifecycle

The heart of the system is a three-state machine on each seat row:

stateDiagram-v2
    [*] --> AVAILABLE : event created
    AVAILABLE --> HOLD : reserve (lock acquired)
    HOLD --> AVAILABLE : hold expires (TTL) OR user releases OR payment fails
    HOLD --> PURCHASED : payment confirmed
    PURCHASED --> AVAILABLE : cancellation / refund (future feature)
    PURCHASED --> [*]

State transitions are the only place strong consistency is required. Every transition is a single SQL UPDATE with a WHERE status = 'AVAILABLE' (or 'HOLD') predicate — this is the conditional update / optimistic lock that prevents races without requiring explicit SELECT … FOR UPDATE in some implementations.

Alternative implementation using optimistic concurrency:

-- Atomic AVAILABLE → HOLD using conditional update
UPDATE seats
  SET status     = 'HOLD',
      user_id    = $user_id,
      booking_id = $booking_id,
      expires_at = now() + interval '10 minutes'
  WHERE event_id = $event_id
    AND id       = $seat_id
    AND status   = 'AVAILABLE';   -- the guard predicate

-- rows_affected = 1 → success; = 0 → someone else got there first

No separate lock-acquisition step is required before the write. Under Postgres's default READ COMMITTED isolation, if two UPDATE transactions race on the same row, one acquires the implicit row-level exclusive lock and commits; the second briefly waits for that lock, then re-evaluates the WHERE predicate against the newly committed row version, sees status='HOLD', and gets rows_affected=0 — returning an error to the client. The wait is bounded by the duration of the first transaction (milliseconds for a simple DB write), not an open-ended block as with a SELECT … FOR UPDATE that holds the lock across a long-running external call. This approach avoids the two-round-trip overhead (SELECT then UPDATE) and is preferable to SELECT … FOR UPDATE when the contention rate is low — which it almost always is, since most seats are not contested simultaneously.

The waiting room in depth

A virtual waiting room is the single biggest reliability lever for an on-sale. Without it, the booking API receives 500k requests simultaneously and either returns 503 to most users or passes the full load to the database. Both outcomes are disasters. With a waiting room, the booking system never sees more than ~2k concurrent checkout sessions at once.

sequenceDiagram
    participant U as User browser
    participant WR as Waiting Room
    participant Q as Redis queue counter
    participant API as Booking API

    U->>WR: GET /onsale/event:123 (at 10:00:00)
    WR->>Q: INCR queue:event:123:position → returns position P
    WR->>U: "You are position 4,200 — estimated wait 3 min"
    loop every 5 sec
      U->>WR: GET /queue/status
      WR->>Q: GET queue:event:123:admitted_position
      WR->>U: "Position 4,200 — now serving 2,100"
    end
    Note over Q: new users admitted as active sessions complete (target: ~2k concurrent)
    WR->>U: "You are next — here is your admission token (JWT, 5 min TTL)"
    U->>API: POST /bookings/hold  [Authorization: Bearer <admission_token>]
    API->>U: booking_id, seat held for 10 minutes

Admission tokens are short-lived JWTs signed by the waiting room service. The booking API verifies the signature before processing any hold request — no valid token, no hold. A user without a valid token gets a 403.

The waiting room stores nothing durable. Queue position is a Redis counter; if the waiting room service restarts, users re-join. This is a deliberate trade-off: durable queue state would require a substantially more complex system, and the UX cost of re-queuing is far lower than the reliability cost of a complex waiting room.

Hold expiry

Holds that are never confirmed — abandoned carts, browser crashes, payment timeouts — must be released back to AVAILABLE so other buyers can claim the seat.

The expiry worker runs every 30 seconds:

flowchart LR
    CRON[Expiry worker<br/>every 30s] -->|"WHERE status=HOLD AND expires_at < now()"| DB[("Postgres primary")]
    DB -->|"RETURNING event_id, seat_id"| WORKER[Worker]
    WORKER -->|"publish invalidation"| CACHE[(Redis seat-map<br/>cache)]
    WORKER -->|"optional: emit event"| KAFKA[Kafka<br/>BookingExpired]
    style CRON fill:#a855f7,color:#fff
    style DB fill:#0e7490,color:#fff
    style CACHE fill:#15803d,color:#fff
    style KAFKA fill:#ff6b1a,color:#0a0a0f

UPDATE seats
  SET status     = 'AVAILABLE',
      user_id    = NULL,
      booking_id = NULL,
      expires_at = NULL
  WHERE status     = 'HOLD'
    AND expires_at < now()
  RETURNING event_id, id;

For each released seat, the worker also publishes an invalidation event to the seat-map cache so Redis reflects the new availability. The window between a hold expiring and the expiry worker running (up to 30 seconds) is acceptable — a browsing user may see a seat as unavailable for up to 30 extra seconds, but no seat is ever permanently lost.

There is one race worth calling out: what if the payment is in-flight when the hold expires? The booking saga must check that the hold is still valid before confirming a payment. If the payment succeeds but the hold has already expired, the correct behavior is an immediate refund — the seat may already be held by someone else.

-- CONFIRM is only valid if the hold still belongs to this booking
UPDATE seats
  SET status     = 'PURCHASED',
      expires_at = NULL
  WHERE booking_id = $booking_id
    AND status     = 'HOLD'      -- guard: hold must still be active
    AND expires_at > now();      -- guard: not expired

-- rows_affected = 0 → hold expired → must refund payment

Payment saga and compensation

The booking saga has three happy-path steps and multiple compensation paths:

sequenceDiagram
    participant API as Booking API
    participant DB as Inventory DB
    participant PAY as Payment Gateway
    participant RECON as Reconciliation Worker

    API->>DB: UPDATE seat → HOLD (idempotency key: booking_id)
    DB-->>API: rows_affected=1 (success)
    API->>PAY: charge(amount, card_token, idempotency_key=booking_id)

    alt payment success
        PAY-->>API: charge_id, status=succeeded
        API->>DB: UPDATE seat → PURCHASED (guard: still HOLD + not expired)
        DB-->>API: rows_affected=1 (success)
        API->>API: emit BookingConfirmed event
    else payment declined
        PAY-->>API: status=declined
        API->>DB: UPDATE seat → AVAILABLE (release hold)
        API->>API: return error to user
    else API server crashes between charge and confirm
        Note over RECON: Runs every 60 seconds
        RECON->>PAY: list charges with idempotency_key=booking_id
        PAY-->>RECON: status=succeeded
        RECON->>DB: confirm booking (HOLD → PURCHASED)
    end

The reconciliation worker is the safety net for the most dangerous failure mode: payment succeeds at the processor, but the API server dies before writing PURCHASED to the database. Without reconciliation, the seat would expire back to AVAILABLE while the user's card was charged — a support nightmare. Most candidates forget this piece; mentioning it in an interview signals that you think through the failure boundaries, not just the happy path.

API design

POST /api/v1/bookings/hold
Authorization: Bearer <admission_token>
Idempotency-Key: <client-generated UUID>
Content-Type: application/json

{
  "event_id": "evt_9kx2",
  "seats": [{"section": "Floor", "row": "A", "num": 12}]
}

→ 200 OK
{
  "booking_id": "bkg_7f3a",
  "seats": [{"id": 84221, "section": "Floor", "row": "A", "num": 12}],
  "expires_at": "2026-06-01T10:12:00Z",
  "total_price_cents": 18500
}

POST /api/v1/bookings/bkg_7f3a/confirm
Authorization: Bearer <user_jwt>
Idempotency-Key: <same UUID as hold, or new UUID>

{
  "payment_method_id": "pm_card_abc"
}

→ 200 OK
{
  "booking_id": "bkg_7f3a",
  "status": "PURCHASED",
  "confirmation_number": "TM-9981234"
}

DELETE /api/v1/bookings/bkg_7f3a
Authorization: Bearer <user_jwt>

→ 204 No Content

The DELETE endpoint releases the hold immediately (sets status = AVAILABLE) without waiting for TTL expiry. It is a no-op if the booking is already PURCHASED or already AVAILABLE.

Idempotency on hold: a second request with the same Idempotency-Key returns the existing booking instead of creating a new one, as long as it is within the TTL window. Idempotency on confirm: a second confirm request on the same booking is a no-op if the booking is already PURCHASED — it does not re-charge the card.

Storage choices

Data	Store	Rationale
Seat inventory (status, holds)	Postgres (primary)	Row-level locking, ACID transactions, conditional updates; money-level consistency
Seat map read cache	Redis (hash by event_id)	Sub-millisecond reads; invalidated on status change; 30s TTL fallback
Waiting room queue positions	Redis counters	Single INCR/GET per user; ephemeral, re-join on restart is acceptable
User accounts	Postgres	Standard relational
Booking records (archive)	Postgres → S3/data warehouse	Operational DB is append-only for completed bookings; archive for analytics
Booking events	Kafka	Notification, analytics, reconciliation consumers
Payment records	Postgres (separate schema)	Strong consistency required; never co-mingle with seat table writes

Why NOT eventual consistency for seat inventory

Eventual consistency means accepting a window where the system believes something is true that isn't. For seat inventory, that window produces double-bookings. Consider:

Two buyers each read seat A12 as AVAILABLE. Both issue a reserve. An eventually-consistent store may accept both writes, resolving the conflict "later" by keeping one — but both users received a confirmation. That is a double-booking.

Strong consistency for the inventory write is non-negotiable. You may sacrifice consistency in adjacent systems — browsing, notifications, analytics — but not on the status column of the seat row.

This is a fundamental difference from systems like shopping carts (where a product going slightly out-of-stock after you add it to a cart is recoverable) or social feeds (where eventual consistency is the intended behavior). Ticket seats are zero-sum finite inventory with no meaningful recovery path after double-booking.

See also: Design a Flash Sale for the closely related inventory-depletion pattern at even higher write throughput (flash sales often use pre-decrement counters in Redis rather than row-level locks, which trades a small oversell risk for throughput — a trade-off that is acceptable for fungible goods but not for assigned seating).

Failure modes and mitigations

Failure	Symptom	Mitigation
Double-booking race	Two users confirm the same seat	Conditional UPDATE with `WHERE status='AVAILABLE'` guard; only one gets `rows_affected=1`
Hold expiry vs payment in-flight	Seat released; payment succeeds	Confirm checks `expires_at > now()` before writing PURCHASED; immediate refund if check fails
Payment succeeds, API server crashes	User charged, seat not confirmed	Reconciliation worker matches charge records to booking records every 60s
Queue fairness (jumping the line)	Users refresh-spam for better position	Admission tokens are position-bound and short-lived; queue position is assigned at first connection
Inventory oversell (many concurrent holds)	All seats HOLD, but some expire	Holds expire back to AVAILABLE; browsing users see updated availability after expiry sweep
DB primary failure	Writes fail during on-sale	Promote a replica (seconds with streaming replication + automatic failover e.g. Patroni); in-flight holds may be lost; waiting room buffers users
Waiting room service failure	Users lose queue position	Stateless; users re-queue; on-sale may need to be restarted; lower stakes than DB failure
Idempotency key collision	Duplicate charge	Client-generated UUID + server-side deduplication table (keyed by idempotency key, TTL 24h)

Sharding considerations

For a single event at peak, 5k writes/sec to one Postgres table exceeds what a typical single primary can sustain under realistic OLTP workloads. Benchmarks on high-end hardware (fast NVMe, large RAM, tuned connection pooling) show realistic row-level-locking write throughput in the 1,000–2,500 TPS range; WAL flushing is a hard serialization bottleneck. At 5k hold attempts/sec the primary will likely be the bottleneck, even before lock contention on hot seats. Most events are far below this peak — a single well-tuned primary handles the common case — but you should flag vertical scaling limits and be prepared to discuss connection pooling (PgBouncer), write batching, or sharding for a true blockbuster show.

If sharding is required, the two realistic options are:

flowchart LR
    REQ["Hold request<br/>event_id=99"] --> ROUTE{hash(event_id)}
    ROUTE -->|"events 0-999"| SH0[("Shard 0<br/>all seats for events 0–999")]
    ROUTE -->|"events 1000-1999"| SH1[("Shard 1<br/>all seats for events 1000–1999")]
    ROUTE -->|"events 2000+"| SH2[("Shard 2<br/>...")]
    SH0 --> NOTE["All lock contention<br/>for one event stays<br/>on one shard"]
    style ROUTE fill:#ff6b1a,color:#0a0a0f
    style SH0 fill:#0e7490,color:#fff
    style SH1 fill:#0e7490,color:#fff
    style SH2 fill:#0e7490,color:#fff
    style NOTE fill:#ffaa00,color:#0a0a0f

Shard by event_id: all seats for one event live on one shard, keeping all seat-lock contention local to one node and avoiding distributed transactions. The downside is that a blockbuster event is always hot on one shard — a classic hot partition. Shard by (event_id, section) spreads a large venue across multiple shards, but buying multiple seats in one transaction now requires a distributed transaction or saga — substantial added complexity.

For most realistic interview scopes, sharding by event_id with vertical scaling of the primary (plus read replicas for the read path) is the right answer. Only revisit if you are told the system handles 1000+ simultaneous hot on-sales.

Things to discuss in an interview

Why strong consistency here: the cost of a double-booking (legal liability, refund cost, support cost, reputational damage) far exceeds the cost of slightly lower write throughput.
The read/write split: seat map browsing is roughly 8× more frequent than bookings at peak (40k reads/sec vs 5k writes/sec); keeping reads on Redis/replicas is what makes the system survive the on-sale.
Hold TTL: what's the right value? Too short (2 min) and users fail checkout under normal conditions. Too long (30 min) and scarce inventory is locked by abandoning users. Ticketmaster's published checkout window is approximately 10 minutes, though it varies by demand; a design choice of 5–10 minutes is a reasonable range that balances checkout conversion against inventory lock-up.
The reconciliation worker: the most important piece most candidates forget. Without it, a server crash between payment-success and DB-confirm is a money bug.
Fairness vs throughput: the waiting room makes the on-sale fair but adds ~100ms latency per user-facing request. Is that trade-off worth it? (Yes, almost always.)
Cancellation and resale: once PURCHASED, releasing a seat is a separate business workflow (refund policy, resale marketplace). Out of scope for the booking system but worth mentioning.

Things you should now be able to answer

Why can't you use eventual consistency for seat reservations?
What does SELECT … FOR UPDATE do, and when would you use a conditional UPDATE instead?
A user's payment succeeds but the API server crashes before confirming the booking. How do you detect and resolve this?
How does the waiting room prevent the DB from being overwhelmed during an on-sale?
How do you ensure a hold that expires while payment is in-flight doesn't produce a double-booking?
What happens if the hold expiry worker runs late — can inventory be permanently lost?

Frequently asked questions

▸Why can't seat reservations use eventual consistency?

Eventual consistency allows a window where the system accepts two writes for the same seat before resolving the conflict, meaning both buyers receive a confirmation — a double-booking. Ticket seats are zero-sum finite inventory with no meaningful recovery path after the fact, unlike fungible goods such as shopping cart items where a slight oversell is recoverable.

▸What is the difference between SELECT FOR UPDATE and a conditional UPDATE for reserving a seat?

SELECT FOR UPDATE acquires a row-level lock in a separate round-trip before the write, holding that lock across any external calls made in the same transaction. A conditional UPDATE atomically flips status from AVAILABLE to HOLD in a single statement using a WHERE status = 'AVAILABLE' guard predicate; if two transactions race, one commits and the second re-evaluates the predicate, finds status = 'HOLD', and returns rows_affected = 0. The conditional UPDATE is preferred when contention is low because it avoids the two-round-trip overhead and does not hold locks across long-running calls.

▸What hold TTL does Ticketmaster use, and how should you choose it?

Ticketmaster's published checkout window is approximately 10 minutes, though it varies by demand. A design choice of 5 to 10 minutes is a reasonable range: too short (2 minutes) causes legitimate users to fail checkout, while too long (30 minutes) locks scarce inventory for abandoning users.

▸How does the system handle a payment succeeding right before the API server crashes?

A reconciliation worker runs every 60 seconds, comparing payment processor charge records to booking records using the booking_id as the idempotency key. If a charge shows status = succeeded but the seat row is still HOLD (not PURCHASED), the worker writes PURCHASED to the database. Without this worker, the hold would expire back to AVAILABLE while the user's card remained charged.

▸How many concurrent users does a virtual waiting room protect the booking system from, and what is the target checkout concurrency?

A hot on-sale can produce 500,000 simultaneous users. The waiting room meters ingress to a target of approximately 2,000 concurrent checkout sessions at once; users beyond that limit receive a queue position and a polling page updated every 5 seconds. Admission to the booking API requires a short-lived JWT signed by the waiting room service.

← previous

Design a Real-Time Fraud Detection System

Design a Distributed Job Scheduler (cron at scale)

// RELATED