~/articles/design-stock-exchange

◆◆◆Advancedasked at Bloombergasked at Citadelasked at Robinhoodasked at Nasdaq

Design a Stock Exchange (matching engine)

Q: What is price-time priority and what data structure implements it?

Price-time priority means the best price is always matched first, and within a price level, orders are filled in the order they arrived — strictly first in, first out. The order book implements this as a TreeMap of price levels (sorted descending for bids, ascending for asks), where each price level is a FIFO queue. Insert is O(log P), cancel is O(1) via an orders_by_id HashMap index, and best bid/ask lookup is O(1).

Q: Why does a matching engine use a single thread per symbol instead of multiple threads?

Multiple threads touching the same order book require locks, and lock contention under load can reorder operations — meaning two simultaneously arriving orders could be serialized in either order, which is a fairness violation. A single thread eliminates the ambiguity: fairness falls out of the architecture for free, with no coordination overhead. LMAX's published benchmark measured 6 million orders per second on a single thread on 2010-era hardware, and a busy symbol at a typical exchange handles around 50k events per second — well within one core's capacity.

Q: What is the sequencer's role in a stock exchange, and why is the sequence number called a fairness primitive?

The sequencer stamps each incoming order with a monotonically increasing sequence number and appends it to a durable log before the matching engine ever sees the event. That sequence number defines the canonical, legally auditable order of all events — once stamped, an order's place in the queue is permanent. Fairness is provable from the log because the sequence number is the permanent record of who arrived first.

Q: How does event sourcing enable crash recovery without putting disk I/O on the matching engine's hot path?

The matching engine is a pure function of its event stream: its in-memory state is just a materialized view of the log, never persisted directly. After a crash, the engine replays the event log from the last checkpoint and arrives at exactly the same state — no distributed handshake or state reconciliation needed. The engine itself never writes to disk; all I/O is in the sequencer, which writes append-only and leaves the matching path untouched.

Q: Why does market data fan-out use multicast instead of unicast from the matching engine?

A single multicast packet reaches all subscribers simultaneously, so the engine's output I/O stays constant regardless of how many clients are watching. Sending unicast from the engine would mean send cost scales with subscriber count — at roughly 1 million subscribed connections for a consolidated feed, that is unworkable. The matching engine emits fill events to a separate market-data publisher, which handles fan-out via multicast and keeps the engine oblivious to subscriber count.

Match buy and sell orders deterministically with microsecond latency and perfect fairness. The single-threaded matching engine, the order book, and event sourcing for recovery.

22 min read2026-03-16Ironclad Academy

#interview #low-latency #consistency #finance

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Nasdaq processed roughly 29 billion shares on a single peak day in 2021. Behind every one of those transactions sits the same piece of machinery: a matching engine that takes a buy order and a sell order, confirms they agree on price, and produces a legally binding fill — in under a hundred microseconds. Robinhood, Citadel, NYSE, and every other modern venue all build around the same core idea, differing mainly in scale and the extremity of their latency targets.

At its heart, a stock exchange is a marketplace for orders. A limit order says "I will buy 200 shares of AAPL at no more than $150.05." A matching engine's job is to find the counterpart — a seller willing to part with shares at $150.05 or less — and execute the trade. Simple in concept; brutal in practice. Wall Street's tolerance for bugs or ambiguity is approximately zero: a wrong match is a regulatory violation, a lost order is a lawsuit, and a microsecond of unnecessary latency is money left on the table.

Two tensions define the hard parts. The first is correctness vs speed: every match must be deterministic and fair (price-time priority, strictly enforced — first in, first out at each price level), yet latency budgets are in the tens of microseconds for co-located clients. Any locking or coordination on the hot path kills both. The second is fan-out vs isolation: a single fill event must reach thousands of market-data subscribers simultaneously, but the engine that produced it must remain oblivious to subscriber count — otherwise, subscriber churn bleeds into match latency. Every architectural decision in a real exchange is driven by one goal: the guarantee that replaying the same inputs in the same order produces exactly the same outputs, every time.

Functional requirements

Accept limit orders: buy or sell N shares of symbol X at price P or better.
Accept market orders: buy or sell N shares at the current best available price.
Accept cancellations and modifications of open orders.
Match orders when a buyer's price meets or exceeds a seller's price.
Generate fill notifications to both parties when a match occurs.
Publish a market data feed: every order, cancel, and fill, in sequence.
Provide a public order book snapshot (best N price levels on each side).

Non-functional requirements

Correctness first: no lost orders, no duplicate fills, no out-of-order matches.
Deterministic: given the same event log, the system always arrives at the same state.
Low latency: p99 match latency < 100 µs gate-to-gate (order received → fill sent) for co-located clients; < 1 ms for ordinary API clients.
High throughput: sustain 500 k order events/sec across all symbols.
High availability: 99.99% uptime; failover in < 1 second without losing or reordering events.
Fair: price-time priority, enforced globally and provably from the event log.

Capacity estimation

Illustrative figures for a mid-size exchange or a single major symbol. A large venue like Nasdaq processes multiple millions of messages/sec across all engines; 500 k/sec is a reasonable interview target.

Dimension	Estimate	How we got there
Order event throughput (peak)	500,000 events/sec	Orders + cancels + modifies combined
Incoming event bandwidth	~100 MB/s	`500,000 events/s × 200 B/event`
Event log retention	~43 TB	`500k events/s × 86,400 s/day × 5 days × 200 B = 500k × 432,000 × 200 B` — feasible on NVMe arrays or object storage
Active symbols	~10,000	Split across matching engine processes
Top-symbol throughput	~50k events/sec	One core handles this comfortably
Bottom-80% symbols	< 500 events/sec each	Long-tail distribution
Order book memory (per busy symbol)	~1 MB	`2 sides × 100 price levels/side × 50 orders/level = 10,000 orders × 100 B`
Total order book memory (all symbols)	~10 GB	`10,000 symbols × 1 MB` — fits in one machine
Market data event size	~1,000 bytes serialized	Per event after serialization
Outbound market data bandwidth	~500 MB/s	`500k events/s × 1,000 B`; fills are a subset — not every event produces a fill
Fan-out delivery	Multicast / pub-sub	Avoids O(N) unicast cost at the engine

Takeaway: 43 TB of log retention at 500 k events/sec fits on NVMe arrays; the entire live order book for 10,000 symbols fits in 10 GB of RAM on a single machine — the architecture is I/O-bound at the sequencer, not at the engine.

The limit order book — the central data structure

Before any architecture discussion, you must understand what the matching engine is actually computing.

An order book for symbol AAPL at a snapshot in time:

ASKS (sell orders, sorted ascending by price)
  Price $150.05  → [sell 200 @ 09:31:04.000001, sell 100 @ 09:31:04.000312]
  Price $150.10  → [sell 500 @ 09:31:03.999812]

  ← SPREAD ($0.05) →

BIDS (buy orders, sorted descending by price)
  Price $150.00  → [buy 300 @ 09:31:04.000100, buy 100 @ 09:31:03.998000]
  Price $149.95  → [buy 1000 @ 09:31:03.997500]

Price-time priority: within a price level, orders are filled in the order they arrived — first in, first out. Across price levels, the best price is always matched first (lowest ask, highest bid).

The data structure

OrderBook {
    bids: TreeMap<Price, Queue<Order>>   // sorted descending
    asks: TreeMap<Price, Queue<Order>>   // sorted ascending
    orders_by_id: HashMap<OrderId, (side, price, qty_remaining)>
}

Insert (limit order): O(log P) to find the price level, O(1) to enqueue. P = number of distinct price levels, typically small.
Cancel: O(1) lookup by order_id in orders_by_id, O(1) removal from the queue (with a doubly-linked list inside the queue).
Best bid / ask: O(1) — peek the first key of each TreeMap.
Match: O(1) per fill — always matching against the front of the best-price queue.

flowchart TD
    OB["Order Book (AAPL)"]
    OB --> BIDS["BIDS (TreeMap desc)"]
    OB --> ASKS["ASKS (TreeMap asc)"]
    OB --> IDX["orders_by_id (HashMap)"]
    BIDS --> B1["$150.00 → [buy 300 t=04.000100, buy 100 t=03.998000]"]
    BIDS --> B2["$149.95 → [buy 1000 t=03.997500]"]
    ASKS --> A1["$150.05 → [sell 200 t=04.000001, sell 100 t=04.000312]"]
    ASKS --> A2["$150.10 → [sell 500 t=03.999812]"]
    IDX --> ID1["order_42 → (BUY, $150.00, qty=300)"]

    style OB fill:#0e7490,color:#fff
    style BIDS fill:#15803d,color:#fff
    style ASKS fill:#ff6b1a,color:#0a0a0f
    style IDX fill:#a855f7,color:#fff

Notice that the orders_by_id index is what makes O(1) cancellation possible. Without it, cancelling an order would require scanning the price level's queue — a linear search. With it, you jump directly to the order's position in constant time.

The matching algorithm

An incoming order arrives. The engine runs this algorithm (pseudocode):

def process_order(incoming):
    if incoming.type == CANCEL:
        cancel(incoming.order_id)
        return

    if incoming.side == BUY:
        opposite = asks          # match against asks
        price_ok = lambda level: level <= incoming.price  # for market orders: always True
    else:
        opposite = bids
        price_ok = lambda level: level >= incoming.price

    while incoming.qty > 0 and opposite and price_ok(opposite.best_price()):
        resting = opposite.front()
        fill_qty = min(incoming.qty, resting.qty_remaining)

        emit_fill(incoming.order_id, resting.order_id, fill_qty, resting.price)

        incoming.qty -= fill_qty
        resting.qty_remaining -= fill_qty
        if resting.qty_remaining == 0:
            opposite.remove_front()

    if incoming.type == LIMIT and incoming.qty > 0:
        # remaining quantity rests on the book
        bids_or_asks[incoming.price].enqueue(incoming)

The match price is always the resting order's price, not the aggressive order's price — this is standard in price-time priority matching and is called the resting price rule. It guarantees the resting order gets the price it asked for.

sequenceDiagram
    participant GW as Gateway
    participant SEQ as Sequencer
    participant ME as Matching Engine
    participant MD as Market Data

    GW->>SEQ: "BUY 100 AAPL @ $150.05" (seq not assigned)
    SEQ->>SEQ: append to log, assign seq#1042
    SEQ->>ME: deliver event seq#1042
    ME->>ME: crosses spread → match 100 vs resting ask@$150.05
    ME->>GW: Fill ACK (order_id=42, qty=100, px=$150.05)
    ME->>MD: publish Fill event (seq#1042)
    ME->>MD: publish updated best bid/ask

Why single-threaded? The determinism argument

This is the most important design decision in a matching engine. Every production exchange matching engine that the author is aware of processes orders for a given symbol on a single thread.

The correctness argument is simple. Two threads touching the same order book need a lock. Under load, lock contention is unpredictable and can reorder operations — two requests that "simultaneously" arrive could be serialized in either order, and that order determines who gets filled first, which is a fairness violation. A single thread eliminates the ambiguity entirely. Fairness is not something you verify after the fact; it falls out of the architecture for free.

The performance argument is equally compelling. A well-optimized single-threaded matching engine can process millions of order events per second on a single core — LMAX's own published benchmark measured 6 million orders/sec on a single thread on 2010-era hardware; modern CPUs do better. Even the busiest symbol at a typical exchange — say 50k events/sec — uses only a small fraction of one core. And because all order-book state for one symbol fits comfortably in cache (~1 MB fits in the L2 of current-generation Intel Xeon Sapphire Rapids cores, and easily in any server's L3), you rarely pay a cache miss on the critical path. This is the mechanical sympathy principle: write software that matches the hardware's strengths, and the hardware rewards you.

The LMAX Disruptor pattern: LMAX Exchange popularized the pattern of a single-threaded business logic processor consuming from a pre-allocated ring buffer. The ring buffer is shared between the input thread (writing) and the processing thread (reading) using memory barriers rather than locks. This achieves sub-microsecond handoff between threads without any heap allocation on the hot path.

flowchart LR
    NET[Network thread<br/>receives order bytes] -->|writes to ring| RB[(Ring Buffer<br/>lock-free SPMC)]
    RB -->|reads in sequence| ME[Matching Engine<br/>single thread]
    ME --> OUT1[Fill notifier thread]
    ME --> OUT2[Market data thread]

    style RB fill:#0e7490,color:#fff
    style ME fill:#15803d,color:#fff

The ring buffer is pre-allocated — no garbage collection pauses — and the single-consumer guarantee means the matching engine thread never competes with anyone for any data structure.

Sequencing — the consistency primitive

Before the matching engine sees an order, a sequencer stamps it with a monotonically increasing sequence number and appends it to a durable log. This moment is the exchange's point of truth.

The sequence number defines the canonical order of all events. Once an event is stamped and written, its place in the universe is settled. The log becomes the source of truth; the in-memory order book is just a materialized view of that log — an optimized read cache, if you will. Recovery becomes trivial: replay the log from the beginning (or from the last snapshot checkpoint) and you arrive at exactly the same state. No state reconciliation, no guesswork.

The sequencer is deliberately simple. Its only job is write-and-stamp. Any logic beyond that belongs in the gateway or the engine, not here.

flowchart TD
    E1[Event arrives from gateway] --> SEQ{Sequencer}
    SEQ -->|"append to write-ahead log<br/>(NVMe or replicated log)"| LOG[(Durable Event Log)]
    SEQ -->|"deliver seq#N to engine"| ME[Matching Engine]
    LOG -.replicate asynchronously.-> HS[Hot Standby<br/>replays same events]

    style SEQ fill:#ff6b1a,color:#0a0a0f
    style LOG fill:#0e7490,color:#fff
    style HS fill:#a855f7,color:#fff

Event sourcing for durability and recovery

Recovery flow

flowchart TD
    CRASH[Engine crashes at seq#8452] --> CHECK{Last checkpoint?}
    CHECK -->|"yes, checkpoint at seq#8000"| SNAP[Load snapshot at seq#8000]
    CHECK -->|no checkpoint| INIT[Start from empty state]
    SNAP --> REPLAY["Replay events seq#8001 → seq#8452"]
    INIT --> REPLAY2["Replay all events from beginning"]
    REPLAY --> LIVE[Engine live at seq#8452]
    REPLAY2 --> LIVE2[Engine live at seq#8452]

    style SNAP fill:#15803d,color:#fff
    style LIVE fill:#ff6b1a,color:#0a0a0f
    style LIVE2 fill:#ff6b1a,color:#0a0a0f

Snapshot frequency is a trade-off: more frequent means faster recovery but more CPU and storage overhead. Production systems typically checkpoint every few minutes, or as frequently as every few seconds for the highest-volume symbols.

Order lifecycle — state diagram

stateDiagram-v2
    [*] --> Pending: received at gateway
    Pending --> Sequenced: sequencer stamps seq#
    Sequenced --> PartiallyFilled: partial match
    Sequenced --> Filled: full match
    Sequenced --> Open: no immediate match (limit order rests)
    Open --> PartiallyFilled: later incoming order matches
    Open --> Cancelled: client cancels
    Open --> Expired: order past GTD time
    PartiallyFilled --> Filled: remaining qty matched
    PartiallyFilled --> Cancelled: client cancels remainder
    Filled --> [*]
    Cancelled --> [*]
    Expired --> [*]

Surrounding system components

The matching engine is not a complete exchange — it is the core. Everything else handles the work the engine should never touch.

flowchart TD
    CLIENT[Client / Broker] --> GW[Order Gateway<br/>FIX / binary protocol]
    GW --> RISK[Pre-trade Risk Engine<br/>margin, position limits, fat-finger]
    RISK -->|approved order| SEQ[Sequencer]
    SEQ --> ME[Matching Engine Cluster<br/>one process per symbol]
    ME --> FILL[Fill Router<br/>sends fills back to clients]
    ME --> MDF[Market Data Feed<br/>L1 / L2 / L3 multicast]
    ME --> AUDIT[Audit Log Writer<br/>regulatory]
    ME --> PT[Post-Trade Clearing<br/>CCP integration]
    MDF --> FEED[(Market Data<br/>subscribers)]
    PT --> CCP[Central Counterparty<br/>Clearinghouse]

    style GW fill:#ff6b1a,color:#0a0a0f
    style SEQ fill:#0e7490,color:#fff
    style ME fill:#15803d,color:#fff
    style MDF fill:#ffaa00,color:#0a0a0f
    style PT fill:#a855f7,color:#fff

Gateway and risk pre-checks

Before an order reaches the sequencer, the gateway validates it and the risk engine runs a rapid pre-trade check: is the message well-formed, is the symbol valid, is this client authorized to trade here? Then the tighter checks: is the price within reasonable bounds of the last trade (fat-finger protection, typically ±10%), and does this order keep the client within their margin and position limits?

All of this happens before the sequencer. An order that fails risk checks is rejected without ever touching the log — which means it never consumes a sequence number, never occupies the hot path, and never needs to be undone. Keeping risk at the gateway is what lets the engine itself stay pure.

Market data feed

Every fill, every order arrival, and every cancel generates market data events. The publisher distributes these at three levels of granularity:

flowchart LR
    ME[Matching Engine] --> MDP[Market Data Publisher]
    MDP --> L1["L1 — best bid and ask only<br/>(most retail clients)"]
    MDP --> L2["L2 — best N price levels each side<br/>(algorithmic traders)"]
    MDP --> L3["L3 — every individual order<br/>(high-frequency traders)"]
    L1 --> MC[Multicast / pub-sub]
    L2 --> MC
    L3 --> MC
    MC --> SUBS[~1M subscribers]

    style MDP fill:#ffaa00,color:#0a0a0f
    style ME fill:#15803d,color:#fff
    style MC fill:#0e7490,color:#fff

Fan-out to a million subscribers uses multicast, not unicast from the engine. A single multicast packet reaches all subscribers simultaneously, so the engine's output I/O stays constant regardless of how many clients are watching. Sending unicast from the engine would mean the engine's send cost scales with subscriber count — unworkable.

Post-trade: clearing and settlement

A fill is a legal obligation. After a match, the clearing system records the trade (counterparties, quantity, price), and the central counterparty clearinghouse (CCP) steps in as the buyer to every seller and the seller to every buyer, eliminating counterparty risk. Settlement — the actual transfer of shares and cash — happens T+1 for US equities (the US moved from T+2 to T+1 in May 2024); many non-US markets still settle T+2. Settlement is entirely off the latency-critical path; it runs asynchronously after the fill is confirmed.

Partitioning by symbol

Horizontal scale is straightforward: each symbol is an independent partition. AAPL events never touch the MSFT order book. Route events to the correct engine process using hash(symbol_id) % N — or a static routing table for operational simplicity.

flowchart LR
    SEQ[Sequencer] --> ROUTE{Symbol Router}
    ROUTE -->|AAPL| ME1[Engine: AAPL]
    ROUTE -->|MSFT| ME2[Engine: MSFT]
    ROUTE -->|TSLA| ME3[Engine: TSLA]
    ME1 -.event log.-> LOG1[(Log: AAPL)]
    ME2 -.event log.-> LOG2[(Log: MSFT)]
    ME3 -.event log.-> LOG3[(Log: TSLA)]

    style ROUTE fill:#ff6b1a,color:#0a0a0f
    style ME1 fill:#15803d,color:#fff
    style ME2 fill:#15803d,color:#fff
    style ME3 fill:#15803d,color:#fff

The implication: you can add symbols cheaply (new partition), and a burst in one symbol doesn't affect any other. The top 10 symbols can be pinned to dedicated cores. The remaining coordination point is the sequencer — for cross-symbol events like index arbitrage, some venues use a global sequence number; others allow per-symbol logs with reconciliation at the clearing layer.

Low-latency techniques

The single-threaded architecture alone gets you to respectable throughput with sub-millisecond latency on ordinary hardware. Getting below 10 µs gate-to-gate requires a different class of techniques:

Technique	What it does	Latency impact
Kernel bypass (DPDK / RDMA)	NIC writes directly to user-space ring buffer, bypassing the OS kernel network stack	Saves ~10–50 µs per packet
Pre-allocated ring buffers	No heap allocation on the hot path; no GC pauses	Eliminates GC jitter (critical for Java-based engines)
Busy-spin (no sleep/yield)	Processing thread spins on the ring-buffer tail pointer; never blocks	Sub-microsecond wake latency vs ~10–30 µs for OS wakeup
CPU isolation + NUMA pinning	Dedicated cores for the engine; memory allocated on the same NUMA node as the NIC	Avoids cache-line thrashing and NUMA penalty
Hugepages	2 MB pages instead of 4 KB; fewer TLB misses	Reduces memory latency on large order books
Binary wire protocol	Proprietary binary format (similar to ITCH/OUCH); no parsing overhead	vs FIX ASCII: ~5–10 µs

Kernel bypass and busy-spin are co-location techniques. They only matter when you're chasing single-digit microsecond latencies. For a retail-facing exchange where p99 < 1 ms is the target, a well-written Java or Go matching engine on a normal Linux box achieves that without any of these.

Failure modes

Engine crash

The hot standby has been tailing the same event log in real time. When the primary goes down, the standby is promoted — it is at most a few events behind. In-flight events that were sequenced but not yet delivered to the primary are re-delivered to the standby. Recovery is deterministic; no manual state reconciliation is needed. If both primary and standby fail simultaneously (datacenter power failure, for example), recovery falls back to replaying the full log from the last checkpoint. Slower, but always correct.

Sequencer failure

This is more serious: the sequencer is the single point of ordering. Without a sequence number, you cannot determine fairness. Production approaches run the sequencer on hardware with redundant power and redundant network paths, and often deploy a replicated log — a two- or three-node Raft group — so the sequencer can tolerate a node failure. The trade-off: Raft consensus adds roughly 1–5 ms of commit latency per event in a same-datacenter deployment (network round-trip + follower fsync). Acceptable for many exchange classes; unacceptable for co-location-sensitive HFT. Some exchanges accept the risk of a non-replicated sequencer and rely on rapid hardware restart combined with full log replay.

Bad-order storm (fat-finger or rogue algo)

A client's algorithm sends 100,000 orders in 50 milliseconds at absurd prices. The first line of defense is the pre-trade risk checks at the gateway — rate limit per client, fat-finger price bands. Beyond that, circuit breakers in the matching engine halt trading for a symbol if the price moves more than X% in Y seconds. The exchange can also pull a kill switch that immediately cancels all open orders from a specific client ID.

Fairness under load

Under extreme load, the sequencer's queue may build up. The order of service in the queue matters — orders earlier in the queue have an earlier sequence number and will be processed first. The only way to "jump the queue" is to arrive before others on the network, which is what co-location achieves legally. The sequence number is the permanent, auditable record of who was first.

Storage choices

Data	Store	Why
Event log (orders/fills in sequence)	Append-only log on NVMe; replicated via Raft or async to standby	Source of truth; needs sequential write throughput, not random access
Order book (live state)	In-memory only (single process per symbol)	All reads are sub-microsecond; durability comes from the log
Order snapshots (checkpoints)	Local NVMe + object storage (S3/GCS)	Fast local read for recovery; remote copy for disaster recovery
Trade records (filled orders)	Postgres (relational, transactional)	Regulatory queries, joins with account data
Market data history	Time-series DB (InfluxDB, Kdb+, or columnar store)	Backtesting, analytics, charting
Client account state	Postgres + risk engine cache	Consistency needed; not on the sub-millisecond path

Building up to the design

It helps to see how the complexity earns its keep at each step.

V1: A sorted list per symbol, in-process

bids = SortedList()   # price descending
asks = SortedList()   # price ascending

def place_order(side, price, qty):
    if side == BUY and asks and asks[0].price <= price:
        # match
        ...
    else:
        bids.add(Order(price, qty, timestamp.now()))

This works at a few thousand orders per second and takes maybe a weekend to write. The fatal flaw is obvious: process restart wipes every open order. For a real exchange, that is catastrophic — traders have legal claims against those orders. You need durability before you ship anything.

V2: Write each order to a database before matching

Now every order is durably written to Postgres before the engine touches it. Recovery: query all open orders, rebuild the book. Durability is solved, but two new problems appear. First, Postgres write latency is roughly 1 ms, and it sits on the hot path — at 500k events/sec the database saturates almost immediately. Second, when you try to recover, Postgres INSERT timestamps don't give you microsecond-precision ordering, so you cannot guarantee you rebuild the book in exactly the right sequence.

V3: Write-ahead log + in-memory engine

Decouple durability from matching. The sequencer appends to a write-ahead log (append-only to NVMe, which is fast). The engine reads from the log and matches in memory. Recovery means replaying the log. Engine latency drops by 10–100× because the database is no longer in the hot path, and ordering is now exact — the log's sequence number is the order. The remaining gap: failover for a busy symbol means replaying minutes of events, which takes minutes. Not good enough.

V4: Hot standby + checkpoints

The standby tails the same log in real time. Periodic checkpoints — every 60 seconds or so — mean replay starts from a recent snapshot, not from open of day. Failover now takes seconds, not minutes. The next thing that breaks is fan-out: every fill needs to reach thousands of market-data subscribers, and if the engine handles that itself, subscriber count starts affecting match latency.

V5: Dedicated market-data publisher + multicast

The engine emits fill events to a separate market-data process that handles fan-out via multicast. The engine is now pure matching — it doesn't know or care how many clients are watching. This is the production-ready architecture.

flowchart LR
    V1["V1: in-memory sorted list<br/>no durability"] --> V2["V2: + DB writes<br/>durable, slow"]
    V2 --> V3["V3: + WAL, in-memory engine<br/>fast, hard failover"]
    V3 --> V4["V4: + hot standby + checkpoints<br/>fast failover"]
    V4 --> V5["V5: + market-data publisher<br/>production-ready"]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V5 fill:#a855f7,color:#fff

V5 is the architecture this article has been describing. Everything beyond this is operational tuning and low-latency engineering.

Things to discuss in an interview

Why single-threaded? Determinism, fairness, cache efficiency. Tie it to the mechanical sympathy principle.
Event sourcing as the durability primitive. Not "save the order book to a database" — "the log IS the database."
The sequencer as the fairness primitive. Sequence number = provable order of arrival. Auditable.
Price-time priority and its data structure. TreeMap of queues; FIFO within a level. Be able to trace through a sample match.
Partitioning by symbol. Why it works, what it enables (linear scale), what the remaining coordination point is (the sequencer for cross-symbol events like index arbitrage).
Hot standby vs cold standby. Hot standby tails the live log; cold standby replays from snapshot. Trade-off: resource cost vs failover speed.
Pre-trade risk vs post-trade. Risk checks at the gateway, before the log, keep the hot path clean.
Market-data fan-out. Multicast, not unicast. The matching engine should not care about subscriber count.

Things you should now be able to answer

What is price-time priority, and what data structure implements it?
Why does every matching engine use a single thread per symbol?
What is the role of the sequencer, and why is the sequence number a fairness primitive?
How does event sourcing enable crash recovery without putting durability on the hot path?
What does a hot standby do differently from a cold standby?
What pre-trade checks happen before an order hits the matching engine?
Why is multicast used for market data fan-out?
What is a circuit breaker in a matching engine context?

Frequently asked questions

▸What is price-time priority and what data structure implements it?

Price-time priority means the best price is always matched first, and within a price level, orders are filled in the order they arrived — strictly first in, first out. The order book implements this as a TreeMap of price levels (sorted descending for bids, ascending for asks), where each price level is a FIFO queue. Insert is O(log P), cancel is O(1) via an orders_by_id HashMap index, and best bid/ask lookup is O(1).

▸Why does a matching engine use a single thread per symbol instead of multiple threads?

Multiple threads touching the same order book require locks, and lock contention under load can reorder operations — meaning two simultaneously arriving orders could be serialized in either order, which is a fairness violation. A single thread eliminates the ambiguity: fairness falls out of the architecture for free, with no coordination overhead. LMAX's published benchmark measured 6 million orders per second on a single thread on 2010-era hardware, and a busy symbol at a typical exchange handles around 50k events per second — well within one core's capacity.

▸What is the sequencer's role in a stock exchange, and why is the sequence number called a fairness primitive?

The sequencer stamps each incoming order with a monotonically increasing sequence number and appends it to a durable log before the matching engine ever sees the event. That sequence number defines the canonical, legally auditable order of all events — once stamped, an order's place in the queue is permanent. Fairness is provable from the log because the sequence number is the permanent record of who arrived first.

▸How does event sourcing enable crash recovery without putting disk I/O on the matching engine's hot path?

The matching engine is a pure function of its event stream: its in-memory state is just a materialized view of the log, never persisted directly. After a crash, the engine replays the event log from the last checkpoint and arrives at exactly the same state — no distributed handshake or state reconciliation needed. The engine itself never writes to disk; all I/O is in the sequencer, which writes append-only and leaves the matching path untouched.

▸Why does market data fan-out use multicast instead of unicast from the matching engine?

A single multicast packet reaches all subscribers simultaneously, so the engine's output I/O stays constant regardless of how many clients are watching. Sending unicast from the engine would mean send cost scales with subscriber count — at roughly 1 million subscribed connections for a consolidated feed, that is unworkable. The matching engine emits fill events to a separate market-data publisher, which handles fan-out via multicast and keeps the engine oblivious to subscriber count.

← previous

Storage Engines: LSM-Trees vs B-Trees

Idempotency & Exactly-Once Semantics

// RELATED

Design a Stock Exchange (matching engine)

The problem

Functional requirements

Non-functional requirements

Capacity estimation

The limit order book — the central data structure

The data structure

The matching algorithm

Why single-threaded? The determinism argument

Sequencing — the consistency primitive

Event sourcing for durability and recovery

Recovery flow

Order lifecycle — state diagram

Surrounding system components

Gateway and risk pre-checks

Market data feed

Post-trade: clearing and settlement

Partitioning by symbol

Low-latency techniques

Failure modes

Engine crash

Sequencer failure

Bad-order storm (fat-finger or rogue algo)

Fairness under load

Storage choices

Building up to the design

V1: A sorted list per symbol, in-process

V2: Write each order to a database before matching

V3: Write-ahead log + in-memory engine

V4: Hot standby + checkpoints

V5: Dedicated market-data publisher + multicast

Things to discuss in an interview

Things you should now be able to answer

Further reading

Frequently asked questions

You may also like

Design an LLM Observability Platform

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Fine-Tuning Platform