Design a Stock Exchange (matching engine)
Match buy and sell orders deterministically with microsecond latency and perfect fairness. The single-threaded matching engine, the order book, and event sourcing for recovery.
The problem
Nasdaq processed roughly 29 billion shares on a single peak day in 2021. Behind every one of those transactions sits the same piece of machinery: a matching engine that takes a buy order and a sell order, confirms they agree on price, and produces a legally binding fill — in under a hundred microseconds. Robinhood, Citadel, NYSE, and every other modern venue all build around the same core idea, differing mainly in scale and the extremity of their latency targets.
At its heart, a stock exchange is a marketplace for orders. A limit order says "I will buy 200 shares of AAPL at no more than $150.05." A matching engine's job is to find the counterpart — a seller willing to part with shares at $150.05 or less — and execute the trade. Simple in concept; brutal in practice. Wall Street's tolerance for bugs or ambiguity is approximately zero: a wrong match is a regulatory violation, a lost order is a lawsuit, and a microsecond of unnecessary latency is money left on the table.
Two tensions define the hard parts. The first is correctness vs speed: every match must be deterministic and fair (price-time priority, strictly enforced — first in, first out at each price level), yet latency budgets are in the tens of microseconds for co-located clients. Any locking or coordination on the hot path kills both. The second is fan-out vs isolation: a single fill event must reach thousands of market-data subscribers simultaneously, but the engine that produced it must remain oblivious to subscriber count — otherwise, subscriber churn bleeds into match latency. Every architectural decision in a real exchange is driven by one goal: the guarantee that replaying the same inputs in the same order produces exactly the same outputs, every time.
Functional requirements
- Accept limit orders: buy or sell N shares of symbol X at price P or better.
- Accept market orders: buy or sell N shares at the current best available price.
- Accept cancellations and modifications of open orders.
- Match orders when a buyer's price meets or exceeds a seller's price.
- Generate fill notifications to both parties when a match occurs.
- Publish a market data feed: every order, cancel, and fill, in sequence.
- Provide a public order book snapshot (best N price levels on each side).
Non-functional requirements
- Correctness first: no lost orders, no duplicate fills, no out-of-order matches.
- Deterministic: given the same event log, the system always arrives at the same state.
- Low latency: p99 match latency < 100 µs gate-to-gate (order received → fill sent) for co-located clients; < 1 ms for ordinary API clients.
- High throughput: sustain 500 k order events/sec across all symbols.
- High availability: 99.99% uptime; failover in < 1 second without losing or reordering events.
- Fair: price-time priority, enforced globally and provably from the event log.
Capacity estimation
Illustrative figures for a mid-size exchange or a single major symbol. A large venue like Nasdaq processes multiple millions of messages/sec across all engines; 500 k/sec is a reasonable interview target.
| Dimension | Estimate | How we got there |
|---|---|---|
| Order event throughput (peak) | 500,000 events/sec | Orders + cancels + modifies combined |
| Incoming event bandwidth | ~100 MB/s | 500,000 events/s × 200 B/event |
| Event log retention | ~43 TB | 500k events/s × 86,400 s/day × 5 days × 200 B = 500k × 432,000 × 200 B — feasible on NVMe arrays or object storage |
| Active symbols | ~10,000 | Split across matching engine processes |
| Top-symbol throughput | ~50k events/sec | One core handles this comfortably |
| Bottom-80% symbols | < 500 events/sec each | Long-tail distribution |
| Order book memory (per busy symbol) | ~1 MB | 2 sides × 100 price levels/side × 50 orders/level = 10,000 orders × 100 B |
| Total order book memory (all symbols) | ~10 GB | 10,000 symbols × 1 MB — fits in one machine |
| Market data event size | ~1,000 bytes serialized | Per event after serialization |
| Outbound market data bandwidth | ~500 MB/s | 500k events/s × 1,000 B; fills are a subset — not every event produces a fill |
| Fan-out delivery | Multicast / pub-sub | Avoids O(N) unicast cost at the engine |
Takeaway: 43 TB of log retention at 500 k events/sec fits on NVMe arrays; the entire live order book for 10,000 symbols fits in 10 GB of RAM on a single machine — the architecture is I/O-bound at the sequencer, not at the engine.
The limit order book — the central data structure
Before any architecture discussion, you must understand what the matching engine is actually computing.
An order book for symbol AAPL at a snapshot in time:
ASKS (sell orders, sorted ascending by price)
Price $150.05 → [sell 200 @ 09:31:04.000001, sell 100 @ 09:31:04.000312]
Price $150.10 → [sell 500 @ 09:31:03.999812]
← SPREAD ($0.05) →
BIDS (buy orders, sorted descending by price)
Price $150.00 → [buy 300 @ 09:31:04.000100, buy 100 @ 09:31:03.998000]
Price $149.95 → [buy 1000 @ 09:31:03.997500]
Price-time priority: within a price level, orders are filled in the order they arrived — first in, first out. Across price levels, the best price is always matched first (lowest ask, highest bid).
The data structure
OrderBook {
bids: TreeMap<Price, Queue<Order>> // sorted descending
asks: TreeMap<Price, Queue<Order>> // sorted ascending
orders_by_id: HashMap<OrderId, (side, price, qty_remaining)>
}
- Insert (limit order): O(log P) to find the price level, O(1) to enqueue. P = number of distinct price levels, typically small.
- Cancel: O(1) lookup by
order_idinorders_by_id, O(1) removal from the queue (with a doubly-linked list inside the queue). - Best bid / ask: O(1) — peek the first key of each TreeMap.
- Match: O(1) per fill — always matching against the front of the best-price queue.
flowchart TD
OB["Order Book (AAPL)"]
OB --> BIDS["BIDS (TreeMap desc)"]
OB --> ASKS["ASKS (TreeMap asc)"]
OB --> IDX["orders_by_id (HashMap)"]
BIDS --> B1["$150.00 → [buy 300 t=04.000100, buy 100 t=03.998000]"]
BIDS --> B2["$149.95 → [buy 1000 t=03.997500]"]
ASKS --> A1["$150.05 → [sell 200 t=04.000001, sell 100 t=04.000312]"]
ASKS --> A2["$150.10 → [sell 500 t=03.999812]"]
IDX --> ID1["order_42 → (BUY, $150.00, qty=300)"]
style OB fill:#0e7490,color:#fff
style BIDS fill:#15803d,color:#fff
style ASKS fill:#ff6b1a,color:#0a0a0f
style IDX fill:#a855f7,color:#fff
Notice that the orders_by_id index is what makes O(1) cancellation possible. Without it, cancelling an order would require scanning the price level's queue — a linear search. With it, you jump directly to the order's position in constant time.
The matching algorithm
An incoming order arrives. The engine runs this algorithm (pseudocode):
def process_order(incoming):
if incoming.type == CANCEL:
cancel(incoming.order_id)
return
if incoming.side == BUY:
opposite = asks # match against asks
price_ok = lambda level: level <= incoming.price # for market orders: always True
else:
opposite = bids
price_ok = lambda level: level >= incoming.price
while incoming.qty > 0 and opposite and price_ok(opposite.best_price()):
resting = opposite.front()
fill_qty = min(incoming.qty, resting.qty_remaining)
emit_fill(incoming.order_id, resting.order_id, fill_qty, resting.price)
incoming.qty -= fill_qty
resting.qty_remaining -= fill_qty
if resting.qty_remaining == 0:
opposite.remove_front()
if incoming.type == LIMIT and incoming.qty > 0:
# remaining quantity rests on the book
bids_or_asks[incoming.price].enqueue(incoming)
The match price is always the resting order's price, not the aggressive order's price — this is standard in price-time priority matching and is called the resting price rule. It guarantees the resting order gets the price it asked for.
sequenceDiagram
participant GW as Gateway
participant SEQ as Sequencer
participant ME as Matching Engine
participant MD as Market Data
GW->>SEQ: "BUY 100 AAPL @ $150.05" (seq not assigned)
SEQ->>SEQ: append to log, assign seq#1042
SEQ->>ME: deliver event seq#1042
ME->>ME: crosses spread → match 100 vs resting ask@$150.05
ME->>GW: Fill ACK (order_id=42, qty=100, px=$150.05)
ME->>MD: publish Fill event (seq#1042)
ME->>MD: publish updated best bid/ask
Why single-threaded? The determinism argument
This is the most important design decision in a matching engine. Every production exchange matching engine that the author is aware of processes orders for a given symbol on a single thread.
The correctness argument is simple. Two threads touching the same order book need a lock. Under load, lock contention is unpredictable and can reorder operations — two requests that "simultaneously" arrive could be serialized in either order, and that order determines who gets filled first, which is a fairness violation. A single thread eliminates the ambiguity entirely. Fairness is not something you verify after the fact; it falls out of the architecture for free.
The performance argument is equally compelling. A well-optimized single-threaded matching engine can process millions of order events per second on a single core — LMAX's own published benchmark measured 6 million orders/sec on a single thread on 2010-era hardware; modern CPUs do better. Even the busiest symbol at a typical exchange — say 50k events/sec — uses only a small fraction of one core. And because all order-book state for one symbol fits comfortably in cache (~1 MB fits in the L2 of current-generation Intel Xeon Sapphire Rapids cores, and easily in any server's L3), you rarely pay a cache miss on the critical path. This is the mechanical sympathy principle: write software that matches the hardware's strengths, and the hardware rewards you.
The LMAX Disruptor pattern: LMAX Exchange popularized the pattern of a single-threaded business logic processor consuming from a pre-allocated ring buffer. The ring buffer is shared between the input thread (writing) and the processing thread (reading) using memory barriers rather than locks. This achieves sub-microsecond handoff between threads without any heap allocation on the hot path.
flowchart LR
NET[Network thread<br/>receives order bytes] -->|writes to ring| RB[(Ring Buffer<br/>lock-free SPMC)]
RB -->|reads in sequence| ME[Matching Engine<br/>single thread]
ME --> OUT1[Fill notifier thread]
ME --> OUT2[Market data thread]
style RB fill:#0e7490,color:#fff
style ME fill:#15803d,color:#fff
The ring buffer is pre-allocated — no garbage collection pauses — and the single-consumer guarantee means the matching engine thread never competes with anyone for any data structure.
Sequencing — the consistency primitive
Before the matching engine sees an order, a sequencer stamps it with a monotonically increasing sequence number and appends it to a durable log. This moment is the exchange's point of truth.
The sequence number defines the canonical order of all events. Once an event is stamped and written, its place in the universe is settled. The log becomes the source of truth; the in-memory order book is just a materialized view of that log — an optimized read cache, if you will. Recovery becomes trivial: replay the log from the beginning (or from the last snapshot checkpoint) and you arrive at exactly the same state. No state reconciliation, no guesswork.
The sequencer is deliberately simple. Its only job is write-and-stamp. Any logic beyond that belongs in the gateway or the engine, not here.
flowchart TD
E1[Event arrives from gateway] --> SEQ{Sequencer}
SEQ -->|"append to write-ahead log<br/>(NVMe or replicated log)"| LOG[(Durable Event Log)]
SEQ -->|"deliver seq#N to engine"| ME[Matching Engine]
LOG -.replicate asynchronously.-> HS[Hot Standby<br/>replays same events]
style SEQ fill:#ff6b1a,color:#0a0a0f
style LOG fill:#0e7490,color:#fff
style HS fill:#a855f7,color:#fff
Event sourcing for durability and recovery
See also: Event sourcing and CQRS.
The matching engine is a pure function of its event stream: State_N = f(State_0, events[0..N]). The in-memory state is never persisted directly — only the events are persisted.
This is the right model for four interlocking reasons. First, recovery needs no coordination: crash the engine, restart it, replay the log, and you're back. No distributed handshake, no state reconciliation. Second, the output is bit-for-bit identical: any node that replays the same log produces the same fills, the same order book, the same market data events — which is exactly what regulators need to audit a disputed trade. Third, the hot standby comes for free: it just tails the same log, staying a few events behind the primary; failover is a promotion, not a rebuild. Fourth, and most importantly for latency, the engine never writes to disk. All disk I/O lives in the sequencer, which writes append-only and can saturate NVMe bandwidth without touching the matching path at all.
Recovery flow
flowchart TD
CRASH[Engine crashes at seq#8452] --> CHECK{Last checkpoint?}
CHECK -->|"yes, checkpoint at seq#8000"| SNAP[Load snapshot at seq#8000]
CHECK -->|no checkpoint| INIT[Start from empty state]
SNAP --> REPLAY["Replay events seq#8001 → seq#8452"]
INIT --> REPLAY2["Replay all events from beginning"]
REPLAY --> LIVE[Engine live at seq#8452]
REPLAY2 --> LIVE2[Engine live at seq#8452]
style SNAP fill:#15803d,color:#fff
style LIVE fill:#ff6b1a,color:#0a0a0f
style LIVE2 fill:#ff6b1a,color:#0a0a0f
Snapshot frequency is a trade-off: more frequent means faster recovery but more CPU and storage overhead. Production systems typically checkpoint every few minutes, or as frequently as every few seconds for the highest-volume symbols.
Order lifecycle — state diagram
stateDiagram-v2
[*] --> Pending: received at gateway
Pending --> Sequenced: sequencer stamps seq#
Sequenced --> PartiallyFilled: partial match
Sequenced --> Filled: full match
Sequenced --> Open: no immediate match (limit order rests)
Open --> PartiallyFilled: later incoming order matches
Open --> Cancelled: client cancels
Open --> Expired: order past GTD time
PartiallyFilled --> Filled: remaining qty matched
PartiallyFilled --> Cancelled: client cancels remainder
Filled --> [*]
Cancelled --> [*]
Expired --> [*]
Surrounding system components
The matching engine is not a complete exchange — it is the core. Everything else handles the work the engine should never touch.
flowchart TD
CLIENT[Client / Broker] --> GW[Order Gateway<br/>FIX / binary protocol]
GW --> RISK[Pre-trade Risk Engine<br/>margin, position limits, fat-finger]
RISK -->|approved order| SEQ[Sequencer]
SEQ --> ME[Matching Engine Cluster<br/>one process per symbol]
ME --> FILL[Fill Router<br/>sends fills back to clients]
ME --> MDF[Market Data Feed<br/>L1 / L2 / L3 multicast]
ME --> AUDIT[Audit Log Writer<br/>regulatory]
ME --> PT[Post-Trade Clearing<br/>CCP integration]
MDF --> FEED[(Market Data<br/>subscribers)]
PT --> CCP[Central Counterparty<br/>Clearinghouse]
style GW fill:#ff6b1a,color:#0a0a0f
style SEQ fill:#0e7490,color:#fff
style ME fill:#15803d,color:#fff
style MDF fill:#ffaa00,color:#0a0a0f
style PT fill:#a855f7,color:#fff
Gateway and risk pre-checks
Before an order reaches the sequencer, the gateway validates it and the risk engine runs a rapid pre-trade check: is the message well-formed, is the symbol valid, is this client authorized to trade here? Then the tighter checks: is the price within reasonable bounds of the last trade (fat-finger protection, typically ±10%), and does this order keep the client within their margin and position limits?
All of this happens before the sequencer. An order that fails risk checks is rejected without ever touching the log — which means it never consumes a sequence number, never occupies the hot path, and never needs to be undone. Keeping risk at the gateway is what lets the engine itself stay pure.
Market data feed
Every fill, every order arrival, and every cancel generates market data events. The publisher distributes these at three levels of granularity:
flowchart LR
ME[Matching Engine] --> MDP[Market Data Publisher]
MDP --> L1["L1 — best bid and ask only<br/>(most retail clients)"]
MDP --> L2["L2 — best N price levels each side<br/>(algorithmic traders)"]
MDP --> L3["L3 — every individual order<br/>(high-frequency traders)"]
L1 --> MC[Multicast / pub-sub]
L2 --> MC
L3 --> MC
MC --> SUBS[~1M subscribers]
style MDP fill:#ffaa00,color:#0a0a0f
style ME fill:#15803d,color:#fff
style MC fill:#0e7490,color:#fff
Fan-out to a million subscribers uses multicast, not unicast from the engine. A single multicast packet reaches all subscribers simultaneously, so the engine's output I/O stays constant regardless of how many clients are watching. Sending unicast from the engine would mean the engine's send cost scales with subscriber count — unworkable.
Post-trade: clearing and settlement
A fill is a legal obligation. After a match, the clearing system records the trade (counterparties, quantity, price), and the central counterparty clearinghouse (CCP) steps in as the buyer to every seller and the seller to every buyer, eliminating counterparty risk. Settlement — the actual transfer of shares and cash — happens T+1 for US equities (the US moved from T+2 to T+1 in May 2024); many non-US markets still settle T+2. Settlement is entirely off the latency-critical path; it runs asynchronously after the fill is confirmed.
Partitioning by symbol
Horizontal scale is straightforward: each symbol is an independent partition. AAPL events never touch the MSFT order book. Route events to the correct engine process using hash(symbol_id) % N — or a static routing table for operational simplicity.
flowchart LR
SEQ[Sequencer] --> ROUTE{Symbol Router}
ROUTE -->|AAPL| ME1[Engine: AAPL]
ROUTE -->|MSFT| ME2[Engine: MSFT]
ROUTE -->|TSLA| ME3[Engine: TSLA]
ME1 -.event log.-> LOG1[(Log: AAPL)]
ME2 -.event log.-> LOG2[(Log: MSFT)]
ME3 -.event log.-> LOG3[(Log: TSLA)]
style ROUTE fill:#ff6b1a,color:#0a0a0f
style ME1 fill:#15803d,color:#fff
style ME2 fill:#15803d,color:#fff
style ME3 fill:#15803d,color:#fff
The implication: you can add symbols cheaply (new partition), and a burst in one symbol doesn't affect any other. The top 10 symbols can be pinned to dedicated cores. The remaining coordination point is the sequencer — for cross-symbol events like index arbitrage, some venues use a global sequence number; others allow per-symbol logs with reconciliation at the clearing layer.
Low-latency techniques
The single-threaded architecture alone gets you to respectable throughput with sub-millisecond latency on ordinary hardware. Getting below 10 µs gate-to-gate requires a different class of techniques:
| Technique | What it does | Latency impact |
|---|---|---|
| Kernel bypass (DPDK / RDMA) | NIC writes directly to user-space ring buffer, bypassing the OS kernel network stack | Saves ~10–50 µs per packet |
| Pre-allocated ring buffers | No heap allocation on the hot path; no GC pauses | Eliminates GC jitter (critical for Java-based engines) |
| Busy-spin (no sleep/yield) | Processing thread spins on the ring-buffer tail pointer; never blocks | Sub-microsecond wake latency vs ~10–30 µs for OS wakeup |
| CPU isolation + NUMA pinning | Dedicated cores for the engine; memory allocated on the same NUMA node as the NIC | Avoids cache-line thrashing and NUMA penalty |
| Hugepages | 2 MB pages instead of 4 KB; fewer TLB misses | Reduces memory latency on large order books |
| Binary wire protocol | Proprietary binary format (similar to ITCH/OUCH); no parsing overhead | vs FIX ASCII: ~5–10 µs |
Kernel bypass and busy-spin are co-location techniques. They only matter when you're chasing single-digit microsecond latencies. For a retail-facing exchange where p99 < 1 ms is the target, a well-written Java or Go matching engine on a normal Linux box achieves that without any of these.
Failure modes
Engine crash
The hot standby has been tailing the same event log in real time. When the primary goes down, the standby is promoted — it is at most a few events behind. In-flight events that were sequenced but not yet delivered to the primary are re-delivered to the standby. Recovery is deterministic; no manual state reconciliation is needed. If both primary and standby fail simultaneously (datacenter power failure, for example), recovery falls back to replaying the full log from the last checkpoint. Slower, but always correct.
Sequencer failure
This is more serious: the sequencer is the single point of ordering. Without a sequence number, you cannot determine fairness. Production approaches run the sequencer on hardware with redundant power and redundant network paths, and often deploy a replicated log — a two- or three-node Raft group — so the sequencer can tolerate a node failure. The trade-off: Raft consensus adds roughly 1–5 ms of commit latency per event in a same-datacenter deployment (network round-trip + follower fsync). Acceptable for many exchange classes; unacceptable for co-location-sensitive HFT. Some exchanges accept the risk of a non-replicated sequencer and rely on rapid hardware restart combined with full log replay.
Bad-order storm (fat-finger or rogue algo)
A client's algorithm sends 100,000 orders in 50 milliseconds at absurd prices. The first line of defense is the pre-trade risk checks at the gateway — rate limit per client, fat-finger price bands. Beyond that, circuit breakers in the matching engine halt trading for a symbol if the price moves more than X% in Y seconds. The exchange can also pull a kill switch that immediately cancels all open orders from a specific client ID.
Fairness under load
Under extreme load, the sequencer's queue may build up. The order of service in the queue matters — orders earlier in the queue have an earlier sequence number and will be processed first. The only way to "jump the queue" is to arrive before others on the network, which is what co-location achieves legally. The sequence number is the permanent, auditable record of who was first.
Storage choices
| Data | Store | Why |
|---|---|---|
| Event log (orders/fills in sequence) | Append-only log on NVMe; replicated via Raft or async to standby | Source of truth; needs sequential write throughput, not random access |
| Order book (live state) | In-memory only (single process per symbol) | All reads are sub-microsecond; durability comes from the log |
| Order snapshots (checkpoints) | Local NVMe + object storage (S3/GCS) | Fast local read for recovery; remote copy for disaster recovery |
| Trade records (filled orders) | Postgres (relational, transactional) | Regulatory queries, joins with account data |
| Market data history | Time-series DB (InfluxDB, Kdb+, or columnar store) | Backtesting, analytics, charting |
| Client account state | Postgres + risk engine cache | Consistency needed; not on the sub-millisecond path |
Building up to the design
It helps to see how the complexity earns its keep at each step.
V1: A sorted list per symbol, in-process
bids = SortedList() # price descending
asks = SortedList() # price ascending
def place_order(side, price, qty):
if side == BUY and asks and asks[0].price <= price:
# match
...
else:
bids.add(Order(price, qty, timestamp.now()))
This works at a few thousand orders per second and takes maybe a weekend to write. The fatal flaw is obvious: process restart wipes every open order. For a real exchange, that is catastrophic — traders have legal claims against those orders. You need durability before you ship anything.
V2: Write each order to a database before matching
Now every order is durably written to Postgres before the engine touches it. Recovery: query all open orders, rebuild the book. Durability is solved, but two new problems appear. First, Postgres write latency is roughly 1 ms, and it sits on the hot path — at 500k events/sec the database saturates almost immediately. Second, when you try to recover, Postgres INSERT timestamps don't give you microsecond-precision ordering, so you cannot guarantee you rebuild the book in exactly the right sequence.
V3: Write-ahead log + in-memory engine
Decouple durability from matching. The sequencer appends to a write-ahead log (append-only to NVMe, which is fast). The engine reads from the log and matches in memory. Recovery means replaying the log. Engine latency drops by 10–100× because the database is no longer in the hot path, and ordering is now exact — the log's sequence number is the order. The remaining gap: failover for a busy symbol means replaying minutes of events, which takes minutes. Not good enough.
V4: Hot standby + checkpoints
The standby tails the same log in real time. Periodic checkpoints — every 60 seconds or so — mean replay starts from a recent snapshot, not from open of day. Failover now takes seconds, not minutes. The next thing that breaks is fan-out: every fill needs to reach thousands of market-data subscribers, and if the engine handles that itself, subscriber count starts affecting match latency.
V5: Dedicated market-data publisher + multicast
The engine emits fill events to a separate market-data process that handles fan-out via multicast. The engine is now pure matching — it doesn't know or care how many clients are watching. This is the production-ready architecture.
flowchart LR
V1["V1: in-memory sorted list<br/>no durability"] --> V2["V2: + DB writes<br/>durable, slow"]
V2 --> V3["V3: + WAL, in-memory engine<br/>fast, hard failover"]
V3 --> V4["V4: + hot standby + checkpoints<br/>fast failover"]
V4 --> V5["V5: + market-data publisher<br/>production-ready"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V5 fill:#a855f7,color:#fff
V5 is the architecture this article has been describing. Everything beyond this is operational tuning and low-latency engineering.
Things to discuss in an interview
- Why single-threaded? Determinism, fairness, cache efficiency. Tie it to the mechanical sympathy principle.
- Event sourcing as the durability primitive. Not "save the order book to a database" — "the log IS the database."
- The sequencer as the fairness primitive. Sequence number = provable order of arrival. Auditable.
- Price-time priority and its data structure. TreeMap of queues; FIFO within a level. Be able to trace through a sample match.
- Partitioning by symbol. Why it works, what it enables (linear scale), what the remaining coordination point is (the sequencer for cross-symbol events like index arbitrage).
- Hot standby vs cold standby. Hot standby tails the live log; cold standby replays from snapshot. Trade-off: resource cost vs failover speed.
- Pre-trade risk vs post-trade. Risk checks at the gateway, before the log, keep the hot path clean.
- Market-data fan-out. Multicast, not unicast. The matching engine should not care about subscriber count.
Things you should now be able to answer
- What is price-time priority, and what data structure implements it?
- Why does every matching engine use a single thread per symbol?
- What is the role of the sequencer, and why is the sequence number a fairness primitive?
- How does event sourcing enable crash recovery without putting durability on the hot path?
- What does a hot standby do differently from a cold standby?
- What pre-trade checks happen before an order hits the matching engine?
- Why is multicast used for market data fan-out?
- What is a circuit breaker in a matching engine context?
Further reading
- "LMAX Architecture" — Martin Fowler's blog (martinfowler.com) — the seminal description of the Disruptor pattern.
- "Exchange Connectivity" — Nasdaq ITCH and OUCH protocol specifications (public, nasdaqtrader.com) — study a real binary protocol.
- "How Modern Matching Engines Work" — Jane Street tech blog and similar — firm-level write-ups, though internal specifics vary.
- Event sourcing and CQRS — the broader pattern that the event log implements.
- "Mechanical Sympathy" — Martin Thompson's blog — CPU cache hierarchy, NUMA, and how to write code that the hardware likes.
- "The Art of Multiprocessor Programming" — Herlihy & Shavit — formal treatment of lock-free data structures used in the Disruptor ring buffer.
Frequently asked questions
▸What is price-time priority and what data structure implements it?
Price-time priority means the best price is always matched first, and within a price level, orders are filled in the order they arrived — strictly first in, first out. The order book implements this as a TreeMap of price levels (sorted descending for bids, ascending for asks), where each price level is a FIFO queue. Insert is O(log P), cancel is O(1) via an orders_by_id HashMap index, and best bid/ask lookup is O(1).
▸Why does a matching engine use a single thread per symbol instead of multiple threads?
Multiple threads touching the same order book require locks, and lock contention under load can reorder operations — meaning two simultaneously arriving orders could be serialized in either order, which is a fairness violation. A single thread eliminates the ambiguity: fairness falls out of the architecture for free, with no coordination overhead. LMAX's published benchmark measured 6 million orders per second on a single thread on 2010-era hardware, and a busy symbol at a typical exchange handles around 50k events per second — well within one core's capacity.
▸What is the sequencer's role in a stock exchange, and why is the sequence number called a fairness primitive?
The sequencer stamps each incoming order with a monotonically increasing sequence number and appends it to a durable log before the matching engine ever sees the event. That sequence number defines the canonical, legally auditable order of all events — once stamped, an order's place in the queue is permanent. Fairness is provable from the log because the sequence number is the permanent record of who arrived first.
▸How does event sourcing enable crash recovery without putting disk I/O on the matching engine's hot path?
The matching engine is a pure function of its event stream: its in-memory state is just a materialized view of the log, never persisted directly. After a crash, the engine replays the event log from the last checkpoint and arrives at exactly the same state — no distributed handshake or state reconciliation needed. The engine itself never writes to disk; all I/O is in the sequencer, which writes append-only and leaves the matching path untouched.
▸Why does market data fan-out use multicast instead of unicast from the matching engine?
A single multicast packet reaches all subscribers simultaneously, so the engine's output I/O stays constant regardless of how many clients are watching. Sending unicast from the engine would mean send cost scales with subscriber count — at roughly 1 million subscribed connections for a consolidated feed, that is unworkable. The matching engine emits fill events to a separate market-data publisher, which handles fan-out via multicast and keeps the engine oblivious to subscriber count.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.