~/articles/lsm-trees-vs-b-trees

◆◆◆Advancedasked at Googleasked at Amazonasked at Meta

Storage Engines: LSM-Trees vs B-Trees

Why does Postgres read fast and Cassandra write fast? The two storage-engine families that underpin every database — and their write/read/space amplification trade-offs.

19 min read2026-03-18Ironclad Academy

#databases #storage #performance #internals

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

Every database you've ever used is built on one of two foundational storage-engine designs. Knowing them at the mechanism level — not just "Postgres uses B-trees" but why and what that costs — is what separates a deliberate database choice from a guess.

The unifying lens: three amplification factors

Before we open the hood on either design, we need a vocabulary for comparing them. The database research community uses three amplification factors as the canonical lens.

Write amplification (WA) is the ratio of bytes written to disk versus bytes written by the application. A WA of 10 means that inserting 1 GB of data causes 10 GB of actual disk writes. High WA reduces disk lifetime, saturates I/O bandwidth, and burns CPU.

Read amplification (RA) is the number of disk reads required per logical read the application requested. A RA of 5 means fetching one value touches 5 disk pages. High RA means slow point lookups and slow range scans — or heavy reliance on caches.

Space amplification (SA) is how many bytes sit on disk relative to bytes of live data. SA of 2 means 1 GB of actual data occupies 2 GB on disk. High SA means bigger disks, more backup bandwidth, and higher cloud storage costs.

The central insight: you cannot minimize all three simultaneously. B-trees optimize for low read and space amplification at the cost of higher random-write pressure. LSM-trees minimize write latency and maximize write throughput at the cost of higher write amplification and a more complex read path.

B-trees: in-place, balanced, predictable

The core structure

A B-tree is a self-balancing tree of fixed-size pages — typically 4 KB or 8 KB on disk — where each interior page holds N key-pointer pairs and N+1 child pointers. Leaf pages hold the actual key-value pairs (or key-rowid pairs pointing to a heap file).

Root page (level 0)
├── Interior pages (level 1)
│     keys: [...], child pointers: [...]
└── Leaf pages (level 2)
      keys: [k1, k2, k3], values: [v1, v2, v3]

For a typical production database with branching factor ~200 and 4 KB pages:

A 3-level tree supports ~200² = 40,000 leaf pages × (leaf entries per page) ≈ tens of millions of rows, depending on row size.
A point lookup traverses exactly 3 pages (root → interior → leaf). A range scan reads the leaf level sequentially until the range is exhausted.

That 3-page depth is why B-tree reads are fast: the tree height grows logarithmically, and in practice a busy database's root page — and often the upper levels — lives permanently in the buffer pool (page cache) in RAM. A hot-path read often touches only 1 disk read (the leaf), with root and interior pages served from cache.

The write path: WAL + in-place update

When you INSERT or UPDATE a row:

Write the change to the WAL (Write-Ahead Log) — a sequential append to a file on disk. This is the durability guarantee: if the machine crashes before the page is flushed, the WAL lets the engine replay the change.
Modify the page in the buffer pool — the page is now "dirty."
Eventually, flush the dirty page to its fixed location on disk. This is a random write — the page lives at a specific offset determined by the tree structure.

Write amplification for a simple update comes to at least ~2 physical writes (WAL record + dirty page flush). When a leaf page is full and must split into two, the WA can be higher, and the split must also be logged.

sequenceDiagram
    participant App
    participant WAL as WAL (sequential)
    participant BP as Buffer Pool
    participant Disk as B-Tree Pages (random)
    App->>WAL: append change record
    App->>BP: modify page in memory
    Note over BP,Disk: background checkpoint
    BP->>Disk: flush dirty page to fixed offset

Range scans and the read advantage

B-trees store leaf pages in key order, and adjacent leaves are (usually) physically nearby on disk. A range query like SELECT * FROM orders WHERE created_at BETWEEN t1 AND t2 traverses the leaf level sequentially — an access pattern that saturates disk read bandwidth. LSM-trees can also serve range scans (SSTables are sorted), but doing so requires merging multiple SSTable iterators, which adds overhead.

Crash recovery

The WAL is the safety net. At startup after a crash, the engine replays WAL records forward from the last checkpoint, re-applying any writes that didn't make it to the page files. Pages are always written as full page images (or with checksums) to detect torn writes. Most B-tree engines recover in seconds to minutes depending on WAL size.

LSM-trees: append-only, sorted, compacted

The Log-Structured Merge-tree was introduced by O'Neil et al. in 1996 specifically to solve the random-write bottleneck of traditional storage engines on spinning disk. The design answer: never do a random write to a data file; only ever append.

The write path: memtable → SSTable

sequenceDiagram
    participant App
    participant WAL as WAL (sequential)
    participant MT as Memtable (RAM)
    participant L0 as L0 SSTables (disk)
    participant LN as L1...LN SSTables (disk)
    App->>WAL: append WAL record (durability)
    App->>MT: insert into in-memory sorted structure
    Note over MT: when memtable is full (~64 MB–256 MB)
    MT->>L0: flush as immutable SSTable (sequential write)
    Note over L0,LN: background compaction thread
    L0->>LN: merge and compact SSTables

Here's how each step works:

WAL write — same as B-tree: the write is logged first for crash durability.
Memtable insert — the key-value pair is inserted into an in-memory sorted data structure (commonly a skip list or a red-black tree). This is purely a memory operation and is very fast.
Memtable flush — when the memtable reaches a configured size limit (often 64–256 MB), it is flushed to disk as an SSTable (Sorted String Table): a file where key-value pairs are written in sorted key order, with a sparse index at the start. This is a large sequential write — maximally efficient on disk.
SSTable is immutable — once written, an SSTable is never modified. Overwrites and deletes write new entries; older versions are cleaned up during compaction.

SSTables: the unit of LSM storage

An SSTable has a specific structure:

┌─────────────────────────────────────────────┐
│ Data blocks (sorted key-value pairs)        │
│   block 0: [k1,v1], [k2,v2], [k3,v3], ...  │
│   block 1: [k100,v100], ...                 │
├─────────────────────────────────────────────┤
│ Index block (sparse: first key per block)   │
│   k1 → offset 0, k100 → offset 4096, ...   │
├─────────────────────────────────────────────┤
│ Bloom filter (one per SSTable)              │
│   "does key K exist in this file?"         │
├─────────────────────────────────────────────┤
│ Metadata (min key, max key, level, etc.)    │
└─────────────────────────────────────────────┘

The sparse index and the min/max key metadata let the engine quickly skip SSTables that cannot contain a queried key. The bloom filter is the decisive optimization for point reads — more on that below.

The read path: newest wins

To read a key, the engine checks in order:

Memtable — if the key is there, return it (most recent write).
L0 SSTables — may overlap in key range; check each in reverse chronological order.
L1, L2, ... SSTables — at each level below L0, SSTables are non-overlapping; binary search into the correct SSTable, then look up within it.

Without any optimization, a cold read on an LSM with many levels could touch every SSTable. Two mechanisms make this manageable in practice:

Bloom filters — each SSTable carries a bloom filter. Before performing any disk I/O on an SSTable, the engine checks the bloom filter: "does this SSTable possibly contain key K?" If the answer is no (true negative is guaranteed), skip the file entirely. The false-positive rate is tunable: at ~0.8% FPR (~10 bits per key, 7 hash functions), ~99% of unnecessary reads are eliminated. See bloom filters for the mechanics.

Key range pruning — each SSTable's metadata includes the min and max key it covers. Before checking the bloom filter, the engine can skip any SSTable whose key range doesn't include the query key.

With both optimizations active, a point read in a well-tuned LSM typically touches 1–2 SSTables, approaching B-tree read performance for hot data.

Tombstones: the LSM delete

Because SSTables are immutable, you cannot delete a record by removing it. A delete instead inserts a tombstone: a special marker record for the key that signals "this key is deleted." Readers that encounter a tombstone during their newest-first sweep return "not found." The actual removal happens during compaction.

This has an important implication: a key that was inserted and then deleted still occupies disk space until a compaction sweeps both the original record and the tombstone into the same output run and discards both. In extreme cases — high delete rates without sufficient compaction throughput — tombstones accumulate and both space amplification and read latency grow. Cassandra requires tuning of gc_grace_seconds precisely to control this.

flowchart LR
    A["Write: key='user:42', value='alice'"] --> SST1["SSTable (L1): user:42=alice"]
    B["Delete: key='user:42'"] --> MT["Memtable: user:42=TOMBSTONE"]
    MT --> SST2["SSTable (L0): user:42=TOMBSTONE"]
    SST1 & SST2 --> COMP["Compaction"]
    COMP --> OUT["Output SSTable: (user:42 discarded)"]
    style COMP fill:#ff6b1a,color:#fff
    style OUT fill:#15803d,color:#fff

Compaction: the heart of LSM-tree maintenance

Compaction is the background process that merges SSTables, removes stale versions, removes tombstone-covered records, and reorganizes data across levels. It is why LSM write amplification is high, and why understanding its strategies matters for capacity planning.

Size-tiered compaction

SSTables of similar size are grouped into tiers. When a tier accumulates N files (commonly N=4), they are merged into one larger file at the next tier.

flowchart TB
    T0A["Tier 0: 64 MB"] & T0B["64 MB"] & T0C["64 MB"] & T0D["64 MB"] --> MERGE1["Merge → Tier 1: 256 MB"]
    T1A["Tier 1: 256 MB"] & T1B["256 MB"] & T1C["256 MB"] & T1D["256 MB"] --> MERGE2["Merge → Tier 2: 1 GB"]
    style MERGE1 fill:#ffaa00,color:#0a0a0f
    style MERGE2 fill:#ff6b1a,color:#fff

Size-tiered compaction writes less data overall — each record is rewritten fewer times before reaching its final tier — which is great for write-heavy workloads. The catch is space amplification: all N source files must exist simultaneously during a merge, so you can briefly need up to N× the space of a single tier. Multiple versions of the same key may also exist across many SSTables, making reads slower than they would be with leveled compaction. Cassandra and Apache HBase use this strategy by default.

Leveled compaction (LevelDB / RocksDB default)

Data is organized into levels (L0, L1, L2, ...). Each level has a size limit that grows by a fixed multiplier (commonly 10×):

L0: ~4 files trigger compaction (overlapping ranges allowed; stall at 20, stop at 36 by default in RocksDB)
L1: ~10 MB  (non-overlapping, sorted by key)  [illustrative; RocksDB default L1 is 256 MB]
L2: ~100 MB (non-overlapping, sorted by key)
L3: ~1 GB
L4: ~10 GB
...

When L0 accumulates enough files, a compaction picks one L0 file and merges it with the overlapping files in L1, producing new L1 files. Similarly, when L1 exceeds its limit, files are pushed down to L2. At every level below L0, the key ranges of SSTables are non-overlapping, which means a point read in L1+ always touches at most one SSTable (after the range check).

The trade-off is inverted compared to size-tiered: leveled compaction keeps reads fast and space usage low, but a record may be rewritten many times as it cascades from L0 → L1 → L2 → ... → LN. Measured write amplification for leveled compaction is commonly 10–30× in practice, depending on level count and size ratio.

Factor	Size-tiered	Leveled
Write amplification	Low (~5–10×)	High (~10–30×)
Read amplification (worst-case)	High	Low (~levels count)
Space amplification	High (up to N× during merge)	Low (~1.1–1.3×)
Typical use case	Write-heavy, Cassandra, HBase	Mixed, RocksDB, LevelDB

A third strategy, FIFO compaction, simply discards the oldest SSTables once a size limit is reached — suitable only for time-series data with a known retention window where old data is intentionally evicted.

Amplification numbers: a worked example

Suppose you are ingesting 100 MB/s of key-value writes into each engine.

B-tree engine:

WAL writes: 100 MB/s (sequential)
Dirty page flushes: ~100 MB/s (random, but only ~1 page per ~few KB of data depending on fill factor)
Write amplification: ~2–4×
Total disk write bandwidth: ~200–400 MB/s
At 100 MB/s incoming, this is achievable on commodity NVMe.

LSM engine (leveled compaction):

WAL writes: 100 MB/s (sequential)
Memtable flush (L0): 100 MB/s (sequential)
Compaction L0→L1: rewrites L0 files plus overlapping L1 files ≈ 200–300 MB/s
Compaction L1→L2: ~300–500 MB/s
Write amplification: 10–30× → total ~1,000–3,000 MB/s
This requires serious sequential I/O bandwidth — which NVMe SSDs (at 3–7 GB/s) provide, explaining why LSM-tree databases are more competitive today than on spinning disk where random and sequential bandwidth are both constrained.

The bottom line: LSM's higher write amplification is "paid for" with sequential I/O that saturates fast storage, while B-tree random writes are gated by IOPS — often the binding constraint on spinning disk and still relevant on NVMe under heavy random-write pressure.

Full architecture comparison

flowchart TD
    subgraph BTREE["B-Tree Engine (e.g. Postgres/InnoDB)"]
        BW["Write arrives"] --> BWAL["WAL append\n(sequential)"]
        BWAL --> BBP["Modify page in buffer pool"]
        BBP --> BDISK["Flush dirty page to B-tree node\n(random write to fixed offset)"]
        BR["Read arrives"] --> BCACHE{"Buffer pool\nhit?"}
        BCACHE -->|yes| BRET["Return value"]
        BCACHE -->|no| BPAGE["Read page from disk\n(O(log N) pages)"]
        BPAGE --> BRET
    end
    subgraph LSMTREE["LSM-Tree Engine (e.g. RocksDB/Cassandra)"]
        LW["Write arrives"] --> LWAL["WAL append\n(sequential)"]
        LWAL --> LMT["Insert into memtable\n(sorted, in-memory)"]
        LMT -->|"full (64-256 MB)"| LFLUSH["Flush to SSTable\n(large sequential write)"]
        LR["Read arrives"] --> LMT2["Check memtable"]
        LMT2 -->|miss| LBF["Check bloom filter\nper SSTable"]
        LBF -->|"may contain"| LREAD["Binary search in SSTable"]
        LREAD --> LRET["Return newest version"]
        LFLUSH -.background.-> LCOMP["Compaction:\nmerge + sort SSTables"]
    end
    style BWAL fill:#a855f7,color:#fff
    style BBP fill:#ff2e88,color:#fff
    style BDISK fill:#0e7490,color:#fff
    style LWAL fill:#a855f7,color:#fff
    style LMT fill:#ff6b1a,color:#fff
    style LFLUSH fill:#ffaa00,color:#0a0a0f
    style LCOMP fill:#15803d,color:#fff
    style LBF fill:#ff2e88,color:#fff

Bloom filters: LSM's read equalizer

A bloom filter is a probabilistic data structure that answers "is element X possibly in this set?" in O(1) time and O(bits-per-element) space, with a controllable false positive rate and zero false negatives.

In an LSM engine, each SSTable carries one bloom filter built at flush time. A point read queries the bloom filter before touching the SSTable's index or data blocks:

Bloom filter says "no": guaranteed safe to skip this SSTable entirely. No disk I/O.
Bloom filter says "maybe yes": proceed to check the sparse index and data blocks.

At 10 bits per key and 7 hash functions, the false positive rate is ~0.8% (from the formula (1 − e^(−kn/m))^k). For 100 M keys, that is ~125 MB of bloom filter data — easily held in memory. Without bloom filters, a read missing all SSTables would require checking every one; with bloom filters, only ~0.8% of SSTables trigger a false positive disk read.

flowchart LR
    READ["Point read: key K"] --> MT["Check memtable"]
    MT -->|miss| BF{"Bloom filter:\n'is K in this SSTable?'"}
    BF -->|"no (true negative)"| SKIP["Skip SSTable entirely\n(zero disk I/O)"]
    BF -->|"maybe (check index)"| IDX["Read sparse index"]
    IDX --> DATA["Read data block"]
    DATA --> RET["Return value"]
    SKIP --> BF2{"Next SSTable\nbloom filter"}
    BF2 -->|"no"| SKIP2["Skip"]
    BF2 -->|"maybe"| IDX2["Read & return"]
    style BF fill:#ff2e88,color:#fff
    style SKIP fill:#15803d,color:#fff
    style DATA fill:#ff6b1a,color:#fff
    style BF2 fill:#ff2e88,color:#fff

Notice that for a key absent from the entire LSM tree, only ~0.8% of SSTables cause an unnecessary disk read — the rest are eliminated in memory. This is why bloom filters aren't an optional optimization in LSM systems; they're load-bearing.

Read the bloom filter article for the full probability derivation and implementation.

Storage engine to database mapping

Database	Engine	Family	Notes
PostgreSQL	Custom heap + B-tree indexes	B-tree	WAL-based crash recovery; MVCC for concurrency
MySQL InnoDB	InnoDB B+ tree	B-tree	Clustered primary key index; change buffer
SQLite	Custom B-tree	B-tree	Single-writer; WAL mode available in v3.7+
LevelDB	LevelDB	LSM	Google's foundational LSM library; leveled compaction
RocksDB	RocksDB (LevelDB fork)	LSM	Meta's production-hardened fork; widely embedded
Cassandra	Storage engine based on LSM	LSM	Size-tiered compaction by default; SSTable per node
HBase	Inspired by Bigtable; LSM via MemStore+HFile	LSM	HDFS-backed SSTables
ScyllaDB	Native C++ LSM engine	LSM	Cassandra-compatible; shard-per-core design
Google Bigtable	Bigtable tablet server (GFS-backed)	LSM	Original paper describes memtable + SSTable design
MongoDB (WiredTiger)	WiredTiger B-tree	B-tree	Default engine since MongoDB 3.2
CockroachDB	Pebble (RocksDB-compatible)	LSM	Distributed SQL on top of LSM; Pebble is a Go-native LSM engine inspired by LevelDB/RocksDB, developed by Cockroach Labs
TiKV	RocksDB	LSM	Storage layer of TiDB; RocksDB for key-value

See how to choose a database and SQL vs NoSQL for the higher-level selection framework. For applying these concepts to a key-value store design, see design a key-value store.

When to choose which

B-tree engines shine on read-heavy OLTP — user profile lookups, product catalog queries, order status checks. The O(log N) point reads with a hot buffer pool are hard to beat, and for complex range queries ("orders placed between Jan and March, sorted by total amount"), the B-tree leaf traversal is a natural fit. Most B-tree engines also ship with mature MVCC implementations, making ACID transactions first-class. If your workload is mixed read/write with stable, in-place updates and you're not overwhelming disk IOPS, a B-tree engine is the right default.

LSM engines earn their keep under sustained write pressure. Time-series metrics, event logs, sensor data, and clickstream all involve sequential appending where sequential memtable flushes saturate disk write bandwidth. If your data model is wide-column or sparse — Cassandra's partition key → clustering key → column layout maps naturally onto SSTable sorted order — an LSM engine feels native. Systems like key-value stores with high write rates and bloom-filter-backed reads are another natural fit.

There are a few situations worth calling out explicitly where the default choice bites you:

High delete rates with LSM. Tombstones accumulate until compaction catches up. If you're building a TTL-heavy system — expiring sessions, expiring cache entries — either tune compaction aggressively or reconsider the engine. Cassandra's gc_grace_seconds exists precisely for this reason.

B-tree under write saturation. If you're sustaining millions of random writes per second, the IOPS ceiling on the underlying storage becomes a hard wall. Batching, bulk-load APIs, or switching to an LSM engine are the escape hatches.

Cold reads on a large LSM. If reads are rare and bloom filters are not loaded in memory — for example, after a restart on a large dataset — the first wave of reads will be slow as bloom filter data pages in from disk. This is usually short-lived but worth knowing if you run large, mostly-idle LSM deployments.

Failure modes

Failure	B-tree behavior	LSM-tree behavior
Process crash mid-write	WAL replay on restart; page state restored	WAL replay; memtable rebuilt from WAL; SSTables are durable (immutable)
Torn write (partial page)	Page checksum detects corruption; restore from WAL	Not possible for completed SSTables (immutable); in-progress flush is aborted and replayed
Disk full	Writes fail; DB goes read-only	Writes can fail; compaction falls behind; read amplification increases; tombstones accumulate
Compaction falls behind (LSM)	N/A	L0 file count grows; reads degrade; engine may throttle or pause writes ("write stall")
Large index rebuild	Expensive; full tree traversal + rewrite	"Compaction" is effectively continuous; a full rewrite is a bottom-level compaction

The write stall in LSM deserves a closer look. If the application writes faster than compaction can keep up — typically when L0 file count exceeds a threshold (RocksDB defaults to stalling at 20 L0 files and stopping at 36) — the engine intentionally slows or pauses writes to let compaction catch up. This produces a sudden latency spike that B-tree engines don't exhibit; they slow gradually under IOPS pressure instead. Operational monitoring of L0 file count, pending compaction bytes, and compaction I/O rate is essential for LSM-backed services in production.

Things to discuss in an interview

The three amplification factors as a unifying framework — immediately shows you think structurally, not just "B-tree good, LSM bad."
Why LSM's sequential writes matter: the gap between sequential and random I/O (especially on spinning disk) and why that drove the original LSM design.
Compaction strategy trade-offs: size-tiered vs leveled. Show you know this is a dial, not a fixed property.
Bloom filters as a first-class design component: not an afterthought, but a load-bearing piece of the read path. Mentioning bits-per-key and false-positive rate signals depth.
Tombstone accumulation in LSM: a real operational risk that interviewers love to probe with "what happens when you have a lot of deletes?"
Write stalls: LSM engines can suddenly pause writes when compaction falls behind — a production risk that B-tree engines don't have.
Connection to database choice: when you are designing a system with a write-heavy workload, name the storage-engine family, not just the brand.

Things you should now be able to answer

Why does Cassandra have lower write latency than Postgres under sustained high write throughput?
What is the role of the memtable in an LSM engine, and what happens to it on a crash?
Why do LSM-tree databases need bloom filters, and what is the cost of not having them?
What is a tombstone, and why does it matter for LSM databases with heavy delete workloads?
When would you prefer size-tiered over leveled compaction, and what do you give up?
What is write amplification, and why does leveled compaction produce more of it than size-tiered?
Why can an LSM engine stall writes, and how would you monitor and prevent it?
Given a time-series ingestion pipeline writing 500 MB/s of new data, which engine family would you reach for and why?

Frequently asked questions

▸What is write amplification, and how does it differ between B-trees and LSM-trees?

Write amplification is the ratio of bytes written to disk versus bytes written by the application. B-trees produce a write amplification of roughly 2-4x (WAL record plus dirty page flush). LSM-trees with leveled compaction produce 10-30x because each record is rewritten repeatedly as it cascades from L0 through successive levels during compaction.

▸Why do LSM-tree databases like Cassandra and RocksDB use bloom filters?

Without bloom filters, a point read that misses the memtable must check every SSTable on disk, producing unbounded read amplification. Each SSTable carries a bloom filter so the engine can skip it entirely — at 10 bits per key and 7 hash functions, the false positive rate is roughly 0.8%, eliminating about 99% of unnecessary disk reads. For 100 million keys this costs only about 125 MB of memory, making bloom filters load-bearing rather than optional.

▸What is a tombstone in an LSM-tree and why does it matter for delete-heavy workloads?

Because SSTables are immutable, a delete writes a tombstone marker record rather than removing the key in place. The original record and tombstone both occupy disk until a compaction merges them and discards both. Under high delete rates, tombstones accumulate faster than compaction can clear them, increasing both space amplification and read latency — which is why Cassandra exposes a gc_grace_seconds tuning parameter specifically to control this.

▸When should I choose a B-tree engine over an LSM-tree engine?

Choose a B-tree engine for read-heavy OLTP workloads such as user profile lookups, product catalog queries, and order status checks where O(log N) point reads with a warm buffer pool are hard to beat. B-tree engines are also the natural choice when your workload involves complex range queries, in-place updates, and you need mature ACID transaction support without hitting the disk IOPS ceiling.

▸What is an LSM write stall and what causes it?

A write stall occurs when the application writes faster than the compaction thread can merge SSTables, causing the L0 file count to grow past a threshold. RocksDB defaults to throttling writes at 20 L0 files and halting them at 36. This produces a sudden latency spike — unlike B-tree engines, which slow gradually under IOPS pressure — making L0 file count and pending compaction bytes essential metrics to monitor in production LSM deployments.

← previous

Design Slack (team chat at scale)

Design a Stock Exchange (matching engine)

// RELATED