Database Indexing
B-trees, LSM-trees, hash indexes, covering indexes, partial indexes — what they do, when they hurt, and how to read an EXPLAIN plan with confidence.
An index is the single biggest performance lever in a database. It's also the single biggest source of "why is this query suddenly slow?" pages. Most engineers know to "add an index." Far fewer know which kind, on which columns, in what order, or why a perfectly good index sometimes makes things worse.
This article builds the mental model from the ground up: what indexes actually are, the two main families (B-tree, LSM), the variants worth knowing (hash, bitmap, GIN, GiST, BRIN), and the trade-offs that decide whether your INSERT just got 10× slower.
Why indexes exist
A table is a pile of rows on disk. Without an index, finding "all rows where email = 'a@b.com'" is a full table scan.
flowchart LR
Q["SELECT * WHERE email = 'a@b.com'"] --> SCAN[Full scan]
SCAN -->|10M rows| ROW1[row 1]
SCAN --> ROW2[row 2]
SCAN --> DOTS[...]
SCAN --> ROWN[row 10M]
ROWN --> R[1 match]
style SCAN fill:#ff2e88,color:#fff
That's 10M reads to find 1 row. Indexes turn this into:
flowchart LR
Q["SELECT * WHERE email = 'a@b.com'"] --> IDX[(Index on email)]
IDX -->|"O(log N)"| PTR[row pointer]
PTR --> ROW[1 row]
style IDX fill:#15803d,color:#fff
An index is a separate data structure that maps from values to row locations. Finding by indexed value goes from O(N) to O(log N) or O(1) depending on the kind.
You pay for this in two ways: storage (indexes can equal or exceed the size of the table) and write cost (every INSERT/UPDATE/DELETE has to update every index that covers a changed column). That's the fundamental trade-off — and it's why you can't just index everything.
The two big families
B-tree (and B+ tree) indexes
The default in Postgres, MySQL, SQL Server, Oracle, MongoDB.
A B-tree is a balanced multi-way tree. Each internal node holds many keys (typically 100–1000) and points to children. Leaf nodes hold the actual values or pointers.
flowchart TD
R["Root:<br/>[M, T]"] -->|< M| N1["[E, J]"]
R -->|M to T| N2["[O, Q]"]
R -->|> T| N3["[V, Z]"]
N1 -->|< E| L1["[A, C, D]"]
N1 -->|E to J| L2["[F, G, H]"]
N1 -->|> J| L3["[K, L]"]
N2 -->|< O| L4["[M, N]"]
N2 -->|O to Q| L5["[P]"]
N2 -->|> Q| L6["[R, S]"]
N3 -->|< V| L7["[T, U]"]
N3 -->|V to Z| L8["[W, X]"]
N3 -->|> Z| L9["[Z+]"]
style R fill:#ff6b1a,color:#0a0a0f
Lookups, inserts, and deletes are all O(log N). Range queries are efficient because leaf nodes are linked left-to-right — once you find your starting row, the rest are sequential reads. Pages are typically 8 KB (Postgres) or 16 KB (MySQL).
A B+ tree stores all values in leaves; internal nodes are just navigation. Almost all "B-tree" indexes you'll encounter are actually B+ trees.
B-trees balance reads and writes well, making them the right default for OLTP workloads. Their weakness shows under heavy random writes: each insert touches a leaf page, possibly splitting it, and over time pages fragment. If your workload is primarily writing, that steady fragmentation cost adds up.
LSM-tree indexes
The default in Cassandra, RocksDB, LevelDB, BigTable, ScyllaDB, and most modern key-value stores.
LSM stands for Log-Structured Merge-Tree. The core insight is simple: sequential writes are much faster than random writes, so instead of updating a tree in place, you always write sequentially and sort things out later.
Writes go to an in-memory sorted structure (the memtable). When the memtable fills up, it flushes to a sorted file on disk (an SSTable). Periodically, old SSTables merge together in the background.
flowchart TD
W[Write] --> MT[Memtable<br/>in RAM, sorted]
MT -->|flush when full| L0[SSTable L0]
L0 -->|compact| L1[SSTable L1<br/>larger, sorted]
L1 -->|compact| L2[SSTable L2]
style MT fill:#15803d,color:#fff
style L0 fill:#0e7490,color:#fff
The tradeoff: reads may have to check multiple SSTables to piece together the current value of a key. Bloom filters let each SSTable quickly say "this key definitely isn't here," so in practice most SSTables are skipped. Compaction runs in the background and costs CPU and I/O — but it keeps the number of SSTables bounded.
LSM-trees shine at write-dominated, append-heavy workloads. The cost is read amplification (multiple files to check) and write amplification from compaction rewriting data. Deletes are also awkward — a tombstone marker is written, and the actual data disappears only when a compaction sweeps the relevant SSTables.
B-tree vs LSM at a glance
| Aspect | B-tree | LSM |
|---|---|---|
| Writes | Random I/O | Sequential I/O |
| Reads | One tree traversal | Multiple SSTable checks |
| Range queries | Excellent | Excellent (sorted SSTables) |
| Write amplification | Low¹ | Compaction-strategy-dependent: leveled compaction (RocksDB default) typically 10–30×; size-tiered compaction (Cassandra default) typically 3–10× |
| Read amplification | Low | Medium (Bloom filters reduce SSTable checks) |
| Space amplification | Low | Higher (multiple copies during compaction; typically 1.1–3× data size) |
| Typical use | OLTP | Write-heavy, append-mostly |
¹ B-tree write amplification is low per logical write (one leaf-page write) but can spike under heavy random-write workloads that cause page splits and fragmentation. LSM trades that spike for a steady background compaction cost.
Index types beyond B-tree / LSM
Most databases offer several index types. Postgres alone has 7: B-tree, Hash, GiST, SP-GiST, GIN, BRIN, and bloom (the last via the bloom extension). The sections below cover the most interview-relevant ones.
Hash indexes
Map key → bucket by hash. O(1) lookup. No range queries — hashing destroys order, so there's no way to say "give me all keys between A and B."
Use hash indexes for exact-match lookups only. Some KV stores use them internally. In production OLTP, a B-tree is nearly as fast and supports far more query shapes, so hash indexes rarely earn their keep.
Bitmap indexes
For very-low-cardinality columns (gender, status, country), store a bitmap per value: bit i = 1 if row i has that value. Bitwise AND/OR across bitmaps makes multi-condition queries fast.
These work well in data warehouses and analytics on columns with fewer than ~100 distinct values that are mostly read-only. In OLTP with frequent updates, bitmaps don't update efficiently — each write touches multiple bitmaps.
GIN (Generalized Inverted Index) — Postgres
For columns where each row has many values: arrays, JSONB, full-text.
CREATE INDEX idx_tags ON posts USING GIN (tags);
SELECT * FROM posts WHERE tags @> ARRAY['system-design'];
Use GIN for full-text search, JSONB key lookups, and array containment queries. It's the right tool any time you need to ask "which rows contain this value?" rather than "what value does this row have?"
GiST (Generalized Search Tree) — Postgres
A framework for indexing unusual data: geometric data (PostGIS), ranges, similarity.
CREATE INDEX idx_location ON places USING GIST (location);
SELECT * FROM places WHERE location <-> POINT(0,0) < 10;
SP-GiST (Space-Partitioned GiST) — Postgres
Like GiST but for space-partitioned structures: quadtrees, k-d trees, radix trees. Useful when the data naturally partitions into non-overlapping regions (point data, prefix-searchable strings). Rarely comes up in interviews; know it exists.
BRIN (Block Range Index) — Postgres
Tiny indexes that summarize ranges of pages. Instead of indexing every row, BRIN stores the min and max value for each block of pages. A query for created_at > yesterday can skip all blocks whose max timestamp is before yesterday.
This makes BRIN extremely space-efficient — often hundreds of times smaller than a B-tree — but it's only effective when data is naturally ordered on disk, which is typical of time-series tables where rows are inserted in timestamp order. If the physical row order has nothing to do with the query column, BRIN won't help.
Use BRIN for time-series tables with billions of rows where queries ask about recent dates. Don't use it for columns that rows are scattered in.
Spatial / R-tree
For 2D/3D queries. PostGIS, MongoDB geospatial.
Vector
For embeddings (HNSW, IVF). pgvector, Pinecone, Weaviate.
Index design — what to actually index
Rules of thumb
- Index every foreign key. Joins use them.
- Index columns used in WHERE clauses with high selectivity.
- Index ORDER BY columns so the sort is free.
- Don't index low-cardinality columns (boolean, status with 3 values) — the optimizer often ignores them.
- Composite index column order matters: leftmost columns are usable for partial matches.
Composite indexes (the most-misunderstood)
An index on (a, b, c):
| Query | Uses index? |
|---|---|
WHERE a = ? | Yes |
WHERE a = ? AND b = ? | Yes |
WHERE a = ? AND b = ? AND c = ? | Yes |
WHERE b = ? | No (no leading column) |
WHERE c = ? | No |
WHERE a = ? AND c = ? | Partial — uses a, scans on c |
WHERE a = ? ORDER BY b | Yes (sort is free) |
The ordering rule: put equality predicates first (highest selectivity among them), then the range predicate, then ORDER BY columns. A range condition on column B stops the optimizer from using the index for column C, so range columns must come after all equality columns.
Here's a concrete example. For the query WHERE user_id = ? AND created_at > ? ORDER BY created_at, the right index order is (user_id, created_at) — equality on user_id first, range and sort on created_at second. Reversing to (created_at, user_id) makes the equality on user_id unusable for most range queries.
flowchart LR
Q["WHERE user_id=42\nAND created_at > T\nORDER BY created_at"] --> GOOD["Index: (user_id, created_at)\nSeek → user_id=42 block\nScan sorted created_at leaf pages"]
Q --> BAD["Index: (created_at, user_id)\nRange scan on created_at\nuser_id filter applied after"]
style GOOD fill:#15803d,color:#fff
style BAD fill:#ff2e88,color:#fff
Covering indexes
An index that contains all the columns a query needs — the database never has to read the table.
-- Query: SELECT email FROM users WHERE id = 42;
CREATE INDEX idx_users_id_email ON users(id) INCLUDE (email);
The INCLUDE clause (Postgres 11+) adds non-key columns to the index. The read becomes a single index lookup with no table access. For hot queries that run millions of times a day, eliminating the heap fetch is a big deal.
Partial indexes
Index only some rows. Useful for skewed data: most rows are "done," queries ask about "pending."
CREATE INDEX idx_orders_pending
ON orders(created_at) WHERE status = 'pending';
The index is smaller, updates are faster, and the matching queries are just as fast. If 99% of your rows are in a state that no hot query ever touches, why pay to index them?
Expression indexes
Index the result of a function or expression.
CREATE INDEX idx_email_lower ON users (LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'a@b.com';
Without this, the query scans the whole table because LOWER(email) doesn't match a regular index on email. Whenever you find yourself calling a function on a column in a WHERE clause, ask whether you should be indexing the expression instead.
How to read EXPLAIN
Reading query plans is the interview-grade skill that separates engineers who guess from engineers who know.
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'a@b.com';
Sample output:
Index Scan using idx_users_email on users (cost=0.42..8.44 rows=1 width=120)
(actual time=0.03..0.04 rows=1)
Index Cond: (email = 'a@b.com')
What to look for:
| Plan node | What it means |
|---|---|
| Seq Scan | Full table scan. Bad if table is big and you expected an index. |
| Index Scan | Used the index, fetched matching rows. |
| Index Only Scan | Used a covering index, no table access. Best. |
| Bitmap Scan | Multiple indexes combined. OK for many-condition queries. |
| Nested Loop | Joins by lookups. Good for small N. |
| Hash Join | Builds hash table; good for medium joins. |
| Merge Join | Both inputs sorted. Best for huge joins. |
Reading the cost numbers. cost=0.42..8.44 means: startup cost (time before first row) = 0.42 units, total cost = 8.44 units. These are in abstract planner units, not milliseconds — but they are comparable within the same plan. A high startup cost matters for LIMIT queries (the planner optimizes for total cost by default; hint it with cursor_tuple_fraction — set it lower, e.g. 0.1, to tell the planner you only want a fraction of rows — during debugging only). rows=1 is the estimate; actual rows=1 is reality. A large gap (e.g., rows=1 actual rows=50000) means stale statistics — run ANALYZE.
If you see Seq Scan on a giant table for a query you thought would use an index, work through this checklist:
- Does the index actually exist?
- Is it on the right column(s)?
- Is the column wrapped in a function (
LOWER(email)) that defeats it? - Are statistics stale? Run
ANALYZE. - Is the planner choosing seq scan because it estimates most rows match? (Sometimes correctly — if a query returns 40% of a table, a seq scan is cheaper than 400k random index lookups.)
- Is
random_page_costset appropriately for your storage? On SSDs, drop it from the default 4.0 to ~1.1; an inflated value makes the planner undervalue index scans.
Costs you pay for every index
Adding an index isn't free. Each one you create shows up on every write path from that point on.
| Cost | What it means |
|---|---|
| Storage | Each index is its own data structure. 5 indexes on a table can double table size. |
| Write amplification | Every INSERT touches every index. 5 indexes ≈ 5× write cost. |
| VACUUM cost | Postgres has to clean up index pages too. |
| Lock contention | Index updates take locks. |
| Cache pressure | More indexes in RAM means less room for table data. |
Every index should pay for itself by accelerating a query that runs often. Indexes that no query uses are pure tax — they slow down every write while helping no read.
You can find those unused indexes in Postgres:
SELECT schemaname, relname, indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
Common pitfalls
Indexing every column "just in case"
You multiply write cost with every index you add — five indexes means roughly 5× the write overhead. Don't.
Function in WHERE defeats the index
WHERE DATE(created_at) = '2026-01-01' ignores an index on created_at. Rewrite as a range: created_at >= '2026-01-01' AND created_at < '2026-01-02'.
Type mismatch
Comparing varchar to int can silently force a cast that disables the index. Watch out in heterogeneous schemas.
Bloat from frequent updates
A table with constant updates accumulates dead tuples. Indexes bloat too. VACUUM FULL or pg_repack reclaim space. Hot updates (HOT updates in Postgres) help when no indexed column changes.
Index on a hot-spot column
If everyone writes to status = 'pending' then transitions to status = 'complete', the index entries pile up at the bottom of the B-tree. Partial indexes help.
Forgetting that NULL is weird
WHERE x = NULL is always false — use IS NULL instead. Both Postgres and MySQL InnoDB include NULL values in B-tree secondary indexes (the old "MySQL doesn't index NULLs" claim was true of MyISAM but not InnoDB). Both Postgres (by default, pre-15) and MySQL permit multiple NULLs in a UNIQUE index because NULLs are not considered equal to each other — this is SQL-standard behavior. Postgres 15 added a NULLS NOT DISTINCT option if you want stricter enforcement. Read the docs for your specific engine.
A real-world example: speeding up a slow query
You have:
SELECT id, title FROM articles
WHERE user_id = 42 AND status = 'published'
ORDER BY created_at DESC
LIMIT 10;
Without an index: seq scan, 10M rows, slow.
A single column index on user_id helps — it returns the matching rows — but then the database still has to sort them. Depending on how many articles user_id=42 has, that sort touches a lot of data.
The right index collapses all three steps into one lookup:
CREATE INDEX idx_user_status_created
ON articles(user_id, status, created_at DESC)
INCLUDE (title);
Now the index seeks directly to the user_id=42 + status='published' block. The leaves are already sorted by created_at DESC, so no sort step. title is in the index itself, so no heap fetch. One index seek and 10 leaf reads, done.
When to not add an index
A write-heavy table where the query is rare? Pay the write cost every second for a query that runs once a day? Probably not. A tiny table under 1000 rows? A full scan is faster anyway — the optimizer will choose it. A low-cardinality column queried with low selectivity? The optimizer will likely ignore the index and scan anyway, so you're just paying write cost for nothing. When in doubt: measure with EXPLAIN ANALYZE before and after, not before.
Things you should now be able to answer
- Why does a heavy-write workload prefer LSM over B-tree?
- What does column order in a composite index control?
- What is a covering index and what does
INCLUDEgive you? - Why might
WHERE LOWER(email) = 'a@b.com'skip your index? - How do you find unused indexes in Postgres?
- When does BRIN make sense and when does it not?
Further reading
- Use The Index, Luke! (use-the-index-luke.com) — the canonical guide
- Postgres documentation chapter on indexes
- "How Cassandra writes" — Cassandra docs
- Petrov, Database Internals — chapter on storage engines
- Databases — SQL, NoSQL, NewSQL on this site
Frequently asked questions
▸What is a covering index and what does the INCLUDE clause give you?
A covering index contains all the columns a query needs, so the database never touches the table. The INCLUDE clause in Postgres 11+ adds non-key columns to the index leaf pages without making them part of the sort key. The result is an Index Only Scan — one index lookup, no heap fetch.
▸B-tree vs LSM-tree: which should you choose and why?
Use a B-tree for mixed OLTP workloads — it balances reads and writes and is the default in Postgres, MySQL, SQL Server, and Oracle. Use an LSM-tree (Cassandra, RocksDB) when writes dominate, because it converts random writes into sequential memtable flushes and SSTable compactions. The trade-off is write amplification: leveled compaction in RocksDB typically runs 10–30x, size-tiered in Cassandra 3–10x, versus low per-write cost in B-trees.
▸What is the column ordering rule for composite indexes?
Put equality predicates first, ordered by highest selectivity, then the range predicate, then ORDER BY columns. A range condition stops the optimizer from using the index for any column that follows it. For a query on (user_id = ? AND created_at > ?), the correct index order is (user_id, created_at) — reversing it makes the user_id equality filter unusable for most range queries.
▸When should you use a BRIN index instead of a B-tree?
Use BRIN for massive, append-only tables where rows are physically inserted in the order you query them — time-series tables are the canonical example. BRIN stores only the min and max value per block range, making it hundreds of times smaller than a B-tree. It is useless when the physical row order has no correlation with the query column, because it cannot skip irrelevant blocks.
▸Why might a perfectly reasonable index get ignored by the query planner?
Common causes: the column is wrapped in a function (WHERE LOWER(email) = ... defeats a plain index on email), statistics are stale so the planner misestimates selectivity (fix with ANALYZE), the planner estimates that most rows match making a sequential scan cheaper than many random index lookups, or random_page_cost is set too high for SSD storage (the default is 4.0; drop it to ~1.1 on SSDs). Always verify with EXPLAIN ANALYZE.
You may also like
Design a Globally-Distributed SQL Database (Spanner / CockroachDB)
SQL transactions that are ACID across continents. How Spanner shards into Paxos groups, runs 2PC on top, and uses TrueTime to give you external consistency — the CP counterpart to Dynamo.
Design a Distributed Search Engine (Elasticsearch)
Index billions of documents and answer full-text queries in milliseconds. Inverted indexes, sharding + replication, scatter-gather, and relevance scoring.
Backpressure & Flow Control
What happens when a fast producer overwhelms a slow consumer? Backpressure, bounded buffers, load shedding, and why unbounded queues are a trap.