MODULE 05 / 12crash course
~/roadmap/05-databases
Beginner

Databases — SQL, NoSQL, NewSQL

Indexes, transactions, isolation levels, MVCC, WAL, LSM trees vs B-trees, replication, sharding, and the trade-offs that decide where your data lives.

19 min read2026-01-19Ironclad Academy

There is no "best database." There is only "best database for this workload." Picking right requires knowing what each kind is good at — and what each kind is bad at.

This module gives you a working mental model of relational databases, six flavors of NoSQL, the newer "NewSQL" hybrids, and the internals (indexes, MVCC, write-ahead logs) that determine why each performs the way it does.

The big picture

flowchart TD
    DB[Databases] --> SQL[Relational / SQL]
    DB --> NS[NoSQL]
    DB --> NEW[NewSQL]
    SQL --> PG[Postgres / MySQL]
    SQL --> SS[SQL Server / Oracle]
    NS --> KV[Key-Value<br/>Redis, DynamoDB, Memcached]
    NS --> DOC[Document<br/>MongoDB, Couchbase]
    NS --> COL[Column Family<br/>Cassandra, HBase, Scylla]
    NS --> GR[Graph<br/>Neo4j, Neptune]
    NS --> TS[Time Series<br/>InfluxDB, TimescaleDB]
    NS --> SR[Search<br/>Elasticsearch, OpenSearch]
    NEW --> SP[Spanner / CockroachDB / Yugabyte / TiDB]
    style SQL fill:#ff6b1a,color:#0a0a0f
    style NS fill:#0e7490,color:#fff
    style NEW fill:#a855f7,color:#fff

Relational databases (SQL)

The 50-year-old workhorse. Postgres, MySQL, SQL Server, Oracle.

ACID transactions let you change multiple rows across multiple tables and guarantee the whole thing succeeds or none of it does. The joins and schema enforcement that SQL is famous for flow from that same foundation — a constraint violation gets caught at write time, before bad data can propagate. Add 50 years of tooling (backups, replication, query planners, observability), and you have a platform that very few new systems can beat on pure reliability.

The ceiling is real, though. One Postgres instance has no hard core or RAM limit in its documentation, but practically speaking most teams hit diminishing returns somewhere in the range of 32–96 cores — parallelism contention, lock management, and connection overhead all grow. Hot working sets beyond a few hundred GB to a few TB start demanding careful tuning. Horizontal scaling through sharding is possible but you own most of the plumbing yourself. Schema migrations on huge tables are painful — holding a lock on a billion-row table while you add a column can mean planned maintenance windows.

Use Postgres unless you have a reason not to. This is genuine production advice, not a meme. Postgres has gotten very good — JSONB columns with GIN indexes, full-text search, partitioning, logical replication, vector search (pgvector), and now sane partitioning for large tables.

ACID

Stands for:

LetterMeaningExample
AtomicityAll or nothingTransfer money: either both accounts update, or neither does
ConsistencyConstraints always holdForeign keys, unique indexes are never violated
IsolationConcurrent txns don't interfereTwo simultaneous transfers don't double-spend
DurabilityOnce committed, persistsPower failure doesn't lose the transfer

The "C" in ACID is not the same as the "C" in CAP — it means "no constraint violations," not "all replicas agree."

Isolation levels

Higher isolation = more correctness, less concurrency. Most DBs default to Read Committed:

LevelDirty ReadNon-Repeatable ReadPhantom ReadLost Update
Read Uncommittedpossiblepossiblepossiblepossible
Read Committed (default)safepossiblepossiblepossible
Repeatable Readsafesafepossiblepossible
Snapshotsafesafesafesafe
Serializablesafesafesafesafe

Definitions:

  • Dirty read: see another transaction's uncommitted change.
  • Non-repeatable read: same row read twice gives different values.
  • Phantom read: same query rerun returns different rows.
  • Lost update: two transactions read, modify, write — one's change is silently overwritten.

If you've never had to think about isolation levels, you've probably had bugs that were caused by them and didn't realize. Bank-of-America-style "I see the wrong balance" bugs are isolation-level bugs.

Postgres tip: READ COMMITTED is the default but REPEATABLE READ is often what you want for analytical reads — it uses snapshot isolation internally, giving you a stable view of the database for the transaction's lifetime.

MVCC: how Postgres avoids most locks

Most modern relational databases use Multi-Version Concurrency Control: every row has a version. Writers create new versions; readers see whichever version was current when they started.

flowchart LR
    W[Writer txn 100] -->|"creates row v2"| ROW[(Row, versions: v1, v2)]
    R[Reader txn 99] -->|"sees v1"| ROW
    R2[Reader txn 101] -->|"sees v2 if W committed"| ROW
    style W fill:#ff6b1a,color:#0a0a0f
    style R fill:#15803d,color:#fff

The consequence: readers don't block writers, and writers don't block readers. But old versions accumulate and need cleanup — Postgres calls this VACUUM, MySQL/InnoDB has the purge thread. Long-running transactions hold those old versions alive, which means your "30-minute analytics query" can cause table bloat for everyone else on that box.

Write-ahead log (WAL)

Durability comes from WAL. Every change is appended to a log on disk before the actual data file is updated.

flowchart LR
    TXN[INSERT/UPDATE] --> WAL[(WAL on disk<br/>append-only)]
    WAL -->|fsync| DURABLE[Committed]
    WAL -.async.-> PAGES[Data pages<br/>updated in background]
    style WAL fill:#ff6b1a,color:#0a0a0f
    style DURABLE fill:#15803d,color:#fff

WAL writes are append-only — sequential on disk and fast (~hundreds of MB/sec on NVMe). The actual data-page updates happen in the background and can be batched. Crash recovery replays the WAL from the last checkpoint, getting the database back to a consistent state in seconds. Replication in Postgres is exactly this: stream the WAL to replicas.

The WAL is also why fsync matters — until the WAL record is fsynced, the change isn't durable. fsync is slow (a few ms on SSD), so DBs batch commits to amortize it (group commit).

Indexes: where reads come from

Without an index, finding a row in a 1B-row table is O(n) — read every row. Slow.

A B-tree index (the default) gives you O(log n) lookups, sorted scans, and range queries. It costs storage (~10–30% of the data) and slows writes a bit (each insert updates the tree).

flowchart TD
    ROOT["root: 50, 100, 150"]
    ROOT --> N1["node: 25, 40, 47"]
    ROOT --> N2["node: 60, 80, 99"]
    ROOT --> N3["node: 110, 130, 145"]
    N1 --> L1[(leaf: 1...49)]
    N2 --> L2[(leaf: 51...99)]
    N3 --> L3[(leaf: 100...149)]
    style ROOT fill:#ff6b1a,color:#0a0a0f
    style L2 fill:#15803d,color:#fff

Index types worth knowing

IndexGood forBad for
B-tree (default)Equality, range, sorted scansSubstring search
HashEquality onlyRange queries (no order)
GIN (Postgres)JSONB, arrays, full-textHigh write rate (slow inserts)
GiST (Postgres)Geometric data, rangesGeneral-purpose
BRIN (Postgres)Huge tables with natural ordering (e.g. time series)Random access
Inverted (search engines)Text search ("contains 'cat'")Equality on structured data

Composite indexes (the leading-column rule)

A composite index (a, b, c) is sorted lexicographically. So it can serve queries on:

  • a (WHERE a = 5)
  • (a, b) (WHERE a = 5 AND b = 10)
  • (a, b, c)
  • Range on the last used column: WHERE a = 5 AND b > 10

Not for queries on just b, just c, or (b, c) — those would need a different index.

Covering indexes

If your index includes all columns the query needs, the DB never reads the table — it serves the answer from the index alone. Postgres calls this INCLUDE:

CREATE INDEX idx_users_email ON users(email) INCLUDE (name, created_at);

Now SELECT name, created_at FROM users WHERE email = ? reads only the index. Massive speedup for read-heavy workloads — at the cost of a fatter index.

Rules of thumb

  • Index every column you WHERE or JOIN on.
  • Don't over-index — every index is a write tax.
  • Watch for unused indexes — Postgres has pg_stat_user_indexes to find them.
  • Concurrent index creation: CREATE INDEX CONCURRENTLY doesn't lock writes — use it in production.

Query plans (read your plans)

When the DB chooses how to execute a query, it picks among:

  • Sequential scan: read the whole table. Cheap if table is small or you need most rows.
  • Index scan: walk the index, look up rows. Cheap if you need few rows.
  • Index-only scan: serve from a covering index, no table lookup.
  • Bitmap scan: hybrid — use index to build a bitmap of pages, then read pages.
  • Hash join / Merge join / Nested loop: three join algorithms with different cost models.

EXPLAIN ANALYZE shows the chosen plan and actual times. The single best skill for fixing slow queries is reading these plans.

EXPLAIN ANALYZE
SELECT u.name, count(o.id)
FROM users u JOIN orders o ON o.user_id = u.id
WHERE u.country = 'US'
GROUP BY u.name;

Look for:

  • Seq Scan on a big table with a low row-count filter → missing index.
  • Sort with high memory → consider a sort-supporting index.
  • Nested Loop on big tables → bad join order; check your join keys are indexed.

When SQL stops being enough

The path from one Postgres to a full distributed stack is a ladder — climb only as high as you need.

flowchart LR
    A[Single Postgres] -->|"hit ~1TB hot data<br/>or ~50k QPS write"| B[Read Replicas]
    B -->|"reads still slow<br/>writes still bottleneck"| C[Vertical Scale]
    C -->|"machine maxed out"| D[Sharding]
    D -->|"need cross-shard txns"| E[Distributed SQL<br/>NewSQL]
    style A fill:#15803d,color:#fff
    style E fill:#a855f7,color:#fff

Order of escalation:

  1. Tune queries and indexes. 80% of "slow database" problems are missing indexes.
  2. Read replicas. Cheap horizontal read scaling. Watch out for replication lag.
  3. Connection pooling. PgBouncer or RDS Proxy. Postgres backend processes are heavy; pool aggressively.
  4. Vertical scale. More RAM, more cores. Often cheaper than re-architecting.
  5. Partition the table (declarative partitioning). Time-series naturally partitions by month.
  6. Shard across multiple DB clusters by user_id / tenant_id.
  7. Move hot collection to a specialty store (cache, search index, time-series DB).
  8. Adopt distributed SQL (CockroachDB, Spanner) — last resort.

Key-value stores

The simplest NoSQL: key → value, no schema, no joins. The tradeoff is deliberate — by throwing out everything except point lookups, a key-value store can saturate a network link before it breaks a sweat.

RedisMemcachedDynamoDB
PersistenceOptional (RDB/AOF)NoneYes
Data structuresMany (lists, sets, sorted sets, streams, hyperloglog)Strings onlyDocuments (JSON)
Single-threadedYes (data ops)NoN/A (managed)
TransactionsMULTI/EXEC, LuaNonePer-item, conditional writes
Use asCache, session store, rate limiter, queue, leaderboardPure cachePrimary store

Redis (the swiss army knife)

Redis is wildly versatile — sorted sets implement leaderboards in 3 lines; streams can replace small Kafkas; HyperLogLog gives you cardinality estimation. The data structures are first-class:

SET    user:42:name "ada"
GET    user:42:name

INCR   pageviews:home

LPUSH  inbox:42 "msg1"
LRANGE inbox:42 0 9

SADD   friends:42 7 12 99
SINTER friends:42 friends:99   # mutual friends

ZADD   leaderboard 5000 "ada"
ZREVRANGE leaderboard 0 9      # top 10

XADD   events * action login user_id 42
XREAD  COUNT 10 STREAMS events 0   # like Kafka

PFADD  uniques:home "user:42"
PFCOUNT uniques:home          # approximate cardinality

Redis as cache vs Redis as database: when you use Redis as a cache, persistence is off and it's fine to lose the data on restart — the primary store is the source of truth. When you use it as a database, you turn on AOF plus periodic RDB, replicate it, and need a recovery plan if it goes down. People do this successfully for stateful apps with smallish hot datasets.

Redis Cluster shards data across nodes by hash slot (16,384 slots). Some operations (multi-key, transactions across keys) only work within a slot.

DynamoDB

Amazon's hosted KV/document store: effectively infinite scale, predictable latency (single-digit milliseconds for typical singleton operations), pay-per-request pricing.

The tradeoff: you must design for the access pattern up front. There are no ad-hoc queries. The data model is:

  • Partition key (PK): hashed to determine which physical partition stores the item.
  • Sort key (SK) (optional): items with the same PK are sorted by SK. Range queries within a PK are efficient.
  • Global Secondary Indexes (GSI): alternate PK/SK pairs for different access patterns.

Common pattern — single-table design: one DynamoDB table holds many entity types, with PK/SK chosen to make every needed query a "get items where PK = X" or a range scan. It's powerful but the modeling has a learning curve.

Document stores

key → JSON. The schema is in the JSON itself, often inconsistent across documents.

MongoDB is the canonical one. A document store shines when the natural unit of your application really is a document — a product listing, a blog post, a user profile — and that shape changes often enough that a fixed schema would slow you down. Cross-document transactions in MongoDB exist since version 4.0, but if your workload leans on them heavily, that's usually a signal the data wants to live in Postgres instead.

Postgres has JSONB with GIN indexes — for many "I want a document store" use cases, just use Postgres and put the document in a JSONB column. You get joins, transactions, and the relational tooling, without sacrificing schema flexibility.

Column-family / wide-column stores (and LSM trees)

Cassandra, HBase, ScyllaDB, Bigtable. Optimized for massive write throughput and time-series workloads.

Row key → column family → (column name, value, timestamp)

The key insight: rows with the same key are stored on the same machine, sorted by column. Time-series queries — "all temperatures for sensor 42 in the last hour" — become sequential disk reads.

LSM trees (the engine)

Cassandra and Scylla (and RocksDB, LevelDB, the storage engines under DynamoDB and many others) use log-structured merge trees instead of B-trees.

flowchart TD
    W[Write] --> MEM[(MemTable<br/>in RAM)]
    MEM -->|flush when full| L0[(SSTable L0)]
    L0 -->|compact| L1[(SSTable L1)]
    L1 -->|compact| L2[(SSTable L2)]
    L2 -->|compact| L3[(SSTable L3)]
    style MEM fill:#15803d,color:#fff
    style L0 fill:#ff6b1a,color:#0a0a0f
    style L3 fill:#a855f7,color:#fff

The flow:

  1. Write goes to an in-memory buffer (MemTable) and a WAL.
  2. When the MemTable fills, it's flushed to disk as an immutable SSTable (sorted string table).
  3. Compaction merges SSTables in the background — kicking out tombstones and old versions, organizing into levels.
  4. Reads check MemTable, then SSTables level by level (with bloom filters to skip levels that can't have the key).
B-tree (Postgres)LSM (Cassandra)
Write amplificationHigher (in-place page updates)Lower at ingestion (sequential append); higher during compaction
Read amplificationLowHigher (multi-level lookup)
Random write throughputBound by disk seeksSequential, very high
Range scansExcellentGood
Storage overhead~10–30%~10–100% (varies by compaction strategy and compression)
Best forMixed read/write OLTPWrite-heavy, append-mostly

Reach for Cassandra-class when you have millions of writes per second (telemetry, logging, IoT), your queries are append-and-scan rather than arbitrary, you can live with eventual consistency (it's tunable), and you need multi-datacenter replication out of the box.

Graph databases

Neo4j, Neptune, ArangoDB. Optimized for "find a path" or "friends of friends" queries.

Storing a social graph in Postgres works — until "friends of friends of friends" becomes a 3-way self-join over a table of billions of rows. Graph DBs walk the edges natively, often O(degree) per hop instead of O(N log N) per join.

For most apps, a friendships(a, b) table in Postgres with proper indexes is plenty — don't reach for a graph DB until that's actually slow.

Time-series databases

InfluxDB, TimescaleDB (a Postgres extension), Prometheus, M3, ClickHouse. Optimized for append-only, time-ordered writes and aggregate queries over time ranges ("avg over the last hour"). They also handle downsampling natively — keep raw data for 7 days, hourly rollups for a year, without writing that logic yourself.

If you're storing metrics or sensor data, a TSDB beats Postgres on storage cost (10–100×) and query speed for time-range aggregations. For application logs at scale, ClickHouse is the rising star.

Search engines

Elasticsearch, OpenSearch, Solr. Built on inverted indexes — for every word, store a list of documents containing it.

Doc 1: "the cat sat on the mat"
Doc 2: "the dog ate the cat"

Inverted index:
  "cat" -> [1, 2]
  "dog" -> [2]
  "mat" -> [1]
  "sat" -> [1]

Queries become set intersections / unions of these postings lists. Add scoring (TF-IDF, BM25), faceting, fuzzy matching, and you have full-text search.

Anti-pattern: using Elasticsearch as your source of truth. It's eventually consistent, refresh-driven, and not great at durability. Use it as a secondary index, fed from your primary DB via CDC or a sync job.

NewSQL: the best of both

Spanner (Google), CockroachDB, Yugabyte, TiDB.

The pitch: SQL on top of a horizontally-scalable, geo-distributed consensus backbone (Paxos for Spanner; Raft for CockroachDB and TiDB). ACID transactions across regions. SQL queries. No sharding by hand.

The cost: higher write latency (consensus across regions costs round-trips). Less mature ecosystem than Postgres. Steeper bill.

Reach for it when you've genuinely outgrown vanilla Postgres and still need transactions and SQL. Don't reach for it on day one.

Replication

Copying data across multiple nodes for durability (data survives a single failure), availability (a replica can serve), and read scaling (more nodes serving reads).

The full coverage is in CAP and Consistency; a quick preview here:

Single-leader (primary-replica)

All writes go to one primary. The primary streams its log to replicas. Reads can be served from anywhere.

sequenceDiagram
    participant Client
    participant Primary
    participant Replica1
    participant Replica2
    Client->>Primary: Write (INSERT/UPDATE)
    Primary->>Replica1: Stream WAL
    Primary->>Replica2: Stream WAL
    Primary-->>Client: Committed
    Client->>Replica1: Read
    Replica1-->>Client: Result (may lag)

Three modes for confirming a write:

  • Sync replication: write returns only after at least one replica confirms. Slower, safer.
  • Async replication: write returns immediately; replicas catch up. Faster, risk of data loss on primary failure.
  • Semi-sync: write returns after one replica confirms (the rest async).

Replication lag is the latency between a write to the primary and that write being visible on a replica. Important for "read your writes" — if you write then immediately read from a lagging replica, you may not see your own change.

Multi-leader and leaderless

Covered in module 10 — short version: multi-leader = "two or more nodes accept writes, sync to each other"; leaderless (Cassandra, DynamoDB) = "everyone's equal, quorums for consistency."

Partitioning (sharding)

Replication makes copies. Partitioning splits the data — no node has it all.

flowchart TD
    DB[(All Data)] --> P1[(Shard 1<br/>users a-h)]
    DB --> P2[(Shard 2<br/>users i-q)]
    DB --> P3[(Shard 3<br/>users r-z)]
    style DB fill:#ff6b1a,color:#0a0a0f

Strategies:

StrategyHowProsCons
Rangeusers a-h on shard 1, i-q on shard 2, r-z on shard 3Range queries efficientHot shards (e.g. all "a" usernames)
Hashshard = hash(key) % NEven distributionNo range queries, rebalancing is hard
Consistent hashHash onto a ringEven + minimal reshufflingSlightly more complex
DirectoryLookup table maps key → shardFlexibleLookup table is itself a bottleneck

Most production systems use consistent hashing. See the Consistent Hashing article.

Cross-partition queries are expensive

Once data is sharded, "give me the top 10 users by post count" requires a query to every shard, then a merge. This is scatter-gather, and it's how distributed search engines (Elasticsearch) and OLAP databases (BigQuery) work — but it adds complexity.

Connection pooling

Postgres uses one process per connection. Memory overhead per connection varies: idle connections cost roughly 1.5–5 MB; connections that have run queries can climb to 10–15 MB depending on work_mem settings and query type. Apps with 1000 concurrent requests cannot open 1000 raw connections to Postgres without melting it.

Two layers:

  • Application-side pool: the SQL driver maintains a small pool (e.g. 10 conns per app instance).
  • Server-side pooler: PgBouncer / RDS Proxy in front of Postgres, with thousands of client connections multiplexed onto fewer DB connections (transaction pooling).

Without a pooler, your "scale to 100 app servers" plan becomes "die at 100 app servers."

Picking a database

A decision tree that holds up most of the time:

flowchart TD
    Q1{Strong consistency<br/>and joins required?}
    Q1 -->|Yes| Q2{Need infinite scale?}
    Q1 -->|No| Q3{Mostly key lookups?}
    Q2 -->|No| PG[Postgres / MySQL]
    Q2 -->|Yes| NEW[Spanner / CockroachDB]
    Q3 -->|Yes| Q4{Hot in memory?}
    Q3 -->|No| Q5{Massive writes?}
    Q4 -->|Yes| RD[Redis]
    Q4 -->|No| DD[DynamoDB]
    Q5 -->|Yes| CA[Cassandra]
    Q5 -->|No| MO[MongoDB]
    style PG fill:#ff6b1a,color:#0a0a0f
    style RD fill:#15803d,color:#fff
    style CA fill:#a855f7,color:#fff

But databases are rarely the choice. Real systems use polyglot persistence: Postgres for the source of truth, Redis for caching, Cassandra for time-series telemetry, S3 for blobs, Elasticsearch for search.

A worked example: choosing storage for a SaaS app

Spec: B2B SaaS, 10k tenants, ~1M users total, mixed CRUD + analytics + search.

NeedPickWhy
Transactional CRUDPostgresStrong consistency, joins, mature
User session cacheRedisSub-ms reads, TTL built-in
Full-text searchElasticsearchInverted index, ranking, faceting
Analytics dashboardsClickHouse / BigQueryColumnar, fast aggregates
Activity log (audit)Cassandra or S3 + AthenaAppend-heavy, rarely read
File uploadsS3Cheap, durable, CDN-friendly
Async jobsPostgres + a queue table, or SQSSimple at this scale

That's six different data stores for one app. Each one is justified by a different access pattern. Resist the urge to make all of them the same DB — and resist the urge to add ones you don't need. Start with Postgres + Redis + S3 and add the rest only when you're forced to.

Things you should now be able to answer

  • Why is "use Postgres unless you have a reason not to" reasonable advice?
  • What's the difference between Read Committed and Serializable? Give an example bug each level allows.
  • Why does Postgres still need VACUUM after every workload?
  • A query is slow on a 100M-row table. What's the first thing you check?
  • When would you choose Cassandra over Postgres? Over MongoDB?
  • What's the read/write trade-off of adding an index?
  • Your read replica is 30 seconds behind. What types of queries are safe to send there?
  • A Cassandra LSM tree can handle 10× the writes of a Postgres B-tree. What does it give up?

→ Next: Storage Systems

// FAQ

Frequently asked questions

What is MVCC and why does Postgres need VACUUM?

Multi-Version Concurrency Control gives every row a version number. Writers create new versions while readers see the version that was current when their transaction started, so readers never block writers. Old versions accumulate on disk and must be cleaned up — Postgres calls this process VACUUM. Long-running transactions hold old versions alive, which causes table bloat for every other query on that server.

What is the leading-column rule for composite indexes?

A composite index on (a, b, c) is sorted lexicographically, so it can only accelerate queries that filter on a, or on (a, b), or on (a, b, c). Queries that filter on b or c alone, or on (b, c) without a, cannot use the index and need a separate index built for that column order.

When should you choose Cassandra over Postgres?

Reach for Cassandra when you have millions of writes per second — telemetry, logging, IoT — your queries are append-and-scan rather than arbitrary, you can tolerate eventual consistency (it is tunable), and you need multi-datacenter replication out of the box. Its LSM-tree engine handles sequential writes far better than Postgres B-trees, but at the cost of higher read amplification and more storage overhead.

What is the difference between replication and sharding?

Replication makes full copies of the data across multiple nodes for durability and read scaling — every replica holds all the data. Sharding (partitioning) splits the data so that no single node holds it all, enabling writes to scale horizontally across the cluster.

Why is Elasticsearch an anti-pattern as a source of truth?

Elasticsearch is eventually consistent, refresh-driven, and not designed for strong durability guarantees. It should be used as a secondary index fed from a primary database via change-data-capture or a sync job, not as the authoritative store for data that cannot be lost or must be immediately consistent.