Databases — SQL, NoSQL, NewSQL
Indexes, transactions, isolation levels, MVCC, WAL, LSM trees vs B-trees, replication, sharding, and the trade-offs that decide where your data lives.
There is no "best database." There is only "best database for this workload." Picking right requires knowing what each kind is good at — and what each kind is bad at.
This module gives you a working mental model of relational databases, six flavors of NoSQL, the newer "NewSQL" hybrids, and the internals (indexes, MVCC, write-ahead logs) that determine why each performs the way it does.
The big picture
flowchart TD
DB[Databases] --> SQL[Relational / SQL]
DB --> NS[NoSQL]
DB --> NEW[NewSQL]
SQL --> PG[Postgres / MySQL]
SQL --> SS[SQL Server / Oracle]
NS --> KV[Key-Value<br/>Redis, DynamoDB, Memcached]
NS --> DOC[Document<br/>MongoDB, Couchbase]
NS --> COL[Column Family<br/>Cassandra, HBase, Scylla]
NS --> GR[Graph<br/>Neo4j, Neptune]
NS --> TS[Time Series<br/>InfluxDB, TimescaleDB]
NS --> SR[Search<br/>Elasticsearch, OpenSearch]
NEW --> SP[Spanner / CockroachDB / Yugabyte / TiDB]
style SQL fill:#ff6b1a,color:#0a0a0f
style NS fill:#0e7490,color:#fff
style NEW fill:#a855f7,color:#fff
Relational databases (SQL)
The 50-year-old workhorse. Postgres, MySQL, SQL Server, Oracle.
ACID transactions let you change multiple rows across multiple tables and guarantee the whole thing succeeds or none of it does. The joins and schema enforcement that SQL is famous for flow from that same foundation — a constraint violation gets caught at write time, before bad data can propagate. Add 50 years of tooling (backups, replication, query planners, observability), and you have a platform that very few new systems can beat on pure reliability.
The ceiling is real, though. One Postgres instance has no hard core or RAM limit in its documentation, but practically speaking most teams hit diminishing returns somewhere in the range of 32–96 cores — parallelism contention, lock management, and connection overhead all grow. Hot working sets beyond a few hundred GB to a few TB start demanding careful tuning. Horizontal scaling through sharding is possible but you own most of the plumbing yourself. Schema migrations on huge tables are painful — holding a lock on a billion-row table while you add a column can mean planned maintenance windows.
Use Postgres unless you have a reason not to. This is genuine production advice, not a meme. Postgres has gotten very good — JSONB columns with GIN indexes, full-text search, partitioning, logical replication, vector search (pgvector), and now sane partitioning for large tables.
ACID
Stands for:
| Letter | Meaning | Example |
|---|---|---|
| Atomicity | All or nothing | Transfer money: either both accounts update, or neither does |
| Consistency | Constraints always hold | Foreign keys, unique indexes are never violated |
| Isolation | Concurrent txns don't interfere | Two simultaneous transfers don't double-spend |
| Durability | Once committed, persists | Power failure doesn't lose the transfer |
The "C" in ACID is not the same as the "C" in CAP — it means "no constraint violations," not "all replicas agree."
Isolation levels
Higher isolation = more correctness, less concurrency. Most DBs default to Read Committed:
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Lost Update |
|---|---|---|---|---|
| Read Uncommitted | possible | possible | possible | possible |
| Read Committed (default) | safe | possible | possible | possible |
| Repeatable Read | safe | safe | possible | possible |
| Snapshot | safe | safe | safe | safe |
| Serializable | safe | safe | safe | safe |
Definitions:
- Dirty read: see another transaction's uncommitted change.
- Non-repeatable read: same row read twice gives different values.
- Phantom read: same query rerun returns different rows.
- Lost update: two transactions read, modify, write — one's change is silently overwritten.
If you've never had to think about isolation levels, you've probably had bugs that were caused by them and didn't realize. Bank-of-America-style "I see the wrong balance" bugs are isolation-level bugs.
Postgres tip:
READ COMMITTEDis the default butREPEATABLE READis often what you want for analytical reads — it uses snapshot isolation internally, giving you a stable view of the database for the transaction's lifetime.
MVCC: how Postgres avoids most locks
Most modern relational databases use Multi-Version Concurrency Control: every row has a version. Writers create new versions; readers see whichever version was current when they started.
flowchart LR
W[Writer txn 100] -->|"creates row v2"| ROW[(Row, versions: v1, v2)]
R[Reader txn 99] -->|"sees v1"| ROW
R2[Reader txn 101] -->|"sees v2 if W committed"| ROW
style W fill:#ff6b1a,color:#0a0a0f
style R fill:#15803d,color:#fff
The consequence: readers don't block writers, and writers don't block readers. But old versions accumulate and need cleanup — Postgres calls this VACUUM, MySQL/InnoDB has the purge thread. Long-running transactions hold those old versions alive, which means your "30-minute analytics query" can cause table bloat for everyone else on that box.
Write-ahead log (WAL)
Durability comes from WAL. Every change is appended to a log on disk before the actual data file is updated.
flowchart LR
TXN[INSERT/UPDATE] --> WAL[(WAL on disk<br/>append-only)]
WAL -->|fsync| DURABLE[Committed]
WAL -.async.-> PAGES[Data pages<br/>updated in background]
style WAL fill:#ff6b1a,color:#0a0a0f
style DURABLE fill:#15803d,color:#fff
WAL writes are append-only — sequential on disk and fast (~hundreds of MB/sec on NVMe). The actual data-page updates happen in the background and can be batched. Crash recovery replays the WAL from the last checkpoint, getting the database back to a consistent state in seconds. Replication in Postgres is exactly this: stream the WAL to replicas.
The WAL is also why fsync matters — until the WAL record is fsynced, the change isn't durable. fsync is slow (a few ms on SSD), so DBs batch commits to amortize it (group commit).
Indexes: where reads come from
Without an index, finding a row in a 1B-row table is O(n) — read every row. Slow.
A B-tree index (the default) gives you O(log n) lookups, sorted scans, and range queries. It costs storage (~10–30% of the data) and slows writes a bit (each insert updates the tree).
flowchart TD
ROOT["root: 50, 100, 150"]
ROOT --> N1["node: 25, 40, 47"]
ROOT --> N2["node: 60, 80, 99"]
ROOT --> N3["node: 110, 130, 145"]
N1 --> L1[(leaf: 1...49)]
N2 --> L2[(leaf: 51...99)]
N3 --> L3[(leaf: 100...149)]
style ROOT fill:#ff6b1a,color:#0a0a0f
style L2 fill:#15803d,color:#fff
Index types worth knowing
| Index | Good for | Bad for |
|---|---|---|
| B-tree (default) | Equality, range, sorted scans | Substring search |
| Hash | Equality only | Range queries (no order) |
| GIN (Postgres) | JSONB, arrays, full-text | High write rate (slow inserts) |
| GiST (Postgres) | Geometric data, ranges | General-purpose |
| BRIN (Postgres) | Huge tables with natural ordering (e.g. time series) | Random access |
| Inverted (search engines) | Text search ("contains 'cat'") | Equality on structured data |
Composite indexes (the leading-column rule)
A composite index (a, b, c) is sorted lexicographically. So it can serve queries on:
a(WHERE a = 5)(a, b)(WHERE a = 5 AND b = 10)(a, b, c)- Range on the last used column:
WHERE a = 5 AND b > 10
Not for queries on just b, just c, or (b, c) — those would need a different index.
Covering indexes
If your index includes all columns the query needs, the DB never reads the table — it serves the answer from the index alone. Postgres calls this INCLUDE:
CREATE INDEX idx_users_email ON users(email) INCLUDE (name, created_at);
Now SELECT name, created_at FROM users WHERE email = ? reads only the index. Massive speedup for read-heavy workloads — at the cost of a fatter index.
Rules of thumb
- Index every column you
WHEREorJOINon. - Don't over-index — every index is a write tax.
- Watch for unused indexes — Postgres has
pg_stat_user_indexesto find them. - Concurrent index creation:
CREATE INDEX CONCURRENTLYdoesn't lock writes — use it in production.
Query plans (read your plans)
When the DB chooses how to execute a query, it picks among:
- Sequential scan: read the whole table. Cheap if table is small or you need most rows.
- Index scan: walk the index, look up rows. Cheap if you need few rows.
- Index-only scan: serve from a covering index, no table lookup.
- Bitmap scan: hybrid — use index to build a bitmap of pages, then read pages.
- Hash join / Merge join / Nested loop: three join algorithms with different cost models.
EXPLAIN ANALYZE shows the chosen plan and actual times. The single best skill for fixing slow queries is reading these plans.
EXPLAIN ANALYZE
SELECT u.name, count(o.id)
FROM users u JOIN orders o ON o.user_id = u.id
WHERE u.country = 'US'
GROUP BY u.name;
Look for:
- Seq Scan on a big table with a low row-count filter → missing index.
- Sort with high memory → consider a sort-supporting index.
- Nested Loop on big tables → bad join order; check your join keys are indexed.
When SQL stops being enough
The path from one Postgres to a full distributed stack is a ladder — climb only as high as you need.
flowchart LR
A[Single Postgres] -->|"hit ~1TB hot data<br/>or ~50k QPS write"| B[Read Replicas]
B -->|"reads still slow<br/>writes still bottleneck"| C[Vertical Scale]
C -->|"machine maxed out"| D[Sharding]
D -->|"need cross-shard txns"| E[Distributed SQL<br/>NewSQL]
style A fill:#15803d,color:#fff
style E fill:#a855f7,color:#fff
Order of escalation:
- Tune queries and indexes. 80% of "slow database" problems are missing indexes.
- Read replicas. Cheap horizontal read scaling. Watch out for replication lag.
- Connection pooling. PgBouncer or RDS Proxy. Postgres backend processes are heavy; pool aggressively.
- Vertical scale. More RAM, more cores. Often cheaper than re-architecting.
- Partition the table (declarative partitioning). Time-series naturally partitions by month.
- Shard across multiple DB clusters by user_id / tenant_id.
- Move hot collection to a specialty store (cache, search index, time-series DB).
- Adopt distributed SQL (CockroachDB, Spanner) — last resort.
Key-value stores
The simplest NoSQL: key → value, no schema, no joins. The tradeoff is deliberate — by throwing out everything except point lookups, a key-value store can saturate a network link before it breaks a sweat.
| Redis | Memcached | DynamoDB | |
|---|---|---|---|
| Persistence | Optional (RDB/AOF) | None | Yes |
| Data structures | Many (lists, sets, sorted sets, streams, hyperloglog) | Strings only | Documents (JSON) |
| Single-threaded | Yes (data ops) | No | N/A (managed) |
| Transactions | MULTI/EXEC, Lua | None | Per-item, conditional writes |
| Use as | Cache, session store, rate limiter, queue, leaderboard | Pure cache | Primary store |
Redis (the swiss army knife)
Redis is wildly versatile — sorted sets implement leaderboards in 3 lines; streams can replace small Kafkas; HyperLogLog gives you cardinality estimation. The data structures are first-class:
SET user:42:name "ada"
GET user:42:name
INCR pageviews:home
LPUSH inbox:42 "msg1"
LRANGE inbox:42 0 9
SADD friends:42 7 12 99
SINTER friends:42 friends:99 # mutual friends
ZADD leaderboard 5000 "ada"
ZREVRANGE leaderboard 0 9 # top 10
XADD events * action login user_id 42
XREAD COUNT 10 STREAMS events 0 # like Kafka
PFADD uniques:home "user:42"
PFCOUNT uniques:home # approximate cardinality
Redis as cache vs Redis as database: when you use Redis as a cache, persistence is off and it's fine to lose the data on restart — the primary store is the source of truth. When you use it as a database, you turn on AOF plus periodic RDB, replicate it, and need a recovery plan if it goes down. People do this successfully for stateful apps with smallish hot datasets.
Redis Cluster shards data across nodes by hash slot (16,384 slots). Some operations (multi-key, transactions across keys) only work within a slot.
DynamoDB
Amazon's hosted KV/document store: effectively infinite scale, predictable latency (single-digit milliseconds for typical singleton operations), pay-per-request pricing.
The tradeoff: you must design for the access pattern up front. There are no ad-hoc queries. The data model is:
- Partition key (PK): hashed to determine which physical partition stores the item.
- Sort key (SK) (optional): items with the same PK are sorted by SK. Range queries within a PK are efficient.
- Global Secondary Indexes (GSI): alternate PK/SK pairs for different access patterns.
Common pattern — single-table design: one DynamoDB table holds many entity types, with PK/SK chosen to make every needed query a "get items where PK = X" or a range scan. It's powerful but the modeling has a learning curve.
Document stores
key → JSON. The schema is in the JSON itself, often inconsistent across documents.
MongoDB is the canonical one. A document store shines when the natural unit of your application really is a document — a product listing, a blog post, a user profile — and that shape changes often enough that a fixed schema would slow you down. Cross-document transactions in MongoDB exist since version 4.0, but if your workload leans on them heavily, that's usually a signal the data wants to live in Postgres instead.
Postgres has JSONB with GIN indexes — for many "I want a document store" use cases, just use Postgres and put the document in a JSONB column. You get joins, transactions, and the relational tooling, without sacrificing schema flexibility.
Column-family / wide-column stores (and LSM trees)
Cassandra, HBase, ScyllaDB, Bigtable. Optimized for massive write throughput and time-series workloads.
Row key → column family → (column name, value, timestamp)
The key insight: rows with the same key are stored on the same machine, sorted by column. Time-series queries — "all temperatures for sensor 42 in the last hour" — become sequential disk reads.
LSM trees (the engine)
Cassandra and Scylla (and RocksDB, LevelDB, the storage engines under DynamoDB and many others) use log-structured merge trees instead of B-trees.
flowchart TD
W[Write] --> MEM[(MemTable<br/>in RAM)]
MEM -->|flush when full| L0[(SSTable L0)]
L0 -->|compact| L1[(SSTable L1)]
L1 -->|compact| L2[(SSTable L2)]
L2 -->|compact| L3[(SSTable L3)]
style MEM fill:#15803d,color:#fff
style L0 fill:#ff6b1a,color:#0a0a0f
style L3 fill:#a855f7,color:#fff
The flow:
- Write goes to an in-memory buffer (MemTable) and a WAL.
- When the MemTable fills, it's flushed to disk as an immutable SSTable (sorted string table).
- Compaction merges SSTables in the background — kicking out tombstones and old versions, organizing into levels.
- Reads check MemTable, then SSTables level by level (with bloom filters to skip levels that can't have the key).
| B-tree (Postgres) | LSM (Cassandra) | |
|---|---|---|
| Write amplification | Higher (in-place page updates) | Lower at ingestion (sequential append); higher during compaction |
| Read amplification | Low | Higher (multi-level lookup) |
| Random write throughput | Bound by disk seeks | Sequential, very high |
| Range scans | Excellent | Good |
| Storage overhead | ~10–30% | ~10–100% (varies by compaction strategy and compression) |
| Best for | Mixed read/write OLTP | Write-heavy, append-mostly |
Reach for Cassandra-class when you have millions of writes per second (telemetry, logging, IoT), your queries are append-and-scan rather than arbitrary, you can live with eventual consistency (it's tunable), and you need multi-datacenter replication out of the box.
Graph databases
Neo4j, Neptune, ArangoDB. Optimized for "find a path" or "friends of friends" queries.
Storing a social graph in Postgres works — until "friends of friends of friends" becomes a 3-way self-join over a table of billions of rows. Graph DBs walk the edges natively, often O(degree) per hop instead of O(N log N) per join.
For most apps, a friendships(a, b) table in Postgres with proper indexes is plenty — don't reach for a graph DB until that's actually slow.
Time-series databases
InfluxDB, TimescaleDB (a Postgres extension), Prometheus, M3, ClickHouse. Optimized for append-only, time-ordered writes and aggregate queries over time ranges ("avg over the last hour"). They also handle downsampling natively — keep raw data for 7 days, hourly rollups for a year, without writing that logic yourself.
If you're storing metrics or sensor data, a TSDB beats Postgres on storage cost (10–100×) and query speed for time-range aggregations. For application logs at scale, ClickHouse is the rising star.
Search engines
Elasticsearch, OpenSearch, Solr. Built on inverted indexes — for every word, store a list of documents containing it.
Doc 1: "the cat sat on the mat"
Doc 2: "the dog ate the cat"
Inverted index:
"cat" -> [1, 2]
"dog" -> [2]
"mat" -> [1]
"sat" -> [1]
Queries become set intersections / unions of these postings lists. Add scoring (TF-IDF, BM25), faceting, fuzzy matching, and you have full-text search.
Anti-pattern: using Elasticsearch as your source of truth. It's eventually consistent, refresh-driven, and not great at durability. Use it as a secondary index, fed from your primary DB via CDC or a sync job.
NewSQL: the best of both
Spanner (Google), CockroachDB, Yugabyte, TiDB.
The pitch: SQL on top of a horizontally-scalable, geo-distributed consensus backbone (Paxos for Spanner; Raft for CockroachDB and TiDB). ACID transactions across regions. SQL queries. No sharding by hand.
The cost: higher write latency (consensus across regions costs round-trips). Less mature ecosystem than Postgres. Steeper bill.
Reach for it when you've genuinely outgrown vanilla Postgres and still need transactions and SQL. Don't reach for it on day one.
Replication
Copying data across multiple nodes for durability (data survives a single failure), availability (a replica can serve), and read scaling (more nodes serving reads).
The full coverage is in CAP and Consistency; a quick preview here:
Single-leader (primary-replica)
All writes go to one primary. The primary streams its log to replicas. Reads can be served from anywhere.
sequenceDiagram
participant Client
participant Primary
participant Replica1
participant Replica2
Client->>Primary: Write (INSERT/UPDATE)
Primary->>Replica1: Stream WAL
Primary->>Replica2: Stream WAL
Primary-->>Client: Committed
Client->>Replica1: Read
Replica1-->>Client: Result (may lag)
Three modes for confirming a write:
- Sync replication: write returns only after at least one replica confirms. Slower, safer.
- Async replication: write returns immediately; replicas catch up. Faster, risk of data loss on primary failure.
- Semi-sync: write returns after one replica confirms (the rest async).
Replication lag is the latency between a write to the primary and that write being visible on a replica. Important for "read your writes" — if you write then immediately read from a lagging replica, you may not see your own change.
Multi-leader and leaderless
Covered in module 10 — short version: multi-leader = "two or more nodes accept writes, sync to each other"; leaderless (Cassandra, DynamoDB) = "everyone's equal, quorums for consistency."
Partitioning (sharding)
Replication makes copies. Partitioning splits the data — no node has it all.
flowchart TD
DB[(All Data)] --> P1[(Shard 1<br/>users a-h)]
DB --> P2[(Shard 2<br/>users i-q)]
DB --> P3[(Shard 3<br/>users r-z)]
style DB fill:#ff6b1a,color:#0a0a0f
Strategies:
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Range | users a-h on shard 1, i-q on shard 2, r-z on shard 3 | Range queries efficient | Hot shards (e.g. all "a" usernames) |
| Hash | shard = hash(key) % N | Even distribution | No range queries, rebalancing is hard |
| Consistent hash | Hash onto a ring | Even + minimal reshuffling | Slightly more complex |
| Directory | Lookup table maps key → shard | Flexible | Lookup table is itself a bottleneck |
Most production systems use consistent hashing. See the Consistent Hashing article.
Cross-partition queries are expensive
Once data is sharded, "give me the top 10 users by post count" requires a query to every shard, then a merge. This is scatter-gather, and it's how distributed search engines (Elasticsearch) and OLAP databases (BigQuery) work — but it adds complexity.
Connection pooling
Postgres uses one process per connection. Memory overhead per connection varies: idle connections cost roughly 1.5–5 MB; connections that have run queries can climb to 10–15 MB depending on work_mem settings and query type. Apps with 1000 concurrent requests cannot open 1000 raw connections to Postgres without melting it.
Two layers:
- Application-side pool: the SQL driver maintains a small pool (e.g. 10 conns per app instance).
- Server-side pooler: PgBouncer / RDS Proxy in front of Postgres, with thousands of client connections multiplexed onto fewer DB connections (transaction pooling).
Without a pooler, your "scale to 100 app servers" plan becomes "die at 100 app servers."
Picking a database
A decision tree that holds up most of the time:
flowchart TD
Q1{Strong consistency<br/>and joins required?}
Q1 -->|Yes| Q2{Need infinite scale?}
Q1 -->|No| Q3{Mostly key lookups?}
Q2 -->|No| PG[Postgres / MySQL]
Q2 -->|Yes| NEW[Spanner / CockroachDB]
Q3 -->|Yes| Q4{Hot in memory?}
Q3 -->|No| Q5{Massive writes?}
Q4 -->|Yes| RD[Redis]
Q4 -->|No| DD[DynamoDB]
Q5 -->|Yes| CA[Cassandra]
Q5 -->|No| MO[MongoDB]
style PG fill:#ff6b1a,color:#0a0a0f
style RD fill:#15803d,color:#fff
style CA fill:#a855f7,color:#fff
But databases are rarely the choice. Real systems use polyglot persistence: Postgres for the source of truth, Redis for caching, Cassandra for time-series telemetry, S3 for blobs, Elasticsearch for search.
A worked example: choosing storage for a SaaS app
Spec: B2B SaaS, 10k tenants, ~1M users total, mixed CRUD + analytics + search.
| Need | Pick | Why |
|---|---|---|
| Transactional CRUD | Postgres | Strong consistency, joins, mature |
| User session cache | Redis | Sub-ms reads, TTL built-in |
| Full-text search | Elasticsearch | Inverted index, ranking, faceting |
| Analytics dashboards | ClickHouse / BigQuery | Columnar, fast aggregates |
| Activity log (audit) | Cassandra or S3 + Athena | Append-heavy, rarely read |
| File uploads | S3 | Cheap, durable, CDN-friendly |
| Async jobs | Postgres + a queue table, or SQS | Simple at this scale |
That's six different data stores for one app. Each one is justified by a different access pattern. Resist the urge to make all of them the same DB — and resist the urge to add ones you don't need. Start with Postgres + Redis + S3 and add the rest only when you're forced to.
Things you should now be able to answer
- Why is "use Postgres unless you have a reason not to" reasonable advice?
- What's the difference between Read Committed and Serializable? Give an example bug each level allows.
- Why does Postgres still need
VACUUMafter every workload? - A query is slow on a 100M-row table. What's the first thing you check?
- When would you choose Cassandra over Postgres? Over MongoDB?
- What's the read/write trade-off of adding an index?
- Your read replica is 30 seconds behind. What types of queries are safe to send there?
- A Cassandra LSM tree can handle 10× the writes of a Postgres B-tree. What does it give up?
→ Next: Storage Systems
Frequently asked questions
▸What is MVCC and why does Postgres need VACUUM?
Multi-Version Concurrency Control gives every row a version number. Writers create new versions while readers see the version that was current when their transaction started, so readers never block writers. Old versions accumulate on disk and must be cleaned up — Postgres calls this process VACUUM. Long-running transactions hold old versions alive, which causes table bloat for every other query on that server.
▸What is the leading-column rule for composite indexes?
A composite index on (a, b, c) is sorted lexicographically, so it can only accelerate queries that filter on a, or on (a, b), or on (a, b, c). Queries that filter on b or c alone, or on (b, c) without a, cannot use the index and need a separate index built for that column order.
▸When should you choose Cassandra over Postgres?
Reach for Cassandra when you have millions of writes per second — telemetry, logging, IoT — your queries are append-and-scan rather than arbitrary, you can tolerate eventual consistency (it is tunable), and you need multi-datacenter replication out of the box. Its LSM-tree engine handles sequential writes far better than Postgres B-trees, but at the cost of higher read amplification and more storage overhead.
▸What is the difference between replication and sharding?
Replication makes full copies of the data across multiple nodes for durability and read scaling — every replica holds all the data. Sharding (partitioning) splits the data so that no single node holds it all, enabling writes to scale horizontally across the cluster.
▸Why is Elasticsearch an anti-pattern as a source of truth?
Elasticsearch is eventually consistent, refresh-driven, and not designed for strong durability guarantees. It should be used as a secondary index fed from a primary database via change-data-capture or a sync job, not as the authoritative store for data that cannot be lost or must be immediately consistent.