~/articles/how-to-choose-a-database
Beginnerasked at Stripeasked at Shopifyasked at Airbnbasked at Discordasked at Uber

How to Choose a Database

A decision framework for picking the right database. Workload shape, access patterns, consistency needs, and the operational costs nobody warns you about.

// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

"Which database should we use?" is the most-asked, worst-answered question in system design. People reach for whatever they used last time, or whatever was on the front page of Hacker News. Neither is a strategy. This article gives you a framework for making the decision deliberately — and for defending it in an interview.

The short version before we go any further: use Postgres unless you have a written-down reason not to. Most of what follows is about identifying when that reason actually exists.

What "right" means

A database is "right" for a workload when it gets you four things at acceptable cost:

flowchart TD
    R[Right database] --> P[Performance<br/>at your QPS/latency]
    R --> C[Correctness<br/>at your consistency need]
    R --> O[Operability<br/>your team can run it]
    R --> M[Money<br/>fits the budget]
    style R fill:#ff6b1a,color:#0a0a0f

The trap is optimizing hard for performance and forgetting operability. A database your team can't confidently operate at 3am is worse than a slower one they can.

Six questions that drive the decision

Before picking anything, write down answers to these. The answers do most of the work.

What's the shape of the data?

ShapeExamplesNatural fit
Relational rowsUsers, orders, transactionsSQL (Postgres / MySQL)
Self-contained documentsProduct page, blog post, user profileDocument (MongoDB) or jsonb in Postgres
Wide rows / time-seriesSensor data, logs, eventsCassandra, Scylla, Timescale
Key → valueSession, cache, tokenRedis, DynamoDB
Graph (nodes + edges)Social network, knowledge graphNeo4j, Neptune
Full-textSearch, log analysisElasticsearch, OpenSearch
Vectors / embeddingsSemantic search, RAGpgvector, Pinecone, Weaviate

Data shape is usually the fastest filter. If every entity is a row with well-defined foreign keys, you're in SQL territory. If your "documents" look like deeply nested JSON that changes shape per record, a document store starts to look attractive — though Postgres jsonb handles a lot of this without leaving the SQL world.

What's the access pattern?

Single row by primary key? Almost any database handles that well. Range queries on a sort key? You want a store with a friendly B-tree or clustering key (SQL, Cassandra with clustering columns). Ad-hoc joins across many tables? That's strong SQL territory — NoSQL doesn't join. Massive analytical scans over billions of rows? That's columnar (BigQuery, Snowflake, ClickHouse), not OLTP.

Proximity and geo queries deserve a call-out: PostGIS on Postgres, MongoDB's geo indexes, and Redis GEO all work, and each has a reason to prefer it over the others depending on your broader stack.

If your access pattern is "we don't know yet, we'll figure it out as we build" — that's actually a strong vote for SQL, because SQL is the most flexible query surface you can start with.

What's the read/write ratio?

flowchart LR
    A[Read-heavy<br/>100:1+] --> CACHE[Cache + replicas<br/>any DB works]
    B[Balanced] --> SQL[Postgres / MySQL]
    C[Write-heavy<br/>1:1 or worse] --> LSM[LSM-tree DBs<br/>Cassandra, Scylla, RocksDB-based]
    style C fill:#ff2e88,color:#fff

The underlying reason matters here. Postgres and MySQL use B-trees, which perform random in-place page updates for every write. Under sustained high write load, that randomness saturates disk I/O — and you're also paying WAL + heap page + index pages on every write. Cassandra, ScyllaDB, and RocksDB-based stores use LSM-trees instead, which buffer writes in memory (the memtable), flush sorted runs to disk sequentially, and compact in the background. The writes are sequential and predictable rather than random.

There's a nuance worth knowing: LSM-trees actually have higher total write amplification (often 10–30× vs 3–5× for B-trees) because compaction rewrites data repeatedly. But that amplification is sequential I/O, not random — and on flash storage the gap narrows further. The practical advantage of LSM shows up at very high sustained write QPS where random I/O exhausts your disk IOPS budget.

flowchart LR
    subgraph "B-tree write (Postgres)"
        WA[Write arrives] --> WAL[Write WAL]
        WAL --> HEAP[Update heap page]
        HEAP --> IDX[Update index pages]
        IDX --> DONE1[Done — random I/O]
    end
    subgraph "LSM-tree write (Cassandra)"
        WB[Write arrives] --> MEM[Buffer in memtable]
        MEM --> FLUSH[Flush sorted SSTable]
        FLUSH --> COMPACT[Compact in background]
        COMPACT --> DONE2[Done — sequential I/O]
    end
    style DONE1 fill:#ff2e88,color:#fff
    style DONE2 fill:#15803d,color:#fff
    style MEM fill:#ff6b1a,color:#0a0a0f

One thing people underestimate: compaction is not free. Cassandra and ScyllaDB background compaction consumes disk I/O and CPU. Under a poorly tuned compaction strategy — say, SizeTiered under heavy random deletes — you accumulate tombstones and space amplification. That's one reason Cassandra operations require real expertise. If nobody on your team has managed compaction strategy at 3am, you should think hard before choosing it.

For event ingestion at hundreds of thousands of writes per second, append-mostly: LSM wins. For balanced OLTP under a few thousand writes per second: SQL wins on simplicity.

What consistency do you actually need?

Be honest here. Most things people think need strong consistency actually don't.

Use caseReal requirement
Money, wallets, billingStrong / linearizable — no exceptions
Account state (profile, settings)Strong, usually
Friend list, follow graphEventual is fine (a delay of seconds is unnoticed)
Feed, timeline, recommendationsEventual, often very eventual
Like count, view countEventually consistent; approximate is fine
Search indexStale by seconds is normal

If eventual consistency is genuinely fine, your database choice opens way up. If strong is required, you've narrowed to Postgres, MySQL, Spanner, CockroachDB, or DynamoDB with transactions.

There's a common mistake worth naming explicitly: "eventual is fine" gets said about the wrong things. A classic error is deciding an account balance or inventory count is eventually consistent because the PM said "a few seconds of lag is OK." A few seconds of lag in a read is fine. But if you're doing read-modify-write without atomicity on a NoSQL store — say, reading a balance, subtracting 100, and writing it back — you get lost updates under concurrent writes. That's a correctness bug, not a latency trade-off.

Three terms you should be comfortable using in an interview:

  • Linearizability — every read sees the most recent write, globally. Expensive. Required for leader election, locks, monotonic counters.
  • Read-your-own-writes — you always see your own writes; others may lag. Good enough for most user-facing state.
  • Eventual — all replicas converge eventually; reads may return stale data. Fine for feeds, recommendations, like counts.

What scale, today and in two years?

flowchart TD
    S[Scale check] --> N{"Hot data fits<br/>on a single box?<br/>(< ~1 TB)"}
    N -->|Yes| OK[Single Postgres / MySQL<br/>+ read replicas]
    N -->|No| BIG{Sharding tolerable?}
    BIG -->|Yes| SH[Sharded SQL<br/>Vitess, Citus, manual]
    BIG -->|No, need ACID across shards| NEW[NewSQL<br/>Spanner, CockroachDB]
    BIG -->|No, eventual OK| NS[NoSQL<br/>Cassandra, DynamoDB]
    style OK fill:#15803d,color:#fff
    style NEW fill:#a855f7,color:#fff

Don't pick a database for a scale you won't reach. Future-proofing with Cassandra when you'll never hit 10k writes per second is a great way to burn two engineers operating it indefinitely.

The real threshold to watch is hot working set, not total data size. A Postgres instance with 10 TB total data but a 50 GB hot working set that fits in the RAM buffer pool will serve reads from memory at microsecond latency. The ~1 TB rule of thumb is for when hot data stops fitting in the buffer pool and every cache miss becomes a disk seek. If you can add RAM cheaply, push that threshold further.

Vertical scaling is also underrated. A well-tuned Postgres instance on large hardware (say, 96-core, 768 GB RAM) can sustain tens to hundreds of thousands of simple OLTP QPS depending on workload — point lookups, connection pooling, and read replicas push the ceiling much higher. Many teams shard prematurely. Always calculate "what does the biggest box handle?" before designing a distributed system.

What can your team actually run?

Postgres has 30 years of tooling, dozens of cloud-managed offerings, framework support everywhere, and every senior engineer has debugged it at some point. When something breaks at 3am, the answers are a search away.

Cassandra requires understanding tombstones, compaction strategies, replication factors, consistency levels, and repair operations. If nobody on your team has held the pager for a Cassandra cluster through an incident, it is a bad choice no matter how well it matches the workload on paper.

This is the single most underweighted factor in interviews and in real engineering decisions alike.

The default ladder

flowchart LR
    PG[Postgres] -->|"reads slow, hot data > RAM"| RR[+ Read replicas]
    RR -->|"writes still bottleneck"| BIG[+ Bigger box<br/>96-core / 768 GB RAM]
    BIG -->|"hit machine ceiling"| SHD[Shard / NewSQL<br/>Vitess, CockroachDB]
    BIG -->|"workload shape doesn't fit SQL"| NS[NoSQL<br/>Cassandra, DynamoDB]
    PG -->|"sub-ms TTL reads needed"| RD[+ Redis cache]
    PG -->|"full-text / faceted search"| ES[+ Elasticsearch]
    PG -->|"analytical scans > billions rows"| OLAP[+ ClickHouse / BigQuery]
    style PG fill:#ff6b1a,color:#0a0a0f
    style SHD fill:#a855f7,color:#fff
    style NS fill:#0e7490,color:#fff

Postgres is the boring, correct answer for most companies. You add other stores around it as specific needs emerge — Redis for cache, Elasticsearch for search, ClickHouse for analytics, S3 for blobs. Each specialised store is added when a concrete constraint forces it, not in anticipation of a constraint that might never arrive.

Decision tree

flowchart TD
    Q1{Need ACID across<br/>multiple rows?}
    Q1 -->|Yes| Q2{Fits in a single<br/>region?}
    Q1 -->|No| Q3{Mostly key lookups?}

    Q2 -->|Yes, hot fits in RAM| PG[Postgres / MySQL]
    Q2 -->|Yes, hot exceeds RAM| SHD[Sharded SQL<br/>Vitess / Citus]
    Q2 -->|Multi-region writes| NEW[Spanner / CockroachDB]

    Q3 -->|Yes, hot| Q4{Need durability?}
    Q3 -->|No| Q5{Time-series or<br/>events?}

    Q4 -->|No, cache only| MC[Memcached / Redis no-persist]
    Q4 -->|Yes, fits one instance| RD[Redis with AOF]
    Q4 -->|Yes, massive scale| DD[DynamoDB / FoundationDB]

    Q5 -->|Yes| TS[TimescaleDB / Influx / Clickhouse]
    Q5 -->|No| Q6{Append-mostly,<br/>massive write QPS?}

    Q6 -->|Yes| CA[Cassandra / Scylla]
    Q6 -->|No| Q7{Document-shaped?}

    Q7 -->|Yes| MO[MongoDB or Postgres jsonb]
    Q7 -->|No| Q8{Graph-shaped?}

    Q8 -->|Yes| NEO[Neo4j / Neptune]
    Q8 -->|No| BACK[Postgres + jsonb]

    style PG fill:#ff6b1a,color:#0a0a0f
    style RD fill:#15803d,color:#fff
    style NEW fill:#a855f7,color:#fff
    style CA fill:#0e7490,color:#fff

Worked examples

E-commerce orders

The data is relational — orders, order items, users, products all live in separate tables and join together constantly. Access patterns are a mix of single-row lookups by order ID and range queries by user. The read/write ratio is roughly balanced, and because this touches money, consistency must be strong. At one million orders a day, the data volume is modest and the team already knows Postgres.

The answer is Postgres, with Redis for product catalog caching. If someone in the meeting tries to steer toward Cassandra for "scale," point out that one million orders a day is roughly 12 writes per second — a single Postgres box handles that in its sleep.

IoT sensor ingestion

Five hundred thousand writes per second, all append-only, from thousands of sensors. The data is time-series. Reads aggregate by device and time range. Consistency can be eventual.

At that write rate, random B-tree page updates will exhaust disk IOPS and the WAL becomes a bottleneck long before the hardware is saturated. If the scale fits a few boxes, TimescaleDB keeps you in SQL while adding time-series partitioning and compression. If you need to go further, Cassandra or ScyllaDB handle this naturally. Vanilla Postgres is the wrong answer here — not because it "doesn't scale," but because the write pattern specifically stress-tests B-tree's weakest point.

Social feed

Posts are document-ish. The follow graph is a true graph. Timeline reads vastly outnumber writes. Consistency can be eventual. Read QPS is massive.

This is the canonical polyglot persistence case. Postgres (or sharded MySQL) holds the canonical post records. Redis holds precomputed timelines for active users because a timeline is a hot key-value lookup, not a relational query. The follow graph either lives in sharded SQL or a dedicated graph DB depending on query complexity. No single store handles all three patterns well; that's not a failure of the design, it's the design.

Search across catalogs

The search experience needs full-text, faceted filtering, and relevance ranking — things SQL LIKE queries can't do at scale. But you also need a system of record with real ACID guarantees for creates and updates.

The right split: Postgres as the source of truth, Elasticsearch as the search index, kept in sync via CDC or a sidecar write pipeline. Elasticsearch returns results fast; Postgres holds the authoritative data. The two are complementary, not competing.

Anti-patterns

"We'll need NoSQL someday" is a common one. If you're at 100k rows today, you don't need it today. Migrate when a specific constraint forces you to, not in anticipation of one.

"We picked MongoDB because it's faster" — for most workloads, Postgres is faster than MongoDB. The benchmarks people remember are from 2010.

"Cassandra because Facebook uses it" — Facebook also has 50 engineers operating it. If your team is three people, this comparison isn't relevant.

"Let's use Elasticsearch as our database" — Elasticsearch is a search index, not a system of record. Its near-real-time index (default 1-second refresh interval), lack of multi-document ACID transactions, and Lucene segment design are not built for the update/delete patterns a transactional store handles routinely. Run it as a derived read replica synced from your source of truth via CDC, and treat it as you would a cache: rebuilt from the primary if needed.

"One DB for everything" — real production stacks use five to ten different stores. Forcing search, OLTP, analytics, and caching into one system makes all of them slower and harder to reason about.

Polyglot persistence — what real stacks look like

flowchart LR
    subgraph "Typical production stack"
        SRC[(Postgres<br/>source of truth)] -->|CDC| SRCH[(Elasticsearch<br/>search)]
        SRC -->|write-through| CACHE[(Redis<br/>cache / sessions)]
        SRC -->|event stream| OLAP[(ClickHouse / BigQuery<br/>analytics)]
        SRC -->|event stream| TIMESERIES[(Cassandra / ScyllaDB<br/>events / time-series)]
    end
    style SRC fill:#ff6b1a,color:#0a0a0f
    style SRCH fill:#0e7490,color:#fff
    style CACHE fill:#15803d,color:#fff
    style OLAP fill:#a855f7,color:#fff
    style TIMESERIES fill:#ffaa00,color:#0a0a0f
CompanyStack snippet
StripePostgres (money), Redis (rate limits), Snowflake (analytics)
ShopifyMySQL (sharded by shop_id via pod architecture; Vitess used for the Shop consumer app), Redis, BigQuery
UberSchemaless on MySQL, Cassandra, Redis, Pinot (OLAP)
AirbnbMySQL (Amazon RDS) + StarRocks (real-time analytics, replaced Druid) + Elasticsearch
DiscordScyllaDB (messages — migrated from Cassandra in 2023), Postgres (guild/user data), Redis

Each store solves one problem. The goal isn't to collect databases; it's to reach for a new one only when the current one has a specific, documented limitation.

The cost nobody talks about

Operational cost dwarfs storage cost. A managed Postgres at $500 a month is cheaper than a self-hosted Cassandra cluster that consumes half an SRE's time to maintain.

Before picking anything off the beaten path, work through:

  • Cost of running it 24×7 (cloud bill plus on-call load)
  • Cost of the team learning its failure modes
  • Cost of backup and disaster recovery
  • Cost of migrating data into it
  • Cost of migrating out if it turns out to be the wrong choice

The fancier the database, the higher all of these costs.

How to answer "why this database" in an interview

A reusable script:

"Given the workload — [X reads/sec, Y writes/sec], data shape is [Z], we need [strong/eventual] consistency on [these operations] — I'd pick [DB]. The reason it fits is [shape + scale]. The cost is [list 2 trade-offs]. If we later needed [other property], the migration path is [pattern]."

That answer in 30 seconds is hireable. "Cassandra because it scales" is not.

Things you should now be able to answer

  • Why is Postgres the right default for most projects?
  • When would you reach for an LSM-tree DB over Postgres?
  • Why is "we'll need NoSQL someday" a poor reason?
  • What does polyglot persistence look like in practice?
  • Why is your team's operational maturity a database-choice factor?

Further reading

// FAQ

Frequently asked questions

When should I use an LSM-tree database like Cassandra or ScyllaDB instead of Postgres?

Reach for an LSM-tree store when you have sustained high-volume append-only writes — the article puts the threshold at roughly 50k writes per second sustained. Postgres and MySQL use B-trees, which perform random in-place page updates that exhaust disk IOPS under that load. LSM-tree stores like Cassandra buffer writes in memory and flush sequentially, making the writes predictable even though total write amplification is higher (10-30x for LSM vs 3-5x for B-tree).

What is the practical scale threshold where Postgres stops being the right default?

The article's rule of thumb is hot data approaching 1 TB, at which point the hot working set stops fitting in the buffer pool and every cache miss becomes a disk seek. A well-tuned Postgres instance on large hardware (the article cites 96-core, 768 GB RAM) can sustain tens to hundreds of thousands of simple OLTP QPS, so vertical scaling should be evaluated before any sharding design.

What is the difference between linearizability, read-your-own-writes, and eventual consistency?

Linearizability means every read sees the most recent write globally — required for leader election, distributed locks, and monotonic counters. Read-your-own-writes guarantees you always see your own writes while others may lag, which is sufficient for most user-facing state. Eventual consistency means all replicas converge eventually and reads may return stale data, which is acceptable for feeds, recommendations, and like counts.

Why should Elasticsearch not be used as a primary database?

Elasticsearch is a search index, not a system of record. Its default 1-second refresh interval, lack of multi-document ACID transactions, and Lucene segment design are not built for the update and delete patterns a transactional store handles routinely. The article recommends running it as a derived read replica synced from a source of truth via CDC, treating it like a cache that can be rebuilt from the primary if needed.

What database stacks do companies like Stripe, Discord, and Shopify actually use?

Stripe uses Postgres for money, Redis for rate limits, and Snowflake for analytics. Discord migrated from Cassandra to ScyllaDB for messages in 2023, keeping Postgres for guild and user data plus Redis. Shopify shards MySQL by shop_id using a pod architecture (with Vitess for the Shop consumer app), alongside Redis and BigQuery.

// RELATED

You may also like