~/articles/event-sourcing-and-cqrs
◆◆◆Advancedasked at Amazonasked at Microsoftasked at Nasdaq

Event Sourcing & CQRS

Store every change as an immutable event and rebuild state by replay. Event sourcing, CQRS read models, snapshots, and the trade-offs nobody warns you about.

// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

Event sourcing and CQRS are among the most misunderstood patterns in distributed systems. They are powerful and well-suited to specific problems. They are also routinely over-applied, turning straightforward CRUD into a graveyard of projections and replay pipelines. This article explains both ideas precisely, shows the mechanics, and is honest about the cost.

The core idea: state as a fold over events

In a conventional database, you store the current state of an entity and mutate it in place:

UPDATE accounts SET balance = balance - 50 WHERE id = 'acc_123';

The previous balance is gone. To understand how the account reached its current state, you need an audit table you might have remembered to add, or application logs that may or may not exist.

Event sourcing reverses the model. Instead of storing current state, you store the sequence of things that happened:

AccountOpened   { account_id: acc_123, initial_balance: 200 }
MoneyDeposited  { account_id: acc_123, amount: 100 }
MoneyWithdrawn  { account_id: acc_123, amount: 50 }

Current state is derived by replaying that sequence. For a bank account, the balance is a fold over the events:

state_0 = {}
state_1 = apply(state_0, AccountOpened)   → { balance: 200 }
state_2 = apply(state_1, MoneyDeposited)  → { balance: 300 }
state_3 = apply(state_2, MoneyWithdrawn)  → { balance: 250 }

The event log is the source of truth. The balance of 250 is a derived fact — a materialized view you recompute on demand. You could have computed a different fact from the same log: "total number of withdrawals" or "balance as of yesterday at 9am." Same log, different fold.

flowchart LR
    E1[AccountOpened<br/>balance: 200] --> F1[fold]
    F1 --> S1[state: 200]
    E2[MoneyDeposited<br/>amount: 100] --> F2[fold]
    S1 --> F2
    F2 --> S2[state: 300]
    E3[MoneyWithdrawn<br/>amount: 50] --> F3[fold]
    S2 --> F3
    F3 --> S3[state: 250]
    style E1 fill:#0e7490,color:#fff
    style E2 fill:#0e7490,color:#fff
    style E3 fill:#0e7490,color:#fff
    style S3 fill:#15803d,color:#fff
    style F1 fill:#ff6b1a,color:#0a0a0f
    style F2 fill:#ff6b1a,color:#0a0a0f
    style F3 fill:#ff6b1a,color:#0a0a0f

Each event is immutable and cannot be changed after it is written. Any downstream view — the current balance, a historical chart, a fraud signal — is just a different fold over the same immutable log.

Aggregates: the unit of consistency

The log is partitioned by aggregate — the unit that groups related events and enforces invariants. An Account is an aggregate; an Order is an aggregate. The aggregate loads its own events from the store, enforces business rules, and appends new events. Two separate accounts may be modified concurrently with no coordination. Two events on the same account are serialized.

events = store.load("acc_123")         // load all events for this stream
account = Account.replay(events)       // fold to current state
account.withdraw(50)                   // check: balance >= 50? yes.
store.append("acc_123", MoneyWithdrawn { amount: 50 })

Optimistic concurrency is the standard: the append includes the expected stream version. If another writer appended first, the version won't match, the write is rejected, and the handler retries.

What the event store must provide

The event store has a straightforward contract:

  • Append events to a stream (by aggregate ID), with an expected version for optimistic concurrency.
  • Read events from a stream, starting from position 0 or any checkpoint.
  • Subscribe to the global event stream (all streams, in append order) — this is how projections are built.

Nothing else is strictly required. You can implement an event store in:

  • Postgres — an events(stream_id, position, event_type, payload JSONB, created_at) table with a unique constraint on (stream_id, position). A serial global_position column drives the subscription feed.
  • Kafka — a topic per aggregate type, partitioned by aggregate ID so events for the same aggregate land in the same partition in order. Do not enable log compaction on event-sourcing topics: compaction retains only the latest record per key, destroying all intermediate events and breaking the replay-from-beginning guarantee that event sourcing depends on. Use time-based or size-based retention instead, and move cold events to long-term archival (e.g. S3) separately.
  • EventStoreDB / KurrentDB — a purpose-built database that makes streams and subscriptions first-class. Its native projection engine can derive new streams from others. (The product is being renamed to KurrentDB following the company's December 2024 rebrand from Event Store to Kurrent.)

Each has trade-offs:

StoreStrong pointsLimitations
PostgresTransactional appends; easy to operateGlobal subscription requires polling or logical replication; not ideal above ~50k events/sec
KafkaMassive throughput; built-in fan-out to many consumersRetention policy matters — long-term storage needs separate archival (never log compaction, which destroys event history); no per-stream optimistic concurrency natively
EventStoreDB / KurrentDBBuilt for this exact model; rich subscription semanticsSmaller ecosystem; one more piece of infrastructure to operate

Building up to a production design

Start simple: fold on every load

The simplest version: every time you need to process a command for acc_123, load all events for that stream and replay them.

events = store.load("acc_123")        // could be 10,000 events
account = Account.replay(events)      // fold all of them

This gives you complete history, correct state, and zero extra infrastructure. There is nothing wrong with starting here.

The problem surfaces once a single aggregate accumulates enough events. An account with 10,000 transactions takes meaningful time to replay. Even with fast in-memory fold logic, deserializing and applying 10,000 events adds up to tens or hundreds of milliseconds per command before any business logic runs — and that's before accounting for the I/O cost of fetching all those events from the store. For high-frequency streams (an exchange order book, a shopping cart with many events), that overhead adds up fast. You need a way to not replay from the very beginning every time.

Add snapshots

Every N events (common choices: 50, 100, 500), persist the current folded state alongside the event log:

snapshot = { version: 500, balance: 1240, state_as_of: "..." }

On next load:

snapshot = store.load_snapshot("acc_123")        // version 500
events = store.load_since("acc_123", version=500) // only events 501..current
account = Account.replay_from(snapshot, events)

Replay is now bounded to at most N events, regardless of how old the aggregate is. The snapshot is just a cache of the folded state at a checkpoint — you can always discard it and rebuild from the raw event log if needed. The log remains the source of truth; the snapshot is just an optimization.

CQRS — separate read models

Loading the aggregate and replaying events works well for the write side (command processing). It works poorly for the query side. "Show me all withdrawals over $100 across all accounts in the last 30 days" is not a fold over a single aggregate's stream — it's a cross-aggregate query, and forcing it through the event-replay model is painful.

CQRS solves this by separating the write and read models. A background projection subscribes to the global event stream and builds a denormalized read model optimized for queries:

event: MoneyWithdrawn { account_id: acc_123, amount: 50, timestamp: T }
   projection writes: INSERT INTO withdrawals(account_id, amount, ts) VALUES (...)

The read model can be any store that suits the query: a Redis hash for real-time account balance lookups, a Postgres table for SQL reporting, Elasticsearch for full-text search over transaction notes.

sequenceDiagram
    participant C as Client
    participant CH as Command Handler
    participant ES as Event Store
    participant P as Projection Worker
    participant R as Read Model (Postgres)
    C->>CH: POST /accounts/acc_123/withdraw { amount: 50 }
    CH->>ES: load stream "acc_123"
    ES-->>CH: [AccountOpened, MoneyDeposited, ...]
    CH->>CH: replay + validate business rule
    CH->>ES: append MoneyWithdrawn { amount: 50 }
    ES-->>C: 202 Accepted
    Note over ES,P: async — projection lags slightly
    ES->>P: MoneyWithdrawn event (via subscription)
    P->>R: UPDATE balances SET balance=250 WHERE account_id='acc_123'
    C->>R: GET /accounts/acc_123/balance
    R-->>C: { balance: 250 }

Notice the timing here: the client gets a 202 immediately after the event is appended. The projection catches up asynchronously. A query arriving before the projection catches up will see the previous balance. This is eventual consistency — it is the central trade-off of CQRS, and it must be communicated to every team building on this stack.

The cost of eventual consistency

This is the thing nobody warns you about in the blog posts.

When a command completes and the client receives a success response, the event has been durably appended to the event store. The read model has not yet been updated. How long before it is? In a low-load system on the same data center, typically 10–100 ms. Under load or with a slow projection, potentially seconds. On a deployment with replay-from-beginning (after a schema change forces a full projection rebuild), potentially hours.

Your product team must consciously accept the consequences. A user withdraws money and immediately refreshes their balance page — they may see the old balance, so you need a UI strategy (optimistic updates, polling until the version advances, "processing…" states). A command that reads from the read model and writes back based on it is not safe because the read model may be stale, and business invariants must be enforced on the write side against the event stream, never against the read model. Projection rebuilds either take down the read model or run in parallel on a shadow table, and for large event stores a full rebuild of a complex projection can take hours.

Failure modes

Event schema evolution

Events are immutable. You cannot change a MoneyWithdrawn event that was written two years ago. But your code will change. You will rename fields, add required fields, split events into two. When you replay the stream today, those old events still exist in their original shape.

The standard strategy is upcasting: at read time, before the event reaches application code, a chain of transformers converts old shapes to the current shape:

MoneyWithdrawn_v1 { amount: 50 }
  →  MoneyWithdrawn_v2 { amount: 50, currency: "USD" }   (default added)
  →  MoneyWithdrawn_v3 { debit: 50, currency: "USD" }    (field renamed)

Each upcast is a deterministic, pure function. You accumulate them forever. There is no migration script that "fixes" the old events — the log is sealed.

This is the most underestimated operational burden of event sourcing. A three-year-old event store may have five versions of a given event type, with a chain of upcasters that must all be maintained and tested. Any bug in the upcast chain corrupts every replay from that point onward.

flowchart LR
    OLD["MoneyWithdrawn_v1<br/>{ amount: 50 }"] --> UC1["Upcast v1→v2<br/>add currency: USD"]
    UC1 --> MID["MoneyWithdrawn_v2<br/>{ amount: 50, currency: USD }"]
    MID --> UC2["Upcast v2→v3<br/>rename amount → debit"]
    UC2 --> NEW["MoneyWithdrawn_v3<br/>{ debit: 50, currency: USD }"]
    NEW --> APP[Application code]
    style OLD fill:#0e7490,color:#fff
    style MID fill:#ffaa00,color:#0a0a0f
    style NEW fill:#15803d,color:#fff
    style UC1 fill:#ff6b1a,color:#0a0a0f
    style UC2 fill:#ff6b1a,color:#0a0a0f
    style APP fill:#a855f7,color:#fff

Every upcast in this chain is permanent. As the system ages, chains grow. That's the deal.

Projection rebuild cost

When you need a new projection — a new query shape, a new read model — you must replay the entire event history from position 0 to build it. A mature event store might contain billions of events. At 100,000 events/sec throughput, 1 billion events takes about 3 hours of wall-clock replay time. You need to run the new projection in parallel (not replacing the old one) until it catches up, have enough compute and I/O to sustain replay rate without impacting live traffic, and handle the window where the new projection exists but is not yet current.

For event stores that have been running for years, this is a real operational constraint. Plan for it from the start: choose a store with efficient sequential scan, monitor projection lag as a first-class metric, and keep individual projection logic as cheap as possible.

Eventual-consistency surprises

The most common class of production incident: a developer writes code that reads from the read model, makes a decision, and then issues a command based on that decision. The read model was stale. The invariant that was checked no longer holds.

Here is a concrete example. A fraud check reads from the read model: "this account has balance $500, so a $400 withdrawal is fine." The read model is 200 ms stale. A concurrent withdrawal of $300 was processed in that 200ms. The command goes through. The account goes negative.

The fix is non-negotiable: all invariant checks happen on the write side, by loading and replaying the aggregate's event stream — never against a read model. The read model is for queries only.

Relationship to streaming and CDC

Event sourcing is a natural producer for Change Data Capture (CDC) pipelines and streaming systems. The event store's global subscription feed is structurally identical to a Kafka topic: an ordered, durable log of changes. Any downstream that can consume a Kafka topic can consume the event store's feed.

This is why event-sourced systems compose well with stream processing (Kafka Streams, Flink) and real-time analytics. The event log is already the data you'd want to put into a stream — you don't need to bolt on CDC after the fact.

It is also why event-sourced financial systems pair naturally with audit and regulatory requirements: the immutable log is the audit trail, not a separate table you might forget to write to.

A full architecture

flowchart TD
    CLI[Client] -->|POST command| GW[API Gateway]
    GW --> CMD[Command Service]
    CMD -->|load stream| ESTORE[(Event Store<br/>append-only)]
    CMD -->|append event + version| ESTORE
    ESTORE -->|global feed| BUS[Event Bus / Kafka]

    BUS --> P1[Account Balance<br/>Projection]
    BUS --> P2[Transaction History<br/>Projection]
    BUS --> P3[Notification<br/>Service]
    BUS --> P4[Fraud Detection<br/>Model]

    P1 --> REDIS[(Redis)]
    P2 --> PG[(Postgres)]
    P3 --> EMAIL[Email / Push]
    P4 --> ML[(ML Store)]

    GW2[Query Gateway] -->|GET balance| REDIS
    GW2 -->|GET history| PG

    SNAP[Snapshot Worker] -.every N events.-> ESTORE

    style CMD fill:#ff6b1a,color:#0a0a0f
    style ESTORE fill:#15803d,color:#fff
    style BUS fill:#0e7490,color:#fff
    style P1 fill:#ffaa00,color:#0a0a0f
    style P2 fill:#ffaa00,color:#0a0a0f
    style P3 fill:#ffaa00,color:#0a0a0f
    style P4 fill:#ffaa00,color:#0a0a0f
    style REDIS fill:#a855f7,color:#fff
    style PG fill:#a855f7,color:#fff
    style GW2 fill:#ff2e88,color:#fff

The command service and the query gateway are two separate entry points. Everything on the write side flows through the Event Store and out to the bus. Everything on the read side queries one of the materialized projections, none of which is the Event Store itself. The Snapshot Worker is a background process that periodically writes checkpoint state back into the store so replays don't have to scan from position 0.

State machine: a command's lifecycle

stateDiagram-v2
    [*] --> Received: client sends command
    Received --> Validating: command handler loads aggregate
    Validating --> Rejected: business rule violated
    Validating --> Appending: rule passes, emit event
    Appending --> Conflict: optimistic concurrency conflict
    Conflict --> Validating: reload + retry
    Appending --> Committed: event durably stored
    Committed --> Projecting: projection worker receives event
    Projecting --> ReadModelUpdated: projection applies event
    ReadModelUpdated --> [*]
    Rejected --> [*]

The gap between Committed and ReadModelUpdated is where eventual consistency lives. A query arriving during that gap sees the pre-command state.

Concrete example: a payment ledger

The payment system design is a canonical use case. Consider a transfer between two accounts:

Command:  TransferMoney { from: acc_A, to: acc_B, amount: 100 }

The command handler must enforce that acc_A has sufficient funds — an invariant that must be checked atomically. With event sourcing, you load acc_A's stream, replay to get the current balance, check balance >= 100, then append:

MoneyTransferInitiated { transfer_id: txn_9, from: acc_A, to: acc_B, amount: 100 }

A saga or process manager listens for MoneyTransferInitiated and orchestrates the debit and credit as separate commands on each account aggregate. Each step appends its own event. If the credit on acc_B fails, a compensating event MoneyTransferFailed triggers a reversal on acc_A. The event log contains the complete story of every attempt and every compensation — invaluable for debugging and regulatory audit.

Similar logic applies to stock exchange matching engines: every order placement, cancellation, and fill is an event. The order book is a projection. The trade blotter is another projection. Regulatory reporting consumes the same log.

When not to reach for this pattern

This section matters as much as everything above.

For most CRUD applications — a content management system, a settings page, a blog editor — the extra machinery here buys you nothing. A standard relational database with updated_at and updated_by columns covers 90% of the audit trail requirements most teams actually have. The complexity is not worth it.

If your queries need strong consistency — the read must reflect the most recently committed command, right now, with zero lag — CQRS with eventually consistent read models is the wrong tool. You need the command and the query to land in the same transaction. A single Postgres instance with good indexing is often the right answer and it is dramatically simpler to operate.

Do not underestimate what you are signing up for operationally. You now have the event store, the snapshot store, projection workers with lag monitoring, multiple read model databases, schema versioning with upcasters, and replay infrastructure. For a small team or a service with simple requirements, that is a large surface area in exchange for benefits you may not actually need.

The pattern earns its keep when you need a complete, tamper-evident audit trail (financial, medical, legal), when multiple query models need different shapes (balance lookup vs. transaction history vs. fraud graph), when you need temporal queries ("what was the state at 9am on March 3rd?"), when you need to replay history into new projections as business requirements change, or when events are already a natural first-class concept in the domain — order processing, payment ledgers, booking systems, matching engines. If several of those apply to your problem, event sourcing and CQRS will serve you well. If none of them do, reach for something simpler.

Storage choices

DataStoreWhy
Event streamsPostgres (events table) or EventStoreDB / KurrentDBTransactional append with optimistic concurrency; efficient stream scan
High-throughput event fan-outKafkaDurable, partitioned, many consumers without coordination
SnapshotsSame store as events, or separate blob storageCollocated is simpler; blob storage (S3) is cheaper for large snapshots
Account balance (read model)RedisSub-millisecond point lookup by account ID
Transaction history (read model)PostgresRange queries, filtering, SQL joins
Full-text / search (read model)ElasticsearchNarrative search over transaction notes or descriptions
Archival eventsS3 + ParquetCompress cold events; queryable via Athena / BigQuery

Things to discuss in an interview

  • Why the read model is eventually consistent and what that means for product behavior — this shows you understand the real trade-off, not just the pattern.
  • Snapshot strategy: when to take one, what happens if the snapshot is corrupt, how to rebuild.
  • Schema evolution: upcasting vs. copy-transform, why you can't just migrate events in place.
  • Projection rebuild: how you handle a new projection on a large event store, parallel vs. cutover.
  • When not to use it: interviewers respect engineers who know when a pattern is wrong for the problem.
  • Idempotent projections: if a projection worker crashes mid-batch and restarts, it replays some events twice. Projection handlers must be idempotent (INSERT ... ON CONFLICT DO NOTHING or a position checkpoint).

Things you should now be able to answer

  • What is the difference between event sourcing and simply having an audit log table?
  • Why does replaying a large event stream require snapshots, and when would you take one?
  • A developer reads from a CQRS read model and uses that data to validate a command — what can go wrong?
  • How do you handle a field rename in an event that was appended two years ago?
  • A new business requirement needs a projection that never existed before. How do you build it?
  • What makes the payment ledger and matching engine good candidates for event sourcing?
  • Name two scenarios where you would explicitly choose a standard CRUD model over event sourcing.

Further reading

  • Martin Fowler, "Event Sourcing" — martinfowler.com (the canonical definition)
  • Greg Young, "CQRS Documents" — cqrs.wordpress.com/documents/ (the original articulation of CQRS as a pattern; PDF directly at cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf)
  • "Designing Data-Intensive Applications" by Martin Kleppmann — Chapter 11 (stream processing and event logs)
  • KurrentDB documentation — docs.kurrent.io (canonical documentation; the product was renamed from EventStoreDB to KurrentDB following the December 2024 rebrand from Event Store to Kurrent)
  • "The Log: What every software engineer should know about real-time data's unifying abstraction" — Jay Kreps, engineering.linkedin.com
// FAQ

Frequently asked questions

What is event sourcing, and how does it differ from a conventional database?

Event sourcing stores every state change as an immutable, ordered event appended to a log rather than overwriting a row in place. Current state is derived by replaying that log using a left-fold: apply each event in sequence to an initial state. In a conventional database, when a balance changes from $200 to $150 the previous value is gone; in an event-sourced system the MoneyWithdrawn event remains permanently, and the balance is a derived fact computed on demand.

What is CQRS, and why does it pair naturally with event sourcing?

CQRS (Command Query Responsibility Segregation) splits the write model (commands that validate business rules and append events) from the read model (denormalized projections built by consuming the event stream). It pairs naturally with event sourcing because the event store's global subscription feed is already an ordered log that any number of downstream projections can consume independently, each optimized for its own query shape — Redis for point lookups, Postgres for range queries, Elasticsearch for full-text search.

When should you use event sourcing, and when should you avoid it?

Reach for event sourcing when you need a complete tamper-evident audit trail (financial, medical, legal), multiple query models with different shapes, temporal queries such as reconstructing state at a specific past timestamp, or when events are already first-class domain concepts like order processing or matching engines. Avoid it for standard CRUD applications — a content management system or settings page — where a relational database with updated_at and updated_by columns covers 90% of real audit requirements at a fraction of the operational complexity.

What are snapshots in event sourcing, and how often should you take them?

A snapshot is the folded aggregate state persisted at a checkpoint version so that replaying on the next load only requires events from that checkpoint forward, bounding replay cost to at most N events regardless of aggregate age. Common snapshot intervals are every 50, 100, or 500 events. The snapshot is purely an optimization — the raw event log remains the source of truth and can always be used to rebuild it if the snapshot is lost or corrupt.

How do you handle schema changes to events that were written years ago?

The standard strategy is upcasting: at read time, before an event reaches application code, a chain of pure deterministic transformer functions converts old event shapes to the current shape — for example, MoneyWithdrawn_v1 with an amount field is promoted to v2 by adding a default currency, then to v3 by renaming amount to debit. These upcast chains accumulate permanently because the log is sealed and old events cannot be rewritten. A bug anywhere in the chain corrupts every replay from that point forward, making schema versioning the most underestimated operational burden in long-lived event-sourced systems.

// RELATED

You may also like