~/articles/saga-pattern-distributed-transactions

◆◆◆Advancedasked at Amazonasked at Uberasked at Netflix

The Saga Pattern & Distributed Transactions

How do you keep data consistent across services with no shared database? Sagas, compensating transactions, orchestration vs choreography, and why 2PC fails at scale.

17 min read2026-03-02Ironclad Academy

#distributed-systems #transactions #microservices #consistency

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

Distributed transactions are the point where microservices get genuinely hard. Every service-decomposition tutorial tells you to split your monolith. None of them explains what happens when you need one logical operation to span three of those services — and one of them fails halfway through. That is exactly what the Saga pattern addresses.

Why a single transaction across services is impossible

In a monolith with one relational database, you write:

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = :customer;
  UPDATE inventory SET stock = stock - 1 WHERE sku = :sku;
  INSERT INTO shipments (order_id, address) VALUES (:oid, :addr);
COMMIT;

Either all three changes land, or none do. The database's transaction manager owns the ACID contract.

Microservices break this. Payment Service owns accounts. Inventory Service owns inventory. Shipping Service owns shipments. They are separate processes with separate databases — possibly separate technologies (Postgres, DynamoDB, MongoDB). There is no single transaction coordinator that owns all three.

The naive fix — "just give them a shared database" — defeats the point of microservices. Service autonomy and independent deployability require independent schemas.

The wrong fix: Two-Phase Commit (2PC)

Two-Phase Commit is the textbook protocol for distributed atomic commit. It does work correctly under certain assumptions. Understanding it exactly is important before explaining why it is usually avoided at scale.

How 2PC works

A coordinator (transaction manager) drives two phases:

Phase 1 — Prepare. The coordinator sends PREPARE to every participant. Each participant checks whether it can commit: acquires all locks it needs, writes a durable log entry, and replies YES or NO. A YES vote is a promise: "I can commit this and will not abort unilaterally."

Phase 2 — Commit or Abort. If all participants voted YES, the coordinator writes a commit record and sends COMMIT to all. If any participant voted NO, the coordinator sends ABORT to all. Participants apply or roll back and release locks.

sequenceDiagram
    participant C as Coordinator
    participant P1 as Payment DB
    participant P2 as Inventory DB
    participant P3 as Shipping DB
    C->>P1: PREPARE
    C->>P2: PREPARE
    C->>P3: PREPARE
    P1-->>C: YES
    P2-->>C: YES
    P3-->>C: YES
    C->>P1: COMMIT
    C->>P2: COMMIT
    C->>P3: COMMIT

Why 2PC breaks at scale

The flaw is not in the protocol's logic — it's in what happens the moment the coordinator goes quiet.

If the coordinator crashes after all three participants vote YES but before it sends COMMIT or ABORT, those participants are stuck. They cannot unilaterally abort — they promised. They cannot commit — they never received the decision. So they sit there holding their locks, waiting for a coordinator that isn't coming back. This is the fundamental blocking problem: participant availability is now tied to coordinator availability.

The locking cost compounds under load. A row locked during Phase 1 stays locked until Phase 2 completes — across at least two network round-trips. For cross-datacenter calls (10–50 ms each), transactions waiting on those rows pile up quickly, cascading into latency spikes under any sustained traffic.

There is also a compatibility wall. 2PC requires every participant to speak the same distributed transaction protocol (XA in the JEE world). Many modern data stores — DynamoDB, Cassandra, Kafka — simply do not support XA. You cannot 2PC across a relational database and a cloud message queue.

2PC is used in practice, and it works well inside a single database cluster or across a small number of tightly-controlled homogeneous nodes. It breaks down when applied across multiple independently-operated services with heterogeneous storage.

The Saga pattern

The Saga pattern, introduced by Garcia-Molina and Salem in 1987 and widely popularized in the microservices era, replaces a single distributed transaction with a sequence of local transactions, each with an explicit compensating transaction.

The structure is simple: a saga is a sequence of steps T1, T2, … Tn. Each Ti is a local ACID transaction against one service's database. Each Ti has a corresponding compensation Ci that semantically undoes Ti's effect. If step Tk fails, the saga executes Ck-1, Ck-2, … C1 in reverse order.

The key insight is that Ti has already committed by the time Tk fails. You cannot roll it back. You can only compensate for it — issue a new transaction that negates the business effect.

The order example: happy path

sequenceDiagram
    participant SAGA as Saga Coordinator
    participant PAY as Payment Service
    participant INV as Inventory Service
    participant SHIP as Shipping Service

    SAGA->>PAY: T1 — charge customer $49.99
    PAY-->>SAGA: ok, payment_id=P42
    SAGA->>INV: T2 — reserve 1× SKU-8801
    INV-->>SAGA: ok, reservation_id=R77
    SAGA->>SHIP: T3 — create shipment
    SHIP-->>SAGA: ok, shipment_id=S19
    SAGA->>SAGA: saga complete

All three local transactions commit independently. The saga is done.

The order example: failure and compensation

sequenceDiagram
    participant SAGA as Saga Coordinator
    participant PAY as Payment Service
    participant INV as Inventory Service
    participant SHIP as Shipping Service

    SAGA->>PAY: T1 — charge customer $49.99
    PAY-->>SAGA: ok, payment_id=P42
    SAGA->>INV: T2 — reserve 1× SKU-8801
    INV-->>SAGA: ok, reservation_id=R77
    SAGA->>SHIP: T3 — create shipment
    SHIP-->>SAGA: FAIL — address undeliverable
    SAGA->>INV: C2 — release reservation R77
    INV-->>SAGA: ok
    SAGA->>PAY: C1 — refund payment P42
    PAY-->>SAGA: ok
    SAGA->>SAGA: saga compensated

The customer ends up neither charged nor with inventory reserved — the intended semantic outcome. But between T1 committing and C1 completing, the customer was charged. That intermediate state was real and visible.

Orchestration vs choreography

There are two structural approaches to driving a saga's steps.

Orchestration

A dedicated Saga Orchestrator (or Saga Manager) is responsible for the control flow. It calls each service in turn, waits for a response, and decides what to do next — including triggering compensations on failure. The orchestrator persists its own state durably (in a database) so it can survive restarts.

flowchart TD
    ORC[Saga Orchestrator<br/>persists state in DB]
    ORC -->|"1. POST /charge"| PAY[Payment Service]
    PAY -->|"200 ok"| ORC
    ORC -->|"2. POST /reserve"| INV[Inventory Service]
    INV -->|"200 ok"| ORC
    ORC -->|"3. POST /ship"| SHIP[Shipping Service]
    SHIP -->|"500 fail"| ORC
    ORC -.->|"C2. POST /release"| INV
    ORC -.->|"C1. POST /refund"| PAY
    style ORC fill:#ff6b1a,color:#0a0a0f
    style PAY fill:#15803d,color:#fff
    style INV fill:#0e7490,color:#fff
    style SHIP fill:#a855f7,color:#fff

The flow is explicit and centralized — easy to read, debug, and audit. Adding a step or changing order requires touching one place. The orchestrator's persisted state gives you real observability: "Order #1234 is at step 2 of 4." Timeouts and retries are equally straightforward to handle in one place.

The cost is coupling. If the orchestrator isn't carefully scoped, it gradually accumulates knowledge of every service's internals and becomes a god service. Every step also routes through an extra network hop. And the orchestrator's database is now a critical dependency — it needs its own HA story.

Choreography

No central coordinator. Each service listens for events and reacts by doing its local step and publishing the next event.

flowchart LR
    API[Order API] -->|OrderPlaced event| BUS[Event Bus<br/>Kafka]
    BUS --> PAY[Payment Service]
    PAY -->|PaymentCharged event| BUS
    BUS --> INV[Inventory Service]
    INV -->|InventoryReserved event| BUS
    BUS --> SHIP[Shipping Service]
    SHIP -->|ShipmentFailed event| BUS
    BUS -.-> INV2[Inventory Service<br/>listens for ShipmentFailed]
    INV2 -.->|InventoryReleased event| BUS
    BUS -.-> PAY2[Payment Service<br/>listens for InventoryReleased]
    style BUS fill:#ffaa00,color:#0a0a0f
    style PAY fill:#15803d,color:#fff
    style INV fill:#0e7490,color:#fff
    style SHIP fill:#a855f7,color:#fff

No central coordinator means no SPOF. Services are maximally decoupled — Shipping Service genuinely does not know Payment Service exists. Adding consumers to the event bus scales naturally.

What you give up is legibility. The full saga flow is implicit, spread across multiple codebases. Debugging a stuck saga means correlating events from four different services using a correlation ID, which requires discipline across teams. Cycle detection — accidentally publishing an event that triggers a loop — requires explicit care. And adding a new step usually means changing the event contracts of at least two adjacent services.

Which to use. Prefer choreography for simple flows where services are genuinely independent and the saga has no branching logic. Switch to orchestration once the flow has branching, conditional retry, or more than a handful of steps — the traceability and explicit control flow outweigh the coupling cost. The deciding factors are complexity and observability requirements, not step count alone.

Many production systems use both: high-level workflow orchestration while individual service steps use event-driven patterns internally.

Semantic vs syntactic rollback

This is the conceptual linchpin that trips up most candidates.

In a standard database transaction, a ROLLBACK is syntactic: the database discards every pending write as if it never happened. No external effect occurred.

Saga compensations are semantic: the transaction already committed and may have had external effects. A REFUND transaction does not erase the charge — it creates a new credit. If the customer refreshed their bank app between T1 and C1, they saw the charge. That observation was real.

Three consequences follow directly. First, compensations may themselves fail — a refund call to a payment provider can time out, so your saga must handle compensation failure, usually by retrying idempotently and eventually alerting a human operator if retries are exhausted. Second, some steps are inherently non-compensatable: once a package ships, you cannot un-ship it. You can issue a return label and refund, but the shipment happened. Design your saga so non-compensatable steps come last, after all compensatable steps have succeeded. Third, compensations are not necessarily instantaneous — a bank refund may take hours, so the saga can mark itself "compensating" and poll or await a callback.

Isolation anomalies and countermeasures

Standard ACID transactions provide isolation: concurrent transactions do not see each other's intermediate state. Sagas do not. Each local transaction commits immediately, and other reads can observe any intermediate saga state.

Three anomalies come up regularly. A dirty-read equivalent: Service D reads data written by saga step T2, but the saga later compensates T2 — D now holds stale data. A lost update: two concurrent sagas both read a value, both decide to modify it, and one compensation undoes a modification the other saga depended on. A phantom: a saga reads "seats available = 3," another saga reserves the last seat and compensates concurrently, and the first saga proceeds on a stale count.

Countermeasures

Semantic locks. Add a status field to each entity touched by a saga. While a saga is in progress, mark the record as PENDING. Other transactions that see PENDING know to wait or fail-fast. This is an application-level lock, not a database lock — it participates in the saga's compensation: the compensation clears the PENDING flag.

-- T1: charge customer, mark order PENDING
UPDATE orders SET status='PENDING' WHERE id=:oid;
INSERT INTO payments ...;

-- C1 compensation: refund, clear flag
UPDATE orders SET status='CANCELLED' WHERE id=:oid;
INSERT INTO refunds ...;

Commutative updates. Design updates so their order doesn't matter. credit(account, +100) and debit(account, -30) are commutative — applying them in either order yields the same balance. If all saga steps are commutative, concurrent execution is safe regardless of interleaving.

Pivot transaction. Divide saga steps into a sequence of compensatable steps, followed by a single "pivot" step (the go/no-go point of no return), followed by retryable steps. Once the pivot commits, the saga completes forward — retrying indefinitely if needed — and never compensates backward. The pivot itself can be the last compensatable step (a clean hand-off point) or the first non-compensatable one; what matters is that everything before it can be undone and everything after it must succeed.

Reread before update. Before a step applies a decision based on data it read earlier in the saga, re-read the data within the local transaction to verify the assumption still holds. This is the optimistic locking equivalent at the saga level.

Idempotency: the non-negotiable requirement

Every step and every compensation must be idempotent — executing it twice must produce the same result as executing it once. This is not optional: network failures, timeouts, and orchestrator restarts will cause retries.

Implement idempotency with idempotency keys:

POST /payments/charge
Idempotency-Key: saga-1234-step-1

→ first call: charges $49.99, returns payment_id=P42
→ duplicate call: returns same payment_id=P42 without double-charging

The payment service stores (idempotency_key → result) durably. On a duplicate request, it returns the cached result without re-executing the side effect. See idempotency and exactly-once delivery for the full pattern.

This matters doubly for compensations: if C1 is called twice (once by retry, once by a recovered orchestrator), the customer must not be refunded twice.

Reliable event publishing: the outbox pattern

In choreography, a service completes its local transaction and publishes an event. If it commits its database transaction and then crashes before publishing the event, the saga stalls — the next service never receives the trigger.

The outbox pattern solves this:

Within the same local transaction as the business change, write the outgoing event to an outbox table in the same database.
A separate CDC (Change Data Capture) process — typically via Debezium reading the Postgres WAL — reads new rows from the outbox table and publishes them to Kafka.
Once published, the outbox row is marked delivered (or deleted).

flowchart LR
    SVC[Service] -->|"same transaction"| DB[(Service DB<br/>+ outbox table)]
    CDC[CDC Process<br/>Debezium] -->|reads WAL| DB
    CDC -->|publishes| KAFKA[Kafka]
    KAFKA --> NEXT[Next service]
    style DB fill:#0e7490,color:#fff
    style KAFKA fill:#ffaa00,color:#0a0a0f
    style CDC fill:#ff2e88,color:#fff

Because the event write and the business write are in the same local ACID transaction, they commit or fail together. The CDC process delivers at-least-once; idempotency in the consumer handles duplicates. See change data capture for the full implementation.

In orchestration, a similar concern applies: the orchestrator must durably record that it issued a command before issuing it, so that after a restart it can resume from the correct step rather than re-issuing an already-completed step.

Building the design: evolution from a naive starting point

Start with the simplest thing that could work and walk forward until the design earns its complexity.

V1: Synchronous service calls, no saga

Order Service calls Payment Service
           calls Inventory Service
           calls Shipping Service

Simple. Works until a step fails halfway. Inventory decremented, payment charged, customer angry — no rollback mechanism of any kind.

V2: Manual compensation in application code

Add try/catch blocks: if Shipping fails, call inventoryService.release() and paymentService.refund(). Manually coded, one-off, fragile. What if the compensation call itself fails? What if the Order Service crashes mid-compensation? The compensation logic is ad hoc and not durable — a crash leaves a partially compensated saga with no way to resume.

V3: Choreography with Kafka

Replace direct calls with events on Kafka. Each service listens, acts, publishes. Add the outbox pattern for reliable delivery. Services are decoupled and independently deployable, and event delivery is reliable.

The catch is visibility. Debugging saga #1234 requires correlating events across 5 Kafka topics. Adding a 6th step touches 3 service codebases. Timeouts need saga-level tracking with no obvious owner.

V4: Orchestrated saga with durable state

A Saga Coordinator service drives all steps. Its state machine is persisted in a database. On restart it replays from the last persisted step. Timeouts are owned by the coordinator — it schedules a deadline event. Compensation is triggered explicitly.

stateDiagram-v2
    [*] --> PaymentPending
    PaymentPending --> InventoryPending: payment ok
    InventoryPending --> ShippingPending: inventory ok
    ShippingPending --> Completed: shipping ok
    ShippingPending --> CompensatingInventory: shipping fail
    CompensatingInventory --> CompensatingPayment: inventory released
    CompensatingPayment --> Failed: payment refunded
    Failed --> [*]
    Completed --> [*]

The saga state is observable ("order 1234 is in CompensatingInventory"), resumable after crash, timeout management lives in one place, and it's testable as a state machine. This is the production design for anything beyond a simple 2-step saga.

Saga steps mapped to compensations

To make this concrete, here is what each step and its compensation look like for a typical order saga:

flowchart LR
    T1["T1: Charge customer<br/>$49.99"] -.->|"C1: Refund $49.99"| T1
    T2["T2: Reserve inventory<br/>SKU-8801 × 1"] -.->|"C2: Release reservation"| T2
    T3["T3: Create shipment<br/>at fulfillment center"] -.->|"C3: Cancel shipment<br/>(if not yet picked)"| T3
    T4["T4: Send confirmation<br/>email"] -.->|"No compensation<br/>(non-reversible)"| T4
    T1 --> T2 --> T3 --> T4
    style T1 fill:#15803d,color:#fff
    style T2 fill:#0e7490,color:#fff
    style T3 fill:#a855f7,color:#fff
    style T4 fill:#ffaa00,color:#0a0a0f

Notice that T4 (send confirmation email) has no meaningful compensation. Once an email is sent, you can't un-send it. This is why non-compensatable steps belong at the end of the chain — once you reach T4, all prior steps have committed and the saga is effectively complete. If T4 fails, you retry it; you never compensate backward past the pivot.

Failure modes

Compensation itself fails

The refund call times out. The compensation is now stuck. The approach is to retry with exponential backoff — most transient failures resolve within seconds. If retries are exhausted, transition the saga to a NEEDS_HUMAN_REVIEW state and alert an operator. Do not silently swallow the failure. For the specific case of payment refunds, payment processors support async refund confirmation via webhook or polling — the saga waits for the callback rather than requiring synchronous success.

Non-compensatable step fails

Scenario: T3 (ship) succeeds, but T4 (send email confirmation) fails. Email is non-compensatable. The order is shipped. The correct response is to retry T4, not compensate backward. Put non-compensatable steps last, and make them retryable. If they fail terminally, accept the inconsistency and alert — the customer shipped but didn't get an email, so an operator re-sends.

Partial visibility during compensation

Between T1 (payment charged) and C1 (refund issued), another system reads the account balance and sends "payment confirmed" to the customer. C1 then runs. The customer received a confirmation they shouldn't have.

Mitigation: semantic locks — mark the order as PENDING during the saga. The email service checks order status before sending confirmation. If PENDING, it waits or polls.

Orchestrator itself crashes

The orchestrator must be crash-safe. Its state machine transitions are written durably before it makes the outbound call:

Write "about to call step N" to DB.
Call step N.
Write "step N succeeded" to DB.

On restart, the orchestrator reads its persisted state and re-issues any step it has recorded as "about to call" but not yet "succeeded." Combined with idempotency on every step, this is safe.

Storage choices

Data	Store	Why
Saga state (orchestrated)	Postgres / DynamoDB	Durable, consistent; the saga state machine must not be lost
Outbox events	Same DB as business data	Atomic write in same transaction as business change
Published event log	Kafka	High-throughput, ordered per partition, replayable
Idempotency keys	Redis (TTL ~24h) or same SQL DB	Fast lookup; short-lived after the saga completes
Compensation audit log	Append-only table (Postgres/S3)	Regulatory requirement in financial systems; immutable history

Things to discuss in an interview

Why 2PC fails at scale: blocking under coordinator failure, locks held across network, heterogeneous store compatibility.
Saga vs 2PC trade-off: sagas give up isolation in exchange for availability and autonomy; 2PC gives isolation but blocks under failure.
Orchestration vs choreography: when each is appropriate; how observability and coupling differ.
Semantic vs syntactic rollback: compensations are new transactions, not undos; they may be visible and can fail.
Isolation anomalies: dirty reads of intermediate saga state; semantic locks and commutative updates as mitigations.
Outbox pattern: why you can't just "commit then publish" without it; how CDC bridges the gap.
Idempotency: why every step and compensation must be idempotent; implementation with idempotency keys.
Non-compensatable steps: design them last; accept the forward-only commitment.

Things you should now be able to answer

Why can't you run a single ACID transaction across three microservices?
What is the fundamental problem with Two-Phase Commit when a coordinator crashes after participants vote YES?
What is the difference between a compensating transaction and a database rollback?
What isolation anomaly can occur in a saga that cannot occur in a 2PC transaction?
When would you choose choreography over orchestration for a saga?
Why does the outbox pattern exist, and what problem does it solve?
What happens if a compensation fails? Walk through the handling options.

Frequently asked questions

▸What is the Saga pattern in distributed systems?

The Saga pattern replaces a single distributed transaction with a sequence of local ACID transactions, each paired with a compensating transaction that semantically undoes it if a later step fails. Unlike a database rollback, compensations are new transactions — a refund for a charge, a re-increment for a decrement — because each local transaction has already committed by the time failure occurs.

▸Why is Two-Phase Commit (2PC) avoided at scale in microservices?

If the 2PC coordinator crashes after all participants vote YES but before sending COMMIT or ABORT, every participant is stuck holding its locks indefinitely — it cannot abort (it promised) and cannot commit (no decision arrived). Beyond that failure mode, row locks acquired in Phase 1 are held across at least two network round-trips, and modern stores like DynamoDB, Cassandra, and Kafka do not support the XA protocol 2PC requires.

▸Saga orchestration vs choreography: when should I use each?

Prefer choreography for simple, linear flows where services are genuinely independent and there is no branching logic. Switch to orchestration once the flow has branching, conditional retry, or more than a handful of steps, because the centralized Saga Coordinator gives explicit control flow, a single place to manage timeouts, and observable state such as 'Order 1234 is at step 2 of 4.' The deciding factors are complexity and observability requirements, not step count alone.

▸How does the outbox pattern prevent lost events in a saga?

Instead of committing a database transaction and then publishing to a message broker in a separate call (which leaves a crash window between the two), the outbox pattern writes the outgoing event into an outbox table inside the same local ACID transaction as the business change. A separate CDC process such as Debezium reads new rows from the outbox table via the Postgres WAL and publishes them to Kafka, so the event and the business write are atomic.

▸What is the approximate happy-path latency for a 4-step order saga?

The article gives a concrete estimate: 4 steps multiplied by 15 ms per step (5 ms local commit plus 10 ms network round-trip) equals approximately 60 ms. Sagas add latency proportional to the number of steps because each step requires its own network round-trip and local commit.

← previous

Forward vs Reverse Proxy (and API Gateway)

Designing Pastebin

// RELATED