Every team that splits a monolith into services eventually hits the same wall: an operation that used to be one BEGIN TRANSACTION ... COMMIT now spans four services and four databases. The instinct is to reach for a distributed transaction. Don’t. This article walks through modeling that operation as a saga – a sequence of local transactions with compensating actions – and the two ways to coordinate it: choreography and orchestration. We will build both for a concrete order flow, then dig into the parts everyone gets wrong: idempotent compensations, durable state, and the dual-write trap hiding inside the saga itself.
Why two-phase commit fails at scale
Two-phase commit (2PC) is correct in theory. A coordinator asks every participant to prepare, and once all vote yes, it tells them to commit. The problem is what happens between those phases. Each participant holds locks on its rows from prepare until it hears the final decision. If the coordinator crashes after participants have voted yes but before it broadcasts commit, every participant is stuck in-doubt: it cannot release the locks (the transaction might commit) and cannot roll back (it might not). They block, holding locks, until the coordinator recovers.
That is the fatal flaw at scale:
- It is a blocking protocol. Cross-service calls measured in tens of milliseconds now hold row locks for that entire window. Throughput collapses under contention.
- It couples availability multiplicatively. A 2PC across four services with 99.9% availability each caps the transaction’s success ceiling near 99.6%, and any participant being down blocks the whole thing.
- It assumes XA everywhere. Your message broker, your cloud datastore, and your third-party payment API almost certainly do not speak XA. The moment one participant cannot enlist, 2PC is off the table.
The saga reframes the problem. Instead of one atomic transaction across services, you accept a sequence of local transactions, each committing independently and immediately. You give up isolation – intermediate states are visible – in exchange for availability and no distributed locks. The price is that you now own rollback in application code. There is no
ROLLBACK; there is only “do something that semantically undoes the previous step.”
Step 1 – Model the saga: steps, compensations, and the pivot
Start by writing the happy path as an ordered list of local transactions (T), then define a compensating transaction © for each step that can be undone. Take an order-placement flow:
| # | Forward step (T) | Compensation © |
|---|---|---|
| 1 | CreateOrder (status = PENDING) |
CancelOrder (status = CANCELLED) |
| 2 | ReserveInventory |
ReleaseInventory |
| 3 | AuthorizePayment (hold funds) |
VoidPayment (release hold) |
| 4 | CapturePayment + ConfirmOrder |
– (pivot; no compensation) |
The critical concept is the pivot transaction: the point of no return. Everything before the pivot is compensatable – if a later step fails, you walk backward running compensations. The pivot itself either commits the whole saga or is the last thing that can fail cleanly. Steps after the pivot are retriable – they must eventually succeed, so you retry them forever rather than compensating.
In our flow, CapturePayment is the pivot. Once funds are captured we do not want to “un-capture” as part of normal flow (that is a refund – a new business process with its own rules). So we design the saga so that everything reversible happens first, the pivot commits, and ConfirmOrder is retriable.
Concretely, a saga step is a tuple you can persist and reason about:
{
"step": "AuthorizePayment",
"type": "compensatable",
"command": { "topic": "payment.authorize", "payload": { "orderId": "o-8821", "amount": 4999 } },
"compensation": { "topic": "payment.void", "payload": { "orderId": "o-8821" } },
"retryPolicy": { "maxAttempts": 5, "backoff": "exponential" }
}
Two rules fall out of this model and they drive everything later:
- Compensations must be commutative with failures – running a compensation for a step that never fully applied must be safe (release inventory that was never reserved = no-op).
- Order matters on the way back. Compensations run in reverse: void payment, then release inventory, then cancel order.
Step 2 – Choreography: a loosely coupled event flow
In choreography there is no central coordinator. Each service listens for events, does its local work, and publishes the next event. The saga’s “logic” is the emergent sum of these reactions. For the order flow:
order-svc: receives PlaceOrder -> CreateOrder -> emits OrderCreated
inventory-svc: on OrderCreated -> ReserveInventory -> emits InventoryReserved
payment-svc: on InventoryReserved -> AuthorizePayment -> emits PaymentAuthorized
payment-svc: on PaymentAuthorized -> CapturePayment -> emits PaymentCaptured
order-svc: on PaymentCaptured -> ConfirmOrder -> emits OrderConfirmed
Compensation is also event-driven. A failure publishes a failure event, and each upstream service has a handler that runs its compensation:
payment-svc: AuthorizePayment fails -> emits PaymentDeclined
inventory-svc: on PaymentDeclined -> ReleaseInventory -> emits InventoryReleased
order-svc: on InventoryReleased -> CancelOrder
A consumer handler in this style is small and explicit. Using a typical broker subscription:
# inventory-svc: reacts to OrderCreated, compensates on PaymentDeclined
@subscribe("order.created")
def on_order_created(evt, ctx):
# Idempotency: dedupe on the saga/order id (see Step 5)
if already_processed(evt.order_id, "reserve"):
return
with db.transaction():
reserve(evt.order_id, evt.items) # local commit
mark_processed(evt.order_id, "reserve")
publish("inventory.reserved",
{"order_id": evt.order_id, "items": evt.items})
@subscribe("payment.declined")
def on_payment_declined(evt, ctx):
if already_processed(evt.order_id, "release"):
return
with db.transaction():
release(evt.order_id) # compensation, local commit
mark_processed(evt.order_id, "release")
publish("inventory.released", {"order_id": evt.order_id})
Choreography is genuinely loosely coupled – adding a fraud-svc that also reacts to OrderCreated requires no change to existing services. But it has two real costs at scale. First, no single place describes the workflow; the order-of-operations lives implicitly across N codebases, and answering “what happens after payment?” means grepping subscriptions. Second, cyclic dependencies and event storms are easy to create accidentally. Reserve choreography for short flows (3-4 steps) with stable logic.
Step 3 – Orchestration: a durable state machine and command dispatch
In orchestration a single component – the orchestrator – owns the workflow. It sends a command to a service, waits for a reply, decides the next step, and persists its position after every transition. Services no longer know about each other; they only handle commands and emit replies. The orchestrator is the only thing that knows the order.
The orchestrator is a state machine. Model it explicitly:
states: STARTED -> ORDER_CREATED -> INVENTORY_RESERVED
-> PAYMENT_AUTHORIZED -> PAYMENT_CAPTURED -> COMPLETED
compensating: COMPENSATING_PAYMENT -> COMPENSATING_INVENTORY
-> COMPENSATING_ORDER -> FAILED
The dispatch loop is deliberately boring – on each reply, advance or begin compensating, then persist:
def handle_reply(saga_id, reply):
saga = store.load(saga_id) # durable state, see Step 4
if reply.success:
saga.record(reply.step) # mark step done, store its undo data
nxt = saga.next_forward_step()
if nxt is None:
saga.state = "COMPLETED"
else:
dispatch_command(saga, nxt) # send next command
else:
saga.state = "COMPENSATING"
comp = saga.next_compensation() # reverse order
if comp is None:
saga.state = "FAILED" # nothing left to undo
else:
dispatch_command(saga, comp)
store.save(saga) # persist AFTER deciding, see Step 7
Choosing a runtime matters more than the diagram. You can hand-roll this, but durable-workflow engines exist precisely to make the state machine and its persistence crash-safe. A few that map cleanly to this pattern:
- Temporal / Cadence – workflow code is your orchestrator; the engine persists every step via event sourcing and replays on recovery.
- AWS Step Functions – the state machine is declared in Amazon States Language; the service durably tracks execution.
- Azure Durable Functions – orchestrator functions with automatic checkpointing of the await history.
A Step Functions definition makes the forward/compensation structure first-class with Catch:
{
"StartAt": "ReserveInventory",
"States": {
"ReserveInventory": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "reserve-inventory" },
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "CancelOrder" }],
"Next": "AuthorizePayment"
},
"AuthorizePayment": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "authorize-payment" },
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "ReleaseInventory" }],
"Next": "CapturePayment"
},
"CapturePayment": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "capture-payment" },
"Retry": [{ "ErrorEquals": ["States.ALL"], "MaxAttempts": 6, "BackoffRate": 2.0, "IntervalSeconds": 2 }],
"End": true
},
"ReleaseInventory": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "release-inventory" }, "Next": "CancelOrder" },
"CancelOrder": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "cancel-order" }, "Next": "OrderFailed" },
"OrderFailed": { "Type": "Fail", "Error": "OrderSagaFailed" }
}
}
Note the asymmetry that encodes the pivot: compensatable steps use Catch (fail forward into compensation), while CapturePayment – the pivot and first retriable step – uses Retry instead, because past this point we drive to completion, not rollback.
Step 4 – Persist saga state and recover after orchestrator crashes
The orchestrator’s in-memory position is worthless if the process dies. The saga must be durable: every transition is written to a store before the next command goes out, so a restarted orchestrator can resume exactly where it left off.
A minimal saga-state table:
CREATE TABLE saga_instance (
saga_id UUID PRIMARY KEY,
saga_type TEXT NOT NULL,
current_state TEXT NOT NULL,
current_step TEXT,
payload JSONB NOT NULL, -- accumulated data + per-step undo info
is_compensating BOOLEAN NOT NULL DEFAULT FALSE,
last_updated TIMESTAMPTZ NOT NULL DEFAULT now(),
version INT NOT NULL DEFAULT 0 -- optimistic concurrency
);
Recovery has two requirements people overlook:
- Resume in-flight sagas on startup. On boot, scan for instances not in a terminal state and re-drive them. A saga stuck in
PAYMENT_AUTHORIZEDwith no recent activity needs its next command re-sent (which is why commands must be idempotent – Step 5). - Guard against duplicate orchestrators. If two instances both pick up the same saga after a crash, you get double dispatch. Use optimistic concurrency on
versionso the loser’s write fails:
UPDATE saga_instance
SET current_state = $1, current_step = $2, payload = $3,
is_compensating = $4, version = version + 1, last_updated = now()
WHERE saga_id = $5 AND version = $6; -- 0 rows updated => another worker won; abort
Durable engines give you this for free. Temporal replays the workflow’s event history to reconstruct in-memory state after a crash, so your orchestrator code looks like a straight-line function even though it survives restarts:
func OrderSaga(ctx workflow.Context, order Order) error {
ao := workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Second,
RetryPolicy: &temporal.RetryPolicy{MaximumAttempts: 5},
}
ctx = workflow.WithActivityOptions(ctx, ao)
if err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, nil); err != nil {
return err // nothing to compensate yet
}
if err := workflow.ExecuteActivity(ctx, AuthorizePayment, order).Get(ctx, nil); err != nil {
// compensate inventory, then fail
_ = workflow.ExecuteActivity(ctx, ReleaseInventory, order).Get(ctx, nil)
return err
}
// pivot: retried to completion by the RetryPolicy above
return workflow.ExecuteActivity(ctx, CapturePayment, order).Get(ctx, nil)
}
Step 5 – Idempotent and semantically reversible compensations
This is where most saga bugs live. Two independent properties must hold.
Idempotency. Every command and compensation will be delivered more than once – brokers are at-least-once, and crash recovery re-sends. Handlers must produce the same result whether they run once or five times. The standard mechanism is a dedup key (the saga id plus step name) recorded inside the same local transaction as the work:
BEGIN;
INSERT INTO processed_commands (saga_id, step) VALUES ($1, 'authorize')
ON CONFLICT DO NOTHING; -- already done? row count tells you
-- only if the insert succeeded, perform the side effect
-- e.g. UPDATE accounts SET held = held + $2 WHERE id = $3;
COMMIT;
If the broker hands you a payment authorization a second time, the ON CONFLICT DO NOTHING makes the insert a no-op and you skip the side effect. The dedup record and the side effect commit atomically because they share one local transaction – no distributed coordination required.
Semantic reversibility. A compensation is rarely a literal undo; it is a new transaction that brings the system to an equivalent state. VoidPayment does not delete the authorization row – it issues a void and records why. This has subtle implications:
- Compensations can fail too.
ReleaseInventorymight hit a down database. Compensations therefore need their own retries and must be idempotent. Retry compensations effectively forever (or until human escalation) – a saga that cannot finish compensating is a stuck-money incident. - Forward steps must capture undo data. To void a payment you need the
authorizationIdthe authorize step returned. Persist it into the saga payload when the forward step succeeds, or compensation has nothing to act on. - Beware non-reversible side effects. If a step sends a “your order shipped” email, you cannot un-send it. Push such effects past the pivot, or design the compensation as a forward correction (“we’re sorry, your order was cancelled”).
Rule of thumb: if you cannot write the compensation, you have found the pivot. Order your steps so everything before the pivot is cleanly reversible and everything after is something you are willing to retry until it succeeds.
Step 6 – The dual-write trap inside the saga
Here is the failure that silently corrupts sagas. A handler does two things: commits to its database, then publishes the next event/reply. Those are two separate systems with no shared transaction. If the process crashes after the DB commit but before the publish, the local work happened but nobody downstream ever hears about it. The saga stalls forever. Publish first and crash before the commit, and you get the opposite: a downstream step running against work that was rolled back.
You cannot fix this with try/catch. The robust fix is the transactional outbox: write the outgoing message into an outbox table in the same local transaction as the business change, then a separate relay reads the outbox and publishes.
BEGIN;
UPDATE inventory SET reserved = reserved + $1 WHERE sku = $2; -- business change
INSERT INTO outbox (id, topic, payload, created_at)
VALUES (gen_random_uuid(), 'inventory.reserved', $3, now()); -- message
COMMIT; -- both or neither
A relay then drains the outbox and publishes – at-least-once, which is exactly why Step 5’s idempotency is mandatory rather than optional:
-- relay: claim a batch, publish, then delete (or mark sent)
WITH claimed AS (
SELECT id, topic, payload FROM outbox
ORDER BY created_at
FOR UPDATE SKIP LOCKED
LIMIT 100
)
SELECT * FROM claimed; -- publish each to the broker, then:
-- DELETE FROM outbox WHERE id = ANY($1);
In production you usually back the relay with change data capture (Debezium tailing the DB write-ahead log) so the outbox is drained with low latency and no polling load. The takeaway: a saga step’s “do work and tell the next service” is itself a mini distributed-transaction problem, and the outbox is how you make it atomic without 2PC.
Step 7 – Timeouts and partial failures
Distributed steps don’t only succeed or fail – they hang. A ReserveInventory reply that never arrives must not freeze the saga forever. Two controls handle this:
- Per-step timeout. Each command carries a deadline. If no reply arrives, the orchestrator treats it as a failure and begins compensation – but because the command might still be in flight, the eventual late reply must be discarded by step-level idempotency. This is the ambiguous-outcome problem: a timeout means “I don’t know if it worked,” so compensation must be safe to run against a step that actually succeeded (release inventory that did get reserved) and the late success must be deduped.
- Saga-level timeout. A ceiling on the whole saga. If a flow normally finishes in seconds but one has been open for an hour, alert and escalate – it is almost certainly stuck mid-compensation holding resources.
In Step Functions, attach TimeoutSeconds per task and HeartbeatSeconds for long activities; in Temporal, StartToCloseTimeout and HeartbeatTimeout do the same. The non-negotiable design rule: timeout-triggered compensation plus idempotency must be safe under the late-success race, or you will leak reserved inventory and authorized funds.
Step 8 – Testing sagas: deterministic simulation and chaos
Sagas fail in the gaps between steps, and those gaps are exactly what normal tests skip. Test the seams deliberately.
Deterministic simulation. Drive the orchestrator with a scripted sequence of replies including failures and duplicates, asserting the final state and the exact compensations run:
def test_payment_decline_compensates_in_reverse():
bus = FakeBus()
saga = OrderOrchestrator(bus, store=InMemoryStore())
saga.start(order_id="o-1", items=[("sku-9", 2)], amount=4999)
bus.reply("inventory.reserve", success=True)
bus.reply("payment.authorize", success=False, reason="insufficient_funds")
# duplicate decline must be a no-op
bus.reply("payment.authorize", success=False, reason="insufficient_funds")
assert bus.commands_sent == [
"inventory.reserve", "payment.authorize",
"inventory.release", "order.cancel", # reverse-order compensation
]
assert saga.state(order_id="o-1") == "FAILED"
Crash-injection. Kill the orchestrator between the DB commit and the publish (the Step 6 window) and assert recovery re-drives correctly with no duplicate side effects. Temporal ships a test framework that fast-forwards timers and can simulate worker restarts; Step Functions has a local test runner for mocked task outcomes.
Chaos in staging. Inject broker delays, drop a percentage of replies, and slow a downstream so step timeouts fire. The behaviors to assert: no orphaned reservations after compensation, no double charges under duplicate delivery, and every started saga eventually reaching a terminal state (COMPLETED or FAILED) – a saga that lingers in a compensating state is a leak.
Enterprise scenario
A retail platform team ran their checkout as a 5-step choreography. Under a flash sale, payment-svc slowed to 8-second responses. Their inventory consumer had no idempotency: when the broker redelivered OrderCreated after a visibility timeout, it reserved stock twice for the same order. Oversell alarms fired, and because compensation was also choreographed, the flood of PaymentDeclined events triggered duplicate ReleaseInventory calls that drove some counts negative. They had no single place to see how many sagas were stuck.
They made two changes. First, every consumer got a dedup table keyed on (saga_id, step), written in the same transaction as the side effect – so redelivery became a no-op:
BEGIN;
INSERT INTO processed (saga_id, step) VALUES ($1, 'reserve')
ON CONFLICT DO NOTHING;
-- reservation runs only when the insert affected a row
COMMIT;
Second, they moved the compensation path to an orchestrator (Temporal) while leaving the forward path event-driven – a pragmatic hybrid. The orchestrator gave them a durable record of every in-flight and compensating saga, a query to list stuck instances, and reverse-order compensation that could not double-fire. Oversells went to zero, and mean time to detect a stuck saga dropped from “a customer complained” to a dashboard query. The lesson they wrote into their design guide: choreography is fine for the happy path, but compensation deserves an orchestrator because that is where you most need to see and control what is happening.
Verify
Confirm the saga is actually correct, not just green on a sunny day:
# 1. Forward path completes and reaches a terminal state
curl -s localhost:8080/sagas/o-8821 | jq '.state' # expect "COMPLETED"
# 2. Force a downstream failure, assert reverse-order compensation
FAIL_STEP=payment.authorize ./scripts/run-saga.sh o-9001
curl -s localhost:8080/sagas/o-9001 | jq '.state, .compensationsRun'
# expect "FAILED", ["payment.void","inventory.release","order.cancel"] (or none-applied)
# 3. Duplicate delivery is a no-op: replay the same event twice
./scripts/replay-event.sh order.created o-9001
psql -c "SELECT reserved FROM inventory WHERE sku='sku-9';" # unchanged
# 4. No orphaned holds after compensation
psql -c "SELECT count(*) FROM inventory WHERE reserved < 0;" # expect 0
psql -c "SELECT count(*) FROM payment_holds h
LEFT JOIN saga_instance s ON s.saga_id = h.saga_id
WHERE s.current_state='FAILED' AND h.status='ACTIVE';" # expect 0
# 5. No saga stuck mid-flight
psql -c "SELECT saga_id, current_state FROM saga_instance
WHERE current_state NOT IN ('COMPLETED','FAILED')
AND last_updated < now() - interval '15 minutes';" # expect 0 rows