Architecture Multi-Cloud

Implementing Distributed Transactions with Sagas: Orchestration vs Choreography in Depth

Every team that splits a monolith into services eventually hits the same wall: an operation that used to be one BEGIN TRANSACTION ... COMMIT now spans four services and four databases. The instinct is to reach for a distributed transaction. Don’t. This article walks through modeling that operation as a saga – a sequence of local transactions with compensating actions – and the two ways to coordinate it: choreography and orchestration. We will build both for a concrete order flow, then dig into the parts everyone gets wrong: idempotent compensations, durable state, and the dual-write trap hiding inside the saga itself.

Why two-phase commit fails at scale

Two-phase commit (2PC) is correct in theory. A coordinator asks every participant to prepare, and once all vote yes, it tells them to commit. The problem is what happens between those phases. Each participant holds locks on its rows from prepare until it hears the final decision. If the coordinator crashes after participants have voted yes but before it broadcasts commit, every participant is stuck in-doubt: it cannot release the locks (the transaction might commit) and cannot roll back (it might not). They block, holding locks, until the coordinator recovers.

That is the fatal flaw at scale:

The saga reframes the problem. Instead of one atomic transaction across services, you accept a sequence of local transactions, each committing independently and immediately. You give up isolation – intermediate states are visible – in exchange for availability and no distributed locks. The price is that you now own rollback in application code. There is no ROLLBACK; there is only “do something that semantically undoes the previous step.”

Step 1 – Model the saga: steps, compensations, and the pivot

Start by writing the happy path as an ordered list of local transactions (T), then define a compensating transaction © for each step that can be undone. Take an order-placement flow:

# Forward step (T) Compensation ©
1 CreateOrder (status = PENDING) CancelOrder (status = CANCELLED)
2 ReserveInventory ReleaseInventory
3 AuthorizePayment (hold funds) VoidPayment (release hold)
4 CapturePayment + ConfirmOrder – (pivot; no compensation)

The critical concept is the pivot transaction: the point of no return. Everything before the pivot is compensatable – if a later step fails, you walk backward running compensations. The pivot itself either commits the whole saga or is the last thing that can fail cleanly. Steps after the pivot are retriable – they must eventually succeed, so you retry them forever rather than compensating.

In our flow, CapturePayment is the pivot. Once funds are captured we do not want to “un-capture” as part of normal flow (that is a refund – a new business process with its own rules). So we design the saga so that everything reversible happens first, the pivot commits, and ConfirmOrder is retriable.

Concretely, a saga step is a tuple you can persist and reason about:

{
  "step": "AuthorizePayment",
  "type": "compensatable",
  "command": { "topic": "payment.authorize", "payload": { "orderId": "o-8821", "amount": 4999 } },
  "compensation": { "topic": "payment.void", "payload": { "orderId": "o-8821" } },
  "retryPolicy": { "maxAttempts": 5, "backoff": "exponential" }
}

Two rules fall out of this model and they drive everything later:

  1. Compensations must be commutative with failures – running a compensation for a step that never fully applied must be safe (release inventory that was never reserved = no-op).
  2. Order matters on the way back. Compensations run in reverse: void payment, then release inventory, then cancel order.

Step 2 – Choreography: a loosely coupled event flow

In choreography there is no central coordinator. Each service listens for events, does its local work, and publishes the next event. The saga’s “logic” is the emergent sum of these reactions. For the order flow:

order-svc:      receives PlaceOrder  -> CreateOrder      -> emits OrderCreated
inventory-svc:  on OrderCreated      -> ReserveInventory -> emits InventoryReserved
payment-svc:    on InventoryReserved -> AuthorizePayment -> emits PaymentAuthorized
payment-svc:    on PaymentAuthorized -> CapturePayment   -> emits PaymentCaptured
order-svc:      on PaymentCaptured   -> ConfirmOrder      -> emits OrderConfirmed

Compensation is also event-driven. A failure publishes a failure event, and each upstream service has a handler that runs its compensation:

payment-svc:    AuthorizePayment fails -> emits PaymentDeclined
inventory-svc:  on PaymentDeclined     -> ReleaseInventory -> emits InventoryReleased
order-svc:      on InventoryReleased   -> CancelOrder

A consumer handler in this style is small and explicit. Using a typical broker subscription:

# inventory-svc: reacts to OrderCreated, compensates on PaymentDeclined
@subscribe("order.created")
def on_order_created(evt, ctx):
    # Idempotency: dedupe on the saga/order id (see Step 5)
    if already_processed(evt.order_id, "reserve"):
        return
    with db.transaction():
        reserve(evt.order_id, evt.items)         # local commit
        mark_processed(evt.order_id, "reserve")
    publish("inventory.reserved",
            {"order_id": evt.order_id, "items": evt.items})

@subscribe("payment.declined")
def on_payment_declined(evt, ctx):
    if already_processed(evt.order_id, "release"):
        return
    with db.transaction():
        release(evt.order_id)                    # compensation, local commit
        mark_processed(evt.order_id, "release")
    publish("inventory.released", {"order_id": evt.order_id})

Choreography is genuinely loosely coupled – adding a fraud-svc that also reacts to OrderCreated requires no change to existing services. But it has two real costs at scale. First, no single place describes the workflow; the order-of-operations lives implicitly across N codebases, and answering “what happens after payment?” means grepping subscriptions. Second, cyclic dependencies and event storms are easy to create accidentally. Reserve choreography for short flows (3-4 steps) with stable logic.

Step 3 – Orchestration: a durable state machine and command dispatch

In orchestration a single component – the orchestrator – owns the workflow. It sends a command to a service, waits for a reply, decides the next step, and persists its position after every transition. Services no longer know about each other; they only handle commands and emit replies. The orchestrator is the only thing that knows the order.

The orchestrator is a state machine. Model it explicitly:

states: STARTED -> ORDER_CREATED -> INVENTORY_RESERVED
        -> PAYMENT_AUTHORIZED -> PAYMENT_CAPTURED -> COMPLETED
compensating: COMPENSATING_PAYMENT -> COMPENSATING_INVENTORY
        -> COMPENSATING_ORDER -> FAILED

The dispatch loop is deliberately boring – on each reply, advance or begin compensating, then persist:

def handle_reply(saga_id, reply):
    saga = store.load(saga_id)                  # durable state, see Step 4

    if reply.success:
        saga.record(reply.step)                 # mark step done, store its undo data
        nxt = saga.next_forward_step()
        if nxt is None:
            saga.state = "COMPLETED"
        else:
            dispatch_command(saga, nxt)         # send next command
    else:
        saga.state = "COMPENSATING"
        comp = saga.next_compensation()         # reverse order
        if comp is None:
            saga.state = "FAILED"               # nothing left to undo
        else:
            dispatch_command(saga, comp)

    store.save(saga)                            # persist AFTER deciding, see Step 7

Choosing a runtime matters more than the diagram. You can hand-roll this, but durable-workflow engines exist precisely to make the state machine and its persistence crash-safe. A few that map cleanly to this pattern:

A Step Functions definition makes the forward/compensation structure first-class with Catch:

{
  "StartAt": "ReserveInventory",
  "States": {
    "ReserveInventory": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "reserve-inventory" },
      "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "CancelOrder" }],
      "Next": "AuthorizePayment"
    },
    "AuthorizePayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "authorize-payment" },
      "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "ReleaseInventory" }],
      "Next": "CapturePayment"
    },
    "CapturePayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "capture-payment" },
      "Retry": [{ "ErrorEquals": ["States.ALL"], "MaxAttempts": 6, "BackoffRate": 2.0, "IntervalSeconds": 2 }],
      "End": true
    },
    "ReleaseInventory": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "release-inventory" }, "Next": "CancelOrder" },
    "CancelOrder": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "cancel-order" }, "Next": "OrderFailed" },
    "OrderFailed": { "Type": "Fail", "Error": "OrderSagaFailed" }
  }
}

Note the asymmetry that encodes the pivot: compensatable steps use Catch (fail forward into compensation), while CapturePayment – the pivot and first retriable step – uses Retry instead, because past this point we drive to completion, not rollback.

Step 4 – Persist saga state and recover after orchestrator crashes

The orchestrator’s in-memory position is worthless if the process dies. The saga must be durable: every transition is written to a store before the next command goes out, so a restarted orchestrator can resume exactly where it left off.

A minimal saga-state table:

CREATE TABLE saga_instance (
    saga_id        UUID PRIMARY KEY,
    saga_type      TEXT NOT NULL,
    current_state  TEXT NOT NULL,
    current_step   TEXT,
    payload        JSONB NOT NULL,        -- accumulated data + per-step undo info
    is_compensating BOOLEAN NOT NULL DEFAULT FALSE,
    last_updated   TIMESTAMPTZ NOT NULL DEFAULT now(),
    version        INT NOT NULL DEFAULT 0 -- optimistic concurrency
);

Recovery has two requirements people overlook:

  1. Resume in-flight sagas on startup. On boot, scan for instances not in a terminal state and re-drive them. A saga stuck in PAYMENT_AUTHORIZED with no recent activity needs its next command re-sent (which is why commands must be idempotent – Step 5).
  2. Guard against duplicate orchestrators. If two instances both pick up the same saga after a crash, you get double dispatch. Use optimistic concurrency on version so the loser’s write fails:
UPDATE saga_instance
   SET current_state = $1, current_step = $2, payload = $3,
       is_compensating = $4, version = version + 1, last_updated = now()
 WHERE saga_id = $5 AND version = $6;   -- 0 rows updated => another worker won; abort

Durable engines give you this for free. Temporal replays the workflow’s event history to reconstruct in-memory state after a crash, so your orchestrator code looks like a straight-line function even though it survives restarts:

func OrderSaga(ctx workflow.Context, order Order) error {
    ao := workflow.ActivityOptions{
        StartToCloseTimeout: 10 * time.Second,
        RetryPolicy: &temporal.RetryPolicy{MaximumAttempts: 5},
    }
    ctx = workflow.WithActivityOptions(ctx, ao)

    if err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, nil); err != nil {
        return err // nothing to compensate yet
    }
    if err := workflow.ExecuteActivity(ctx, AuthorizePayment, order).Get(ctx, nil); err != nil {
        // compensate inventory, then fail
        _ = workflow.ExecuteActivity(ctx, ReleaseInventory, order).Get(ctx, nil)
        return err
    }
    // pivot: retried to completion by the RetryPolicy above
    return workflow.ExecuteActivity(ctx, CapturePayment, order).Get(ctx, nil)
}

Step 5 – Idempotent and semantically reversible compensations

This is where most saga bugs live. Two independent properties must hold.

Idempotency. Every command and compensation will be delivered more than once – brokers are at-least-once, and crash recovery re-sends. Handlers must produce the same result whether they run once or five times. The standard mechanism is a dedup key (the saga id plus step name) recorded inside the same local transaction as the work:

BEGIN;
  INSERT INTO processed_commands (saga_id, step) VALUES ($1, 'authorize')
  ON CONFLICT DO NOTHING;          -- already done? row count tells you
  -- only if the insert succeeded, perform the side effect
  -- e.g. UPDATE accounts SET held = held + $2 WHERE id = $3;
COMMIT;

If the broker hands you a payment authorization a second time, the ON CONFLICT DO NOTHING makes the insert a no-op and you skip the side effect. The dedup record and the side effect commit atomically because they share one local transaction – no distributed coordination required.

Semantic reversibility. A compensation is rarely a literal undo; it is a new transaction that brings the system to an equivalent state. VoidPayment does not delete the authorization row – it issues a void and records why. This has subtle implications:

Rule of thumb: if you cannot write the compensation, you have found the pivot. Order your steps so everything before the pivot is cleanly reversible and everything after is something you are willing to retry until it succeeds.

Step 6 – The dual-write trap inside the saga

Here is the failure that silently corrupts sagas. A handler does two things: commits to its database, then publishes the next event/reply. Those are two separate systems with no shared transaction. If the process crashes after the DB commit but before the publish, the local work happened but nobody downstream ever hears about it. The saga stalls forever. Publish first and crash before the commit, and you get the opposite: a downstream step running against work that was rolled back.

You cannot fix this with try/catch. The robust fix is the transactional outbox: write the outgoing message into an outbox table in the same local transaction as the business change, then a separate relay reads the outbox and publishes.

BEGIN;
  UPDATE inventory SET reserved = reserved + $1 WHERE sku = $2;   -- business change
  INSERT INTO outbox (id, topic, payload, created_at)
       VALUES (gen_random_uuid(), 'inventory.reserved', $3, now()); -- message
COMMIT;   -- both or neither

A relay then drains the outbox and publishes – at-least-once, which is exactly why Step 5’s idempotency is mandatory rather than optional:

-- relay: claim a batch, publish, then delete (or mark sent)
WITH claimed AS (
  SELECT id, topic, payload FROM outbox
  ORDER BY created_at
  FOR UPDATE SKIP LOCKED
  LIMIT 100
)
SELECT * FROM claimed;     -- publish each to the broker, then:
-- DELETE FROM outbox WHERE id = ANY($1);

In production you usually back the relay with change data capture (Debezium tailing the DB write-ahead log) so the outbox is drained with low latency and no polling load. The takeaway: a saga step’s “do work and tell the next service” is itself a mini distributed-transaction problem, and the outbox is how you make it atomic without 2PC.

Step 7 – Timeouts and partial failures

Distributed steps don’t only succeed or fail – they hang. A ReserveInventory reply that never arrives must not freeze the saga forever. Two controls handle this:

In Step Functions, attach TimeoutSeconds per task and HeartbeatSeconds for long activities; in Temporal, StartToCloseTimeout and HeartbeatTimeout do the same. The non-negotiable design rule: timeout-triggered compensation plus idempotency must be safe under the late-success race, or you will leak reserved inventory and authorized funds.

Step 8 – Testing sagas: deterministic simulation and chaos

Sagas fail in the gaps between steps, and those gaps are exactly what normal tests skip. Test the seams deliberately.

Deterministic simulation. Drive the orchestrator with a scripted sequence of replies including failures and duplicates, asserting the final state and the exact compensations run:

def test_payment_decline_compensates_in_reverse():
    bus = FakeBus()
    saga = OrderOrchestrator(bus, store=InMemoryStore())
    saga.start(order_id="o-1", items=[("sku-9", 2)], amount=4999)

    bus.reply("inventory.reserve",  success=True)
    bus.reply("payment.authorize",  success=False, reason="insufficient_funds")

    # duplicate decline must be a no-op
    bus.reply("payment.authorize",  success=False, reason="insufficient_funds")

    assert bus.commands_sent == [
        "inventory.reserve", "payment.authorize",
        "inventory.release", "order.cancel",       # reverse-order compensation
    ]
    assert saga.state(order_id="o-1") == "FAILED"

Crash-injection. Kill the orchestrator between the DB commit and the publish (the Step 6 window) and assert recovery re-drives correctly with no duplicate side effects. Temporal ships a test framework that fast-forwards timers and can simulate worker restarts; Step Functions has a local test runner for mocked task outcomes.

Chaos in staging. Inject broker delays, drop a percentage of replies, and slow a downstream so step timeouts fire. The behaviors to assert: no orphaned reservations after compensation, no double charges under duplicate delivery, and every started saga eventually reaching a terminal state (COMPLETED or FAILED) – a saga that lingers in a compensating state is a leak.

Enterprise scenario

A retail platform team ran their checkout as a 5-step choreography. Under a flash sale, payment-svc slowed to 8-second responses. Their inventory consumer had no idempotency: when the broker redelivered OrderCreated after a visibility timeout, it reserved stock twice for the same order. Oversell alarms fired, and because compensation was also choreographed, the flood of PaymentDeclined events triggered duplicate ReleaseInventory calls that drove some counts negative. They had no single place to see how many sagas were stuck.

They made two changes. First, every consumer got a dedup table keyed on (saga_id, step), written in the same transaction as the side effect – so redelivery became a no-op:

BEGIN;
  INSERT INTO processed (saga_id, step) VALUES ($1, 'reserve')
  ON CONFLICT DO NOTHING;
  -- reservation runs only when the insert affected a row
COMMIT;

Second, they moved the compensation path to an orchestrator (Temporal) while leaving the forward path event-driven – a pragmatic hybrid. The orchestrator gave them a durable record of every in-flight and compensating saga, a query to list stuck instances, and reverse-order compensation that could not double-fire. Oversells went to zero, and mean time to detect a stuck saga dropped from “a customer complained” to a dashboard query. The lesson they wrote into their design guide: choreography is fine for the happy path, but compensation deserves an orchestrator because that is where you most need to see and control what is happening.

Verify

Confirm the saga is actually correct, not just green on a sunny day:

# 1. Forward path completes and reaches a terminal state
curl -s localhost:8080/sagas/o-8821 | jq '.state'        # expect "COMPLETED"

# 2. Force a downstream failure, assert reverse-order compensation
FAIL_STEP=payment.authorize ./scripts/run-saga.sh o-9001
curl -s localhost:8080/sagas/o-9001 | jq '.state, .compensationsRun'
# expect "FAILED", ["payment.void","inventory.release","order.cancel"]  (or none-applied)

# 3. Duplicate delivery is a no-op: replay the same event twice
./scripts/replay-event.sh order.created o-9001
psql -c "SELECT reserved FROM inventory WHERE sku='sku-9';"   # unchanged

# 4. No orphaned holds after compensation
psql -c "SELECT count(*) FROM inventory WHERE reserved < 0;"   # expect 0
psql -c "SELECT count(*) FROM payment_holds h
         LEFT JOIN saga_instance s ON s.saga_id = h.saga_id
         WHERE s.current_state='FAILED' AND h.status='ACTIVE';" # expect 0

# 5. No saga stuck mid-flight
psql -c "SELECT saga_id, current_state FROM saga_instance
         WHERE current_state NOT IN ('COMPLETED','FAILED')
           AND last_updated < now() - interval '15 minutes';"   # expect 0 rows

Checklist

sagamicroservicesdistributed-transactionsevent-drivenresilience

Comments

Keep Reading