AWS Lesson 42 of 123

AWS Step Functions in Production: Express vs Standard, Distributed Map, and Resilient Error Handling

A Lambda function that calls three other services is not a workflow — it is a distributed monolith with a 15-minute timeout and no audit trail. The moment a business process spans retries, branches, human approval, or thousands of parallel items, you want an orchestrator that owns the state so your code does not have to. AWS Step Functions is that orchestrator: a serverless state machine engine where you describe a workflow in Amazon States Language (ASL) — a JSON DSL of states, transitions, retries and catches — and the service durably executes it, remembering exactly where every run is. It is also a place where teams quietly burn money on the wrong workflow type, melt downstream services with unbounded fan-out, and write Retry blocks that re-amplify the exact outage they were meant to absorb.

This is how I design Step Functions workflows that are durable, that scale cleanly, and that fail in ways an on-call engineer can actually reason about. We will treat the four hard problems as one connected system: choosing the execution model (Standard’s exactly-once durability versus Express’s at-least-once throughput), fanning out at scale (inline Map’s 40-iteration ceiling versus Distributed Map’s 10,000 child executions over an S3 dataset), error handling that absorbs rather than amplifies (Retry with jittered backoff, Catch that routes, TimeoutSeconds that bounds), and compensation (the saga pattern, because there is no distributed transaction to roll back). Every decision is laid out as a scannable matrix you can keep open at 02:00, alongside the ASL and CLI that implement it.

By the end you will stop reaching for Parallel when you mean Map, stop paying Standard prices for a hot 200 ms loop, and stop shipping a compensation path you have never exercised. Assume a recent CLI (aws --version >= 2.x), familiarity with ASL, and IAM roles already scoped per state machine.

What problem this solves

In production, the pain is not “I cannot call three services in a row” — a Lambda does that. The pain is everything that happens when the third call fails after the first two committed real side effects: a card charged, an inventory item reserved, an email sent. Without an orchestrator that owns the state, your recovery logic lives inside the same function that just died, the audit trail is whatever you remembered to log, and a transient 429 from a downstream takes the whole business transaction down because nothing knew to retry just that step.

What breaks without Step Functions: teams build distributed monoliths — one fat Lambda that calls everything, hits the 15-minute wall on a slow downstream, and leaves you with no record of which side effects completed. Or they hand-roll orchestration in SQS + DynamoDB “state” tables and reinvent retries, timeouts, and idempotency badly. Or they fan out with an unbounded for loop over a Lambda and take down a rate-limited internal API the moment volume spikes. The failure modes are always the same three: wrong durability model (double-charges from at-least-once, or a transition bill from running Standard on a firehose), unbounded fan-out (a self-inflicted downstream outage), and retry storms (lockstep backoff that re-hammers a recovering service).

Who hits this: anyone running an order pipeline, a media-processing batch, an ETL fan-out, a human-approval flow, or any multi-service saga. It bites hardest on high-volume idempotent processing (where Express is right but at-least-once double-counts if a Task is not idempotent), large-dataset fan-out (where inline Map silently caps you at 40 concurrent and overflows the 256 KB state payload), and workflows with non-replayable side effects (where the absence of a saga means a partial failure leaves money and inventory in an inconsistent state).

To frame the whole field before the deep dive, here is every problem class this article covers, the symptom it produces, and the lever that fixes it:

Problem class What it looks like in production First question to ask The lever that fixes it
Wrong workflow type Double-charges (Express) or a huge transition bill (Standard on a firehose) Are the side effects replayable, and how hot is the traffic? Standard for durable orchestration; Express for hot idempotent loops
Fan-out melts downstream A rate-limited API/DB falls over the moment volume spikes Is MaxConcurrency capped to the downstream’s safe limit? Distributed Map with a pinned MaxConcurrency + ItemBatcher
State payload overflow Inline Map wedges on a large object list Does the whole array fit in one 256 KB payload? Distributed Map (each child gets its own 256 KB budget)
Retry storm A recovering service is re-hammered in lockstep Do all executions back off by the same intervals? JitterStrategy: FULL + MaxDelaySeconds cap
Hung execution A Task hangs for hours/days on a stuck downstream Is TimeoutSeconds set on every external Task? TimeoutSeconds on every Task; HeartbeatSeconds on callbacks
No rollback after partial failure Card charged, shipment failed, money stuck How far did the workflow get before it failed? Saga: Catch into a reverse compensation chain
Opaque Express failures An Express run fails and you cannot tell why Is CloudWatch logging enabled on the state machine? loggingConfiguration at ALL/ERROR + X-Ray

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the serverless building blocks Step Functions orchestrates: Lambda (the unit of business logic — see AWS Lambda deep dive: runtimes, triggers, layers, concurrency), S3 (the dataset Distributed Map reads — S3 deep dive), DynamoDB for state and idempotency keys (DynamoDB single-table design), and IAM roles and policy evaluation (IAM fundamentals), because every Task assumes the state machine’s execution role. You should be comfortable reading JSON and running aws from a shell.

This sits in the serverless orchestration track. It is downstream of raw messaging — if your problem is fan-out/buffering rather than stateful orchestration, SQS, SNS & EventBridge messaging fundamentals and SQS/SNS fan-out, FIFO & DLQ handling come first. It is the engine behind the event-driven order-processing saga and a core piece of event-driven serverless architecture. For debugging across services it pairs with X-Ray service map & tracing and CloudWatch & CloudTrail observability.

A quick map of where each moving part lives and who usually owns it during an incident, so you call the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Trigger (EventBridge / API / SDK) StartExecution, execution name App / platform Duplicate starts, throttling on the start API
State machine definition (ASL) States, retries, catches, timeouts App / dev team Retry storms, missing timeouts, bad Map vs Parallel
Execution role (IAM) Permissions for every Task + child exec Platform / security AccessDenied on first Distributed Map run, Task failures
Task targets (Lambda / SDK integ) The actual side effect App / dev team Throttling, idempotency bugs, downstream outages
Distributed Map child executions Per-batch Express/Standard runs App / platform Fan-out saturation, partial-batch failures
Observability (CloudWatch / X-Ray) History, metrics, traces, Map Run Platform / SRE Blind Express failures, missed alarms

Core concepts

Five mental models make every later decision obvious.

The orchestrator owns the state, not your code. A Lambda that calls three services holds the “where am I” in local variables that vanish when it dies. A Step Functions execution holds it durably: the service knows precisely which state ran, what it returned, and what is next. That is why recovery, retries, and compensation are declarative — the workflow already knows how far it got. This single property is the reason to reach for an orchestrator at all.

The workflow type is a durability contract chosen once. Standard gives exactly-once execution semantics, a durable queryable history, and a 1-year ceiling, billed per state transition. Express gives at-least-once semantics, no durable history (logs only), a 5-minute ceiling, billed per request plus duration. You choose at creation and cannot flip a state machine between them — you create a new one. Standard is a durable state machine you query later; Express is a streaming transform you fire and forget.

ASL is small — five state types carry almost every workflow. Task does work and is the only state with side effects. Choice branches on input. Parallel runs a fixed set of different branches concurrently and joins on all. Map runs the same sub-workflow over each element of an array. Pass / Wait / Succeed / Fail shape data, sleep, and terminate. Reaching for Parallel when you mean Map (or vice versa) is the most common structural mistake.

Fan-out has two execution models with very different ceilings. Inline Map runs inside the parent execution: capped at 40 concurrent iterations, sharing the parent’s one 256 KB state payload. Fine for dozens of items. Distributed Map runs each iteration (or batch) as its own child workflow execution with its own history and its own 256 KB budget, scaling to up to 10,000 parallel child executions over datasets of millions of items. The whole list never has to fit in one payload.

Failure is the design surface, not an afterthought. A Task with no Retry fails the whole execution on the first transient blip; a Task with no TimeoutSeconds can hang to the execution limit (a year on Standard). Retries that all back off by identical intervals re-hammer a recovering service in lockstep — the thundering herd. And because Step Functions has no distributed transaction, a partial failure cannot be rolled back; it must be compensated with an inverse action per completed step.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters
State machine The workflow definition (ASL) Per region/account The thing you version and deploy
Execution One run of a state machine Triggered per event What you query, retry, and bill on
ASL Amazon States Language (JSON DSL) The definition Declares states, retries, catches
Standard Exactly-once, durable, 1-year type Chosen at creation Orchestration with non-replayable effects
Express At-least-once, logs-only, 5-min type Chosen at creation High-volume idempotent processing
Task The only state with side effects A state Invokes Lambda / SDK / nested SM
Choice Branch on input comparison A state Routing logic
Parallel Fixed set of different branches A state “Do these N different things at once”
Map (inline) Same sub-workflow per array item A state ≤40 concurrent, shares 256 KB
Distributed Map Per-item/batch child executions A state ≤10k children, own 256 KB each
Retry Backoff-on-error rule on a Task On a Task/state Absorbs transient failures
Catch Routes a failure to a handler state On a Task/state Implements compensation
Saga Reverse compensation chain Your workflow shape “Undo what committed” — no rollback exists
waitForTaskToken Pause until an external callback An integration pattern Human approval, async jobs
Context object ($$) Execution/Task/State metadata At runtime Idempotency keys, callback tokens

Standard vs Express: pick by durability, not habit

The first decision is the workflow type, and it is irreversible after creation — you cannot flip a state machine between Standard and Express, you create a new one. They share ASL but differ in their execution guarantees, duration limits, and billing model.

Property Standard Express
Max duration 1 year 5 minutes
Execution semantics Exactly-once At-least-once
Execution history Durable, queryable for 90 days Sent to CloudWatch Logs only
Pricing model Per state transition ($0.000025 each, us-east-1) Per request + GB-second of duration
Throughput Up to thousands of starts/sec Effectively unbounded, very high rates
waitForTaskToken / human approval Yes No
.sync (run a job and wait) Yes No
Result visible in describe-execution Yes (durable) No (logs/synchronous return only)

The pricing models invert depending on workload shape. Standard bills $0.000025 per state transition, so a workflow with 10 states costs $0.00025 per execution regardless of how long it waits — a 6-hour wait for an approval costs nothing extra. Express bills $1.00 per million executions plus $0.00001667 per GB-second of duration; a short, hot, high-volume workflow that finishes in 200 ms is dramatically cheaper there, while a long-running or sparse one is cheaper on Standard.

Mental model: Standard is a durable state machine you query later; Express is a streaming transform you fire and forget. Use Standard for orchestration with side effects you cannot replay; use Express for high-volume, idempotent event processing.

The trap is at-least-once on Express. Express can run a state more than once on internal retry, so every Task it invokes must be idempotent. If an Express workflow charges a credit card or increments a counter without an idempotency key, you will eventually double-charge. A nested pattern is common and correct: a Standard parent that orchestrates the durable, exactly-once business steps, invoking Express child workflows (via startExecution.sync) for the hot inner loops.

Choosing by workload shape

Match the workload to the type before you write a line of ASL. The decision is almost always made by two axes: are the side effects replayable, and how hot is the traffic?

If the workload is… Side effects Traffic shape Choose Why
Order/payment orchestration Non-replayable (charges, shipments) Sparse, long-lived Standard Exactly-once + durable audit; waits are free
Human-approval flow Non-replayable Hours–days paused Standard Only Standard supports waitForTaskToken at length
Per-event enrichment/transform Idempotent Very high, short Express Cheapest per item; history not needed
IoT / clickstream processing Idempotent Firehose Express Unbounded rate; logs suffice
Batch fan-out inner loop Idempotent Bursty, short Express child Cheap per item under a Standard parent
ETL with a long Glue/EMR step Replayable jobs Sparse Standard (.sync) Needs .sync to wait on the job
Saga with compensation Non-replayable Any Standard Durable state is what makes the saga reliable

Synchronous vs asynchronous Express

Express has two start modes, and the difference decides whether you can read the result inline. The trap is assuming a synchronous Express call gives you Standard-grade exactly-once — it does not; the semantics are still at-least-once.

Mode How you start it You get back Use when
Asynchronous Express StartExecution Just an execution ARN Fire-and-forget event processing
Synchronous Express StartSyncExecution The full result inline An API-Gateway-fronted request needing the answer now
.sync from a parent states:::states:startExecution.sync Parent waits for child terminal state Nested fan-out where the parent must join

The cost shape, expanded — what actually drives each bill and the lever to pull:

Cost driver Standard Express Lever to reduce it
Number of state transitions $0.000025 each Not billed per transition Collapse trivial Pass states; use direct SDK integrations
Number of executions Not billed per exec $1.00 / million Batch items so fewer executions run
Duration (GB-seconds) Not billed on duration $0.00001667 / GB-s Faster Tasks, smaller memory in the child
Long waits Free (no transition runs) N/A (5-min cap) Use Standard for anything that waits
CloudWatch Logs ingestion Optional Often required Log at ERROR not ALL in steady state

State machine design: the core state types

ASL is small. Five state types carry almost every real workflow.

Here is the full state-type reference — what each does, whether it has side effects, and the field that controls it:

State type Purpose Side effects? Key fields Common gotcha
Task Invoke Lambda / SDK / nested SM Yes Resource, Parameters, Retry, Catch, TimeoutSeconds No timeout → hangs to execution limit
Choice Branch on input No Choices, Default No DefaultStates.NoChoiceMatched error
Parallel N fixed different branches, join on all Via its Tasks Branches, ResultPath One branch failing fails the whole Parallel
Map (inline) Same sub-workflow per item Via its Tasks ItemsPath, MaxConcurrency, ItemProcessor 40-concurrency cap; shares 256 KB
Map (distributed) Per-item child executions Via its Tasks ItemReader, ItemBatcher, ResultWriter Needs states:StartExecution IAM
Pass Inject/reshape data, no work No Result, Parameters, ResultPath Counts as a transition (Standard cost)
Wait Sleep for time/until timestamp No Seconds, Timestamp, SecondsPath On Express, counts against the 5-min cap
Succeed Terminate successfully No
Fail Terminate with an error No Error, Cause Error string is what Catch matches upstream

A common mistake is reaching for Parallel when you mean Map. Parallel is for “do these three different things at once” (validate, enrich, score). Map is for “do this one thing to each of these items.” The decision in one table:

You want to… The items are… Concurrency you need Use
Validate, enrich, and score in parallel A fixed, named set of different tasks The number of branches (small, fixed) Parallel
Process each line item of an order A variable array of the same thing ≤40 Inline Map
Transform every object under an S3 prefix Tens of thousands of the same thing Hundreds–thousands Distributed Map
Run one of several routes by input value N/A (just routing) N/A Choice
Aggregate results then continue N/A N/A Pass (with ResultPath)

Below, a Choice routes by order value, and a Map (inline mode) processes line items with bounded concurrency.

{
  "Comment": "Order processing",
  "StartAt": "RouteByValue",
  "States": {
    "RouteByValue": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.order.totalUsd",
          "NumericGreaterThan": 10000,
          "Next": "ManualReview"
        }
      ],
      "Default": "ProcessLineItems"
    },
    "ManualReview": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "request-approval",
        "Payload": {
          "orderId.$": "$.order.id",
          "taskToken.$": "$$.Task.Token"
        }
      },
      "Next": "ProcessLineItems"
    },
    "ProcessLineItems": {
      "Type": "Map",
      "ItemsPath": "$.order.lineItems",
      "MaxConcurrency": 5,
      "ItemProcessor": {
        "ProcessorConfig": { "Mode": "INLINE" },
        "StartAt": "Fulfil",
        "States": {
          "Fulfil": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "Parameters": { "FunctionName": "fulfil-line-item", "Payload.$": "$" },
            "End": true
          }
        }
      },
      "End": true
    }
  }
}

Note $$ — the context object, distinct from $ (state input). $$.Task.Token is how a Task hands its callback token to an external system. $$.Execution.Name and $$.State.EnteredTime are invaluable for idempotency keys and logging. The fields of the context object you will actually use:

Context path What it holds Typical use
$$.Execution.Name The unique execution name Stable idempotency key for compensations
$$.Execution.Id The execution ARN Correlation in logs / DynamoDB
$$.Execution.StartTime When the run started SLA/timeout math in-flow
$$.State.Name Current state name Structured logging
$$.State.EnteredTime When this state began Latency attribution
$$.Task.Token The callback token waitForTaskToken handoff
$$.Map.Item.Index Item index inside a Map Per-item logging/keys
$$.Map.Item.Value The item itself Pass the raw item to a Task

Choice comparators and data flow

Choice is more capable than people expect; knowing the comparators saves a pile of pass-through Lambdas. And the input/output processing fields (InputPath, Parameters, ResultSelector, ResultPath, OutputPath) are where most “why is my state getting the wrong input” bugs live.

Choice comparator family Examples Notes
Numeric NumericGreaterThan, NumericEquals, NumericLessThanEquals Plus ...Path variants comparing two fields
String StringEquals, StringMatches (wildcards), StringLessThan StringMatches supports * globbing
Boolean BooleanEquals Common for feature flags
Timestamp TimestampGreaterThan, TimestampEquals ISO-8601 comparisons
Presence IsPresent, IsNull, IsString, IsNumeric Guard against missing fields before comparing
Logical And, Or, Not Nest the above into compound rules
Field When it applies What it does Order of evaluation
InputPath Before processing Selects a sub-node of the raw input 1
Parameters Task/Map Builds the payload sent to the resource 2
ResultSelector After the result Reshapes the raw result 3
ResultPath After the result Where to put the result in the state 4
OutputPath Last Selects what passes to the next state 5

Distributed Map: fan-out over S3 with real concurrency control

Inline Map runs inside the parent execution and is capped at 40 concurrent iterations, and the whole thing shares one 256 KB state payload. That is fine for dozens of items. For tens of thousands — every object under an S3 prefix, every row of a large CSV — you need Distributed mode, which is a different execution model: each iteration (or batch) becomes its own child workflow execution with its own history and its own 256 KB budget. Distributed Map scales to up to 10,000 parallel child executions and can iterate datasets of millions of items.

The two modes side by side — this table decides which one your workload needs:

Dimension Inline Map Distributed Map
Where iterations run Inside the parent execution Separate child executions
Max concurrency 40 Up to 10,000
State payload per item Shares parent’s 256 KB Own 256 KB per child
Dataset size Dozens–hundreds Millions
Item source An array in the input (ItemsPath) S3 (objects, CSV, JSON, manifest) via ItemReader
Batching No ItemBatcher
Partial-failure tolerance All-or-nothing ToleratedFailurePercentage / ToleratedFailureCount
Results handling In the state payload ResultWriter → S3
Extra IAM None states:StartExecution, S3 read/write
Console triage Standard execution view Map Run aggregate view

Set Mode to DISTRIBUTED, point ItemReader at an S3 source, and you get three controls that matter at scale: MaxConcurrency (how hard you hit downstream), ItemBatcher (amortize per-invocation overhead), and ToleratedFailurePercentage (do not fail 9,999 good items because 1 was malformed).

{
  "Type": "Map",
  "ItemProcessor": {
    "ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
    "StartAt": "Transform",
    "States": {
      "Transform": {
        "Type": "Task",
        "Resource": "arn:aws:states:::lambda:invoke",
        "Parameters": { "FunctionName": "transform-batch", "Payload.$": "$" },
        "End": true
      }
    }
  },
  "ItemReader": {
    "Resource": "arn:aws:states:::s3:listObjectsV2",
    "Parameters": { "Bucket": "raw-events-prod", "Prefix": "2026/06/" }
  },
  "ItemBatcher": {
    "MaxItemsPerBatch": 100,
    "MaxInputBytesPerBatch": 262144
  },
  "MaxConcurrency": 500,
  "ToleratedFailurePercentage": 2,
  "ResultWriter": {
    "Resource": "arn:aws:states:::s3:putObject",
    "Parameters": { "Bucket": "map-results-prod", "Prefix": "runs/" }
  },
  "End": true
}

Several decisions are load-bearing here:

The Distributed Map control surface, option by option

Every field that shapes a Distributed Map run, its default, and when you change it:

Field What it controls Default When to change Gotcha
ProcessorConfig.Mode Inline vs distributed INLINE Always set DISTRIBUTED for S3/large datasets Distributed needs extra IAM
ProcessorConfig.ExecutionType Child type (Express/Standard) STANDARD if unset Set EXPRESS for cheap idempotent items Express children are at-least-once
MaxConcurrency Parallel child executions 0 = unlimited (up to 10k) Always cap to the downstream’s safe limit 0 can melt a rate-limited API
ItemBatcher.MaxItemsPerBatch Items per child invocation 1 (no batching) Raise to amortize per-call overhead Lambda must loop + report partial failure
ItemBatcher.MaxInputBytesPerBatch Byte ceiling per batch Cap so a batch fits the Lambda payload 256 KB child / 6 MB Lambda sync limit
ToleratedFailurePercentage % of items allowed to fail 0 (any failure fails the run) Raise to quarantine a few bad records Too high hides a systemic break
ToleratedFailureCount Absolute failure count allowed 0 Alternative to percentage on small sets Use one or the other
Label Prefix for child execution names state name Disambiguate concurrent Map Runs Keep it short
ItemReader.MaxItems Cap items read from source all Throttle a test run Useful for dry runs

ItemReader sources

Distributed Map reads more than a flat object list. Pick the reader that matches your data shape:

ItemReader.Resource Reads Each item is Use when
s3:listObjectsV2 Object keys under a prefix One S3 object reference “Process every file under prefix/
s3:getObject (CSV) Rows of a CSV file One CSV row (object) A big CSV export to fan over
s3:getObject (JSON) Elements of a JSON array One array element A large JSON array of records
s3:getObject (JSON Lines) Lines of a JSONL file One JSON object per line Streaming/event exports
S3 inventory manifest Files listed in a manifest One referenced object Inventory-driven reprocessing at huge scale

Distributed Map also needs IAM permission to start its own child executions and to read/write S3 — states:StartExecution, s3:GetObject, s3:ListBucket, and s3:PutObject on the relevant resources. This is the most common reason a freshly built Distributed Map fails on its first run. The exact permission set:

Permission Why Distributed Map needs it Symptom if missing
states:StartExecution Launch each child execution First run fails immediately, no children start
states:DescribeExecution / states:StopExecution Manage child lifecycle Children orphaned; Map Run cannot stop them
s3:ListBucket listObjectsV2 enumeration Reader returns zero items
s3:GetObject Read CSV/JSON item content Reader fails to parse the dataset
s3:PutObject ResultWriter manifest write Run completes but no results manifest
lambda:InvokeFunction The Task inside the child Every child fails with AccessDenied

Error handling: Retry, Catch, and backoff with jitter

A Task without a Retry block fails the whole execution on the first transient blip. The fix is not “retry everything forever” — it is to retry the retryable errors with bounded, jittered backoff, and to Catch the rest into a handler.

Retry matches on error names and applies exponential backoff. The fields that matter:

The full Retry field reference, with defaults and the trade-off of each:

Field What it does Default Set it to… Trade-off
ErrorEquals Errors this rule matches (required) Specific error names per rule States.ALL must be alone in its retrier
IntervalSeconds First wait before retry 1 1–2 for rate limits Too low re-hammers; too high slows recovery
BackoffRate Multiplier per attempt 2.0 2.0 typical >2 grows fast; pair with MaxDelaySeconds
MaxAttempts Retries before giving up 3 5–6 for transient, 1–2 for timeouts More = longer to surface a real failure
MaxDelaySeconds Cap on any single interval none 20–60 Without it, backoff balloons to hours
JitterStrategy Spread retries randomly NONE FULL anywhere you fan out NONE causes lockstep retry storms

The thundering-herd problem is concrete: if a downstream API returns 429 to 2,000 concurrent executions and they all back off by exactly 2 s, 4 s, 8 s, they retry in lockstep and re-hammer the recovering service at the same instants. JitterStrategy: FULL spreads each retry randomly across its backoff window, smearing the load.

"CallPaymentApi": {
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": { "FunctionName": "charge-card", "Payload.$": "$" },
  "Retry": [
    {
      "ErrorEquals": ["Lambda.TooManyRequestsException", "PaymentApi.RateLimited"],
      "IntervalSeconds": 1,
      "BackoffRate": 2.0,
      "MaxAttempts": 6,
      "MaxDelaySeconds": 20,
      "JitterStrategy": "FULL"
    },
    {
      "ErrorEquals": ["States.Timeout"],
      "IntervalSeconds": 2,
      "MaxAttempts": 2
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "ResultPath": "$.error",
      "Next": "CompensateCharge"
    }
  ],
  "Next": "ConfirmOrder"
}

Two details people miss. First, retriers are evaluated in order, and each rule has its own counter — so split rate-limit retries (aggressive, many attempts) from timeout retries (cautious, few). Second, Catch uses ResultPath: "$.error" to merge the error into the existing input rather than replacing it, so the handler still has the order context. Set TimeoutSeconds on every Task that calls something external; a Task with no timeout can hang until the execution-level limit, and on Standard that limit is a year.

The error-and-limit reference

The predefined error names you Retry/Catch on, what triggers each, and whether it is worth retrying:

Error name Raised when Retryable? Typical handling
States.ALL Catch-all (any error) n/a Catch of last resort; alone in its retrier
States.TaskFailed A Task returned a failure Often Retry transient; catch permanent
States.Timeout TimeoutSeconds/HeartbeatSeconds hit Cautiously Few retries, then catch
States.Permissions Execution role lacks a permission No Fix IAM; do not retry
States.DataLimitExceeded Output exceeded 256 KB No Offload to S3 (ResultWriter/payload trimming)
States.Runtime Internal runtime error (e.g. bad JSONPath) No Fix the definition
States.HeartbeatTimeout Worker stopped sending heartbeats Yes Catch → compensate/alert
Lambda.TooManyRequestsException Lambda throttled (429) Yes Aggressive jittered retry
Lambda.ServiceException Transient Lambda service error Yes Retry with backoff
Lambda.Unknown Unhandled Lambda fault Sometimes Retry once, then catch

The service quotas that shape design — the numbers you must respect:

Limit Standard Express Notes
Max execution duration 1 year 5 minutes Express hard-fails at 5 min
State payload size 256 KB 256 KB Offload large data to S3
Inline Map concurrency 40 40 Per parent execution
Distributed Map child executions up to 10,000 up to 10,000 The fan-out ceiling
StateTransition / StartExecution rates Account/region quotas Very high ExecutionThrottled when exceeded
Execution history retention 90 days none (logs only) Express needs CloudWatch logs
Max state machine definition size ~1 MB ~1 MB Large defs → modularize
Open executions per account Soft quota Soft quota Request increase for big fan-out

Worked backoff math

To make MaxDelaySeconds concrete, here is how the interval grows with IntervalSeconds: 1, BackoffRate: 2.0, capped at 20:

Attempt Uncapped interval With MaxDelaySeconds: 20 With JitterStrategy: FULL
1 1 s 1 s random in [0, 1] s
2 2 s 2 s random in [0, 2] s
3 4 s 4 s random in [0, 4] s
4 8 s 8 s random in [0, 8] s
5 16 s 16 s random in [0, 16] s
6 32 s 20 s (capped) random in [0, 20] s
7 64 s 20 s (capped) random in [0, 20] s

Compensation and the saga pattern

Step Functions has no distributed transaction. When step 3 of 5 fails after steps 1 and 2 committed real side effects, you cannot roll back — you must compensate, running an inverse action for each completed step. That is the saga pattern, and Step Functions expresses it naturally because the workflow already knows exactly how far it got.

The structure: each forward Task has a Catch that routes to a compensation chain, and the chain undoes completed work in reverse order. Reserve inventory -> charge card -> create shipment; if shipment creation fails, refund the card, then release the inventory.

"CreateShipment": {
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": { "FunctionName": "create-shipment", "Payload.$": "$" },
  "Catch": [
    { "ErrorEquals": ["States.ALL"], "ResultPath": "$.error", "Next": "RefundCharge" }
  ],
  "Next": "OrderComplete"
},
"RefundCharge": {
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": {
    "FunctionName": "refund-charge",
    "Payload": { "chargeId.$": "$.chargeId", "idempotencyKey.$": "$$.Execution.Name" }
  },
  "Next": "ReleaseInventory"
},
"ReleaseInventory": {
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": { "FunctionName": "release-inventory", "Payload.$": "$.reservation" },
  "Next": "OrderFailed"
},
"OrderFailed": { "Type": "Fail", "Error": "OrderFailed", "Cause": "Compensated after shipment failure" }

Compensation actions must themselves be idempotent and retryable — a refund that runs twice must refund once, hence the idempotencyKey derived from the execution name (which is unique and stable for the run). Compensation that fails is the worst case; give compensation Tasks their own Retry and route a final failure to an alarm and a dead-letter store for human cleanup. A saga is only as reliable as its weakest undo.

The forward-to-compensation map

For each forward step, name the inverse and the idempotency strategy before you write the workflow. This table is the saga design itself:

Forward step Side effect Compensating action Idempotency key If compensation fails
Reserve inventory Holds stock Release reservation reservationId Alarm; manual stock reconcile
Charge card Moves money Refund charge $$.Execution.Name DLQ; finance review
Create shipment Books a carrier Cancel shipment shipmentId Alarm; ops cancels manually
Send confirmation email Notifies customer Send correction email messageId Best-effort; log only
Write order record Persists state Mark order FAILED order PK Retry; never delete the record

Saga design rules as a checklist table:

Rule Why it matters What breaks if you ignore it
Every forward Task has a Catch to compensation The workflow must route on failure Partial failure leaves committed side effects
Compensations run in reverse order Undo the last commit first Releasing inventory before refunding can race
Every undo is idempotent Compensations themselves get retried Double-refund / double-release
Every undo is retryable with its own Retry A failed undo is the worst case Money stuck with no recovery path
Failed compensation → alarm + DLQ Humans must clean up the residue Silent inconsistency in production
Use a stable idempotency key ($$.Execution.Name) Same key across retries of the run Non-deterministic keys defeat idempotency
Exercise the path before production Untested undo = untested code on your worst day “Safety net” that does not catch

Optimized integrations and the callback (waitForTaskToken) pattern

Step Functions has three integration patterns, and the difference is real money and latency.

The three patterns side by side — this table decides how you wire a Task:

Pattern ARN suffix Behaviour Bills (Standard) Use when
Request/Response (none) Call, get immediate API response, continue 1 transition Fire-and-forget; fast SDK calls
Run a Job (.sync) .sync / .sync:2 Wait for the underlying job to finish Transitions only (wait is free) ECS task, Glue/EMR job, nested SM
Callback (.waitForTaskToken) .waitForTaskToken Pause until external SendTaskSuccess Transitions only (pause is free) Human approval, third-party webhook

Prefer optimized SDK integrations (arn:aws:states:::dynamodb:putItem) over wrapping every call in a Lambda. They run inside the service, so you pay no Lambda invocation, no cold start, and no code to maintain. Use Lambda only for genuine business logic, not for shuttling a value into DynamoDB. A sampler of optimized integrations and what they replace:

Optimized integration What it does Replaces this Lambda
dynamodb:putItem / getItem / updateItem Direct DynamoDB write/read “Lambda that just writes a row”
sns:publish Publish to a topic “Lambda that just publishes”
sqs:sendMessage Enqueue a message “Lambda that just enqueues”
lambda:invoke Invoke a function (business logic) (legitimate use)
states:startExecution.sync Run a nested state machine and wait Hand-rolled polling loop
ecs:runTask.sync Run an ECS/Fargate task to completion Poll-for-task-status Lambda
glue:startJobRun.sync Run a Glue job and wait Poll-for-job Lambda
bedrock:invokeModel Call a foundation model Lambda wrapper around Bedrock

The callback pattern is how you model anything asynchronous or human-driven — an approval, a third-party webhook, a long external job. The execution sits paused (free, on Standard, for up to a year) holding a token; the external actor completes it later:

# External system resumes the paused execution
aws stepfunctions send-task-success \
  --task-token "$TASK_TOKEN" \
  --task-output '{"approved": true, "approver": "vinod"}'

Always set HeartbeatSeconds on a waitForTaskToken Task and have the worker call SendTaskHeartbeat. Without a heartbeat, a worker that dies silently leaves the execution paused until the (possibly year-long) timeout. With one, Step Functions fails the Task promptly when heartbeats stop, and your Catch can compensate or alert. The callback timeout/heartbeat knobs:

Setting What it does Default Set it when
TimeoutSeconds Max time the Task may run/pause none (→ execution limit) Always, to bound a paused callback
HeartbeatSeconds Max gap between worker heartbeats none The worker can die silently
SendTaskSuccess Resume the execution with output The work completed
SendTaskFailure Fail the Task with an error name The work failed (so Catch fires)
SendTaskHeartbeat Reset the heartbeat clock Long-running work; prove liveness

Observability: history, X-Ray, and the metrics that matter

Standard workflows keep a full, durable execution history — every state entry/exit, input, output, and error — queryable for 90 days. This is the single best debugging artifact in serverless; get-execution-history reconstructs exactly what happened, in order.

# Replay what actually happened, newest event detail first
aws stepfunctions get-execution-history \
  --execution-arn "$EXEC_ARN" \
  --reverse-order \
  --query 'events[?contains(type, `Failed`)].[type, taskFailedEventDetails.error, taskFailedEventDetails.cause]' \
  --output table

Enable X-Ray on the state machine (tracingConfiguration.enabled = true) to get an end-to-end trace across the workflow and every downstream it calls — the fastest way to find the one Task adding 4 seconds of tail latency. For Express workflows, which have no durable history, you must enable CloudWatch Logs (loggingConfiguration at ALL or ERROR); without logs an Express failure is nearly opaque.

The observability surface — what each tool gives you and where it shines:

Tool / signal What it shows Standard Express Best for
Execution history Every event, input/output, error Durable 90 days None Post-mortem replay
get-execution-history CLI History as queryable JSON Yes No Scripted triage
CloudWatch Logs Per-execution log events Optional Required The only window into Express
X-Ray service map End-to-end trace + latency Yes Yes Tail-latency hunting
Map Run view Child success/failure aggregate Yes Yes Triaging a fan-out
CloudWatch metrics Counts/latency by state machine Yes Yes Alarms
Execution event history (console) Visual graph + per-state detail Yes Limited Eyeballing a single run

The CloudWatch metrics I alarm on:

Metric Why it matters Alarm on
ExecutionsFailed Hard failures A sustained nonzero rate → page
ExecutionsTimedOut Workflows hitting their timeout Any nonzero → stuck callback / slow downstream
ExecutionThrottled Exceeding StartExecution/transition quotas Any nonzero → back off or raise quota
ExecutionsAborted Manually/forcibly stopped runs Spikes → operator intervention or bug
ExecutionTime (p99) Latency regressions Rising p99 → creeping Wait/retry inflation
ExecutionsStarted vs Succeeded Throughput vs completion Gap → silent failures

The logging levels and what each captures (cost vs visibility):

loggingConfiguration level Logs Cost Use when
OFF Nothing None Never in production
ERROR Failed/aborted execution events Low Steady-state Express
FATAL Only execution-terminating errors Lowest non-off Very high volume Express
ALL Every event Highest Debugging / low volume
includeExecutionData Input/output payloads + payload size Deep debugging (watch for PII)

For Distributed Map specifically, the Map Run in the console aggregates child-execution success/failure counts and links straight to failed children — that view is where you triage a fan-out that came back 98% green.

Architecture at a glance

The diagram traces a real production saga left to right, then maps the five failure-or-decision points onto the exact hop where each bites. Read it as the path an order takes. A trigger — an EventBridge rule or an API/SDK StartExecution with an idempotent name — starts a Standard parent state machine, the durable, exactly-once orchestrator that owns the saga state for up to a year and bills only per transition. Inside the parent, a Choice routes, Tasks carry Retry/Catch/TimeoutSeconds, and a waitForTaskToken Task can pause for free holding a heartbeat-guarded callback. When the parent needs to process tens of thousands of items, it enters a Distributed Map: ItemReader lists objects under an S3 prefix, each batch spawns an Express child execution (cheap, 5-minute, at-least-once), and ResultWriter persists a manifest to S3 so large outputs never blow the 256 KB payload cap. The per-item side-effect Tasks (reserve → charge → ship) commit real state; when one fails, the parent’s Catch routes into the compensation chain (refund → release) keyed on $$.Execution.Name. Every hop streams into CloudWatch and X-Ray — durable history, the Map Run aggregate, and the ExecutionsFailed/Throttled alarms.

The five badges narrate where this design earns its keep, each as symptom · confirm · fix. (1) choosing the wrong workflow type double-charges (Express on non-replayable effects) or explodes the transition bill (Standard on a firehose); (2) a retrier with JitterStrategy: NONE or no TimeoutSeconds turns a blip into a lockstep storm or a year-long hang; (3) an uncapped Distributed Map MaxConcurrency melts a rate-limited downstream, or missing states:StartExecution/S3 IAM fails the first run; (4) a partial failure mid-saga cannot roll back and must be compensated in reverse with idempotent undos; (5) an Express failure is near-blind without CloudWatch logging, and ExecutionThrottled is the quota smoke alarm. The whole method: localise the symptom to a hop, read the badge, run the named confirm, apply the fix.

AWS Step Functions production architecture: an EventBridge rule and an API/SDK StartExecution trigger a Standard parent state machine (exactly-once, 1-year, billed per transition) running a saga with Choice, Retry/Catch Tasks and a free paused waitForTaskToken callback; the parent enters a Distributed Map whose ItemReader lists S3 objects under raw-events-prod, spawns up to 10,000 Express child executions per batch (at-least-once, 5-minute), and writes a results manifest to S3 via ResultWriter to beat the 256 KB payload cap; per-item side-effect Tasks reserve-charge-ship into DynamoDB and, on Catch, run a reverse compensation chain refund-release keyed on the execution name; all hops stream durable execution history, the Map Run aggregate, and X-Ray traces into CloudWatch with alarms on ExecutionsFailed, ExecutionsTimedOut and ExecutionThrottled. Five numbered badges mark wrong-workflow-type, retry-storm/no-timeout, fan-out saturation/IAM, no-rollback saga, and opaque-Express failure points

Real-world scenario

Lumira Media runs a nightly pipeline that transcodes every asset uploaded that day — typically 60,000 objects under an S3 prefix — into three renditions each. The platform team is five engineers; the workload is in us-east-1 and the original design was an inline Map that read the object list into the parent execution and fanned out. It worked at a few thousand items and then wedged: the parent execution’s 256 KB state payload overflowed on the object list well before they reached peak volume, and the inline 40-concurrency cap meant the few runs that did start took most of the night.

The constraints were hard. The full batch had to finish inside a 6-hour window before downstream publishing began. A handful of corrupt source files were expected nightly and must not fail the whole run. And the transcoder was a rate-limited internal service that fell over above ~400 concurrent jobs — so “just raise concurrency” was the exact move that would cause an outage. The first attempt to fix the wedge had made it worse: an engineer set inline Map MaxConcurrency to 0 (unlimited), which immediately saturated the transcoder and triggered a cascading failure that took down an unrelated service sharing the same backend pool.

The redesign moved to Distributed Map reading the prefix via s3:listObjectsV2, with ExecutionType: EXPRESS children, MaxConcurrency pinned to 400 to respect the transcoder, an ItemBatcher of 20 to amortize invocation cost, and ToleratedFailurePercentage: 1 so a few bad files were quarantined rather than fatal. ResultWriter wrote a per-item manifest to S3 that the publishing stage consumed directly. Critically, the per-item Task carried a jittered retrier — JitterStrategy: FULL, MaxDelaySeconds: 30 — because the first un-jittered version had produced a secondary thundering herd: when the transcoder briefly 503’d, 400 children retried in lockstep and re-toppled it.

"TranscodeAll": {
  "Type": "Map",
  "ItemReader": {
    "Resource": "arn:aws:states:::s3:listObjectsV2",
    "Parameters": { "Bucket.$": "$.bucket", "Prefix.$": "$.todayPrefix" }
  },
  "ItemProcessor": {
    "ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
    "StartAt": "Transcode",
    "States": {
      "Transcode": {
        "Type": "Task",
        "Resource": "arn:aws:states:::lambda:invoke",
        "Parameters": { "FunctionName": "transcode-batch", "Payload.$": "$" },
        "Retry": [
          { "ErrorEquals": ["Transcoder.Throttled"], "IntervalSeconds": 2,
            "BackoffRate": 2.0, "MaxAttempts": 5, "MaxDelaySeconds": 30, "JitterStrategy": "FULL" }
        ],
        "End": true
      }
    }
  },
  "ItemBatcher": { "MaxItemsPerBatch": 20 },
  "MaxConcurrency": 400,
  "ToleratedFailurePercentage": 1,
  "ResultWriter": {
    "Resource": "arn:aws:states:::s3:putObject",
    "Parameters": { "Bucket.$": "$.resultsBucket", "Prefix": "transcode-runs/" }
  },
  "End": true
}

A CloudWatch alarm on the Map Run failed-child count caught the rare night when corruption spiked past the 1% tolerance, and an ExecutionThrottled alarm on the parent would have caught the original unlimited-concurrency mistake before it cascaded. The pipeline now finishes a 60,000-object night in under two hours, the transcoder stays under its concurrency ceiling, and corrupt files land in a results manifest for morning review instead of failing the batch. No new infrastructure — just the right Map mode, an honest concurrency cap, and jittered retries against the one service that could not be rushed.

The incident-and-redesign as a timeline, because the order of moves is the lesson:

Stage Symptom Action Effect What it should have been
Original Inline Map wedges ~3k items (none) Payload overflow + 40-cap Distributed Map from the start
Panic fix “Raise concurrency” Inline Map MaxConcurrency: 0 Transcoder saturated; cascade Cap to the downstream’s 400 limit
Redesign Need 10k+ scale Distributed Map + Express children Scales to 60k
First run Transcoder 503s briefly Un-jittered retry Lockstep herd re-topples it JitterStrategy: FULL + cap
Stabilized A few corrupt files ToleratedFailurePercentage: 1 Bad items quarantined
Steady state 60k in <2h Map Run + ExecutionThrottled alarms Visible, bounded, safe The actual fix

Advantages and disadvantages

Owning orchestration in a managed state machine both solves this class of problem and introduces its own sharp edges. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
The service owns durable state — recovery, retries, and compensation are declarative, not hand-rolled The workflow type is immutable; a wrong Standard/Express choice means recreating the state machine
Standard’s exactly-once history is the best debugging artifact in serverless Express has no durable history — a failure is near-blind without CloudWatch logging
Distributed Map fans out to 10,000 children over millions of items with no servers Uncapped MaxConcurrency melts a rate-limited downstream — the fan-out is a loaded gun
Retry/Catch/TimeoutSeconds are first-class, declarative resilience Defaults are unsafe: JitterStrategy: NONE, no TimeoutSeconds, MaxConcurrency: 0
Optimized SDK integrations cut Lambda invocations, cost, and cold starts The saga has no rollback — you must design every inverse action yourself
waitForTaskToken models human/async steps with free, year-long pauses (Standard) A waitForTaskToken Task with no HeartbeatSeconds can pause for a year on a dead worker
Per-transition billing makes long waits effectively free on Standard Standard on a hot, short, high-volume firehose racks up a large transition bill
The Map Run view triages a fan-out at a glance Distributed Map needs extra IAM (states:StartExecution, S3) that fails silently if missing

The model is right when you have a multi-step business process with non-replayable side effects, a need for durable audit, fan-out at real scale, or human-in-the-loop steps. It is the wrong tool for a single fast transform (just use a Lambda) or pure buffering/fan-out without state (use SQS/SNS/EventBridge). The disadvantages are all manageable — but only if you know they exist, which is the point of every table above.

Hands-on lab

Build, run, and deliberately break a small Standard workflow with retry and catch — all free-tier-friendly (Step Functions Standard includes 4,000 free state transitions/month). Run in CloudShell or any shell with the AWS CLI configured.

Step 1 — Variables and an execution role.

REGION=us-east-1
ACCT=$(aws sts get-caller-identity --query Account --output text)
ROLE_ARN=$(aws iam create-role --role-name sfn-lab-role \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"states.amazonaws.com"},"Action":"sts:AssumeRole"}]}' \
  --query 'Role.Arn' --output text)
aws iam attach-role-policy --role-name sfn-lab-role \
  --policy-arn arn:aws:iam::aws:policy/AWSLambdaRole   # invoke any Lambda for the lab

Expected: a role ARN like arn:aws:iam::<acct>:role/sfn-lab-role.

Step 2 — A definition with a retrier and a catch. Save as lab.asl.json. It calls a (nonexistent) function so you can watch the retry/catch fire.

{
  "Comment": "Retry + Catch lab",
  "StartAt": "DoWork",
  "States": {
    "DoWork": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "does-not-exist", "Payload.$": "$" },
      "TimeoutSeconds": 30,
      "Retry": [
        { "ErrorEquals": ["Lambda.TooManyRequestsException"], "IntervalSeconds": 1,
          "BackoffRate": 2.0, "MaxAttempts": 3, "MaxDelaySeconds": 10, "JitterStrategy": "FULL" }
      ],
      "Catch": [ { "ErrorEquals": ["States.ALL"], "ResultPath": "$.error", "Next": "Handled" } ],
      "Next": "Done"
    },
    "Handled": { "Type": "Pass", "Result": { "handled": true }, "End": true },
    "Done": { "Type": "Succeed" }
  }
}

Step 3 — Statically validate before creating anything (no resources made).

aws stepfunctions validate-state-machine-definition \
  --definition file://lab.asl.json \
  --query '{result:result,diagnostics:diagnostics}'

Expected: "result": "OK" with an empty diagnostics array.

Step 4 — Create the Standard state machine.

SM_ARN=$(aws stepfunctions create-state-machine \
  --name sfn-lab --type STANDARD \
  --definition file://lab.asl.json --role-arn "$ROLE_ARN" \
  --query 'stateMachineArn' --output text)
echo "$SM_ARN"

Step 5 — Start an execution and capture its ARN.

EXEC_ARN=$(aws stepfunctions start-execution --state-machine-arn "$SM_ARN" \
  --input '{"hello":"world"}' --query 'executionArn' --output text)

Step 6 — Poll to terminal status (expect SUCCEEDED via the Catch).

aws stepfunctions describe-execution --execution-arn "$EXEC_ARN" \
  --query '{status:status,output:output}'

Expected: "status": "SUCCEEDED" and an output containing "handled": true — the Task failed (no such function), the Catch routed to Handled, and the run succeeded gracefully.

Step 7 — Confirm the failure/catch actually happened in the history.

aws stepfunctions get-execution-history --execution-arn "$EXEC_ARN" \
  --query 'events[?type==`TaskFailed` || type==`PassStateEntered`].type'

Expected: a TaskFailed followed by a PassStateEntered — proof the catch fired.

Step 8 — Teardown.

aws stepfunctions delete-state-machine --state-machine-arn "$SM_ARN"
aws iam detach-role-policy --role-name sfn-lab-role \
  --policy-arn arn:aws:iam::aws:policy/AWSLambdaRole
aws iam delete-role --role-name sfn-lab-role

Common mistakes & troubleshooting

The differentiator. Each row is a real failure mode: the symptom you see, the root cause, the exact command/console path to confirm it, and the fix. Scan to your symptom, then read the detail below.

# Symptom Root cause Confirm (exact command / path) Fix
1 Double-charges / duplicate side effects Express at-least-once + non-idempotent Task describe-state-machine shows "type":"EXPRESS"; Task has no idempotency key Add idempotency key, or move side effect to a Standard parent
2 Huge Step Functions bill on a hot workflow Standard on a short high-volume firehose Cost Explorer → StateTransition; describe-state-machine type=STANDARD Recreate as Express (type is immutable)
3 Inline Map wedges on a big list 256 KB payload overflow / 40-concurrency cap get-execution-historyStates.DataLimitExceeded Switch to Distributed Map (S3 ItemReader)
4 Distributed Map fails on first run Missing states:StartExecution / S3 IAM get-execution-historyStates.Permissions / AccessDenied Add child-exec + S3 perms to the role
5 A rate-limited downstream falls over MaxConcurrency uncapped (0) Map Run shows full concurrency; downstream throttle metrics spike Pin MaxConcurrency to the downstream’s safe limit
6 Recovering service re-toppled by retries JitterStrategy: NONE → lockstep storm get-execution-history shows synchronized retry waits Set JitterStrategy: FULL + MaxDelaySeconds
7 Execution hangs for hours/days No TimeoutSeconds on a Task Execution RUNNING far past expected; no progress events Add TimeoutSeconds to every external Task
8 Callback paused forever waitForTaskToken worker died, no heartbeat Execution RUNNING; no SendTaskSuccess ever arrives Add HeartbeatSeconds + worker SendTaskHeartbeat
9 Partial failure left money stuck No saga / compensation chain Execution FAILED mid-flow with committed side effects Add Catch → reverse compensation chain
10 Express failure is unexplained No CloudWatch logging on the state machine describe-state-machineloggingConfiguration OFF Enable loggingConfiguration ALL/ERROR
11 Parallel fails when one branch fails Parallel semantics: any branch failure fails all History shows one branch Failed aborting the rest Use Map with tolerance, or Catch per branch
12 States.NoChoiceMatched error Choice with no Default and no match History → States.NoChoiceMatched Add a Default state to every Choice
13 Output truncated / States.DataLimitExceeded A Task output exceeded 256 KB History → States.DataLimitExceeded Offload to S3; trim with ResultSelector/ResultPath
14 ExecutionThrottled spikes Exceeding StartExecution/transition quota CloudWatch ExecutionThrottled metric nonzero Back off the trigger; request a quota increase

Detail on the costly ones

1 — Express double-charges. At-least-once means a state can run more than once on internal retry. Confirm the type with aws stepfunctions describe-state-machine --state-machine-arn $SM_ARN --query 'type'; if it is EXPRESS and the Task moves money or increments a counter without an idempotency key derived from a stable value ($$.Execution.Name, an order ID), you will eventually double-apply. Fix by making the Task idempotent, or by hoisting the non-replayable step into a Standard parent and keeping only the idempotent inner loop on Express.

5 — Fan-out melts the downstream. MaxConcurrency: 0 means “unlimited up to 10,000.” Against a Lambda transform you are bounded by Lambda concurrency, but against a database or third-party API, unlimited is a self-inflicted outage. Confirm by correlating the Map Run’s concurrency with the downstream’s saturation metric (RDS connections, API 429 rate). Fix by pinning MaxConcurrency to the downstream’s tested safe limit and raising it only while watching that metric — never the Step Functions console.

6 — Retry storm. Confirm by pulling history and looking for retries clustered at identical intervals: get-execution-history ... --query 'events[?type==\TaskFailed`].timestamp’across many executions shows the same timestamps.JitterStrategy: FULLsmears each retry randomly across its window;MaxDelaySeconds` stops exponential growth from ballooning to hours.

9 — No saga. Confirm by checking a FAILED execution’s last successful state in the history — if it is past a side-effect Task (charge, reservation), that effect is committed and orphaned. Fix by giving each forward Task a Catch into a reverse compensation chain whose every undo is idempotent and retryable, with a final failure routed to a DLQ + alarm.

Best practices

Security notes

Step Functions runs as an identity and touches many services; least privilege is the whole game.

Cost & sizing

What drives the bill depends entirely on the type. Standard bills per state transition ($0.000025 each, us-east-1); Express bills $1.00 per million executions plus $0.00001667 per GB-second of duration. The free tier includes 4,000 Standard state transitions per month.

Workload Best type Rough monthly cost (us-east-1) Why
100k orders/mo, 12 states each Standard ~$30 (1.2M transitions × $0.000025) Durable, exactly-once; cheap at this volume
50M events/mo, 200 ms each, 128 MB Express ~$50 exec + ~$21 duration ≈ $71 Per-item Express is far cheaper than Standard here
Nightly 60k-item fan-out (batched ×20) Standard parent + Express children a few dollars/night Batching cuts executions 20×
Human-approval flow, paused 2 days Standard ~$0.0003 / execution Pauses are free; only transitions bill
Same 50M events on Standard (anti-pattern) (don’t) ~$25,000 (1B+ transitions) The cautionary cost of the wrong type

Sizing levers, ranked by impact:

Lever Effect on cost Effort Trade-off
Right type (Standard vs Express) Can be 100–1000× One decision (at creation) Irreversible; recreate to change
ItemBatcher (batch items) Cuts executions/transitions N× Low (Lambda loops over $.Items) Lambda must report partial failure
Collapse trivial Pass states Fewer transitions (Standard) Low Slightly less explicit data shaping
Direct SDK integrations vs Lambda Removes invocation cost Low Only for non-business-logic calls
Smaller child memory/duration (Express) Lower GB-seconds Medium Profile first; don’t starve the Task
Log at ERROR not ALL Lower CloudWatch ingestion Trivial Less detail when debugging

In INR terms, a typical order-orchestration workload at ~100k executions/month runs on the order of ₹2,000–3,000/month all-in (transitions + Lambda + logs) — Step Functions itself is rarely the dominant line item; the Lambdas and downstream services usually are. The expensive mistake is not the per-transition price, it is running the wrong type: Express-priced volume on a Standard machine, as the anti-pattern row shows.

Interview & exam questions

Q1. When would you choose Standard over Express, and why is the choice important? Standard for workflows needing exactly-once semantics, durable queryable history, long duration (up to a year), or waitForTaskToken/.sync — i.e. orchestration with non-replayable side effects. Express for high-volume, short, idempotent processing. It matters because the type is immutable after creation and because Express’s at-least-once semantics will double-apply non-idempotent side effects. (SAA-C03, DVA-C02)

Q2. Why must every Task in an Express workflow be idempotent? Express guarantees at-least-once, so the engine can run a state more than once on internal retry. A non-idempotent Task (charge a card, increment a counter) will eventually execute twice. You make it idempotent with a stable key — typically derived from $$.Execution.Name. (DVA-C02)

Q3. What is the difference between inline Map and Distributed Map? Inline Map runs iterations inside the parent execution, capped at 40 concurrent and sharing one 256 KB payload — good for dozens of items. Distributed Map runs each item/batch as its own child execution (own 256 KB, own history), scaling to 10,000 concurrent over millions of items read from S3. (SAA-C03, DOP-C02)

Q4. How do you stop a Distributed Map from overwhelming a rate-limited downstream? Pin MaxConcurrency to the downstream’s tested safe limit (not 0/unlimited), use ItemBatcher to reduce invocation count, and raise concurrency only while watching the downstream’s saturation metric. (DOP-C02)

Q5. What does JitterStrategy: FULL solve? The thundering herd: without jitter, many executions back off by identical intervals and retry in lockstep, re-hammering a recovering service. FULL randomizes each retry across its backoff window, smearing the load. Always pair with MaxDelaySeconds so exponential growth does not balloon to hours. (DOP-C02)

Q6. Step Functions has no distributed transaction — how do you handle a partial failure? With the saga pattern: each forward Task has a Catch that routes to a compensation chain undoing completed work in reverse order. Every undo must be idempotent and retryable; a failed compensation routes to a DLQ and an alarm. (SAA-C03, DOP-C02)

Q7. What are the three service-integration patterns and when do you use each? Request/Response (call and continue, fire-and-forget); .sync (run a job — ECS/Glue/nested SM — and wait without polling); .waitForTaskToken (pause until an external SendTaskSuccess, for human approval or async webhooks). (DVA-C02)

Q8. Why set HeartbeatSeconds on a waitForTaskToken Task? Without it, a worker that dies silently leaves the execution paused until the (possibly year-long) TimeoutSeconds/execution limit. With heartbeats, Step Functions fails the Task promptly when they stop, so your Catch can compensate or alert. (DVA-C02)

Q9. How do you debug a failed Express execution? Express keeps no durable history, so you must enable loggingConfiguration (ALL/ERROR) and ideally X-Ray. Without logs, an Express failure is near-opaque. For Standard, get-execution-history replays every event. (DOP-C02)

Q10. Why prefer optimized SDK integrations over Lambda? They run inside the target service (dynamodb:putItem, sns:publish), so you pay no Lambda invocation, no cold start, and maintain no code. Use Lambda only for genuine business logic, not for shuttling a value into another service. (DVA-C02)

Q11. What CloudWatch metrics do you alarm on for a state machine? ExecutionsFailed (hard failures, page on a sustained rate), ExecutionsTimedOut (stuck callback/slow downstream), and ExecutionThrottled (exceeding StartExecution/transition quotas — back off or raise the quota). (DOP-C02)

Q12. How do you triage a Distributed Map run that came back 98% green? Use the Map Run view in the console: it aggregates child-execution success/failure counts and links straight to the failed children, so you can open the specific failures rather than scanning thousands of green ones. (DOP-C02)

Quick check

  1. You need a workflow that pauses for a two-day human approval and must record an exact audit trail. Standard or Express, and why?
  2. An inline Map over 80,000 S3 objects keeps failing with States.DataLimitExceeded. What is the fix?
  3. A retrier uses IntervalSeconds: 1, BackoffRate: 2.0, MaxAttempts: 8 and no MaxDelaySeconds. What is the risk?
  4. Your saga charged a card, then shipment creation failed. What pattern recovers consistency, and what property must the refund Task have?
  5. An Express workflow is failing in production and you “can’t see anything.” What did you forget to enable?

Answers

  1. Standard — only Standard offers durable history, exactly-once semantics, and the long-duration waitForTaskToken needed for a multi-day human approval; the pause is free (only transitions bill).
  2. Switch from inline Map to Distributed Map with an S3 ItemReader — each item/batch becomes its own child execution with its own 256 KB budget, so the full object list never has to fit in the parent’s payload.
  3. The interval balloons exponentially (1, 2, 4, 8, 16, 32, 64, 128 s); without MaxDelaySeconds a single retry can wait minutes-to-hours, and without JitterStrategy: FULL the retries land in lockstep and re-hammer the downstream.
  4. The saga patternCatch into a reverse compensation chain (refund, then release inventory). The refund Task must be idempotent (keyed on $$.Execution.Name) so a retried compensation refunds exactly once.
  5. CloudWatch logging (loggingConfiguration at ALL/ERROR) — Express keeps no durable execution history, so without logs (and ideally X-Ray) the failure is near-opaque.

Glossary

Next steps

awsstep-functionsorchestrationserverlessworkflows
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments