A Lambda function that calls three other services is not a workflow — it is a distributed monolith with a 15-minute timeout and no audit trail. The moment a business process spans retries, branches, human approval, or thousands of parallel items, you want an orchestrator that owns the state so your code does not have to. Step Functions is that orchestrator, but it is also a place where teams quietly burn money on the wrong workflow type, melt downstream services with unbounded fan-out, and write Retry blocks that re-amplify the exact outage they were meant to absorb. This is how I design Step Functions workflows that are durable, that scale cleanly, and that fail in ways an on-call engineer can actually reason about.
Assume a recent CLI (aws --version >= 2.x), the Amazon States Language (ASL), and IAM roles already scoped per state machine.
1. Standard vs Express: pick by durability, not habit
The first decision is the workflow type, and it is irreversible after creation — you cannot flip a state machine between Standard and Express, you create a new one. They share ASL but differ in their execution guarantees, duration limits, and billing model.
| Property | Standard | Express |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Execution semantics | Exactly-once | At-least-once |
| Execution history | Durable, queryable for 90 days | Sent to CloudWatch Logs only |
| Pricing model | Per state transition ($0.000025 each, us-east-1) | Per request + GB-second of duration |
| Throughput | Up to thousands of starts/sec | Effectively unbounded, very high rates |
waitForTaskToken / human approval |
Yes | No |
The pricing models invert depending on workload shape. Standard bills $0.000025 per state transition, so a workflow with 10 states costs $0.00025 per execution regardless of how long it waits — a 6-hour wait for an approval costs nothing extra. Express bills $1.00 per million executions plus $0.00001667 per GB-second of duration; a short, hot, high-volume workflow that finishes in 200ms is dramatically cheaper there, while a long-running or sparse one is cheaper on Standard.
Mental model: Standard is a durable state machine you query later; Express is a streaming transform you fire and forget. Use Standard for orchestration with side effects you cannot replay; use Express for high-volume, idempotent event processing.
The trap is at-least-once on Express. Express can run a state more than once on internal retry, so every Task it invokes must be idempotent. If an Express workflow charges a credit card or increments a counter without an idempotency key, you will eventually double-charge. A nested pattern is common and correct: a Standard parent that orchestrates the durable, exactly-once business steps, invoking Express child workflows (via startExecution.sync) for the hot inner loops.
2. State machine design: the core state types
ASL is small. Five state types carry almost every real workflow.
- Task — does work: invokes a Lambda, an SDK action, or another state machine. The only state with side effects.
- Choice — branches on input using comparison rules. Your routing logic.
- Parallel — runs a fixed set of branches concurrently, joins on all. Use when you have N known, distinct sub-workflows.
- Map — runs the same sub-workflow over each element of an array. Use for a variable-length collection of homogeneous items.
- Pass / Wait / Succeed / Fail — shape data, sleep, and terminate.
A common mistake is reaching for Parallel when you mean Map. Parallel is for “do these three different things at once” (validate, enrich, score). Map is for “do this one thing to each of these items.” Below, a Choice routes by order value, and a Map (inline mode) processes line items with bounded concurrency.
{
"Comment": "Order processing",
"StartAt": "RouteByValue",
"States": {
"RouteByValue": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.order.totalUsd",
"NumericGreaterThan": 10000,
"Next": "ManualReview"
}
],
"Default": "ProcessLineItems"
},
"ManualReview": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "request-approval",
"Payload": {
"orderId.$": "$.order.id",
"taskToken.$": "$$.Task.Token"
}
},
"Next": "ProcessLineItems"
},
"ProcessLineItems": {
"Type": "Map",
"ItemsPath": "$.order.lineItems",
"MaxConcurrency": 5,
"ItemProcessor": {
"ProcessorConfig": { "Mode": "INLINE" },
"StartAt": "Fulfil",
"States": {
"Fulfil": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "fulfil-line-item", "Payload.$": "$" },
"End": true
}
}
},
"End": true
}
}
}
Note $$ — the context object, distinct from $ (state input). $$.Task.Token is how a Task hands its callback token to an external system. $$.Execution.Name and $$.State.EnteredTime are invaluable for idempotency keys and logging.
3. Distributed Map: fan-out over S3 with real concurrency control
Inline Map runs inside the parent execution and is capped at 40 concurrent iterations, and the whole thing shares one 256 KB state payload. That is fine for dozens of items. For tens of thousands — every object under an S3 prefix, every row of a large CSV — you need Distributed mode, which is a different execution model: each iteration (or batch) becomes its own child workflow execution with its own history and its own 256 KB budget. Distributed Map scales to up to 10,000 parallel child executions and can iterate datasets of millions of items.
Set Mode to DISTRIBUTED, point ItemReader at an S3 source, and you get three controls that matter at scale: MaxConcurrency (how hard you hit downstream), ItemBatcher (amortize per-invocation overhead), and ToleratedFailurePercentage (do not fail 9,999 good items because 1 was malformed).
{
"Type": "Map",
"ItemProcessor": {
"ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
"StartAt": "Transform",
"States": {
"Transform": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "transform-batch", "Payload.$": "$" },
"End": true
}
}
},
"ItemReader": {
"Resource": "arn:aws:states:::s3:listObjectsV2",
"Parameters": { "Bucket": "raw-events-prod", "Prefix": "2026/06/" }
},
"ItemBatcher": {
"MaxItemsPerBatch": 100,
"MaxInputBytesPerBatch": 262144
},
"MaxConcurrency": 500,
"ToleratedFailurePercentage": 2,
"ResultWriter": {
"Resource": "arn:aws:states:::s3:putObject",
"Parameters": { "Bucket": "map-results-prod", "Prefix": "runs/" }
},
"End": true
}
Several decisions are load-bearing here:
ExecutionType: EXPRESSfor the child workflows is the default and right choice for high-volume, idempotent item processing — it is far cheaper per item than Standard children. Use Standard children only when an individual item needs a long-running or human-in-the-loop step.MaxConcurrency: 500is a throttle on your blast radius. With a Lambda transform you are bounded by Lambda’s account concurrency; with a database or third-party API behind it, this number is the difference between steady throughput and a self-inflicted outage. Start conservative and raise it while watching the downstream’s saturation metrics, not the Step Functions console.ItemBatcherturns 50,000 single-item invocations into 500 batches of 100. That cuts invocation overhead and cost by two orders of magnitude — but your Lambda must now loop over$.Itemsand, critically, report partial batch failure rather than failing the whole batch on one bad record.ResultWriterpersists per-item results to S3. Without it, large outputs blow the 256 KB limit; with it, you get a manifest you can audit and reprocess.
Distributed Map also needs IAM permission to start its own child executions and to read/write S3 — states:StartExecution, s3:GetObject, s3:ListBucket, and s3:PutObject on the relevant resources. This is the most common reason a freshly built Distributed Map fails on its first run.
4. Error handling: Retry, Catch, and backoff with jitter
A Task without a Retry block fails the whole execution on the first transient blip. The fix is not “retry everything forever” — it is to retry the retryable errors with bounded, jittered backoff, and to Catch the rest into a handler.
Retry matches on error names and applies exponential backoff. The fields that matter:
ErrorEquals— which errors this rule catches.States.TaskFailed,Lambda.TooManyRequestsException, or your own thrown error names.States.ALLis a catch-all; never combine it with specific rules in the same retrier.IntervalSeconds/BackoffRate/MaxAttempts— first delay, multiplier, and cap.MaxDelaySeconds— caps how large any single interval can grow. Without it, exponential backoff can balloon to hours.JitterStrategy—FULLrandomizes each interval; the default isNONE. This is not optional at scale.
The thundering-herd problem is concrete: if a downstream API returns 429 to 2,000 concurrent executions and they all back off by exactly 2s, 4s, 8s, they retry in lockstep and re-hammer the recovering service at the same instants. JitterStrategy: FULL spreads each retry randomly across its backoff window, smearing the load.
"CallPaymentApi": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "charge-card", "Payload.$": "$" },
"Retry": [
{
"ErrorEquals": ["Lambda.TooManyRequestsException", "PaymentApi.RateLimited"],
"IntervalSeconds": 1,
"BackoffRate": 2.0,
"MaxAttempts": 6,
"MaxDelaySeconds": 20,
"JitterStrategy": "FULL"
},
{
"ErrorEquals": ["States.Timeout"],
"IntervalSeconds": 2,
"MaxAttempts": 2
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "CompensateCharge"
}
],
"Next": "ConfirmOrder"
}
Two details people miss. First, retriers are evaluated in order, and each rule has its own counter — so split rate-limit retries (aggressive, many attempts) from timeout retries (cautious, few). Second, Catch uses ResultPath: "$.error" to merge the error into the existing input rather than replacing it, so the handler still has the order context. Set TimeoutSeconds on every Task that calls something external; a Task with no timeout can hang until the execution-level limit, and on Standard that limit is a year.
5. Compensation and the saga pattern
Step Functions has no distributed transaction. When step 3 of 5 fails after steps 1 and 2 committed real side effects, you cannot roll back — you must compensate, running an inverse action for each completed step. That is the saga pattern, and Step Functions expresses it naturally because the workflow already knows exactly how far it got.
The structure: each forward Task has a Catch that routes to a compensation chain, and the chain undoes completed work in reverse order. Reserve inventory -> charge card -> create shipment; if shipment creation fails, refund the card, then release the inventory.
"CreateShipment": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "create-shipment", "Payload.$": "$" },
"Catch": [
{ "ErrorEquals": ["States.ALL"], "ResultPath": "$.error", "Next": "RefundCharge" }
],
"Next": "OrderComplete"
},
"RefundCharge": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "refund-charge",
"Payload": { "chargeId.$": "$.chargeId", "idempotencyKey.$": "$$.Execution.Name" }
},
"Next": "ReleaseInventory"
},
"ReleaseInventory": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "release-inventory", "Payload.$": "$.reservation" },
"Next": "OrderFailed"
},
"OrderFailed": { "Type": "Fail", "Error": "OrderFailed", "Cause": "Compensated after shipment failure" }
Compensation actions must themselves be idempotent and retryable — a refund that runs twice must refund once, hence the idempotencyKey derived from the execution name (which is unique and stable for the run). Compensation that fails is the worst case; give compensation Tasks their own Retry and route a final failure to an alarm and a dead-letter store for human cleanup. A saga is only as reliable as its weakest undo.
6. Optimized integrations and the callback (waitForTaskToken) pattern
Step Functions has three integration patterns, and the difference is real money and latency.
- Request/Response (default) — call the service, move on immediately. For fire-and-forget actions.
.sync— call the service and wait for the underlying job to finish (an ECS task, a Glue job, a nested execution) without you polling. Step Functions watches for you..waitForTaskToken— pause the execution and resume only when an external system callsSendTaskSuccess/SendTaskFailurewith the token.
Prefer optimized SDK integrations (arn:aws:states:::dynamodb:putItem) over wrapping every call in a Lambda. They run inside the service, so you pay no Lambda invocation, no cold start, and no code to maintain. Use Lambda only for genuine business logic, not for shuttling a value into DynamoDB.
The callback pattern is how you model anything asynchronous or human-driven — an approval, a third-party webhook, a long external job. The execution sits paused (free, on Standard, for up to a year) holding a token; the external actor completes it later:
# External system resumes the paused execution
aws stepfunctions send-task-success \
--task-token "$TASK_TOKEN" \
--task-output '{"approved": true, "approver": "vinod"}'
Always set HeartbeatSeconds on a waitForTaskToken Task and have the worker call SendTaskHeartbeat. Without a heartbeat, a worker that dies silently leaves the execution paused until the (possibly year-long) timeout. With one, Step Functions fails the Task promptly when heartbeats stop, and your Catch can compensate or alert.
7. Observability: history, X-Ray, and the metrics that matter
Standard workflows keep a full, durable execution history — every state entry/exit, input, output, and error — queryable for 90 days. This is the single best debugging artifact in serverless; get-execution-history reconstructs exactly what happened, in order.
# Replay what actually happened, newest event detail first
aws stepfunctions get-execution-history \
--execution-arn "$EXEC_ARN" \
--reverse-order \
--query 'events[?contains(type, `Failed`)].[type, taskFailedEventDetails.error, taskFailedEventDetails.cause]' \
--output table
Enable X-Ray on the state machine (tracingConfiguration.enabled = true) to get an end-to-end trace across the workflow and every downstream it calls — the fastest way to find the one Task adding 4 seconds of tail latency. For Express workflows, which have no durable history, you must enable CloudWatch Logs (loggingConfiguration at ALL or ERROR); without logs an Express failure is nearly opaque.
The CloudWatch metrics I alarm on:
| Metric | Why it matters |
|---|---|
ExecutionsFailed |
Hard failures — page on a sustained nonzero rate. |
ExecutionsTimedOut |
Workflows hitting their timeout — usually a stuck callback or slow downstream. |
ExecutionThrottled |
You are exceeding StartExecution / state-transition quotas; back off or request a limit increase. |
ExecutionTime (p99) |
Latency regressions and creeping Wait/retry inflation. |
For Distributed Map specifically, the Map Run in the console aggregates child-execution success/failure counts and links straight to failed children — that view is where you triage a fan-out that came back 98% green.
Verify
Confirm the workflow before you trust it with production traffic.
# 1. Statically validate the ASL definition before deploying (no resources created)
aws stepfunctions validate-state-machine-definition \
--definition file://order-workflow.asl.json \
--query '{result:result,diagnostics:diagnostics}'
# 2. Start a real execution and capture its ARN
EXEC_ARN=$(aws stepfunctions start-execution \
--state-machine-arn "$SM_ARN" \
--input '{"order":{"id":"o-123","totalUsd":250,"lineItems":[{"sku":"a"}]}}' \
--query 'executionArn' --output text)
# 3. Poll to terminal status
aws stepfunctions describe-execution --execution-arn "$EXEC_ARN" \
--query '{status:status,error:error,cause:cause}'
# 4. Confirm retry/jitter actually engaged (look for TaskFailed -> retry waits in history)
aws stepfunctions get-execution-history --execution-arn "$EXEC_ARN" \
--query 'events[?type==`TaskScheduled` || type==`TaskFailed`].type'
Then deliberately break a downstream in a non-prod copy and confirm two things: the Catch routes into the compensation chain, and the compensation runs every undo in order. A saga whose compensation path you have never exercised is not a safety net — it is untested code on your worst day.
Enterprise scenario
A media company ran a nightly pipeline that transcoded every asset uploaded that day — typically 60,000 objects under an S3 prefix — into three renditions each. The original design was an inline Map that read the object list into the parent execution and fanned out. It worked at a few thousand items and then wedged: the parent execution’s 256 KB state payload overflowed on the object list well before they reached peak volume, and the inline 40-concurrency cap meant the few runs that did start took most of the night.
The constraint was hard: the full batch had to finish inside a 6-hour window before downstream publishing began, a handful of corrupt source files were expected nightly and must not fail the whole run, and the transcoder was a rate-limited internal service that fell over above ~400 concurrent jobs.
They moved to Distributed Map reading the prefix via s3:listObjectsV2, with ExecutionType: EXPRESS children, MaxConcurrency pinned to 400 to respect the transcoder, ItemBatcher of 20 to amortize invocation cost, and ToleratedFailurePercentage: 1 so a few bad files were quarantined rather than fatal. ResultWriter wrote a per-item manifest to S3 that the publishing stage consumed directly, and a CloudWatch alarm on the Map Run’s failed-child count caught the rare night when corruption spiked past tolerance.
"TranscodeAll": {
"Type": "Map",
"ItemReader": {
"Resource": "arn:aws:states:::s3:listObjectsV2",
"Parameters": { "Bucket.$": "$.bucket", "Prefix.$": "$.todayPrefix" }
},
"ItemProcessor": {
"ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
"StartAt": "Transcode",
"States": {
"Transcode": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "transcode-batch", "Payload.$": "$" },
"Retry": [
{ "ErrorEquals": ["Transcoder.Throttled"], "IntervalSeconds": 2,
"BackoffRate": 2.0, "MaxAttempts": 5, "MaxDelaySeconds": 30, "JitterStrategy": "FULL" }
],
"End": true
}
}
},
"ItemBatcher": { "MaxItemsPerBatch": 20 },
"MaxConcurrency": 400,
"ToleratedFailurePercentage": 1,
"ResultWriter": {
"Resource": "arn:aws:states:::s3:putObject",
"Parameters": { "Bucket.$": "$.resultsBucket", "Prefix": "transcode-runs/" }
},
"End": true
}
The pipeline now finishes a 60,000-object night in under two hours, the transcoder stays under its concurrency ceiling, and corrupt files land in a results manifest for morning review instead of failing the batch. No new infrastructure — just the right Map mode, an honest concurrency cap, and jittered retries against the one service that could not be rushed.