Quick take — A reusable Terraform module for aws_sfn_state_machine that wires up Standard/Express workflows with CloudWatch logging, X-Ray tracing, an execution IAM role, and safe publish-and-version semantics. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "step_functions" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-step-functions?ref=v1.0.0"
name = "..." # State machine name; prefix for the IAM role and log gro…
definition = "..." # Amazon States Language JSON (validated with `jsondecode…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
AWS Step Functions is a serverless orchestrator: you describe a workflow as a state machine in Amazon States Language (ASL), and the service runs it state-by-state, handling retries, error catching, parallelism, timeouts, and human-approval waits without you running any coordination code. The core Terraform resource is aws_sfn_state_machine, and on its own it hides a lot of sharp edges — the state machine is useless without an IAM role that lets it call the downstream services (Lambda, ECS, SNS, DynamoDB, SQS), and in production you almost always want CloudWatch Logs delivery, X-Ray tracing, and the STANDARD vs EXPRESS decision made deliberately rather than by accident.
This module wraps aws_sfn_state_machine together with the three things teams forget the first time: a scoped execution role (aws_iam_role + inline policy), a dedicated CloudWatch log group with a retention policy, and tracing_configuration / logging_configuration blocks driven by variables. It also exposes publish and version_description so you get an immutable, addressable version ARN on every apply — which is what you point an alias or an EventBridge rule at. You hand it a definition (your rendered ASL JSON) and a map of IAM statements; it returns the state machine ARN, the published version ARN, and the role ARN.
When to use it
Reach for this module when you have more than one Step Functions workflow and you are tired of copy-pasting the same log group, role, and tracing boilerplate per workflow. Concretely:
- Multi-workflow estates — order processing, ETL orchestration, nightly batch, approval flows — where each needs identical observability but a different ASL definition and a different downstream permission set.
- Express workflows fronting high-volume events (streaming ingest, IoT, API fan-out) where you need
EXPRESStype plusLOG_LEVEL = ALLto CloudWatch because Express executions have no execution history in the console. - Workflows that must be auditable and versioned — you want
publish = trueso every deploy yields a new version ARN you can pin EventBridge/aliases to and roll back to. - Regulated environments that mandate encryption-at-rest with a customer-managed KMS key on the state machine definition and on the logs.
If you only ever have a single throwaway workflow, inline aws_sfn_state_machine is fine — the module earns its keep at scale and under audit.
Module structure
terraform-module-aws-step-functions/
├── versions.tf # provider + Terraform version pins
├── main.tf # log group, IAM role + policy, state machine
├── variables.tf # var-driven inputs with validations
└── outputs.tf # ARNs, name, version ARN, role ARN
# versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# main.tf
locals {
# Express workflows must log to CloudWatch to be observable at all;
# force a sane default unless the caller explicitly opts out.
effective_log_level = var.type == "EXPRESS" && var.log_level == "OFF" ? "ALL" : var.log_level
log_group_name = coalesce(
var.log_group_name,
"/aws/vendedlogs/states/${var.name}"
)
tags = merge(
var.tags,
{
"ManagedBy" = "Terraform"
"Module" = "terraform-module-aws-step-functions"
}
)
}
# Dedicated log group. The /aws/vendedlogs/ prefix is required so that
# Step Functions' service-linked delivery can write without extra perms.
resource "aws_cloudwatch_log_group" "this" {
name = local.log_group_name
retention_in_days = var.log_retention_in_days
kms_key_id = var.logs_kms_key_arn
tags = local.tags
}
data "aws_iam_policy_document" "assume" {
statement {
sid = "StepFunctionsAssume"
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["states.amazonaws.com"]
}
}
}
resource "aws_iam_role" "this" {
name = "${var.name}-sfn-exec"
assume_role_policy = data.aws_iam_policy_document.assume.json
permissions_boundary = var.permissions_boundary_arn
tags = local.tags
}
# Caller-supplied least-privilege statements for downstream service calls
# (lambda:InvokeFunction, dynamodb:PutItem, sns:Publish, ...).
data "aws_iam_policy_document" "exec" {
dynamic "statement" {
for_each = var.policy_statements
content {
sid = statement.value.sid
effect = statement.value.effect
actions = statement.value.actions
resources = statement.value.resources
}
}
}
resource "aws_iam_role_policy" "exec" {
count = length(var.policy_statements) > 0 ? 1 : 0
name = "${var.name}-sfn-exec"
role = aws_iam_role.this.id
policy = data.aws_iam_policy_document.exec.json
}
# Permissions the state machine needs to deliver its own logs and X-Ray traces.
data "aws_iam_policy_document" "observability" {
statement {
sid = "CloudWatchLogsDelivery"
effect = "Allow"
actions = [
"logs:CreateLogDelivery",
"logs:GetLogDelivery",
"logs:UpdateLogDelivery",
"logs:DeleteLogDelivery",
"logs:ListLogDeliveries",
"logs:PutResourcePolicy",
"logs:DescribeResourcePolicies",
"logs:DescribeLogGroups",
]
resources = ["*"]
}
dynamic "statement" {
for_each = var.enable_tracing ? [1] : []
content {
sid = "XRayTracing"
effect = "Allow"
actions = [
"xray:PutTraceSegments",
"xray:PutTelemetryRecords",
"xray:GetSamplingRules",
"xray:GetSamplingTargets",
]
resources = ["*"]
}
}
}
resource "aws_iam_role_policy" "observability" {
name = "${var.name}-sfn-observability"
role = aws_iam_role.this.id
policy = data.aws_iam_policy_document.observability.json
}
resource "aws_sfn_state_machine" "this" {
name = var.name
type = var.type
role_arn = aws_iam_role.this.arn
# Rendered Amazon States Language JSON supplied by the caller.
definition = var.definition
# Publish an immutable, addressable version on every definition change.
publish = var.publish
version_description = var.version_description
# Optional customer-managed key for definition encryption-at-rest.
dynamic "encryption_configuration" {
for_each = var.kms_key_id == null ? [] : [1]
content {
type = "CUSTOMER_MANAGED_KMS_KEY"
kms_key_id = var.kms_key_id
kms_data_key_reuse_period_seconds = var.kms_data_key_reuse_period_seconds
}
}
logging_configuration {
log_destination = "${aws_cloudwatch_log_group.this.arn}:*"
include_execution_data = var.include_execution_data
level = local.effective_log_level
}
tracing_configuration {
enabled = var.enable_tracing
}
tags = local.tags
depends_on = [
aws_iam_role_policy.observability,
]
}
# variables.tf
variable "name" {
description = "Name of the state machine and prefix for its IAM role and log group."
type = string
validation {
condition = can(regex("^[A-Za-z0-9_-]{1,80}$", var.name))
error_message = "name must be 1-80 chars of letters, digits, hyphen, or underscore."
}
}
variable "definition" {
description = "Amazon States Language (ASL) definition of the workflow, as a JSON string (use jsonencode() or templatefile())."
type = string
validation {
condition = can(jsondecode(var.definition))
error_message = "definition must be valid JSON (Amazon States Language)."
}
}
variable "type" {
description = "Workflow type: STANDARD (durable, exactly-once, up to 1 year) or EXPRESS (high-volume, at-least-once, up to 5 minutes)."
type = string
default = "STANDARD"
validation {
condition = contains(["STANDARD", "EXPRESS"], var.type)
error_message = "type must be STANDARD or EXPRESS."
}
}
variable "policy_statements" {
description = "Least-privilege IAM statements granting the state machine access to downstream services (Lambda, DynamoDB, SNS, etc.)."
type = list(object({
sid = string
effect = optional(string, "Allow")
actions = list(string)
resources = list(string)
}))
default = []
}
variable "publish" {
description = "Whether to publish a new immutable version of the state machine on each definition change."
type = bool
default = true
}
variable "version_description" {
description = "Description attached to the published version (e.g. a git SHA or release tag). Ignored when publish = false."
type = string
default = null
}
variable "enable_tracing" {
description = "Enable AWS X-Ray tracing for the state machine."
type = bool
default = true
}
variable "log_level" {
description = "CloudWatch Logs level: ALL, ERROR, FATAL, or OFF. EXPRESS workflows are upgraded to ALL when set to OFF."
type = string
default = "ERROR"
validation {
condition = contains(["ALL", "ERROR", "FATAL", "OFF"], var.log_level)
error_message = "log_level must be ALL, ERROR, FATAL, or OFF."
}
}
variable "include_execution_data" {
description = "Whether to include execution input/output and state payloads in the logs (disable if payloads contain sensitive data)."
type = bool
default = false
}
variable "log_group_name" {
description = "Override the CloudWatch log group name. Defaults to /aws/vendedlogs/states/<name>."
type = string
default = null
}
variable "log_retention_in_days" {
description = "Retention period in days for the workflow's CloudWatch log group."
type = number
default = 30
validation {
condition = contains(
[0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1096, 1827, 2192, 2557, 2922, 3288, 3653],
var.log_retention_in_days
)
error_message = "log_retention_in_days must be a value CloudWatch Logs accepts (e.g. 1, 7, 14, 30, 90, 365, ... or 0 for never expire)."
}
}
variable "kms_key_id" {
description = "Customer-managed KMS key ARN/ID for encrypting the state machine definition at rest. Null uses the AWS-owned key."
type = string
default = null
}
variable "kms_data_key_reuse_period_seconds" {
description = "How long (60-900s) Step Functions reuses a data key before calling KMS again. Only used when kms_key_id is set."
type = number
default = 300
validation {
condition = var.kms_data_key_reuse_period_seconds >= 60 && var.kms_data_key_reuse_period_seconds <= 900
error_message = "kms_data_key_reuse_period_seconds must be between 60 and 900."
}
}
variable "logs_kms_key_arn" {
description = "Optional KMS key ARN to encrypt the CloudWatch log group."
type = string
default = null
}
variable "permissions_boundary_arn" {
description = "Optional IAM permissions boundary ARN to attach to the execution role."
type = string
default = null
}
variable "tags" {
description = "Tags applied to the state machine, IAM role, and log group."
type = map(string)
default = {}
}
# outputs.tf
output "id" {
description = "ARN of the state machine (its id)."
value = aws_sfn_state_machine.this.id
}
output "arn" {
description = "ARN of the state machine."
value = aws_sfn_state_machine.this.arn
}
output "name" {
description = "Name of the state machine."
value = aws_sfn_state_machine.this.name
}
output "state_machine_version_arn" {
description = "ARN of the published immutable version (null when publish = false)."
value = aws_sfn_state_machine.this.state_machine_version_arn
}
output "creation_date" {
description = "Date the state machine was created."
value = aws_sfn_state_machine.this.creation_date
}
output "role_arn" {
description = "ARN of the execution IAM role assumed by the state machine."
value = aws_iam_role.this.arn
}
output "role_name" {
description = "Name of the execution IAM role."
value = aws_iam_role.this.name
}
output "log_group_name" {
description = "Name of the CloudWatch log group receiving workflow logs."
value = aws_cloudwatch_log_group.this.name
}
output "log_group_arn" {
description = "ARN of the CloudWatch log group."
value = aws_cloudwatch_log_group.this.arn
}
How to use it
Here a STANDARD order-fulfilment workflow invokes two Lambdas and publishes to an SNS topic. The ASL is rendered with jsonencode, and the execution role is scoped to exactly those resources. An EventBridge rule downstream targets the published version ARN, so it always invokes the exact definition this apply produced.
module "step_functions" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-step-functions?ref=v1.0.0"
name = "order-fulfilment"
type = "STANDARD"
enable_tracing = true
log_level = "ERROR"
include_execution_data = false
log_retention_in_days = 90
version_description = "release-2026.06.09"
definition = jsonencode({
Comment = "Validate, charge, then notify on order placement"
StartAt = "ValidateOrder"
States = {
ValidateOrder = {
Type = "Task"
Resource = aws_lambda_function.validate.arn
Retry = [{ ErrorEquals = ["States.TaskFailed"], MaxAttempts = 2, BackoffRate = 2.0, IntervalSeconds = 1 }]
Next = "ChargePayment"
}
ChargePayment = {
Type = "Task"
Resource = aws_lambda_function.charge.arn
Catch = [{ ErrorEquals = ["States.ALL"], Next = "NotifyFailure" }]
Next = "NotifySuccess"
}
NotifySuccess = {
Type = "Task"
Resource = "arn:aws:states:::sns:publish"
Parameters = {
TopicArn = aws_sns_topic.orders.arn
"Message.$" = "$.orderId"
}
End = true
}
NotifyFailure = {
Type = "Fail"
Error = "PaymentFailed"
Cause = "Payment could not be captured"
}
}
})
policy_statements = [
{
sid = "InvokeOrderLambdas"
actions = ["lambda:InvokeFunction"]
resources = [aws_lambda_function.validate.arn, aws_lambda_function.charge.arn]
},
{
sid = "PublishOrderEvents"
actions = ["sns:Publish"]
resources = [aws_sns_topic.orders.arn]
},
]
tags = {
Environment = "prod"
Team = "commerce"
}
}
# Downstream: trigger the exact published version on a schedule / event.
resource "aws_cloudwatch_event_rule" "nightly_reconcile" {
name = "order-fulfilment-nightly"
schedule_expression = "cron(0 2 * * ? *)"
}
resource "aws_cloudwatch_event_target" "to_sfn" {
rule = aws_cloudwatch_event_rule.nightly_reconcile.name
arn = module.step_functions.state_machine_version_arn
role_arn = aws_iam_role.eventbridge_invoke.arn
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/step_functions/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-step-functions?ref=v1.0.0"
}
inputs = {
name = "..."
definition = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/step_functions && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name | string | — | yes | State machine name; prefix for the IAM role and log group (1-80 chars, [A-Za-z0-9_-]). |
| definition | string | — | yes | Amazon States Language JSON (validated with jsondecode); build with jsonencode() or templatefile(). |
| type | string | "STANDARD" |
no | STANDARD or EXPRESS. |
| policy_statements | list(object) | [] |
no | Least-privilege IAM statements for downstream service calls (sid, effect, actions, resources). |
| publish | bool | true |
no | Publish a new immutable version on each definition change. |
| version_description | string | null |
no | Description for the published version (e.g. git SHA / release tag). |
| enable_tracing | bool | true |
no | Enable AWS X-Ray tracing. |
| log_level | string | "ERROR" |
no | ALL, ERROR, FATAL, or OFF; EXPRESS is auto-upgraded from OFF to ALL. |
| include_execution_data | bool | false |
no | Include input/output/state payloads in logs (leave off for sensitive data). |
| log_group_name | string | null |
no | Override log group name; defaults to /aws/vendedlogs/states/<name>. |
| log_retention_in_days | number | 30 |
no | CloudWatch Logs retention (must be an accepted value; 0 = never expire). |
| kms_key_id | string | null |
no | Customer-managed KMS key for definition encryption at rest. |
| kms_data_key_reuse_period_seconds | number | 300 |
no | KMS data-key reuse window (60-900s); only used with kms_key_id. |
| logs_kms_key_arn | string | null |
no | KMS key ARN to encrypt the CloudWatch log group. |
| permissions_boundary_arn | string | null |
no | IAM permissions boundary ARN for the execution role. |
| tags | map(string) | {} |
no | Tags applied to all created resources. |
Outputs
| Name | Description |
|---|---|
| id | ARN of the state machine (its id). |
| arn | ARN of the state machine. |
| name | Name of the state machine. |
| state_machine_version_arn | ARN of the published immutable version (null when publish = false). |
| creation_date | Creation timestamp of the state machine. |
| role_arn | ARN of the execution IAM role. |
| role_name | Name of the execution IAM role. |
| log_group_name | Name of the CloudWatch log group receiving workflow logs. |
| log_group_arn | ARN of the CloudWatch log group. |
Enterprise scenario
A retail platform runs its checkout pipeline as a STANDARD workflow (validate → reserve inventory → charge → fulfil), and a separate EXPRESS workflow handling 4,000 clickstream events/second for real-time personalisation. Both are stood up from this single module: the checkout one pins publish = true with version_description set to the CI release tag so the on-call team can roll back to the previous version ARN in seconds, while the Express one runs log_level = "ALL" with include_execution_data = false because its payloads contain PII. Audit gets the CloudWatch log groups and X-Ray service maps for free, and platform engineering only maintains one module instead of two hand-rolled state machines.
Best practices
- Right-size the type up front.
STANDARDis durable and exactly-once but billed per state transition — ideal for long, low-volume business workflows.EXPRESSis at-least-once, capped at 5 minutes, and billed by duration/memory — use it for high-throughput event processing, and make tasks idempotent because it can re-run them. - Keep the execution role least-privilege. Pass only the
actions/resourceseach workflow truly calls viapolicy_statements; never attach a wildcardlambda:*or*resource. The module already scopes log and X-Ray permissions separately so your business policy stays clean. - Pin EventBridge, aliases, and SDK callers to
state_machine_version_arn, not the bare ARN. Withpublish = trueevery definition change yields a new immutable version, giving you instant, auditable rollback without a redeploy. - Guard sensitive payloads. Leave
include_execution_data = false(the default) when inputs carry PII or secrets, and setkms_key_id+logs_kms_key_arnin regulated environments so both the definition and the logs are encrypted with a key you control. - Set retention and naming deliberately. Use the
/aws/vendedlogs/states/<name>convention the module defaults to so delivery works without extra resource policies, and picklog_retention_in_daysper compliance need rather than letting logs accumulate forever and inflate the CloudWatch bill. - Make reliability explicit in the ASL. Add
RetrywithBackoffRateandCatchblocks on everyTaskstate so transient downstream failures self-heal and terminal ones route to a defined failure path — the module gives you the observability to see when they fire.