IaC AWS

Terraform Module: AWS Step Functions — Versioned State Machines with Logging, Tracing, and Least-Privilege IAM

Quick take — A reusable Terraform module for aws_sfn_state_machine that wires up Standard/Express workflows with CloudWatch logging, X-Ray tracing, an execution IAM role, and safe publish-and-version semantics. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "step_functions" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-step-functions?ref=v1.0.0"

  name       = "..."  # State machine name; prefix for the IAM role and log gro…
  definition = "..."  # Amazon States Language JSON (validated with `jsondecode…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS Step Functions is a serverless orchestrator: you describe a workflow as a state machine in Amazon States Language (ASL), and the service runs it state-by-state, handling retries, error catching, parallelism, timeouts, and human-approval waits without you running any coordination code. The core Terraform resource is aws_sfn_state_machine, and on its own it hides a lot of sharp edges — the state machine is useless without an IAM role that lets it call the downstream services (Lambda, ECS, SNS, DynamoDB, SQS), and in production you almost always want CloudWatch Logs delivery, X-Ray tracing, and the STANDARD vs EXPRESS decision made deliberately rather than by accident.

This module wraps aws_sfn_state_machine together with the three things teams forget the first time: a scoped execution role (aws_iam_role + inline policy), a dedicated CloudWatch log group with a retention policy, and tracing_configuration / logging_configuration blocks driven by variables. It also exposes publish and version_description so you get an immutable, addressable version ARN on every apply — which is what you point an alias or an EventBridge rule at. You hand it a definition (your rendered ASL JSON) and a map of IAM statements; it returns the state machine ARN, the published version ARN, and the role ARN.

When to use it

Reach for this module when you have more than one Step Functions workflow and you are tired of copy-pasting the same log group, role, and tracing boilerplate per workflow. Concretely:

If you only ever have a single throwaway workflow, inline aws_sfn_state_machine is fine — the module earns its keep at scale and under audit.

Module structure

terraform-module-aws-step-functions/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # log group, IAM role + policy, state machine
├── variables.tf     # var-driven inputs with validations
└── outputs.tf       # ARNs, name, version ARN, role ARN
# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}
# main.tf

locals {
  # Express workflows must log to CloudWatch to be observable at all;
  # force a sane default unless the caller explicitly opts out.
  effective_log_level = var.type == "EXPRESS" && var.log_level == "OFF" ? "ALL" : var.log_level

  log_group_name = coalesce(
    var.log_group_name,
    "/aws/vendedlogs/states/${var.name}"
  )

  tags = merge(
    var.tags,
    {
      "ManagedBy" = "Terraform"
      "Module"    = "terraform-module-aws-step-functions"
    }
  )
}

# Dedicated log group. The /aws/vendedlogs/ prefix is required so that
# Step Functions' service-linked delivery can write without extra perms.
resource "aws_cloudwatch_log_group" "this" {
  name              = local.log_group_name
  retention_in_days = var.log_retention_in_days
  kms_key_id        = var.logs_kms_key_arn
  tags              = local.tags
}

data "aws_iam_policy_document" "assume" {
  statement {
    sid     = "StepFunctionsAssume"
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["states.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "this" {
  name                 = "${var.name}-sfn-exec"
  assume_role_policy   = data.aws_iam_policy_document.assume.json
  permissions_boundary = var.permissions_boundary_arn
  tags                 = local.tags
}

# Caller-supplied least-privilege statements for downstream service calls
# (lambda:InvokeFunction, dynamodb:PutItem, sns:Publish, ...).
data "aws_iam_policy_document" "exec" {
  dynamic "statement" {
    for_each = var.policy_statements
    content {
      sid       = statement.value.sid
      effect    = statement.value.effect
      actions   = statement.value.actions
      resources = statement.value.resources
    }
  }
}

resource "aws_iam_role_policy" "exec" {
  count  = length(var.policy_statements) > 0 ? 1 : 0
  name   = "${var.name}-sfn-exec"
  role   = aws_iam_role.this.id
  policy = data.aws_iam_policy_document.exec.json
}

# Permissions the state machine needs to deliver its own logs and X-Ray traces.
data "aws_iam_policy_document" "observability" {
  statement {
    sid    = "CloudWatchLogsDelivery"
    effect = "Allow"
    actions = [
      "logs:CreateLogDelivery",
      "logs:GetLogDelivery",
      "logs:UpdateLogDelivery",
      "logs:DeleteLogDelivery",
      "logs:ListLogDeliveries",
      "logs:PutResourcePolicy",
      "logs:DescribeResourcePolicies",
      "logs:DescribeLogGroups",
    ]
    resources = ["*"]
  }

  dynamic "statement" {
    for_each = var.enable_tracing ? [1] : []
    content {
      sid    = "XRayTracing"
      effect = "Allow"
      actions = [
        "xray:PutTraceSegments",
        "xray:PutTelemetryRecords",
        "xray:GetSamplingRules",
        "xray:GetSamplingTargets",
      ]
      resources = ["*"]
    }
  }
}

resource "aws_iam_role_policy" "observability" {
  name   = "${var.name}-sfn-observability"
  role   = aws_iam_role.this.id
  policy = data.aws_iam_policy_document.observability.json
}

resource "aws_sfn_state_machine" "this" {
  name     = var.name
  type     = var.type
  role_arn = aws_iam_role.this.arn

  # Rendered Amazon States Language JSON supplied by the caller.
  definition = var.definition

  # Publish an immutable, addressable version on every definition change.
  publish             = var.publish
  version_description = var.version_description

  # Optional customer-managed key for definition encryption-at-rest.
  dynamic "encryption_configuration" {
    for_each = var.kms_key_id == null ? [] : [1]
    content {
      type                              = "CUSTOMER_MANAGED_KMS_KEY"
      kms_key_id                        = var.kms_key_id
      kms_data_key_reuse_period_seconds = var.kms_data_key_reuse_period_seconds
    }
  }

  logging_configuration {
    log_destination        = "${aws_cloudwatch_log_group.this.arn}:*"
    include_execution_data = var.include_execution_data
    level                  = local.effective_log_level
  }

  tracing_configuration {
    enabled = var.enable_tracing
  }

  tags = local.tags

  depends_on = [
    aws_iam_role_policy.observability,
  ]
}
# variables.tf

variable "name" {
  description = "Name of the state machine and prefix for its IAM role and log group."
  type        = string

  validation {
    condition     = can(regex("^[A-Za-z0-9_-]{1,80}$", var.name))
    error_message = "name must be 1-80 chars of letters, digits, hyphen, or underscore."
  }
}

variable "definition" {
  description = "Amazon States Language (ASL) definition of the workflow, as a JSON string (use jsonencode() or templatefile())."
  type        = string

  validation {
    condition     = can(jsondecode(var.definition))
    error_message = "definition must be valid JSON (Amazon States Language)."
  }
}

variable "type" {
  description = "Workflow type: STANDARD (durable, exactly-once, up to 1 year) or EXPRESS (high-volume, at-least-once, up to 5 minutes)."
  type        = string
  default     = "STANDARD"

  validation {
    condition     = contains(["STANDARD", "EXPRESS"], var.type)
    error_message = "type must be STANDARD or EXPRESS."
  }
}

variable "policy_statements" {
  description = "Least-privilege IAM statements granting the state machine access to downstream services (Lambda, DynamoDB, SNS, etc.)."
  type = list(object({
    sid       = string
    effect    = optional(string, "Allow")
    actions   = list(string)
    resources = list(string)
  }))
  default = []
}

variable "publish" {
  description = "Whether to publish a new immutable version of the state machine on each definition change."
  type        = bool
  default     = true
}

variable "version_description" {
  description = "Description attached to the published version (e.g. a git SHA or release tag). Ignored when publish = false."
  type        = string
  default     = null
}

variable "enable_tracing" {
  description = "Enable AWS X-Ray tracing for the state machine."
  type        = bool
  default     = true
}

variable "log_level" {
  description = "CloudWatch Logs level: ALL, ERROR, FATAL, or OFF. EXPRESS workflows are upgraded to ALL when set to OFF."
  type        = string
  default     = "ERROR"

  validation {
    condition     = contains(["ALL", "ERROR", "FATAL", "OFF"], var.log_level)
    error_message = "log_level must be ALL, ERROR, FATAL, or OFF."
  }
}

variable "include_execution_data" {
  description = "Whether to include execution input/output and state payloads in the logs (disable if payloads contain sensitive data)."
  type        = bool
  default     = false
}

variable "log_group_name" {
  description = "Override the CloudWatch log group name. Defaults to /aws/vendedlogs/states/<name>."
  type        = string
  default     = null
}

variable "log_retention_in_days" {
  description = "Retention period in days for the workflow's CloudWatch log group."
  type        = number
  default     = 30

  validation {
    condition = contains(
      [0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1096, 1827, 2192, 2557, 2922, 3288, 3653],
      var.log_retention_in_days
    )
    error_message = "log_retention_in_days must be a value CloudWatch Logs accepts (e.g. 1, 7, 14, 30, 90, 365, ... or 0 for never expire)."
  }
}

variable "kms_key_id" {
  description = "Customer-managed KMS key ARN/ID for encrypting the state machine definition at rest. Null uses the AWS-owned key."
  type        = string
  default     = null
}

variable "kms_data_key_reuse_period_seconds" {
  description = "How long (60-900s) Step Functions reuses a data key before calling KMS again. Only used when kms_key_id is set."
  type        = number
  default     = 300

  validation {
    condition     = var.kms_data_key_reuse_period_seconds >= 60 && var.kms_data_key_reuse_period_seconds <= 900
    error_message = "kms_data_key_reuse_period_seconds must be between 60 and 900."
  }
}

variable "logs_kms_key_arn" {
  description = "Optional KMS key ARN to encrypt the CloudWatch log group."
  type        = string
  default     = null
}

variable "permissions_boundary_arn" {
  description = "Optional IAM permissions boundary ARN to attach to the execution role."
  type        = string
  default     = null
}

variable "tags" {
  description = "Tags applied to the state machine, IAM role, and log group."
  type        = map(string)
  default     = {}
}
# outputs.tf

output "id" {
  description = "ARN of the state machine (its id)."
  value       = aws_sfn_state_machine.this.id
}

output "arn" {
  description = "ARN of the state machine."
  value       = aws_sfn_state_machine.this.arn
}

output "name" {
  description = "Name of the state machine."
  value       = aws_sfn_state_machine.this.name
}

output "state_machine_version_arn" {
  description = "ARN of the published immutable version (null when publish = false)."
  value       = aws_sfn_state_machine.this.state_machine_version_arn
}

output "creation_date" {
  description = "Date the state machine was created."
  value       = aws_sfn_state_machine.this.creation_date
}

output "role_arn" {
  description = "ARN of the execution IAM role assumed by the state machine."
  value       = aws_iam_role.this.arn
}

output "role_name" {
  description = "Name of the execution IAM role."
  value       = aws_iam_role.this.name
}

output "log_group_name" {
  description = "Name of the CloudWatch log group receiving workflow logs."
  value       = aws_cloudwatch_log_group.this.name
}

output "log_group_arn" {
  description = "ARN of the CloudWatch log group."
  value       = aws_cloudwatch_log_group.this.arn
}

How to use it

Here a STANDARD order-fulfilment workflow invokes two Lambdas and publishes to an SNS topic. The ASL is rendered with jsonencode, and the execution role is scoped to exactly those resources. An EventBridge rule downstream targets the published version ARN, so it always invokes the exact definition this apply produced.

module "step_functions" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-step-functions?ref=v1.0.0"

  name                   = "order-fulfilment"
  type                   = "STANDARD"
  enable_tracing         = true
  log_level              = "ERROR"
  include_execution_data = false
  log_retention_in_days  = 90
  version_description    = "release-2026.06.09"

  definition = jsonencode({
    Comment = "Validate, charge, then notify on order placement"
    StartAt = "ValidateOrder"
    States = {
      ValidateOrder = {
        Type     = "Task"
        Resource = aws_lambda_function.validate.arn
        Retry    = [{ ErrorEquals = ["States.TaskFailed"], MaxAttempts = 2, BackoffRate = 2.0, IntervalSeconds = 1 }]
        Next     = "ChargePayment"
      }
      ChargePayment = {
        Type     = "Task"
        Resource = aws_lambda_function.charge.arn
        Catch    = [{ ErrorEquals = ["States.ALL"], Next = "NotifyFailure" }]
        Next     = "NotifySuccess"
      }
      NotifySuccess = {
        Type     = "Task"
        Resource = "arn:aws:states:::sns:publish"
        Parameters = {
          TopicArn         = aws_sns_topic.orders.arn
          "Message.$"      = "$.orderId"
        }
        End = true
      }
      NotifyFailure = {
        Type  = "Fail"
        Error = "PaymentFailed"
        Cause = "Payment could not be captured"
      }
    }
  })

  policy_statements = [
    {
      sid       = "InvokeOrderLambdas"
      actions   = ["lambda:InvokeFunction"]
      resources = [aws_lambda_function.validate.arn, aws_lambda_function.charge.arn]
    },
    {
      sid       = "PublishOrderEvents"
      actions   = ["sns:Publish"]
      resources = [aws_sns_topic.orders.arn]
    },
  ]

  tags = {
    Environment = "prod"
    Team        = "commerce"
  }
}

# Downstream: trigger the exact published version on a schedule / event.
resource "aws_cloudwatch_event_rule" "nightly_reconcile" {
  name                = "order-fulfilment-nightly"
  schedule_expression = "cron(0 2 * * ? *)"
}

resource "aws_cloudwatch_event_target" "to_sfn" {
  rule     = aws_cloudwatch_event_rule.nightly_reconcile.name
  arn      = module.step_functions.state_machine_version_arn
  role_arn = aws_iam_role.eventbridge_invoke.arn
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/step_functions/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-step-functions?ref=v1.0.0"
}

inputs = {
  name = "..."
  definition = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/step_functions && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name string yes State machine name; prefix for the IAM role and log group (1-80 chars, [A-Za-z0-9_-]).
definition string yes Amazon States Language JSON (validated with jsondecode); build with jsonencode() or templatefile().
type string "STANDARD" no STANDARD or EXPRESS.
policy_statements list(object) [] no Least-privilege IAM statements for downstream service calls (sid, effect, actions, resources).
publish bool true no Publish a new immutable version on each definition change.
version_description string null no Description for the published version (e.g. git SHA / release tag).
enable_tracing bool true no Enable AWS X-Ray tracing.
log_level string "ERROR" no ALL, ERROR, FATAL, or OFF; EXPRESS is auto-upgraded from OFF to ALL.
include_execution_data bool false no Include input/output/state payloads in logs (leave off for sensitive data).
log_group_name string null no Override log group name; defaults to /aws/vendedlogs/states/<name>.
log_retention_in_days number 30 no CloudWatch Logs retention (must be an accepted value; 0 = never expire).
kms_key_id string null no Customer-managed KMS key for definition encryption at rest.
kms_data_key_reuse_period_seconds number 300 no KMS data-key reuse window (60-900s); only used with kms_key_id.
logs_kms_key_arn string null no KMS key ARN to encrypt the CloudWatch log group.
permissions_boundary_arn string null no IAM permissions boundary ARN for the execution role.
tags map(string) {} no Tags applied to all created resources.

Outputs

Name Description
id ARN of the state machine (its id).
arn ARN of the state machine.
name Name of the state machine.
state_machine_version_arn ARN of the published immutable version (null when publish = false).
creation_date Creation timestamp of the state machine.
role_arn ARN of the execution IAM role.
role_name Name of the execution IAM role.
log_group_name Name of the CloudWatch log group receiving workflow logs.
log_group_arn ARN of the CloudWatch log group.

Enterprise scenario

A retail platform runs its checkout pipeline as a STANDARD workflow (validate → reserve inventory → charge → fulfil), and a separate EXPRESS workflow handling 4,000 clickstream events/second for real-time personalisation. Both are stood up from this single module: the checkout one pins publish = true with version_description set to the CI release tag so the on-call team can roll back to the previous version ARN in seconds, while the Express one runs log_level = "ALL" with include_execution_data = false because its payloads contain PII. Audit gets the CloudWatch log groups and X-Ray service maps for free, and platform engineering only maintains one module instead of two hand-rolled state machines.

Best practices

TerraformAWSStep FunctionsModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading