Terraform Module: Azure Chaos Studio — codify resilience experiments as version-controlled fault injection

Quick take — A reusable hashicorp/azurerm module for Azure Chaos Studio: provision experiments with selectors, branches, fault steps and a system-assigned identity so chaos engineering becomes repeatable, reviewable IaC. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "chaos_studio" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-chaos-studio?ref=v1.0.0"

  name                = "..."           # Experiment name (2–64 chars, alphanumeric start/end, al…
  location            = "..."           # Azure region for the experiment resource.
  resource_group_name = "..."           # Resource group that holds the experiment.
  selectors           = ["...", "..."]  # Named groups of Chaos Studio *target* resource IDs that…
  steps               = ["...", "..."]  # Ordered steps → parallel branches → actions. Each actio…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Azure Chaos Studio is a managed fault-injection service for chaos engineering. You enable a target on a resource (a VM, an AKS cluster, a Cosmos DB account), attach one or more capabilities (the specific faults that resource can absorb — CPU pressure, shutdown, network latency, failover), and then author an experiment that orchestrates those faults across steps, branches, and selectors to validate that your system degrades gracefully instead of falling over.

The piece worth wrapping in Terraform is the azurerm_chaos_studio_experiment resource. An experiment is a non-trivial nested document: it has a selectors block (named groups of target resource IDs), and a steps → branches → actions hierarchy where each action references a fault urn, a duration, fault-specific parameters, and the selector it applies to. Hand-clicking that in the portal is fine for a one-off, but it is exactly the kind of artifact you want under review: a GameDay scenario is a hypothesis about your system, and a hypothesis belongs in version control where it can be diffed, peer-reviewed, and re-run identically every quarter.

This module takes a list of selectors and a list of steps as variables, wires up the experiment with a system-assigned managed identity (Chaos Studio uses that identity to actually execute faults against your targets, so it needs RBAC on each target), and emits the experiment ID, name, and principal ID so a downstream module can grant the role assignments the experiment needs. It turns “we did some chaos testing once” into a durable, repeatable resilience asset.

When to use it

You run GameDays / resilience drills and want each scenario defined as code so it is reproducible run-to-run and reviewable in a PR.
You are validating multi-region or zone-redundant architectures and need to repeatably kill a zone, fail over a database, or inject latency on a dependency.
You want chaos experiments to live next to the workload’s Terraform, so the blast radius (which exact resource IDs) is tracked alongside the infrastructure it targets.
You need the experiment’s managed identity principal ID as an output so another module can assign it the precise reader/operator roles on each target — without that RBAC, faults fail at runtime.
You are wiring chaos into a pipeline (run an experiment in a pre-prod stage, gate the release on the system surviving) and need the experiment to exist deterministically before the run.

If you only ever need a single ad-hoc experiment and will never re-run it, the portal is faster. The module pays off the moment an experiment becomes a recurring, audited part of your reliability practice.

Module structure

terraform-module-azure-chaos-studio/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

main.tf

locals {
  # Chaos Studio experiments inherit the location of their selected targets,
  # but the experiment resource itself is regional. Normalise the name once.
  experiment_name = var.name
}

resource "azurerm_chaos_studio_experiment" "this" {
  name                = local.experiment_name
  location            = var.location
  resource_group_name = var.resource_group_name

  # Chaos Studio executes faults using this identity. It must hold the right
  # roles on every target resource (e.g. "Reader" + a fault-specific operator
  # role). Grant those downstream using the principal_id output.
  identity {
    type = "SystemAssigned"
  }

  # Named groups of target resource IDs that steps reference by name.
  dynamic "selectors" {
    for_each = var.selectors
    content {
      name                    = selectors.value.name
      chaos_studio_target_ids = selectors.value.chaos_studio_target_ids
    }
  }

  # The experiment graph: ordered steps -> parallel branches -> actions.
  dynamic "steps" {
    for_each = var.steps
    content {
      name = steps.value.name

      dynamic "branch" {
        for_each = steps.value.branches
        content {
          name = branch.value.name

          dynamic "actions" {
            for_each = branch.value.actions
            content {
              action_type   = actions.value.action_type
              urn           = actions.value.urn
              duration      = actions.value.duration
              selector_name = actions.value.selector_name
              parameters    = actions.value.parameters
            }
          }
        }
      }
    }
  }

  tags = var.tags
}

variables.tf

variable "name" {
  description = "Name of the Chaos Studio experiment."
  type        = string

  validation {
    condition     = can(regex("^[a-zA-Z0-9][a-zA-Z0-9._-]{0,62}[a-zA-Z0-9]$", var.name))
    error_message = "name must be 2-64 chars, start/end alphanumeric, and contain only letters, numbers, '.', '_' or '-'."
  }
}

variable "location" {
  description = "Azure region for the experiment resource (e.g. westeurope)."
  type        = string
}

variable "resource_group_name" {
  description = "Resource group that will hold the experiment."
  type        = string
}

variable "selectors" {
  description = <<-EOT
    Named selectors. Each selector is a group of Chaos Studio *target* resource IDs
    (the .../providers/Microsoft.Chaos/targets/... IDs created when you onboard a
    resource as a chaos target). Steps reference selectors by name.
  EOT
  type = list(object({
    name                    = string
    chaos_studio_target_ids = list(string)
  }))

  validation {
    condition     = length(var.selectors) > 0
    error_message = "At least one selector is required so steps have a target to act on."
  }

  validation {
    condition = alltrue([
      for s in var.selectors : length(s.chaos_studio_target_ids) > 0
    ])
    error_message = "Each selector must contain at least one chaos_studio_target_id."
  }
}

variable "steps" {
  description = <<-EOT
    Ordered experiment steps. Each step contains one or more parallel branches,
    and each branch contains one or more actions. An action is either a fault
    ("continuous"/"discrete") referencing a fault urn, or a "delay".
    'parameters' is a list of { key, value } pairs passed to the fault.
  EOT
  type = list(object({
    name = string
    branches = list(object({
      name = string
      actions = list(object({
        action_type   = string
        urn           = optional(string)
        duration      = optional(string)
        selector_name = optional(string)
        parameters    = optional(list(object({
          key   = string
          value = string
        })), [])
      }))
    }))
  }))

  validation {
    condition     = length(var.steps) > 0
    error_message = "At least one step is required."
  }

  validation {
    condition = alltrue(flatten([
      for st in var.steps : [
        for br in st.branches : [
          for a in br.actions : contains(["continuous", "discrete", "delay"], a.action_type)
        ]
      ]
    ]))
    error_message = "Every action_type must be one of: continuous, discrete, delay."
  }

  validation {
    condition = alltrue(flatten([
      for st in var.steps : [
        for br in st.branches : [
          # Fault actions (continuous/discrete) must carry a urn; delays must not.
          for a in br.actions :
          a.action_type == "delay" ? a.urn == null : a.urn != null
        ]
      ]
    ]))
    error_message = "continuous/discrete actions require a 'urn'; 'delay' actions must omit 'urn' (use 'duration' only)."
  }
}

variable "tags" {
  description = "Tags applied to the experiment."
  type        = map(string)
  default     = {}
}

outputs.tf

output "id" {
  description = "Resource ID of the Chaos Studio experiment."
  value       = azurerm_chaos_studio_experiment.this.id
}

output "name" {
  description = "Name of the Chaos Studio experiment."
  value       = azurerm_chaos_studio_experiment.this.name
}

output "principal_id" {
  description = <<-EOT
    Object (principal) ID of the experiment's system-assigned managed identity.
    Use this to grant the experiment the RBAC roles it needs on each target,
    otherwise fault execution fails at run time.
  EOT
  value       = azurerm_chaos_studio_experiment.this.identity[0].principal_id
}

output "tenant_id" {
  description = "Tenant ID of the experiment's system-assigned managed identity."
  value       = azurerm_chaos_studio_experiment.this.identity[0].tenant_id
}

How to use it

Below, an AKS cluster has already been onboarded as a Chaos Studio target with the pod-chaos capability. The experiment runs a pod-failure fault for 10 minutes, then grants the experiment’s managed identity the operator role it needs on the target.

module "chaos_studio" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-chaos-studio?ref=v1.0.0"

  name                = "exp-aks-pod-failure-prod"
  location            = "westeurope"
  resource_group_name = azurerm_resource_group.reliability.name

  selectors = [
    {
      name = "aks-pods"
      chaos_studio_target_ids = [
        azurerm_chaos_studio_target.aks.id,
      ]
    }
  ]

  steps = [
    {
      name = "Kill pods in the orders namespace"
      branches = [
        {
          name = "branch-pod-failure"
          actions = [
            {
              action_type   = "continuous"
              urn           = "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.2"
              duration      = "PT10M"
              selector_name = "aks-pods"
              parameters = [
                {
                  key   = "jsonSpec"
                  value = jsonencode({
                    action = "pod-failure"
                    mode   = "all"
                    selector = {
                      namespaces = ["orders"]
                    }
                  })
                }
              ]
            }
          ]
        }
      ]
    }
  ]

  tags = {
    environment = "prod"
    owner       = "sre"
    purpose     = "resilience-gameday"
  }
}

# Downstream: use the principal_id output so Chaos Studio's identity can
# actually execute faults against the AKS cluster. Without this the run fails.
resource "azurerm_role_assignment" "chaos_on_aks" {
  scope                = azurerm_kubernetes_cluster.prod.id
  role_definition_name = "Azure Kubernetes Service Cluster Admin Role"
  principal_id         = module.chaos_studio.principal_id
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module config — live/prod/chaos_studio/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-chaos-studio?ref=v1.0.0"
}

inputs = {
  name = "..."
  location = "..."
  resource_group_name = "..."
  selectors = ["...", "..."]
  steps = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/chaos_studio && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`name`	`string`	—	Yes	Experiment name (2–64 chars, alphanumeric start/end, allows `. _ -`).
`location`	`string`	—	Yes	Azure region for the experiment resource.
`resource_group_name`	`string`	—	Yes	Resource group that holds the experiment.
`selectors`	`list(object({ name, chaos_studio_target_ids }))`	—	Yes	Named groups of Chaos Studio target resource IDs that steps reference by name. At least one required.
`steps`	`list(object({ name, branches[...] }))`	—	Yes	Ordered steps → parallel branches → actions. Each action is `continuous`/`discrete` (with a fault `urn`) or `delay`. At least one required.
`tags`	`map(string)`	`{}`	No	Tags applied to the experiment.

Outputs

Name	Description
`id`	Resource ID of the Chaos Studio experiment.
`name`	Name of the experiment.
`principal_id`	Object ID of the experiment’s system-assigned managed identity; use it to grant RBAC on each target.
`tenant_id`	Tenant ID of the experiment’s system-assigned managed identity.

Enterprise scenario

A payments platform runs a quarterly regional-failover GameDay for its PCI-scoped order service on AKS. The SRE team keeps three experiment modules in the workload repo — pod-failure, node CPU pressure, and a Cosmos DB regional failover — each instantiated from this module and pinned to ?ref=v1.0.0. A pre-production pipeline stage applies the Terraform, triggers the pod-failure experiment via the Azure CLI, and fails the release if the synthetic checkout probe drops below its SLO during fault injection. Because the blast radius lives in code, auditors can see exactly which resource IDs were targeted in each drill, and the principal_id output feeds a single RBAC module that scopes Chaos Studio’s identity to only the resources under test.

Best practices

Least-privilege the experiment identity. The principal_id output exists so you grant only the minimum roles on only the specific targets a fault needs (e.g. Azure Kubernetes Service Cluster Admin for pod chaos, Contributor narrowly scoped for VM shutdown). Never grant it subscription-wide; an over-privileged chaos identity is a real blast-radius and security risk.
Pin the fault urn version and keep experiments in source control. Fault URNs are versioned (e.g. podChaos/2.2); pinning makes runs reproducible, and a PR-reviewed experiment means the hypothesis and blast radius are auditable rather than ad-hoc clicks in the portal.
Cost is in what you break, not the experiment. The experiment resource itself is effectively free, but the faults consume real capacity (CPU pressure, extra failover traffic, restarted nodes). Bound impact with conservative duration (ISO-8601, e.g. PT10M) and start in pre-prod before promoting a scenario to production.
Use selectors to constrain the blast radius explicitly. List exact target IDs in selectors rather than broad groups, so a single experiment can never accidentally fault resources outside the namespace/zone/account you intended to test.
Name for the scenario, not the service. exp-<workload>-<fault>-<env> (e.g. exp-aks-pod-failure-prod) makes intent obvious in alerts, the portal, and activity logs when a GameDay is in flight.
Always have an abort path and steady-state checks. Pair experiments with monitoring/SLO probes so you can confirm the system is healthy before injecting and halt the experiment if a real customer-facing metric breaches threshold.