Every resilience claim I have ever reviewed was theoretical until someone broke the system on purpose. Zone-redundant SKUs, three replicas, retry policies, multi-AZ node pools – all of it is a hypothesis until a real fault hits and you watch what actually happens. Azure Chaos Studio is Microsoft’s managed fault-injection service, and the value is not “it can kill VMs.” Anyone can kill a VM. The value is that it forces you to write down a steady-state hypothesis (“p99 latency stays under 400 ms and error rate under 1% while one zone is down”), inject a controlled, time-boxed, blast-radius-limited fault, and either validate the hypothesis or find the gap before a customer does.
This article is the playbook I use: how the two fault classes actually work, how to wire least-privilege RBAC, how to author experiments with steps/branches/parallelism, the specific fault libraries for networking, VMSS, and AKS, how to encode safety with hypotheses and abort criteria, how to gate releases on experiment results, and how to correlate the blast with Azure Monitor. Every command and JSON snippet here is real and current against the 2023-11-01 Chaos API.
1. Architecture: agent-based vs service-direct faults
Chaos Studio injects two fundamentally different classes of fault, and the distinction drives everything downstream – setup, RBAC, and blast radius.
| Service-direct | Agent-based | |
|---|---|---|
| Mechanism | Chaos Studio calls the Azure resource provider directly (ARM control plane) | A VM extension (the chaos agent) runs inside the guest OS and injects the fault locally |
| Target type | Microsoft-{ResourceProvider} (e.g. Microsoft-AzureKubernetesServiceChaosMesh, Microsoft-VirtualMachineScaleSet) |
Microsoft-Agent |
| Examples | VMSS shutdown, AKS Chaos Mesh, NSG rule injection, Cosmos DB failover | CPU/memory/disk pressure, kill process, network latency, network disconnect via firewall |
| Setup cost | Enable target + capability only | Enable target + capability, assign a managed identity, install the agent VM extension |
| RBAC needed | Resource-specific (e.g. AKS Cluster Admin Role, VM Contributor) | Reader on the target VM/VMSS |
The mental model: service-direct faults are things you could do from the control plane (shut down an instance, flip a firewall). Agent-based faults are things that have to happen inside the box (burn CPU, drop packets at the host network stack). Network disconnect via firewall is agent-based because it programs the in-guest firewall; a service-direct “block traffic” exists only via NSG manipulation, which is coarser.
Before any of this works, the resource must be onboarded as a Chaos Studio target, and the specific faults you want must be enabled as capabilities on that target. An un-onboarded resource is invisible to experiments – this is the first safety rail.
2. Enable targets and capabilities with least-privilege RBAC
Service-direct: VMSS shutdown
Onboarding is two REST calls – create the target, then enable each capability. Use az rest so this is scriptable and reviewable in PRs.
SUBSCRIPTION_ID="<sub-guid>"
RG="rg-resilience-prod"
VMSS_ID="/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Compute/virtualMachineScaleSets/vmss-web"
# 1. Create the service-direct target on the VMSS
az rest --method put \
--uri "https://management.azure.com${VMSS_ID}/providers/Microsoft.Chaos/targets/Microsoft-VirtualMachineScaleSet?api-version=2023-11-01" \
--body '{"properties":{}}'
# 2. Enable the Shutdown capability (version 2.0)
az rest --method put \
--uri "https://management.azure.com${VMSS_ID}/providers/Microsoft.Chaos/targets/Microsoft-VirtualMachineScaleSet/capabilities/Shutdown-2.0?api-version=2023-11-01" \
--body '{"properties":{}}'
Agent-based: CPU/network faults on the same VMSS
Agent-based onboarding requires a user-assigned managed identity bound to the scale set, a Microsoft-Agent target, and the chaos agent extension. The agent authenticates to Chaos Studio using that identity.
UAMI_ID="/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.ManagedIdentity/userAssignedIdentities/id-chaos-agent"
UAMI_CLIENT_ID="<client-id-of-uami>"
TENANT_ID="<tenant-guid>"
# 1. Bind the user-assigned identity to the VMSS
az vmss identity assign --ids "$VMSS_ID" --identities "$UAMI_ID"
# 2. Create the Microsoft-Agent target referencing that identity
cat > agent-target.json <<EOF
{
"properties": {
"identities": [
{ "clientId": "$UAMI_CLIENT_ID", "tenantId": "$TENANT_ID", "type": "AzureManagedIdentity" }
]
}
}
EOF
AGENT_PROFILE_ID=$(az rest --method put \
--uri "https://management.azure.com${VMSS_ID}/providers/Microsoft.Chaos/targets/Microsoft-Agent?api-version=2023-11-01" \
--body @agent-target.json --query properties.agentProfileId -o tsv)
# 3. Enable the capabilities you need
for CAP in CPUPressure-1.0 NetworkDisconnectViaFirewall-1.1 NetworkLatency-1.2; do
az rest --method put \
--uri "https://management.azure.com${VMSS_ID}/providers/Microsoft.Chaos/targets/Microsoft-Agent/capabilities/${CAP}?api-version=2023-11-01" \
--body '{"properties":{}}'
done
# 4. Install the chaos agent extension (Linux), wiring agentProfileId + identity
az vmss extension set \
--resource-group "$RG" --vmss-name "vmss-web" \
--name ChaosLinuxAgent --publisher Microsoft.Azure.Chaos --version 1.0 \
--settings "{\"profile\":\"$AGENT_PROFILE_ID\",\"auth.msi.clientid\":\"$UAMI_CLIENT_ID\"}"
# 5. Roll the new model to all instances
az vmss update-instances -g "$RG" -n "vmss-web" --instance-ids "*"
The Windows agent is
ChaosWindowsAgent(currently--version 1.1). Add"appinsightskey":"<key>"to the settings to stream agent diagnostics into Application Insights – invaluable when an experiment “did nothing” and you need to know whether the fault even fired.
Least-privilege for the experiment identity
When you create an experiment, Chaos Studio mints a system-assigned managed identity for it. That identity – not your user – executes faults, and it needs the minimum role on each target. This is the RBAC rail that bounds what a runaway experiment can touch.
# Look up exactly which role a capability requires before assigning anything
az rest --method get \
--uri "https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/providers/Microsoft.Chaos/locations/eastus/targetTypes/Microsoft-VirtualMachineScaleSet/capabilityTypes/Shutdown-2.0?api-version=2024-01-01" \
--query "properties.requiredAzureRoleDefinitionIds"
Practical role map:
- Agent-based faults (CPU, network):
Readeron the VM/VMSS. Counterintuitive, but correct – the agent does the work locally, so the control-plane identity only needs read. Roles like Virtual Machine Contributor that lack*/readwill not work. - VMSS Shutdown (service-direct):
Virtual Machine Contributorscoped to the scale set. - AKS Chaos Mesh (service-direct):
Azure Kubernetes Service Cluster Admin Roleon the cluster.
Scope every assignment to the resource, never the resource group or subscription. A chaos experiment with subscription-level Contributor is a self-inflicted incident waiting to happen.
3. Designing experiments: steps, branches, and parallel faults
An experiment is itself an ARM resource. Its anatomy:
- Selectors – named, reusable groups of target resources. Define once, reference from many actions.
- Steps – run sequentially. Step 2 starts only after Step 1 completes.
- Branches – inside a step, run in parallel. This is how you model “zone fails AND a dependency goes slow simultaneously.”
- Actions – the leaves: a fault (
continuouswith an ISO 8601duration, ordiscretelike a one-shot shutdown) or adelay.
Here is a two-step experiment. Step 1 warms a steady-state observation window with a delay; Step 2 runs CPU pressure and network latency in parallel against the same scale set instances.
{
"identity": { "type": "SystemAssigned" },
"location": "eastus",
"properties": {
"selectors": [
{
"id": "vmssAgentSelector",
"type": "List",
"targets": [
{
"id": "/subscriptions/<sub>/resourceGroups/rg-resilience-prod/providers/Microsoft.Compute/virtualMachineScaleSets/vmss-web/providers/Microsoft.Chaos/targets/Microsoft-Agent",
"type": "ChaosTarget"
}
]
}
],
"steps": [
{
"name": "Step 1 - establish steady state",
"branches": [
{ "name": "warmup", "actions": [ { "type": "delay", "name": "urn:csci:microsoft:chaosStudio:timedDelay/1.0", "duration": "PT3M" } ] }
]
},
{
"name": "Step 2 - parallel pressure + latency",
"branches": [
{
"name": "cpu",
"actions": [
{
"type": "continuous",
"name": "urn:csci:microsoft:agent:cpuPressure/1.0",
"duration": "PT10M",
"selectorId": "vmssAgentSelector",
"parameters": [
{ "key": "pressureLevel", "value": "90" },
{ "key": "virtualMachineScaleSetInstances", "value": "[0,1]" }
]
}
]
},
{
"name": "latency",
"actions": [
{
"type": "continuous",
"name": "urn:csci:microsoft:agent:networkLatency/1.2",
"duration": "PT10M",
"selectorId": "vmssAgentSelector",
"parameters": [
{ "key": "latencyInMilliseconds", "value": "200" },
{ "key": "destinationFilters", "value": "[{\"address\":\"10.0.2.0\",\"subnetMask\":\"24\",\"portLow\":1433,\"portHigh\":1433}]" },
{ "key": "virtualMachineScaleSetInstances", "value": "[0,1]" }
]
}
]
}
]
}
]
}
}
Note the fault names are versioned URNs (urn:csci:microsoft:agent:cpuPressure/1.0). parameters values are always strings, even when they encode JSON arrays – that escaping trips people up constantly. Create it with:
az rest --method put \
--uri "https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Chaos/experiments/exp-vmss-pressure?api-version=2023-11-01" \
--body @experiment.json
4. Network, VMSS shutdown, and AKS fault libraries
The faults I reach for most, with their exact URNs and required parameters:
| Fault | URN | Class | Key parameters |
|---|---|---|---|
| VMSS Shutdown | urn:csci:microsoft:virtualMachineScaleSet:shutdown/2.0 |
service-direct | abruptShutdown (bool, optional) |
| Network Disconnect via Firewall | urn:csci:microsoft:agent:networkDisconnectViaFirewall/1.1 |
agent-based | destinationFilters (array) |
| Network Latency | urn:csci:microsoft:agent:networkLatency/1.2 |
agent-based | latencyInMilliseconds, destinationFilters/inboundDestinationFilters |
| CPU Pressure | urn:csci:microsoft:agent:cpuPressure/1.0 |
agent-based | pressureLevel (1-99) |
| Kill Process | urn:csci:microsoft:agent:killProcess/1.0 |
agent-based | processName, killIntervalInMilliseconds |
| AKS Chaos Mesh Pod | urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.2 |
service-direct | jsonSpec |
| AKS Chaos Mesh Network | urn:csci:microsoft:azureKubernetesServiceChaosMesh:networkChaos/2.2 |
service-direct | jsonSpec |
Zone-loss simulation with VMSS shutdown
To validate zone-redundancy, shut down the instances in a single zone and confirm the app stays up on the survivors. Combine a List selector pinned to zone-1 instances with Shutdown-2.0. Leave abruptShutdown true to model a hard power loss rather than a graceful drain – that is the failure you actually fear.
AKS pod and network faults via Chaos Mesh
AKS faults are service-direct but delegate to Chaos Mesh, which must be installed on the cluster first. Chaos Studio drives it through the AKS control plane.
# One-time: install Chaos Mesh on a Linux node pool
az aks get-credentials --admin -g rg-aks -n aks-prod
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
kubectl create ns chaos-testing
helm install chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock
# Onboard the cluster as a service-direct Chaos Studio target + capability
AKS_ID="/subscriptions/$SUBSCRIPTION_ID/resourceGroups/rg-aks/providers/Microsoft.ContainerService/managedClusters/aks-prod"
az rest --method put \
--uri "https://management.azure.com${AKS_ID}/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh?api-version=2023-11-01" \
--body '{"properties":{}}'
az rest --method put \
--uri "https://management.azure.com${AKS_ID}/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh/capabilities/PodChaos-2.2?api-version=2023-11-01" \
--body '{"properties":{}}'
The jsonSpec parameter is the spec block of a Chaos Mesh CRD, flattened and minified to JSON. Take a PodChaos YAML, strip everything outside spec, drop duration (Chaos Studio supplies it), and convert. This kills pods carrying app: checkout in the payments namespace:
{
"type": "continuous",
"name": "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.2",
"duration": "PT5M",
"selectorId": "aksSelector",
"parameters": [
{ "key": "jsonSpec", "value": "{\"action\":\"pod-failure\",\"mode\":\"fixed-percent\",\"value\":\"50\",\"selector\":{\"namespaces\":[\"payments\"],\"labelSelectors\":{\"app\":\"checkout\"}}}" }
]
}
For a NetworkChaos fault (partition, delay, loss), swap the URN to .../networkChaos/2.2 and supply the corresponding Chaos Mesh NetworkChaos spec. The mode: fixed-percent with value: 50 is itself a blast-radius control – you take down half the matched pods, not all of them.
5. Steady-state hypotheses and abort criteria for safety
A fault without a hypothesis is just vandalism. Chaos Studio does not have a native “assertion engine,” so the discipline lives in how you structure the run: define the steady state in your observability stack, gate the experiment on it before you inject, and wire an automated abort.
The pattern I enforce:
- Pre-flight gate. Before Step 1’s fault, the pipeline queries Azure Monitor for the steady-state metric. If the system is already unhealthy, abort – never inject chaos into a sick system.
- Abort criteria as an alert + automation. Create a metric alert (e.g. availability < 99% over 1 minute) whose action group invokes an Azure Function/Logic App that calls the experiment cancel API.
# Emergency stop -- the single most important command to have wired and tested
az rest --method post \
--uri "https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Chaos/experiments/exp-vmss-pressure/cancel?api-version=2023-11-01"
Have this on a hotkey, in the runbook, and wired to an alert. When a network-disconnect fault accidentally severs the path the agent itself uses to receive the “stop” signal, your only recourse is the service-direct cancel plus rolling the VMSS model – so test your abort path on a non-prod target before you ever touch prod.
6. Blast-radius control with selectors and time-boxing
Blast radius is controlled on four axes – enforce all of them:
- Scope (selectors).
virtualMachineScaleSetInstances: [0,1]hits two instances, not the fleet. For AKS,mode: fixed-percent/value: 50and a tightlabelSelectorbound the pod set. Never usemode: allin production. - Time (duration). Every continuous fault carries an ISO 8601
duration(PT10M). The fault self-terminates – there is no “infinite” chaos. Keep prod runs short; you are sampling, not load-testing. - Identity (RBAC). From section 2: the experiment identity holds the minimum role, scoped to the resource. It physically cannot touch anything you did not grant.
- Targeting (onboarding). Only explicitly-onboarded targets with explicitly-enabled capabilities are reachable. The intersection of “onboarded” and “RBAC-granted” is your hard ceiling.
A useful convention: separate Chaos Studio resource groups per environment (rg-chaos-staging, rg-chaos-prod) with distinct RBAC, and graduate experiments staging -> prod only after they pass clean in staging. The blast radius of a mistake in authoring is then confined to staging.
7. Integrating experiments into release pipelines as gates
The highest-value use of Chaos Studio is a resilience gate in the deployment pipeline: after deploying to a pre-prod/canary slice, run the experiment, assert steady state held, and only then promote. Here is the gate as an Azure DevOps stage.
- stage: ResilienceGate
dependsOn: DeployCanary
jobs:
- job: ChaosExperiment
steps:
- task: AzureCLI@2
displayName: "Run chaos experiment and gate on steady state"
inputs:
azureSubscription: "sc-resilience-prod"
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
set -euo pipefail
EXP="exp-vmss-pressure"
BASE="https://management.azure.com/subscriptions/$(SUB)/resourceGroups/$(RG)/providers/Microsoft.Chaos/experiments/$EXP"
# Start
az rest --method post --uri "$BASE/start?api-version=2023-11-01"
# Poll until terminal
for i in $(seq 1 60); do
STATUS=$(az rest --method get --uri "$BASE/statuses?api-version=2023-11-01" \
--query "value[0].properties.status" -o tsv)
echo "Experiment status: $STATUS"
[[ "$STATUS" =~ ^(Success|Failed|Cancelled)$ ]] && break
sleep 15
done
# Assert steady state held DURING the run via Azure Monitor (see section 8)
FAILED=$(az monitor metrics list \
--resource "$(APPGW_ID)" --metric "FailedRequests" \
--aggregation Total --interval PT1M \
--start-time "$(date -u -d '15 minutes ago' '+%Y-%m-%dT%H:%M:%SZ')" \
--query "max(value[0].timeseries[0].data[].total)" -o tsv)
echo "Peak failed requests during experiment: ${FAILED:-0}"
if (( $(printf '%.0f' "${FAILED:-0}") > 50 )); then
echo "##vso[task.logissue type=error]Steady-state breach -- failing the gate"
exit 1
fi
[[ "$STATUS" == "Success" ]] || { echo "Experiment did not succeed"; exit 1; }
A red gate means the canary is less resilient than your bar, and promotion stops. This converts resilience from an annual game-day into a per-release regression check.
8. Observability correlation with Azure Monitor during runs
An experiment is only as good as your ability to see the blast. Chaos Studio emits experiment lifecycle events, but the signal you correlate against lives in Azure Monitor metrics, Log Analytics, and Application Insights. Two correlation techniques:
Time-window overlay. Every experiment run has a precise start/stop timestamp. Pull the run window and overlay it on your golden-signal dashboards. In a Log Analytics workbook, this KQL surfaces error-rate and latency for the App Gateway behind the workload, bucketed so you can line it up against the fault window:
AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS"
| where TimeGenerated between (datetime(2026-06-08T10:00:00Z) .. datetime(2026-06-08T10:20:00Z))
| summarize
p99_latency_ms = percentile(timeTaken_d * 1000, 99),
error_rate = 100.0 * countif(httpStatus_d >= 500) / count()
by bin(TimeGenerated, 30s)
| order by TimeGenerated asc
AKS active pod count. For pod-failure experiments, the cleanest live signal is the container-insights pod count – you should watch it drop and recover within the fault window, then return to baseline:
KubePodInventory
| where ClusterName == "aks-prod" and Namespace == "payments"
| where TimeGenerated between (datetime(2026-06-08T10:05:00Z) .. datetime(2026-06-08T10:15:00Z))
| summarize ReadyPods = dcountif(Name, PodStatus == "Running") by bin(TimeGenerated, 1m)
| render timechart
The recovery curve is the resilience evidence. If pods drop and the survivors absorb the load with error rate flat, the hypothesis holds. If error rate spikes and stays elevated after the fault ends, you have found a recovery bug – exactly the class of defect that turns a transient blip into a multi-hour outage in production.
Verify
Confirm the whole pipeline is sound before trusting any result:
az rest --method get --uri ".../Microsoft.Chaos/targets/Microsoft-Agent?api-version=2023-11-01"returns anagentProfileId– the target is onboarded.kubectl get po -n chaos-testingshowschaos-controller-managerandchaos-daemonpodsRunning(AKS path).- Run the experiment once against a staging target and confirm
statusesreachesSuccess, notFailed(aFailedstatus with “permission” in the detail means RBAC on the experiment identity is wrong). - During that run, confirm the fault actually fired: CPU pressure should show as a CPU spike in VM metrics; pod-failure as a dip in
KubePodInventoryready count. No movement means the fault did not land. - Trigger your abort alert manually and confirm the experiment moves to
Cancelledwithin your SLA. - Inspect post-run: every metric returned to baseline. Lingering degradation is a finding, not noise.
Enterprise scenario
A payments platform team I worked with ran their checkout service on a zone-redundant AKS cluster fronted by an internal Standard Load Balancer, and every architecture review asserted “we survive a zone failure.” Their constraint: they could not actually fail a production zone to prove it, and a prior manual game-day had been called off when an engineer accidentally cordoned nodes across all three zones at once and triggered a real partial outage. Leadership banned ad-hoc chaos. They needed proof without the risk.
We rebuilt it as a tightly-bounded Chaos Studio experiment. Rather than a brute zone shutdown, we modeled the symptom of a zone going dark from the pods’ perspective: a NetworkChaos partition isolating exactly the checkout pods scheduled in one zone, scoped by topology label and capped at the matched subset – never mode: all. The blast radius was fixed by selector, the duration time-boxed to five minutes, and a metric alert on the load balancer’s health-probe count wired to a Logic App that called the experiment cancel endpoint if availability dropped below 99.5%.
{
"type": "continuous",
"name": "urn:csci:microsoft:azureKubernetesServiceChaosMesh:networkChaos/2.2",
"duration": "PT5M",
"selectorId": "aksSelector",
"parameters": [
{ "key": "jsonSpec", "value": "{\"action\":\"partition\",\"mode\":\"all\",\"selector\":{\"namespaces\":[\"payments\"],\"labelSelectors\":{\"app\":\"checkout\",\"topology.kubernetes.io/zone\":\"eastus-1\"}},\"direction\":\"both\"}" }
]
}
The first run found the gap immediately: their PodDisruptionBudget allowed too many concurrent evictions, and connection-draining on the load balancer was misconfigured, so in-flight checkout requests to the partitioned zone returned 502s for roughly 40 seconds instead of failing over cleanly. None of this was visible in steady state – the dashboards were green every single day. They fixed the PDB and probe timing, re-ran the experiment in the release pipeline as a gate, and the second run held steady state flat. The experiment now runs on every deploy to the checkout service, and zone-resilience stopped being a slide and became a green check in CI.