Terraform Module: Azure Monitor Metric Alert — static & dynamic thresholds with action-group routing

Quick take — A reusable hashicorp/azurerm module for azurerm_monitor_metric_alert: multi-criteria static and dynamic-threshold metric alerts, dimension splitting, and action-group wiring for production-grade Azure monitoring. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "monitor_alert" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-monitor-alert?ref=v1.0.0"

  name                = "..."           # Name of the metric alert rule.
  resource_group_name = "..."           # Resource group for the alert (may differ from the targe…
  scopes              = ["...", "..."]  # Resource IDs the alert evaluates; multiple require type…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

An Azure Monitor metric alert continuously evaluates a platform or custom metric (CPU percentage on a VM, 5xx rate on an App Service, available memory on AKS nodes, queue depth on a Service Bus) against a threshold over a sliding window, and fires an action — page an on-call engineer, post to a webhook, or trigger an Azure Function — when the condition is breached. The underlying resource is azurerm_monitor_metric_alert, and while a single alert looks trivial in the portal, doing it consistently in code is where teams stumble: every alert needs the right scopes, a correct metric_namespace, the proper aggregation, a sensible frequency/window_size pair, and one or more action blocks pointing at action groups.

Wrapping it in a module gives you one place to encode those decisions. This module supports both static-threshold criteria (criteria blocks with an explicit threshold) and dynamic-threshold criteria (dynamic_criteria, which machine-learns a baseline so you don’t hand-tune numbers), lets you split a single alert by dimensions (e.g. one alert that evaluates per InstanceId or per HTTP status code), and wires up action groups plus an optional webhook_properties payload. The result is that a metric alert becomes a 12-line module call instead of a 60-line resource that every team copy-pastes and subtly gets wrong.

When to use it

You manage more than a handful of metric alerts and want naming, severity, and action-group routing standardised across subscriptions.
You need dynamic thresholds for spiky or seasonal workloads where a fixed number produces constant false positives.
You want a single alert to fan out per dimension (per instance, per region, per response code) rather than maintaining one resource per value.
You are codifying an SRE alerting catalogue and want each alert reviewed in a PR with its threshold, window, and severity visible in the diff.

Reach for a different tool when you need log-based alerting (use azurerm_monitor_scheduled_query_rules_alert_v2 against Log Analytics), activity-log alerts on control-plane events like resource deletion (azurerm_monitor_activity_log_alert), or Prometheus-style rules on Azure Monitor managed Prometheus (azurerm_monitor_alert_prometheus_rule_group).

Module structure

terraform-module-azure-monitor-alert/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

main.tf

resource "azurerm_monitor_metric_alert" "this" {
  name                = var.name
  resource_group_name = var.resource_group_name
  scopes              = var.scopes
  description         = var.description

  enabled       = var.enabled
  auto_mitigate = var.auto_mitigate
  severity      = var.severity
  frequency     = var.frequency
  window_size   = var.window_size

  # Required only when scopes span multiple resources of the same type.
  target_resource_type     = var.target_resource_type
  target_resource_location = var.target_resource_location

  # Static-threshold criteria (one or more).
  dynamic "criteria" {
    for_each = var.static_criteria
    content {
      metric_namespace = criteria.value.metric_namespace
      metric_name      = criteria.value.metric_name
      aggregation      = criteria.value.aggregation
      operator         = criteria.value.operator
      threshold        = criteria.value.threshold
      skip_metric_validation = try(criteria.value.skip_metric_validation, false)

      dynamic "dimension" {
        for_each = try(criteria.value.dimensions, [])
        content {
          name     = dimension.value.name
          operator = dimension.value.operator
          values   = dimension.value.values
        }
      }
    }
  }

  # Dynamic-threshold criteria (machine-learned baseline). At most one.
  dynamic "dynamic_criteria" {
    for_each = var.dynamic_criteria == null ? [] : [var.dynamic_criteria]
    content {
      metric_namespace  = dynamic_criteria.value.metric_namespace
      metric_name       = dynamic_criteria.value.metric_name
      aggregation       = dynamic_criteria.value.aggregation
      operator          = dynamic_criteria.value.operator
      alert_sensitivity = dynamic_criteria.value.alert_sensitivity

      evaluation_total_count   = try(dynamic_criteria.value.evaluation_total_count, 4)
      evaluation_failure_count = try(dynamic_criteria.value.evaluation_failure_count, 4)
      ignore_data_before       = try(dynamic_criteria.value.ignore_data_before, null)
      skip_metric_validation   = try(dynamic_criteria.value.skip_metric_validation, false)

      dynamic "dimension" {
        for_each = try(dynamic_criteria.value.dimensions, [])
        content {
          name     = dimension.value.name
          operator = dimension.value.operator
          values   = dimension.value.values
        }
      }
    }
  }

  dynamic "action" {
    for_each = var.action_group_ids
    content {
      action_group_id    = action.value
      webhook_properties = var.webhook_properties
    }
  }

  tags = var.tags
}

variables.tf

variable "name" {
  type        = string
  description = "Name of the metric alert rule."
}

variable "resource_group_name" {
  type        = string
  description = "Resource group in which to create the alert (does not need to match the target's RG)."
}

variable "scopes" {
  type        = list(string)
  description = "Resource IDs the alert evaluates. Multiple IDs require target_resource_type/location."

  validation {
    condition     = length(var.scopes) > 0
    error_message = "At least one scope (resource ID) must be provided."
  }
}

variable "description" {
  type        = string
  description = "Human-readable description shown in the alert payload and portal."
  default     = "Managed by Terraform."
}

variable "enabled" {
  type        = bool
  description = "Whether the alert rule is enabled."
  default     = true
}

variable "auto_mitigate" {
  type        = bool
  description = "Resolve the alert automatically when the condition clears."
  default     = true
}

variable "severity" {
  type        = number
  description = "Alert severity: 0 (Critical) to 4 (Verbose)."
  default     = 3

  validation {
    condition     = var.severity >= 0 && var.severity <= 4
    error_message = "severity must be between 0 (Critical) and 4 (Verbose)."
  }
}

variable "frequency" {
  type        = string
  description = "How often the metric is evaluated (ISO 8601): PT1M, PT5M, PT15M, PT30M, PT1H."
  default     = "PT1M"

  validation {
    condition     = contains(["PT1M", "PT5M", "PT15M", "PT30M", "PT1H"], var.frequency)
    error_message = "frequency must be one of PT1M, PT5M, PT15M, PT30M, PT1H."
  }
}

variable "window_size" {
  type        = string
  description = "Lookback window over which the metric is aggregated (ISO 8601). Must be >= frequency."
  default     = "PT5M"

  validation {
    condition = contains(
      ["PT1M", "PT5M", "PT15M", "PT30M", "PT1H", "PT6H", "PT12H", "P1D"],
      var.window_size
    )
    error_message = "window_size must be one of PT1M, PT5M, PT15M, PT30M, PT1H, PT6H, PT12H, P1D."
  }
}

variable "target_resource_type" {
  type        = string
  description = "ARM type (e.g. Microsoft.Compute/virtualMachines). Required for multi-resource scopes."
  default     = null
}

variable "target_resource_location" {
  type        = string
  description = "Azure region of the targets. Required for multi-resource scopes."
  default     = null
}

variable "static_criteria" {
  description = "One or more static-threshold criteria. Leave empty to use only dynamic_criteria."
  type = list(object({
    metric_namespace       = string
    metric_name            = string
    aggregation            = string # Average, Minimum, Maximum, Total, Count
    operator               = string # Equals, NotEquals, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual
    threshold              = number
    skip_metric_validation = optional(bool, false)
    dimensions = optional(list(object({
      name     = string
      operator = string # Include, Exclude, StartsWith
      values   = list(string)
    })), [])
  }))
  default = []

  validation {
    condition = alltrue([
      for c in var.static_criteria :
      contains(["Average", "Minimum", "Maximum", "Total", "Count"], c.aggregation)
    ])
    error_message = "Each static_criteria.aggregation must be Average, Minimum, Maximum, Total, or Count."
  }
}

variable "dynamic_criteria" {
  description = "Optional single dynamic-threshold criterion (machine-learned baseline)."
  type = object({
    metric_namespace         = string
    metric_name              = string
    aggregation              = string
    operator                 = string # GreaterThan, LessThan, GreaterOrLessThan
    alert_sensitivity        = string # Low, Medium, High
    evaluation_total_count   = optional(number, 4)
    evaluation_failure_count = optional(number, 4)
    ignore_data_before       = optional(string)
    skip_metric_validation   = optional(bool, false)
    dimensions = optional(list(object({
      name     = string
      operator = string
      values   = list(string)
    })), [])
  })
  default = null

  validation {
    condition = var.dynamic_criteria == null ? true : contains(
      ["Low", "Medium", "High"], var.dynamic_criteria.alert_sensitivity
    )
    error_message = "dynamic_criteria.alert_sensitivity must be Low, Medium, or High."
  }
}

variable "action_group_ids" {
  type        = list(string)
  description = "Action group resource IDs notified when the alert fires."
  default     = []
}

variable "webhook_properties" {
  type        = map(string)
  description = "Custom key/value pairs appended to every action group webhook payload."
  default     = {}
}

variable "tags" {
  type        = map(string)
  description = "Tags applied to the alert rule."
  default     = {}
}

outputs.tf

output "id" {
  description = "Resource ID of the metric alert rule."
  value       = azurerm_monitor_metric_alert.this.id
}

output "name" {
  description = "Name of the metric alert rule."
  value       = azurerm_monitor_metric_alert.this.name
}

output "severity" {
  description = "Configured severity (0-4) of the alert rule."
  value       = azurerm_monitor_metric_alert.this.severity
}

output "scopes" {
  description = "Resource IDs the alert evaluates."
  value       = azurerm_monitor_metric_alert.this.scopes
}

output "enabled" {
  description = "Whether the alert rule is currently enabled."
  value       = azurerm_monitor_metric_alert.this.enabled
}

How to use it

A static-threshold alert that pages on-call when an App Service sustains an HTTP 5xx rate above 10 over five minutes, split per instance:

module "monitor_metric_alert" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-monitor-alert?ref=v1.0.0"

  name                = "alert-web-prod-http5xx"
  resource_group_name = azurerm_resource_group.monitoring.name
  scopes              = [azurerm_linux_web_app.api.id]
  description         = "Production API sustaining elevated HTTP 5xx responses."
  severity            = 1
  frequency           = "PT1M"
  window_size         = "PT5M"

  static_criteria = [{
    metric_namespace = "Microsoft.Web/sites"
    metric_name      = "Http5xx"
    aggregation      = "Total"
    operator         = "GreaterThan"
    threshold        = 10
    dimensions = [{
      name     = "Instance"
      operator = "Include"
      values   = ["*"]
    }]
  }]

  action_group_ids   = [azurerm_monitor_action_group.oncall.id]
  webhook_properties = { team = "platform", runbook = "RB-API-5XX" }

  tags = { env = "prod", workload = "api" }
}

# Downstream reference: feed the alert ID into an Azure Policy / dashboard
# or assert it exists in a smoke test using the module output.
output "api_5xx_alert_id" {
  value = module.monitor_metric_alert.id
}

A dynamic-threshold variant for a metric with no obvious fixed limit (e.g. message ingress that varies by time of day) simply swaps static_criteria for dynamic_criteria:

dynamic_criteria = {
  metric_namespace  = "Microsoft.ServiceBus/namespaces"
  metric_name       = "IncomingMessages"
  aggregation       = "Total"
  operator          = "GreaterThan"
  alert_sensitivity = "Medium"
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module config — live/prod/monitor_alert/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-monitor-alert?ref=v1.0.0"
}

inputs = {
  name = "..."
  resource_group_name = "..."
  scopes = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/monitor_alert && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`name`	`string`	—	Yes	Name of the metric alert rule.
`resource_group_name`	`string`	—	Yes	Resource group for the alert (may differ from the target’s RG).
`scopes`	`list(string)`	—	Yes	Resource IDs the alert evaluates; multiple require type/location.
`description`	`string`	`"Managed by Terraform."`	No	Description shown in payload and portal.
`enabled`	`bool`	`true`	No	Whether the rule is enabled.
`auto_mitigate`	`bool`	`true`	No	Auto-resolve when the condition clears.
`severity`	`number`	`3`	No	Severity 0 (Critical) to 4 (Verbose).
`frequency`	`string`	`"PT1M"`	No	Evaluation frequency (ISO 8601).
`window_size`	`string`	`"PT5M"`	No	Aggregation lookback window (>= frequency).
`target_resource_type`	`string`	`null`	No	ARM type; required for multi-resource scopes.
`target_resource_location`	`string`	`null`	No	Region of targets; required for multi-resource scopes.
`static_criteria`	`list(object)`	`[]`	No	Static-threshold criteria with optional dimension splitting.
`dynamic_criteria`	`object`	`null`	No	Single machine-learned dynamic-threshold criterion.
`action_group_ids`	`list(string)`	`[]`	No	Action group IDs notified on fire.
`webhook_properties`	`map(string)`	`{}`	No	Custom key/value pairs added to webhook payloads.
`tags`	`map(string)`	`{}`	No	Tags applied to the rule.

Outputs

Name	Description
`id`	Resource ID of the metric alert rule.
`name`	Name of the metric alert rule.
`severity`	Configured severity (0-4) of the alert rule.
`scopes`	Resource IDs the alert evaluates.
`enabled`	Whether the alert rule is currently enabled.

Enterprise scenario

A retail platform runs its checkout API across three regional App Service plans behind Front Door. The SRE team uses this module from a for_each map to stamp out one Http5xx static alert (severity 1, dimension-split per instance) and one IncomingMessages dynamic-threshold alert per region, all routed to a tiered action group that pages PagerDuty for severity 0-1 and posts to a Teams channel for the rest. Because thresholds, windows, and webhook_properties (carrying the runbook ID) live in version control, an auditor can trace every production page back to a reviewed pull request, and Black Friday capacity changes are a one-line edit to the alert catalogue.

Best practices

Match window_size to the metric’s cadence. A PT1M frequency with a PT5M window smooths transient spikes; never set window_size smaller than frequency, and widen the window for low-volume metrics so a single data point can’t trip the alert.
Prefer dynamic thresholds for variable workloads. For seasonal or bursty signals, dynamic_criteria with alert_sensitivity = "Medium" eliminates the constant retuning that static thresholds demand and cuts alert fatigue — reserve static thresholds for hard SLOs (e.g. disk > 90%).
Reserve severity 0-1 for paging. Map severity to action-group routing so only genuine customer-impacting conditions wake someone; default informational alerts to severity 3-4 and route them to chat, not a pager.
Mind alert-rule cost and dimension fan-out. Each evaluated time series is billed; a wildcard Include * dimension on a high-cardinality metric multiplies cost and noise — scope dimension values deliberately rather than always using ["*"].
Keep metric_namespace/metric_name exact and let validation run. Leave skip_metric_validation = false so Terraform catches typos at plan time; only enable it for custom metrics that don’t yet exist when the alert is created.
Name for routing, tag for ownership. Encode workload, environment, and signal in the name (alert-web-prod-http5xx) and carry env/workload/team in tags and webhook_properties so downstream automation and on-call dashboards can filter without parsing free text.

Terraform Module: Azure Monitor Metric Alert — static & dynamic thresholds with action-group routing

Quickstart (copy-paste)

What this module is

When to use it

Module structure

How to use it

With Terragrunt

Inputs

Outputs

Enterprise scenario

Best practices

Written by Vinod

Comments

Keep Reading

The Terraform Architecting Ladder: From a Single Module to an Enterprise IaC Platform

HashiCorp Terraform Associate (003) Prep Kit: Objectives, Practice Questions & Cheat Sheet

Terraform Fundamentals: HCL, Providers, State & the Core Workflow