Quick take — A reusable hashicorp/azurerm module for azurerm_monitor_metric_alert: multi-criteria static and dynamic-threshold metric alerts, dimension splitting, and action-group wiring for production-grade Azure monitoring. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "azurerm" {
features {}
}
module "monitor_alert" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-monitor-alert?ref=v1.0.0"
name = "..." # Name of the metric alert rule.
resource_group_name = "..." # Resource group for the alert (may differ from the targe…
scopes = ["...", "..."] # Resource IDs the alert evaluates; multiple require type…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
An Azure Monitor metric alert continuously evaluates a platform or custom metric (CPU percentage on a VM, 5xx rate on an App Service, available memory on AKS nodes, queue depth on a Service Bus) against a threshold over a sliding window, and fires an action — page an on-call engineer, post to a webhook, or trigger an Azure Function — when the condition is breached. The underlying resource is azurerm_monitor_metric_alert, and while a single alert looks trivial in the portal, doing it consistently in code is where teams stumble: every alert needs the right scopes, a correct metric_namespace, the proper aggregation, a sensible frequency/window_size pair, and one or more action blocks pointing at action groups.
Wrapping it in a module gives you one place to encode those decisions. This module supports both static-threshold criteria (criteria blocks with an explicit threshold) and dynamic-threshold criteria (dynamic_criteria, which machine-learns a baseline so you don’t hand-tune numbers), lets you split a single alert by dimensions (e.g. one alert that evaluates per InstanceId or per HTTP status code), and wires up action groups plus an optional webhook_properties payload. The result is that a metric alert becomes a 12-line module call instead of a 60-line resource that every team copy-pastes and subtly gets wrong.
When to use it
- You manage more than a handful of metric alerts and want naming, severity, and action-group routing standardised across subscriptions.
- You need dynamic thresholds for spiky or seasonal workloads where a fixed number produces constant false positives.
- You want a single alert to fan out per dimension (per instance, per region, per response code) rather than maintaining one resource per value.
- You are codifying an SRE alerting catalogue and want each alert reviewed in a PR with its threshold, window, and severity visible in the diff.
Reach for a different tool when you need log-based alerting (use azurerm_monitor_scheduled_query_rules_alert_v2 against Log Analytics), activity-log alerts on control-plane events like resource deletion (azurerm_monitor_activity_log_alert), or Prometheus-style rules on Azure Monitor managed Prometheus (azurerm_monitor_alert_prometheus_rule_group).
Module structure
terraform-module-azure-monitor-alert/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 4.0"
}
}
}
resource "azurerm_monitor_metric_alert" "this" {
name = var.name
resource_group_name = var.resource_group_name
scopes = var.scopes
description = var.description
enabled = var.enabled
auto_mitigate = var.auto_mitigate
severity = var.severity
frequency = var.frequency
window_size = var.window_size
# Required only when scopes span multiple resources of the same type.
target_resource_type = var.target_resource_type
target_resource_location = var.target_resource_location
# Static-threshold criteria (one or more).
dynamic "criteria" {
for_each = var.static_criteria
content {
metric_namespace = criteria.value.metric_namespace
metric_name = criteria.value.metric_name
aggregation = criteria.value.aggregation
operator = criteria.value.operator
threshold = criteria.value.threshold
skip_metric_validation = try(criteria.value.skip_metric_validation, false)
dynamic "dimension" {
for_each = try(criteria.value.dimensions, [])
content {
name = dimension.value.name
operator = dimension.value.operator
values = dimension.value.values
}
}
}
}
# Dynamic-threshold criteria (machine-learned baseline). At most one.
dynamic "dynamic_criteria" {
for_each = var.dynamic_criteria == null ? [] : [var.dynamic_criteria]
content {
metric_namespace = dynamic_criteria.value.metric_namespace
metric_name = dynamic_criteria.value.metric_name
aggregation = dynamic_criteria.value.aggregation
operator = dynamic_criteria.value.operator
alert_sensitivity = dynamic_criteria.value.alert_sensitivity
evaluation_total_count = try(dynamic_criteria.value.evaluation_total_count, 4)
evaluation_failure_count = try(dynamic_criteria.value.evaluation_failure_count, 4)
ignore_data_before = try(dynamic_criteria.value.ignore_data_before, null)
skip_metric_validation = try(dynamic_criteria.value.skip_metric_validation, false)
dynamic "dimension" {
for_each = try(dynamic_criteria.value.dimensions, [])
content {
name = dimension.value.name
operator = dimension.value.operator
values = dimension.value.values
}
}
}
}
dynamic "action" {
for_each = var.action_group_ids
content {
action_group_id = action.value
webhook_properties = var.webhook_properties
}
}
tags = var.tags
}
variable "name" {
type = string
description = "Name of the metric alert rule."
}
variable "resource_group_name" {
type = string
description = "Resource group in which to create the alert (does not need to match the target's RG)."
}
variable "scopes" {
type = list(string)
description = "Resource IDs the alert evaluates. Multiple IDs require target_resource_type/location."
validation {
condition = length(var.scopes) > 0
error_message = "At least one scope (resource ID) must be provided."
}
}
variable "description" {
type = string
description = "Human-readable description shown in the alert payload and portal."
default = "Managed by Terraform."
}
variable "enabled" {
type = bool
description = "Whether the alert rule is enabled."
default = true
}
variable "auto_mitigate" {
type = bool
description = "Resolve the alert automatically when the condition clears."
default = true
}
variable "severity" {
type = number
description = "Alert severity: 0 (Critical) to 4 (Verbose)."
default = 3
validation {
condition = var.severity >= 0 && var.severity <= 4
error_message = "severity must be between 0 (Critical) and 4 (Verbose)."
}
}
variable "frequency" {
type = string
description = "How often the metric is evaluated (ISO 8601): PT1M, PT5M, PT15M, PT30M, PT1H."
default = "PT1M"
validation {
condition = contains(["PT1M", "PT5M", "PT15M", "PT30M", "PT1H"], var.frequency)
error_message = "frequency must be one of PT1M, PT5M, PT15M, PT30M, PT1H."
}
}
variable "window_size" {
type = string
description = "Lookback window over which the metric is aggregated (ISO 8601). Must be >= frequency."
default = "PT5M"
validation {
condition = contains(
["PT1M", "PT5M", "PT15M", "PT30M", "PT1H", "PT6H", "PT12H", "P1D"],
var.window_size
)
error_message = "window_size must be one of PT1M, PT5M, PT15M, PT30M, PT1H, PT6H, PT12H, P1D."
}
}
variable "target_resource_type" {
type = string
description = "ARM type (e.g. Microsoft.Compute/virtualMachines). Required for multi-resource scopes."
default = null
}
variable "target_resource_location" {
type = string
description = "Azure region of the targets. Required for multi-resource scopes."
default = null
}
variable "static_criteria" {
description = "One or more static-threshold criteria. Leave empty to use only dynamic_criteria."
type = list(object({
metric_namespace = string
metric_name = string
aggregation = string # Average, Minimum, Maximum, Total, Count
operator = string # Equals, NotEquals, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual
threshold = number
skip_metric_validation = optional(bool, false)
dimensions = optional(list(object({
name = string
operator = string # Include, Exclude, StartsWith
values = list(string)
})), [])
}))
default = []
validation {
condition = alltrue([
for c in var.static_criteria :
contains(["Average", "Minimum", "Maximum", "Total", "Count"], c.aggregation)
])
error_message = "Each static_criteria.aggregation must be Average, Minimum, Maximum, Total, or Count."
}
}
variable "dynamic_criteria" {
description = "Optional single dynamic-threshold criterion (machine-learned baseline)."
type = object({
metric_namespace = string
metric_name = string
aggregation = string
operator = string # GreaterThan, LessThan, GreaterOrLessThan
alert_sensitivity = string # Low, Medium, High
evaluation_total_count = optional(number, 4)
evaluation_failure_count = optional(number, 4)
ignore_data_before = optional(string)
skip_metric_validation = optional(bool, false)
dimensions = optional(list(object({
name = string
operator = string
values = list(string)
})), [])
})
default = null
validation {
condition = var.dynamic_criteria == null ? true : contains(
["Low", "Medium", "High"], var.dynamic_criteria.alert_sensitivity
)
error_message = "dynamic_criteria.alert_sensitivity must be Low, Medium, or High."
}
}
variable "action_group_ids" {
type = list(string)
description = "Action group resource IDs notified when the alert fires."
default = []
}
variable "webhook_properties" {
type = map(string)
description = "Custom key/value pairs appended to every action group webhook payload."
default = {}
}
variable "tags" {
type = map(string)
description = "Tags applied to the alert rule."
default = {}
}
output "id" {
description = "Resource ID of the metric alert rule."
value = azurerm_monitor_metric_alert.this.id
}
output "name" {
description = "Name of the metric alert rule."
value = azurerm_monitor_metric_alert.this.name
}
output "severity" {
description = "Configured severity (0-4) of the alert rule."
value = azurerm_monitor_metric_alert.this.severity
}
output "scopes" {
description = "Resource IDs the alert evaluates."
value = azurerm_monitor_metric_alert.this.scopes
}
output "enabled" {
description = "Whether the alert rule is currently enabled."
value = azurerm_monitor_metric_alert.this.enabled
}
How to use it
A static-threshold alert that pages on-call when an App Service sustains an HTTP 5xx rate above 10 over five minutes, split per instance:
module "monitor_metric_alert" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-monitor-alert?ref=v1.0.0"
name = "alert-web-prod-http5xx"
resource_group_name = azurerm_resource_group.monitoring.name
scopes = [azurerm_linux_web_app.api.id]
description = "Production API sustaining elevated HTTP 5xx responses."
severity = 1
frequency = "PT1M"
window_size = "PT5M"
static_criteria = [{
metric_namespace = "Microsoft.Web/sites"
metric_name = "Http5xx"
aggregation = "Total"
operator = "GreaterThan"
threshold = 10
dimensions = [{
name = "Instance"
operator = "Include"
values = ["*"]
}]
}]
action_group_ids = [azurerm_monitor_action_group.oncall.id]
webhook_properties = { team = "platform", runbook = "RB-API-5XX" }
tags = { env = "prod", workload = "api" }
}
# Downstream reference: feed the alert ID into an Azure Policy / dashboard
# or assert it exists in a smoke test using the module output.
output "api_5xx_alert_id" {
value = module.monitor_metric_alert.id
}
A dynamic-threshold variant for a metric with no obvious fixed limit (e.g. message ingress that varies by time of day) simply swaps static_criteria for dynamic_criteria:
dynamic_criteria = {
metric_namespace = "Microsoft.ServiceBus/namespaces"
metric_name = "IncomingMessages"
aggregation = "Total"
operator = "GreaterThan"
alert_sensitivity = "Medium"
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "azurerm"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...azurerm state bucket/container + key per path...
}
}
2. Module config — live/prod/monitor_alert/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-monitor-alert?ref=v1.0.0"
}
inputs = {
name = "..."
resource_group_name = "..."
scopes = ["...", "..."]
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/monitor_alert && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
name |
string |
— | Yes | Name of the metric alert rule. |
resource_group_name |
string |
— | Yes | Resource group for the alert (may differ from the target’s RG). |
scopes |
list(string) |
— | Yes | Resource IDs the alert evaluates; multiple require type/location. |
description |
string |
"Managed by Terraform." |
No | Description shown in payload and portal. |
enabled |
bool |
true |
No | Whether the rule is enabled. |
auto_mitigate |
bool |
true |
No | Auto-resolve when the condition clears. |
severity |
number |
3 |
No | Severity 0 (Critical) to 4 (Verbose). |
frequency |
string |
"PT1M" |
No | Evaluation frequency (ISO 8601). |
window_size |
string |
"PT5M" |
No | Aggregation lookback window (>= frequency). |
target_resource_type |
string |
null |
No | ARM type; required for multi-resource scopes. |
target_resource_location |
string |
null |
No | Region of targets; required for multi-resource scopes. |
static_criteria |
list(object) |
[] |
No | Static-threshold criteria with optional dimension splitting. |
dynamic_criteria |
object |
null |
No | Single machine-learned dynamic-threshold criterion. |
action_group_ids |
list(string) |
[] |
No | Action group IDs notified on fire. |
webhook_properties |
map(string) |
{} |
No | Custom key/value pairs added to webhook payloads. |
tags |
map(string) |
{} |
No | Tags applied to the rule. |
Outputs
| Name | Description |
|---|---|
id |
Resource ID of the metric alert rule. |
name |
Name of the metric alert rule. |
severity |
Configured severity (0-4) of the alert rule. |
scopes |
Resource IDs the alert evaluates. |
enabled |
Whether the alert rule is currently enabled. |
Enterprise scenario
A retail platform runs its checkout API across three regional App Service plans behind Front Door. The SRE team uses this module from a for_each map to stamp out one Http5xx static alert (severity 1, dimension-split per instance) and one IncomingMessages dynamic-threshold alert per region, all routed to a tiered action group that pages PagerDuty for severity 0-1 and posts to a Teams channel for the rest. Because thresholds, windows, and webhook_properties (carrying the runbook ID) live in version control, an auditor can trace every production page back to a reviewed pull request, and Black Friday capacity changes are a one-line edit to the alert catalogue.
Best practices
- Match
window_sizeto the metric’s cadence. APT1Mfrequency with aPT5Mwindow smooths transient spikes; never setwindow_sizesmaller thanfrequency, and widen the window for low-volume metrics so a single data point can’t trip the alert. - Prefer dynamic thresholds for variable workloads. For seasonal or bursty signals,
dynamic_criteriawithalert_sensitivity = "Medium"eliminates the constant retuning that static thresholds demand and cuts alert fatigue — reserve static thresholds for hard SLOs (e.g. disk > 90%). - Reserve severity 0-1 for paging. Map severity to action-group routing so only genuine customer-impacting conditions wake someone; default informational alerts to severity 3-4 and route them to chat, not a pager.
- Mind alert-rule cost and dimension fan-out. Each evaluated time series is billed; a wildcard
Include *dimension on a high-cardinality metric multiplies cost and noise — scope dimensionvaluesdeliberately rather than always using["*"]. - Keep
metric_namespace/metric_nameexact and let validation run. Leaveskip_metric_validation = falseso Terraform catches typos at plan time; only enable it for custom metrics that don’t yet exist when the alert is created. - Name for routing, tag for ownership. Encode workload, environment, and signal in the name (
alert-web-prod-http5xx) and carryenv/workload/teamintagsandwebhook_propertiesso downstream automation and on-call dashboards can filter without parsing free text.