IaC Azure

Terraform Module: Azure Monitor Metric Alert — static & dynamic thresholds with action-group routing

Quick take — A reusable hashicorp/azurerm module for azurerm_monitor_metric_alert: multi-criteria static and dynamic-threshold metric alerts, dimension splitting, and action-group wiring for production-grade Azure monitoring. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "monitor_alert" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-monitor-alert?ref=v1.0.0"

  name                = "..."           # Name of the metric alert rule.
  resource_group_name = "..."           # Resource group for the alert (may differ from the targe…
  scopes              = ["...", "..."]  # Resource IDs the alert evaluates; multiple require type…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

An Azure Monitor metric alert continuously evaluates a platform or custom metric (CPU percentage on a VM, 5xx rate on an App Service, available memory on AKS nodes, queue depth on a Service Bus) against a threshold over a sliding window, and fires an action — page an on-call engineer, post to a webhook, or trigger an Azure Function — when the condition is breached. The underlying resource is azurerm_monitor_metric_alert, and while a single alert looks trivial in the portal, doing it consistently in code is where teams stumble: every alert needs the right scopes, a correct metric_namespace, the proper aggregation, a sensible frequency/window_size pair, and one or more action blocks pointing at action groups.

Wrapping it in a module gives you one place to encode those decisions. This module supports both static-threshold criteria (criteria blocks with an explicit threshold) and dynamic-threshold criteria (dynamic_criteria, which machine-learns a baseline so you don’t hand-tune numbers), lets you split a single alert by dimensions (e.g. one alert that evaluates per InstanceId or per HTTP status code), and wires up action groups plus an optional webhook_properties payload. The result is that a metric alert becomes a 12-line module call instead of a 60-line resource that every team copy-pastes and subtly gets wrong.

When to use it

Reach for a different tool when you need log-based alerting (use azurerm_monitor_scheduled_query_rules_alert_v2 against Log Analytics), activity-log alerts on control-plane events like resource deletion (azurerm_monitor_activity_log_alert), or Prometheus-style rules on Azure Monitor managed Prometheus (azurerm_monitor_alert_prometheus_rule_group).

Module structure

terraform-module-azure-monitor-alert/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

main.tf

resource "azurerm_monitor_metric_alert" "this" {
  name                = var.name
  resource_group_name = var.resource_group_name
  scopes              = var.scopes
  description         = var.description

  enabled       = var.enabled
  auto_mitigate = var.auto_mitigate
  severity      = var.severity
  frequency     = var.frequency
  window_size   = var.window_size

  # Required only when scopes span multiple resources of the same type.
  target_resource_type     = var.target_resource_type
  target_resource_location = var.target_resource_location

  # Static-threshold criteria (one or more).
  dynamic "criteria" {
    for_each = var.static_criteria
    content {
      metric_namespace = criteria.value.metric_namespace
      metric_name      = criteria.value.metric_name
      aggregation      = criteria.value.aggregation
      operator         = criteria.value.operator
      threshold        = criteria.value.threshold
      skip_metric_validation = try(criteria.value.skip_metric_validation, false)

      dynamic "dimension" {
        for_each = try(criteria.value.dimensions, [])
        content {
          name     = dimension.value.name
          operator = dimension.value.operator
          values   = dimension.value.values
        }
      }
    }
  }

  # Dynamic-threshold criteria (machine-learned baseline). At most one.
  dynamic "dynamic_criteria" {
    for_each = var.dynamic_criteria == null ? [] : [var.dynamic_criteria]
    content {
      metric_namespace  = dynamic_criteria.value.metric_namespace
      metric_name       = dynamic_criteria.value.metric_name
      aggregation       = dynamic_criteria.value.aggregation
      operator          = dynamic_criteria.value.operator
      alert_sensitivity = dynamic_criteria.value.alert_sensitivity

      evaluation_total_count   = try(dynamic_criteria.value.evaluation_total_count, 4)
      evaluation_failure_count = try(dynamic_criteria.value.evaluation_failure_count, 4)
      ignore_data_before       = try(dynamic_criteria.value.ignore_data_before, null)
      skip_metric_validation   = try(dynamic_criteria.value.skip_metric_validation, false)

      dynamic "dimension" {
        for_each = try(dynamic_criteria.value.dimensions, [])
        content {
          name     = dimension.value.name
          operator = dimension.value.operator
          values   = dimension.value.values
        }
      }
    }
  }

  dynamic "action" {
    for_each = var.action_group_ids
    content {
      action_group_id    = action.value
      webhook_properties = var.webhook_properties
    }
  }

  tags = var.tags
}

variables.tf

variable "name" {
  type        = string
  description = "Name of the metric alert rule."
}

variable "resource_group_name" {
  type        = string
  description = "Resource group in which to create the alert (does not need to match the target's RG)."
}

variable "scopes" {
  type        = list(string)
  description = "Resource IDs the alert evaluates. Multiple IDs require target_resource_type/location."

  validation {
    condition     = length(var.scopes) > 0
    error_message = "At least one scope (resource ID) must be provided."
  }
}

variable "description" {
  type        = string
  description = "Human-readable description shown in the alert payload and portal."
  default     = "Managed by Terraform."
}

variable "enabled" {
  type        = bool
  description = "Whether the alert rule is enabled."
  default     = true
}

variable "auto_mitigate" {
  type        = bool
  description = "Resolve the alert automatically when the condition clears."
  default     = true
}

variable "severity" {
  type        = number
  description = "Alert severity: 0 (Critical) to 4 (Verbose)."
  default     = 3

  validation {
    condition     = var.severity >= 0 && var.severity <= 4
    error_message = "severity must be between 0 (Critical) and 4 (Verbose)."
  }
}

variable "frequency" {
  type        = string
  description = "How often the metric is evaluated (ISO 8601): PT1M, PT5M, PT15M, PT30M, PT1H."
  default     = "PT1M"

  validation {
    condition     = contains(["PT1M", "PT5M", "PT15M", "PT30M", "PT1H"], var.frequency)
    error_message = "frequency must be one of PT1M, PT5M, PT15M, PT30M, PT1H."
  }
}

variable "window_size" {
  type        = string
  description = "Lookback window over which the metric is aggregated (ISO 8601). Must be >= frequency."
  default     = "PT5M"

  validation {
    condition = contains(
      ["PT1M", "PT5M", "PT15M", "PT30M", "PT1H", "PT6H", "PT12H", "P1D"],
      var.window_size
    )
    error_message = "window_size must be one of PT1M, PT5M, PT15M, PT30M, PT1H, PT6H, PT12H, P1D."
  }
}

variable "target_resource_type" {
  type        = string
  description = "ARM type (e.g. Microsoft.Compute/virtualMachines). Required for multi-resource scopes."
  default     = null
}

variable "target_resource_location" {
  type        = string
  description = "Azure region of the targets. Required for multi-resource scopes."
  default     = null
}

variable "static_criteria" {
  description = "One or more static-threshold criteria. Leave empty to use only dynamic_criteria."
  type = list(object({
    metric_namespace       = string
    metric_name            = string
    aggregation            = string # Average, Minimum, Maximum, Total, Count
    operator               = string # Equals, NotEquals, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual
    threshold              = number
    skip_metric_validation = optional(bool, false)
    dimensions = optional(list(object({
      name     = string
      operator = string # Include, Exclude, StartsWith
      values   = list(string)
    })), [])
  }))
  default = []

  validation {
    condition = alltrue([
      for c in var.static_criteria :
      contains(["Average", "Minimum", "Maximum", "Total", "Count"], c.aggregation)
    ])
    error_message = "Each static_criteria.aggregation must be Average, Minimum, Maximum, Total, or Count."
  }
}

variable "dynamic_criteria" {
  description = "Optional single dynamic-threshold criterion (machine-learned baseline)."
  type = object({
    metric_namespace         = string
    metric_name              = string
    aggregation              = string
    operator                 = string # GreaterThan, LessThan, GreaterOrLessThan
    alert_sensitivity        = string # Low, Medium, High
    evaluation_total_count   = optional(number, 4)
    evaluation_failure_count = optional(number, 4)
    ignore_data_before       = optional(string)
    skip_metric_validation   = optional(bool, false)
    dimensions = optional(list(object({
      name     = string
      operator = string
      values   = list(string)
    })), [])
  })
  default = null

  validation {
    condition = var.dynamic_criteria == null ? true : contains(
      ["Low", "Medium", "High"], var.dynamic_criteria.alert_sensitivity
    )
    error_message = "dynamic_criteria.alert_sensitivity must be Low, Medium, or High."
  }
}

variable "action_group_ids" {
  type        = list(string)
  description = "Action group resource IDs notified when the alert fires."
  default     = []
}

variable "webhook_properties" {
  type        = map(string)
  description = "Custom key/value pairs appended to every action group webhook payload."
  default     = {}
}

variable "tags" {
  type        = map(string)
  description = "Tags applied to the alert rule."
  default     = {}
}

outputs.tf

output "id" {
  description = "Resource ID of the metric alert rule."
  value       = azurerm_monitor_metric_alert.this.id
}

output "name" {
  description = "Name of the metric alert rule."
  value       = azurerm_monitor_metric_alert.this.name
}

output "severity" {
  description = "Configured severity (0-4) of the alert rule."
  value       = azurerm_monitor_metric_alert.this.severity
}

output "scopes" {
  description = "Resource IDs the alert evaluates."
  value       = azurerm_monitor_metric_alert.this.scopes
}

output "enabled" {
  description = "Whether the alert rule is currently enabled."
  value       = azurerm_monitor_metric_alert.this.enabled
}

How to use it

A static-threshold alert that pages on-call when an App Service sustains an HTTP 5xx rate above 10 over five minutes, split per instance:

module "monitor_metric_alert" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-monitor-alert?ref=v1.0.0"

  name                = "alert-web-prod-http5xx"
  resource_group_name = azurerm_resource_group.monitoring.name
  scopes              = [azurerm_linux_web_app.api.id]
  description         = "Production API sustaining elevated HTTP 5xx responses."
  severity            = 1
  frequency           = "PT1M"
  window_size         = "PT5M"

  static_criteria = [{
    metric_namespace = "Microsoft.Web/sites"
    metric_name      = "Http5xx"
    aggregation      = "Total"
    operator         = "GreaterThan"
    threshold        = 10
    dimensions = [{
      name     = "Instance"
      operator = "Include"
      values   = ["*"]
    }]
  }]

  action_group_ids   = [azurerm_monitor_action_group.oncall.id]
  webhook_properties = { team = "platform", runbook = "RB-API-5XX" }

  tags = { env = "prod", workload = "api" }
}

# Downstream reference: feed the alert ID into an Azure Policy / dashboard
# or assert it exists in a smoke test using the module output.
output "api_5xx_alert_id" {
  value = module.monitor_metric_alert.id
}

A dynamic-threshold variant for a metric with no obvious fixed limit (e.g. message ingress that varies by time of day) simply swaps static_criteria for dynamic_criteria:

dynamic_criteria = {
  metric_namespace  = "Microsoft.ServiceBus/namespaces"
  metric_name       = "IncomingMessages"
  aggregation       = "Total"
  operator          = "GreaterThan"
  alert_sensitivity = "Medium"
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module configlive/prod/monitor_alert/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-monitor-alert?ref=v1.0.0"
}

inputs = {
  name = "..."
  resource_group_name = "..."
  scopes = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/monitor_alert && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name string Yes Name of the metric alert rule.
resource_group_name string Yes Resource group for the alert (may differ from the target’s RG).
scopes list(string) Yes Resource IDs the alert evaluates; multiple require type/location.
description string "Managed by Terraform." No Description shown in payload and portal.
enabled bool true No Whether the rule is enabled.
auto_mitigate bool true No Auto-resolve when the condition clears.
severity number 3 No Severity 0 (Critical) to 4 (Verbose).
frequency string "PT1M" No Evaluation frequency (ISO 8601).
window_size string "PT5M" No Aggregation lookback window (>= frequency).
target_resource_type string null No ARM type; required for multi-resource scopes.
target_resource_location string null No Region of targets; required for multi-resource scopes.
static_criteria list(object) [] No Static-threshold criteria with optional dimension splitting.
dynamic_criteria object null No Single machine-learned dynamic-threshold criterion.
action_group_ids list(string) [] No Action group IDs notified on fire.
webhook_properties map(string) {} No Custom key/value pairs added to webhook payloads.
tags map(string) {} No Tags applied to the rule.

Outputs

Name Description
id Resource ID of the metric alert rule.
name Name of the metric alert rule.
severity Configured severity (0-4) of the alert rule.
scopes Resource IDs the alert evaluates.
enabled Whether the alert rule is currently enabled.

Enterprise scenario

A retail platform runs its checkout API across three regional App Service plans behind Front Door. The SRE team uses this module from a for_each map to stamp out one Http5xx static alert (severity 1, dimension-split per instance) and one IncomingMessages dynamic-threshold alert per region, all routed to a tiered action group that pages PagerDuty for severity 0-1 and posts to a Teams channel for the rest. Because thresholds, windows, and webhook_properties (carrying the runbook ID) live in version control, an auditor can trace every production page back to a reviewed pull request, and Black Friday capacity changes are a one-line edit to the alert catalogue.

Best practices

TerraformAzureMonitor Metric AlertModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading