IaC AWS

Terraform Module: AWS CloudWatch Alarm — standardized metric alarms with SNS notifications and treat-missing-data guardrails

Quick take — Build a reusable Terraform module for aws_cloudwatch_metric_alarm: a var-driven CloudWatch alarm with SNS actions, configurable evaluation/datapoints, anomaly-detection support, and safe missing-data handling for hashicorp/aws ~> 5.0. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "cloudwatch_alarm" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"

  name_prefix = "..."  # Prefix prepended to `alarm_name` (e.g. `prod-payments`)…
  alarm_name  = "..."  # Short alarm name; combined with `name_prefix`. Validate…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Amazon CloudWatch metric alarms watch a single metric (or a math expression over several metrics) and transition between OK, ALARM, and INSUFFICIENT_DATA states based on a threshold you define. When an alarm crosses into a state, it fires alarm actions — most commonly publishing to an SNS topic that fans out to PagerDuty, email, Slack, or an Auto Scaling / EC2 recovery action.

The raw aws_cloudwatch_metric_alarm resource has more than two dozen arguments, and the ones that actually determine whether an alarm is usefulevaluation_periods, datapoints_to_alarm, treat_missing_data, and the period/statistic pairing — are exactly the ones teams get wrong. An alarm with treat_missing_data = "missing" on a metric that only emits when something happens will never fire; an alarm with evaluation_periods = 1 on a noisy metric will page you all night.

Wrapping it in a module lets you encode your organization’s conventions once: a consistent naming prefix, a default treat_missing_data policy, mandatory ok_actions so on-call knows when an incident clears, and tagging. Every team then gets a battle-tested alarm by passing five or six variables instead of copy-pasting a 30-line resource and silently dropping the datapoints_to_alarm line.

When to use it

Module structure

terraform-module-aws-cloudwatch-alarm/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf
# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}
# main.tf

locals {
  alarm_name = "${var.name_prefix}-${var.alarm_name}"

  # When a raw metric is supplied (namespace + metric_name), we drive the
  # alarm with the top-level metric arguments. When metric_query blocks are
  # supplied, those take over and the top-level metric fields must be null.
  use_metric_query = length(var.metric_queries) > 0
}

resource "aws_cloudwatch_metric_alarm" "this" {
  alarm_name        = local.alarm_name
  alarm_description  = var.alarm_description
  actions_enabled   = var.actions_enabled

  comparison_operator = var.comparison_operator
  evaluation_periods  = var.evaluation_periods
  datapoints_to_alarm = var.datapoints_to_alarm
  threshold           = local.use_metric_query ? null : var.threshold

  # Single-metric path: only set when no metric_query blocks are provided.
  namespace          = local.use_metric_query ? null : var.namespace
  metric_name        = local.use_metric_query ? null : var.metric_name
  period             = local.use_metric_query ? null : var.period
  statistic          = local.use_metric_query || var.extended_statistic != null ? null : var.statistic
  extended_statistic = local.use_metric_query ? null : var.extended_statistic
  unit               = local.use_metric_query ? null : var.unit
  dimensions         = local.use_metric_query ? null : var.dimensions

  # Metric-math / anomaly-detection path.
  threshold_metric_id = var.threshold_metric_id

  dynamic "metric_query" {
    for_each = var.metric_queries
    content {
      id          = metric_query.value.id
      expression  = lookup(metric_query.value, "expression", null)
      label       = lookup(metric_query.value, "label", null)
      return_data = lookup(metric_query.value, "return_data", null)
      account_id  = lookup(metric_query.value, "account_id", null)
      period      = lookup(metric_query.value, "period", null)

      dynamic "metric" {
        for_each = lookup(metric_query.value, "metric", null) != null ? [metric_query.value.metric] : []
        content {
          namespace   = metric.value.namespace
          metric_name = metric.value.metric_name
          period      = metric.value.period
          stat        = metric.value.stat
          unit        = lookup(metric.value, "unit", null)
          dimensions  = lookup(metric.value, "dimensions", null)
        }
      }
    }
  }

  treat_missing_data        = var.treat_missing_data
  alarm_actions             = var.alarm_actions
  ok_actions                = var.ok_actions
  insufficient_data_actions = var.insufficient_data_actions

  tags = var.tags
}
# variables.tf

variable "name_prefix" {
  description = "Prefix prepended to alarm_name, e.g. \"prod-payments\". Keeps alarm names unique and searchable across the account."
  type        = string
}

variable "alarm_name" {
  description = "Short, descriptive alarm name (combined with name_prefix to form the full CloudWatch alarm name)."
  type        = string

  validation {
    condition     = length(var.alarm_name) > 0 && length(var.alarm_name) <= 200
    error_message = "alarm_name must be 1-200 characters."
  }
}

variable "alarm_description" {
  description = "Human-readable description shown in the console and notifications. Include the runbook URL."
  type        = string
  default     = null
}

variable "actions_enabled" {
  description = "Whether actions (alarm/ok/insufficient) are executed on state change. Set false to silence without deleting."
  type        = bool
  default     = true
}

variable "comparison_operator" {
  description = "How the metric is compared to the threshold."
  type        = string
  default     = "GreaterThanOrEqualToThreshold"

  validation {
    condition = contains([
      "GreaterThanOrEqualToThreshold",
      "GreaterThanThreshold",
      "LessThanThreshold",
      "LessThanOrEqualToThreshold",
      "LessThanLowerOrGreaterThanUpperThreshold",
      "LessThanLowerThreshold",
      "GreaterThanUpperThreshold",
    ], var.comparison_operator)
    error_message = "comparison_operator must be a valid CloudWatch comparison operator."
  }
}

variable "evaluation_periods" {
  description = "Number of periods over which data is evaluated to set alarm state."
  type        = number
  default     = 3

  validation {
    condition     = var.evaluation_periods >= 1
    error_message = "evaluation_periods must be >= 1."
  }
}

variable "datapoints_to_alarm" {
  description = "M of N: number of breaching datapoints within evaluation_periods required to trigger ALARM. Null means it equals evaluation_periods."
  type        = number
  default     = null

  validation {
    condition     = var.datapoints_to_alarm == null || var.datapoints_to_alarm >= 1
    error_message = "datapoints_to_alarm must be null or >= 1."
  }
}

variable "threshold" {
  description = "Value to compare the metric against. Ignored when metric_queries (metric math / anomaly detection) are used."
  type        = number
  default     = null
}

variable "namespace" {
  description = "Metric namespace for the single-metric path, e.g. \"AWS/EC2\". Ignored when metric_queries are supplied."
  type        = string
  default     = null
}

variable "metric_name" {
  description = "Metric name for the single-metric path, e.g. \"CPUUtilization\". Ignored when metric_queries are supplied."
  type        = string
  default     = null
}

variable "period" {
  description = "Granularity in seconds for the single-metric path. Must be a multiple of 60 (1, 10, or 30 are only valid for high-resolution custom metrics)."
  type        = number
  default     = 300

  validation {
    condition     = contains([1, 10, 30], var.period) || var.period % 60 == 0
    error_message = "period must be 1, 10, 30, or a multiple of 60 seconds."
  }
}

variable "statistic" {
  description = "Standard statistic for the single-metric path (SampleCount, Average, Sum, Minimum, Maximum). Mutually exclusive with extended_statistic."
  type        = string
  default     = "Average"

  validation {
    condition     = var.statistic == null || contains(["SampleCount", "Average", "Sum", "Minimum", "Maximum"], var.statistic)
    error_message = "statistic must be one of SampleCount, Average, Sum, Minimum, Maximum."
  }
}

variable "extended_statistic" {
  description = "Percentile statistic for the single-metric path, e.g. \"p99\" or \"p95.5\". When set, statistic is ignored."
  type        = string
  default     = null

  validation {
    condition     = var.extended_statistic == null || can(regex("^p(100(\\.0+)?|\\d{1,2}(\\.\\d+)?)$", var.extended_statistic))
    error_message = "extended_statistic must be a percentile like p99, p95, or p99.9."
  }
}

variable "unit" {
  description = "Optional metric unit (e.g. \"Percent\", \"Bytes\"). Leave null unless the metric is published with a specific unit you must match."
  type        = string
  default     = null
}

variable "dimensions" {
  description = "Map of dimension name/value pairs for the single-metric path, e.g. { InstanceId = \"i-0abc\" }."
  type        = map(string)
  default     = {}
}

variable "metric_queries" {
  description = <<-EOT
    List of metric_query blocks for metric math or anomaly detection. When non-empty, the single-metric
    arguments are ignored. Each item supports: id (required), expression, label, return_data, account_id,
    period, and an optional metric = { namespace, metric_name, period, stat, unit, dimensions }.
  EOT
  type = list(object({
    id          = string
    expression  = optional(string)
    label       = optional(string)
    return_data = optional(bool)
    account_id  = optional(string)
    period      = optional(number)
    metric = optional(object({
      namespace   = string
      metric_name = string
      period      = number
      stat        = string
      unit        = optional(string)
      dimensions  = optional(map(string))
    }))
  }))
  default = []
}

variable "threshold_metric_id" {
  description = "ID of the ANOMALY_DETECTION_BAND metric_query used as the threshold. Set this instead of threshold for anomaly-detection alarms."
  type        = string
  default     = null
}

variable "treat_missing_data" {
  description = "How missing datapoints are treated: missing, notBreaching, breaching, or ignore. Choose deliberately per metric."
  type        = string
  default     = "missing"

  validation {
    condition     = contains(["missing", "notBreaching", "breaching", "ignore"], var.treat_missing_data)
    error_message = "treat_missing_data must be one of missing, notBreaching, breaching, ignore."
  }
}

variable "alarm_actions" {
  description = "List of ARNs (typically SNS topics) to notify when the alarm enters ALARM state."
  type        = list(string)
  default     = []
}

variable "ok_actions" {
  description = "List of ARNs to notify when the alarm returns to OK. Strongly recommended so on-call knows an incident cleared."
  type        = list(string)
  default     = []
}

variable "insufficient_data_actions" {
  description = "List of ARNs to notify when the alarm enters INSUFFICIENT_DATA."
  type        = list(string)
  default     = []
}

variable "tags" {
  description = "Tags applied to the alarm."
  type        = map(string)
  default     = {}
}
# outputs.tf

output "alarm_id" {
  description = "The ID of the CloudWatch metric alarm (equals the alarm name)."
  value       = aws_cloudwatch_metric_alarm.this.id
}

output "alarm_arn" {
  description = "The ARN of the CloudWatch metric alarm, e.g. for cross-account policies or composite alarms."
  value       = aws_cloudwatch_metric_alarm.this.arn
}

output "alarm_name" {
  description = "The full alarm name (name_prefix + alarm_name)."
  value       = aws_cloudwatch_metric_alarm.this.alarm_name
}

output "comparison_operator" {
  description = "The comparison operator the alarm was created with."
  value       = aws_cloudwatch_metric_alarm.this.comparison_operator
}

How to use it

A static-threshold CPU alarm for an EC2 instance, wired to an SNS topic, with ok_actions so the incident clears cleanly:

resource "aws_sns_topic" "alerts" {
  name = "prod-payments-oncall"
}

module "cloudwatch_alarm" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"

  name_prefix       = "prod-payments"
  alarm_name        = "ec2-cpu-high"
  alarm_description = "CPU >= 80% for 15m on the payments API host. Runbook: https://wiki.internal/runbooks/payments-cpu"

  namespace   = "AWS/EC2"
  metric_name = "CPUUtilization"
  statistic   = "Average"
  period      = 300
  dimensions  = { InstanceId = aws_instance.payments_api.id }

  comparison_operator = "GreaterThanOrEqualToThreshold"
  threshold           = 80
  evaluation_periods  = 3
  datapoints_to_alarm = 3
  treat_missing_data  = "notBreaching"

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]

  tags = {
    Team        = "payments"
    Environment = "prod"
  }
}

# Downstream reference: feed the alarm ARN into a composite alarm so the
# "service unhealthy" page only fires when CPU AND latency are both breaching.
resource "aws_cloudwatch_composite_alarm" "service_unhealthy" {
  alarm_name = "prod-payments-service-unhealthy"
  alarm_rule = "ALARM(\"${module.cloudwatch_alarm.alarm_name}\") AND ALARM(\"prod-payments-latency-high\")"

  alarm_actions = [aws_sns_topic.alerts.arn]
}

For an anomaly-detection alarm, omit threshold and instead pass two metric_queries — the raw metric and an ANOMALY_DETECTION_BAND expression — then point threshold_metric_id at the band and use a band comparison operator:

module "latency_anomaly" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"

  name_prefix       = "prod-payments"
  alarm_name        = "alb-latency-anomaly"
  alarm_description = "ALB target response time outside the expected band (2 stddev)."

  comparison_operator = "GreaterThanUpperThreshold"
  evaluation_periods  = 3
  threshold_metric_id = "e1"
  treat_missing_data  = "notBreaching"

  metric_queries = [
    {
      id          = "m1"
      return_data = true
      metric = {
        namespace   = "AWS/ApplicationELB"
        metric_name = "TargetResponseTime"
        period      = 300
        stat        = "Average"
        dimensions  = { LoadBalancer = aws_lb.payments.arn_suffix }
      }
    },
    {
      id          = "e1"
      expression  = "ANOMALY_DETECTION_BAND(m1, 2)"
      label       = "TargetResponseTime (expected)"
      return_data = true
    },
  ]

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/cloudwatch_alarm/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"
}

inputs = {
  name_prefix = "..."
  alarm_name = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/cloudwatch_alarm && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name_prefix string yes Prefix prepended to alarm_name (e.g. prod-payments) for unique, searchable names.
alarm_name string yes Short alarm name; combined with name_prefix. Validated to 1–200 chars.
alarm_description string null no Description shown in console/notifications. Include the runbook URL.
actions_enabled bool true no Whether state-change actions execute. Set false to silence without deleting.
comparison_operator string GreaterThanOrEqualToThreshold no Comparison of metric to threshold; validated against the seven valid operators.
evaluation_periods number 3 no Periods evaluated to set state. Must be ≥ 1.
datapoints_to_alarm number null no M-of-N breaching datapoints to trigger ALARM. Null = equals evaluation_periods.
threshold number null no Static threshold value. Ignored when metric_queries are used.
namespace string null no Metric namespace for the single-metric path (e.g. AWS/EC2).
metric_name string null no Metric name for the single-metric path (e.g. CPUUtilization).
period number 300 no Granularity in seconds. Must be 1, 10, 30, or a multiple of 60.
statistic string Average no Standard statistic. Mutually exclusive with extended_statistic.
extended_statistic string null no Percentile statistic (e.g. p99). Overrides statistic when set.
unit string null no Optional metric unit; set only when you must match the published unit.
dimensions map(string) {} no Dimension name/value pairs for the single-metric path.
metric_queries list(object) [] no metric_query blocks for metric math / anomaly detection. Disables the single-metric path.
threshold_metric_id string null no ID of the ANOMALY_DETECTION_BAND query used as the threshold.
treat_missing_data string missing no Missing-data policy: missing, notBreaching, breaching, or ignore.
alarm_actions list(string) [] no ARNs notified on ALARM (typically an SNS topic).
ok_actions list(string) [] no ARNs notified on return to OK.
insufficient_data_actions list(string) [] no ARNs notified on INSUFFICIENT_DATA.
tags map(string) {} no Tags applied to the alarm.

Outputs

Name Description
alarm_id The ID of the metric alarm (equals the alarm name).
alarm_arn The ARN of the alarm — use for composite alarms or cross-account policies.
alarm_name The full alarm name (name_prefix + alarm_name).
comparison_operator The comparison operator the alarm was created with.

Enterprise scenario

A payments platform runs roughly 400 CloudWatch alarms across 30 microservices in three AWS accounts. Before standardizing, each squad hand-wrote aws_cloudwatch_metric_alarm resources, and a postmortem found that 12% of “critical” alarms used the default treat_missing_data = "missing" against sparse error-count metrics — so they had never paged during real outages. The platform team published this module with notBreaching chosen per metric, mandatory ok_actions, and a naming convention enforced by name_prefix; squads now declare alarms in five variables, the ARNs flow into per-service composite alarms for “service unhealthy” paging, and a Terraform policy check rejects any alarm missing a runbook URL in alarm_description.

Best practices

TerraformAWSCloudWatch AlarmModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading