Terraform Module: AWS CloudWatch Alarm — standardized metric alarms with SNS notifications and treat-missing-data guardrails

Quick take — Build a reusable Terraform module for aws_cloudwatch_metric_alarm: a var-driven CloudWatch alarm with SNS actions, configurable evaluation/datapoints, anomaly-detection support, and safe missing-data handling for hashicorp/aws ~> 5.0. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "cloudwatch_alarm" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"

  name_prefix = "..."  # Prefix prepended to `alarm_name` (e.g. `prod-payments`)…
  alarm_name  = "..."  # Short alarm name; combined with `name_prefix`. Validate…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Amazon CloudWatch metric alarms watch a single metric (or a math expression over several metrics) and transition between OK, ALARM, and INSUFFICIENT_DATA states based on a threshold you define. When an alarm crosses into a state, it fires alarm actions — most commonly publishing to an SNS topic that fans out to PagerDuty, email, Slack, or an Auto Scaling / EC2 recovery action.

The raw aws_cloudwatch_metric_alarm resource has more than two dozen arguments, and the ones that actually determine whether an alarm is useful — evaluation_periods, datapoints_to_alarm, treat_missing_data, and the period/statistic pairing — are exactly the ones teams get wrong. An alarm with treat_missing_data = "missing" on a metric that only emits when something happens will never fire; an alarm with evaluation_periods = 1 on a noisy metric will page you all night.

Wrapping it in a module lets you encode your organization’s conventions once: a consistent naming prefix, a default treat_missing_data policy, mandatory ok_actions so on-call knows when an incident clears, and tagging. Every team then gets a battle-tested alarm by passing five or six variables instead of copy-pasting a 30-line resource and silently dropping the datapoints_to_alarm line.

When to use it

You manage more than a handful of alarms and want naming, tagging, and notification routing to be identical across them.
You want to support both static-threshold alarms (CPU > 80%) and anomaly-detection band alarms from the same module without two code paths in every stack.
You need alarms wired to SNS for incident tooling, and optionally to EC2 auto-recovery or Auto Scaling actions.
You want guardrails — e.g. enforcing a valid comparison_operator, a sane period, and an explicit missing-data policy — so a typo cannot ship a dead alarm.
Reach for AWS’s own managed alerting (e.g. Application Signals SLOs, or CloudWatch Alarms recommended by a service) when you want best-practice defaults you don’t control; use this module when you need per-team consistency and Terraform-managed state.

Module structure

terraform-module-aws-cloudwatch-alarm/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# main.tf

locals {
  alarm_name = "${var.name_prefix}-${var.alarm_name}"

  # When a raw metric is supplied (namespace + metric_name), we drive the
  # alarm with the top-level metric arguments. When metric_query blocks are
  # supplied, those take over and the top-level metric fields must be null.
  use_metric_query = length(var.metric_queries) > 0
}

resource "aws_cloudwatch_metric_alarm" "this" {
  alarm_name        = local.alarm_name
  alarm_description  = var.alarm_description
  actions_enabled   = var.actions_enabled

  comparison_operator = var.comparison_operator
  evaluation_periods  = var.evaluation_periods
  datapoints_to_alarm = var.datapoints_to_alarm
  threshold           = local.use_metric_query ? null : var.threshold

  # Single-metric path: only set when no metric_query blocks are provided.
  namespace          = local.use_metric_query ? null : var.namespace
  metric_name        = local.use_metric_query ? null : var.metric_name
  period             = local.use_metric_query ? null : var.period
  statistic          = local.use_metric_query || var.extended_statistic != null ? null : var.statistic
  extended_statistic = local.use_metric_query ? null : var.extended_statistic
  unit               = local.use_metric_query ? null : var.unit
  dimensions         = local.use_metric_query ? null : var.dimensions

  # Metric-math / anomaly-detection path.
  threshold_metric_id = var.threshold_metric_id

  dynamic "metric_query" {
    for_each = var.metric_queries
    content {
      id          = metric_query.value.id
      expression  = lookup(metric_query.value, "expression", null)
      label       = lookup(metric_query.value, "label", null)
      return_data = lookup(metric_query.value, "return_data", null)
      account_id  = lookup(metric_query.value, "account_id", null)
      period      = lookup(metric_query.value, "period", null)

      dynamic "metric" {
        for_each = lookup(metric_query.value, "metric", null) != null ? [metric_query.value.metric] : []
        content {
          namespace   = metric.value.namespace
          metric_name = metric.value.metric_name
          period      = metric.value.period
          stat        = metric.value.stat
          unit        = lookup(metric.value, "unit", null)
          dimensions  = lookup(metric.value, "dimensions", null)
        }
      }
    }
  }

  treat_missing_data        = var.treat_missing_data
  alarm_actions             = var.alarm_actions
  ok_actions                = var.ok_actions
  insufficient_data_actions = var.insufficient_data_actions

  tags = var.tags
}

# variables.tf

variable "name_prefix" {
  description = "Prefix prepended to alarm_name, e.g. \"prod-payments\". Keeps alarm names unique and searchable across the account."
  type        = string
}

variable "alarm_name" {
  description = "Short, descriptive alarm name (combined with name_prefix to form the full CloudWatch alarm name)."
  type        = string

  validation {
    condition     = length(var.alarm_name) > 0 && length(var.alarm_name) <= 200
    error_message = "alarm_name must be 1-200 characters."
  }
}

variable "alarm_description" {
  description = "Human-readable description shown in the console and notifications. Include the runbook URL."
  type        = string
  default     = null
}

variable "actions_enabled" {
  description = "Whether actions (alarm/ok/insufficient) are executed on state change. Set false to silence without deleting."
  type        = bool
  default     = true
}

variable "comparison_operator" {
  description = "How the metric is compared to the threshold."
  type        = string
  default     = "GreaterThanOrEqualToThreshold"

  validation {
    condition = contains([
      "GreaterThanOrEqualToThreshold",
      "GreaterThanThreshold",
      "LessThanThreshold",
      "LessThanOrEqualToThreshold",
      "LessThanLowerOrGreaterThanUpperThreshold",
      "LessThanLowerThreshold",
      "GreaterThanUpperThreshold",
    ], var.comparison_operator)
    error_message = "comparison_operator must be a valid CloudWatch comparison operator."
  }
}

variable "evaluation_periods" {
  description = "Number of periods over which data is evaluated to set alarm state."
  type        = number
  default     = 3

  validation {
    condition     = var.evaluation_periods >= 1
    error_message = "evaluation_periods must be >= 1."
  }
}

variable "datapoints_to_alarm" {
  description = "M of N: number of breaching datapoints within evaluation_periods required to trigger ALARM. Null means it equals evaluation_periods."
  type        = number
  default     = null

  validation {
    condition     = var.datapoints_to_alarm == null || var.datapoints_to_alarm >= 1
    error_message = "datapoints_to_alarm must be null or >= 1."
  }
}

variable "threshold" {
  description = "Value to compare the metric against. Ignored when metric_queries (metric math / anomaly detection) are used."
  type        = number
  default     = null
}

variable "namespace" {
  description = "Metric namespace for the single-metric path, e.g. \"AWS/EC2\". Ignored when metric_queries are supplied."
  type        = string
  default     = null
}

variable "metric_name" {
  description = "Metric name for the single-metric path, e.g. \"CPUUtilization\". Ignored when metric_queries are supplied."
  type        = string
  default     = null
}

variable "period" {
  description = "Granularity in seconds for the single-metric path. Must be a multiple of 60 (1, 10, or 30 are only valid for high-resolution custom metrics)."
  type        = number
  default     = 300

  validation {
    condition     = contains([1, 10, 30], var.period) || var.period % 60 == 0
    error_message = "period must be 1, 10, 30, or a multiple of 60 seconds."
  }
}

variable "statistic" {
  description = "Standard statistic for the single-metric path (SampleCount, Average, Sum, Minimum, Maximum). Mutually exclusive with extended_statistic."
  type        = string
  default     = "Average"

  validation {
    condition     = var.statistic == null || contains(["SampleCount", "Average", "Sum", "Minimum", "Maximum"], var.statistic)
    error_message = "statistic must be one of SampleCount, Average, Sum, Minimum, Maximum."
  }
}

variable "extended_statistic" {
  description = "Percentile statistic for the single-metric path, e.g. \"p99\" or \"p95.5\". When set, statistic is ignored."
  type        = string
  default     = null

  validation {
    condition     = var.extended_statistic == null || can(regex("^p(100(\\.0+)?|\\d{1,2}(\\.\\d+)?)$", var.extended_statistic))
    error_message = "extended_statistic must be a percentile like p99, p95, or p99.9."
  }
}

variable "unit" {
  description = "Optional metric unit (e.g. \"Percent\", \"Bytes\"). Leave null unless the metric is published with a specific unit you must match."
  type        = string
  default     = null
}

variable "dimensions" {
  description = "Map of dimension name/value pairs for the single-metric path, e.g. { InstanceId = \"i-0abc\" }."
  type        = map(string)
  default     = {}
}

variable "metric_queries" {
  description = <<-EOT
    List of metric_query blocks for metric math or anomaly detection. When non-empty, the single-metric
    arguments are ignored. Each item supports: id (required), expression, label, return_data, account_id,
    period, and an optional metric = { namespace, metric_name, period, stat, unit, dimensions }.
  EOT
  type = list(object({
    id          = string
    expression  = optional(string)
    label       = optional(string)
    return_data = optional(bool)
    account_id  = optional(string)
    period      = optional(number)
    metric = optional(object({
      namespace   = string
      metric_name = string
      period      = number
      stat        = string
      unit        = optional(string)
      dimensions  = optional(map(string))
    }))
  }))
  default = []
}

variable "threshold_metric_id" {
  description = "ID of the ANOMALY_DETECTION_BAND metric_query used as the threshold. Set this instead of threshold for anomaly-detection alarms."
  type        = string
  default     = null
}

variable "treat_missing_data" {
  description = "How missing datapoints are treated: missing, notBreaching, breaching, or ignore. Choose deliberately per metric."
  type        = string
  default     = "missing"

  validation {
    condition     = contains(["missing", "notBreaching", "breaching", "ignore"], var.treat_missing_data)
    error_message = "treat_missing_data must be one of missing, notBreaching, breaching, ignore."
  }
}

variable "alarm_actions" {
  description = "List of ARNs (typically SNS topics) to notify when the alarm enters ALARM state."
  type        = list(string)
  default     = []
}

variable "ok_actions" {
  description = "List of ARNs to notify when the alarm returns to OK. Strongly recommended so on-call knows an incident cleared."
  type        = list(string)
  default     = []
}

variable "insufficient_data_actions" {
  description = "List of ARNs to notify when the alarm enters INSUFFICIENT_DATA."
  type        = list(string)
  default     = []
}

variable "tags" {
  description = "Tags applied to the alarm."
  type        = map(string)
  default     = {}
}

# outputs.tf

output "alarm_id" {
  description = "The ID of the CloudWatch metric alarm (equals the alarm name)."
  value       = aws_cloudwatch_metric_alarm.this.id
}

output "alarm_arn" {
  description = "The ARN of the CloudWatch metric alarm, e.g. for cross-account policies or composite alarms."
  value       = aws_cloudwatch_metric_alarm.this.arn
}

output "alarm_name" {
  description = "The full alarm name (name_prefix + alarm_name)."
  value       = aws_cloudwatch_metric_alarm.this.alarm_name
}

output "comparison_operator" {
  description = "The comparison operator the alarm was created with."
  value       = aws_cloudwatch_metric_alarm.this.comparison_operator
}

How to use it

A static-threshold CPU alarm for an EC2 instance, wired to an SNS topic, with ok_actions so the incident clears cleanly:

resource "aws_sns_topic" "alerts" {
  name = "prod-payments-oncall"
}

module "cloudwatch_alarm" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"

  name_prefix       = "prod-payments"
  alarm_name        = "ec2-cpu-high"
  alarm_description = "CPU >= 80% for 15m on the payments API host. Runbook: https://wiki.internal/runbooks/payments-cpu"

  namespace   = "AWS/EC2"
  metric_name = "CPUUtilization"
  statistic   = "Average"
  period      = 300
  dimensions  = { InstanceId = aws_instance.payments_api.id }

  comparison_operator = "GreaterThanOrEqualToThreshold"
  threshold           = 80
  evaluation_periods  = 3
  datapoints_to_alarm = 3
  treat_missing_data  = "notBreaching"

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]

  tags = {
    Team        = "payments"
    Environment = "prod"
  }
}

# Downstream reference: feed the alarm ARN into a composite alarm so the
# "service unhealthy" page only fires when CPU AND latency are both breaching.
resource "aws_cloudwatch_composite_alarm" "service_unhealthy" {
  alarm_name = "prod-payments-service-unhealthy"
  alarm_rule = "ALARM(\"${module.cloudwatch_alarm.alarm_name}\") AND ALARM(\"prod-payments-latency-high\")"

  alarm_actions = [aws_sns_topic.alerts.arn]
}

For an anomaly-detection alarm, omit threshold and instead pass two metric_queries — the raw metric and an ANOMALY_DETECTION_BAND expression — then point threshold_metric_id at the band and use a band comparison operator:

module "latency_anomaly" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"

  name_prefix       = "prod-payments"
  alarm_name        = "alb-latency-anomaly"
  alarm_description = "ALB target response time outside the expected band (2 stddev)."

  comparison_operator = "GreaterThanUpperThreshold"
  evaluation_periods  = 3
  threshold_metric_id = "e1"
  treat_missing_data  = "notBreaching"

  metric_queries = [
    {
      id          = "m1"
      return_data = true
      metric = {
        namespace   = "AWS/ApplicationELB"
        metric_name = "TargetResponseTime"
        period      = 300
        stat        = "Average"
        dimensions  = { LoadBalancer = aws_lb.payments.arn_suffix }
      }
    },
    {
      id          = "e1"
      expression  = "ANOMALY_DETECTION_BAND(m1, 2)"
      label       = "TargetResponseTime (expected)"
      return_data = true
    },
  ]

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module config — live/prod/cloudwatch_alarm/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"
}

inputs = {
  name_prefix = "..."
  alarm_name = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/cloudwatch_alarm && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
name_prefix	string	—	yes	Prefix prepended to `alarm_name` (e.g. `prod-payments`) for unique, searchable names.
alarm_name	string	—	yes	Short alarm name; combined with `name_prefix`. Validated to 1–200 chars.
alarm_description	string	null	no	Description shown in console/notifications. Include the runbook URL.
actions_enabled	bool	true	no	Whether state-change actions execute. Set false to silence without deleting.
comparison_operator	string	GreaterThanOrEqualToThreshold	no	Comparison of metric to threshold; validated against the seven valid operators.
evaluation_periods	number	3	no	Periods evaluated to set state. Must be ≥ 1.
datapoints_to_alarm	number	null	no	M-of-N breaching datapoints to trigger ALARM. Null = equals `evaluation_periods`.
threshold	number	null	no	Static threshold value. Ignored when `metric_queries` are used.
namespace	string	null	no	Metric namespace for the single-metric path (e.g. `AWS/EC2`).
metric_name	string	null	no	Metric name for the single-metric path (e.g. `CPUUtilization`).
period	number	300	no	Granularity in seconds. Must be 1, 10, 30, or a multiple of 60.
statistic	string	Average	no	Standard statistic. Mutually exclusive with `extended_statistic`.
extended_statistic	string	null	no	Percentile statistic (e.g. `p99`). Overrides `statistic` when set.
unit	string	null	no	Optional metric unit; set only when you must match the published unit.
dimensions	map(string)	{}	no	Dimension name/value pairs for the single-metric path.
metric_queries	list(object)	[]	no	metric_query blocks for metric math / anomaly detection. Disables the single-metric path.
threshold_metric_id	string	null	no	ID of the `ANOMALY_DETECTION_BAND` query used as the threshold.
treat_missing_data	string	missing	no	Missing-data policy: `missing`, `notBreaching`, `breaching`, or `ignore`.
alarm_actions	list(string)	[]	no	ARNs notified on ALARM (typically an SNS topic).
ok_actions	list(string)	[]	no	ARNs notified on return to OK.
insufficient_data_actions	list(string)	[]	no	ARNs notified on INSUFFICIENT_DATA.
tags	map(string)	{}	no	Tags applied to the alarm.

Outputs

Name	Description
alarm_id	The ID of the metric alarm (equals the alarm name).
alarm_arn	The ARN of the alarm — use for composite alarms or cross-account policies.
alarm_name	The full alarm name (`name_prefix` + `alarm_name`).
comparison_operator	The comparison operator the alarm was created with.

Enterprise scenario

A payments platform runs roughly 400 CloudWatch alarms across 30 microservices in three AWS accounts. Before standardizing, each squad hand-wrote aws_cloudwatch_metric_alarm resources, and a postmortem found that 12% of “critical” alarms used the default treat_missing_data = "missing" against sparse error-count metrics — so they had never paged during real outages. The platform team published this module with notBreaching chosen per metric, mandatory ok_actions, and a naming convention enforced by name_prefix; squads now declare alarms in five variables, the ARNs flow into per-service composite alarms for “service unhealthy” paging, and a Terraform policy check rejects any alarm missing a runbook URL in alarm_description.

Best practices

Choose treat_missing_data per metric, never by default. For sparse metrics (error counts, queue ages that vanish when empty) missing means the alarm goes silent exactly when you need it; prefer notBreaching for health metrics and breaching for metrics whose absence is itself a failure.
Use M-of-N (datapoints_to_alarm < evaluation_periods) to tame noise. A 3 of 5 configuration tolerates transient spikes while still firing on sustained problems — far better than a single-period alarm that pages on one bad scrape.
Always set ok_actions, and put runbook links in alarm_description. On-call needs to know when an incident clears as much as when it starts, and the description is the one piece of context that travels into every SNS/PagerDuty notification.
Match period and statistic to the metric’s resolution and the alarm intent. High-resolution custom metrics support 10s/30s periods; standard AWS metrics are 60s minimum. Use extended_statistic (e.g. p99) for latency SLOs rather than Average, which hides tail latency.
Scope SNS topic access tightly and tag for cost/ownership. Grant CloudWatch only sns:Publish to the alert topic via the topic policy, and tag every alarm with Team/Environment so you can attribute the (small but real) per-alarm and per-evaluation cost and find orphaned alarms.
Roll up related alarms with aws_cloudwatch_composite_alarm. Page humans on a composite “service unhealthy” signal built from this module’s alarm_arn/alarm_name outputs, and keep individual alarms on notify-only channels to cut alert fatigue.

Terraform Module: AWS CloudWatch Alarm — standardized metric alarms with SNS notifications and treat-missing-data guardrails

Quickstart (copy-paste)

What this module is

When to use it

Module structure

How to use it

With Terragrunt

Inputs

Outputs

Enterprise scenario

Best practices

Written by Vinod

Comments

Keep Reading

The Terraform Architecting Ladder: From a Single Module to an Enterprise IaC Platform

HashiCorp Terraform Associate (003) Prep Kit: Objectives, Practice Questions & Cheat Sheet

Terraform Fundamentals: HCL, Providers, State & the Core Workflow