Quick take — Build a reusable Terraform module for aws_cloudwatch_metric_alarm: a var-driven CloudWatch alarm with SNS actions, configurable evaluation/datapoints, anomaly-detection support, and safe missing-data handling for hashicorp/aws ~> 5.0. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "cloudwatch_alarm" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"
name_prefix = "..." # Prefix prepended to `alarm_name` (e.g. `prod-payments`)…
alarm_name = "..." # Short alarm name; combined with `name_prefix`. Validate…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Amazon CloudWatch metric alarms watch a single metric (or a math expression over several metrics) and transition between OK, ALARM, and INSUFFICIENT_DATA states based on a threshold you define. When an alarm crosses into a state, it fires alarm actions — most commonly publishing to an SNS topic that fans out to PagerDuty, email, Slack, or an Auto Scaling / EC2 recovery action.
The raw aws_cloudwatch_metric_alarm resource has more than two dozen arguments, and the ones that actually determine whether an alarm is useful — evaluation_periods, datapoints_to_alarm, treat_missing_data, and the period/statistic pairing — are exactly the ones teams get wrong. An alarm with treat_missing_data = "missing" on a metric that only emits when something happens will never fire; an alarm with evaluation_periods = 1 on a noisy metric will page you all night.
Wrapping it in a module lets you encode your organization’s conventions once: a consistent naming prefix, a default treat_missing_data policy, mandatory ok_actions so on-call knows when an incident clears, and tagging. Every team then gets a battle-tested alarm by passing five or six variables instead of copy-pasting a 30-line resource and silently dropping the datapoints_to_alarm line.
When to use it
- You manage more than a handful of alarms and want naming, tagging, and notification routing to be identical across them.
- You want to support both static-threshold alarms (CPU > 80%) and anomaly-detection band alarms from the same module without two code paths in every stack.
- You need alarms wired to SNS for incident tooling, and optionally to EC2 auto-recovery or Auto Scaling actions.
- You want guardrails — e.g. enforcing a valid
comparison_operator, a saneperiod, and an explicit missing-data policy — so a typo cannot ship a dead alarm. - Reach for AWS’s own managed alerting (e.g. Application Signals SLOs, or CloudWatch Alarms recommended by a service) when you want best-practice defaults you don’t control; use this module when you need per-team consistency and Terraform-managed state.
Module structure
terraform-module-aws-cloudwatch-alarm/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf
# versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# main.tf
locals {
alarm_name = "${var.name_prefix}-${var.alarm_name}"
# When a raw metric is supplied (namespace + metric_name), we drive the
# alarm with the top-level metric arguments. When metric_query blocks are
# supplied, those take over and the top-level metric fields must be null.
use_metric_query = length(var.metric_queries) > 0
}
resource "aws_cloudwatch_metric_alarm" "this" {
alarm_name = local.alarm_name
alarm_description = var.alarm_description
actions_enabled = var.actions_enabled
comparison_operator = var.comparison_operator
evaluation_periods = var.evaluation_periods
datapoints_to_alarm = var.datapoints_to_alarm
threshold = local.use_metric_query ? null : var.threshold
# Single-metric path: only set when no metric_query blocks are provided.
namespace = local.use_metric_query ? null : var.namespace
metric_name = local.use_metric_query ? null : var.metric_name
period = local.use_metric_query ? null : var.period
statistic = local.use_metric_query || var.extended_statistic != null ? null : var.statistic
extended_statistic = local.use_metric_query ? null : var.extended_statistic
unit = local.use_metric_query ? null : var.unit
dimensions = local.use_metric_query ? null : var.dimensions
# Metric-math / anomaly-detection path.
threshold_metric_id = var.threshold_metric_id
dynamic "metric_query" {
for_each = var.metric_queries
content {
id = metric_query.value.id
expression = lookup(metric_query.value, "expression", null)
label = lookup(metric_query.value, "label", null)
return_data = lookup(metric_query.value, "return_data", null)
account_id = lookup(metric_query.value, "account_id", null)
period = lookup(metric_query.value, "period", null)
dynamic "metric" {
for_each = lookup(metric_query.value, "metric", null) != null ? [metric_query.value.metric] : []
content {
namespace = metric.value.namespace
metric_name = metric.value.metric_name
period = metric.value.period
stat = metric.value.stat
unit = lookup(metric.value, "unit", null)
dimensions = lookup(metric.value, "dimensions", null)
}
}
}
}
treat_missing_data = var.treat_missing_data
alarm_actions = var.alarm_actions
ok_actions = var.ok_actions
insufficient_data_actions = var.insufficient_data_actions
tags = var.tags
}
# variables.tf
variable "name_prefix" {
description = "Prefix prepended to alarm_name, e.g. \"prod-payments\". Keeps alarm names unique and searchable across the account."
type = string
}
variable "alarm_name" {
description = "Short, descriptive alarm name (combined with name_prefix to form the full CloudWatch alarm name)."
type = string
validation {
condition = length(var.alarm_name) > 0 && length(var.alarm_name) <= 200
error_message = "alarm_name must be 1-200 characters."
}
}
variable "alarm_description" {
description = "Human-readable description shown in the console and notifications. Include the runbook URL."
type = string
default = null
}
variable "actions_enabled" {
description = "Whether actions (alarm/ok/insufficient) are executed on state change. Set false to silence without deleting."
type = bool
default = true
}
variable "comparison_operator" {
description = "How the metric is compared to the threshold."
type = string
default = "GreaterThanOrEqualToThreshold"
validation {
condition = contains([
"GreaterThanOrEqualToThreshold",
"GreaterThanThreshold",
"LessThanThreshold",
"LessThanOrEqualToThreshold",
"LessThanLowerOrGreaterThanUpperThreshold",
"LessThanLowerThreshold",
"GreaterThanUpperThreshold",
], var.comparison_operator)
error_message = "comparison_operator must be a valid CloudWatch comparison operator."
}
}
variable "evaluation_periods" {
description = "Number of periods over which data is evaluated to set alarm state."
type = number
default = 3
validation {
condition = var.evaluation_periods >= 1
error_message = "evaluation_periods must be >= 1."
}
}
variable "datapoints_to_alarm" {
description = "M of N: number of breaching datapoints within evaluation_periods required to trigger ALARM. Null means it equals evaluation_periods."
type = number
default = null
validation {
condition = var.datapoints_to_alarm == null || var.datapoints_to_alarm >= 1
error_message = "datapoints_to_alarm must be null or >= 1."
}
}
variable "threshold" {
description = "Value to compare the metric against. Ignored when metric_queries (metric math / anomaly detection) are used."
type = number
default = null
}
variable "namespace" {
description = "Metric namespace for the single-metric path, e.g. \"AWS/EC2\". Ignored when metric_queries are supplied."
type = string
default = null
}
variable "metric_name" {
description = "Metric name for the single-metric path, e.g. \"CPUUtilization\". Ignored when metric_queries are supplied."
type = string
default = null
}
variable "period" {
description = "Granularity in seconds for the single-metric path. Must be a multiple of 60 (1, 10, or 30 are only valid for high-resolution custom metrics)."
type = number
default = 300
validation {
condition = contains([1, 10, 30], var.period) || var.period % 60 == 0
error_message = "period must be 1, 10, 30, or a multiple of 60 seconds."
}
}
variable "statistic" {
description = "Standard statistic for the single-metric path (SampleCount, Average, Sum, Minimum, Maximum). Mutually exclusive with extended_statistic."
type = string
default = "Average"
validation {
condition = var.statistic == null || contains(["SampleCount", "Average", "Sum", "Minimum", "Maximum"], var.statistic)
error_message = "statistic must be one of SampleCount, Average, Sum, Minimum, Maximum."
}
}
variable "extended_statistic" {
description = "Percentile statistic for the single-metric path, e.g. \"p99\" or \"p95.5\". When set, statistic is ignored."
type = string
default = null
validation {
condition = var.extended_statistic == null || can(regex("^p(100(\\.0+)?|\\d{1,2}(\\.\\d+)?)$", var.extended_statistic))
error_message = "extended_statistic must be a percentile like p99, p95, or p99.9."
}
}
variable "unit" {
description = "Optional metric unit (e.g. \"Percent\", \"Bytes\"). Leave null unless the metric is published with a specific unit you must match."
type = string
default = null
}
variable "dimensions" {
description = "Map of dimension name/value pairs for the single-metric path, e.g. { InstanceId = \"i-0abc\" }."
type = map(string)
default = {}
}
variable "metric_queries" {
description = <<-EOT
List of metric_query blocks for metric math or anomaly detection. When non-empty, the single-metric
arguments are ignored. Each item supports: id (required), expression, label, return_data, account_id,
period, and an optional metric = { namespace, metric_name, period, stat, unit, dimensions }.
EOT
type = list(object({
id = string
expression = optional(string)
label = optional(string)
return_data = optional(bool)
account_id = optional(string)
period = optional(number)
metric = optional(object({
namespace = string
metric_name = string
period = number
stat = string
unit = optional(string)
dimensions = optional(map(string))
}))
}))
default = []
}
variable "threshold_metric_id" {
description = "ID of the ANOMALY_DETECTION_BAND metric_query used as the threshold. Set this instead of threshold for anomaly-detection alarms."
type = string
default = null
}
variable "treat_missing_data" {
description = "How missing datapoints are treated: missing, notBreaching, breaching, or ignore. Choose deliberately per metric."
type = string
default = "missing"
validation {
condition = contains(["missing", "notBreaching", "breaching", "ignore"], var.treat_missing_data)
error_message = "treat_missing_data must be one of missing, notBreaching, breaching, ignore."
}
}
variable "alarm_actions" {
description = "List of ARNs (typically SNS topics) to notify when the alarm enters ALARM state."
type = list(string)
default = []
}
variable "ok_actions" {
description = "List of ARNs to notify when the alarm returns to OK. Strongly recommended so on-call knows an incident cleared."
type = list(string)
default = []
}
variable "insufficient_data_actions" {
description = "List of ARNs to notify when the alarm enters INSUFFICIENT_DATA."
type = list(string)
default = []
}
variable "tags" {
description = "Tags applied to the alarm."
type = map(string)
default = {}
}
# outputs.tf
output "alarm_id" {
description = "The ID of the CloudWatch metric alarm (equals the alarm name)."
value = aws_cloudwatch_metric_alarm.this.id
}
output "alarm_arn" {
description = "The ARN of the CloudWatch metric alarm, e.g. for cross-account policies or composite alarms."
value = aws_cloudwatch_metric_alarm.this.arn
}
output "alarm_name" {
description = "The full alarm name (name_prefix + alarm_name)."
value = aws_cloudwatch_metric_alarm.this.alarm_name
}
output "comparison_operator" {
description = "The comparison operator the alarm was created with."
value = aws_cloudwatch_metric_alarm.this.comparison_operator
}
How to use it
A static-threshold CPU alarm for an EC2 instance, wired to an SNS topic, with ok_actions so the incident clears cleanly:
resource "aws_sns_topic" "alerts" {
name = "prod-payments-oncall"
}
module "cloudwatch_alarm" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"
name_prefix = "prod-payments"
alarm_name = "ec2-cpu-high"
alarm_description = "CPU >= 80% for 15m on the payments API host. Runbook: https://wiki.internal/runbooks/payments-cpu"
namespace = "AWS/EC2"
metric_name = "CPUUtilization"
statistic = "Average"
period = 300
dimensions = { InstanceId = aws_instance.payments_api.id }
comparison_operator = "GreaterThanOrEqualToThreshold"
threshold = 80
evaluation_periods = 3
datapoints_to_alarm = 3
treat_missing_data = "notBreaching"
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
tags = {
Team = "payments"
Environment = "prod"
}
}
# Downstream reference: feed the alarm ARN into a composite alarm so the
# "service unhealthy" page only fires when CPU AND latency are both breaching.
resource "aws_cloudwatch_composite_alarm" "service_unhealthy" {
alarm_name = "prod-payments-service-unhealthy"
alarm_rule = "ALARM(\"${module.cloudwatch_alarm.alarm_name}\") AND ALARM(\"prod-payments-latency-high\")"
alarm_actions = [aws_sns_topic.alerts.arn]
}
For an anomaly-detection alarm, omit threshold and instead pass two metric_queries — the raw metric and an ANOMALY_DETECTION_BAND expression — then point threshold_metric_id at the band and use a band comparison operator:
module "latency_anomaly" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"
name_prefix = "prod-payments"
alarm_name = "alb-latency-anomaly"
alarm_description = "ALB target response time outside the expected band (2 stddev)."
comparison_operator = "GreaterThanUpperThreshold"
evaluation_periods = 3
threshold_metric_id = "e1"
treat_missing_data = "notBreaching"
metric_queries = [
{
id = "m1"
return_data = true
metric = {
namespace = "AWS/ApplicationELB"
metric_name = "TargetResponseTime"
period = 300
stat = "Average"
dimensions = { LoadBalancer = aws_lb.payments.arn_suffix }
}
},
{
id = "e1"
expression = "ANOMALY_DETECTION_BAND(m1, 2)"
label = "TargetResponseTime (expected)"
return_data = true
},
]
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/cloudwatch_alarm/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-cloudwatch-alarm?ref=v1.0.0"
}
inputs = {
name_prefix = "..."
alarm_name = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/cloudwatch_alarm && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name_prefix | string | — | yes | Prefix prepended to alarm_name (e.g. prod-payments) for unique, searchable names. |
| alarm_name | string | — | yes | Short alarm name; combined with name_prefix. Validated to 1–200 chars. |
| alarm_description | string | null | no | Description shown in console/notifications. Include the runbook URL. |
| actions_enabled | bool | true | no | Whether state-change actions execute. Set false to silence without deleting. |
| comparison_operator | string | GreaterThanOrEqualToThreshold | no | Comparison of metric to threshold; validated against the seven valid operators. |
| evaluation_periods | number | 3 | no | Periods evaluated to set state. Must be ≥ 1. |
| datapoints_to_alarm | number | null | no | M-of-N breaching datapoints to trigger ALARM. Null = equals evaluation_periods. |
| threshold | number | null | no | Static threshold value. Ignored when metric_queries are used. |
| namespace | string | null | no | Metric namespace for the single-metric path (e.g. AWS/EC2). |
| metric_name | string | null | no | Metric name for the single-metric path (e.g. CPUUtilization). |
| period | number | 300 | no | Granularity in seconds. Must be 1, 10, 30, or a multiple of 60. |
| statistic | string | Average | no | Standard statistic. Mutually exclusive with extended_statistic. |
| extended_statistic | string | null | no | Percentile statistic (e.g. p99). Overrides statistic when set. |
| unit | string | null | no | Optional metric unit; set only when you must match the published unit. |
| dimensions | map(string) | {} | no | Dimension name/value pairs for the single-metric path. |
| metric_queries | list(object) | [] | no | metric_query blocks for metric math / anomaly detection. Disables the single-metric path. |
| threshold_metric_id | string | null | no | ID of the ANOMALY_DETECTION_BAND query used as the threshold. |
| treat_missing_data | string | missing | no | Missing-data policy: missing, notBreaching, breaching, or ignore. |
| alarm_actions | list(string) | [] | no | ARNs notified on ALARM (typically an SNS topic). |
| ok_actions | list(string) | [] | no | ARNs notified on return to OK. |
| insufficient_data_actions | list(string) | [] | no | ARNs notified on INSUFFICIENT_DATA. |
| tags | map(string) | {} | no | Tags applied to the alarm. |
Outputs
| Name | Description |
|---|---|
| alarm_id | The ID of the metric alarm (equals the alarm name). |
| alarm_arn | The ARN of the alarm — use for composite alarms or cross-account policies. |
| alarm_name | The full alarm name (name_prefix + alarm_name). |
| comparison_operator | The comparison operator the alarm was created with. |
Enterprise scenario
A payments platform runs roughly 400 CloudWatch alarms across 30 microservices in three AWS accounts. Before standardizing, each squad hand-wrote aws_cloudwatch_metric_alarm resources, and a postmortem found that 12% of “critical” alarms used the default treat_missing_data = "missing" against sparse error-count metrics — so they had never paged during real outages. The platform team published this module with notBreaching chosen per metric, mandatory ok_actions, and a naming convention enforced by name_prefix; squads now declare alarms in five variables, the ARNs flow into per-service composite alarms for “service unhealthy” paging, and a Terraform policy check rejects any alarm missing a runbook URL in alarm_description.
Best practices
- Choose
treat_missing_dataper metric, never by default. For sparse metrics (error counts, queue ages that vanish when empty)missingmeans the alarm goes silent exactly when you need it; prefernotBreachingfor health metrics andbreachingfor metrics whose absence is itself a failure. - Use M-of-N (
datapoints_to_alarm<evaluation_periods) to tame noise. A3 of 5configuration tolerates transient spikes while still firing on sustained problems — far better than a single-period alarm that pages on one bad scrape. - Always set
ok_actions, and put runbook links inalarm_description. On-call needs to know when an incident clears as much as when it starts, and the description is the one piece of context that travels into every SNS/PagerDuty notification. - Match
periodandstatisticto the metric’s resolution and the alarm intent. High-resolution custom metrics support 10s/30s periods; standard AWS metrics are 60s minimum. Useextended_statistic(e.g.p99) for latency SLOs rather thanAverage, which hides tail latency. - Scope SNS topic access tightly and tag for cost/ownership. Grant CloudWatch only
sns:Publishto the alert topic via the topic policy, and tag every alarm withTeam/Environmentso you can attribute the (small but real) per-alarm and per-evaluation cost and find orphaned alarms. - Roll up related alarms with
aws_cloudwatch_composite_alarm. Page humans on a composite “service unhealthy” signal built from this module’salarm_arn/alarm_nameoutputs, and keep individual alarms on notify-only channels to cut alert fatigue.