Terraform Module: GCP Cloud Monitoring Alert — codified alert policies with thresholds, channels, and severity

Quick take — A reusable Terraform module for google_monitoring_alert_policy on hashicorp/google ~> 5.0: MQL/filter conditions, threshold and absence triggers, notification channel wiring, auto-close, and documentation. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "monitoring_alert" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-monitoring-alert?ref=v1.0.0"

  project_id   = "..."           # GCP project ID hosting the alert policy and its metrics.
  display_name = "..."           # Human-readable policy name (1–512 chars) shown in conso…
  conditions   = ["...", "..."]  # One or more threshold/absence conditions; see object sc…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Google Cloud Monitoring alert policies (google_monitoring_alert_policy) are the rules that decide when an on-call human or a webhook gets paged. A policy is a set of one or more conditions — each evaluating a metric filter (or a Monitoring Query Language / PromQL query) against a threshold over a rolling window — combined by a combiner (AND, OR, AND_WITH_MATCHING_RESOURCE) and routed to one or more notification channels. It also carries the human context that makes an alert actionable: severity, documentation (the runbook), labels, and auto-close behaviour.

Authoring these by hand in the console is where reliability quietly erodes. The fields are fiddly and easy to get subtly wrong: duration must be a string of seconds with an s suffix ("300s"), comparison is an enum like COMPARISON_GT, alignment_period and per_series_aligner interact in non-obvious ways, and a missing trigger block silently changes whether any time series breaching or all of them breaching fires the alert. A reusable module pins those mechanics once, exposes only the knobs that vary per alert (metric, threshold, duration, severity, channels), and validates them so a typo fails at plan instead of at 3 a.m. It also makes alerting reviewable: the threshold for “P99 latency too high” lives in a pull request, not in one engineer’s browser history.

When to use it

You manage alerting for many services or environments and want every alert policy to share the same shape — consistent labels, severity taxonomy, auto-close, and documentation format.
You want metric-threshold and metric-absence alerts (latency, error rate, CPU, queue depth, “no heartbeat in 5 minutes”) defined as code and code-reviewed alongside the workload they watch.
You need alerts to route to the right channels per severity (PagerDuty for CRITICAL, Slack/email for WARNING) without copy-pasting channel IDs.
You are codifying an SLO/error-budget practice and want burn-rate or threshold alerts versioned next to the SLOs.
Skip this module if you only need a single throwaway alert, or if your condition is a log-based metric that does not yet exist — create the google_logging_metric and google_monitoring_notification_channel first, then feed their outputs in.

Module structure

terraform-module-gcp-monitoring-alert/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # google_monitoring_alert_policy with dynamic conditions
├── variables.tf     # var-driven inputs with validations
└── outputs.tf       # policy id/name + condition names

# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

# main.tf

locals {
  # Channels are passed as short IDs (the "ABC123" part) OR as full resource
  # names. Normalise to the full "projects/<p>/notificationChannels/<id>" form
  # that the API requires.
  notification_channels = [
    for ch in var.notification_channel_ids :
    can(regex("^projects/", ch)) ? ch : "projects/${var.project_id}/notificationChannels/${ch}"
  ]
}

resource "google_monitoring_alert_policy" "this" {
  project      = var.project_id
  display_name = var.display_name
  combiner     = var.combiner
  enabled      = var.enabled
  severity     = var.severity

  user_labels = var.user_labels

  notification_channels = local.notification_channels

  # One condition per entry in var.conditions. Each is either a threshold
  # condition (filter + comparison + threshold) or, when threshold_value is
  # null, an absence condition ("this metric stopped reporting").
  dynamic "conditions" {
    for_each = var.conditions
    content {
      display_name = conditions.value.display_name

      # Threshold condition
      dynamic "condition_threshold" {
        for_each = conditions.value.threshold_value != null ? [conditions.value] : []
        content {
          filter          = condition_threshold.value.filter
          comparison      = condition_threshold.value.comparison
          threshold_value = condition_threshold.value.threshold_value
          duration        = "${condition_threshold.value.duration_seconds}s"

          aggregations {
            alignment_period     = "${condition_threshold.value.alignment_period_seconds}s"
            per_series_aligner   = condition_threshold.value.per_series_aligner
            cross_series_reducer = condition_threshold.value.cross_series_reducer
            group_by_fields      = condition_threshold.value.group_by_fields
          }

          trigger {
            count   = condition_threshold.value.trigger_count
            percent = condition_threshold.value.trigger_percent
          }
        }
      }

      # Absence condition (metric stopped reporting for `duration`)
      dynamic "condition_absent" {
        for_each = conditions.value.threshold_value == null ? [conditions.value] : []
        content {
          filter   = condition_absent.value.filter
          duration = "${condition_absent.value.duration_seconds}s"

          aggregations {
            alignment_period   = "${condition_absent.value.alignment_period_seconds}s"
            per_series_aligner = condition_absent.value.per_series_aligner
          }

          trigger {
            count   = condition_absent.value.trigger_count
            percent = condition_absent.value.trigger_percent
          }
        }
      }
    }
  }

  # Runbook / context shown on every incident for this policy.
  dynamic "documentation" {
    for_each = var.documentation_content != null ? [1] : []
    content {
      content   = var.documentation_content
      mime_type = "text/markdown"
      subject   = var.documentation_subject
    }
  }

  # Auto-close stale open incidents so a fixed-but-never-acked alert
  # does not linger forever.
  alert_strategy {
    auto_close = var.auto_close_duration

    dynamic "notification_rate_limit" {
      for_each = var.notification_rate_limit_period != null ? [1] : []
      content {
        period = var.notification_rate_limit_period
      }
    }
  }
}

# variables.tf

variable "project_id" {
  description = "GCP project ID that hosts the alert policy and its metrics."
  type        = string
}

variable "display_name" {
  description = "Human-readable name of the alert policy as it appears in the console and incidents."
  type        = string

  validation {
    condition     = length(var.display_name) > 0 && length(var.display_name) <= 512
    error_message = "display_name must be between 1 and 512 characters."
  }
}

variable "combiner" {
  description = "How multiple conditions combine: AND, OR, or AND_WITH_MATCHING_RESOURCE."
  type        = string
  default     = "OR"

  validation {
    condition     = contains(["AND", "OR", "AND_WITH_MATCHING_RESOURCE"], var.combiner)
    error_message = "combiner must be one of AND, OR, or AND_WITH_MATCHING_RESOURCE."
  }
}

variable "enabled" {
  description = "Whether the policy is active. Set false to stage a policy without paging anyone."
  type        = bool
  default     = true
}

variable "severity" {
  description = "Incident severity surfaced to notification channels: CRITICAL, ERROR, or WARNING."
  type        = string
  default     = "WARNING"

  validation {
    condition     = contains(["CRITICAL", "ERROR", "WARNING"], var.severity)
    error_message = "severity must be one of CRITICAL, ERROR, or WARNING."
  }
}

variable "user_labels" {
  description = "Key/value labels attached to the policy (e.g. team, service, slo) for routing and search."
  type        = map(string)
  default     = {}

  validation {
    # User label keys must be lowercase and start with a letter (GCP constraint).
    condition     = alltrue([for k in keys(var.user_labels) : can(regex("^[a-z][a-z0-9_-]*$", k))])
    error_message = "user_labels keys must be lowercase and start with a letter ([a-z][a-z0-9_-]*)."
  }
}

variable "notification_channel_ids" {
  description = "Notification channel IDs (short form like 'ABC123' or full 'projects/<p>/notificationChannels/<id>')."
  type        = list(string)
  default     = []
}

variable "conditions" {
  description = <<-EOT
    List of conditions. Each item is a threshold condition when threshold_value is
    set, or an absence condition when threshold_value is null (metric stopped
    reporting). filter is a Monitoring filter string.
  EOT
  type = list(object({
    display_name             = string
    filter                   = string
    comparison               = optional(string, "COMPARISON_GT")
    threshold_value          = optional(number) # null => absence condition
    duration_seconds         = optional(number, 300)
    alignment_period_seconds = optional(number, 60)
    per_series_aligner       = optional(string, "ALIGN_RATE")
    cross_series_reducer     = optional(string, "REDUCE_NONE")
    group_by_fields          = optional(list(string), [])
    trigger_count            = optional(number, 1)
    trigger_percent          = optional(number)
  }))

  validation {
    condition     = length(var.conditions) >= 1
    error_message = "At least one condition must be provided."
  }

  validation {
    condition = alltrue([
      for c in var.conditions :
      c.comparison == null || contains(
        ["COMPARISON_GT", "COMPARISON_GE", "COMPARISON_LT", "COMPARISON_LE", "COMPARISON_EQ", "COMPARISON_NE"],
        c.comparison
      )
    ])
    error_message = "Each condition.comparison must be a valid COMPARISON_* enum."
  }

  validation {
    condition     = alltrue([for c in var.conditions : c.duration_seconds >= 0])
    error_message = "condition.duration_seconds must be zero or positive."
  }
}

variable "documentation_content" {
  description = "Markdown runbook shown on every incident. Set null to omit documentation."
  type        = string
  default     = null
}

variable "documentation_subject" {
  description = "Subject line / title for the documentation block."
  type        = string
  default     = null
}

variable "auto_close_duration" {
  description = "How long an open incident waits with no data before auto-closing, e.g. '1800s'. Min 1800s per API."
  type        = string
  default     = "1800s"

  validation {
    condition     = can(regex("^[0-9]+s$", var.auto_close_duration))
    error_message = "auto_close_duration must be a duration string of seconds, e.g. '1800s'."
  }
}

variable "notification_rate_limit_period" {
  description = "Minimum spacing between notifications for the same incident, e.g. '300s'. Null disables rate limiting."
  type        = string
  default     = null

  validation {
    condition     = var.notification_rate_limit_period == null || can(regex("^[0-9]+s$", var.notification_rate_limit_period))
    error_message = "notification_rate_limit_period must be null or a duration string of seconds, e.g. '300s'."
  }
}

# outputs.tf

output "id" {
  description = "Full resource ID of the alert policy (projects/<project>/alertPolicies/<id>)."
  value       = google_monitoring_alert_policy.this.id
}

output "name" {
  description = "API name of the alert policy, identical to its id."
  value       = google_monitoring_alert_policy.this.name
}

output "display_name" {
  description = "Display name of the alert policy."
  value       = google_monitoring_alert_policy.this.display_name
}

output "creation_record" {
  description = "Provenance record (mutate time and the identity that created the policy)."
  value       = google_monitoring_alert_policy.this.creation_record
}

output "condition_names" {
  description = "Server-assigned full names of each condition, useful for cross-referencing in incidents."
  value       = [for c in google_monitoring_alert_policy.this.conditions : c.name]
}

How to use it

# A pre-existing PagerDuty channel created elsewhere in the stack.
resource "google_monitoring_notification_channel" "pagerduty" {
  project      = "kv-prod-001"
  display_name = "Payments On-Call (PagerDuty)"
  type         = "pagerduty"
  labels = {
    service_key = var.pagerduty_service_key
  }
  sensitive_labels {
    auth_token = var.pagerduty_service_key
  }
}

module "cloud_monitoring_alert" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-monitoring-alert?ref=v1.0.0"

  project_id   = "kv-prod-001"
  display_name = "payments-api — P99 latency high"
  severity     = "CRITICAL"
  combiner     = "OR"

  notification_channel_ids = [
    google_monitoring_notification_channel.pagerduty.id,
  ]

  user_labels = {
    team    = "payments"
    service = "payments-api"
    slo     = "latency"
  }

  conditions = [
    {
      display_name = "P99 request latency > 800ms for 5m"
      filter = join(" AND ", [
        "metric.type=\"run.googleapis.com/request_latencies\"",
        "resource.type=\"cloud_run_revision\"",
        "resource.label.\"service_name\"=\"payments-api\"",
      ])
      comparison               = "COMPARISON_GT"
      threshold_value          = 800
      duration_seconds         = 300
      alignment_period_seconds = 60
      per_series_aligner       = "ALIGN_PERCENTILE_99"
      cross_series_reducer     = "REDUCE_MEAN"
      group_by_fields          = ["resource.label.service_name"]
    },
    {
      # threshold_value omitted -> absence condition: revision stopped emitting metrics.
      display_name             = "payments-api stopped reporting latency (10m)"
      filter                   = "metric.type=\"run.googleapis.com/request_latencies\" AND resource.type=\"cloud_run_revision\""
      duration_seconds         = 600
      alignment_period_seconds = 60
      per_series_aligner       = "ALIGN_PERCENTILE_99"
    },
  ]

  documentation_subject = "Payments API latency breach"
  documentation_content = <<-EOT
    ## Payments API latency runbook
    1. Check the Cloud Run revision dashboard for traffic spikes.
    2. Inspect downstream Spanner / Pub/Sub latency.
    3. If a bad deploy is suspected, roll back to the previous revision.
  EOT

  auto_close_duration            = "1800s"
  notification_rate_limit_period = "300s"
}

# Downstream reference: feed the policy id into a dashboard's incident list,
# or simply expose it for audit tooling.
output "payments_latency_alert_id" {
  value = module.cloud_monitoring_alert.id
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module config — live/prod/monitoring_alert/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-monitoring-alert?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  display_name = "..."
  conditions = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/monitoring_alert && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`project_id`	`string`	—	Yes	GCP project ID hosting the alert policy and its metrics.
`display_name`	`string`	—	Yes	Human-readable policy name (1–512 chars) shown in console and incidents.
`combiner`	`string`	`"OR"`	No	How conditions combine: `AND`, `OR`, or `AND_WITH_MATCHING_RESOURCE`.
`enabled`	`bool`	`true`	No	Whether the policy is active; `false` stages it without paging.
`severity`	`string`	`"WARNING"`	No	Incident severity: `CRITICAL`, `ERROR`, or `WARNING`.
`user_labels`	`map(string)`	`{}`	No	Labels on the policy (keys lowercase, letter-led) for routing/search.
`notification_channel_ids`	`list(string)`	`[]`	No	Channel IDs (short `ABC123` or full resource name); normalised internally.
`conditions`	`list(object)`	—	Yes	One or more threshold/absence conditions; see object schema in `variables.tf`.
`documentation_content`	`string`	`null`	No	Markdown runbook attached to every incident.
`documentation_subject`	`string`	`null`	No	Subject/title for the documentation block.
`auto_close_duration`	`string`	`"1800s"`	No	Idle time before an open incident auto-closes (min `1800s`).
`notification_rate_limit_period`	`string`	`null`	No	Minimum spacing between notifications per incident; `null` disables.

Outputs

Name	Description
`id`	Full resource ID of the alert policy (`projects/<project>/alertPolicies/<id>`).
`name`	API name of the policy (identical to `id`).
`display_name`	Display name of the alert policy.
`creation_record`	Provenance: mutate time and the identity that created the policy.
`condition_names`	Server-assigned full names of each condition.

Enterprise scenario

A fintech platform runs ~40 microservices on Cloud Run across dev, staging, and prod projects. The SRE team consumes this module from a single alerts/ Terraform stack so every service ships with the same three baseline alerts — P99 latency breach, 5xx error-rate breach, and a metric-absence “service went dark” alert — all tagged with team and slo labels. severity drives routing: CRITICAL policies attach the PagerDuty channel and a 5-minute rate limit, while WARNING policies go to Slack only, so a noisy staging deploy never pages on-call. Because the thresholds and runbooks live in version control, an auditor can diff exactly when an SLO threshold changed and who approved the pull request.

Best practices

Match severity to channel routing, not just colour. Reserve CRITICAL for policies wired to paging channels with a notification_rate_limit_period; send WARNING to chat/email only. This keeps your error budget for actual pages and stops alert fatigue.
Always set a runbook via documentation_content. An alert without context is just an interruption — put the first three triage steps and the rollback command directly on the incident so responders start acting, not searching.
Use absence conditions for liveness, and align the window to reality. A condition_absent with a 10-minute duration_seconds catches a service that silently stopped emitting metrics, but set it longer than your metric’s normal sampling gap or batch cadence to avoid flapping during quiet traffic.
Pick per_series_aligner and threshold together. ALIGN_RATE on a DELTA/CUMULATIVE metric yields per-second rates — your threshold must be in those units. Mismatched aligners are the most common reason an alert “never fires” or “fires constantly.”
Tag every policy with user_labels (team, service, slo). Labels are how you bulk-silence, build per-team dashboards, and answer “which alerts does this squad own?” without scraping display names.
Keep auto_close_duration modest (≈30 min). Long auto-close windows leave stale incidents open after a transient blip self-heals, masking the next real breach; the API minimum of 1800s is a sane default for most workloads.

Terraform Module: GCP Cloud Monitoring Alert — codified alert policies with thresholds, channels, and severity

Quickstart (copy-paste)

What this module is

When to use it

Module structure

How to use it

With Terragrunt

Inputs

Outputs

Enterprise scenario

Best practices

Written by Vinod

Comments

Keep Reading

The Terraform Architecting Ladder: From a Single Module to an Enterprise IaC Platform

HashiCorp Terraform Associate (003) Prep Kit: Objectives, Practice Questions & Cheat Sheet

Terraform Fundamentals: HCL, Providers, State & the Core Workflow