IaC GCP

Terraform Module: GCP Cloud Monitoring Alert — codified alert policies with thresholds, channels, and severity

Quick take — A reusable Terraform module for google_monitoring_alert_policy on hashicorp/google ~> 5.0: MQL/filter conditions, threshold and absence triggers, notification channel wiring, auto-close, and documentation. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "monitoring_alert" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-monitoring-alert?ref=v1.0.0"

  project_id   = "..."           # GCP project ID hosting the alert policy and its metrics.
  display_name = "..."           # Human-readable policy name (1–512 chars) shown in conso…
  conditions   = ["...", "..."]  # One or more threshold/absence conditions; see object sc…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Google Cloud Monitoring alert policies (google_monitoring_alert_policy) are the rules that decide when an on-call human or a webhook gets paged. A policy is a set of one or more conditions — each evaluating a metric filter (or a Monitoring Query Language / PromQL query) against a threshold over a rolling window — combined by a combiner (AND, OR, AND_WITH_MATCHING_RESOURCE) and routed to one or more notification channels. It also carries the human context that makes an alert actionable: severity, documentation (the runbook), labels, and auto-close behaviour.

Authoring these by hand in the console is where reliability quietly erodes. The fields are fiddly and easy to get subtly wrong: duration must be a string of seconds with an s suffix ("300s"), comparison is an enum like COMPARISON_GT, alignment_period and per_series_aligner interact in non-obvious ways, and a missing trigger block silently changes whether any time series breaching or all of them breaching fires the alert. A reusable module pins those mechanics once, exposes only the knobs that vary per alert (metric, threshold, duration, severity, channels), and validates them so a typo fails at plan instead of at 3 a.m. It also makes alerting reviewable: the threshold for “P99 latency too high” lives in a pull request, not in one engineer’s browser history.

When to use it

Module structure

terraform-module-gcp-monitoring-alert/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # google_monitoring_alert_policy with dynamic conditions
├── variables.tf     # var-driven inputs with validations
└── outputs.tf       # policy id/name + condition names
# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}
# main.tf

locals {
  # Channels are passed as short IDs (the "ABC123" part) OR as full resource
  # names. Normalise to the full "projects/<p>/notificationChannels/<id>" form
  # that the API requires.
  notification_channels = [
    for ch in var.notification_channel_ids :
    can(regex("^projects/", ch)) ? ch : "projects/${var.project_id}/notificationChannels/${ch}"
  ]
}

resource "google_monitoring_alert_policy" "this" {
  project      = var.project_id
  display_name = var.display_name
  combiner     = var.combiner
  enabled      = var.enabled
  severity     = var.severity

  user_labels = var.user_labels

  notification_channels = local.notification_channels

  # One condition per entry in var.conditions. Each is either a threshold
  # condition (filter + comparison + threshold) or, when threshold_value is
  # null, an absence condition ("this metric stopped reporting").
  dynamic "conditions" {
    for_each = var.conditions
    content {
      display_name = conditions.value.display_name

      # Threshold condition
      dynamic "condition_threshold" {
        for_each = conditions.value.threshold_value != null ? [conditions.value] : []
        content {
          filter          = condition_threshold.value.filter
          comparison      = condition_threshold.value.comparison
          threshold_value = condition_threshold.value.threshold_value
          duration        = "${condition_threshold.value.duration_seconds}s"

          aggregations {
            alignment_period     = "${condition_threshold.value.alignment_period_seconds}s"
            per_series_aligner   = condition_threshold.value.per_series_aligner
            cross_series_reducer = condition_threshold.value.cross_series_reducer
            group_by_fields      = condition_threshold.value.group_by_fields
          }

          trigger {
            count   = condition_threshold.value.trigger_count
            percent = condition_threshold.value.trigger_percent
          }
        }
      }

      # Absence condition (metric stopped reporting for `duration`)
      dynamic "condition_absent" {
        for_each = conditions.value.threshold_value == null ? [conditions.value] : []
        content {
          filter   = condition_absent.value.filter
          duration = "${condition_absent.value.duration_seconds}s"

          aggregations {
            alignment_period   = "${condition_absent.value.alignment_period_seconds}s"
            per_series_aligner = condition_absent.value.per_series_aligner
          }

          trigger {
            count   = condition_absent.value.trigger_count
            percent = condition_absent.value.trigger_percent
          }
        }
      }
    }
  }

  # Runbook / context shown on every incident for this policy.
  dynamic "documentation" {
    for_each = var.documentation_content != null ? [1] : []
    content {
      content   = var.documentation_content
      mime_type = "text/markdown"
      subject   = var.documentation_subject
    }
  }

  # Auto-close stale open incidents so a fixed-but-never-acked alert
  # does not linger forever.
  alert_strategy {
    auto_close = var.auto_close_duration

    dynamic "notification_rate_limit" {
      for_each = var.notification_rate_limit_period != null ? [1] : []
      content {
        period = var.notification_rate_limit_period
      }
    }
  }
}
# variables.tf

variable "project_id" {
  description = "GCP project ID that hosts the alert policy and its metrics."
  type        = string
}

variable "display_name" {
  description = "Human-readable name of the alert policy as it appears in the console and incidents."
  type        = string

  validation {
    condition     = length(var.display_name) > 0 && length(var.display_name) <= 512
    error_message = "display_name must be between 1 and 512 characters."
  }
}

variable "combiner" {
  description = "How multiple conditions combine: AND, OR, or AND_WITH_MATCHING_RESOURCE."
  type        = string
  default     = "OR"

  validation {
    condition     = contains(["AND", "OR", "AND_WITH_MATCHING_RESOURCE"], var.combiner)
    error_message = "combiner must be one of AND, OR, or AND_WITH_MATCHING_RESOURCE."
  }
}

variable "enabled" {
  description = "Whether the policy is active. Set false to stage a policy without paging anyone."
  type        = bool
  default     = true
}

variable "severity" {
  description = "Incident severity surfaced to notification channels: CRITICAL, ERROR, or WARNING."
  type        = string
  default     = "WARNING"

  validation {
    condition     = contains(["CRITICAL", "ERROR", "WARNING"], var.severity)
    error_message = "severity must be one of CRITICAL, ERROR, or WARNING."
  }
}

variable "user_labels" {
  description = "Key/value labels attached to the policy (e.g. team, service, slo) for routing and search."
  type        = map(string)
  default     = {}

  validation {
    # User label keys must be lowercase and start with a letter (GCP constraint).
    condition     = alltrue([for k in keys(var.user_labels) : can(regex("^[a-z][a-z0-9_-]*$", k))])
    error_message = "user_labels keys must be lowercase and start with a letter ([a-z][a-z0-9_-]*)."
  }
}

variable "notification_channel_ids" {
  description = "Notification channel IDs (short form like 'ABC123' or full 'projects/<p>/notificationChannels/<id>')."
  type        = list(string)
  default     = []
}

variable "conditions" {
  description = <<-EOT
    List of conditions. Each item is a threshold condition when threshold_value is
    set, or an absence condition when threshold_value is null (metric stopped
    reporting). filter is a Monitoring filter string.
  EOT
  type = list(object({
    display_name             = string
    filter                   = string
    comparison               = optional(string, "COMPARISON_GT")
    threshold_value          = optional(number) # null => absence condition
    duration_seconds         = optional(number, 300)
    alignment_period_seconds = optional(number, 60)
    per_series_aligner       = optional(string, "ALIGN_RATE")
    cross_series_reducer     = optional(string, "REDUCE_NONE")
    group_by_fields          = optional(list(string), [])
    trigger_count            = optional(number, 1)
    trigger_percent          = optional(number)
  }))

  validation {
    condition     = length(var.conditions) >= 1
    error_message = "At least one condition must be provided."
  }

  validation {
    condition = alltrue([
      for c in var.conditions :
      c.comparison == null || contains(
        ["COMPARISON_GT", "COMPARISON_GE", "COMPARISON_LT", "COMPARISON_LE", "COMPARISON_EQ", "COMPARISON_NE"],
        c.comparison
      )
    ])
    error_message = "Each condition.comparison must be a valid COMPARISON_* enum."
  }

  validation {
    condition     = alltrue([for c in var.conditions : c.duration_seconds >= 0])
    error_message = "condition.duration_seconds must be zero or positive."
  }
}

variable "documentation_content" {
  description = "Markdown runbook shown on every incident. Set null to omit documentation."
  type        = string
  default     = null
}

variable "documentation_subject" {
  description = "Subject line / title for the documentation block."
  type        = string
  default     = null
}

variable "auto_close_duration" {
  description = "How long an open incident waits with no data before auto-closing, e.g. '1800s'. Min 1800s per API."
  type        = string
  default     = "1800s"

  validation {
    condition     = can(regex("^[0-9]+s$", var.auto_close_duration))
    error_message = "auto_close_duration must be a duration string of seconds, e.g. '1800s'."
  }
}

variable "notification_rate_limit_period" {
  description = "Minimum spacing between notifications for the same incident, e.g. '300s'. Null disables rate limiting."
  type        = string
  default     = null

  validation {
    condition     = var.notification_rate_limit_period == null || can(regex("^[0-9]+s$", var.notification_rate_limit_period))
    error_message = "notification_rate_limit_period must be null or a duration string of seconds, e.g. '300s'."
  }
}
# outputs.tf

output "id" {
  description = "Full resource ID of the alert policy (projects/<project>/alertPolicies/<id>)."
  value       = google_monitoring_alert_policy.this.id
}

output "name" {
  description = "API name of the alert policy, identical to its id."
  value       = google_monitoring_alert_policy.this.name
}

output "display_name" {
  description = "Display name of the alert policy."
  value       = google_monitoring_alert_policy.this.display_name
}

output "creation_record" {
  description = "Provenance record (mutate time and the identity that created the policy)."
  value       = google_monitoring_alert_policy.this.creation_record
}

output "condition_names" {
  description = "Server-assigned full names of each condition, useful for cross-referencing in incidents."
  value       = [for c in google_monitoring_alert_policy.this.conditions : c.name]
}

How to use it

# A pre-existing PagerDuty channel created elsewhere in the stack.
resource "google_monitoring_notification_channel" "pagerduty" {
  project      = "kv-prod-001"
  display_name = "Payments On-Call (PagerDuty)"
  type         = "pagerduty"
  labels = {
    service_key = var.pagerduty_service_key
  }
  sensitive_labels {
    auth_token = var.pagerduty_service_key
  }
}

module "cloud_monitoring_alert" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-monitoring-alert?ref=v1.0.0"

  project_id   = "kv-prod-001"
  display_name = "payments-api — P99 latency high"
  severity     = "CRITICAL"
  combiner     = "OR"

  notification_channel_ids = [
    google_monitoring_notification_channel.pagerduty.id,
  ]

  user_labels = {
    team    = "payments"
    service = "payments-api"
    slo     = "latency"
  }

  conditions = [
    {
      display_name = "P99 request latency > 800ms for 5m"
      filter = join(" AND ", [
        "metric.type=\"run.googleapis.com/request_latencies\"",
        "resource.type=\"cloud_run_revision\"",
        "resource.label.\"service_name\"=\"payments-api\"",
      ])
      comparison               = "COMPARISON_GT"
      threshold_value          = 800
      duration_seconds         = 300
      alignment_period_seconds = 60
      per_series_aligner       = "ALIGN_PERCENTILE_99"
      cross_series_reducer     = "REDUCE_MEAN"
      group_by_fields          = ["resource.label.service_name"]
    },
    {
      # threshold_value omitted -> absence condition: revision stopped emitting metrics.
      display_name             = "payments-api stopped reporting latency (10m)"
      filter                   = "metric.type=\"run.googleapis.com/request_latencies\" AND resource.type=\"cloud_run_revision\""
      duration_seconds         = 600
      alignment_period_seconds = 60
      per_series_aligner       = "ALIGN_PERCENTILE_99"
    },
  ]

  documentation_subject = "Payments API latency breach"
  documentation_content = <<-EOT
    ## Payments API latency runbook
    1. Check the Cloud Run revision dashboard for traffic spikes.
    2. Inspect downstream Spanner / Pub/Sub latency.
    3. If a bad deploy is suspected, roll back to the previous revision.
  EOT

  auto_close_duration            = "1800s"
  notification_rate_limit_period = "300s"
}

# Downstream reference: feed the policy id into a dashboard's incident list,
# or simply expose it for audit tooling.
output "payments_latency_alert_id" {
  value = module.cloud_monitoring_alert.id
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/monitoring_alert/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-monitoring-alert?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  display_name = "..."
  conditions = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/monitoring_alert && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string Yes GCP project ID hosting the alert policy and its metrics.
display_name string Yes Human-readable policy name (1–512 chars) shown in console and incidents.
combiner string "OR" No How conditions combine: AND, OR, or AND_WITH_MATCHING_RESOURCE.
enabled bool true No Whether the policy is active; false stages it without paging.
severity string "WARNING" No Incident severity: CRITICAL, ERROR, or WARNING.
user_labels map(string) {} No Labels on the policy (keys lowercase, letter-led) for routing/search.
notification_channel_ids list(string) [] No Channel IDs (short ABC123 or full resource name); normalised internally.
conditions list(object) Yes One or more threshold/absence conditions; see object schema in variables.tf.
documentation_content string null No Markdown runbook attached to every incident.
documentation_subject string null No Subject/title for the documentation block.
auto_close_duration string "1800s" No Idle time before an open incident auto-closes (min 1800s).
notification_rate_limit_period string null No Minimum spacing between notifications per incident; null disables.

Outputs

Name Description
id Full resource ID of the alert policy (projects/<project>/alertPolicies/<id>).
name API name of the policy (identical to id).
display_name Display name of the alert policy.
creation_record Provenance: mutate time and the identity that created the policy.
condition_names Server-assigned full names of each condition.

Enterprise scenario

A fintech platform runs ~40 microservices on Cloud Run across dev, staging, and prod projects. The SRE team consumes this module from a single alerts/ Terraform stack so every service ships with the same three baseline alerts — P99 latency breach, 5xx error-rate breach, and a metric-absence “service went dark” alert — all tagged with team and slo labels. severity drives routing: CRITICAL policies attach the PagerDuty channel and a 5-minute rate limit, while WARNING policies go to Slack only, so a noisy staging deploy never pages on-call. Because the thresholds and runbooks live in version control, an auditor can diff exactly when an SLO threshold changed and who approved the pull request.

Best practices

TerraformGCPCloud Monitoring AlertModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading