Quick take — A reusable Terraform module for google_monitoring_alert_policy on hashicorp/google ~> 5.0: MQL/filter conditions, threshold and absence triggers, notification channel wiring, auto-close, and documentation. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "monitoring_alert" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-monitoring-alert?ref=v1.0.0"
project_id = "..." # GCP project ID hosting the alert policy and its metrics.
display_name = "..." # Human-readable policy name (1–512 chars) shown in conso…
conditions = ["...", "..."] # One or more threshold/absence conditions; see object sc…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Google Cloud Monitoring alert policies (google_monitoring_alert_policy) are the rules that decide when an on-call human or a webhook gets paged. A policy is a set of one or more conditions — each evaluating a metric filter (or a Monitoring Query Language / PromQL query) against a threshold over a rolling window — combined by a combiner (AND, OR, AND_WITH_MATCHING_RESOURCE) and routed to one or more notification channels. It also carries the human context that makes an alert actionable: severity, documentation (the runbook), labels, and auto-close behaviour.
Authoring these by hand in the console is where reliability quietly erodes. The fields are fiddly and easy to get subtly wrong: duration must be a string of seconds with an s suffix ("300s"), comparison is an enum like COMPARISON_GT, alignment_period and per_series_aligner interact in non-obvious ways, and a missing trigger block silently changes whether any time series breaching or all of them breaching fires the alert. A reusable module pins those mechanics once, exposes only the knobs that vary per alert (metric, threshold, duration, severity, channels), and validates them so a typo fails at plan instead of at 3 a.m. It also makes alerting reviewable: the threshold for “P99 latency too high” lives in a pull request, not in one engineer’s browser history.
When to use it
- You manage alerting for many services or environments and want every alert policy to share the same shape — consistent labels, severity taxonomy, auto-close, and documentation format.
- You want metric-threshold and metric-absence alerts (latency, error rate, CPU, queue depth, “no heartbeat in 5 minutes”) defined as code and code-reviewed alongside the workload they watch.
- You need alerts to route to the right channels per severity (PagerDuty for
CRITICAL, Slack/email forWARNING) without copy-pasting channel IDs. - You are codifying an SLO/error-budget practice and want burn-rate or threshold alerts versioned next to the SLOs.
- Skip this module if you only need a single throwaway alert, or if your condition is a log-based metric that does not yet exist — create the
google_logging_metricandgoogle_monitoring_notification_channelfirst, then feed their outputs in.
Module structure
terraform-module-gcp-monitoring-alert/
├── versions.tf # provider + Terraform version pins
├── main.tf # google_monitoring_alert_policy with dynamic conditions
├── variables.tf # var-driven inputs with validations
└── outputs.tf # policy id/name + condition names
# versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
# main.tf
locals {
# Channels are passed as short IDs (the "ABC123" part) OR as full resource
# names. Normalise to the full "projects/<p>/notificationChannels/<id>" form
# that the API requires.
notification_channels = [
for ch in var.notification_channel_ids :
can(regex("^projects/", ch)) ? ch : "projects/${var.project_id}/notificationChannels/${ch}"
]
}
resource "google_monitoring_alert_policy" "this" {
project = var.project_id
display_name = var.display_name
combiner = var.combiner
enabled = var.enabled
severity = var.severity
user_labels = var.user_labels
notification_channels = local.notification_channels
# One condition per entry in var.conditions. Each is either a threshold
# condition (filter + comparison + threshold) or, when threshold_value is
# null, an absence condition ("this metric stopped reporting").
dynamic "conditions" {
for_each = var.conditions
content {
display_name = conditions.value.display_name
# Threshold condition
dynamic "condition_threshold" {
for_each = conditions.value.threshold_value != null ? [conditions.value] : []
content {
filter = condition_threshold.value.filter
comparison = condition_threshold.value.comparison
threshold_value = condition_threshold.value.threshold_value
duration = "${condition_threshold.value.duration_seconds}s"
aggregations {
alignment_period = "${condition_threshold.value.alignment_period_seconds}s"
per_series_aligner = condition_threshold.value.per_series_aligner
cross_series_reducer = condition_threshold.value.cross_series_reducer
group_by_fields = condition_threshold.value.group_by_fields
}
trigger {
count = condition_threshold.value.trigger_count
percent = condition_threshold.value.trigger_percent
}
}
}
# Absence condition (metric stopped reporting for `duration`)
dynamic "condition_absent" {
for_each = conditions.value.threshold_value == null ? [conditions.value] : []
content {
filter = condition_absent.value.filter
duration = "${condition_absent.value.duration_seconds}s"
aggregations {
alignment_period = "${condition_absent.value.alignment_period_seconds}s"
per_series_aligner = condition_absent.value.per_series_aligner
}
trigger {
count = condition_absent.value.trigger_count
percent = condition_absent.value.trigger_percent
}
}
}
}
}
# Runbook / context shown on every incident for this policy.
dynamic "documentation" {
for_each = var.documentation_content != null ? [1] : []
content {
content = var.documentation_content
mime_type = "text/markdown"
subject = var.documentation_subject
}
}
# Auto-close stale open incidents so a fixed-but-never-acked alert
# does not linger forever.
alert_strategy {
auto_close = var.auto_close_duration
dynamic "notification_rate_limit" {
for_each = var.notification_rate_limit_period != null ? [1] : []
content {
period = var.notification_rate_limit_period
}
}
}
}
# variables.tf
variable "project_id" {
description = "GCP project ID that hosts the alert policy and its metrics."
type = string
}
variable "display_name" {
description = "Human-readable name of the alert policy as it appears in the console and incidents."
type = string
validation {
condition = length(var.display_name) > 0 && length(var.display_name) <= 512
error_message = "display_name must be between 1 and 512 characters."
}
}
variable "combiner" {
description = "How multiple conditions combine: AND, OR, or AND_WITH_MATCHING_RESOURCE."
type = string
default = "OR"
validation {
condition = contains(["AND", "OR", "AND_WITH_MATCHING_RESOURCE"], var.combiner)
error_message = "combiner must be one of AND, OR, or AND_WITH_MATCHING_RESOURCE."
}
}
variable "enabled" {
description = "Whether the policy is active. Set false to stage a policy without paging anyone."
type = bool
default = true
}
variable "severity" {
description = "Incident severity surfaced to notification channels: CRITICAL, ERROR, or WARNING."
type = string
default = "WARNING"
validation {
condition = contains(["CRITICAL", "ERROR", "WARNING"], var.severity)
error_message = "severity must be one of CRITICAL, ERROR, or WARNING."
}
}
variable "user_labels" {
description = "Key/value labels attached to the policy (e.g. team, service, slo) for routing and search."
type = map(string)
default = {}
validation {
# User label keys must be lowercase and start with a letter (GCP constraint).
condition = alltrue([for k in keys(var.user_labels) : can(regex("^[a-z][a-z0-9_-]*$", k))])
error_message = "user_labels keys must be lowercase and start with a letter ([a-z][a-z0-9_-]*)."
}
}
variable "notification_channel_ids" {
description = "Notification channel IDs (short form like 'ABC123' or full 'projects/<p>/notificationChannels/<id>')."
type = list(string)
default = []
}
variable "conditions" {
description = <<-EOT
List of conditions. Each item is a threshold condition when threshold_value is
set, or an absence condition when threshold_value is null (metric stopped
reporting). filter is a Monitoring filter string.
EOT
type = list(object({
display_name = string
filter = string
comparison = optional(string, "COMPARISON_GT")
threshold_value = optional(number) # null => absence condition
duration_seconds = optional(number, 300)
alignment_period_seconds = optional(number, 60)
per_series_aligner = optional(string, "ALIGN_RATE")
cross_series_reducer = optional(string, "REDUCE_NONE")
group_by_fields = optional(list(string), [])
trigger_count = optional(number, 1)
trigger_percent = optional(number)
}))
validation {
condition = length(var.conditions) >= 1
error_message = "At least one condition must be provided."
}
validation {
condition = alltrue([
for c in var.conditions :
c.comparison == null || contains(
["COMPARISON_GT", "COMPARISON_GE", "COMPARISON_LT", "COMPARISON_LE", "COMPARISON_EQ", "COMPARISON_NE"],
c.comparison
)
])
error_message = "Each condition.comparison must be a valid COMPARISON_* enum."
}
validation {
condition = alltrue([for c in var.conditions : c.duration_seconds >= 0])
error_message = "condition.duration_seconds must be zero or positive."
}
}
variable "documentation_content" {
description = "Markdown runbook shown on every incident. Set null to omit documentation."
type = string
default = null
}
variable "documentation_subject" {
description = "Subject line / title for the documentation block."
type = string
default = null
}
variable "auto_close_duration" {
description = "How long an open incident waits with no data before auto-closing, e.g. '1800s'. Min 1800s per API."
type = string
default = "1800s"
validation {
condition = can(regex("^[0-9]+s$", var.auto_close_duration))
error_message = "auto_close_duration must be a duration string of seconds, e.g. '1800s'."
}
}
variable "notification_rate_limit_period" {
description = "Minimum spacing between notifications for the same incident, e.g. '300s'. Null disables rate limiting."
type = string
default = null
validation {
condition = var.notification_rate_limit_period == null || can(regex("^[0-9]+s$", var.notification_rate_limit_period))
error_message = "notification_rate_limit_period must be null or a duration string of seconds, e.g. '300s'."
}
}
# outputs.tf
output "id" {
description = "Full resource ID of the alert policy (projects/<project>/alertPolicies/<id>)."
value = google_monitoring_alert_policy.this.id
}
output "name" {
description = "API name of the alert policy, identical to its id."
value = google_monitoring_alert_policy.this.name
}
output "display_name" {
description = "Display name of the alert policy."
value = google_monitoring_alert_policy.this.display_name
}
output "creation_record" {
description = "Provenance record (mutate time and the identity that created the policy)."
value = google_monitoring_alert_policy.this.creation_record
}
output "condition_names" {
description = "Server-assigned full names of each condition, useful for cross-referencing in incidents."
value = [for c in google_monitoring_alert_policy.this.conditions : c.name]
}
How to use it
# A pre-existing PagerDuty channel created elsewhere in the stack.
resource "google_monitoring_notification_channel" "pagerduty" {
project = "kv-prod-001"
display_name = "Payments On-Call (PagerDuty)"
type = "pagerduty"
labels = {
service_key = var.pagerduty_service_key
}
sensitive_labels {
auth_token = var.pagerduty_service_key
}
}
module "cloud_monitoring_alert" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-monitoring-alert?ref=v1.0.0"
project_id = "kv-prod-001"
display_name = "payments-api — P99 latency high"
severity = "CRITICAL"
combiner = "OR"
notification_channel_ids = [
google_monitoring_notification_channel.pagerduty.id,
]
user_labels = {
team = "payments"
service = "payments-api"
slo = "latency"
}
conditions = [
{
display_name = "P99 request latency > 800ms for 5m"
filter = join(" AND ", [
"metric.type=\"run.googleapis.com/request_latencies\"",
"resource.type=\"cloud_run_revision\"",
"resource.label.\"service_name\"=\"payments-api\"",
])
comparison = "COMPARISON_GT"
threshold_value = 800
duration_seconds = 300
alignment_period_seconds = 60
per_series_aligner = "ALIGN_PERCENTILE_99"
cross_series_reducer = "REDUCE_MEAN"
group_by_fields = ["resource.label.service_name"]
},
{
# threshold_value omitted -> absence condition: revision stopped emitting metrics.
display_name = "payments-api stopped reporting latency (10m)"
filter = "metric.type=\"run.googleapis.com/request_latencies\" AND resource.type=\"cloud_run_revision\""
duration_seconds = 600
alignment_period_seconds = 60
per_series_aligner = "ALIGN_PERCENTILE_99"
},
]
documentation_subject = "Payments API latency breach"
documentation_content = <<-EOT
## Payments API latency runbook
1. Check the Cloud Run revision dashboard for traffic spikes.
2. Inspect downstream Spanner / Pub/Sub latency.
3. If a bad deploy is suspected, roll back to the previous revision.
EOT
auto_close_duration = "1800s"
notification_rate_limit_period = "300s"
}
# Downstream reference: feed the policy id into a dashboard's incident list,
# or simply expose it for audit tooling.
output "payments_latency_alert_id" {
value = module.cloud_monitoring_alert.id
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/monitoring_alert/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-monitoring-alert?ref=v1.0.0"
}
inputs = {
project_id = "..."
display_name = "..."
conditions = ["...", "..."]
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/monitoring_alert && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
project_id |
string |
— | Yes | GCP project ID hosting the alert policy and its metrics. |
display_name |
string |
— | Yes | Human-readable policy name (1–512 chars) shown in console and incidents. |
combiner |
string |
"OR" |
No | How conditions combine: AND, OR, or AND_WITH_MATCHING_RESOURCE. |
enabled |
bool |
true |
No | Whether the policy is active; false stages it without paging. |
severity |
string |
"WARNING" |
No | Incident severity: CRITICAL, ERROR, or WARNING. |
user_labels |
map(string) |
{} |
No | Labels on the policy (keys lowercase, letter-led) for routing/search. |
notification_channel_ids |
list(string) |
[] |
No | Channel IDs (short ABC123 or full resource name); normalised internally. |
conditions |
list(object) |
— | Yes | One or more threshold/absence conditions; see object schema in variables.tf. |
documentation_content |
string |
null |
No | Markdown runbook attached to every incident. |
documentation_subject |
string |
null |
No | Subject/title for the documentation block. |
auto_close_duration |
string |
"1800s" |
No | Idle time before an open incident auto-closes (min 1800s). |
notification_rate_limit_period |
string |
null |
No | Minimum spacing between notifications per incident; null disables. |
Outputs
| Name | Description |
|---|---|
id |
Full resource ID of the alert policy (projects/<project>/alertPolicies/<id>). |
name |
API name of the policy (identical to id). |
display_name |
Display name of the alert policy. |
creation_record |
Provenance: mutate time and the identity that created the policy. |
condition_names |
Server-assigned full names of each condition. |
Enterprise scenario
A fintech platform runs ~40 microservices on Cloud Run across dev, staging, and prod projects. The SRE team consumes this module from a single alerts/ Terraform stack so every service ships with the same three baseline alerts — P99 latency breach, 5xx error-rate breach, and a metric-absence “service went dark” alert — all tagged with team and slo labels. severity drives routing: CRITICAL policies attach the PagerDuty channel and a 5-minute rate limit, while WARNING policies go to Slack only, so a noisy staging deploy never pages on-call. Because the thresholds and runbooks live in version control, an auditor can diff exactly when an SLO threshold changed and who approved the pull request.
Best practices
- Match
severityto channel routing, not just colour. ReserveCRITICALfor policies wired to paging channels with anotification_rate_limit_period; sendWARNINGto chat/email only. This keeps your error budget for actual pages and stops alert fatigue. - Always set a runbook via
documentation_content. An alert without context is just an interruption — put the first three triage steps and the rollback command directly on the incident so responders start acting, not searching. - Use absence conditions for liveness, and align the window to reality. A
condition_absentwith a 10-minuteduration_secondscatches a service that silently stopped emitting metrics, but set it longer than your metric’s normal sampling gap or batch cadence to avoid flapping during quiet traffic. - Pick
per_series_alignerandthresholdtogether.ALIGN_RATEon aDELTA/CUMULATIVEmetric yields per-second rates — your threshold must be in those units. Mismatched aligners are the most common reason an alert “never fires” or “fires constantly.” - Tag every policy with
user_labels(team,service,slo). Labels are how you bulk-silence, build per-team dashboards, and answer “which alerts does this squad own?” without scraping display names. - Keep
auto_close_durationmodest (≈30 min). Long auto-close windows leave stale incidents open after a transient blip self-heals, masking the next real breach; the API minimum of1800sis a sane default for most workloads.