It is 02:14 on a Saturday and the on-call SRE for a payments platform gets paged for the fourth time in twenty minutes — same Kubernetes node, same KubeNodeNotReady alert, fanned out into four separate PagerDuty incidents because four workloads on that node each tripped their own rule. By the time she has acknowledged, silenced, and stitched them together by hand, the actual customer-facing symptom — a climbing checkout error rate — is buried under the noise. The post-incident review lands on a single, unglamorous conclusion: the alerts were correct, but the routing was dumb. There was no grouping by root cause, no suppression of dependent alerts, no severity-aware escalation, and crucially no runbook attached to the page, so a sleepy engineer had to remember from scratch how to cordon and drain a node. This guide fixes exactly that. We will put PagerDuty Event Orchestration in front of Prometheus Alertmanager so that alerts are deduplicated, grouped, suppressed when they are merely symptoms, routed to the right responders by severity, and — the part that saves the 02:14 engineer — auto-enriched with a direct link to the runbook for that specific alert.
Prerequisites. Before you start, have the following in place:
- A running Prometheus + Alertmanager stack (this guide assumes the
kube-prometheus-stackHelm chart on Kubernetes, but the Alertmanager config applies anywhere). - A PagerDuty account on a plan that includes Event Orchestration (Business or Digital Operations / AIOps), with admin rights to create services and Event Orchestrations.
- Terraform ≥ 1.5 and the
PagerDuty/pagerdutyprovider ≥ 3.0 — we manage orchestration rules as code so they are reviewable and revertable, not clicked together in a UI nobody can audit. - The PagerDuty REST API token stored in HashiCorp Vault (we read it at plan/apply time; it never lands in a
.tfvarsfile or a CI log). - A place to host runbook markdown that returns a stable URL per alert — a Git-backed wiki, Backstage TechDocs, or a
runbooks/folder served by your docs site. - SSO already federating Okta (or Entra ID) into PagerDuty over SAML/SCIM, so responder identity and on-call schedules map to real humans and groups.
Target topology
The flow is a single, mostly one-directional pipeline with a few enrichment side-channels. Prometheus evaluates rules and pushes firing alerts to Alertmanager, which does first-pass grouping and inhibition and then sends a structured webhook to a PagerDuty Events API v2 integration. That integration is owned not by a plain service but by a PagerDuty Event Orchestration, which is where the real intelligence lives: a tree of rules that inspects each event’s severity, namespace, alertname, and custom details, then decides whether to suppress it (it is a dependent symptom), route it to a specific PagerDuty service and escalation policy, set its priority/urgency, and attach runbook links and custom fields. Routed events become incidents that page the right on-call, defined by an Okta-fed schedule. Two side-channels enrich without changing the routing decision: Dynatrace (or Datadog) posts problem context back onto the incident via the PagerDuty events API, and a high-priority incident opens a ServiceNow change/incident record through the native integration so ITSM has a system of record. Everything in the dashed box — services, escalation policies, the orchestration and its rules, and the integrations — is provisioned by Terraform.
1. Stand up the PagerDuty service, escalation policy, and Event Orchestration in Terraform
Start with infrastructure-as-code so the whole routing brain is in version control. First, read the API token from Vault rather than hardcoding it. Configure the provider with a token sourced from the environment, and populate that environment variable from Vault in your shell or CI:
# Pull the PagerDuty REST token from Vault into the env the provider reads.
# (Vault stores third-party tokens; nothing sensitive touches the repo.)
export PAGERDUTY_TOKEN="$(vault kv get -field=api_token secret/pagerduty/terraform)"
# providers.tf
terraform {
required_version = ">= 1.5"
required_providers {
pagerduty = {
source = "PagerDuty/pagerduty"
version = "~> 3.6"
}
}
}
# Reads PAGERDUTY_TOKEN from the environment (set from Vault above).
provider "pagerduty" {}
Now define the escalation policy, the schedule it points at, and the service. The schedule references on-call users that Okta provisions into PagerDuty via SCIM, so you reference them by their PagerDuty user IDs (which map 1:1 to Okta identities):
# service.tf
data "pagerduty_user" "primary_oncall" {
email = "sre-primary@kloudvin.io" # provisioned from Okta via SCIM
}
resource "pagerduty_schedule" "sre_rotation" {
name = "SRE Primary Rotation"
time_zone = "Asia/Kolkata"
layer {
name = "Weekly"
start = "2026-06-10T00:00:00+05:30"
rotation_virtual_start = "2026-06-10T00:00:00+05:30"
rotation_turn_length_seconds = 604800
users = [data.pagerduty_user.primary_oncall.id]
}
}
resource "pagerduty_escalation_policy" "platform" {
name = "Platform — Sev-aware"
num_loops = 2
rule {
escalation_delay_in_minutes = 10
target {
type = "schedule_reference"
id = pagerduty_schedule.sre_rotation.id
}
}
}
resource "pagerduty_service" "platform_prod" {
name = "platform-prod"
escalation_policy = pagerduty_escalation_policy.platform.id
alert_creation = "create_alerts_and_incidents"
acknowledgement_timeout = 600
auto_resolve_timeout = 14400
}
Create the Events API v2 integration on that service — this generates the Integration Key (routing key) Alertmanager will post to:
# integration.tf
data "pagerduty_vendor" "prometheus" {
name = "Prometheus"
}
resource "pagerduty_service_integration" "prometheus" {
name = "Prometheus via Alertmanager"
service = pagerduty_service.platform_prod.id
vendor = data.pagerduty_vendor.prometheus.id
}
output "routing_key" {
value = pagerduty_service_integration.prometheus.integration_key
sensitive = true
}
Apply it and capture the key:
terraform init
terraform plan -out=tfplan
terraform apply tfplan
terraform output -raw routing_key # copy into the Alertmanager secret in step 2
2. Point Alertmanager at PagerDuty and do first-pass grouping/inhibition there
Alertmanager is the first noise filter and the right place for cheap, local deduplication and inhibition; PagerDuty Event Orchestration is the second, smarter layer. Do not skip Alertmanager’s inhibit_rules — they kill obvious symptom storms before they ever cost you a PagerDuty event.
Store the routing key as a Kubernetes secret so it is not in the values file:
kubectl create secret generic pagerduty-key \
--namespace monitoring \
--from-literal=routingKey="$(terraform output -raw routing_key)"
Configure Alertmanager (here as the alertmanager.config block of the kube-prometheus-stack Helm values). Note the grouping, the severity-based routing, and the inhibition that suppresses everything else on a node when that node is NotReady:
# values-alertmanager.yaml
alertmanager:
config:
global:
resolve_timeout: 5m
route:
receiver: pagerduty-default
group_by: ['alertname', 'namespace', 'cluster']
group_wait: 30s # let related alerts batch before first page
group_interval: 5m
repeat_interval: 4h
routes:
- matchers: [ 'severity = critical' ]
receiver: pagerduty-critical
continue: false
- matchers: [ 'severity = warning' ]
receiver: pagerduty-warning
inhibit_rules:
# If a node is down, do not also page for every pod on that node.
- source_matchers: [ 'alertname = KubeNodeNotReady' ]
target_matchers: [ 'severity =~ "warning|critical"' ]
equal: ['node']
receivers:
- name: pagerduty-default
pagerduty_configs:
- routing_key_file: /etc/alertmanager/secrets/pagerduty-key/routingKey
severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
details:
alertname: '{{ .CommonLabels.alertname }}'
namespace: '{{ .CommonLabels.namespace }}'
cluster: '{{ .CommonLabels.cluster }}'
summary: '{{ .CommonAnnotations.summary }}'
runbook_url:'{{ .CommonAnnotations.runbook_url }}' # passed through to orchestration
- name: pagerduty-critical
pagerduty_configs:
- routing_key_file: /etc/alertmanager/secrets/pagerduty-key/routingKey
severity: critical
details: &common_details
alertname: '{{ .CommonLabels.alertname }}'
namespace: '{{ .CommonLabels.namespace }}'
cluster: '{{ .CommonLabels.cluster }}'
summary: '{{ .CommonAnnotations.summary }}'
runbook_url:'{{ .CommonAnnotations.runbook_url }}'
- name: pagerduty-warning
pagerduty_configs:
- routing_key_file: /etc/alertmanager/secrets/pagerduty-key/routingKey
severity: warning
details: *common_details
Mount the secret and apply the chart:
helm upgrade --install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values values-alertmanager.yaml \
--set alertmanager.alertmanagerSpec.secrets='{pagerduty-key}'
Two things to verify here: the runbook_url annotation flows through as a custom detail (we set it on the alert rules in step 5), and the severity label is reliably present on every rule — Event Orchestration’s routing decisions key off it.
3. Build the Event Orchestration rule tree (route, suppress, set priority)
This is the core of the integration. An Event Orchestration sits in front of one or more services and runs a top-down rule set against every incoming event. We define it in Terraform so the logic is reviewable. Create the orchestration, then its router and service rules.
# orchestration.tf
resource "pagerduty_event_orchestration" "platform" {
name = "Platform Event Orchestration"
description = "Routes, suppresses, and enriches Prometheus/Alertmanager events"
}
Attach the orchestration’s service rules to the production service. These rules run on events routed into that service and are where suppression, priority, and enrichment happen:
# orchestration_service.tf
resource "pagerduty_event_orchestration_service" "platform_prod" {
service = pagerduty_service.platform_prod.id
enable_event_orchestration_for_service = true
# Rule 1 — SUPPRESS dependent watchdog/heartbeat noise outright.
set {
id = "start"
rule {
label = "Drop Watchdog heartbeat"
condition { expression = "event.custom_details.alertname matches 'Watchdog'" }
actions { suppress = true }
}
# Rule 2 — SUPPRESS warnings during an active node drain (symptom, not cause).
rule {
label = "Suppress pod warnings when node is draining"
condition { expression = "event.custom_details.alertname matches part 'KubePod' and event.severity matches 'warning'" }
condition { expression = "event.custom_details.cluster matches 'prod'" }
actions {
route_to = "high-priority"
suppress = true
}
}
# Rule 3 — ROUTE critical events to the high-priority handling set.
rule {
label = "Critical to high-priority set"
condition { expression = "event.severity matches 'critical'" }
actions { route_to = "high-priority" }
}
}
# The high-priority set: set PD priority, urgency, and attach the runbook.
set {
id = "high-priority"
rule {
label = "Tag P1, raise urgency, attach runbook"
actions {
priority = data.pagerduty_priority.p1.id
annotate = "Auto-enriched by Event Orchestration. Runbook attached below."
severity = "critical"
extract {
target = "event.custom_details.runbook"
source = "event.custom_details.runbook_url"
regex = "(.*)"
}
}
}
}
catch_all {
actions {} # let anything unmatched create a normal incident
}
}
data "pagerduty_priority" "p1" {
name = "P1"
}
A few decisions worth the why:
- Suppress, do not delete.
suppress = truestill records the event on the incident timeline (so you can see the symptom storm during review) but does not page. That is exactly what the 02:14 engineer needed — the pod-eviction noise visible but silent while the node alert pages. - Route by severity, not by hardcoding services in Alertmanager. Keeping the routing decision in PagerDuty means SREs change escalation behavior with a Terraform PR, not a Prometheus redeploy.
extractpulls the runbook URL out of the event details into a named field the responder UI surfaces prominently — covered fully in step 5.
Apply:
terraform apply -target=pagerduty_event_orchestration.platform \
-target=pagerduty_event_orchestration_service.platform_prod
4. Add severity-aware escalation and a ServiceNow change record for P1s
For genuinely high-priority incidents you usually need two more things: a tighter escalation policy and an ITSM record. Tie a P1 to ServiceNow so change management has a system of record, using PagerDuty’s native ServiceNow integration (configured once in the PagerDuty UI under Integrations → ServiceNow, then referenced here). The integration auto-creates a ServiceNow incident when a PagerDuty incident reaches the mapped priority and syncs status bidirectionally — so when the SRE resolves in PagerDuty, the ServiceNow ticket closes too.
In ServiceNow’s PagerDuty mapping, set the trigger to priority = P1 and the work-note sync on. Then make sure your orchestration actually stamps P1 (done in step 3). To verify the binding end to end, you will fire a synthetic P1 in step 6 and confirm a INC record appears.
For escalation, add a second, faster policy for P1 and point the high-priority orchestration set’s service at it if you split services — or keep one service and rely on PagerDuty’s priority to drive a response play (an automation that adds responders, posts to a Slack/Teams war room, and notifies stakeholders):
# A response play auto-launched for P1 pulls in the incident commander
# and the database SME alongside the primary on-call.
resource "pagerduty_escalation_policy" "p1_fast" {
name = "P1 — fast escalation + IC"
num_loops = 3
rule {
escalation_delay_in_minutes = 5
target {
type = "schedule_reference"
id = pagerduty_schedule.sre_rotation.id
}
}
rule {
escalation_delay_in_minutes = 5
target {
type = "user_reference"
id = data.pagerduty_user.primary_oncall.id # IC backup
}
}
}
5. Wire runbooks into the alert so the page is actionable
The single highest-leverage change for the 02:14 engineer: every paged alert links straight to its runbook. There are two halves — author the runbook annotation on the Prometheus rule, and surface it as a clickable field on the PagerDuty incident.
Author the runbook_url annotation on each PrometheusRule. Host the runbook markdown anywhere that yields a stable URL (a Git wiki, Backstage TechDocs, or your docs site’s runbooks/ path):
# prometheusrule-node.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: node-rules
namespace: monitoring
spec:
groups:
- name: node.rules
rules:
- alert: KubeNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} not ready for 5m"
runbook_url: "https://docs.kloudvin.io/runbooks/kube-node-not-ready"
That runbook_url rides through Alertmanager (step 2 puts it in details) into the PagerDuty event. The orchestration’s extract action (step 3) lifts it into a custom field. Surface it on the incident as a custom field so responders see a labeled link rather than hunting through event details:
# Define an incident custom field that holds the runbook link.
resource "pagerduty_incident_custom_field" "runbook" {
name = "runbook"
display_name = "Runbook"
data_type = "string"
field_type = "single_value"
}
Then have the orchestration’s high-priority set populate it (the extract block in step 3 writes to event.custom_details.runbook, which PagerDuty maps onto the incident). The result: the page body contains a one-click Runbook link to the exact cordon-and-drain procedure — no recall required.
For richer enrichment, let Dynatrace (or Datadog) attach live problem context. Dynatrace’s PagerDuty integration posts the detected problem, affected entities, and a deep link back onto the incident via the Events API, so the responder sees the observability platform’s root-cause guess next to the runbook. Configure it in Dynatrace under Settings → Integration → Problem notifications → PagerDuty, using the same service routing key.
6. Validation
Validate every layer with a synthetic event rather than waiting for a real outage. First, fire a test event straight at the Events API to confirm the orchestration routes, prioritizes, and attaches the runbook:
ROUTING_KEY="$(terraform output -raw routing_key)"
curl -sS -X POST https://events.pagerduty.com/v2/enqueue \
-H 'Content-Type: application/json' \
-d '{
"routing_key": "'"$ROUTING_KEY"'",
"event_action": "trigger",
"dedup_key": "synthetic-kube-node-not-ready",
"payload": {
"summary": "SYNTHETIC: Node ip-10-0-3-12 not ready",
"source": "alertmanager-test",
"severity": "critical",
"custom_details": {
"alertname": "KubeNodeNotReady",
"cluster": "prod",
"runbook_url": "https://docs.kloudvin.io/runbooks/kube-node-not-ready"
}
}
}'
Confirm, in order:
# 1. The incident exists, is P1, and carries the runbook field.
pd incident:list --statuses=triggered --total # PagerDuty CLI
# 2. Validate the Alertmanager config and routing tree before relying on it.
amtool check-config values-alertmanager.yaml
amtool config routes test --config.file=values-alertmanager.yaml \
severity=critical alertname=KubeNodeNotReady
# 3. Confirm inhibition: a node-down + pod-warning pair yields ONE page.
amtool config routes test --config.file=values-alertmanager.yaml \
severity=warning alertname=KubePodNotReady node=ip-10-0-3-12
Then verify the side-channels: a P1 should produce a ServiceNow INC record (check the ServiceNow incident list filtered by the PagerDuty integration user) and, if Dynatrace is wired, a problem-context note on the incident. Resolve the synthetic incident and confirm the ServiceNow ticket auto-closes:
curl -sS -X POST https://events.pagerduty.com/v2/enqueue \
-H 'Content-Type: application/json' \
-d '{"routing_key":"'"$ROUTING_KEY"'","event_action":"resolve","dedup_key":"synthetic-kube-node-not-ready"}'
7. Rollback / teardown
Because the routing brain is in Terraform, rollback is a git revert plus an apply — but stage it so you never lose paging coverage. To revert a single bad orchestration rule, revert the commit and re-apply just the orchestration resource:
git revert <bad-commit-sha>
terraform plan -target=pagerduty_event_orchestration_service.platform_prod
terraform apply -target=pagerduty_event_orchestration_service.platform_prod
To safely detach Event Orchestration entirely and fall back to the plain service (alerts still page, just without smart routing), flip the per-service flag rather than destroying anything:
resource "pagerduty_event_orchestration_service" "platform_prod" {
enable_event_orchestration_for_service = false # events bypass the rule tree
# ...
}
Full teardown, in dependency order so nothing is orphaned:
# Point Alertmanager away first so no live traffic hits a dying integration.
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring --reuse-values \
--set-string alertmanager.config.route.receiver=blackhole
# Then destroy PagerDuty objects (orchestration → integration → service → policy).
terraform destroy \
-target=pagerduty_event_orchestration_service.platform_prod \
-target=pagerduty_event_orchestration.platform \
-target=pagerduty_service_integration.prometheus \
-target=pagerduty_service.platform_prod
Keep the Vault token and Okta/SCIM provisioning intact unless you are decommissioning PagerDuty altogether — they are shared with other services.
Common pitfalls
- Missing
severitylabel. If a PrometheusRule omitsseverity, the orchestration’sevent.severity matches 'critical'condition never fires and the event falls tocatch_all— paging at default urgency with no priority. Lint every rule for aseveritylabel in CI. - Suppressing too aggressively. An overly broad suppress rule (e.g. matching all
warning) silences alerts you actually wanted. Always scope suppression with a second condition (cluster,namespace, or anequaljoin) and test withamtool config routes test. - Runbook URL dropped between layers. The annotation must survive Prometheus → Alertmanager
details→ orchestrationextract→ incident field. Break any link and the page arrives runbook-less. The step-6 synthetic event is the cheap way to catch this. - Double-suppression confusion. Alertmanager
inhibit_rulesand PagerDutysuppressoverlap. Keep node-level symptom storms in Alertmanager (cheaper, closer to source) and cross-service / priority logic in Event Orchestration. Document which layer owns what. - Routing key in plaintext. Never commit the integration key. Read it from Vault at apply time and mount it as a Kubernetes secret with
routing_key_file, as above. - Editing rules in the UI and in Terraform. Pick one. UI edits drift from state and the next
applyreverts them. Terraform is the source of truth here.
Security notes
Treat the integration key as a credential: it can create incidents and, abused, page your whole org at 03:00. Store it in HashiCorp Vault, inject it as a short-lived Kubernetes secret, and never log it in CI. Lock PagerDuty access behind Okta (or Entra ID) SSO with SCIM provisioning, so on-call membership and responder identity are governed centrally and deprovisioned the moment someone leaves — no orphaned local PagerDuty logins. Scope the Terraform REST token to the minimum (services, escalation policies, orchestrations) rather than a global admin key, and rotate it on the same cadence as your other API secrets. If your posture tooling (Wiz) or runtime security (CrowdStrike Falcon) raises a finding, route those into the same PagerDuty pipeline so security incidents inherit the runbook-and-escalation discipline you just built for reliability.
Cost notes
Event Orchestration is part of the higher PagerDuty tiers (Business / AIOps), so the line item is per-seat, not per-event — which means the cheapest optimization is suppressing noise so you do not need more responder seats to keep up with pages. Alertmanager-side grouping and inhibition cost nothing and cut event volume before it reaches PagerDuty, so do as much first-pass deduplication there as is sane. The ServiceNow and Dynatrace integrations are included in their respective platforms’ licenses; the only real incremental spend is responder seats and any AIOps add-on. Measure the win in the metric that matters: pages per incident should drop sharply (the 02:14 four-pages-one-cause pattern collapsing to a single actionable page), and mean time to acknowledge should fall once the runbook is one click from the alert.