Observability Multi-cloud

Integrate PagerDuty Event Orchestration with Prometheus Alertmanager and Runbooks

It is 02:14 on a Saturday and the on-call SRE for a payments platform gets paged for the fourth time in twenty minutes — same Kubernetes node, same KubeNodeNotReady alert, fanned out into four separate PagerDuty incidents because four workloads on that node each tripped their own rule. By the time she has acknowledged, silenced, and stitched them together by hand, the actual customer-facing symptom — a climbing checkout error rate — is buried under the noise. The post-incident review lands on a single, unglamorous conclusion: the alerts were correct, but the routing was dumb. There was no grouping by root cause, no suppression of dependent alerts, no severity-aware escalation, and crucially no runbook attached to the page, so a sleepy engineer had to remember from scratch how to cordon and drain a node. This guide fixes exactly that. We will put PagerDuty Event Orchestration in front of Prometheus Alertmanager so that alerts are deduplicated, grouped, suppressed when they are merely symptoms, routed to the right responders by severity, and — the part that saves the 02:14 engineer — auto-enriched with a direct link to the runbook for that specific alert.

Prerequisites. Before you start, have the following in place:

Target topology

Integrate PagerDuty Event Orchestration with Prometheus Alertmanager and Runbooks — topology

The flow is a single, mostly one-directional pipeline with a few enrichment side-channels. Prometheus evaluates rules and pushes firing alerts to Alertmanager, which does first-pass grouping and inhibition and then sends a structured webhook to a PagerDuty Events API v2 integration. That integration is owned not by a plain service but by a PagerDuty Event Orchestration, which is where the real intelligence lives: a tree of rules that inspects each event’s severity, namespace, alertname, and custom details, then decides whether to suppress it (it is a dependent symptom), route it to a specific PagerDuty service and escalation policy, set its priority/urgency, and attach runbook links and custom fields. Routed events become incidents that page the right on-call, defined by an Okta-fed schedule. Two side-channels enrich without changing the routing decision: Dynatrace (or Datadog) posts problem context back onto the incident via the PagerDuty events API, and a high-priority incident opens a ServiceNow change/incident record through the native integration so ITSM has a system of record. Everything in the dashed box — services, escalation policies, the orchestration and its rules, and the integrations — is provisioned by Terraform.

1. Stand up the PagerDuty service, escalation policy, and Event Orchestration in Terraform

Start with infrastructure-as-code so the whole routing brain is in version control. First, read the API token from Vault rather than hardcoding it. Configure the provider with a token sourced from the environment, and populate that environment variable from Vault in your shell or CI:

# Pull the PagerDuty REST token from Vault into the env the provider reads.
# (Vault stores third-party tokens; nothing sensitive touches the repo.)
export PAGERDUTY_TOKEN="$(vault kv get -field=api_token secret/pagerduty/terraform)"
# providers.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    pagerduty = {
      source  = "PagerDuty/pagerduty"
      version = "~> 3.6"
    }
  }
}

# Reads PAGERDUTY_TOKEN from the environment (set from Vault above).
provider "pagerduty" {}

Now define the escalation policy, the schedule it points at, and the service. The schedule references on-call users that Okta provisions into PagerDuty via SCIM, so you reference them by their PagerDuty user IDs (which map 1:1 to Okta identities):

# service.tf
data "pagerduty_user" "primary_oncall" {
  email = "sre-primary@kloudvin.io"   # provisioned from Okta via SCIM
}

resource "pagerduty_schedule" "sre_rotation" {
  name      = "SRE Primary Rotation"
  time_zone = "Asia/Kolkata"

  layer {
    name                         = "Weekly"
    start                        = "2026-06-10T00:00:00+05:30"
    rotation_virtual_start       = "2026-06-10T00:00:00+05:30"
    rotation_turn_length_seconds = 604800
    users                        = [data.pagerduty_user.primary_oncall.id]
  }
}

resource "pagerduty_escalation_policy" "platform" {
  name      = "Platform — Sev-aware"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.sre_rotation.id
    }
  }
}

resource "pagerduty_service" "platform_prod" {
  name                    = "platform-prod"
  escalation_policy       = pagerduty_escalation_policy.platform.id
  alert_creation          = "create_alerts_and_incidents"
  acknowledgement_timeout = 600
  auto_resolve_timeout    = 14400
}

Create the Events API v2 integration on that service — this generates the Integration Key (routing key) Alertmanager will post to:

# integration.tf
data "pagerduty_vendor" "prometheus" {
  name = "Prometheus"
}

resource "pagerduty_service_integration" "prometheus" {
  name    = "Prometheus via Alertmanager"
  service = pagerduty_service.platform_prod.id
  vendor  = data.pagerduty_vendor.prometheus.id
}

output "routing_key" {
  value     = pagerduty_service_integration.prometheus.integration_key
  sensitive = true
}

Apply it and capture the key:

terraform init
terraform plan -out=tfplan
terraform apply tfplan
terraform output -raw routing_key   # copy into the Alertmanager secret in step 2

2. Point Alertmanager at PagerDuty and do first-pass grouping/inhibition there

Alertmanager is the first noise filter and the right place for cheap, local deduplication and inhibition; PagerDuty Event Orchestration is the second, smarter layer. Do not skip Alertmanager’s inhibit_rules — they kill obvious symptom storms before they ever cost you a PagerDuty event.

Store the routing key as a Kubernetes secret so it is not in the values file:

kubectl create secret generic pagerduty-key \
  --namespace monitoring \
  --from-literal=routingKey="$(terraform output -raw routing_key)"

Configure Alertmanager (here as the alertmanager.config block of the kube-prometheus-stack Helm values). Note the grouping, the severity-based routing, and the inhibition that suppresses everything else on a node when that node is NotReady:

# values-alertmanager.yaml
alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      receiver: pagerduty-default
      group_by: ['alertname', 'namespace', 'cluster']
      group_wait: 30s          # let related alerts batch before first page
      group_interval: 5m
      repeat_interval: 4h
      routes:
        - matchers: [ 'severity = critical' ]
          receiver: pagerduty-critical
          continue: false
        - matchers: [ 'severity = warning' ]
          receiver: pagerduty-warning

    inhibit_rules:
      # If a node is down, do not also page for every pod on that node.
      - source_matchers: [ 'alertname = KubeNodeNotReady' ]
        target_matchers: [ 'severity =~ "warning|critical"' ]
        equal: ['node']

    receivers:
      - name: pagerduty-default
        pagerduty_configs:
          - routing_key_file: /etc/alertmanager/secrets/pagerduty-key/routingKey
            severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
            details:
              alertname:  '{{ .CommonLabels.alertname }}'
              namespace:  '{{ .CommonLabels.namespace }}'
              cluster:    '{{ .CommonLabels.cluster }}'
              summary:    '{{ .CommonAnnotations.summary }}'
              runbook_url:'{{ .CommonAnnotations.runbook_url }}'   # passed through to orchestration
      - name: pagerduty-critical
        pagerduty_configs:
          - routing_key_file: /etc/alertmanager/secrets/pagerduty-key/routingKey
            severity: critical
            details: &common_details
              alertname:  '{{ .CommonLabels.alertname }}'
              namespace:  '{{ .CommonLabels.namespace }}'
              cluster:    '{{ .CommonLabels.cluster }}'
              summary:    '{{ .CommonAnnotations.summary }}'
              runbook_url:'{{ .CommonAnnotations.runbook_url }}'
      - name: pagerduty-warning
        pagerduty_configs:
          - routing_key_file: /etc/alertmanager/secrets/pagerduty-key/routingKey
            severity: warning
            details: *common_details

Mount the secret and apply the chart:

helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values-alertmanager.yaml \
  --set alertmanager.alertmanagerSpec.secrets='{pagerduty-key}'

Two things to verify here: the runbook_url annotation flows through as a custom detail (we set it on the alert rules in step 5), and the severity label is reliably present on every rule — Event Orchestration’s routing decisions key off it.

3. Build the Event Orchestration rule tree (route, suppress, set priority)

This is the core of the integration. An Event Orchestration sits in front of one or more services and runs a top-down rule set against every incoming event. We define it in Terraform so the logic is reviewable. Create the orchestration, then its router and service rules.

# orchestration.tf
resource "pagerduty_event_orchestration" "platform" {
  name        = "Platform Event Orchestration"
  description = "Routes, suppresses, and enriches Prometheus/Alertmanager events"
}

Attach the orchestration’s service rules to the production service. These rules run on events routed into that service and are where suppression, priority, and enrichment happen:

# orchestration_service.tf
resource "pagerduty_event_orchestration_service" "platform_prod" {
  service                                = pagerduty_service.platform_prod.id
  enable_event_orchestration_for_service = true

  # Rule 1 — SUPPRESS dependent watchdog/heartbeat noise outright.
  set {
    id = "start"
    rule {
      label = "Drop Watchdog heartbeat"
      condition { expression = "event.custom_details.alertname matches 'Watchdog'" }
      actions { suppress = true }
    }

    # Rule 2 — SUPPRESS warnings during an active node drain (symptom, not cause).
    rule {
      label = "Suppress pod warnings when node is draining"
      condition { expression = "event.custom_details.alertname matches part 'KubePod' and event.severity matches 'warning'" }
      condition { expression = "event.custom_details.cluster matches 'prod'" }
      actions {
        route_to = "high-priority"
        suppress = true
      }
    }

    # Rule 3 — ROUTE critical events to the high-priority handling set.
    rule {
      label = "Critical to high-priority set"
      condition { expression = "event.severity matches 'critical'" }
      actions { route_to = "high-priority" }
    }
  }

  # The high-priority set: set PD priority, urgency, and attach the runbook.
  set {
    id = "high-priority"
    rule {
      label = "Tag P1, raise urgency, attach runbook"
      actions {
        priority = data.pagerduty_priority.p1.id
        annotate = "Auto-enriched by Event Orchestration. Runbook attached below."
        severity = "critical"
        extract {
          target = "event.custom_details.runbook"
          source = "event.custom_details.runbook_url"
          regex  = "(.*)"
        }
      }
    }
  }

  catch_all {
    actions {}   # let anything unmatched create a normal incident
  }
}

data "pagerduty_priority" "p1" {
  name = "P1"
}

A few decisions worth the why:

Apply:

terraform apply -target=pagerduty_event_orchestration.platform \
                -target=pagerduty_event_orchestration_service.platform_prod

4. Add severity-aware escalation and a ServiceNow change record for P1s

For genuinely high-priority incidents you usually need two more things: a tighter escalation policy and an ITSM record. Tie a P1 to ServiceNow so change management has a system of record, using PagerDuty’s native ServiceNow integration (configured once in the PagerDuty UI under Integrations → ServiceNow, then referenced here). The integration auto-creates a ServiceNow incident when a PagerDuty incident reaches the mapped priority and syncs status bidirectionally — so when the SRE resolves in PagerDuty, the ServiceNow ticket closes too.

In ServiceNow’s PagerDuty mapping, set the trigger to priority = P1 and the work-note sync on. Then make sure your orchestration actually stamps P1 (done in step 3). To verify the binding end to end, you will fire a synthetic P1 in step 6 and confirm a INC record appears.

For escalation, add a second, faster policy for P1 and point the high-priority orchestration set’s service at it if you split services — or keep one service and rely on PagerDuty’s priority to drive a response play (an automation that adds responders, posts to a Slack/Teams war room, and notifies stakeholders):

# A response play auto-launched for P1 pulls in the incident commander
# and the database SME alongside the primary on-call.
resource "pagerduty_escalation_policy" "p1_fast" {
  name      = "P1 — fast escalation + IC"
  num_loops = 3

  rule {
    escalation_delay_in_minutes = 5
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.sre_rotation.id
    }
  }
  rule {
    escalation_delay_in_minutes = 5
    target {
      type = "user_reference"
      id   = data.pagerduty_user.primary_oncall.id   # IC backup
    }
  }
}

5. Wire runbooks into the alert so the page is actionable

The single highest-leverage change for the 02:14 engineer: every paged alert links straight to its runbook. There are two halves — author the runbook annotation on the Prometheus rule, and surface it as a clickable field on the PagerDuty incident.

Author the runbook_url annotation on each PrometheusRule. Host the runbook markdown anywhere that yields a stable URL (a Git wiki, Backstage TechDocs, or your docs site’s runbooks/ path):

# prometheusrule-node.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-rules
  namespace: monitoring
spec:
  groups:
    - name: node.rules
      rules:
        - alert: KubeNodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.node }} not ready for 5m"
            runbook_url: "https://docs.kloudvin.io/runbooks/kube-node-not-ready"

That runbook_url rides through Alertmanager (step 2 puts it in details) into the PagerDuty event. The orchestration’s extract action (step 3) lifts it into a custom field. Surface it on the incident as a custom field so responders see a labeled link rather than hunting through event details:

# Define an incident custom field that holds the runbook link.
resource "pagerduty_incident_custom_field" "runbook" {
  name         = "runbook"
  display_name = "Runbook"
  data_type    = "string"
  field_type   = "single_value"
}

Then have the orchestration’s high-priority set populate it (the extract block in step 3 writes to event.custom_details.runbook, which PagerDuty maps onto the incident). The result: the page body contains a one-click Runbook link to the exact cordon-and-drain procedure — no recall required.

For richer enrichment, let Dynatrace (or Datadog) attach live problem context. Dynatrace’s PagerDuty integration posts the detected problem, affected entities, and a deep link back onto the incident via the Events API, so the responder sees the observability platform’s root-cause guess next to the runbook. Configure it in Dynatrace under Settings → Integration → Problem notifications → PagerDuty, using the same service routing key.

6. Validation

Validate every layer with a synthetic event rather than waiting for a real outage. First, fire a test event straight at the Events API to confirm the orchestration routes, prioritizes, and attaches the runbook:

ROUTING_KEY="$(terraform output -raw routing_key)"

curl -sS -X POST https://events.pagerduty.com/v2/enqueue \
  -H 'Content-Type: application/json' \
  -d '{
    "routing_key": "'"$ROUTING_KEY"'",
    "event_action": "trigger",
    "dedup_key": "synthetic-kube-node-not-ready",
    "payload": {
      "summary": "SYNTHETIC: Node ip-10-0-3-12 not ready",
      "source": "alertmanager-test",
      "severity": "critical",
      "custom_details": {
        "alertname": "KubeNodeNotReady",
        "cluster": "prod",
        "runbook_url": "https://docs.kloudvin.io/runbooks/kube-node-not-ready"
      }
    }
  }'

Confirm, in order:

# 1. The incident exists, is P1, and carries the runbook field.
pd incident:list --statuses=triggered --total   # PagerDuty CLI

# 2. Validate the Alertmanager config and routing tree before relying on it.
amtool check-config values-alertmanager.yaml
amtool config routes test --config.file=values-alertmanager.yaml \
  severity=critical alertname=KubeNodeNotReady

# 3. Confirm inhibition: a node-down + pod-warning pair yields ONE page.
amtool config routes test --config.file=values-alertmanager.yaml \
  severity=warning alertname=KubePodNotReady node=ip-10-0-3-12

Then verify the side-channels: a P1 should produce a ServiceNow INC record (check the ServiceNow incident list filtered by the PagerDuty integration user) and, if Dynatrace is wired, a problem-context note on the incident. Resolve the synthetic incident and confirm the ServiceNow ticket auto-closes:

curl -sS -X POST https://events.pagerduty.com/v2/enqueue \
  -H 'Content-Type: application/json' \
  -d '{"routing_key":"'"$ROUTING_KEY"'","event_action":"resolve","dedup_key":"synthetic-kube-node-not-ready"}'

7. Rollback / teardown

Because the routing brain is in Terraform, rollback is a git revert plus an apply — but stage it so you never lose paging coverage. To revert a single bad orchestration rule, revert the commit and re-apply just the orchestration resource:

git revert <bad-commit-sha>
terraform plan -target=pagerduty_event_orchestration_service.platform_prod
terraform apply -target=pagerduty_event_orchestration_service.platform_prod

To safely detach Event Orchestration entirely and fall back to the plain service (alerts still page, just without smart routing), flip the per-service flag rather than destroying anything:

resource "pagerduty_event_orchestration_service" "platform_prod" {
  enable_event_orchestration_for_service = false   # events bypass the rule tree
  # ...
}

Full teardown, in dependency order so nothing is orphaned:

# Point Alertmanager away first so no live traffic hits a dying integration.
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --reuse-values \
  --set-string alertmanager.config.route.receiver=blackhole

# Then destroy PagerDuty objects (orchestration → integration → service → policy).
terraform destroy \
  -target=pagerduty_event_orchestration_service.platform_prod \
  -target=pagerduty_event_orchestration.platform \
  -target=pagerduty_service_integration.prometheus \
  -target=pagerduty_service.platform_prod

Keep the Vault token and Okta/SCIM provisioning intact unless you are decommissioning PagerDuty altogether — they are shared with other services.

Common pitfalls

Security notes

Treat the integration key as a credential: it can create incidents and, abused, page your whole org at 03:00. Store it in HashiCorp Vault, inject it as a short-lived Kubernetes secret, and never log it in CI. Lock PagerDuty access behind Okta (or Entra ID) SSO with SCIM provisioning, so on-call membership and responder identity are governed centrally and deprovisioned the moment someone leaves — no orphaned local PagerDuty logins. Scope the Terraform REST token to the minimum (services, escalation policies, orchestrations) rather than a global admin key, and rotate it on the same cadence as your other API secrets. If your posture tooling (Wiz) or runtime security (CrowdStrike Falcon) raises a finding, route those into the same PagerDuty pipeline so security incidents inherit the runbook-and-escalation discipline you just built for reliability.

Cost notes

Event Orchestration is part of the higher PagerDuty tiers (Business / AIOps), so the line item is per-seat, not per-event — which means the cheapest optimization is suppressing noise so you do not need more responder seats to keep up with pages. Alertmanager-side grouping and inhibition cost nothing and cut event volume before it reaches PagerDuty, so do as much first-pass deduplication there as is sane. The ServiceNow and Dynatrace integrations are included in their respective platforms’ licenses; the only real incremental spend is responder seats and any AIOps add-on. Measure the win in the metric that matters: pages per incident should drop sharply (the 02:14 four-pages-one-cause pattern collapsing to a single actionable page), and mean time to acknowledge should fall once the runbook is one click from the alert.

PagerDutyAlertmanagerPrometheusIncident ResponseRunbooksObservability
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading