Observability Multi-Cloud

Designing Alertmanager Routing Trees: Grouping, Inhibition, Silences, and Dedup

Prometheus decides what is wrong. Alertmanager decides who hears about it, when, and how loudly. That second job is where most teams lose the plot: a flat list of receivers, no grouping, inhibition rules that suppress the wrong things, and a config nobody dares touch because there is no way to test it. The result is the same failure mode as bad threshold tuning - too many pages or too few - except now it is a routing problem, not a query problem.

This guide builds an Alertmanager configuration the way you would build any other piece of production infrastructure: as a tree you can reason about, with rules that compose, and a CI gate that fails before a typo silences your database alerts. We will go through the anatomy of a route, construct a real team-to-receiver tree, wire up inhibition and time-based muting, integrate PagerDuty and Slack, run it in HA, and validate the whole thing with amtool.

Versions referenced: Alertmanager 0.27/0.28. The schema below is stable across that range; the one moving target is matchers, covered in step 1.

1. Anatomy of a route

Every alert that arrives from Prometheus enters at the top of a single routing tree and walks down it depth-first. A route node has two responsibilities: decide whether an alert matches it, and if so, control how matching alerts are batched and how often they re-notify.

route:
  receiver: default-receiver        # fallback if nothing more specific matches
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - team = payments
      receiver: payments-slack

The four timing knobs are the part people get wrong, so be precise about what each one does:

Parameter What it controls Typical value
group_by The label set that defines a notification group. Alerts sharing these label values are batched into one notification. ['alertname', 'cluster', 'service']
group_wait How long to wait after the first alert in a new group arrives, before sending, so siblings can join the same notification. 30s
group_interval Minimum time before sending an updated notification for a group that already fired (e.g. a new alert joined the group). 5m
repeat_interval How long before re-sending a notification for a group whose contents have not changed (the “you still have an open page” reminder). 3h to 4h

Two non-obvious rules:

Matchers, not match

Older configs used match (equality) and match_re (regex) maps. Those are deprecated. Use the matchers list with the PromQL-style operators:

matchers:
  - severity =~ "critical|page"     # regex match
  - team = payments                 # exact match
  - environment != staging          # negative match

Operators are =, !=, =~, !~. Quote the value when it contains regex metacharacters or spaces. If you are migrating, run amtool config routes (step 7) after the change - a silently mis-parsed matcher is the classic way to drop alerts on the floor.

2. Build a routing tree that maps teams to receivers

The mental model: each alert tries to find the most specific route that wants it. Two decisions drive the tree shape - ordering and continue.

route:
  receiver: catch-all
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # 1. Critical, anywhere: mirror to the incident bridge, then keep going.
    - matchers:
        - severity = critical
      receiver: incident-bridge
      continue: true
      group_wait: 10s            # critical should batch less aggressively

    # 2. Team-scoped subtrees. Each team owns its routing below this node.
    - matchers:
        - team = payments
      receiver: payments-slack
      routes:
        - matchers:
            - severity = critical
          receiver: payments-pagerduty

    - matchers:
        - team = platform
      receiver: platform-slack
      routes:
        - matchers:
            - severity = critical
          receiver: platform-pagerduty
        # Watchdog/heartbeat alerts go to a dead-man's-switch receiver only.
        - matchers:
            - alertname = Watchdog
          receiver: deadmansswitch
          group_wait: 0s
          group_interval: 1m
          repeat_interval: 1m

Read route 1 carefully: because of continue: true, a critical payments alert hits incident-bridge, then falls through to the team = payments subtree, then matches the nested severity = critical child and pages payments-pagerduty. One alert, two destinations, expressed declaratively. Without continue, it would have stopped at incident-bridge and the on-call engineer would never get paged - a subtle and dangerous mistake.

Design rule: give each team a single top-level subtree keyed on a team label that you enforce in your Prometheus alert rules. Teams own everything below their node and you own the root. This keeps merge conflicts local and makes ownership obvious in review.

3. Inhibition rules to suppress downstream noise

When a whole cluster’s network partitions, you do not want 400 “service X cannot reach service Y” pages. You want one “cluster network down” page and silence on the rest. That is inhibition: while a source alert is firing, matching target alerts are suppressed.

inhibit_rules:
  # If a node is down, suppress the per-service alerts on that same node.
  - source_matchers:
      - alertname = NodeDown
    target_matchers:
      - severity =~ "warning|critical"
    equal: ['cluster', 'node']

  # If anything is already 'critical' for a service, mute its 'warning'
  # for the same service so on-call sees one severity, not two.
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: ['alertname', 'cluster', 'service']

The equal list is the part that makes inhibition safe. It says “only suppress targets whose listed label values match the source’s.” Without it, a critical alert in us-east would mute warnings in us-west. Three things to keep in mind:

4. Silences and time-based muting

There are two distinct mechanisms, and conflating them is a common error.

Silences are imperative and temporary - you create them at runtime (UI, API, or amtool) for a maintenance window or a known-noisy alert. They expire. They are stored in Alertmanager’s state, not the config file.

# Silence all payments alerts in staging for 2 hours during a deploy.
amtool silence add \
  --alertmanager.url=http://alertmanager:9093 \
  --author="vinod" \
  --comment="payments staging deploy" \
  --duration=2h \
  team=payments environment=staging

# List and expire silences
amtool silence query --alertmanager.url=http://alertmanager:9093
amtool silence expire <silence-id> --alertmanager.url=http://alertmanager:9093

Time intervals are declarative and recurring - defined in the config and referenced from routes to mute (or actively-route) on a schedule: business hours, weekends, quiet hours.

time_intervals:
  - name: outside-business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '18:00'
            end_time: '24:00'
          - start_time: '00:00'
            end_time: '09:00'
        location: 'Europe/London'
      - weekdays: ['saturday', 'sunday']
        location: 'Europe/London'

Reference it from a route with mute_time_intervals (suppress notifications during the window) or active_time_intervals (only notify during the window):

routes:
  # Low-priority info alerts: only deliver during business hours.
  - matchers:
      - severity = info
    receiver: platform-slack
    mute_time_intervals:
      - outside-business-hours

A few correctness notes: end_time: '24:00' is the legal way to express midnight-end; always set location to an IANA timezone or you get UTC, which will surprise you twice a year. The older top-level key was mute_time_intervals: (still accepted); time_intervals: is the current name and is what active_time_intervals requires.

5. Receiver integrations and templating

A receiver is a named bundle of one or more notification integrations. The big three plus a webhook escape hatch:

receivers:
  - name: default-receiver

  - name: payments-pagerduty
    pagerduty_configs:
      - routing_key_file: /etc/alertmanager/secrets/pagerduty_payments_key
        severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'

  - name: payments-slack
    slack_configs:
      - api_url_file: /etc/alertmanager/secrets/slack_webhook_url
        channel: '#payments-alerts'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}*{{ .Labels.service }}* - {{ .Annotations.summary }}
          {{ if .Annotations.runbook_url }}<{{ .Annotations.runbook_url }}|runbook>{{ end }}
          {{ end }}

  - name: incident-bridge
    webhook_configs:
      - url: http://incident-router.internal:8080/alertmanager
        send_resolved: true
        max_alerts: 0          # 0 = no truncation

Templating essentials:

templates:
  - '/etc/alertmanager/templates/*.tmpl'
{{ define "slack.title" }}{{ .Status | toUpper }}: {{ .CommonLabels.alertname }} ({{ .Alerts | len }}){{ end }}

send_resolved: true is the difference between an on-call engineer manually checking whether the thing recovered and getting a green message automatically. Turn it on everywhere except integrations that handle resolution themselves.

6. Running Alertmanager in HA mode

A single Alertmanager is a single point of failure for notification - Prometheus may still be evaluating rules, but nobody hears them. Run at least two (three is better) in a cluster. They gossip over a mesh protocol so that grouping, silences, and inhibition state are shared, and - critically - deduplicated: each instance receives the same alerts from Prometheus, but only one notification goes out per group.

Point every Prometheus at all Alertmanager peers. This is the part that confuses people: you do not load-balance Prometheus across Alertmanagers. Each Prometheus sends to all of them; the cluster dedupes.

# prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager-0:9093
            - alertmanager-1:9093
            - alertmanager-2:9093

Start each peer with the gossip flags:

# alertmanager-0
alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-1:9094 \
  --cluster.peer=alertmanager-2:9094

How dedup works: when a peer is about to send a notification, it waits a position-based offset (--cluster.peer-timeout, default 15s, multiplied by its index in the cluster). The instance at position 0 sends first and gossips “I sent this”; the others see that and suppress their copy. So momentary duplicate notifications during a partition are expected and safe - the system is designed to favor an occasional duplicate over a missed page.

Operational checks:

# Confirm all peers see each other (run against each pod)
curl -s http://alertmanager-0:9093/metrics | grep '^alertmanager_cluster_members'

7. Testing routes with amtool and validating in CI

This is the step that turns “config nobody touches” into “config under change control.” Three layers.

Syntax check - does the file even parse?

amtool check-config /etc/alertmanager/alertmanager.yml

Route simulation - given a set of labels, which receiver(s) would fire? This catches ordering and continue mistakes before they reach production.

# Where does a critical payments alert actually land?
amtool config routes test \
  --config.file=alertmanager.yml \
  severity=critical team=payments alertname=ApiHighErrorRate
# Expected output: incident-bridge  payments-pagerduty
# Visualize the whole tree as text
amtool config routes show --config.file=alertmanager.yml

CI gate - wire both into your pipeline so a bad matcher fails the merge, not the on-call rotation:

# .github/workflows/alertmanager.yml
name: validate-alertmanager
on:
  pull_request:
    paths: ['alertmanager/**']
jobs:
  amtool:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install amtool
        run: |
          VER=0.28.0
          curl -sSL -o am.tgz \
            https://github.com/prometheus/alertmanager/releases/download/v${VER}/alertmanager-${VER}.linux-amd64.tar.gz
          tar -xzf am.tgz
          sudo mv alertmanager-${VER}.linux-amd64/amtool /usr/local/bin/
      - name: Validate config
        run: amtool check-config alertmanager/alertmanager.yml
      - name: Assert critical routing
        run: |
          OUT=$(amtool config routes test --config.file=alertmanager/alertmanager.yml \
                  severity=critical team=payments alertname=Synthetic)
          echo "$OUT" | grep -q payments-pagerduty || { echo "critical payments not paging!"; exit 1; }

That last grep assertion is the highest-leverage line in the whole pipeline: it encodes the invariant “critical payments alerts page payments” as an executable test. Add one per team for every routing guarantee you actually care about.

Verify

Run these against a staging Alertmanager before trusting the config:

# 1. Config parses and matchers are valid.
amtool check-config alertmanager.yml

# 2. Routing lands where you expect for representative label sets.
amtool config routes test --config.file=alertmanager.yml \
  severity=critical team=platform alertname=NodeDown
amtool config routes test --config.file=alertmanager.yml \
  severity=info team=platform alertname=DiskFillingSlowly

# 3. Fire a synthetic alert end to end and watch it route.
amtool alert add --alertmanager.url=http://alertmanager:9093 \
  alertname=Synthetic severity=critical team=platform \
  --annotation=summary="routing smoke test"

# 4. Confirm HA peers are all joined.
curl -s http://alertmanager:9093/metrics | grep alertmanager_cluster_members

# 5. Inspect inhibition: a muted alert is active but not notified.
amtool alert query --alertmanager.url=http://alertmanager:9093 inhibited=true

Expected results: step 2 returns platform-pagerduty for the NodeDown case and nothing during quiet hours (or platform-slack during business hours) for the info case; step 3 produces exactly one Slack/PagerDuty notification despite multiple peers; step 4 reports the same member count on every pod.

Enterprise scenario

A payments platform team running three regional Kubernetes clusters had a recurring 3am problem. Whenever a regional API gateway degraded, the blast radius lit up: the gateway alert paged, but so did ~30 downstream service alerts (elevated latency, retry storms, queue depth) plus the synthetic-probe failures - all to the same PagerDuty service. On-call would acknowledge 30 incidents to find the one root cause, and the noise had trained them to auto-ack, which is how they missed a real database failover two weeks running.

The constraint: they could not change the alert rules quickly (those were owned by a dozen service teams across repos), and they could not simply drop the downstream alerts because in other failure modes those were the signal. The fix had to live entirely in Alertmanager and be provably correct, because the team that owned it was not the team that owned the rules.

They solved it with a scoped inhibition rule keyed on the region label, plus a continue-based mirror, and locked the behavior with a CI assertion:

inhibit_rules:
  - source_matchers:
      - alertname = GatewayDegraded
      - severity = critical
    target_matchers:
      - tier = downstream            # service teams agreed to add this one label
    equal: ['region']                # never cross-suppress between regions
# CI guard so a future edit can't silently re-enable the 3am storm.
# amtool config routes test ... must still page on the gateway itself:
#   severity=critical alertname=GatewayDegraded region=eu-west -> payments-pagerduty

The only ask of the service teams was a single tier: downstream label on their alerts - cheap, reviewable, and enforced in their own rule linting. Post-change, a regional gateway outage produced one page (the gateway) with the downstream alerts visible-but-muted in the UI for context. Mean time to acknowledge the correct incident dropped from minutes-of-triage to seconds, and the auto-ack habit died because the pages were trustworthy again. The equal: ['region'] clause was the load-bearing detail: an earlier draft without it had suppressed a genuine eu-west outage during a us-east incident, which is exactly the kind of bug the routing-test CI gate now catches.

Checklist

alertmanagerprometheusalertingon-callobservability

Comments

Keep Reading