Prometheus decides what is wrong. Alertmanager decides who hears about it, when, and how loudly. That second job is where most teams lose the plot: a flat list of receivers, no grouping, inhibition rules that suppress the wrong things, and a config nobody dares touch because there is no way to test it. The result is the same failure mode as bad threshold tuning - too many pages or too few - except now it is a routing problem, not a query problem.
This guide builds an Alertmanager configuration the way you would build any other piece of production infrastructure: as a tree you can reason about, with rules that compose, and a CI gate that fails before a typo silences your database alerts. We will go through the anatomy of a route, construct a real team-to-receiver tree, wire up inhibition and time-based muting, integrate PagerDuty and Slack, run it in HA, and validate the whole thing with amtool.
Versions referenced: Alertmanager 0.27/0.28. The schema below is stable across that range; the one moving target is matchers, covered in step 1.
1. Anatomy of a route
Every alert that arrives from Prometheus enters at the top of a single routing tree and walks down it depth-first. A route node has two responsibilities: decide whether an alert matches it, and if so, control how matching alerts are batched and how often they re-notify.
route:
receiver: default-receiver # fallback if nothing more specific matches
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- team = payments
receiver: payments-slack
The four timing knobs are the part people get wrong, so be precise about what each one does:
| Parameter | What it controls | Typical value |
|---|---|---|
group_by |
The label set that defines a notification group. Alerts sharing these label values are batched into one notification. | ['alertname', 'cluster', 'service'] |
group_wait |
How long to wait after the first alert in a new group arrives, before sending, so siblings can join the same notification. | 30s |
group_interval |
Minimum time before sending an updated notification for a group that already fired (e.g. a new alert joined the group). | 5m |
repeat_interval |
How long before re-sending a notification for a group whose contents have not changed (the “you still have an open page” reminder). | 3h to 4h |
Two non-obvious rules:
-
group_by: ['...']with the special value['instance']or any normal label groups by those labels. To send one notification per alert with no batching, usegroup_by: [...]listing a unique label, or better, setgroup_by: ['...']to include enough labels. To disable grouping entirely, use the literalgroup_by: ['...']reserved value...:group_by: ['...'] # every distinct alert is its own group - use sparingly -
Timing fields are inherited down the tree. A child route that does not set
group_waituses the parent’s. This is the single biggest lever for keeping configs short: set sane defaults at the root, override only where a team genuinely needs different behavior.
Matchers, not match
Older configs used match (equality) and match_re (regex) maps. Those are deprecated. Use the matchers list with the PromQL-style operators:
matchers:
- severity =~ "critical|page" # regex match
- team = payments # exact match
- environment != staging # negative match
Operators are =, !=, =~, !~. Quote the value when it contains regex metacharacters or spaces. If you are migrating, run amtool config routes (step 7) after the change - a silently mis-parsed matcher is the classic way to drop alerts on the floor.
2. Build a routing tree that maps teams to receivers
The mental model: each alert tries to find the most specific route that wants it. Two decisions drive the tree shape - ordering and continue.
- Sibling routes are evaluated top to bottom, and by default an alert stops at the first matching route. So order matters: put specific routes above general ones.
continue: truetells Alertmanager to keep evaluating siblings after a match. This is how you deliver one alert to multiple receivers (e.g. the owning team and a central incident channel).
route:
receiver: catch-all
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# 1. Critical, anywhere: mirror to the incident bridge, then keep going.
- matchers:
- severity = critical
receiver: incident-bridge
continue: true
group_wait: 10s # critical should batch less aggressively
# 2. Team-scoped subtrees. Each team owns its routing below this node.
- matchers:
- team = payments
receiver: payments-slack
routes:
- matchers:
- severity = critical
receiver: payments-pagerduty
- matchers:
- team = platform
receiver: platform-slack
routes:
- matchers:
- severity = critical
receiver: platform-pagerduty
# Watchdog/heartbeat alerts go to a dead-man's-switch receiver only.
- matchers:
- alertname = Watchdog
receiver: deadmansswitch
group_wait: 0s
group_interval: 1m
repeat_interval: 1m
Read route 1 carefully: because of continue: true, a critical payments alert hits incident-bridge, then falls through to the team = payments subtree, then matches the nested severity = critical child and pages payments-pagerduty. One alert, two destinations, expressed declaratively. Without continue, it would have stopped at incident-bridge and the on-call engineer would never get paged - a subtle and dangerous mistake.
Design rule: give each team a single top-level subtree keyed on a
teamlabel that you enforce in your Prometheus alert rules. Teams own everything below their node and you own the root. This keeps merge conflicts local and makes ownership obvious in review.
3. Inhibition rules to suppress downstream noise
When a whole cluster’s network partitions, you do not want 400 “service X cannot reach service Y” pages. You want one “cluster network down” page and silence on the rest. That is inhibition: while a source alert is firing, matching target alerts are suppressed.
inhibit_rules:
# If a node is down, suppress the per-service alerts on that same node.
- source_matchers:
- alertname = NodeDown
target_matchers:
- severity =~ "warning|critical"
equal: ['cluster', 'node']
# If anything is already 'critical' for a service, mute its 'warning'
# for the same service so on-call sees one severity, not two.
- source_matchers:
- severity = critical
target_matchers:
- severity = warning
equal: ['alertname', 'cluster', 'service']
The equal list is the part that makes inhibition safe. It says “only suppress targets whose listed label values match the source’s.” Without it, a critical alert in us-east would mute warnings in us-west. Three things to keep in mind:
- Inhibition is evaluated globally, across the whole alert set, independent of routing. A source in one subtree can inhibit a target in another.
- A muted alert is still active and still visible in the UI and API - it is just not notified. This matters for dashboards built on the Alertmanager API.
- An alert can inhibit itself if your matchers are sloppy (source and target both match the same alert). Always include enough
equallabels and distinct matchers that the source set and target set are disjoint for a single alert.
4. Silences and time-based muting
There are two distinct mechanisms, and conflating them is a common error.
Silences are imperative and temporary - you create them at runtime (UI, API, or amtool) for a maintenance window or a known-noisy alert. They expire. They are stored in Alertmanager’s state, not the config file.
# Silence all payments alerts in staging for 2 hours during a deploy.
amtool silence add \
--alertmanager.url=http://alertmanager:9093 \
--author="vinod" \
--comment="payments staging deploy" \
--duration=2h \
team=payments environment=staging
# List and expire silences
amtool silence query --alertmanager.url=http://alertmanager:9093
amtool silence expire <silence-id> --alertmanager.url=http://alertmanager:9093
Time intervals are declarative and recurring - defined in the config and referenced from routes to mute (or actively-route) on a schedule: business hours, weekends, quiet hours.
time_intervals:
- name: outside-business-hours
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '18:00'
end_time: '24:00'
- start_time: '00:00'
end_time: '09:00'
location: 'Europe/London'
- weekdays: ['saturday', 'sunday']
location: 'Europe/London'
Reference it from a route with mute_time_intervals (suppress notifications during the window) or active_time_intervals (only notify during the window):
routes:
# Low-priority info alerts: only deliver during business hours.
- matchers:
- severity = info
receiver: platform-slack
mute_time_intervals:
- outside-business-hours
A few correctness notes: end_time: '24:00' is the legal way to express midnight-end; always set location to an IANA timezone or you get UTC, which will surprise you twice a year. The older top-level key was mute_time_intervals: (still accepted); time_intervals: is the current name and is what active_time_intervals requires.
5. Receiver integrations and templating
A receiver is a named bundle of one or more notification integrations. The big three plus a webhook escape hatch:
receivers:
- name: default-receiver
- name: payments-pagerduty
pagerduty_configs:
- routing_key_file: /etc/alertmanager/secrets/pagerduty_payments_key
severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
- name: payments-slack
slack_configs:
- api_url_file: /etc/alertmanager/secrets/slack_webhook_url
channel: '#payments-alerts'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}*{{ .Labels.service }}* - {{ .Annotations.summary }}
{{ if .Annotations.runbook_url }}<{{ .Annotations.runbook_url }}|runbook>{{ end }}
{{ end }}
- name: incident-bridge
webhook_configs:
- url: http://incident-router.internal:8080/alertmanager
send_resolved: true
max_alerts: 0 # 0 = no truncation
Templating essentials:
- The notification body operates on a group of alerts.
.Alertsis the slice;.Alerts.Firingand.Alerts.Resolvedpartition it..CommonLabels/.CommonAnnotationshold labels shared by every alert in the group - do not use them when a group can span services, because the differing label will simply not appear. - Prefer the
*_filevariants (api_url_file,routing_key_file) so secrets live on disk (mounted from a Kubernetes Secret or Vault), not in the config you commit. Alertmanager added these to keep webhook URLs and integration keys out of source control. - Factor repeated markup into named templates and load them with
templates:at the top level:
templates:
- '/etc/alertmanager/templates/*.tmpl'
{{ define "slack.title" }}{{ .Status | toUpper }}: {{ .CommonLabels.alertname }} ({{ .Alerts | len }}){{ end }}
send_resolved: true is the difference between an on-call engineer manually checking whether the thing recovered and getting a green message automatically. Turn it on everywhere except integrations that handle resolution themselves.
6. Running Alertmanager in HA mode
A single Alertmanager is a single point of failure for notification - Prometheus may still be evaluating rules, but nobody hears them. Run at least two (three is better) in a cluster. They gossip over a mesh protocol so that grouping, silences, and inhibition state are shared, and - critically - deduplicated: each instance receives the same alerts from Prometheus, but only one notification goes out per group.
Point every Prometheus at all Alertmanager peers. This is the part that confuses people: you do not load-balance Prometheus across Alertmanagers. Each Prometheus sends to all of them; the cluster dedupes.
# prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-0:9093
- alertmanager-1:9093
- alertmanager-2:9093
Start each peer with the gossip flags:
# alertmanager-0
alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=alertmanager-1:9094 \
--cluster.peer=alertmanager-2:9094
How dedup works: when a peer is about to send a notification, it waits a position-based offset (--cluster.peer-timeout, default 15s, multiplied by its index in the cluster). The instance at position 0 sends first and gossips “I sent this”; the others see that and suppress their copy. So momentary duplicate notifications during a partition are expected and safe - the system is designed to favor an occasional duplicate over a missed page.
Operational checks:
--cluster.listen-address=(empty) disables clustering. If you see duplicate pages from a “single” deployment, confirm you did not accidentally start two un-clustered instances.- On Kubernetes, the Prometheus Operator’s
AlertmanagerCRD withreplicas: 3wires the--cluster.peerflags via a headless service for you. Verify peers are connected: the/#/statuspage lists cluster members, and the metricalertmanager_cluster_membersshould equal your replica count on every pod.
# Confirm all peers see each other (run against each pod)
curl -s http://alertmanager-0:9093/metrics | grep '^alertmanager_cluster_members'
7. Testing routes with amtool and validating in CI
This is the step that turns “config nobody touches” into “config under change control.” Three layers.
Syntax check - does the file even parse?
amtool check-config /etc/alertmanager/alertmanager.yml
Route simulation - given a set of labels, which receiver(s) would fire? This catches ordering and continue mistakes before they reach production.
# Where does a critical payments alert actually land?
amtool config routes test \
--config.file=alertmanager.yml \
severity=critical team=payments alertname=ApiHighErrorRate
# Expected output: incident-bridge payments-pagerduty
# Visualize the whole tree as text
amtool config routes show --config.file=alertmanager.yml
CI gate - wire both into your pipeline so a bad matcher fails the merge, not the on-call rotation:
# .github/workflows/alertmanager.yml
name: validate-alertmanager
on:
pull_request:
paths: ['alertmanager/**']
jobs:
amtool:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install amtool
run: |
VER=0.28.0
curl -sSL -o am.tgz \
https://github.com/prometheus/alertmanager/releases/download/v${VER}/alertmanager-${VER}.linux-amd64.tar.gz
tar -xzf am.tgz
sudo mv alertmanager-${VER}.linux-amd64/amtool /usr/local/bin/
- name: Validate config
run: amtool check-config alertmanager/alertmanager.yml
- name: Assert critical routing
run: |
OUT=$(amtool config routes test --config.file=alertmanager/alertmanager.yml \
severity=critical team=payments alertname=Synthetic)
echo "$OUT" | grep -q payments-pagerduty || { echo "critical payments not paging!"; exit 1; }
That last grep assertion is the highest-leverage line in the whole pipeline: it encodes the invariant “critical payments alerts page payments” as an executable test. Add one per team for every routing guarantee you actually care about.
Verify
Run these against a staging Alertmanager before trusting the config:
# 1. Config parses and matchers are valid.
amtool check-config alertmanager.yml
# 2. Routing lands where you expect for representative label sets.
amtool config routes test --config.file=alertmanager.yml \
severity=critical team=platform alertname=NodeDown
amtool config routes test --config.file=alertmanager.yml \
severity=info team=platform alertname=DiskFillingSlowly
# 3. Fire a synthetic alert end to end and watch it route.
amtool alert add --alertmanager.url=http://alertmanager:9093 \
alertname=Synthetic severity=critical team=platform \
--annotation=summary="routing smoke test"
# 4. Confirm HA peers are all joined.
curl -s http://alertmanager:9093/metrics | grep alertmanager_cluster_members
# 5. Inspect inhibition: a muted alert is active but not notified.
amtool alert query --alertmanager.url=http://alertmanager:9093 inhibited=true
Expected results: step 2 returns platform-pagerduty for the NodeDown case and nothing during quiet hours (or platform-slack during business hours) for the info case; step 3 produces exactly one Slack/PagerDuty notification despite multiple peers; step 4 reports the same member count on every pod.
Enterprise scenario
A payments platform team running three regional Kubernetes clusters had a recurring 3am problem. Whenever a regional API gateway degraded, the blast radius lit up: the gateway alert paged, but so did ~30 downstream service alerts (elevated latency, retry storms, queue depth) plus the synthetic-probe failures - all to the same PagerDuty service. On-call would acknowledge 30 incidents to find the one root cause, and the noise had trained them to auto-ack, which is how they missed a real database failover two weeks running.
The constraint: they could not change the alert rules quickly (those were owned by a dozen service teams across repos), and they could not simply drop the downstream alerts because in other failure modes those were the signal. The fix had to live entirely in Alertmanager and be provably correct, because the team that owned it was not the team that owned the rules.
They solved it with a scoped inhibition rule keyed on the region label, plus a continue-based mirror, and locked the behavior with a CI assertion:
inhibit_rules:
- source_matchers:
- alertname = GatewayDegraded
- severity = critical
target_matchers:
- tier = downstream # service teams agreed to add this one label
equal: ['region'] # never cross-suppress between regions
# CI guard so a future edit can't silently re-enable the 3am storm.
# amtool config routes test ... must still page on the gateway itself:
# severity=critical alertname=GatewayDegraded region=eu-west -> payments-pagerduty
The only ask of the service teams was a single tier: downstream label on their alerts - cheap, reviewable, and enforced in their own rule linting. Post-change, a regional gateway outage produced one page (the gateway) with the downstream alerts visible-but-muted in the UI for context. Mean time to acknowledge the correct incident dropped from minutes-of-triage to seconds, and the auto-ack habit died because the pages were trustworthy again. The equal: ['region'] clause was the load-bearing detail: an earlier draft without it had suppressed a genuine eu-west outage during a us-east incident, which is exactly the kind of bug the routing-test CI gate now catches.