Designing Alertmanager Routing Trees: Grouping, Inhibition, Silences, and Dedup

Prometheus decides what is wrong. Alertmanager decides who hears about it, when, and how loudly. That second job is where most teams quietly lose the plot. A flat list of receivers with no grouping. Inhibition rules that suppress the wrong things — or, worse, suppress a real outage in one region because an unrelated incident is firing in another. Time-based muting that silences at UTC when the team lives in Asia/Kolkata. And a config nobody dares touch, because there is no way to test whether a change routes a critical database page to PagerDuty or drops it silently on the floor. The failure mode is exactly the one you get from bad threshold tuning — too many pages, or too few — except now it is a routing problem, not a query problem, and it hides behind YAML instead of PromQL.

This guide builds an Alertmanager configuration the way you would build any other piece of production infrastructure: as a tree you can reason about, with rules that compose predictably, and a CI gate that fails a pull request before a typo’d matcher silences your on-call rotation. We will walk the full alert lifecycle — from a firing alert leaving Prometheus, through the routing tree, grouping, inhibition and silence checks, deduplication across a gossip cluster, into a rendered notification, and finally to the resolved message. We will enumerate every routing knob (route, routes, matchers, continue, group_by, group_wait, group_interval, repeat_interval), wire up receivers (PagerDuty, Slack, Opsgenie, email, webhook) with templated notifications, build inhibition rules that are provably safe, distinguish silences from time intervals, run Alertmanager in HA so a single node failure never silences your alerts, and validate the whole thing with amtool.

By the end you will stop guessing. When someone asks “if a critical payments alert fires in eu-west, which receivers get it?” you will answer with a command, not a shrug. When the 3am storm of 400 downstream alerts arrives, you will have already suppressed it down to the one root-cause page. And when you edit the config, a pipeline will prove — before merge — that every routing guarantee you care about still holds. Versions referenced throughout: Alertmanager 0.27 / 0.28. The schema is stable across that range; the one moving target is matchers versus the deprecated match/match_re, which we cover first.

What problem this solves

Prometheus is superb at detecting problems and hopeless at communicating them. An alerting rule fires an alert with a set of labels and annotations and ships it to Alertmanager over HTTP; that is the entire extent of Prometheus’s involvement in your pager going off. Everything downstream — batching related alerts so one incident is one notification, deciding whether the payments team or the platform team owns it, suppressing the 30 symptom alerts when the one cause is already firing, muting low-priority noise overnight, making sure the page actually goes out even when an Alertmanager node dies — is Alertmanager’s job. Get it wrong and you get the two classic on-call diseases: alert fatigue (so many pages the team auto-acknowledges and misses the real one) or silent gaps (a whole class of alerts that never reaches anyone because a matcher didn’t match).

What breaks without a deliberate design: a network partition in one cluster fires the gateway-down alert plus 400 “service X cannot reach service Y” alerts, all as separate PagerDuty incidents, all to the same on-call engineer, who acknowledges 400 incidents to find the one that matters and learns to auto-ack — which is precisely how the next real incident gets missed. A config change adds a team = payemnts matcher (note the typo) and every payments critical alert now falls through to the catch-all Slack channel nobody watches, undetected until an outage. A “temporary” silence created during a deploy is set for 30 days instead of 30 minutes and mutes a genuine incident three weeks later. A single Alertmanager instance is restarted during a node drain at the exact moment a database fails over, and nobody is paged because the one thing that turns alerts into pages was briefly down.

Who hits this: every team running Prometheus at more than toy scale. It bites hardest on organisations where the team that owns Alertmanager is not the team that owns the alert rules (a central SRE or platform team routing alerts for a dozen product teams), because the routing has to be provably correct without the ability to change the rules that feed it. It bites container-native shops where alerts fan out from cluster-wide failures. And it bites anyone who has ever said “we’ll clean up the alerting config later” — later never comes, and the config calcifies into something nobody understands and everybody fears.

To frame the whole field before the deep dive, here is every job Alertmanager does, the pain it exists to prevent, and the mechanism that does it:

Job	Pain it prevents	Mechanism	Config surface
Routing	Alerts reaching the wrong team, or no team	The routing tree — depth-first match on labels	`route` / `routes` / `matchers`
Grouping	400 separate notifications for one incident	Batching alerts that share `group_by` labels	`group_by` / `group_wait` / `group_interval`
Deduplication	Duplicate pages from HA replicas	Gossip-coordinated single send per group	`--cluster.*` flags
Inhibition	Symptom alerts drowning the root cause	Suppress targets while a source fires	`inhibit_rules`
Silencing	Known-noisy alerts paging during maintenance	Runtime, time-bounded mutes by matcher	Silences (UI / API / `amtool`)
Scheduled muting	Low-priority alerts paging at 3am	Recurring mute/active windows	`time_intervals` + route refs
Repeat control	Re-paging a still-open incident too often	Re-notify cadence per group	`repeat_interval`
Formatting	Cryptic pages with no runbook or context	Go-template notification bodies	`templates` + receiver `*_configs`

Learning objectives

By the end of this article you can:

Trace an alert through the full lifecycle — firing in Prometheus, through routing, grouping, inhibition and silence checks, dedup across a cluster, into a notification, and out again as resolved — and name what happens at each stage.
Read and write a routing tree: order sibling routes correctly, use continue: true to deliver one alert to multiple receivers, and choose group_by / group_wait / group_interval / repeat_interval for a given severity.
Migrate legacy match / match_re to the current matchers syntax and confirm the parse with amtool before it reaches production.
Build inhibition rules with a correct equal: scope so a source alert never cross-suppresses an unrelated target in another region or cluster.
Distinguish silences (imperative, runtime, expiring) from time intervals (declarative, recurring, in-config) and apply each to the right problem, with explicit IANA time zones.
Configure receivers for PagerDuty, Slack, Opsgenie, email and webhook, keep integration keys out of source control with *_file variants, and template notification bodies that carry the runbook link and the right severity.
Run Alertmanager in HA: why you run more than one, how the gossip mesh dedupes, why every Prometheus targets all peers, and how to confirm the cluster is healthy.
Validate the config in CI with amtool check-config plus per-team amtool config routes test assertions, and diagnose the common misroutes when routing goes wrong.

Prerequisites & where this fits

You should already run Prometheus and understand that an alerting rule (an alert: in a rules file) fires an alert with a set of labels (identity: alertname, severity, cluster, plus whatever you attach) and annotations (human context: summary, description, runbook_url). You should know that Prometheus sends alerts to Alertmanager over HTTP and re-sends them roughly every evaluation interval while the underlying condition holds — Alertmanager is a stateful consumer of a repeating stream, not a one-shot event bus. Comfort with YAML, PromQL-style label matchers, and running a binary with command-line flags is assumed. Basic familiarity with your paging tool (PagerDuty / Opsgenie) and a chat tool (Slack) helps but is not required.

This sits in the Observability / alerting track, downstream of metric collection and alert authoring and upstream of your human on-call process. The alert rules are covered by PromQL in Anger: Rate, Histograms, and Aggregation Patterns That Actually Work and the SLO-driven variants in SLOs and Error Budgets in Practice: Defining SLIs and Building Multi-Window Burn-Rate Alerts — those decide what fires; this article decides what happens to it next. The human side — escalation policies, on-call schedules, and runbooks — is Building an On-Call Practice: PagerDuty Escalation, Alert Routing, and Actionable Runbooks; Alertmanager is the machine that feeds that practice. If you are scaling Prometheus itself, Scaling Prometheus: Recording Rules, Remote-Write, and Long-Term Storage with Thanos and Mimir and Taming Metric Cardinality: Relabeling, Limits, and Cost Governance in Prometheus are the adjacent concerns.

A quick map of who owns what during an alerting incident, so you route the fix to the right place, not just the alert:

Layer	What lives here	Who usually owns it	Alerting failure it can cause
Alert rules (Prometheus)	`alert:` expressions, labels, annotations	Product / service teams	Missing/extra labels break routing & inhibition
Prometheus → AM send	`alerting.alertmanagers` targets	Platform / SRE	Sending to one AM only → duplicate/missed pages
Routing tree	`route`, matchers, `continue`	Platform / SRE (root), teams (subtrees)	Misroute; alert to wrong or no receiver
Grouping / timing	`group_by`, `group_wait`, `repeat_interval`	Platform / SRE	Storm of notifications, or delayed pages
Inhibition	`inhibit_rules`, `equal` scope	Platform / SRE	Cross-region suppression of a real outage
Receivers / secrets	`*_configs`, integration keys	Platform + security	Key leaked in git, or page never delivered
Notification templates	`templates`, `.tmpl` files	Platform / SRE	Page with no runbook, wrong severity
HA cluster	`--cluster.*` flags, peers	Platform / SRE	SPOF; no pages when one node dies

Core concepts

Six mental models make every later decision obvious.

An alert is a repeating stream with a lifecycle, not an event. Prometheus evaluates rules on a loop and, for every alert currently satisfying its condition, sends it to Alertmanager on roughly each evaluation cycle — the same alert, over and over, while the condition holds. Alertmanager holds each alert with a state: active/firing (Prometheus is still sending it, or its EndsAt is in the future), and it transitions to resolved when Prometheus either sends it with endsAt in the past or stops sending it and the resolve_timeout (default 5m) elapses. Grouping, inhibition, silences, and repeat timing all operate on this held, stateful view — which is why Alertmanager keeps a --storage.path and why the same alert does not page you a hundred times a minute.

Routing is a single tree walked depth-first. Every alert enters at the root route and descends. At each node it checks the node’s matchers; a match means “this route is a candidate,” and the alert continues down into that node’s child routes. The alert lands on the deepest matching route (the most specific one), which supplies the receiver and — inheriting from ancestors where not overridden — the timing. By default an alert stops at the first matching sibling; continue: true overrides that and lets it also match later siblings. Ordering and continue are the two levers that shape the whole tree.

Grouping turns many alerts into one notification. group_by names the labels that define a group: all currently-firing alerts sharing those exact label values are one notification group and produce one notification carrying all of them. This is the single most important defence against alert storms — one “cluster network down” page listing 40 affected services beats 40 pages. The timing knobs (group_wait, group_interval, repeat_interval) then govern when that group notifies.

Inhibition is global suppression, independent of routing. While a source alert matches its source_matchers, any alert matching the target target_matchers — and agreeing with the source on every label in equal — is muted (held active, but not notified). It is evaluated across the entire alert set, so a source in one subtree can inhibit a target in a completely different subtree. The equal list is the safety rail: it scopes suppression to the same region/cluster/service so a critical alert in us-east never mutes a warning in us-west.

Silences and time intervals are two different mutes. A silence is imperative and temporary: you create it at runtime (UI, API, amtool) with a matcher and a duration, for a maintenance window or a known-noisy alert; it lives in Alertmanager’s state and expires. A time interval is declarative and recurring: defined in the config file and referenced from routes to mute (or exclusively allow) notifications on a schedule — quiet hours, weekends, business hours. Conflating them is a common error; silences are for “shut this up now,” time intervals are for “shut this up every night.”

Running one Alertmanager is a single point of failure for notification. Prometheus can be evaluating rules perfectly, but if the one Alertmanager is down, nobody is paged. So you run two or three in a cluster: they gossip over a mesh so silences, inhibition, and “I already notified this group” state are shared, and — critically — deduplicated, so each replica receiving the same alerts from every Prometheus results in exactly one notification per group. The system deliberately favours an occasional duplicate page over a missed one.

The vocabulary in one table

Pin down every moving part before the deep sections. The glossary at the end repeats these for lookup; this is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters
Alert	A firing condition + labels + annotations from Prometheus	Sent over HTTP, held in AM state	The unit everything acts on
Route	A tree node: matchers + timing + receiver	`route` / `routes`	Decides who gets the alert
Matcher	A label predicate (`=`, `!=`, `=~`, `!~`)	`matchers:` list	Whether a route/rule applies
`continue`	Keep matching sibling routes after a hit	On a route	One alert → multiple receivers
`group_by`	Labels that define a notification group	On a route	Batches related alerts into one page
`group_wait`	Delay before first send of a new group	On a route	Lets siblings join the same page
`group_interval`	Min gap before an updated group notifies	On a route	Throttles updates to an open group
`repeat_interval`	Gap before re-sending an unchanged group	On a route	The “still open” reminder cadence
Receiver	A named bundle of notification integrations	`receivers:`	Where the notification is sent
Inhibition rule	Suppress targets while a source fires	`inhibit_rules:`	Kills symptom-alert noise
`equal`	Labels source & target must share to suppress	On an inhibit rule	Prevents cross-region suppression
Silence	Runtime, expiring mute by matcher	AM state (UI/API/amtool)	Maintenance / known-noise mutes
Time interval	In-config recurring mute/active window	`time_intervals:`	Quiet hours, weekends
Gossip cluster	HA peers sharing state over a mesh	`--cluster.*` flags	Dedup + no SPOF
`amtool`	The CLI: check, route-test, silence, alert	Binary	Testing & CI

The alert lifecycle end to end

Before touching config, hold the whole pipeline in your head. An alert makes this journey every time it fires, and every configuration knob plugs into one of these stages. Knowing the order is what lets you reason about why a page was late, duplicated, or missing.

Fire. A Prometheus alerting rule’s expression becomes true. After any for: duration elapses, Prometheus marks the alert firing and, on each evaluation, POSTs it (and every other firing alert) to the /api/v2/alerts endpoint of every Alertmanager it is configured with. The alert carries its labels, annotations, startsAt, and an endsAt a few intervals in the future.
Receive & dedupe-at-ingest. Each Alertmanager receives the alert. If several Prometheis or several sends deliver the same alert (same label fingerprint), Alertmanager treats them as one — it stores the latest.
Route. The alert walks the routing tree depth-first, landing on the deepest matching route(s). This yields the receiver(s) and the effective timing (inherited from ancestors unless overridden).
Inhibition & silence check. Before anything notifies, Alertmanager asks: is this alert inhibited by a currently-firing source? Is it covered by an active silence? Is its route inside a mute time interval right now? If any is true, the alert stays active but is not notified (it is still visible in the UI/API).
Group & wait. The alert joins a notification group defined by its route’s group_by. If the group is brand new, Alertmanager waits group_wait so siblings can join. If the group already exists and this alert changes it, it waits at most group_interval.
Deduplicate across the cluster. In HA, the peers coordinate over gossip so exactly one of them sends the notification for this group (position-based timing, covered later). The others suppress their copy.
Notify. The winning peer renders the notification through the receiver’s template and delivers it to PagerDuty / Slack / etc. Success or failure is recorded in alertmanager_notifications_total / alertmanager_notifications_failed_total.
Repeat. While the group’s contents do not change, Alertmanager re-sends every repeat_interval (the “you still have an open page” reminder).
Resolve. Prometheus stops sending the alert (condition cleared) or sends it with a past endsAt. After resolve_timeout, Alertmanager marks it resolved and — if the receiver has send_resolved: true — sends a resolution notification. The group empties; when the last alert resolves, the group disappears.

Here is each stage mapped to the config that governs it and the metric that proves it happened:

#	Stage	Governed by	Confirm with
1	Fire	Prometheus rule `expr` + `for`	Prometheus `/alerts`; `ALERTS` metric
2	Receive / ingest-dedupe	`alerting.alertmanagers` (Prom side)	`alertmanager_alerts` (state=active)
3	Route	`route` / `routes` / `matchers`	`amtool config routes test …`
4	Inhibit / silence / mute	`inhibit_rules`, silences, `time_intervals`	`amtool alert query inhibited=true`; silences UI
5	Group & wait	`group_by` / `group_wait` / `group_interval`	notification timing; group size in payload
6	Cluster dedupe	`--cluster.*` flags	`alertmanager_cluster_members`
7	Notify	`receivers`, `templates`	`alertmanager_notifications_total{integration}`
8	Repeat	`repeat_interval`	re-notification cadence
9	Resolve	`resolve_timeout`, `send_resolved`	`alertmanager_alerts{state="active"}` drops

Anatomy of a route

Every alert enters at the top of a single routing tree and walks down it depth-first. A route node has two responsibilities: decide whether an alert matches it, and if so, control how matching alerts are batched and how often they re-notify.

route:
  receiver: default-receiver          # fallback if nothing more specific matches
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - team = payments
      receiver: payments-slack

The four timing knobs are the part people consistently get wrong, so be precise about what each one does — the distinction between group_wait, group_interval, and repeat_interval in particular is where most misconfigurations live:

Parameter	What it controls	Applies when	Typical value	Common mistake
`group_by`	The label set that defines a notification group; alerts sharing these values batch into one notification	Always	`['alertname','cluster','service']`	Grouping by a high-cardinality label (e.g. `instance`) → one page per host
`group_wait`	How long to wait after the first alert in a new group before sending, so siblings can join	New group	`30s` (critical: `10s`)	Setting it high → slow first page
`group_interval`	Minimum time before sending an updated notification for a group that already fired (a new alert joined, or one resolved)	Existing group changed	`5m`	Confusing it with `repeat_interval`
`repeat_interval`	How long before re-sending a notification for a group whose contents have not changed	Unchanged group still firing	`3h`–`4h`	Setting it low → re-paging an open incident every few minutes

Two non-obvious rules govern grouping and inheritance:

The special ['...'] value disables grouping. Normally group_by lists labels and alerts sharing those values batch together. The literal reserved value group_by: ['...'] means “do not aggregate — every distinct alert is its own group and its own notification.” Use it sparingly (a route where you genuinely want one page per alert); it is the fastest way to recreate an alert storm:
```
group_by: ['...']   # every distinct alert notifies separately — use with care
```
Timing fields are inherited down the tree. A child route that omits group_wait, group_interval, repeat_interval, or group_by inherits the parent’s value. This is the single biggest lever for keeping a config short and consistent: set sane defaults at the root and override only where a team genuinely needs different behaviour. A critical-severity subtree might override group_wait to 10s; a noisy info subtree might override repeat_interval to 24h.

Matchers, not match

Older configs used match (equality map) and match_re (regex map). Both are deprecated in favour of the matchers list with PromQL-style operators. The new syntax is not just cosmetic — it disambiguates quoting and is what current tooling validates:

matchers:
  - severity =~ "critical|page"      # regex match
  - team = payments                  # exact match
  - environment != staging           # negation
  - service !~ "test-.*"             # negative regex

The four operators and when to reach for each:

Operator	Meaning	Example	Note
`=`	Label equals value exactly	`team = payments`	Most common; fastest to reason about
`!=`	Label does not equal value	`environment != staging`	Excludes; matches even if label absent-then-any
`=~`	Label matches the regex (fully anchored)	`severity =~ "critical\|page"`	Regex is anchored — `warn` won’t match `warning`
`!~`	Label does not match the regex	`service !~ "canary-.*"`	Exclude a family of services

Quoting rules bite the unwary: quote the value when it contains regex metacharacters, spaces, or a pipe. Regexes are fully anchored (as if wrapped in ^(?:...)$), so severity =~ "warn" does not match warning — a frequent surprise for anyone bringing PromQL habits, where regexes are partially matched by default. When migrating from match/match_re, run amtool config routes (covered later) immediately after the change: a silently mis-parsed matcher is the classic way to drop alerts on the floor with zero error at load time.

Here is the migration mapping at a glance, so a bulk conversion is mechanical:

Legacy	Equivalent matcher	Watch-out
`match: {team: payments}`	`matchers: ['team = payments']`	Straightforward
`match: {severity: critical}`	`matchers: ['severity = critical']`	Straightforward
`match_re: {severity: "critical\|page"}`	`matchers: ['severity =~ "critical\|page"']`	Quote the value; regex now anchored
`match_re: {service: "api-.*"}`	`matchers: ['service =~ "api-.*"']`	Anchoring: `api-.` still matches `api-x`; `.api` behaviour changes
`match: {}` (empty, matches all)	omit `matchers` entirely	An empty matcher list matches everything

Building a routing tree that maps teams to receivers

The mental model: each alert tries to find the most specific route that wants it. Two decisions drive the shape of the tree — sibling ordering and continue.

Sibling routes are evaluated top to bottom, and by default an alert stops at the first matching sibling. So order matters: put specific routes above general ones, or the general one swallows the alert first.
continue: true tells Alertmanager to keep evaluating siblings after a match. This is how you deliver one alert to multiple receivers — the owning team and a central incident channel.

A real team-to-receiver tree:

route:
  receiver: catch-all
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # 1. Critical, anywhere: mirror to the incident bridge, then keep going.
    - matchers:
        - severity = critical
      receiver: incident-bridge
      continue: true
      group_wait: 10s            # critical should batch less aggressively

    # 2. Team-scoped subtrees. Each team owns its routing below this node.
    - matchers:
        - team = payments
      receiver: payments-slack
      routes:
        - matchers:
            - severity = critical
          receiver: payments-pagerduty

    - matchers:
        - team = platform
      receiver: platform-slack
      routes:
        - matchers:
            - severity = critical
          receiver: platform-pagerduty
        # Heartbeat / dead-man's-switch: fast, frequent, its own receiver.
        - matchers:
            - alertname = Watchdog
          receiver: deadmansswitch
          group_wait: 0s
          group_interval: 1m
          repeat_interval: 1m

    # 3. Low-priority info: only during business hours (see time intervals).
    - matchers:
        - severity = info
      receiver: platform-slack
      mute_time_intervals:
        - outside-business-hours

Read route 1 carefully. Because of continue: true, a critical payments alert first matches route 1 and hits incident-bridge, then — instead of stopping — falls through to route 2’s team = payments subtree, matches the nested severity = critical child, and pages payments-pagerduty. One alert, two destinations, declared once. Without continue, the alert would stop at incident-bridge and the on-call engineer would never be paged — a subtle and dangerous mistake that a routing test (later) catches immediately.

The decision table for how an alert resolves against the tree:

Situation	What happens	Config lever
Alert matches no route	Delivered to the root `receiver` (catch-all)	Root `receiver:` must be a real, watched destination
Alert matches one sibling, no `continue`	Stops there; descends into that sibling’s children	Default behaviour
Alert matches a sibling with `continue: true`	Also evaluated against later siblings	`continue: true`
Alert matches a parent and a child	Lands on the child (deepest match) for receiver/timing	Nest specific routes as children
Two siblings both match, first has no `continue`	Only the first fires	Order + `continue`
Child omits a timing field	Inherits it from the nearest ancestor that sets it	Set defaults at root

A design rule that keeps large configs maintainable: give each team a single top-level subtree keyed on a team label that you enforce in your Prometheus alert rules (make every rule carry team: <name>; lint for it). Teams own everything below their node; you own the root and the cross-cutting routes (critical mirror, watchdog). This keeps merge conflicts local to one team’s block and makes ownership obvious in code review. The alternative — a flat list of severity-and-service matchers at the root — becomes an unreadable, conflict-prone mess by about the fifth team.

Inhibition rules: suppressing downstream noise

When a whole cluster’s network partitions, you do not want 400 “service X cannot reach service Y” pages. You want one “cluster network down” page and silence on the rest. That is inhibition: while a source alert is firing, matching target alerts are suppressed.

inhibit_rules:
  # If a node is down, suppress the per-service alerts on that same node.
  - source_matchers:
      - alertname = NodeDown
    target_matchers:
      - severity =~ "warning|critical"
    equal: ['cluster', 'node']

  # If anything is already 'critical' for a service, mute its 'warning' for the
  # same service so on-call sees one severity, not two.
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: ['alertname', 'cluster', 'service']

The three fields and what each contributes:

Field	Role	Failure if you get it wrong
`source_matchers`	Selects the alert(s) whose presence causes suppression	Too broad → suppresses more than intended; too narrow → storm not caught
`target_matchers`	Selects the alert(s) that get suppressed while a source fires	Overlaps the source → an alert inhibits itself
`equal`	Labels the target must share with the source to be suppressed	Omit it and suppression goes global — a critical in `us-east` mutes warnings in `us-west`

The equal list is the part that makes inhibition safe. It says: only suppress a target whose listed label values match the source’s. It is the single most important line in any inhibition rule, and the most common thing missing from a broken one. Three rules to internalise:

Inhibition is global and routing-independent. It is evaluated across the entire alert set. A source in the platform subtree can inhibit a target in the payments subtree; routing has nothing to do with it. This is a feature (cluster-down suppresses everything) but a trap if your matchers are broad.
A muted alert is still active. It is visible in the UI and the API and counts in dashboards; it is simply not notified. If you build dashboards on the Alertmanager API, filter on inhibited accordingly — a “firing but muted” alert is not the same as “not firing.”
An alert can inhibit itself. If your source_matchers and target_matchers both match the same alert and the equal labels line up, that alert suppresses its own notification. Always ensure the source set and target set are disjoint for any single alert — usually by making the source a higher severity or a distinct alertname than the target, and by choosing equal labels that both carry.

Common inhibition patterns worth having in your config from day one:

Pattern	Source	Target	`equal` scope	Effect
Severity collapse	`severity = critical`	`severity = warning`	`alertname, cluster, service`	One severity per issue
Node down mutes its services	`alertname = NodeDown`	`severity =~ "warning\|critical"`	`cluster, node`	Root cause, not symptoms
Cluster down mutes everything in it	`alertname = ClusterUnreachable`	`severity =~ "warning\|critical"`	`cluster`	One page per cluster outage
Gateway down mutes downstream tier	`alertname = GatewayDegraded, severity = critical`	`tier = downstream`	`region`	Kill fan-out from one gateway
Maintenance flag mutes region	`alertname = MaintenanceMode`	`severity =~ "warning\|critical"`	`region`	Planned work quiets that region

Silences and time-based muting

There are two distinct mechanisms, and conflating them is a common error. Both suppress notifications; the difference is who creates them, when, and how they end.

Silences are imperative and temporary. You create them at runtime — the UI, the API, or amtool — for a maintenance window or a known-noisy alert. They have an author, a comment, and a duration; they expire; they are stored in Alertmanager’s state (and gossiped across the cluster), not in the config file. Use them for “shut this specific thing up for the next two hours.”

# Silence all payments alerts in staging for 2 hours during a deploy.
amtool silence add \
  --alertmanager.url=http://alertmanager:9093 \
  --author="vinod" \
  --comment="payments staging deploy #4821" \
  --duration=2h \
  team=payments environment=staging

# List active silences, then expire one early.
amtool silence query --alertmanager.url=http://alertmanager:9093
amtool silence expire <silence-id> --alertmanager.url=http://alertmanager:9093

Time intervals are declarative and recurring. They are defined in the config and referenced from routes to mute (or exclusively allow) notifications on a schedule — business hours, weekends, quiet hours. They never expire; they fire every time the wall clock enters the window.

time_intervals:
  - name: outside-business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '18:00'
            end_time: '24:00'
          - start_time: '00:00'
            end_time: '09:00'
        location: 'Asia/Kolkata'
      - weekdays: ['saturday', 'sunday']
        location: 'Asia/Kolkata'

Reference an interval from a route with mute_time_intervals (suppress notifications during the window) or active_time_intervals (only notify during the window — everything outside is muted):

routes:
  # Low-priority info alerts: do NOT page outside business hours.
  - matchers:
      - severity = info
    receiver: platform-slack
    mute_time_intervals:
      - outside-business-hours

  # A "business-hours dashboards" webhook that should ONLY fire 09:00–18:00.
  - matchers:
      - notify = bi-dashboard
    receiver: bi-webhook
    active_time_intervals:
      - business-hours

The full set of sub-fields you can specify inside a time_intervals block, and the gotchas:

Sub-field	Format / example	Notes & gotcha
`times`	`start_time: '18:00'`, `end_time: '24:00'`	24-hour clock; `end_time` is exclusive; use `'24:00'` for midnight-end
`weekdays`	`['monday:friday']` or `['saturday','sunday']`	Ranges use a colon; lowercase names
`days_of_month`	`['1:5', '-1']`	`-1` = last day of month; ranges with colon
`months`	`['january:march', '12']`	Names or numbers
`years`	`['2026:2027']`	Rarely needed
`location`	`'Asia/Kolkata'`	IANA name; omit it and you get UTC, which surprises you twice a year at DST

A few correctness notes learned the hard way: end_time: '24:00' is the legal way to express “up to and including midnight” — '00:00' as an end means the start of that day, not the end. Always set location to an IANA timezone; the default is UTC, and a quiet-hours window authored without a location will mute at the wrong local time and drift by an hour across daylight-saving changes. The older top-level key mute_time_intervals: (a list of intervals at the config root) is still accepted, but time_intervals: is the current name and is what active_time_intervals requires — standardise on it.

Silences versus time intervals, side by side, so you never reach for the wrong one:

Dimension	Silence	Time interval
Nature	Imperative, one-off	Declarative, recurring
Created via	UI / API / `amtool` at runtime	Config file (`time_intervals`)
Lifetime	Fixed duration, then expires	Forever; fires on schedule
Stored in	Alertmanager state (gossiped)	The config (version-controlled)
Scoping	Matcher on labels	`weekdays`/`times`/`location` + referenced from routes
Typical use	“Mute DB alerts for this 2h deploy”	“Never page info alerts overnight”
Audit trail	Author + comment on the silence	Git history of the config
Risk	Over-long duration mutes a real incident	Wrong `location` mutes at the wrong hour

Receivers and integrations

A receiver is a named bundle of one or more notification integrations. An alert routed to a receiver notifies through every integration in it. The big three paging/chat integrations plus email and a webhook escape hatch:

receivers:
  - name: catch-all            # a receiver with no config = notify nothing (valid)

  - name: payments-pagerduty
    pagerduty_configs:
      - routing_key_file: /etc/alertmanager/secrets/pagerduty_payments_key
        severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'

  - name: payments-slack
    slack_configs:
      - api_url_file: /etc/alertmanager/secrets/slack_webhook_url
        channel: '#payments-alerts'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }} ({{ .Alerts | len }})'
        text: >-
          {{ range .Alerts }}*{{ .Labels.service }}* — {{ .Annotations.summary }}
          {{ if .Annotations.runbook_url }}<{{ .Annotations.runbook_url }}|runbook>{{ end }}
          {{ end }}

  - name: platform-opsgenie
    opsgenie_configs:
      - api_key_file: /etc/alertmanager/secrets/opsgenie_key
        priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else }}P3{{ end }}'
        message: '{{ .CommonLabels.alertname }} on {{ .CommonLabels.cluster }}'

  - name: dba-email
    email_configs:
      - to: 'dba-oncall@example.com'
        send_resolved: true
        headers:
          Subject: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'

  - name: incident-bridge
    webhook_configs:
      - url: http://incident-router.internal:8080/alertmanager
        send_resolved: true
        max_alerts: 0          # 0 = send all, no truncation

The integrations you will actually use, what they page, and the one field that most often trips people up:

Integration	Best for	Key field	`_file` variant	Gotcha
`pagerduty_configs`	Paging on-call	`routing_key` (Events API v2)	`routing_key_file`	Map `severity` explicitly, or PD defaults to `error`
`opsgenie_configs`	Paging on-call	`api_key`	`api_key_file`	Set `priority` from severity for correct escalation
`slack_configs`	Team chat, triage	`api_url` (incoming webhook)	`api_url_file`	Webhook is per-channel; `channel:` can override for a bot token
`email_configs`	Low-urgency, records	`to` + global SMTP	(SMTP creds via `global`)	Requires `global.smtp_*`; easy to forget
`webhook_configs`	Anything custom / bridge	`url`	`url_file`	`max_alerts` truncates by default in some setups; set `0` for all
`victorops_configs` / `wechat_configs` / `pushover_configs` / `telegram_configs` / `sns_configs` / `msteams_configs`	Niche destinations	integration-specific	most have `_file`	Check per-integration required fields

Two operational disciplines that separate a safe receiver config from a dangerous one:

Keep secrets out of git with the *_file variants. Every integration that takes a key or URL has a _file twin (routing_key_file, api_url_file, api_key_file, url_file). Point them at files mounted from a Kubernetes Secret or Vault so the committed config carries paths, not credentials. Committing a PagerDuty routing key or a Slack webhook URL is a real incident — the webhook can be abused to spam a channel, the routing key to spoof pages.
Turn on send_resolved: true almost everywhere. It is the difference between an on-call engineer manually checking whether the thing recovered and getting an automatic green “resolved” message. Enable it on Slack, email, and webhook receivers universally; the only exception is an integration that manages resolution itself (some incident tools prefer to close their own incidents), where a resolved notification would be redundant or confusing.

Templating notifications

The notification body is a Go template evaluated over a group of alerts. Getting the data model right is what turns a cryptic page into an actionable one:

.Alerts is the slice of all alerts in the group; .Alerts.Firing and .Alerts.Resolved partition it. Range over .Alerts to list each one.
.CommonLabels / .CommonAnnotations hold labels/annotations shared by every alert in the group. They are convenient for a title — but a label that differs across the group (e.g. service, when the group spans services) simply will not appear in .CommonLabels. Do not build a title from .CommonLabels.service if a group can contain multiple services; it silently renders empty.
.Status is firing or resolved for the group; .GroupLabels are the group_by labels; .ExternalURL links back to Alertmanager.

Factor repeated markup into named templates and load them at the top level:

templates:
  - '/etc/alertmanager/templates/*.tmpl'

{{ define "slack.title" }}{{ .Status | toUpper }}: {{ .CommonLabels.alertname }} ({{ .Alerts | len }}){{ end }}

{{ define "slack.text" }}
{{ range .Alerts -}}
• *{{ .Labels.service }}* [{{ .Labels.severity }}] {{ .Annotations.summary }}
  {{ if .Annotations.runbook_url }}<{{ .Annotations.runbook_url }}|runbook> · {{ end }}{{ .Labels.cluster }}
{{ end }}
{{- end }}

The template fields you will reach for most, and what each returns:

Field / function	Returns	Use for
`.Status`	`firing` / `resolved` (group-level)	Title, colour, subject
`.Alerts`	All alerts in the group	Range to list them
`.Alerts.Firing` / `.Alerts.Resolved`	Firing / resolved subsets	Separate sections
`.Alerts \| len`	Count of alerts in the group	“(3)” in the title
`.CommonLabels`	Labels shared by all in the group	Group title — only truly-common labels
`.CommonAnnotations`	Annotations shared by all	`summary` when identical
`.GroupLabels`	The `group_by` label set	What defines this group
`.Labels.<x>` (inside range)	One alert’s label	Per-alert detail
`.Annotations.runbook_url`	One alert’s runbook link	Actionable page
`.ExternalURL`	Alertmanager’s external URL	“Open in Alertmanager” link
`{{ … \| toUpper }}`	Uppercased string	`FIRING`

Running Alertmanager in HA

A single Alertmanager is a single point of failure for notification. Prometheus may still be evaluating rules perfectly, but if the one Alertmanager is down — restart, node drain, OOM — nobody hears them. So you run at least two (three is better) in a cluster. The replicas gossip over a mesh protocol so that silences, inhibition, and notification-log state are shared, and — critically — deduplicated: each replica receives the same alerts from every Prometheus, but only one notification goes out per group.

The rule people get wrong: point every Prometheus at all Alertmanager peers. You do not load-balance Prometheus across Alertmanagers or put them behind a single VIP. Each Prometheus sends every firing alert to every peer; the cluster dedupes the resulting notification. Sending to only one peer defeats HA (that peer’s failure = no pages); sending through a load balancer means a given alert reaches only one peer and the dedup coordination breaks.

# prometheus.yml — every Prometheus lists ALL peers
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager-0:9093
            - alertmanager-1:9093
            - alertmanager-2:9093

Start each peer with the gossip flags, listing the other peers:

# alertmanager-0
alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-1:9094 \
  --cluster.peer=alertmanager-2:9094

How deduplication works. When a peer is about to send a notification for a group, it waits a position-based offset: --cluster.peer-timeout (default 15s) multiplied by that peer’s index in the sorted cluster membership. The peer at position 0 sends first and gossips a notification-log entry (“I sent this group”); the peers at positions 1 and 2 see that entry before their own timer elapses and suppress their copy. During a network partition, peers can’t see each other’s log entries, so more than one may send — a brief duplicate notification. That is by design: the system favours an occasional duplicate page over a missed one. The relevant flags:

Flag	Purpose	Default	Notes
`--cluster.listen-address`	Address/port for gossip (typically `:9094`)	`0.0.0.0:9094`	Set empty (`--cluster.listen-address=`) to disable clustering
`--cluster.peer`	A peer’s gossip address (repeat per peer)	none	List the other peers; can be a headless-service DNS name
`--cluster.peer-timeout`	Per-position send offset for dedup	`15s`	Multiplied by peer index; raise on slow networks
`--cluster.gossip-interval`	How often peers gossip	`200ms`	Rarely tuned
`--cluster.pushpull-interval`	Full-state sync interval	`1m`	Rarely tuned
`--cluster.reconnect-timeout`	How long to try re-reaching a lost peer	`6h`	Rarely tuned

Operational checks that catch the two most common HA mistakes (accidentally un-clustered, or peers not actually connected):

# Confirm every peer sees the full membership (run against EACH pod/host).
# This value must equal your replica count on every peer.
curl -s http://alertmanager-0:9093/metrics | grep '^alertmanager_cluster_members'

# The status page also lists cluster members and their addresses.
curl -s http://alertmanager-0:9093/api/v2/status | jq '.cluster'

On Kubernetes, the Prometheus Operator’s Alertmanager CRD with replicas: 3 wires the --cluster.peer flags via a headless service for you — you rarely touch the flags directly. But you must still verify: alertmanager_cluster_members should equal your replica count on every pod, and the /#/status page should list all members as settled. The classic failure: --cluster.listen-address= (empty) silently disables clustering, so two “replicas” run as two independent, un-clustered instances — each sends its own copy and you get duplicate pages from a deployment you thought was HA. If you see consistent duplicates, check clustering is actually on before anything else.

HA sizing and behaviour at a glance:

Replicas	Notification availability	Dedup behaviour	Verdict
1	SPOF — any restart = no pages	N/A	Not for production
2	Survives one failure	One sends, one suppresses	Minimum viable
3	Survives one failure with quorum-ish comfort	Position 0 sends; others suppress	Recommended
3, partitioned	Both sides still page	Brief duplicates until healed	By design — dup > miss

Testing routes with amtool and validating in CI

This is the step that turns “config nobody touches” into “config under change control.” Three layers, each catching a different class of mistake.

Layer 1 — Syntax check. Does the file even parse, and are the matchers valid?

amtool check-config /etc/alertmanager/alertmanager.yml
# Checking '/etc/alertmanager/alertmanager.yml'  SUCCESS
# Found:
#  - global config
#  - route
#  - 2 inhibit rules
#  - 5 receivers
#  - 1 time interval

Layer 2 — Route simulation. Given a set of labels, which receiver(s) would fire? This is the single highest-value command in the whole toolchain — it catches ordering and continue mistakes before they reach production, with no running Alertmanager required:

# Where does a critical payments alert actually land?
amtool config routes test \
  --config.file=alertmanager.yml \
  severity=critical team=payments alertname=ApiHighErrorRate
# Expected output:
#   incident-bridge
#   payments-pagerduty

# Visualise the whole tree as text.
amtool config routes show --config.file=alertmanager.yml

Layer 3 — Live checks against a running cluster. Fire a synthetic alert end to end, inspect silences and inhibitions:

# Fire a synthetic alert and watch it route (against staging).
amtool alert add \
  --alertmanager.url=http://alertmanager:9093 \
  alertname=Synthetic severity=critical team=platform \
  --annotation=summary="routing smoke test"

# List currently-inhibited alerts (active but muted).
amtool alert query --alertmanager.url=http://alertmanager:9093 inhibited=true

The amtool subcommands you will use, mapped to the lifecycle stage they exercise:

Command	What it does	Lifecycle stage	When to run
`amtool check-config <file>`	Parse & validate the config	Load-time	CI, pre-commit
`amtool config routes show`	Render the routing tree as text	Route	Reviewing a tree
`amtool config routes test <labels>`	Show which receiver(s) a label set hits	Route	CI assertions, debugging misroutes
`amtool alert add <labels>`	Inject a synthetic alert	Fire → Notify	Smoke test staging
`amtool alert query [filters]`	List current alerts (e.g. `inhibited=true`)	Inhibit / state	Debugging suppression
`amtool silence add <matchers>`	Create a silence	Silence	Maintenance windows
`amtool silence query` / `expire`	List / end silences	Silence	Ops hygiene

The CI gate wires layers 1 and 2 into your pipeline so a bad matcher fails the merge, not the on-call rotation:

# .github/workflows/alertmanager.yml
name: validate-alertmanager
on:
  pull_request:
    paths: ['alertmanager/**']
jobs:
  amtool:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install amtool
        run: |
          VER=0.28.0
          curl -sSL -o am.tgz \
            https://github.com/prometheus/alertmanager/releases/download/v${VER}/alertmanager-${VER}.linux-amd64.tar.gz
          tar -xzf am.tgz
          sudo mv alertmanager-${VER}.linux-amd64/amtool /usr/local/bin/
      - name: Validate config parses
        run: amtool check-config alertmanager/alertmanager.yml
      - name: Assert critical payments pages PagerDuty
        run: |
          OUT=$(amtool config routes test --config.file=alertmanager/alertmanager.yml \
                  severity=critical team=payments alertname=Synthetic)
          echo "$OUT" | grep -q payments-pagerduty \
            || { echo "REGRESSION: critical payments not paging!"; exit 1; }
      - name: Assert info alerts never page overnight receiver
        run: |
          OUT=$(amtool config routes test --config.file=alertmanager/alertmanager.yml \
                  severity=info team=platform alertname=Synthetic)
          echo "$OUT" | grep -q pagerduty \
            && { echo "REGRESSION: info alert reaching pager!"; exit 1; } || true

That grep assertion is the highest-leverage line in the whole pipeline: it encodes the invariant “critical payments alerts page payments” as an executable test. Add one per team, per routing guarantee you actually care about — one for “critical → pages,” one for “info → never pages,” one for “the mirror still fires.” When someone edits the tree six months from now, these tests fail the PR instead of the rotation.

Architecture at a glance

Picture the alert flow as a pipeline with two halves. On the left sit your Prometheus servers — two or three of them for HA, each independently evaluating the same alert rules against the same targets and, on every evaluation cycle, POSTing the full set of currently-firing alerts to every Alertmanager peer. There is no load balancer between Prometheus and Alertmanager; the fan-out is deliberate and total, because deduplication happens downstream, not upstream.

On the right sit the Alertmanager peers — a gossip cluster of two or three nodes connected on their cluster port (:9094), sharing silences, inhibition state, and the notification log. Each peer runs the identical pipeline on the alerts it receives. An alert first hits the routing tree: it enters the root route and walks depth-first, checking matchers at each node, until it settles on the deepest matching route (or several, where continue: true lets it match siblings too). That resolves the receiver(s) and the effective timing (inherited down the tree). Next come the suppression gates — three of them, evaluated before anything notifies: is this alert inhibited by a currently-firing source (per inhibit_rules, scoped by equal)? Is it covered by an active silence? Is its route inside a mute time interval right now? An alert that trips any gate stays active but un-notified — visible in the UI and API, silent on the pager. An alert that passes joins its notification group (defined by group_by), waits group_wait if the group is new, and becomes eligible to notify.

The final stage is cluster deduplication and delivery. Before a peer sends a group’s notification, it waits a position-based offset (--cluster.peer-timeout × peer index) and checks the shared notification log; the peer at position 0 sends first and records it, and the others — seeing that record — suppress their copies. The winning peer renders the notification through the receiver’s Go template (title, per-alert lines, runbook links) and delivers it to PagerDuty, Opsgenie, Slack, email, or a webhook. While the group’s contents stay unchanged, the same peer re-sends every repeat_interval; when Prometheus stops sending the alerts and resolve_timeout elapses, a send_resolved notification closes the loop. Read left to right, the system is: fan-out from Prometheus → route → suppress → group → dedupe → notify → repeat → resolve — and every knob in this article plugs into exactly one of those arrows.

Real-world scenario

A payments platform team — call them Northwind Pay — ran three regional Kubernetes clusters (ap-south, eu-west, us-east) with a shared Prometheus + Alertmanager stack the central SRE team owned. They had a recurring 3am problem. Whenever a regional API gateway degraded, the blast radius lit up: the gateway alert paged, but so did roughly 30 downstream service alerts (elevated latency, retry storms, queue depth) plus a dozen synthetic-probe failures — all to the same PagerDuty service, all as separate incidents. On-call would acknowledge 30-plus incidents to find the one root cause. The noise had trained them to auto-ack, which is exactly how they missed a genuine database failover two weeks running: it arrived as incident number 34 in a wall of gateway-fan-out noise and got acked with the rest.

The constraint was organisational, not technical. The SRE team owned Alertmanager but could not change the alert rules — those lived in a dozen service-team repos and moved at the speed of a dozen backlogs. And they could not simply drop the downstream alerts, because in other failure modes (a slow database, a bad deploy in one service) those downstream alerts were the signal. The fix had to live entirely in Alertmanager, be provably correct, and ask as little of the service teams as possible.

They solved it with a scoped inhibition rule keyed on the region label, plus a continue-based mirror to a read-only incident channel for context, and locked the behaviour with a CI assertion. The only ask of the service teams was a single tier: downstream label on their alerts — cheap, reviewable, and enforced in each team’s own rule linting:

inhibit_rules:
  - source_matchers:
      - alertname = GatewayDegraded
      - severity = critical
    target_matchers:
      - tier = downstream            # the one label service teams agreed to add
    equal: ['region']                # never cross-suppress between regions

# CI guard so a future edit can't silently re-enable the 3am storm.
# The gateway itself must STILL page, and downstream must NOT, in the same region:
#   amtool config routes test … severity=critical alertname=GatewayDegraded region=eu-west
#     -> expects: payments-pagerduty

Post-change, a regional gateway outage produced one page — the gateway — with the 30 downstream alerts visible-but-muted in the Alertmanager UI for context. Mean time to acknowledge the correct incident dropped from minutes of triage to seconds, and the auto-ack habit died because the pages were trustworthy again. The database failover that had been missed twice now stood alone as the only page during its window. The load-bearing detail was equal: ['region']: an earlier draft without it had suppressed a genuine eu-west outage during an unrelated us-east incident — a critical warning muted across regions — which is precisely the class of bug the routing-test CI gate now guards against. Total cost of the fix: one label, one inhibition rule, and two lines of CI.

Advantages and disadvantages

The routing-tree-plus-inhibition model is powerful and, in the same breath, easy to misconfigure into silence. Weigh it honestly:

Advantages (why this model works)	Disadvantages (why it bites)
One declarative tree expresses complex team/severity routing that you can review in a PR	The tree is walked depth-first with `continue`/ordering subtleties — a misordered sibling silently swallows alerts
Grouping collapses an alert storm into one actionable notification	Wrong `group_by` (high-cardinality label) recreates the storm; no error, just noise
Inhibition kills symptom fan-out so on-call sees the root cause	A missing `equal:` scope suppresses real alerts across regions/clusters — a silent outage
Time intervals + silences give both scheduled and ad-hoc muting	An over-long silence or a UTC-defaulted interval mutes a genuine incident
HA gossip cluster removes the notification SPOF and dedupes replicas	Misconfigured clustering (`--cluster.listen-address=` empty) yields duplicate pages from “one” deployment
`amtool` makes routing testable and CI-gateable	Without the CI gate, a typo’d matcher drops alerts with zero load-time error
`*_file` secrets keep integration keys out of git	Forget them and a routing key / webhook lands in source control — a real security incident
Templating produces rich, runbook-carrying pages	`.CommonLabels.<x>` renders empty when the group spans that label — a subtle “why is my title blank” bug

The model is right for any team past toy scale that wants routing as reviewable code rather than tribal knowledge, and it scales cleanly to dozens of teams via per-team subtrees. It bites hardest where the failure modes are silent: a dropped-alert misroute, a cross-region inhibition, an over-broad silence — none throw an error, all of them just quietly fail to page. Every one of those is preventable, but only if you know it exists and you gate the config in CI, which is the entire thesis of this article.

Hands-on lab

Run a real Alertmanager locally, prove a continue-based mirror routes correctly, inject an alert, apply a silence, and tear down. No cloud account needed; Docker is the only prerequisite.

Step 1 — Write a minimal but real alertmanager.yml.

mkdir -p ~/am-lab && cd ~/am-lab
cat > alertmanager.yml <<'YAML'
route:
  receiver: catch-all
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 1h
  routes:
    - matchers: ['severity = critical']
      receiver: incident-bridge
      continue: true
    - matchers: ['team = payments']
      receiver: payments-slack
      routes:
        - matchers: ['severity = critical']
          receiver: payments-pagerduty

inhibit_rules:
  - source_matchers: ['severity = critical']
    target_matchers: ['severity = warning']
    equal: ['alertname', 'cluster', 'service']

receivers:
  - name: catch-all
  - name: incident-bridge
  - name: payments-slack
  - name: payments-pagerduty
YAML

Step 2 — Validate it before running anything. If you have the amtool binary locally, run it directly; otherwise use the container image:

docker run --rm -v "$PWD/alertmanager.yml:/etc/am/am.yml" \
  prom/alertmanager:v0.28.0 amtool check-config /etc/am/am.yml
# Expected: "SUCCESS", listing 1 route, 1 inhibit rule, 4 receivers.

Step 3 — Prove the routing before it ever runs. This is the assertion you would put in CI:

docker run --rm -v "$PWD/alertmanager.yml:/etc/am/am.yml" \
  prom/alertmanager:v0.28.0 \
  amtool config routes test --config.file=/etc/am/am.yml \
  severity=critical team=payments alertname=ApiDown
# Expected output (two receivers, proving continue works):
#   incident-bridge
#   payments-pagerduty

If you see only incident-bridge, your continue: true is missing — exactly the bug this test exists to catch.

Step 4 — Run Alertmanager and open the UI.

docker run -d --name am -p 9093:9093 \
  -v "$PWD/alertmanager.yml:/etc/alertmanager/alertmanager.yml" \
  prom/alertmanager:v0.28.0
# Browse to http://localhost:9093  → the Alertmanager UI (Alerts / Silences / Status).

Step 5 — Inject two alerts and watch inhibition suppress the warning.

# A critical and a matching warning for the same service — the warning should be inhibited.
docker exec am amtool alert add --alertmanager.url=http://localhost:9093 \
  alertname=ApiDown severity=critical cluster=ap-south service=checkout \
  --annotation=summary="checkout API down"

docker exec am amtool alert add --alertmanager.url=http://localhost:9093 \
  alertname=ApiDown severity=warning cluster=ap-south service=checkout \
  --annotation=summary="checkout API elevated errors"

# List inhibited alerts — the warning should appear as inhibited.
docker exec am amtool alert query --alertmanager.url=http://localhost:9093 inhibited=true
# Expected: the severity=warning ApiDown alert is listed (active, muted by the critical).

Step 6 — Apply and inspect a silence.

docker exec am amtool silence add --alertmanager.url=http://localhost:9093 \
  --author="lab" --comment="deploy window" --duration=30m \
  service=checkout

docker exec am amtool silence query --alertmanager.url=http://localhost:9093
# Expected: one active silence matching service=checkout, expiring in ~30m.

Validation checklist. You validated a config before running it, proved a continue-based mirror sends to two receivers with amtool config routes test, saw an inhibition rule mute a warning while its critical fired, and created a runtime silence. Each step maps to a lifecycle stage:

Step	What you did	Lifecycle stage it proves	Real-world analogue
2	`check-config`	Load-time validation	The CI parse gate
3	`routes test` shows two receivers	Route + `continue`	The per-team CI assertion
5	Warning shown as inhibited	Inhibition (with `equal` scope)	Killing symptom fan-out
6	Silence created & listed	Silence (runtime mute)	The deploy-window mute

Teardown.

docker rm -f am
rm -rf ~/am-lab

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable symptom→cause→confirm→fix table, then the reasoning underneath the ones that bite hardest.

#	Symptom	Root cause	Confirm (exact command / path)	Fix
1	An alert never pages, no error anywhere	Misordered/typo’d matcher; a general sibling swallowed it, or it hit the catch-all	`amtool config routes test <labels>` shows the wrong/`catch-all` receiver	Reorder siblings (specific above general); fix the matcher; add a CI assertion
2	Critical alert hits the incident channel but doesn’t page the team	`continue: true` missing on the mirror route, so the alert stopped at the first match	`amtool config routes test` returns only the mirror receiver	Add `continue: true` to the mirror route
3	Storm of separate notifications for one incident	`group_by` includes a high-cardinality label (e.g. `instance`), or is `['...']`	Inspect route’s `group_by`; count notifications vs alerts	Group by `['alertname','cluster','service']`; remove `instance`
4	A real alert in one region is silently muted	Inhibition rule missing `equal:` (or too-broad `equal`) → global suppression	`amtool alert query inhibited=true` shows the real alert; read the `inhibit_rules`	Add/narrow `equal: ['region'/'cluster']`
5	Duplicate pages from an HA deployment	Clustering disabled (`--cluster.listen-address=` empty) or peers not connected	`alertmanager_cluster_members` ≠ replica count on some pod	Fix cluster flags; verify metric equals replicas on every peer
6	Info/low-priority alerts page at 3am	Route missing `mute_time_intervals`, or interval has no `location` (defaulted to UTC)	Read the route; check the `time_intervals` `location`	Add `mute_time_intervals`; set IANA `location`
7	Migrated config drops alerts after `match`→`matchers` change	A matcher mis-parsed (regex anchoring, unquoted metachar)	`amtool check-config`; `amtool config routes test` on a known label set	Quote values; remember regexes are anchored; re-test
8	Notification title/field renders empty	`.CommonLabels.<x>` used where the group spans that label	Look at the template; check if the group contains >1 value for that label	Use `.GroupLabels`/range per-alert; only `.CommonLabels` for truly-common labels
9	No “resolved” message ever arrives	`send_resolved: false` (default on some integrations) or alert never resolves	Check receiver config; `alertmanager_alerts{state="active"}` stays high	Set `send_resolved: true`; verify Prometheus stops sending
10	Secret leaked in git / rejected by scanner	Integration key/URL inlined instead of `*_file`	`grep -R 'routing_key\|api_url\|api_key' alertmanager.yml` finds a literal	Move to `routing_key_file`/`api_url_file`; rotate the leaked key
11	A silence isn’t muting the alert	Silence matcher doesn’t match the alert’s actual labels	Compare silence matchers to the alert’s labels in the UI	Fix matcher to the real label values; silences are label-exact
12	Every re-page comes minutes apart on an open incident	`repeat_interval` set too low (or confused with `group_interval`)	Read the route’s `repeat_interval`	Raise `repeat_interval` to hours; understand it re-sends unchanged groups
13	Alerts fire in Prometheus but Alertmanager shows nothing	Prometheus not sending, or sending to the wrong/one AM	Prometheus `/status` → runtime; `alertmanager_alerts` flat on AM	Fix `alerting.alertmanagers` to list all peers
14	Time-interval mute never triggers (or always does)	Wrong `end_time` (`'00:00'` vs `'24:00'`) or timezone confusion	Read the interval; reason in the configured `location`	Use `'24:00'` for midnight-end; set the right IANA `location`

The expanded reasoning for the entries that cause the most 3am confusion:

1. An alert never pages and nothing errors. The most dangerous failure because it is silent. Either a general sibling route sits above the specific one and swallows the alert (default stop-at-first-match), or a matcher has a typo (team = payemnts) so the specific route never matches and the alert lands on the catch-all. Confirm: amtool config routes test <the alert's labels> and read which receiver comes back — if it’s catch-all or the wrong team, that’s your bug. Fix: put specific routes above general ones, correct the matcher, and add a CI assertion so it can’t regress.

2. The mirror fires but the team isn’t paged. You built a continue: true mirror to an incident channel, but forgot the continue, so the alert stopped at the mirror and never fell through to the team’s paging subtree. Confirm: amtool config routes test returns only the mirror receiver, not the pager. Fix: add continue: true to the mirror route — and test that the pager still appears.

4. A real alert is silently muted by inhibition. An inhibition rule without an equal: scope (or with too few labels in it) suppresses globally: a critical in one region mutes warnings — or worse, criticals — everywhere. Confirm: amtool alert query inhibited=true lists a real alert you expected to page; read the inhibit_rules and find the one whose source_matchers are firing. Fix: add or narrow equal: to the scoping label (region, cluster, service) so suppression only applies within the same scope.

5. Duplicate pages from “one” HA deployment. Two replicas that are not actually clustered each send their own notification. Almost always --cluster.listen-address= was set empty (disabling clustering) or the peers never settled. Confirm: curl .../metrics | grep alertmanager_cluster_members on every pod — it must equal the replica count everywhere; a pod reporting 1 is un-clustered. Fix: correct the cluster flags (or the Operator CRD replicas), and verify the metric on all peers.

8. A template field renders blank. You wrote {{ .CommonLabels.service }} in a title, but the notification group spans multiple services, so service is not a common label and renders empty. .CommonLabels only contains labels identical across every alert in the group. Confirm: check whether the group’s group_by (or its contents) includes more than one value for that label. Fix: use .GroupLabels for group-defining labels, range over .Alerts for per-alert detail, and reserve .CommonLabels for labels you know are shared.

Best practices

Set sane defaults at the root, override sparingly. Put group_by, group_wait, group_interval, and repeat_interval on the root route; children inherit them. Only override where a severity or team genuinely needs different behaviour (critical → group_wait: 10s; noisy info → repeat_interval: 24h). This keeps the tree short and consistent.
One top-level subtree per team, keyed on an enforced team label. Make every Prometheus alert rule carry team: <name> and lint for it. Teams own their subtree; you own the root. Merge conflicts stay local, ownership is obvious in review.
Order specific routes above general ones, and be deliberate about continue. Default is stop-at-first-match; a general sibling above a specific one silently swallows alerts. Use continue: true intentionally for mirrors, and test that the downstream page still fires.
Always scope inhibition with equal:. Never ship an inhibit_rule without an equal list containing the region/cluster/service that must match. This one line is the difference between “suppress the symptoms of this outage” and “silently mute a real alert in another region.”
Group by identity, never by high cardinality. ['alertname','cluster','service'] is a good default. Never group by instance, pod, or anything per-host — you recreate the storm you were trying to prevent.
Migrate to matchers, quote regexes, and remember anchoring. Drop match/match_re. Quote any value with metacharacters or spaces. Regexes are fully anchored, so =~ "warn" does not match warning — a frequent bug for PromQL veterans.
Keep secrets in *_file variants; never inline a key or webhook. Mount integration keys from a Secret/Vault. Grep the committed config for literal routing_key/api_url/api_key values in CI and fail the build if any appear.
Turn on send_resolved: true everywhere it makes sense. Slack, email, webhook. On-call should get automatic recovery messages, not have to poll.
Use explicit IANA location on every time interval. The default is UTC. A quiet-hours window without a location mutes at the wrong local time and drifts across DST. Use '24:00' (not '00:00') for a midnight end.
Run three Alertmanager replicas; point every Prometheus at all of them. Never load-balance Prometheus across peers. Verify alertmanager_cluster_members equals the replica count on every peer, and expect brief duplicates during a partition (by design).
Gate the config in CI. amtool check-config plus one amtool config routes test assertion per routing guarantee you care about (critical pages, info never pages, mirrors fire). A typo’d matcher should fail the PR, not the rotation.
Prefer silences with tight durations and clear comments. A silence needs an author and a reason; keep durations to the actual maintenance window. Audit long-lived silences regularly — a 30-day silence set “temporarily” is a real-incident-in-waiting.
Alert on Alertmanager itself. Watch alertmanager_notifications_failed_total (a page that didn’t deliver), alertmanager_cluster_members (peers dropping), and run a Watchdog / dead-man’s-switch alert that always fires and pages a separate system if it stops — that is how you detect the entire pipeline being down.

Security notes

Never commit integration secrets. PagerDuty routing keys, Opsgenie API keys, and Slack incoming-webhook URLs are credentials — a leaked Slack webhook lets anyone spam that channel; a leaked routing key lets anyone spoof or suppress pages. Use the *_file variants and mount from a Kubernetes Secret (ideally sealed / external-secrets) or Vault. Rotate immediately if one lands in git history.
Lock down the Alertmanager API and UI. The API can create silences — anyone who can reach it can mute your production alerts. Put Alertmanager behind an authenticating reverse proxy or network policy; do not expose :9093 to the internet or to an untrusted internal network. Silence creation is an audited, powerful action.
Protect the gossip port. The cluster port (:9094) exchanges state between peers; restrict it to the peers themselves with a network policy or security group. An attacker on the gossip mesh could inject silences or corrupt notification state.
Least privilege for the receivers’ credentials. A PagerDuty integration (Events API v2) routing key is scoped to one service — prefer per-service keys over an account-wide token, so a leak blasts one service, not all of them. Same for Opsgenie API integrations.
Do not leak internal topology in notifications. Templates that dump every label can expose internal hostnames, cluster names, and service topology to a chat channel or a third-party paging tool. Include what on-call needs (service, severity, runbook) and omit what an attacker would find useful.
Treat silences as an audit surface. Every silence has an author and a comment — enforce that they are meaningful, and review long-lived silences. A silence is an intentional blind spot; unexplained ones are a smell.
Pin and verify the Alertmanager image. Run a known version (0.27/0.28), pin by digest in production, and pull from a trusted registry — Alertmanager holds your paging credentials in memory.

The security controls mapped to what they protect and the incident they prevent:

Control	Mechanism	Protects against	Also prevents
`*_file` secrets	`routing_key_file`, `api_url_file`	Keys in source control	Credential-scanner build failures
Authenticated API/UI	Reverse proxy + authn / NetworkPolicy	Unauthorised silence creation	Malicious muting of prod alerts
Gossip port restriction	NetworkPolicy / SG on `:9094`	State injection on the mesh	Silence/notification-log tampering
Per-service routing keys	Scoped PD/Opsgenie integrations	Blast radius of one leaked key	Account-wide page spoofing
Minimal templates	Curated notification fields	Topology disclosure to chat/3rd-party	Recon via alert payloads
Silence audit	Enforced author + comment; review	Unexplained blind spots	“Temporary” silences muting real incidents

Cost & sizing

Alertmanager itself is free (open-source) and cheap to run — the cost is in the destinations and the operational discipline, not the binary. What actually drives spend and how to size it:

Compute footprint is tiny. Alertmanager is a small Go binary; three replicas each need on the order of 128–256 MB RAM and a fraction of a CPU for typical alert volumes (thousands of active alerts). It is essentially free next to Prometheus. Storage (--storage.path) holds silences and the notification log — megabytes, not gigabytes.
The bill is downstream. PagerDuty and Opsgenie charge per user (seat) — roughly the range of a few thousand rupees per responder per month depending on tier — not per alert. That is the real cost lever, and it is a strong argument for inhibition and grouping: fewer, higher-quality pages don’t reduce the seat cost, but noisy pages that push you to add responders (or lose them to burnout) do.
Grouping and inhibition are cost controls, not just noise controls. A storm that pages 400 times still costs the same per-seat, but the human cost — burnout, attrition, and the on-call premium you pay to retain people — is real. Collapsing 400 pages into one is a retention-and-cost lever as much as a hygiene one.
Slack/email are effectively free for alert volumes; the constraint there is channel noise, not money.
HA replicas cost three small pods. Running three Alertmanagers instead of one roughly triples a negligible number — still negligible. There is no reason to run one to “save money”; the SPOF risk dwarfs the pod cost.

A rough monthly picture for a mid-size team, in INR:

Cost driver	What you pay for	Rough INR / month	Sized by	Watch-out
Alertmanager compute (3 replicas)	~256 MB + fractional CPU each	~₹1,500–3,000 (on shared k8s)	Alert volume (tiny)	Negligible; don’t skimp to one replica
PagerDuty / Opsgenie seats	Per responder	~₹1,500–4,000 per responder	Number of on-call people	The real cost — noise pushes you to add seats
Slack / email	Incoming webhooks / SMTP	~₹0 (existing tooling)	Channel volume	Noise, not cost
Persistent storage	Silences + notification log	~₹0 (megabytes)	N/A	Trivial
Engineering time	Config, CI gate, on-call tuning	The dominant real cost	Team maturity	Under-investing here costs far more in missed/noisy pages

The honest takeaway: the software is free and the compute is nearly free; you are really paying for responder seats and engineering discipline. Every rupee spent on grouping, inhibition, and CI-tested routing pays back in fewer responders burned out and fewer incidents missed — which is a far larger number than the Alertmanager pods will ever cost.

Interview & exam questions

1. Explain the difference between group_wait, group_interval, and repeat_interval. group_wait is the delay before the first notification for a new group, so sibling alerts can join it (typically 30s). group_interval is the minimum gap before sending an updated notification for a group that already fired because its contents changed (a new alert joined or one resolved; typically 5m). repeat_interval is how long before re-sending a notification for a group whose contents have not changed — the “still open” reminder (typically hours). Confusing group_interval with repeat_interval is the classic mistake.

2. What does continue: true do, and give a case where forgetting it is dangerous. By default an alert stops at the first matching sibling route. continue: true makes it also evaluate later siblings. The dangerous case: a top-level “critical → incident channel” mirror route without continue means a critical alert stops there and never falls through to the team’s paging subtree — so the incident is posted to a channel but nobody is paged. Always amtool config routes test to confirm the pager still appears.

3. Why is the equal: list critical in an inhibition rule? Inhibition is evaluated globally, so without equal: a source alert suppresses every matching target regardless of scope — a critical in us-east would mute warnings (or criticals) in us-west. equal: restricts suppression to targets whose listed label values match the source’s (e.g. equal: ['region']), so a source only inhibits within the same region/cluster/service. Omitting it is the most common way inhibition silently swallows real alerts.

4. Distinguish a silence from a time interval. A silence is imperative, created at runtime (UI/API/amtool) with a matcher and a fixed duration for a maintenance window or known-noisy alert; it lives in Alertmanager state and expires. A time interval is declarative, defined in the config and referenced from routes (mute_time_intervals/active_time_intervals) to mute or allow notifications on a recurring schedule (quiet hours, weekends); it never expires. Silences = “shut this up now”; time intervals = “shut this up every night.”

5. Why do you run more than one Alertmanager, and why does every Prometheus target all of them? A single Alertmanager is a SPOF for notification — if it’s down, nobody is paged even though Prometheus is fine. You run 2–3 in a gossip cluster that shares silences, inhibition, and notification-log state and deduplicates so one notification goes out per group. Every Prometheus sends every alert to all peers (no load balancer) because dedup happens in the cluster downstream; sending to one peer reintroduces the SPOF, and load-balancing breaks the dedup coordination.

6. How does Alertmanager deduplicate notifications across HA peers? Before sending a group’s notification, each peer waits a position-based offset (--cluster.peer-timeout, default 15s, times its index in the sorted membership) and checks the shared notification log. The peer at position 0 sends first and records it; the others see that record before their timer elapses and suppress their copy. During a partition, peers can’t see each other’s records, so brief duplicates occur — deliberately, since a duplicate page is safer than a missed one.

7. You migrated match_re: {severity: "critical|page"} to a matcher and alerts stopped routing. What’s the likely bug? Matcher regexes are fully anchored and the value with a pipe must be quoted: severity =~ "critical|page". If you wrote it unquoted or assumed partial matching (PromQL habit), the matcher parses to something that doesn’t match your alerts. Confirm with amtool check-config and amtool config routes test on a known label set; a mis-parsed matcher throws no load-time error, it just drops alerts.

8. Your template shows a blank service in the title. Why? The title uses .CommonLabels.service, but the notification group spans multiple services, so service is not common to every alert in the group and renders empty. .CommonLabels holds only labels identical across the whole group. Use .GroupLabels for the grouping labels or range over .Alerts for per-alert detail; reserve .CommonLabels for labels you know are shared.

9. An HA deployment sends duplicate pages. First thing you check? Whether clustering is actually enabled and settled. --cluster.listen-address= set empty disables clustering, turning “replicas” into independent instances that each notify. Run curl .../metrics | grep alertmanager_cluster_members on every pod — it must equal the replica count everywhere; a pod reporting fewer members is un-clustered. Fix the cluster flags (or Operator CRD) and re-verify on all peers.

10. How would you make routing changes safe to merge? Gate them in CI: amtool check-config to catch parse/matcher errors, plus amtool config routes test <labels> assertions that encode each routing invariant you care about (“critical payments → pages payments”, “info → never pages”). A grep on the output fails the PR if the invariant breaks. This turns silent misroutes into failing builds instead of failing rotations.

11. What is a dead-man’s-switch (Watchdog) alert and why does it matter for Alertmanager? It’s an alert engineered to always fire (a rule with a constantly-true expression) and route to a special receiver that pings an external heartbeat system. If the whole monitoring pipeline — Prometheus, Alertmanager, or the network between them — goes down, the Watchdog stops arriving, and the external system pages you about the absence. It’s how you detect that your alerting itself is dead, which no internal alert can tell you.

12. When would you set group_by: ['...'] and what’s the risk? The literal ['...'] disables aggregation — every distinct alert becomes its own group and its own notification. Use it only on a narrow route where you genuinely want one page per alert (e.g. a deduped webhook consumer that handles its own grouping). The risk is obvious: applied broadly, it recreates the alert storm that grouping exists to prevent.

These map most directly to the Prometheus Certified Associate (PCA) — alerting and Alertmanager configuration are a core domain — and to SRE / observability interview loops generally. The HA, dedup, and CI-testing themes are exactly what distinguishes an engineer who has operated Alertmanager from one who has only read its docs.

Question theme	Where it’s tested	Depth signal
Timing knobs (`group_wait` vs `repeat_interval`)	PCA; SRE interviews	Fundamentals
`continue` and routing order	PCA; SRE interviews	Have you debugged a misroute
Inhibition `equal` scoping	SRE interviews	Have you caused/fixed a cross-region mute
Silences vs time intervals	PCA	Conceptual clarity
HA / dedup / gossip	SRE / platform interviews	Have you run it in production
`amtool` + CI gating	Platform / DevOps interviews	Do you treat alerting as code

Quick check

A critical alert is posted to your incident Slack channel but the on-call engineer is never paged. What’s the single most likely config mistake, and how do you confirm it in one command?
You add an inhibition rule to suppress downstream service alerts when a gateway is down, and a week later a real outage in another region goes unpaged. What line did you forget?
True or false: to run Alertmanager in HA you should put the peers behind a load balancer and point Prometheus at the load balancer’s address.
Your low-priority info alerts are paging at 3am despite a mute_time_intervals reference. Name the two most likely causes.
What is the difference between group_interval and repeat_interval?

Answers

The mirror route is missing continue: true, so the alert stopped at the incident-channel route and never fell through to the team’s paging subtree. Confirm with amtool config routes test severity=critical team=<team> alertname=<x> — it will return only the Slack receiver, not the pager. Add continue: true and re-test.
The equal: list on the inhibition rule (e.g. equal: ['region']). Without it, inhibition is global — the gateway-down source in one region suppressed matching targets everywhere, including a genuine critical in another region. Scope suppression with equal to the label that must match.
False. Each Prometheus must send to all peers directly (list every peer in alerting.alertmanagers), and the peers gossip to deduplicate. A load balancer sends a given alert to only one peer, which reintroduces the SPOF and breaks the cluster’s dedup coordination.
(a) The time interval has no location, so it defaults to UTC and the “quiet hours” window is at the wrong local time. (b) The end_time is wrong (e.g. '00:00' instead of '24:00' for a midnight end), so the window doesn’t cover the hours you think. Read the interval and reason in the configured timezone; fix location and use '24:00' for midnight.
group_interval is the minimum gap before sending an updated notification for a group whose contents changed (a new alert joined, or one resolved) — typically 5m. repeat_interval is how long before re-sending a notification for a group whose contents have not changed — the “still open” reminder — typically hours. One reacts to change; the other reminds about no-change.

Glossary

Alert — a firing condition from Prometheus, carrying labels (identity) and annotations (context), sent to Alertmanager over HTTP and held in its state until resolved.
Route — a node in the routing tree: a set of matchers, timing knobs, a receiver, and optional child routes; an alert descends the tree to the deepest matching route.
Matcher — a label predicate using =, !=, =~, or !~; the current syntax replacing the deprecated match/match_re. Regexes are fully anchored.
continue — a per-route flag; true keeps evaluating sibling routes after a match, enabling one alert to reach multiple receivers.
group_by — the labels defining a notification group; alerts sharing these values batch into one notification. ['...'] disables grouping.
group_wait — delay before the first notification of a new group, so sibling alerts can join it (default 30s).
group_interval — minimum time before an updated notification for a group whose contents changed (default 5m).
repeat_interval — time before re-sending a notification for an unchanged firing group (the “still open” reminder; hours).
Receiver — a named bundle of notification integrations (pagerduty_configs, slack_configs, etc.); an alert routed to it notifies through every integration in it.
Inhibition rule — suppresses target alerts (target_matchers) while a source alert (source_matchers) fires, scoped by equal.
equal — the list of labels a target must share with the source for inhibition to apply; the safety rail that prevents cross-scope suppression.
Silence — a runtime, expiring mute created via UI/API/amtool with a matcher and duration; stored in Alertmanager state, gossiped across the cluster.
Time interval — an in-config recurring schedule (time_intervals) referenced from routes via mute_time_intervals (suppress during) or active_time_intervals (only notify during).
resolve_timeout — how long after Prometheus stops sending an alert before Alertmanager marks it resolved (default 5m).
send_resolved — a per-integration flag; true sends a notification when the group’s alerts resolve.
Gossip cluster — two or more Alertmanager peers connected on the cluster port (:9094) sharing silences, inhibition, and notification-log state, and deduplicating notifications.
Deduplication — the cluster mechanism (position-based --cluster.peer-timeout offset + shared notification log) ensuring one notification per group despite every peer receiving the alert.
Watchdog / dead-man’s-switch — an always-firing alert routed to an external heartbeat system; its absence signals the monitoring pipeline is down.
amtool — Alertmanager’s CLI: check-config, config routes test/show, alert add/query, silence add/query/expire.

Next steps

You can now design a routing tree, suppress noise safely, run Alertmanager in HA, and gate the config in CI. Build outward:

Next: Building an On-Call Practice: PagerDuty Escalation, Alert Routing, and Actionable Runbooks — the human side Alertmanager feeds: escalation policies, schedules, and runbooks.
Related: SLOs and Error Budgets in Practice: Defining SLIs and Building Multi-Window Burn-Rate Alerts — author the alerts Alertmanager routes, driven by error budgets rather than static thresholds.
Related: PromQL in Anger: Rate, Histograms, and Aggregation Patterns That Actually Work — write the expressions and labels that make routing and inhibition possible.
Related: Taming Metric Cardinality: Relabeling, Limits, and Cost Governance in Prometheus — control the labels upstream so grouping and matchers stay sane.
Related: Scaling Prometheus: Recording Rules, Remote-Write, and Long-Term Storage with Thanos and Mimir — scale the metrics pipeline that feeds your alerting.
Related: SLOs as Code: Authoring SLIs with OpenSLO and Generating Burn-Rate Alerts via Sloth and Pyrra — generate the burn-rate alert rules that land in this routing tree.