Hand-written burn-rate alerts are a maintenance liability. Google’s multi-window multi-burn-rate recipe is four alert rules per objective, each depending on several recording rules across four window lengths, and every one of them has to recompute the right threshold from the SLO target. Author that by hand for one service and it is a fun afternoon; author it for sixty services and you have a generator’s worth of copy-paste with subtle per-service bugs nobody finds until the page does not fire.
The fix is to treat the objective as the source of truth and let a tool emit the PromQL. This article builds that pipeline: describe SLIs and targets once in the vendor-neutral OpenSLO spec, compile them into Prometheus recording and burn-rate rules with Sloth, run Pyrra for a live error-budget UI and an alternative rule generator, wire the alerts into Alertmanager, and gate all of it behind CI. The assumption throughout is Prometheus (or a compatible store like Mimir/Thanos) with the Prometheus Operator, Sloth v0.11+, and Pyrra v0.7+.
1. Choose SLIs that move when users hurt
An SLI is a ratio of good events over valid events. The discipline is entirely in the nouns, and three indicator families cover almost every request-driven service:
- Availability - fraction of valid requests served without a server-side error.
good = requests that are not 5xx,valid = all requests that reached the service. - Latency - fraction of valid requests faster than a threshold
T. This is a count-based ratio, not a percentile:good = requests with duration <= T, computed from the histogram bucket atle="T". Never build a latency SLI on an average; the mean hides the tail where users actually suffer. - Quality / correctness - fraction of requests that returned a usable answer. For a recommendation API that degrades to an empty list under load, a
200with no results is a failure the availability SLI cannot see. Emit an explicitresult="degraded"label and count it as bad.
Two rules keep an SLI honest:
Measure as close to the user as you own the signal. The ingress or load balancer sees what the client experiences; the app’s own histogram only sees requests that already arrived. Prefer the edge for the user-facing SLO and keep deeper signals for debugging.
Exclude what is not yours. A
400is the client’s fault - keep4xxout of both numerator and denominator’s failure accounting, or you burn budget for malformed requests. Drop health checks and synthetics entirely; they inflate the denominator and mask real pain.
If you can imagine the SLI looking healthy while users are unhappy - or the reverse - the indicator is wrong. Fix it before you argue about the target.
2. Model objectives, windows, and targets in OpenSLO
OpenSLO is a vendor-neutral YAML spec for SLOs. The value is decoupling: the SLI and target live in one document, independent of whichever engine compiles them. An SLI can be defined inline or as a reusable kind: SLI referenced by many SLOs.
Define the availability SLI once as its own object:
# slis/checkout-availability.openslo.yaml
apiVersion: openslo/v1
kind: SLI
metadata:
name: checkout-availability
displayName: Checkout availability
spec:
description: Fraction of checkout requests not served as 5xx, measured at ingress.
ratioMetric:
counter: true
good:
metricSource:
type: Prometheus
spec:
query: |
sum(rate(nginx_ingress_controller_requests{service="checkout", status!~"5.."}[1m]))
total:
metricSource:
type: Prometheus
spec:
query: |
sum(rate(nginx_ingress_controller_requests{service="checkout"}[1m]))
Then bind it to a target and window in an SLO object. The budgetingMethod: Occurrences consumes budget per bad event (the right default for request SLIs); Timeslices consumes per bad window and is for time-based services. The window is a rolling 28 days - long enough that a single failure does not blow a tight target, short enough to stay relevant.
# slos/checkout-availability.openslo.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
name: checkout-availability
displayName: Checkout availability
spec:
service: checkout
indicatorRef: checkout-availability
timeWindow:
- duration: 28d
isRolling: true
budgetingMethod: Occurrences
objectives:
- displayName: 99.9% of checkout requests succeed
target: 0.999
For the latency SLO, the SLI counts requests below the threshold by reading the histogram bucket directly. The le="0.3" selector means “300 ms or faster”, and the denominator is the histogram’s _count series:
# slis/checkout-latency.openslo.yaml
apiVersion: openslo/v1
kind: SLI
metadata:
name: checkout-latency-300ms
spec:
ratioMetric:
counter: true
good:
metricSource:
type: Prometheus
spec:
query: |
sum(rate(http_request_duration_seconds_bucket{service="checkout", le="0.3"}[1m]))
total:
metricSource:
type: Prometheus
spec:
query: |
sum(rate(http_request_duration_seconds_count{service="checkout"}[1m]))
A target of 0.999 over 28 days is an error budget of 0.1%. Against a billing window that does roughly 50M checkout requests a month, that is a budget of ~50,000 bad requests - a quantity you can spend and track, not just “43 minutes of downtime”. That subtraction, budget = 1 - target, is the whole point of the exercise.
3. Generate Prometheus rules with Sloth
Sloth turns an SLO spec into the full set of Prometheus recording rules (SLI ratios at multiple windows) and the four multi-window multi-burn-rate alert rules. It speaks its own prometheus/v1 format and also reads OpenSLO directly.
Sloth’s native spec is denser and gives you direct control over the alert names and labels. Note the {{.window}} templating - Sloth fills it with each window length it needs (5m, 30m, 1h, 6h, and the long 28d budget windows), so you write the query shape once:
# sloth/checkout.sloth.yaml
version: "prometheus/v1"
service: "checkout"
labels:
team: "payments"
slos:
- name: "requests-availability"
objective: 99.9
description: "Checkout availability measured at ingress."
sli:
events:
error_query: |
sum(rate(nginx_ingress_controller_requests{service="checkout", status=~"5.."}[{{.window}}]))
total_query: |
sum(rate(nginx_ingress_controller_requests{service="checkout"}[{{.window}}]))
alerting:
name: CheckoutAvailabilityBurn
labels:
category: availability
annotations:
summary: "Checkout is burning its availability error budget."
page_alert:
labels:
severity: page
ticket_alert:
labels:
severity: ticket
Generate the PrometheusRule-ready output:
# Native Sloth spec -> Prometheus rules
sloth generate -i sloth/checkout.sloth.yaml -o rules/checkout.rules.yaml
# OpenSLO spec -> the same output, no rewrite needed
sloth generate \
--input slos/checkout-availability.openslo.yaml \
--out rules/checkout-availability.rules.yaml
Validate before you ship - this is your CI gate:
# Validate every spec under a directory; non-zero exit on failure
sloth validate -i ./sloth/
What Sloth emits, per SLO: recording rules for the SLI error ratio at each window (slo:sli_error:ratio_rate5m, ...rate30m, ...rate1h, ...rate2h, ...rate6h, ...rate1d, ...rate3d), metadata rules carrying the objective and error budget (slo:objective:ratio, slo:error_budget:ratio), and a SLOMetricAbsent alert that fires when the SLI series disappears entirely - a failure mode hand-rolled rules almost always forget.
In a cluster, skip the CLI and run the controller. It reconciles a sloth.slok.dev/v1 PrometheusServiceLevel CR into a managed monitoring.coreos.com/v1 PrometheusRule:
# k8s/checkout-psl.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: checkout
namespace: monitoring
spec:
service: "checkout"
labels:
team: "payments"
slos:
- name: "requests-availability"
objective: 99.9
sli:
events:
error_query: sum(rate(nginx_ingress_controller_requests{service="checkout", status=~"5.."}[{{.window}}]))
total_query: sum(rate(nginx_ingress_controller_requests{service="checkout"}[{{.window}}]))
alerting:
name: CheckoutAvailabilityBurn
page_alert:
labels: {severity: page}
ticket_alert:
labels: {severity: ticket}
kubectl apply -f k8s/checkout-psl.yaml
# The controller (deployed once, cluster-wide) writes the PrometheusRule:
kubectl get prometheusrules -n monitoring
4. Run Pyrra for live error-budget dashboards
Sloth is a compiler; Pyrra is a compiler and a UI. It reads its own ServiceLevelObjective CRD, generates the equivalent burn-rate recording rules, and serves a dashboard showing remaining budget, current burn rate, and projected exhaustion per objective. Many teams run both: Sloth as the canonical rule generator, Pyrra for the operator-facing view.
The Pyrra objective is intentionally terse - target is a string percentage and window accepts week units:
# pyrra/checkout-availability.yaml
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: checkout-availability
namespace: monitoring
labels:
prometheus: k8s
role: alert-rules
spec:
target: "99.9"
window: 4w
description: Checkout availability at ingress.
indicator:
ratio:
errors:
metric: nginx_ingress_controller_requests{service="checkout", status=~"5.."}
total:
metric: nginx_ingress_controller_requests{service="checkout"}
The latency indicator uses the histogram bucket for success and the _count series for total:
# pyrra/checkout-latency.yaml
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: checkout-latency
namespace: monitoring
spec:
target: "99.5"
window: 4w
indicator:
latency:
success:
metric: http_request_duration_seconds_bucket{service="checkout", le="0.3"}
total:
metric: http_request_duration_seconds_count{service="checkout"}
Pyrra runs as a split deployment: an api process serving the UI, plus a backend that watches your source of truth. The Kubernetes backend reconciles the CRD into PrometheusRule objects; the filesystem backend watches a directory and writes rule files (use it when you are not on the Operator):
# UI + aggregating API
pyrra api --prometheus-url=http://prometheus.monitoring:9090
# Backend (one of):
pyrra kubernetes # watches ServiceLevelObjective CRs, writes PrometheusRules
pyrra filesystem \
--config-files="/etc/pyrra/*.yaml" \
--prometheus-folder="/etc/prometheus/pyrra/"
Pyrra’s generated recording rules carry burn rates at the windows it needs - http_requests:burnrate5m, :burnrate1h, :burnrate6h, and so on - alongside :increase series over the full window that power the dashboard’s budget bar.
5. Wire multi-window, multi-burn-rate alerts to Alertmanager
The reason for multiple windows is that one threshold cannot be both fast and quiet. A high burn rate over a short window catches a sharp outage in minutes; a lower burn rate over a long window catches a slow bleed without paging on every transient blip. Google’s canonical table pairs a long and short window per severity so an alert only fires when both agree there is sustained burn:
| Severity | Long window | Short window | Burn rate | Budget consumed if sustained |
|---|---|---|---|---|
| Page | 1h | 5m | 14.4 | ~2% in 1h |
| Page | 6h | 30m | 6 | ~5% in 6h |
| Ticket | 1d (24h) | 2h | 3 | ~10% in 1d |
| Ticket | 3d (72h) | 6h | 1 | ~10% in 3d |
Both Sloth and Pyrra emit exactly this structure, so you do not write the PromQL. Your job is routing. Sloth labels page alerts severity: page and ticket alerts severity: ticket (per your spec); route on those:
# alertmanager.yaml
route:
receiver: default
group_by: ["sloth_service", "sloth_slo"]
routes:
- matchers:
- severity = "page"
receiver: pagerduty-payments
group_wait: 30s
repeat_interval: 4h
- matchers:
- severity = "ticket"
receiver: slack-payments
repeat_interval: 24h
inhibit_rules:
# A firing page silences the matching ticket for the same SLO -
# do not open a Jira while someone is already paged.
- source_matchers: [severity = "page"]
target_matchers: [severity = "ticket"]
equal: ["sloth_service", "sloth_slo"]
receivers:
- name: default
- name: pagerduty-payments
pagerduty_configs:
- routing_key: "<integration-key>"
- name: slack-payments
slack_configs:
- api_url: "<webhook-url>"
channel: "#payments-slo"
The inhibition rule matters: during a real outage both the page (1h/5m) and ticket (1d/2h) alerts will eventually fire for the same SLO. Inhibiting the ticket while the page is active stops the on-call from drowning in duplicate noise about a single incident.
6. Version SLO definitions in Git and validate in CI
The specs are code, so they get the same treatment: pull requests, review, and a gate that refuses anything that will not compile. Lay the repo out by service and keep generated rules out of source control - regenerate them in CI so the spec is unambiguously the source of truth.
slo-definitions/
slis/ # reusable OpenSLO SLI objects
slos/ # OpenSLO SLO objects (one per objective)
pyrra/ # Pyrra ServiceLevelObjective CRs
.github/workflows/slo.yaml
A GitHub Actions job that validates with Sloth and dry-run-renders the output catches the two failures that actually happen: a malformed spec, and a spec that compiles but references a metric or label that does not exist.
# .github/workflows/slo.yaml
name: validate-slos
on: [pull_request]
jobs:
sloth:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Sloth
run: |
curl -sL \
https://github.com/slok/sloth/releases/latest/download/sloth-linux-amd64 \
-o /usr/local/bin/sloth
chmod +x /usr/local/bin/sloth
- name: Validate specs
run: sloth validate -i ./slos/
- name: Render rules (fails on bad templates)
run: |
for f in slos/*.openslo.yaml; do
sloth generate -i "$f" -o /dev/null
done
Add a promtool step to assert the rules are not just well-formed YAML but valid Prometheus rules, and unit-test the burn-rate logic with sample series so a refactor cannot silently break a page:
# Render, then let Prometheus itself check the rules
sloth generate -i slos/checkout-availability.openslo.yaml -o /tmp/checkout.rules.yaml
promtool check rules /tmp/checkout.rules.yaml
promtool test rules tests/checkout_burnrate_test.yaml
7. Report budget consumption and gate releases
The error budget only changes behavior if someone looks at it on a cadence and it has teeth. Two consumers matter: a weekly stakeholder report, and the deploy pipeline.
For reporting, query the budget directly. Sloth’s metadata rules expose the objective and the remaining budget as series, so a single PromQL expression gives you “fraction of budget left”:
# Fraction of 28d error budget remaining for the availability SLO.
# 1 = full budget, 0 = exhausted, < 0 = over budget.
1 - (
slo:sli_error:ratio_rate28d{sloth_service="checkout", sloth_slo="requests-availability"}
/ on(sloth_service, sloth_slo)
slo:error_budget:ratio{sloth_service="checkout", sloth_slo="requests-availability"}
)
For release gating, the pipeline asks the same question before promoting. If the budget is exhausted, the policy is to freeze risky changes and spend the next cycle on reliability - the budget is the negotiated line between feature velocity and stability, enforced automatically instead of in a meeting:
# Returns the remaining-budget fraction; gate promotion on it.
REMAINING=$(curl -sG http://prometheus:9090/api/v1/query \
--data-urlencode 'query=1 - (slo:sli_error:ratio_rate28d{sloth_service="checkout",sloth_slo="requests-availability"} / on(sloth_service,sloth_slo) slo:error_budget:ratio{sloth_service="checkout",sloth_slo="requests-availability"})' \
| jq -r '.data.result[0].value[1]')
# Block prod promotion if less than 10% of the budget remains.
awk -v r="$REMAINING" 'BEGIN { exit !(r < 0.10) }' \
&& { echo "Budget below 10% - prod promotion blocked"; exit 1; } \
|| echo "Budget OK ($REMAINING remaining)"
Enterprise scenario
A payments platform team I worked with ran 70+ services across three Prometheus shards behind Thanos. They had migrated to burn-rate alerting but authored every rule by hand in Jsonnet, and a quiet template bug had set the long-window page threshold to a 6h window where the spec called for 1h. The result: a checkout degradation burned 30% of the monthly budget over a weekend before the page fired, because the slower window smoothed the spike below threshold.
The constraint was that they could not rip out Jsonnet wholesale - dashboards, mixins, and CI all depended on it. So they inverted the relationship. SLO definitions moved to OpenSLO specs in a dedicated repo, and a CI job compiled them with Sloth into the PrometheusRule objects that Jsonnet previously hand-built. Jsonnet kept owning dashboards but consumed the Sloth-generated objective and budget series instead of redefining them. The bug class disappeared, because no human typed a window length again - the four-window table is baked into Sloth’s generator and unit-tested upstream.
The piece that sold it to leadership was the gate. They wired the remaining-budget query into the Argo CD sync hook so a service whose budget was exhausted could not promote a non-hotfix change to prod:
# A PrometheusRule that flips an inhibiting label when budget is gone.
# Argo CD's pre-sync gate reads this alert's state via the Alertmanager API.
- alert: CheckoutBudgetExhausted
expr: |
(slo:sli_error:ratio_rate28d{sloth_service="checkout", sloth_slo="requests-availability"}
/ on(sloth_service, sloth_slo)
slo:error_budget:ratio{sloth_service="checkout", sloth_slo="requests-availability"}) >= 1
for: 15m
labels:
severity: ticket
gate: freeze
annotations:
summary: "Checkout availability budget exhausted; non-hotfix promotions frozen."
Within a quarter the page-on-noise rate dropped sharply and, more importantly, the two budget-blown incidents that followed were caught by the short window in minutes, not over a weekend.
Verify
Confirm the pipeline end to end before trusting it:
- Rules compiled and loaded.
sloth validate -i ./slos/exits zero, andkubectl get prometheusrules -n monitoringshows one rule object per service. In the Prometheus UI,/ruleslists theslo:sli_error:ratio_rate*recording rules with no evaluation errors. - SLI series exist and are sane. Query
slo:sli_error:ratio_rate5m{sloth_service="checkout"}- it should be a small number near zero in steady state, never negative, never above 1. - Burn-rate math is right. Run
promtool test rulesagainst sample series that inject a known error rate and assert the page alert fires at 14.4x and stays silent below it. - Alerts route correctly. Fire a synthetic burn (drop the error query threshold in a staging copy) and confirm the page lands in PagerDuty and the matching ticket is inhibited, not duplicated, in the Alertmanager UI.
- Pyrra reflects reality. The Pyrra dashboard’s remaining-budget bar for
checkout-availabilitymatches the PromQL budget query within rounding. - Gate actually blocks. Force the budget query to return
< 0.10in a test and confirm the pipeline step exits non-zero.