Observability Multi-Cloud

SLOs as Code: Authoring SLIs with OpenSLO and Generating Burn-Rate Alerts via Sloth and Pyrra

Hand-written burn-rate alerts are a maintenance liability. Google’s multi-window multi-burn-rate recipe is four alert rules per objective, each depending on several recording rules across four window lengths, and every one of them has to recompute the right threshold from the SLO target. Author that by hand for one service and it is a fun afternoon; author it for sixty services and you have a generator’s worth of copy-paste with subtle per-service bugs nobody finds until the page does not fire.

The fix is to treat the objective as the source of truth and let a tool emit the PromQL. This article builds that pipeline: describe SLIs and targets once in the vendor-neutral OpenSLO spec, compile them into Prometheus recording and burn-rate rules with Sloth, run Pyrra for a live error-budget UI and an alternative rule generator, wire the alerts into Alertmanager, and gate all of it behind CI. The assumption throughout is Prometheus (or a compatible store like Mimir/Thanos) with the Prometheus Operator, Sloth v0.11+, and Pyrra v0.7+.

1. Choose SLIs that move when users hurt

An SLI is a ratio of good events over valid events. The discipline is entirely in the nouns, and three indicator families cover almost every request-driven service:

Two rules keep an SLI honest:

Measure as close to the user as you own the signal. The ingress or load balancer sees what the client experiences; the app’s own histogram only sees requests that already arrived. Prefer the edge for the user-facing SLO and keep deeper signals for debugging.

Exclude what is not yours. A 400 is the client’s fault - keep 4xx out of both numerator and denominator’s failure accounting, or you burn budget for malformed requests. Drop health checks and synthetics entirely; they inflate the denominator and mask real pain.

If you can imagine the SLI looking healthy while users are unhappy - or the reverse - the indicator is wrong. Fix it before you argue about the target.

2. Model objectives, windows, and targets in OpenSLO

OpenSLO is a vendor-neutral YAML spec for SLOs. The value is decoupling: the SLI and target live in one document, independent of whichever engine compiles them. An SLI can be defined inline or as a reusable kind: SLI referenced by many SLOs.

Define the availability SLI once as its own object:

# slis/checkout-availability.openslo.yaml
apiVersion: openslo/v1
kind: SLI
metadata:
  name: checkout-availability
  displayName: Checkout availability
spec:
  description: Fraction of checkout requests not served as 5xx, measured at ingress.
  ratioMetric:
    counter: true
    good:
      metricSource:
        type: Prometheus
        spec:
          query: |
            sum(rate(nginx_ingress_controller_requests{service="checkout", status!~"5.."}[1m]))
    total:
      metricSource:
        type: Prometheus
        spec:
          query: |
            sum(rate(nginx_ingress_controller_requests{service="checkout"}[1m]))

Then bind it to a target and window in an SLO object. The budgetingMethod: Occurrences consumes budget per bad event (the right default for request SLIs); Timeslices consumes per bad window and is for time-based services. The window is a rolling 28 days - long enough that a single failure does not blow a tight target, short enough to stay relevant.

# slos/checkout-availability.openslo.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
  name: checkout-availability
  displayName: Checkout availability
spec:
  service: checkout
  indicatorRef: checkout-availability
  timeWindow:
    - duration: 28d
      isRolling: true
  budgetingMethod: Occurrences
  objectives:
    - displayName: 99.9% of checkout requests succeed
      target: 0.999

For the latency SLO, the SLI counts requests below the threshold by reading the histogram bucket directly. The le="0.3" selector means “300 ms or faster”, and the denominator is the histogram’s _count series:

# slis/checkout-latency.openslo.yaml
apiVersion: openslo/v1
kind: SLI
metadata:
  name: checkout-latency-300ms
spec:
  ratioMetric:
    counter: true
    good:
      metricSource:
        type: Prometheus
        spec:
          query: |
            sum(rate(http_request_duration_seconds_bucket{service="checkout", le="0.3"}[1m]))
    total:
      metricSource:
        type: Prometheus
        spec:
          query: |
            sum(rate(http_request_duration_seconds_count{service="checkout"}[1m]))

A target of 0.999 over 28 days is an error budget of 0.1%. Against a billing window that does roughly 50M checkout requests a month, that is a budget of ~50,000 bad requests - a quantity you can spend and track, not just “43 minutes of downtime”. That subtraction, budget = 1 - target, is the whole point of the exercise.

3. Generate Prometheus rules with Sloth

Sloth turns an SLO spec into the full set of Prometheus recording rules (SLI ratios at multiple windows) and the four multi-window multi-burn-rate alert rules. It speaks its own prometheus/v1 format and also reads OpenSLO directly.

Sloth’s native spec is denser and gives you direct control over the alert names and labels. Note the {{.window}} templating - Sloth fills it with each window length it needs (5m, 30m, 1h, 6h, and the long 28d budget windows), so you write the query shape once:

# sloth/checkout.sloth.yaml
version: "prometheus/v1"
service: "checkout"
labels:
  team: "payments"
slos:
  - name: "requests-availability"
    objective: 99.9
    description: "Checkout availability measured at ingress."
    sli:
      events:
        error_query: |
          sum(rate(nginx_ingress_controller_requests{service="checkout", status=~"5.."}[{{.window}}]))
        total_query: |
          sum(rate(nginx_ingress_controller_requests{service="checkout"}[{{.window}}]))
    alerting:
      name: CheckoutAvailabilityBurn
      labels:
        category: availability
      annotations:
        summary: "Checkout is burning its availability error budget."
      page_alert:
        labels:
          severity: page
      ticket_alert:
        labels:
          severity: ticket

Generate the PrometheusRule-ready output:

# Native Sloth spec -> Prometheus rules
sloth generate -i sloth/checkout.sloth.yaml -o rules/checkout.rules.yaml

# OpenSLO spec -> the same output, no rewrite needed
sloth generate \
  --input slos/checkout-availability.openslo.yaml \
  --out rules/checkout-availability.rules.yaml

Validate before you ship - this is your CI gate:

# Validate every spec under a directory; non-zero exit on failure
sloth validate -i ./sloth/

What Sloth emits, per SLO: recording rules for the SLI error ratio at each window (slo:sli_error:ratio_rate5m, ...rate30m, ...rate1h, ...rate2h, ...rate6h, ...rate1d, ...rate3d), metadata rules carrying the objective and error budget (slo:objective:ratio, slo:error_budget:ratio), and a SLOMetricAbsent alert that fires when the SLI series disappears entirely - a failure mode hand-rolled rules almost always forget.

In a cluster, skip the CLI and run the controller. It reconciles a sloth.slok.dev/v1 PrometheusServiceLevel CR into a managed monitoring.coreos.com/v1 PrometheusRule:

# k8s/checkout-psl.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout
  namespace: monitoring
spec:
  service: "checkout"
  labels:
    team: "payments"
  slos:
    - name: "requests-availability"
      objective: 99.9
      sli:
        events:
          error_query: sum(rate(nginx_ingress_controller_requests{service="checkout", status=~"5.."}[{{.window}}]))
          total_query: sum(rate(nginx_ingress_controller_requests{service="checkout"}[{{.window}}]))
      alerting:
        name: CheckoutAvailabilityBurn
        page_alert:
          labels: {severity: page}
        ticket_alert:
          labels: {severity: ticket}
kubectl apply -f k8s/checkout-psl.yaml
# The controller (deployed once, cluster-wide) writes the PrometheusRule:
kubectl get prometheusrules -n monitoring

4. Run Pyrra for live error-budget dashboards

Sloth is a compiler; Pyrra is a compiler and a UI. It reads its own ServiceLevelObjective CRD, generates the equivalent burn-rate recording rules, and serves a dashboard showing remaining budget, current burn rate, and projected exhaustion per objective. Many teams run both: Sloth as the canonical rule generator, Pyrra for the operator-facing view.

The Pyrra objective is intentionally terse - target is a string percentage and window accepts week units:

# pyrra/checkout-availability.yaml
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: checkout-availability
  namespace: monitoring
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  target: "99.9"
  window: 4w
  description: Checkout availability at ingress.
  indicator:
    ratio:
      errors:
        metric: nginx_ingress_controller_requests{service="checkout", status=~"5.."}
      total:
        metric: nginx_ingress_controller_requests{service="checkout"}

The latency indicator uses the histogram bucket for success and the _count series for total:

# pyrra/checkout-latency.yaml
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: checkout-latency
  namespace: monitoring
spec:
  target: "99.5"
  window: 4w
  indicator:
    latency:
      success:
        metric: http_request_duration_seconds_bucket{service="checkout", le="0.3"}
      total:
        metric: http_request_duration_seconds_count{service="checkout"}

Pyrra runs as a split deployment: an api process serving the UI, plus a backend that watches your source of truth. The Kubernetes backend reconciles the CRD into PrometheusRule objects; the filesystem backend watches a directory and writes rule files (use it when you are not on the Operator):

# UI + aggregating API
pyrra api --prometheus-url=http://prometheus.monitoring:9090

# Backend (one of):
pyrra kubernetes      # watches ServiceLevelObjective CRs, writes PrometheusRules
pyrra filesystem \
  --config-files="/etc/pyrra/*.yaml" \
  --prometheus-folder="/etc/prometheus/pyrra/"

Pyrra’s generated recording rules carry burn rates at the windows it needs - http_requests:burnrate5m, :burnrate1h, :burnrate6h, and so on - alongside :increase series over the full window that power the dashboard’s budget bar.

5. Wire multi-window, multi-burn-rate alerts to Alertmanager

The reason for multiple windows is that one threshold cannot be both fast and quiet. A high burn rate over a short window catches a sharp outage in minutes; a lower burn rate over a long window catches a slow bleed without paging on every transient blip. Google’s canonical table pairs a long and short window per severity so an alert only fires when both agree there is sustained burn:

Severity Long window Short window Burn rate Budget consumed if sustained
Page 1h 5m 14.4 ~2% in 1h
Page 6h 30m 6 ~5% in 6h
Ticket 1d (24h) 2h 3 ~10% in 1d
Ticket 3d (72h) 6h 1 ~10% in 3d

Both Sloth and Pyrra emit exactly this structure, so you do not write the PromQL. Your job is routing. Sloth labels page alerts severity: page and ticket alerts severity: ticket (per your spec); route on those:

# alertmanager.yaml
route:
  receiver: default
  group_by: ["sloth_service", "sloth_slo"]
  routes:
    - matchers:
        - severity = "page"
      receiver: pagerduty-payments
      group_wait: 30s
      repeat_interval: 4h
    - matchers:
        - severity = "ticket"
      receiver: slack-payments
      repeat_interval: 24h

inhibit_rules:
  # A firing page silences the matching ticket for the same SLO -
  # do not open a Jira while someone is already paged.
  - source_matchers: [severity = "page"]
    target_matchers: [severity = "ticket"]
    equal: ["sloth_service", "sloth_slo"]

receivers:
  - name: default
  - name: pagerduty-payments
    pagerduty_configs:
      - routing_key: "<integration-key>"
  - name: slack-payments
    slack_configs:
      - api_url: "<webhook-url>"
        channel: "#payments-slo"

The inhibition rule matters: during a real outage both the page (1h/5m) and ticket (1d/2h) alerts will eventually fire for the same SLO. Inhibiting the ticket while the page is active stops the on-call from drowning in duplicate noise about a single incident.

6. Version SLO definitions in Git and validate in CI

The specs are code, so they get the same treatment: pull requests, review, and a gate that refuses anything that will not compile. Lay the repo out by service and keep generated rules out of source control - regenerate them in CI so the spec is unambiguously the source of truth.

slo-definitions/
  slis/        # reusable OpenSLO SLI objects
  slos/        # OpenSLO SLO objects (one per objective)
  pyrra/       # Pyrra ServiceLevelObjective CRs
  .github/workflows/slo.yaml

A GitHub Actions job that validates with Sloth and dry-run-renders the output catches the two failures that actually happen: a malformed spec, and a spec that compiles but references a metric or label that does not exist.

# .github/workflows/slo.yaml
name: validate-slos
on: [pull_request]
jobs:
  sloth:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Sloth
        run: |
          curl -sL \
            https://github.com/slok/sloth/releases/latest/download/sloth-linux-amd64 \
            -o /usr/local/bin/sloth
          chmod +x /usr/local/bin/sloth
      - name: Validate specs
        run: sloth validate -i ./slos/
      - name: Render rules (fails on bad templates)
        run: |
          for f in slos/*.openslo.yaml; do
            sloth generate -i "$f" -o /dev/null
          done

Add a promtool step to assert the rules are not just well-formed YAML but valid Prometheus rules, and unit-test the burn-rate logic with sample series so a refactor cannot silently break a page:

# Render, then let Prometheus itself check the rules
sloth generate -i slos/checkout-availability.openslo.yaml -o /tmp/checkout.rules.yaml
promtool check rules /tmp/checkout.rules.yaml
promtool test rules tests/checkout_burnrate_test.yaml

7. Report budget consumption and gate releases

The error budget only changes behavior if someone looks at it on a cadence and it has teeth. Two consumers matter: a weekly stakeholder report, and the deploy pipeline.

For reporting, query the budget directly. Sloth’s metadata rules expose the objective and the remaining budget as series, so a single PromQL expression gives you “fraction of budget left”:

# Fraction of 28d error budget remaining for the availability SLO.
# 1 = full budget, 0 = exhausted, < 0 = over budget.
1 - (
  slo:sli_error:ratio_rate28d{sloth_service="checkout", sloth_slo="requests-availability"}
  / on(sloth_service, sloth_slo)
  slo:error_budget:ratio{sloth_service="checkout", sloth_slo="requests-availability"}
)

For release gating, the pipeline asks the same question before promoting. If the budget is exhausted, the policy is to freeze risky changes and spend the next cycle on reliability - the budget is the negotiated line between feature velocity and stability, enforced automatically instead of in a meeting:

# Returns the remaining-budget fraction; gate promotion on it.
REMAINING=$(curl -sG http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=1 - (slo:sli_error:ratio_rate28d{sloth_service="checkout",sloth_slo="requests-availability"} / on(sloth_service,sloth_slo) slo:error_budget:ratio{sloth_service="checkout",sloth_slo="requests-availability"})' \
  | jq -r '.data.result[0].value[1]')

# Block prod promotion if less than 10% of the budget remains.
awk -v r="$REMAINING" 'BEGIN { exit !(r < 0.10) }' \
  && { echo "Budget below 10% - prod promotion blocked"; exit 1; } \
  || echo "Budget OK ($REMAINING remaining)"

Enterprise scenario

A payments platform team I worked with ran 70+ services across three Prometheus shards behind Thanos. They had migrated to burn-rate alerting but authored every rule by hand in Jsonnet, and a quiet template bug had set the long-window page threshold to a 6h window where the spec called for 1h. The result: a checkout degradation burned 30% of the monthly budget over a weekend before the page fired, because the slower window smoothed the spike below threshold.

The constraint was that they could not rip out Jsonnet wholesale - dashboards, mixins, and CI all depended on it. So they inverted the relationship. SLO definitions moved to OpenSLO specs in a dedicated repo, and a CI job compiled them with Sloth into the PrometheusRule objects that Jsonnet previously hand-built. Jsonnet kept owning dashboards but consumed the Sloth-generated objective and budget series instead of redefining them. The bug class disappeared, because no human typed a window length again - the four-window table is baked into Sloth’s generator and unit-tested upstream.

The piece that sold it to leadership was the gate. They wired the remaining-budget query into the Argo CD sync hook so a service whose budget was exhausted could not promote a non-hotfix change to prod:

# A PrometheusRule that flips an inhibiting label when budget is gone.
# Argo CD's pre-sync gate reads this alert's state via the Alertmanager API.
- alert: CheckoutBudgetExhausted
  expr: |
    (slo:sli_error:ratio_rate28d{sloth_service="checkout", sloth_slo="requests-availability"}
      / on(sloth_service, sloth_slo)
     slo:error_budget:ratio{sloth_service="checkout", sloth_slo="requests-availability"}) >= 1
  for: 15m
  labels:
    severity: ticket
    gate: freeze
  annotations:
    summary: "Checkout availability budget exhausted; non-hotfix promotions frozen."

Within a quarter the page-on-noise rate dropped sharply and, more importantly, the two budget-blown incidents that followed were caught by the short window in minutes, not over a weekend.

Verify

Confirm the pipeline end to end before trusting it:

Checklist

slosliopensloerror-budgetobservability

Comments

Keep Reading