SLOs as Code: Authoring SLIs with OpenSLO and Generating Burn-Rate Alerts via Sloth and Pyrra

Hand-written burn-rate alerts are a maintenance liability. Google’s multi-window multi-burn-rate recipe is four alert rules per objective, each depending on several recording rules across four window lengths, and every one of them has to recompute the right threshold from the SLO target. Author that by hand for one service and it is a pleasant afternoon; author it for sixty services and you have a generator’s worth of copy-paste with subtle per-service bugs nobody finds until the page fails to fire. The threshold 14.4 in a page rule is not a number you should ever type twice, let alone two hundred times.

The fix is to treat the objective as the source of truth and let a tool emit the PromQL. This article builds that pipeline end to end. You describe the SLI and the target once in the vendor-neutral OpenSLO spec — a Kubernetes-style YAML with kind: SLI, kind: SLO, and kind: AlertPolicy — and then compile those documents into Prometheus recording and burn-rate alert rules with Sloth, run Pyrra for a live error-budget UI and an alternative rule generator, wire the resulting alerts into Alertmanager with proper routing and inhibition, and gate all of it behind CI so a malformed spec or a bad metric reference never reaches production. The whole thing lives in Git, gets pull-request review like any other code, and turns “reliability policy” from a slide deck into a set of files that a machine enforces.

The assumption throughout is Prometheus (or a compatible store like Grafana Mimir or Thanos) with the Prometheus Operator managing PrometheusRule objects, Sloth v0.11+, Pyrra v0.7+, and OpenSLO spec v1. Where the mechanics of the burn-rate math and the meaning of an error budget are covered in depth elsewhere — see SLOs and Error Budgets in Practice: Defining SLIs and Building Multi-Window Burn-Rate Alerts — this article is about the specification and generation layer: the spec formats, the generators, the diffs between them, and the pipeline that ties it all together. You will finish able to author an SLI as a reusable object, compile it three different ways, choose deliberately between Sloth, Pyrra, and raw OpenSLO, and defend that choice in a design review.

What problem this solves

The burn-rate alerting pattern is now standard practice, and that is exactly the problem. The pattern is correct and mechanical, which means humans should not be executing it. A single availability SLO at 99.9% over 28 days expands, under the canonical Google SRE workbook recipe, into roughly a dozen Prometheus rules: SLI error-ratio recording rules at five to seven window lengths (5m, 30m, 1h, 2h, 6h, 1d, 3d), two metadata rules carrying the objective and the error budget, and four alert rules pairing a long and short window per severity. Each alert rule embeds a burn-rate threshold — 14.4, 6, 3, 1 — derived arithmetically from the target and the window. Get one threshold wrong and the alert either pages on noise or, far worse, stays silent through a real burn.

What breaks without a generator is not any single rule; it is consistency at scale. Team A writes their rules in Jsonnet, team B in raw YAML, team C copy-pastes from a wiki that is two revisions stale. The long-window page threshold that should key off a 1h window keys off 6h in one service because someone fat-fingered a mixin parameter, and a checkout degradation burns a third of the monthly budget over a weekend before the smoothed slow window crosses threshold. Nobody audits four hundred alert rules by hand. The result is an alerting estate that looks uniform in the runbook and is quietly divergent in production — the single most dangerous state for an on-call system, because you trust it precisely when you should not.

Who hits this: any platform or SRE team running SLO-based alerting across more than a handful of services on Prometheus. It bites hardest on organizations that adopted burn-rate alerting manually (the pattern spread faster than the tooling), teams running multiple Prometheus shards behind Thanos or Mimir where rule drift is invisible across shards, and anyone whose “SLO definitions” live in dashboards and tribal memory rather than in reviewable files. The fix is not “write better rules.” It is “stop writing rules” — declare the objective, generate the rules, and make the generator the only thing that ever types a window length or a burn threshold.

To frame the whole field before the deep dive, here is the layered model this article builds, from the human-authored top to the machine-consumed bottom:

Layer	What lives here	Who authors it	Format	Consumed by
Intent	“Checkout should succeed 99.9% of the time”	Product + SRE	English (in the spec’s `description`)	Humans in review
Specification	SLI query, target, window, alert policy	SRE / service owner	OpenSLO / Sloth / Pyrra YAML	The generator
Generation	Recording + burn-rate alert rules	A tool (Sloth / Pyrra)	Prometheus rule YAML	Prometheus
Evaluation	Rule execution, series, alert firing	Prometheus	PromQL	Alertmanager + dashboards
Routing	Page vs ticket, grouping, inhibition	SRE	Alertmanager config	PagerDuty / Slack
Enforcement	Budget check gating a deploy	CI/CD	Pipeline step	Argo CD / GitHub Actions

Learning objectives

By the end of this article you can:

Author an SLI as a reusable declarative object — availability, latency, and quality/correctness — choosing correctly between a ratio indicator (good/total events) and a threshold indicator, and know why latency SLIs are count-based ratios over histogram buckets, never percentiles.
Model objectives, rolling windows, targets, and budgeting methods (Occurrences vs Timeslices) in the OpenSLO kind: SLO and kind: SLI documents, and attach a kind: AlertPolicy with burn-rate conditions.
Compile SLO definitions into Prometheus recording rules and the four multi-window multi-burn-rate alert rules using Sloth — via the CLI, via OpenSLO input, and via the PrometheusServiceLevel Kubernetes controller — and read the generated slo:sli_error:ratio_rate* series.
Run Pyrra for a live error-budget UI and as an alternative rule generator through its ServiceLevelObjective CRD, and know when to run Sloth and Pyrra together.
Choose deliberately between OpenSLO, Sloth, and Pyrra on the axes that matter — portability, generation, UI, ecosystem — and justify the choice.
Wire the generated severity: page / severity: ticket alerts into Alertmanager with grouping, routing, and an inhibition rule that silences tickets during an active page.
Version SLO specs in Git, validate them in CI with sloth validate, promtool check rules, and promtool test rules, keep generated rules out of source control, and gate release promotion on remaining error budget.

Prerequisites & where this fits

You should already be fluent with Prometheus and PromQL — rate(), sum(), histogram _bucket/_count series, and on()/group_left vector matching. If any of that is shaky, PromQL in Anger: Rate, Histograms, and Aggregation Patterns That Actually Work is the upstream read, and Scaling Prometheus: Recording Rules, Remote-Write, and Long-Term Storage with Thanos and Mimir covers the recording-rule machinery these tools generate. You should understand what an SLI, SLO, and error budget are, and the intuition behind multi-window multi-burn-rate alerting; this article does not re-derive the burn-rate math, and SLOs and Error Budgets in Practice is the conceptual companion. Familiarity with the Prometheus Operator (PrometheusRule, ServiceMonitor) and basic kubectl helps, because both Sloth and Pyrra ship Kubernetes controllers.

This sits in the Observability & SRE track, specifically the reliability engineering layer that sits on top of raw metrics and below incident response. Upstream of it is the metrics pipeline (scraping, recording rules, long-term storage). Downstream of it is on-call — Building an On-Call Practice: PagerDuty Escalation, Alert Routing, and Actionable Runbooks — and the routing layer, Designing Alertmanager Routing Trees: Grouping, Inhibition, Silences, and Dedup, which consumes the labels the generators emit. The dashboards that visualize the generated budget series are their own discipline: Engineering Grafana Dashboards That Get Used: RED, USE, Template Variables, and Provisioning-as-Code. Where this fits in one line: it is the compiler between reliability policy and the alerting/dashboard estate.

A quick map of who owns what across the pipeline, so responsibilities are clear before you build it:

Concern	Artifact	Owner	Reviewed by	Changes how often
What “good” means	OpenSLO `kind: SLI`	Service owner	SRE	Rarely (indicator is stable)
The target and window	OpenSLO `kind: SLO`	Product + SRE	Both	Quarterly, deliberately
The alert policy	`kind: AlertPolicy` / Sloth `alerting:`	SRE	SRE	Rarely (standard windows)
The generated rules	`PrometheusRule`	The generator (nobody)	CI	Every spec change
Routing	Alertmanager config	SRE	On-call leads	As teams/receivers change
The budget policy	Runbook + gate script	Eng leadership + SRE	Both	Set once, enforced always

Core concepts

Six mental models make every later section obvious.

An SLI is a ratio of good events over valid events — the discipline is entirely in the nouns. An SLI answers “what fraction of the things users asked for went well?” The numerator is good events, the denominator is valid events, and the entire art is choosing those two sets so the ratio moves when — and only when — users hurt. Everything downstream (the target, the budget, the alert) is arithmetic on this ratio; if the ratio is wrong, no amount of generation saves you.

“As code” means the objective is the source of truth, and rules are a build artifact. The inversion at the heart of this article: you do not author Prometheus rules and hope they encode the objective; you author the objective and generate the rules. Rules become like compiled binaries — regenerated deterministically, never edited by hand, never committed to source control. The spec is the source; the rules are dist/. This is what makes drift impossible: no human types a window length or a burn threshold, ever.

Ratio SLIs and threshold SLIs are the two indicator shapes. A ratio metric SLI divides one counter (good, or bad) by another (total) — availability and count-based latency both fit. A threshold metric SLI compares a single gauge against a bound and counts the fraction of time (or samples) it stays within — useful for queue depth, saturation, or freshness where there is no natural “good count / total count”. OpenSLO models both (ratioMetric and thresholdMetric); Sloth and Pyrra are ratio-first. Knowing which shape your indicator is determines the whole spec.

A latency SLI is a count-based ratio over a histogram bucket, never a percentile. The trap: “99.5% of requests under 300ms” is not histogram_quantile(0.995, ...) < 0.3. It is good = count of requests with duration <= 300ms (read straight from the le="0.3" bucket) over total = all requests. A percentile answers “how slow is the 99.5th-slowest request”; the SLI answers “what fraction were fast enough” — different questions, and only the second one composes into an error budget. Build a latency SLI on a percentile and the budget arithmetic is meaningless.

Multi-window multi-burn-rate is a fixed table the generator owns. One threshold cannot be both fast and quiet. A high burn rate over a short window catches a sharp outage in minutes; a low burn rate over a long window catches a slow bleed without paging on every transient blip. The canonical recipe pairs a long and short window per severity so an alert fires only when both agree there is sustained burn. This table — windows and burn-rate thresholds — is identical across every SLO of a given target, which is precisely why it belongs baked into a generator and unit-tested upstream, not retyped per service.

The budgeting method decides how budget is consumed. Occurrences (the default for request-driven SLIs) counts every bad event against the budget — one failed request spends one request’s worth of budget. Timeslices counts every bad window — if a one-minute slice violates the target, the whole minute is bad, regardless of how many requests it held. Occurrences is right for high-throughput APIs; Timeslices suits low-traffic or time-based services (a batch that must finish by 06:00) where “the slice was bad” is the meaningful unit.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the model side by side:

Concept	One-line definition	Where it lives	Why it matters
SLI	Ratio of good events over valid events	`kind: SLI` / `sli:` block	The measurement; wrong SLI = wrong everything
SLO	An SLI plus a target over a window	`kind: SLO`	The promise; source of the budget
Objective / target	The number (e.g. 0.999) the SLI must meet	`objectives:` / `objective:`	`budget = 1 - target`
Error budget	Allowed bad fraction = `1 - target`	Derived; a metadata rule	The currency you spend
Ratio metric	good/total counter division	`ratioMetric`	Availability, count-based latency
Threshold metric	A gauge compared to a bound	`thresholdMetric`	Saturation, freshness, depth
Budgeting method	How budget is consumed	`budgetingMethod`	`Occurrences` vs `Timeslices`
Burn rate	How fast budget is spent vs. the flat rate	Generated alert expr	14.4× = budget gone in ~2 days
AlertPolicy	Conditions that trigger notification	`kind: AlertPolicy`	Ties the SLO to paging
Recording rule	Precomputed SLI ratio at a window	Generated `PrometheusRule`	`slo:sli_error:ratio_rate5m`
Sloth	OpenSLO/native → Prometheus rule compiler	CLI + K8s controller	The canonical generator
Pyrra	Compiler + live error-budget UI	CRD + UI + filesystem	Operator-facing view
OpenSLO	Vendor-neutral SLO spec	YAML documents	The portable source format

Authoring SLIs: ratio vs threshold, good vs total

The SLI is the load-bearing decision. Get it right and the target is a negotiation; get it wrong and every alert lies. Three indicator families cover almost every request-driven service, and each has a canonical shape.

Availability — the ratio of not-errors

Availability is the fraction of valid requests served without a server-side error. The clean definition is good = requests that are not 5xx, total = all requests that reached the service. Two subtleties that separate a correct SLI from a plausible-looking wrong one:

Measure as close to the user as you own the signal. The ingress or load balancer sees what the client experiences; the app’s own histogram only sees requests that already arrived and got dispatched. A request that the app never processed (connection dropped at the LB, worker saturated) is invisible to the app-level metric but very visible to the user. Prefer the edge signal (nginx_ingress_controller_requests, Envoy, an ALB metric) for the user-facing SLO; keep deeper app signals for debugging.
Exclude what is not yours. A 4xx is the client’s fault — a malformed request, a 404 for a resource that never existed. Counting 4xx as bad burns budget for the client’s mistakes; counting it as good but keeping it in the denominator dilutes the ratio. The usual choice is to exclude 4xx from the failure numerator while deciding deliberately whether it stays in the denominator. Always drop health checks and synthetics — they inflate total and mask real pain.

Latency — the count-based ratio you must not turn into a percentile

Latency is the fraction of valid requests faster than a threshold T. This is a count, read from the histogram bucket at le="T", divided by the total request count:

# GOOD latency SLI: fraction of requests at or under 300ms.
# Numerator reads the cumulative bucket at le="0.3"; denominator is _count.
sum(rate(http_request_duration_seconds_bucket{service="checkout", le="0.3"}[5m]))
  /
sum(rate(http_request_duration_seconds_count{service="checkout"}[5m]))

Never build a latency SLI on histogram_quantile(). The percentile tells you how slow the tail is; the SLI must tell you what fraction was fast enough, and only the latter composes into a budget you can spend and track. If your metric is a native histogram (Prometheus 2.40+) rather than classic buckets, the bucket selection differs but the principle is identical — count the fast requests, divide by all requests.

Quality / correctness — the failures a status code cannot see

The SLI that most teams miss. A request can return 200 and still be a failure: a recommendation API that degrades to an empty list under load, a search that returns stale results, a payment that “succeeds” but writes an inconsistent record. The availability SLI cannot see any of these because HTTP is happy. The fix is to emit an explicit signal — a result="degraded" label, a correct="false" counter — and count it as bad. This is where the good/total framing earns its keep: good is not “returned 200”, it is “returned a usable answer”, and you get to define usable.

The indicator families side by side, with the canonical query shape for each:

SLI family	Good events	Valid (total) events	Metric type	Common mistake
Availability	Requests not `5xx`	All requests at the edge	Counter ratio	Counting `4xx` as failures
Latency	Requests `<= T` (bucket `le="T"`)	All requests (`_count`)	Counter ratio (histogram)	Using a percentile instead of a bucket count
Quality / correctness	Requests with usable result	All requests	Counter ratio (custom label)	Trusting `200` as “good”
Freshness / lag	Time since update `<= T`	Sampling windows	Threshold (gauge)	Ratio-ing a gauge that has no “total”
Throughput / saturation	Fraction of time under limit	Sampling windows	Threshold (gauge)	Alerting on the raw gauge, not the SLO

And the two indicator shapes, so you pick the right OpenSLO block:

Property	Ratio metric	Threshold metric
OpenSLO block	`ratioMetric` (`good`/`total` or `bad`/`total`)	`thresholdMetric` + `op` + `value`
Natural for	Availability, latency, correctness	Queue depth, freshness, saturation
Underlying data	Two counters	One gauge
“Good” defined as	Count in the good set	Samples satisfying the comparison
Sloth support	Native (`events` / `raw`)	Not first-class (model as ratio)
Pyrra support	`ratio` / `latency` indicators	`latency`/`bool_gauge` variants
Budget meaning	Fraction of bad events	Fraction of bad windows/samples

Two rules keep any SLI honest, and they are worth stating as a test you run mentally before writing a line of YAML:

If you can imagine the SLI looking healthy while users are unhappy — or looking sick while users are fine — the indicator is wrong. A latency SLI that ignores the empty-result degradation, an availability SLI that counts synthetic health checks, a correctness SLI missing entirely — each passes review and fails users. Fix the indicator before you argue about the target.

Modeling objectives in the OpenSLO spec

OpenSLO is a vendor-neutral, Kubernetes-style YAML specification for SLOs. Its value is decoupling: the SLI and target live in documents that are independent of whichever engine compiles them, so you can move from Sloth to Pyrra to a commercial platform (Nobl9, which co-authored the spec) without rewriting your intent. The current stable version is openslo/v1, and it defines several kinds that compose: SLI, SLO, Service, AlertPolicy, AlertCondition, and AlertNotificationTarget.

The `kind: SLI` — a reusable indicator

Define the availability SLI once as its own object so many SLOs can reference it. The ratioMetric with counter: true tells consumers these are monotonic counters to be rate()-d:

# slis/checkout-availability.openslo.yaml
apiVersion: openslo/v1
kind: SLI
metadata:
  name: checkout-availability
  displayName: Checkout availability
spec:
  description: Fraction of checkout requests not served as 5xx, measured at ingress.
  ratioMetric:
    counter: true
    good:
      metricSource:
        type: Prometheus
        spec:
          query: |
            sum(rate(nginx_ingress_controller_requests{service="checkout", status!~"5.."}[1m]))
    total:
      metricSource:
        type: Prometheus
        spec:
          query: |
            sum(rate(nginx_ingress_controller_requests{service="checkout"}[1m]))

You can express the same indicator as bad/total instead of good/total — Sloth internally works in error ratio, so a bad definition maps most directly, but either is valid and the compiler handles the arithmetic.

The `kind: SLO` — target, window, and budgeting

Bind the SLI to a target and a window in an SLO object. indicatorRef points at the reusable SLI by name; budgetingMethod: Occurrences consumes budget per bad event (the right default for request SLIs); the window is a rolling 28 days — long enough that a single failure does not blow a tight target, short enough to stay relevant to a monthly cadence:

# slos/checkout-availability.openslo.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
  name: checkout-availability
  displayName: Checkout availability
spec:
  service: checkout
  indicatorRef: checkout-availability
  timeWindow:
    - duration: 28d
      isRolling: true
  budgetingMethod: Occurrences
  objectives:
    - displayName: 99.9% of checkout requests succeed
      target: 0.999

For the latency SLO, the SLI counts requests below the threshold by reading the histogram bucket directly. The le="0.3" selector means “300 ms or faster”, and the denominator is the histogram’s _count series:

# slos/checkout-latency.openslo.yaml
apiVersion: openslo/v1
kind: SLI
metadata:
  name: checkout-latency-300ms
spec:
  ratioMetric:
    counter: true
    good:
      metricSource:
        type: Prometheus
        spec:
          query: |
            sum(rate(http_request_duration_seconds_bucket{service="checkout", le="0.3"}[1m]))
    total:
      metricSource:
        type: Prometheus
        spec:
          query: |
            sum(rate(http_request_duration_seconds_count{service="checkout"}[1m]))
---
apiVersion: openslo/v1
kind: SLO
metadata:
  name: checkout-latency
spec:
  service: checkout
  indicatorRef: checkout-latency-300ms
  timeWindow:
    - duration: 28d
      isRolling: true
  budgetingMethod: Occurrences
  objectives:
    - displayName: 99.5% of checkout requests under 300ms
      target: 0.995

The `kind: AlertPolicy` — connecting the SLO to paging

OpenSLO models alerting as its own kind, so the policy (burn-rate conditions, notification targets) is decoupled from the SLO and reusable across many objectives. An AlertCondition expresses a burn-rate threshold over short and long lookback windows; an AlertPolicy bundles conditions with notification targets:

# alerts/fast-burn.openslo.yaml
apiVersion: openslo/v1
kind: AlertCondition
metadata:
  name: fast-burn
spec:
  description: Page when burning budget fast enough to exhaust it in ~2 days.
  severity: page
  condition:
    kind: burnrate
    op: gte
    threshold: 14.4
    lookbackWindow: 1h
    alertAfter: 5m
---
apiVersion: openslo/v1
kind: AlertPolicy
metadata:
  name: checkout-alerts
spec:
  alertWhenBreaching: true
  conditions:
    - conditionRef: fast-burn
  notificationTargets:
    - targetRef: payments-pagerduty

Not every generator consumes every OpenSLO kind. Sloth reads SLI and SLO and emits its own alerting from the spec’s target; the AlertPolicy/AlertCondition kinds are most fully honored by the reference oslo CLI and commercial backends. This is the first place where “portable spec” and “what a given tool actually does with it” diverge — a theme the tooling-comparison section makes explicit. The OpenSLO kinds and their support status, so you author only what your generator reads:

OpenSLO kind	Purpose	Sloth reads it	Pyrra equivalent	`oslo` CLI
`SLI`	Reusable indicator (good/total)	Yes	Inline in CRD	Yes
`SLO`	Target + window + budgeting	Yes	`ServiceLevelObjective`	Yes
`Service`	Groups SLOs under a service	As a label	`metadata.labels`	Yes
`AlertPolicy`	Bundles alert conditions	Uses target, not policy	Generated from target	Yes
`AlertCondition`	A single burn-rate rule	Implicit (fixed table)	Implicit (fixed table)	Yes
`AlertNotificationTarget`	Where alerts go	No (Alertmanager’s job)	No (Alertmanager’s job)	Yes
`DataSource`	Reusable metric source	Partially	`prometheus-url` flag	Yes

The budgeting-method choice, concretely, because it changes what a bad minute costs:

Aspect	`Occurrences`	`Timeslices`
Unit of consumption	One bad event	One bad time-slice
Right for	High-throughput request APIs	Low-traffic / time-based services
Budget of 0.1% means	0.1% of requests may fail	0.1% of slices may violate
A 1-min total outage at 1000 rps	Spends ~60,000 events of budget	Spends 1 slice of budget
Sensitivity to volume	Scales with traffic	Independent of traffic
Sloth default	Yes (events-based)	Not the native model

A target of 0.999 over 28 days is an error budget of 0.1%. Against a checkout that does roughly 50M requests a month, that is a budget of about 50,000 bad requests — a quantity you can spend, track, and reason about, not a vague “43 minutes of downtime”. That subtraction, budget = 1 - target, and its translation from a percentage into a count of events, is the whole point of modeling the objective as data.

Generating Prometheus rules with Sloth

Sloth turns an SLO spec into the full set of Prometheus recording rules and the four multi-window multi-burn-rate alert rules. It speaks its own denser prometheus/v1 format and also reads OpenSLO directly, so you can standardize on OpenSLO as the source and use Sloth purely as the compiler.

The native Sloth spec

Sloth’s native format gives direct control over alert names and labels. The {{.window}} templating is the key move — Sloth fills it with each window length it needs (5m, 30m, 1h, 2h, 6h, 1d, 3d), so you write the query shape once and the generator produces every windowed variant:

# sloth/checkout.sloth.yaml
version: "prometheus/v1"
service: "checkout"
labels:
  team: "payments"
slos:
  - name: "requests-availability"
    objective: 99.9
    description: "Checkout availability measured at ingress."
    sli:
      events:
        error_query: |
          sum(rate(nginx_ingress_controller_requests{service="checkout", status=~"5.."}[{{.window}}]))
        total_query: |
          sum(rate(nginx_ingress_controller_requests{service="checkout"}[{{.window}}]))
    alerting:
      name: CheckoutAvailabilityBurn
      labels:
        category: availability
      annotations:
        summary: "Checkout is burning its availability error budget."
      page_alert:
        labels:
          severity: page
      ticket_alert:
        labels:
          severity: ticket

Sloth offers three SLI input styles, and choosing the right one avoids fighting the tool:

Sloth SLI style	Shape	Use when	Note
`events`	`error_query` + `total_query`	You have separate bad and total counters	Most common; Sloth divides them
`raw`	A single `error_ratio_query`	You already compute the error ratio	You own the `[{{.window}}]` placement
Plugin (`sli_plugin`)	Named reusable Go plugin	Standard SLIs repeated across services	Central library of indicator shapes

Compiling and validating

Generate the PrometheusRule-ready output from either the native spec or an OpenSLO document — the output is identical:

# Native Sloth spec -> Prometheus rules
sloth generate -i sloth/checkout.sloth.yaml -o rules/checkout.rules.yaml

# OpenSLO spec -> the same output, no rewrite needed
sloth generate \
  --input slos/checkout-availability.openslo.yaml \
  --out rules/checkout-availability.rules.yaml

Validate before you ship — this is your CI gate. validate walks a directory and exits non-zero on any malformed spec:

# Validate every spec under a directory; non-zero exit on failure
sloth validate -i ./sloth/

What Sloth emits

Per SLO, Sloth produces three tiers of rules. Understanding the generated series names is essential because your dashboards and budget queries reference them directly:

Rule group	Example series	What it holds	Windows
SLI error ratios	`slo:sli_error:ratio_rate5m`	Bad/total error ratio at a window	5m, 30m, 1h, 2h, 6h, 1d, 3d
Metadata	`slo:objective:ratio`	The target (e.g. 0.999) as a series	Constant
Metadata	`slo:error_budget:ratio`	The budget (`1 - target`) as a series	Constant
Metadata	`slo:time_period:days`	The window length in days	Constant
Metadata	`slo:current_burn_rate:ratio`	Live burn rate for dashboards	Recomputed
Metadata	`slo:period_burn_rate:ratio`	Burn rate over the whole window	Recomputed
Metadata	`slo:period_error_budget_remaining:ratio`	Fraction of budget left	Recomputed
Alerts	`CheckoutAvailabilityBurn` (page/ticket)	The four MWMBR alert rules	1h/5m, 6h/30m, 1d/2h, 3d/6h

Two details that hand-rolled rules almost always miss and Sloth gets right for free:

A SLOMetricAbsent alert that fires when the SLI series disappears entirely. A rule that silently stops evaluating because the metric was renamed is worse than a broken rule — it looks healthy and pages never. Sloth’s absent-metric alert catches exactly this.
Consistent labels across every rule (sloth_service, sloth_slo, sloth_id, plus your custom labels), which is what makes routing, grouping, and inhibition in Alertmanager tractable — you route on sloth_service, not on a fragile alertname regex.

The Kubernetes controller

In a cluster, skip the CLI and run the controller. It reconciles a sloth.slok.dev/v1 PrometheusServiceLevel custom resource into a managed monitoring.coreos.com/v1 PrometheusRule that the Prometheus Operator then loads:

# k8s/checkout-psl.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout
  namespace: monitoring
spec:
  service: "checkout"
  labels:
    team: "payments"
  slos:
    - name: "requests-availability"
      objective: 99.9
      sli:
        events:
          error_query: sum(rate(nginx_ingress_controller_requests{service="checkout", status=~"5.."}[{{.window}}]))
          total_query: sum(rate(nginx_ingress_controller_requests{service="checkout"}[{{.window}}]))
      alerting:
        name: CheckoutAvailabilityBurn
        page_alert:
          labels: {severity: page}
        ticket_alert:
          labels: {severity: ticket}

kubectl apply -f k8s/checkout-psl.yaml
# The controller (deployed once, cluster-wide) writes the PrometheusRule:
kubectl get prometheusrules -n monitoring
# checkout   1   30s

The two Sloth delivery modes, and when each fits:

Mode	Input	Output	Best for	Trade-off
CLI (`sloth generate`)	Spec file(s)	Rule YAML on disk	CI pipelines, GitOps repos	You wire the deploy of the output
Controller (`PrometheusServiceLevel`)	CR in the cluster	Managed `PrometheusRule`	Operator-based clusters	Runtime dependency on the controller

Running Pyrra for live error-budget dashboards

Sloth is a compiler; Pyrra is a compiler and a UI. It reads its own ServiceLevelObjective CRD, generates the equivalent burn-rate recording rules, and serves a dashboard showing remaining budget, current burn rate, and projected exhaustion per objective. Many teams run both: Sloth as the canonical rule generator that CI enforces, Pyrra for the operator-facing view during incidents.

The Pyrra `ServiceLevelObjective`

The Pyrra objective is intentionally terse — target is a string percentage and window accepts week units. Note that Pyrra takes the metric selector directly and builds the rate()/sum() itself, where Sloth takes the full query:

# pyrra/checkout-availability.yaml
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: checkout-availability
  namespace: monitoring
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  target: "99.9"
  window: 4w
  description: Checkout availability at ingress.
  indicator:
    ratio:
      errors:
        metric: nginx_ingress_controller_requests{service="checkout", status=~"5.."}
      total:
        metric: nginx_ingress_controller_requests{service="checkout"}

The latency indicator uses the histogram bucket for success and the _count series for total — the same count-based approach, expressed in Pyrra’s latency indicator:

# pyrra/checkout-latency.yaml
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: checkout-latency
  namespace: monitoring
spec:
  target: "99.5"
  window: 4w
  indicator:
    latency:
      success:
        metric: http_request_duration_seconds_bucket{service="checkout", le="0.3"}
      total:
        metric: http_request_duration_seconds_count{service="checkout"}

Pyrra supports several indicator kinds, and picking the right one matters because each generates different PromQL:

Pyrra indicator	Fields	Models	Notes
`ratio`	`errors`, `total`	Availability, correctness	Errors as a subset of total
`latency`	`success` (bucket), `total` (`_count`)	Count-based latency	Reads the `le` bucket
`latencyNative`	`total` (native histogram)	Latency on native histograms	Prometheus 2.40+
`boolGauge`	`metric`	Up/down or boolean state	For a 0/1 signal

The Pyrra deployment topology

Pyrra runs as a split deployment: an api process serving the UI and aggregating across objectives, plus a backend that watches your source of truth and writes rules. The Kubernetes backend reconciles the CRD into PrometheusRule objects; the filesystem backend watches a directory and writes rule files — use it when you are not on the Operator:

# UI + aggregating API (talks to Prometheus for live budget numbers)
pyrra api --prometheus-url=http://prometheus.monitoring:9090

# Backend (choose one):
pyrra kubernetes      # watches ServiceLevelObjective CRs, writes PrometheusRules
pyrra filesystem \
  --config-files="/etc/pyrra/*.yaml" \
  --prometheus-folder="/etc/prometheus/pyrra/"

Pyrra’s generated recording rules carry burn rates at the windows it needs — http_requests:burnrate5m, :burnrate1h, :burnrate6h, and so on — alongside :increase series over the full window that power the dashboard’s budget bar. The naming differs from Sloth’s, which matters if you point a dashboard at one and switch to the other.

The Pyrra process model, so you deploy the right pieces:

Component	Command	Responsibility	Runs where
API / UI	`pyrra api`	Serves UI, aggregates, queries Prometheus live	One deployment
Kubernetes backend	`pyrra kubernetes`	Reconciles CRDs → `PrometheusRule`	One deployment
Filesystem backend	`pyrra filesystem`	Watches files → writes rule files	Sidecar to Prometheus

Sloth vs Pyrra vs OpenSLO: choosing deliberately

These three are not competitors so much as three layers, and confusing them is the most common design mistake. OpenSLO is a specification — a portable file format with no runtime. Sloth is a generator — CLI plus controller, no UI. Pyrra is a generator plus a UI — its own CRD, a dashboard, and a filesystem/Kubernetes backend. The clean decision is: author intent in OpenSLO if portability matters; generate rules with Sloth (most flexible, CI-friendly); add Pyrra when operators want a live budget dashboard without building Grafana panels by hand.

The head-to-head on the axes that decide a real adoption:

Axis	OpenSLO	Sloth	Pyrra
What it is	Spec / file format	Rule generator	Generator + UI
Runtime component	None	CLI + optional controller	API + backend (always running)
Reads OpenSLO	(is OpenSLO)	Yes	No (own CRD)
Generates Prometheus rules	No (needs a tool)	Yes	Yes
Multi-window multi-burn-rate	Modeled, not generated	Yes (fixed table)	Yes (fixed table)
Live error-budget UI	No	No	Yes (built-in)
Absent-metric alert	Modeled	Yes (`SLOMetricAbsent`)	No (as of v0.7)
Kubernetes CRD	No (plain YAML)	`PrometheusServiceLevel`	`ServiceLevelObjective`
Filesystem (non-K8s) mode	N/A	CLI writes files	`pyrra filesystem`
Threshold (gauge) SLIs	Yes (`thresholdMetric`)	Model as ratio	`boolGauge`
SLI plugins / reuse	`DataSource` refs	Go `sli_plugin` library	Metric-selector reuse
Portability to other backends	High (Nobl9, etc.)	Sloth-specific spec	Pyrra-specific CRD
Best single use	Source of truth for intent	The canonical compiler	The operator dashboard

The generated-rule differences that bite when you mix them:

Detail	Sloth	Pyrra
SLI ratio series name	`slo:sli_error:ratio_rate5m`	`http_requests:burnrate5m` (per-metric)
Budget series	`slo:error_budget:ratio`	`:increase` + computed in UI
Standard labels	`sloth_service`, `sloth_slo`, `sloth_id`	`slo`, plus your CRD labels
Alert naming	You set it (`alerting.name`)	Derived from the objective
Window set	5m,30m,1h,2h,6h,1d,3d	5m,30m,1h,2h,6h,1d,4d (config-dependent)
Dashboard coupling	Bring your own Grafana	Built-in Pyrra UI

The common architectures, from simplest to most complete:

Setup	Source of truth	Generator	UI	When it fits
Sloth-only	Sloth or OpenSLO spec	Sloth	Grafana (hand-built)	You already have Grafana SLO dashboards
Pyrra-only	Pyrra CRD	Pyrra	Pyrra	You want a turnkey budget UI
OpenSLO → Sloth	OpenSLO spec	Sloth	Grafana	Portability + CI flexibility
Sloth (rules) + Pyrra (UI)	OpenSLO / Sloth spec	Sloth canonical	Pyrra	CI-enforced rules and a live dashboard
Commercial (Nobl9)	OpenSLO spec	Vendor	Vendor	Managed SLO platform, OpenSLO in

The recommendation for most teams: standardize the source on OpenSLO (portable, reviewable), make Sloth the canonical generator that CI enforces (it has the absent-metric alert and the widest SLI flexibility), and run Pyrra in read-mostly mode for the dashboard if operators want one — feeding it the same objectives. This gives you one source format, one enforced generator, and a live UI, with no single vendor lock-in.

Wiring multi-window multi-burn-rate alerts to Alertmanager

The reason for multiple windows is that one threshold cannot be both fast and quiet. A high burn rate over a short window catches a sharp outage in minutes; a lower burn rate over a long window catches a slow bleed without paging on every transient blip. The canonical table pairs a long and short window per severity so an alert fires only when both agree there is sustained burn — the short window confirms it is happening now, the long window confirms it is sustained:

Severity	Long window	Short window	Burn rate	Budget consumed if sustained	Detects
Page	1h	5m	14.4	~2% in 1h	Sharp, fast outage
Page	6h	30m	6	~5% in 6h	Sustained moderate burn
Ticket	1d (24h)	2h	3	~10% in 1d	Slow bleed
Ticket	3d (72h)	6h	1	~10% in 3d	Very slow chronic degradation

Both Sloth and Pyrra emit exactly this structure, so you never write the PromQL. Your job is routing. Sloth labels page alerts severity: page and ticket alerts severity: ticket (per your spec); route on those, and group by the service/SLO labels the generator provides:

# alertmanager.yaml
route:
  receiver: default
  group_by: ["sloth_service", "sloth_slo"]
  routes:
    - matchers:
        - severity = "page"
      receiver: pagerduty-payments
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
    - matchers:
        - severity = "ticket"
      receiver: slack-payments
      group_wait: 1m
      repeat_interval: 24h

inhibit_rules:
  # A firing page silences the matching ticket for the same SLO —
  # do not open a Jira while someone is already paged for it.
  - source_matchers: [severity = "page"]
    target_matchers: [severity = "ticket"]
    equal: ["sloth_service", "sloth_slo"]

receivers:
  - name: default
  - name: pagerduty-payments
    pagerduty_configs:
      - routing_key: "<integration-key>"
  - name: slack-payments
    slack_configs:
      - api_url: "<webhook-url>"
        channel: "#payments-slo"

The inhibition rule matters because during a real outage both the page (1h/5m) and the ticket (1d/2h) alerts will eventually fire for the same SLO — the fast page trips first, and as the burn sustains, the slower ticket condition crosses too. Inhibiting the ticket while the page is active stops the on-call from drowning in duplicate noise about a single incident. The full routing behavior of Alertmanager (grouping timers, silence semantics, dedup) is its own topic — see Designing Alertmanager Routing Trees: Grouping, Inhibition, Silences, and Dedup — but the generator hands you clean, consistent labels precisely so this routing is a few matchers, not a regex nightmare.

The labels the generators emit that you route and group on:

Label	Emitted by	Value	Use in Alertmanager
`severity`	Both (your spec)	`page` / `ticket`	Route to PagerDuty vs Slack
`sloth_service`	Sloth	Service name	`group_by`, inhibition `equal`
`sloth_slo`	Sloth	SLO name	`group_by`, inhibition `equal`
`sloth_id`	Sloth	`service-slo`	Unique key
`slo`	Pyrra	Objective name	`group_by` (Pyrra)
`category`	Your spec	e.g. `availability`	Optional routing/dashboards
`team`	Your spec	e.g. `payments`	Route to the owning team

Versioning SLO definitions in Git and validating in CI

The specs are code, so they get the same treatment: pull requests, review, and a gate that refuses anything that will not compile. Lay the repo out by concern and keep generated rules out of source control — regenerate them in CI so the spec is unambiguously the source of truth and nobody is ever tempted to hand-edit a generated PrometheusRule:

slo-definitions/
  slis/        # reusable OpenSLO SLI objects
  slos/        # OpenSLO SLO objects (one per objective)
  alerts/      # OpenSLO AlertPolicy / AlertCondition
  pyrra/       # Pyrra ServiceLevelObjective CRs (if using Pyrra UI)
  tests/       # promtool test-rules fixtures
  .github/workflows/slo.yaml

A GitHub Actions job that validates with Sloth and dry-run-renders the output catches the two failures that actually happen in practice: a malformed spec (bad YAML, wrong field), and a spec that compiles but references a metric or label that does not exist (a renamed metric, a typo in a selector):

# .github/workflows/slo.yaml
name: validate-slos
on: [pull_request]
jobs:
  sloth:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Sloth
        run: |
          curl -sL \
            https://github.com/slok/sloth/releases/latest/download/sloth-linux-amd64 \
            -o /usr/local/bin/sloth
          chmod +x /usr/local/bin/sloth
      - name: Validate specs
        run: sloth validate -i ./slos/
      - name: Render rules (fails on bad templates)
        run: |
          for f in slos/*.openslo.yaml; do
            sloth generate -i "$f" -o /dev/null
          done

Add a promtool step to assert the rules are not just well-formed YAML but valid Prometheus rules, and unit-test the burn-rate logic with sample series so a refactor cannot silently break a page. promtool test rules is the step most teams skip and most regret skipping — it is the only check that proves the alert fires when it should and stays silent when it should not:

# Render, then let Prometheus itself check the rules are valid
sloth generate -i slos/checkout-availability.openslo.yaml -o /tmp/checkout.rules.yaml
promtool check rules /tmp/checkout.rules.yaml

# Unit-test the burn-rate math against synthetic series
promtool test rules tests/checkout_burnrate_test.yaml

A minimal promtool test rules fixture that injects a known error rate and asserts the page fires:

# tests/checkout_burnrate_test.yaml
rule_files:
  - /tmp/checkout.rules.yaml
evaluation_interval: 1m
tests:
  - interval: 1m
    input_series:
      # 15% error rate — well above the 14.4x page threshold for a 99.9% SLO
      - series: 'nginx_ingress_controller_requests{service="checkout", status="500"}'
        values: '0+150x60'
      - series: 'nginx_ingress_controller_requests{service="checkout", status="200"}'
        values: '0+850x60'
    alert_rule_test:
      - eval_time: 10m
        alertname: CheckoutAvailabilityBurn
        exp_alerts:
          - exp_labels:
              severity: page
              sloth_service: checkout
              sloth_slo: requests-availability

The CI gates and what each one actually catches — layer them, because each catches a class the others miss:

CI step	Command	Catches	Misses
Spec validation	`sloth validate`	Malformed spec, bad fields	Wrong metric names
Render check	`sloth generate -o /dev/null`	Bad templates, unresolvable refs	Semantic errors
Rule lint	`promtool check rules`	Invalid PromQL, bad rule structure	Wrong thresholds
Rule unit test	`promtool test rules`	Alert fires/silent at wrong burn rate	Missing SLIs entirely
Metric existence	Query Prometheus in CI (optional)	Selector matches no series	Rules that never fire
Diff review	PR review of the spec	Wrong target, wrong SLI intent	Everything a human misses

The pipeline stages end to end, from commit to enforcement:

Stage	Trigger	Action	Failure blocks
Author	Developer edits spec	Write OpenSLO YAML	—
PR open	Push to branch	`sloth validate` + render	Merge
PR review	Reviewer	Human reviews intent + target	Merge
Merge	Merge to main	Regenerate rules, apply to cluster	Deploy
Runtime	Prometheus loads rules	Evaluate SLI + burn rate	—
Release gate	Deploy pipeline	Query remaining budget	Promotion

Reporting budget consumption and gating releases

The error budget only changes behavior if someone looks at it on a cadence and it has teeth. Two consumers matter: a weekly stakeholder report, and the deploy pipeline. The budget policy — what happens when it is exhausted — is the negotiated line between feature velocity and stability, and enforcing it automatically is what stops it from being re-litigated in every planning meeting.

For reporting, query the budget directly. Sloth’s metadata rules expose the objective and the SLI error ratio as series, so a single PromQL expression gives you “fraction of budget remaining”:

# Fraction of 28d error budget remaining for the availability SLO.
# 1 = full budget, 0 = exhausted, < 0 = over budget.
1 - (
  slo:sli_error:ratio_rate28d{sloth_service="checkout", sloth_slo="requests-availability"}
  / on(sloth_service, sloth_slo)
  slo:error_budget:ratio{sloth_service="checkout", sloth_slo="requests-availability"}
)

If you emitted the slo:period_error_budget_remaining:ratio metadata rule, you can read remaining budget directly without the subtraction — Sloth precomputes it. Either way, the number is a first-class series you can graph, alert on, and query from a pipeline.

For release gating, the pipeline asks the same question before promoting. If the budget is exhausted, the policy is to freeze risky changes and spend the next cycle on reliability. The gate turns policy into a non-negotiable check:

# Returns the remaining-budget fraction; gate promotion on it.
REMAINING=$(curl -sG http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=1 - (slo:sli_error:ratio_rate28d{sloth_service="checkout",sloth_slo="requests-availability"} / on(sloth_service,sloth_slo) slo:error_budget:ratio{sloth_service="checkout",sloth_slo="requests-availability"})' \
  | jq -r '.data.result[0].value[1]')

# Block prod promotion if less than 10% of the budget remains.
awk -v r="$REMAINING" 'BEGIN { exit (r >= 0.10) }' \
  && { echo "Budget below 10% — prod promotion blocked"; exit 1; } \
  || echo "Budget OK ($REMAINING remaining)"

An error-budget policy is a table of thresholds and consequences, agreed once and enforced by the gate:

Budget remaining	Policy	Enforcement
> 50%	Ship freely	None
20–50%	Ship, but review risky changes	Advisory PR label
10–20%	Feature flags only; harden	Gate warns, requires ack
0–10%	Freeze non-hotfix promotions	Gate blocks deploy
< 0% (over budget)	Reliability sprint; incident review	Gate blocks + page eng lead

The budget series you can query and what each answers:

Series / expression	Answers	Used by
`slo:error_budget:ratio`	What is the total budget?	Reference denominator
`slo:sli_error:ratio_rate28d`	Error ratio over the window	Budget consumption
`slo:period_error_budget_remaining:ratio`	Fraction of budget left	Report + gate
`slo:current_burn_rate:ratio`	How fast are we burning now?	Live dashboard
Projected exhaustion (burn × remaining)	When will it run out?	Pyrra UI / capacity planning

Architecture at a glance

Read the pipeline as a left-to-right flow from human intent to enforced behavior, with a single source of truth at the top feeding three consumers at the bottom. On the left, a service owner authors two kinds of document in Git: reusable kind: SLI objects that define good-and-total for availability, latency, and correctness, and kind: SLO objects that bind each SLI to a target (0.999) and a rolling 28-day window with a budgeting method. These plain-YAML documents are the only thing a human ever writes; they carry the intent in a description and the math in target.

In the middle sits the generation layer, and this is the pivot of the whole design: the specs flow into Sloth, the canonical compiler, which — via CLI in CI or via the PrometheusServiceLevel controller in-cluster — expands each SLO into a full PrometheusRule: SLI error-ratio recording rules at seven window lengths (slo:sli_error:ratio_rate5m through rate3d), metadata rules carrying the objective and the error budget as series, an absent-metric alert, and the four multi-window multi-burn-rate alert rules. No human types a window or a 14.4; the four-window table is baked into the generator and unit-tested upstream. Optionally, the same objectives feed Pyrra, which generates its own equivalent burn-rate rules and, more importantly, serves a live error-budget dashboard.

On the right are the three consumers of what the generator produced. Prometheus evaluates the recording rules and the burn-rate alerts, and when a burn is both fast and sustained it fires an alert carrying clean severity: page / severity: ticket and sloth_service / sloth_slo labels. Alertmanager routes on those labels — page to PagerDuty, ticket to Slack — with an inhibition rule that silences the ticket while the page is active, so one incident is one notification stream. Dashboards and the release gate read the budget series directly: Grafana (or the Pyrra UI) graphs remaining budget and projected exhaustion, and the CI/CD pipeline queries 1 - (error_ratio / budget) before every promotion, blocking risky deploys when less than 10% of the budget remains. The whole diagram is one arrow: intent in Git → rules by machine → alerts, dashboards, and gates that enforce reliability policy without a human ever retyping a threshold. Notice that every downstream consumer keys off labels and series the generator produced — which is exactly why generating them consistently, rather than hand-writing them per service, is the point.

Real-world scenario

Meridian Payments ran 70+ services across three Prometheus shards behind Thanos, serving a card-processing platform with a hard regulatory reporting obligation on availability. They had migrated to burn-rate alerting two years earlier — good instinct — but authored every rule by hand in Jsonnet mixins. The SRE team was six engineers; the observability stack cost roughly ₹9,00,000/month across compute and storage.

The incident that forced the change was quiet, which was the worst part. A checkout dependency began returning intermittent 5xx on a Friday evening. The burn was real — about 8% error rate — but it did not page. The post-mortem found why: a Jsonnet template bug had set the long-window page threshold to key off a 6h window where the spec called for 1h. The slower window smoothed the Friday-evening spike below the 14.4× threshold, so the page condition never crossed. By the time the ticket alert (1d/2h) fired on Saturday morning, the degradation had burned 30% of the monthly error budget over roughly fourteen hours. Nobody had audited the generated rules against the spec because there were four hundred of them and they all “looked right.”

The constraint was that they could not rip out Jsonnet wholesale — dashboards, mixins, and CI all depended on it, and a big-bang migration was itself a reliability risk. So they inverted the relationship instead of replacing the stack. SLO definitions moved to OpenSLO specs in a dedicated slo-definitions repo, one kind: SLI and kind: SLO per objective, reviewed like code. A CI job compiled them with Sloth into the PrometheusRule objects that Jsonnet had previously hand-built. Jsonnet kept owning dashboards but consumed the Sloth-generated slo:objective:ratio and slo:error_budget:ratio series instead of redefining them. The window-length bug class disappeared entirely, because no human typed a window length again — the four-window table lives in Sloth’s generator and is unit-tested upstream, and promtool test rules in their CI now asserts the page fires at 14.4× and stays silent below it.

The piece that sold it to leadership was the gate. They wired the remaining-budget query into the Argo CD pre-sync hook so a service whose budget was exhausted could not promote a non-hotfix change to production. A PrometheusRule flipped a gate: freeze label when budget hit zero, and the sync hook read that alert’s state via the Alertmanager API:

# A PrometheusRule that flips an inhibiting label when budget is gone.
# Argo CD's pre-sync gate reads this alert's state via the Alertmanager API.
- alert: CheckoutBudgetExhausted
  expr: |
    (slo:sli_error:ratio_rate28d{sloth_service="checkout", sloth_slo="requests-availability"}
      / on(sloth_service, sloth_slo)
     slo:error_budget:ratio{sloth_service="checkout", sloth_slo="requests-availability"}) >= 1
  for: 15m
  labels:
    severity: ticket
    gate: freeze
  annotations:
    summary: "Checkout availability budget exhausted; non-hotfix promotions frozen."

Within a quarter, two things changed measurably. The page-on-noise rate dropped sharply, because the generated multi-window logic was correct and consistent where the hand-rolled Jsonnet had drifted. And the two budget-blown incidents that occurred after the migration were both caught by the short window in minutes, not over a weekend — the exact failure mode that had cost them 30% of a month’s budget was now impossible to reproduce, because the thresholds were generated, tested, and identical across all 70 services. The migration took one engineer about six weeks, running Sloth-generated rules side-by-side with the Jsonnet ones for two weeks to confirm parity before cutting over.

The incident and the fix as a timeline, because the order is the lesson:

Time	Event	Root cause / action	Effect
Fri 19:00	Checkout dependency 5xx begins	~8% error rate	Budget burning, no page
Fri 19:00–Sat 09:00	Burn continues, page silent	Page keyed off 6h window (spec: 1h)	30% of month’s budget gone
Sat 09:00	Ticket alert finally fires	1d/2h condition crosses	Investigation starts, late
Post-mortem	Root cause found	Jsonnet template window bug	Decision: generate, don’t hand-write
Week 1–2	OpenSLO specs authored	One SLI+SLO per objective	Intent in reviewable files
Week 3–4	Sloth in CI, side-by-side	Rules generated, parity checked	Confidence before cutover
Week 5	Cutover + Argo gate	Sloth rules canonical; budget gate live	Drift impossible; freezes automatic
+1 quarter	Two more budget incidents	Caught by short window in minutes	Weekend-long burn now impossible

Advantages and disadvantages

Generating SLO rules from a spec both solves the drift problem and introduces a build dependency you must operate. Weigh it honestly:

Advantages (why generate)	Disadvantages (the cost)
No human ever types a window length or a burn threshold — the drift bug class disappears	A generator is a new dependency to install, version, and keep current across CI and cluster
The four-window MWMBR table is baked in and unit-tested upstream, not re-derived per service	The abstraction hides the PromQL; engineers must learn to read generated rules to debug
One spec change regenerates all rules consistently across 70+ services	A generator bug affects every SLO at once (mitigated by pinning versions + testing)
Specs are reviewable in PRs; intent and target are visible, not buried in Jsonnet	Two spec formats (Sloth native vs OpenSLO) and multiple tools invite confusion
Absent-metric alerts (Sloth) catch the “silently stopped evaluating” failure hand-rolled rules miss	Portability is partial: `AlertPolicy` and threshold SLIs aren’t uniformly supported
Budget series are first-class — report, graph, and gate on them with one query	Live UIs (Pyrra) add an always-running component and its own CRD to operate
Portable source (OpenSLO) decouples intent from any single vendor	The generated series names differ by tool; switching generators breaks dashboards

The model is right for any team running SLO-based alerting across more than a handful of services, where consistency and reviewability matter more than the marginal control of hand-written PromQL. It is overkill for a single service with one SLO that one engineer owns end to end — there, a hand-written rule you fully understand may be simpler than installing a generator. It bites hardest when a team adopts both Sloth and Pyrra without deciding which is canonical, or when they commit generated rules to Git and then hand-edit them, defeating the entire “spec is source of truth” premise. The disadvantages are all manageable — pin versions, pick one canonical generator, never commit generated output, and test the rules — but only if you know they exist.

Hands-on lab

Author an OpenSLO availability SLI and SLO, generate the full rule set with Sloth, load it into a local Prometheus, and unit-test the burn-rate alert — all locally, no cloud, free. You need Docker and about 20 minutes.

Step 1 — Install Sloth and set up a workspace.

mkdir -p slo-lab/{slis,slos,rules,tests} && cd slo-lab
curl -sL https://github.com/slok/sloth/releases/latest/download/sloth-$(uname -s | tr '[:upper:]' '[:lower:]')-amd64 \
  -o /usr/local/bin/sloth && chmod +x /usr/local/bin/sloth
sloth version

Expected: a version string like v0.11.0.

Step 2 — Author the OpenSLO SLI and SLO. Write a single file with both kinds:

cat > slos/checkout.openslo.yaml <<'EOF'
apiVersion: openslo/v1
kind: SLI
metadata:
  name: checkout-availability
spec:
  ratioMetric:
    counter: true
    good:
      metricSource:
        type: Prometheus
        spec:
          query: sum(rate(http_requests_total{job="checkout", code!~"5.."}[{{.window}}]))
    total:
      metricSource:
        type: Prometheus
        spec:
          query: sum(rate(http_requests_total{job="checkout"}[{{.window}}]))
---
apiVersion: openslo/v1
kind: SLO
metadata:
  name: checkout-availability
spec:
  service: checkout
  indicatorRef: checkout-availability
  timeWindow:
    - duration: 28d
      isRolling: true
  budgetingMethod: Occurrences
  objectives:
    - displayName: 99.9% availability
      target: 0.999
EOF

Step 3 — Validate and generate the rules.

sloth validate -i ./slos/
sloth generate -i slos/checkout.openslo.yaml -o rules/checkout.rules.yaml

Expected: validate prints a success line and exits 0; generate writes rules/checkout.rules.yaml. Inspect it — you should see recording rules named slo:sli_error:ratio_rate5m through rate3d, metadata rules slo:objective:ratio and slo:error_budget:ratio, a SLOMetricAbsent alert, and the multi-window burn-rate alert group.

grep -E "record:|alert:" rules/checkout.rules.yaml | head -30

Step 4 — Lint the generated rules with promtool.

docker run --rm -v "$PWD/rules:/rules" prom/prometheus:latest \
  promtool check rules /rules/checkout.rules.yaml

Expected: SUCCESS: N rules found.

Step 5 — Unit-test the burn-rate alert. Create a fixture that injects a 15% error rate and asserts the page fires:

cat > tests/burnrate_test.yaml <<'EOF'
rule_files:
  - /rules/checkout.rules.yaml
evaluation_interval: 1m
tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{job="checkout", code="500"}'
        values: '0+150x120'
      - series: 'http_requests_total{job="checkout", code="200"}'
        values: '0+850x120'
    alert_rule_test:
      - eval_time: 10m
        alertname: checkout-availability
        exp_alerts:
          - exp_labels:
              severity: page
              sloth_service: checkout
              sloth_slo: checkout-availability
EOF

docker run --rm -v "$PWD/rules:/rules" -v "$PWD/tests:/tests" prom/prometheus:latest \
  promtool test rules /tests/burnrate_test.yaml

Expected: SUCCESS. If the alertname or labels differ, run grep "alert:" rules/checkout.rules.yaml to read the exact generated names and adjust the fixture — this is the real skill: reading what the generator produced.

Step 6 — Load the rules into a live Prometheus (optional).

cat > prometheus.yml <<'EOF'
global:
  evaluation_interval: 15s
rule_files:
  - /rules/checkout.rules.yaml
EOF
docker run --rm -d --name prom -p 9090:9090 \
  -v "$PWD/prometheus.yml:/etc/prometheus/prometheus.yml" \
  -v "$PWD/rules:/rules" prom/prometheus:latest
# Open http://localhost:9090/rules — the SLO recording/alert rules appear with no eval errors.

Validation checklist. You authored intent as OpenSLO, generated a dozen correct rules from a two-field target, linted them with the same promtool Prometheus uses, and proved the page fires at a known burn rate — without writing a single threshold or window length. The steps mapped to what each proves:

Step	What you did	What it proves
2	Author OpenSLO SLI + SLO	Intent is a small, reviewable file
3	`sloth validate` + `generate`	A dozen rules come from two fields
4	`promtool check rules`	The output is valid Prometheus, not just YAML
5	`promtool test rules`	The alert fires at the right burn rate
6	Load into Prometheus	The rules evaluate live with no errors

Teardown.

docker rm -f prom 2>/dev/null
cd .. && rm -rf slo-lab

Cost note. Entirely local and free — Docker pulls the Prometheus image once (~250 MB); no cloud resources are created.

Common mistakes & troubleshooting

The failure modes of an SLO-as-code pipeline, as a symptom → cause → confirm → fix table you can scan mid-incident:

#	Symptom	Root cause	Confirm	Fix
1	Alert never fires despite a real outage	SLI selector matches no series (renamed metric)	Query the `good`/`total` PromQL directly in Prometheus — empty result	Fix the selector; add a metric-existence check in CI
2	SLI ratio > 1 or negative	`good` and `total` count different sets (good not a subset of total)	Graph `slo:sli_error:ratio_rate5m` — should be 0–1	Ensure `good` ⊆ `total`; use `bad`/`total` if clearer
3	Latency SLO passes while the tail is terrible	Built on a percentile, not a bucket count	Read the SLI query — if it uses `histogram_quantile`, it’s wrong	Rewrite as `bucket{le="T"}` / `_count`
4	Page fires on every transient blip	Only a short window used, or `alertAfter`/`for` too low	Check the alert expr for a missing long-window `and`	Use the generator’s MWMBR output; don’t hand-trim it
5	Slow burn never pages, only tickets late	Long-window threshold on the wrong window (the Meridian bug)	Diff generated rule windows against the spec	Regenerate; never hand-edit generated rules
6	`sloth generate` fails on a valid-looking spec	`{{.window}}` missing from the query, or wrong SLI style	Run `sloth validate -i` for the exact error	Add `[{{.window}}]`; match `events`/`raw` correctly
7	Rules load but nothing appears in `/rules`	`PrometheusRule` label doesn’t match the Prometheus ruleSelector	`kubectl get prometheusrule -o yaml` — check labels	Add the label the Prometheus CR selects (e.g. `role: alert-rules`)
8	Budget query returns no data	Wrong `sloth_service`/`sloth_slo` label values	List labels: `count by(sloth_service,sloth_slo)(slo:objective:ratio)`	Use the exact generated label values
9	Both page and ticket page the on-call for one incident	Missing inhibition rule in Alertmanager	Check `amtool config routes` / firing alerts	Add the `page` → `ticket` inhibit with `equal` on service+slo
10	Alert fires but `severity` label absent → routes to default	Sloth spec missing `page_alert`/`ticket_alert` labels	Grep the generated alert for `severity`	Add the labels to the spec’s `alerting:` block
11	Pyrra dashboard budget disagrees with the Grafana panel	Grafana panel points at Sloth series, Pyrra computes its own	Compare the two series names	Pick one generator’s series as canonical for dashboards
12	Rules silently stopped evaluating after a metric rename	No absent-metric alert	Check for a `SLOMetricAbsent` in the rule file	Use Sloth (emits it); or add an `absent()` alert manually
13	CI passes but the alert never fires in prod	`promtool check` passed, but no `test rules` for logic	No `alert_rule_test` in CI	Add `promtool test rules` with a known-burn fixture
14	Generated rules drifted from the spec in Git	Someone committed and hand-edited generated output	`git blame` the `PrometheusRule`	Remove generated rules from source control; regenerate in CI

The entries that bite hardest, expanded:

1. The alert never fires because the SLI matches nothing. The single most dangerous failure — it looks healthy. A metric got renamed (http_requests_total → http_server_requests_total), and the good/total queries now return empty. The error ratio is 0/0 = nothing, no alert, no page, ever. Confirm by pasting the SLI’s good and total queries straight into the Prometheus expression browser; an empty result is the smoking gun. Fix the selector, and add a CI step that queries Prometheus to assert each SLI’s total matches at least one series — the only check that catches a rename before production does.

3. The latency SLO passes while p99 is a disaster. The SLI was written as histogram_quantile(0.995, ...) < 0.3, which is not a ratio and does not compose into a budget. It “passes” because the expression is true most of the time, but it is measuring the wrong thing entirely. Confirm by reading the SLI query — any histogram_quantile in an SLI definition is a bug. Fix by rewriting as a count-based ratio: sum(rate(..._bucket{le="0.3"}[w])) / sum(rate(..._count[w])).

5. The slow burn never pages — the Meridian failure. A hand-edited or mis-templated rule has the long-window condition keyed off the wrong window (6h instead of 1h), so a sustained-but-moderate burn stays smoothed below threshold and only the ticket fires, late. Confirm by diffing the windows in the generated alert against the canonical MWMBR table. Fix by regenerating from the spec and — critically — never hand-editing generated rules, so the generator’s tested table is always what runs.

12. Rules silently stopped evaluating. A metric rename or a scrape config change means the SLI series simply stops existing, and without an absent-metric alert, nothing tells you. Confirm by checking whether the rule file contains a SLOMetricAbsent alert. Fix by using Sloth (which emits it automatically) or, if you must hand-roll, adding absent(slo:sli_error:ratio_rate5m{...}) as an alert of its own.

Best practices

The spec is the only source of truth; generated rules are dist/. Never commit generated PrometheusRule output to Git, and never hand-edit it. Regenerate in CI on every merge so the spec and the running rules can never diverge.
Never type a window length or a burn threshold by hand. The MWMBR table (14.4, 6, 3, 1 across 1h/5m, 6h/30m, 1d/2h, 3d/6h) belongs in the generator, tested upstream. If you find yourself typing 14.4, you are doing it wrong.
Pick one canonical generator. Run Sloth or Pyrra as the authority for the rules that page; if you want the Pyrra UI too, run it read-mostly on the same objectives, but let one tool own the alerts. Two generators owning the same alerts is drift by another name.
Measure SLIs at the edge, exclude 4xx and synthetics. The user-facing SLO reads the ingress/LB signal; health checks and client-fault 4xx stay out of the failure numerator, or you burn budget for things that are not outages.
Latency SLIs are count-based bucket ratios, never percentiles. bucket{le="T"} / _count, full stop. A percentile-based latency SLI does not compose into a budget.
Define SLIs as reusable objects. One kind: SLI referenced by many kind: SLO beats inlining the same query five times — one place to fix a selector when the metric changes.
Unit-test the burn-rate logic in CI. promtool test rules with a known-burn fixture is the only check that proves the page fires when it should and stays silent when it should not. promtool check rules alone is necessary but not sufficient.
Emit and monitor an absent-metric alert. A rule that silently stops evaluating is worse than a broken one. Sloth’s SLOMetricAbsent is free; use it.
Give the budget teeth. A weekly report surfaces remaining budget per SLO, and the release pipeline blocks risky promotions when the budget is exhausted — enforced by a script, not a meeting.
Route on the generator’s labels, group by service+SLO, inhibit ticket during page. The generators emit clean, consistent labels precisely so routing is a few matchers. Use them; do not route on brittle alertname regexes.
Pin generator versions across CI and cluster. A generator upgrade can change series names or window sets; pin the version, upgrade deliberately, and re-run the rule tests before rolling it out.
Keep the SLO count small and meaningful. Two to four SLOs per service (availability, latency, maybe correctness) that map to user pain beat a dozen vanity SLIs. Every SLO is a page you might get at 03:00 — make it earn that.

Security notes

Guard the metric source’s credentials. The generators query or reference Prometheus; if they authenticate (Mimir/Thanos with auth), keep the token in a Kubernetes Secret or CI secret, not in the spec. The spec is code and gets committed — never put a bearer token in it.
Scope the generator’s RBAC in-cluster. The Sloth/Pyrra controllers need to create and update PrometheusRule objects in specific namespaces — grant exactly that (a Role on monitoring.coreos.com/prometheusrules), not cluster-admin. A generator that can write any object anywhere is a needless blast radius.
Do not leak topology through SLO metadata. SLO description fields and generated alert annotations can end up in PagerDuty, Slack, and public dashboards. Keep internal hostnames, secret paths, and architecture details out of them — an annotation is a status, not a system map.
Protect the budget gate from being bypassed. If a deploy freeze depends on a budget query, ensure the gate fails closed — if Prometheus is unreachable, the gate should block (or require explicit human override), never silently pass. A gate that fails open is not a gate.
Review spec changes like security-sensitive code. A malicious or careless spec change that loosens a target from 0.999 to 0.99 quietly widens the budget 10× and mutes alerts. Require PR review on the SLO repo, and treat target changes as a deliberate, logged decision.
Least privilege for the CI job. The pipeline that regenerates and applies rules needs write access to the cluster’s monitoring namespace only. Use a scoped service account or OIDC federation, not a long-lived admin kubeconfig in a CI secret.

The security posture of the pipeline, control by control:

Control	Mechanism	Protects against
Metric-source auth	Token in Secret, not spec	Credential leak via committed YAML
Generator RBAC	Namespaced `Role` on `PrometheusRule`	Over-privileged controller
Annotation hygiene	Review `description`/annotations	Topology leak to Slack/PagerDuty
Fail-closed gate	Gate blocks on query failure	Silent deploy of unreliable code
PR review on targets	Branch protection on SLO repo	Quiet target-loosening
Scoped CI identity	OIDC / scoped service account	Cluster-wide compromise from CI

Cost & sizing

The cost of SLO-as-code is almost entirely cardinality and rule-evaluation load, not licensing — Sloth, Pyrra, and OpenSLO are all open source and free. What drives the bill:

Recording-rule series. Each SLO generates roughly 7 windowed SLI-ratio series plus 5–6 metadata series — call it ~13 series per SLO. At 70 services × 3 SLOs, that is ~2,700 new series, plus the alert-state series. This is modest, but at hundreds of services it adds up, and every series consumes ingestion, storage, and query resources. See Taming Metric Cardinality: Relabeling, Limits, and Cost Governance in Prometheus for controlling the blast radius.
Rule evaluation CPU. Multi-window rules evaluate at every scrape interval across many windows. The 28-day (rate28d) recording rules in particular scan a large range; on a busy Prometheus these can dominate rule-evaluation time. Long-window recording rules are the reason to consider Thanos/Mimir downsampling — see Thanos in Production: Global Query View, Deduplication, and Object-Storage Downsampling.
Pyrra’s always-on components. If you run Pyrra for the UI, the api and backend are two small deployments (a few hundred MB RAM each) plus their live queries against Prometheus. Sloth’s controller is similarly lightweight; the CLI is free (runs in CI).
Storage of budget history. To report on budget trends, you retain the metadata series long-term — cheap per-series, but a 28-day-window SLO implies you keep at least that window queryable, usually longer for reporting.

Rough figures for a mid-size estate (70 services, ~200 SLOs) on self-managed Prometheus + Thanos: the incremental compute for rule evaluation and the extra series is on the order of ₹8,000–20,000/month in additional Prometheus/Thanos capacity, and Pyrra’s components add perhaps ₹2,000–4,000/month in pod resources. Against the cost of a single budget-blown incident that a correct, generated alert would have caught in minutes, this is negligible — the Meridian weekend burn alone was worth far more than a year of the pipeline’s overhead. The cost drivers and what each buys:

Cost driver	What you pay for	Rough INR/month (mid estate)	Reduce it by
Extra recording-rule series	~13 series/SLO ingestion + storage	₹5,000–12,000	Fewer SLOs; drop unused windows
Rule evaluation CPU	Long-window (`rate28d`) scans	₹3,000–8,000	Downsampling (Thanos/Mimir)
Pyrra components	`api` + backend pods + live queries	₹2,000–4,000	Run Pyrra read-only, or skip it
Sloth controller	One lightweight pod	< ₹1,000	Use CLI-in-CI instead
Long-term budget history	Retained metadata series	₹1,000–3,000	Shorter retention on non-reporting series
Tooling licenses	Nothing — all OSS	₹0	—

Interview & exam questions

1. Why generate SLO rules from a spec instead of writing PromQL by hand? Because the multi-window multi-burn-rate pattern is mechanical and consistent across services, and humans introduce drift at scale — a mistyped window length or burn threshold in one of hundreds of rules can silence a real page. A generator bakes the tested table in once, so no human types a threshold, and one spec change regenerates every rule identically. Consistency at scale is the win, not saving keystrokes.

2. What is the difference between a ratio SLI and a threshold SLI? A ratio SLI divides one counter (good or bad) by another (total) — availability and count-based latency fit. A threshold SLI compares a single gauge against a bound and counts the fraction satisfying it — useful for saturation, freshness, or queue depth where there is no natural good/total count. OpenSLO models both (ratioMetric, thresholdMetric); Sloth and Pyrra are ratio-first.

3. Why must a latency SLI be a count-based bucket ratio and not a percentile? A percentile answers “how slow is the 99.5th-slowest request”; the SLI must answer “what fraction of requests were fast enough”. Only the second composes into an error budget you can spend and track. A latency SLI is sum(rate(bucket{le="T"}[w])) / sum(rate(_count[w])) — the count at or under the threshold over the total — never histogram_quantile().

4. Explain budgetingMethod: Occurrences vs Timeslices. Occurrences consumes budget per bad event — one failed request spends one request’s worth, right for high-throughput APIs where volume is the meaningful unit. Timeslices consumes per bad window — a violating one-minute slice spends one slice regardless of request count, right for low-traffic or time-based services where “the slice was bad” is the unit. Sloth is events/Occurrences-native.

5. What does Sloth generate from a single SLO, and what would you miss writing it by hand? SLI error-ratio recording rules at ~7 windows (5m to 3d), metadata rules for the objective and error budget, the four MWMBR alert rules (page and ticket), and a SLOMetricAbsent alert. The absent-metric alert and consistent labels are what hand-rolled rules almost always miss — the former catches a rule that silently stopped evaluating; the latter makes routing and inhibition tractable.

6. How do OpenSLO, Sloth, and Pyrra relate? OpenSLO is a spec (a portable file format, no runtime). Sloth is a generator (CLI + controller, no UI) that reads OpenSLO or its own format. Pyrra is a generator plus a UI with its own CRD, a live budget dashboard, and filesystem/Kubernetes backends. A common architecture is OpenSLO as source, Sloth as the canonical generator, Pyrra for the operator dashboard.

7. Your burn-rate alert never fired during a real outage — what is the first thing you check? Whether the SLI’s good/total queries match any series. A renamed metric leaves the ratio computing on nothing (0/0), so no alert ever fires and everything looks healthy. Paste the SLI queries into Prometheus; an empty result is the cause. Prevent it with a CI check that asserts each SLI’s total matches at least one series, and with a SLOMetricAbsent alert.

8. Why keep generated rules out of source control? So the spec is unambiguously the source of truth and nobody can hand-edit a generated rule into drift. Committing generated output invites someone to tweak the PrometheusRule directly, silently diverging it from the spec — exactly the drift the generator was meant to eliminate. Regenerate in CI on merge instead.

9. How do you route the generated alerts, and why is inhibition needed? Route on the severity: page / severity: ticket labels the generator emits (page → PagerDuty, ticket → Slack), grouping by sloth_service/sloth_slo. Inhibition is needed because during a real outage both the fast page (1h/5m) and the slower ticket (1d/2h) conditions fire for the same SLO; an inhibit rule silencing the ticket while the page is active keeps one incident to one notification stream.

10. What CI gates would you put on an SLO-definitions repo? sloth validate (malformed specs), sloth generate -o /dev/null (bad templates/refs), promtool check rules (invalid PromQL/structure), and — the one most skip — promtool test rules with a known-burn fixture that proves the page fires at the right burn rate and stays silent below it. Optionally, a query against Prometheus to confirm each SLI selector matches real series.

11. How does the error budget gate a release, and what is the failure mode to avoid? The pipeline queries remaining budget (1 - error_ratio/budget) before promoting and blocks if it is below a threshold (e.g. 10%). The failure mode to avoid is a gate that fails open — if Prometheus is unreachable the gate must block or require explicit override, never silently pass, or it is not a gate.

12. When is generating SLO rules overkill? For a single service with one SLO owned end to end by one engineer who fully understands the hand-written rule — there, installing and operating a generator may add more complexity than it removes. The value of generation is consistency across many services; with one SLO there is nothing to be consistent with.

These map to SRE and platform-engineering interviews rather than a single vendor cert, though the Prometheus/PromQL depth aligns with the PCA (Prometheus Certified Associate) and the SLO/error-budget material with Google’s SRE body of knowledge. A compact revision map:

Question theme	Relevant body of knowledge
SLI/SLO/error-budget definitions	Google SRE Workbook; PCA
PromQL for ratios/histograms	Prometheus Certified Associate
Multi-window burn-rate alerting	SRE Workbook (Alerting on SLOs)
Generators (Sloth/Pyrra) + OpenSLO	Platform/SRE tooling
CI/CD gating on budgets	DevOps / release-engineering

Quick check

You need “99.5% of requests under 300ms” as an SLI. Write the shape of the good and total queries, and name the mistake you must avoid.
True or false: you should commit the generated PrometheusRule files to Git so they are versioned.
Sloth and Pyrra both generate burn-rate rules. What does Pyrra add that Sloth does not, and what does Sloth emit that Pyrra (as of v0.7) does not?
Your burn-rate page alert never fired during a genuine 8% error outage, and the SLI dashboard shows no data. What is the single most likely cause?
What is the difference between budgetingMethod: Occurrences and Timeslices, in one sentence each?

Answers

good = sum(rate(http_request_duration_seconds_bucket{le="0.3"}[w])), total = sum(rate(http_request_duration_seconds_count[w])) — a count-based ratio over the le="0.3" histogram bucket. The mistake to avoid is using histogram_quantile() (a percentile), which measures how slow the tail is, not what fraction was fast enough, and does not compose into a budget.
False. Generated rules are a build artifact; committing them invites hand-editing and drift. Keep only the spec in Git and regenerate the rules in CI on every merge, so the spec is the unambiguous source of truth.
Pyrra adds a live error-budget UI (dashboard of remaining budget, burn rate, projected exhaustion) via its own ServiceLevelObjective CRD. Sloth emits a SLOMetricAbsent alert — firing when the SLI series disappears entirely — which Pyrra does not, and which catches a rule that has silently stopped evaluating.
The SLI’s good/total queries match no series — almost certainly a renamed or mis-selected metric, so the error ratio computes on nothing and no alert can fire. Paste the SLI queries into Prometheus; an empty result confirms it. Prevent it with a CI metric-existence check and an absent-metric alert.
Occurrences consumes budget per bad event (one failed request spends one request’s worth) — right for high-throughput APIs. Timeslices consumes budget per bad time-slice (a violating minute spends one slice regardless of request count) — right for low-traffic or time-based services.

Glossary

SLI (Service Level Indicator) — a ratio of good events over valid events that measures one dimension of user experience (availability, latency, correctness).
SLO (Service Level Objective) — an SLI plus a target over a time window (e.g. 99.9% over 28 days); the promise you make and the source of the error budget.
Error budget — the allowed fraction of bad events, 1 - target; the “currency” you can spend on failures and change velocity before you must slow down.
Ratio metric — an SLI shape dividing one counter (good or bad) by a total counter; the natural form for availability and count-based latency.
Threshold metric — an SLI shape comparing a single gauge against a bound and counting the fraction satisfying it; for saturation, freshness, or queue depth.
Budgeting method — how budget is consumed: Occurrences (per bad event) or Timeslices (per bad time-window).
Burn rate — how fast the error budget is being spent relative to the flat rate that would exhaust it exactly over the window; 14.4× spends a 28-day budget in about two days.
Multi-window multi-burn-rate (MWMBR) — the alerting pattern pairing a long and short window per severity so an alert fires only when both agree on sustained burn; catches fast outages quickly and slow bleeds without noise.
OpenSLO — a vendor-neutral, Kubernetes-style YAML specification for SLOs (kind: SLI, SLO, AlertPolicy, etc.); a portable source format with no runtime of its own.
Sloth — an open-source generator (CLI + PrometheusServiceLevel controller) that compiles an SLO spec (native or OpenSLO) into Prometheus recording and burn-rate alert rules.
Pyrra — an open-source generator and UI with its own ServiceLevelObjective CRD, generating equivalent rules and serving a live error-budget dashboard.
PrometheusServiceLevel — Sloth’s Kubernetes CRD, reconciled by its controller into a managed PrometheusRule.
ServiceLevelObjective — Pyrra’s Kubernetes CRD, reconciled into PrometheusRule objects and surfaced in the Pyrra UI.
slo:sli_error:ratio_rate5m — a Sloth-generated recording rule holding the SLI error ratio over a 5-minute window; similar series exist for 30m, 1h, 2h, 6h, 1d, 3d.
SLOMetricAbsent — a Sloth-generated alert that fires when the SLI series disappears entirely, catching a rule that has silently stopped evaluating.
Occurrences / Timeslices — the two OpenSLO budgeting methods: consume budget per bad event, or per bad time-window, respectively.
Error-budget policy — the agreed table of consequences as budget depletes (ship freely, flags only, freeze), enforced by a report and a release gate rather than a meeting.
promtool test rules — the Prometheus tool that unit-tests alerting rules against synthetic input series, proving an alert fires (and stays silent) at the intended thresholds.

Next steps

You can now author SLIs and SLOs as code, generate correct multi-window burn-rate rules from them, and enforce the budget in CI. Build outward:

Next: SLOs and Error Budgets in Practice: Defining SLIs and Building Multi-Window Burn-Rate Alerts — the conceptual foundation: why the burn-rate math works and how to reason about budgets.
Related: PromQL in Anger: Rate, Histograms, and Aggregation Patterns That Actually Work — the query language the generators emit; read generated rules fluently.
Related: Designing Alertmanager Routing Trees: Grouping, Inhibition, Silences, and Dedup — the routing layer that consumes the labels these generators produce.
Related: Engineering Grafana Dashboards That Get Used: RED, USE, Template Variables, and Provisioning-as-Code — build the budget-burn dashboards that visualize the generated series.
Related: Scaling Prometheus: Recording Rules, Remote-Write, and Long-Term Storage with Thanos and Mimir — the recording-rule and long-term-storage machinery under the generated rules.
Related: Building an On-Call Practice: PagerDuty Escalation, Alert Routing, and Actionable Runbooks — turn the pages these SLOs raise into an actionable on-call practice.