Observability Multi-Cloud

Grafana as Code: Provisioning Dashboards, Folders, and Unified Alerting with Terraform

Clicking a dashboard into existence is fine until the third person edits it, the alert someone hand-tuned at 2 a.m. survives no review, and nobody can answer “is staging configured the same as prod?” Grafana has two complementary answers to this, and most teams pick the wrong one for the wrong layer. File-based provisioning, loaded from YAML and JSON on disk, is the right tool for bootstrap config that ships with the instance. The Grafana Terraform provider is the right tool for everything that has a lifecycle - folders, permissions, library panels, alert rules, notification policies - because Terraform tracks state, computes drift, and plans changes before they land. This article builds the full pipeline: where each model belongs, how to parameterize dashboards across environments, how to express unified alerting as code, and how to wire plan, lint, and drift detection into CI.

The whole thing assumes Grafana 11 or later (unified alerting is the only alerting system now; legacy alerting was removed) and the grafana/grafana Terraform provider 3.x.

1. Pick the right provisioning model per layer

The two mechanisms are not competitors. They own different layers of the stack.

Concern File-based provisioning Terraform provider
Data sources (bootstrap) Yes - ships with the image Possible, but state churns on secrets
Folders and folder permissions No native file support Yes - first-class resource
Dashboards Yes (JSON on disk) Yes (grafana_dashboard)
Library panels No Yes (grafana_library_panel)
Contact points / notification policies Yes (alerting YAML) Yes
Alert rules Yes (alerting YAML) Yes (grafana_rule_group)
Cross-environment templating Limited (env vars) Full - variables, workspaces, modules
Drift detection None terraform plan

The rule I apply on every platform team: anything that must exist before Grafana can serve a single request goes in file-based provisioning; anything with a review-and-promote lifecycle goes in Terraform. Data sources are the classic boundary case. The Prometheus data source the instance cannot start usefully without belongs in a provisioning file baked into the container. A team’s dashboards and their alert rules belong in Terraform so a pull request gates every change.

File-based provisioning is declarative and idempotent: Grafana reconciles the on-disk files into its database on startup and on a configurable interval. Critically, a resource provisioned from a file is read-only in the UI - the edit button is disabled - which is exactly the guarantee you want for the bootstrap layer.

Here is the minimal data source file. It lives at /etc/grafana/provisioning/datasources/:

# datasources/prometheus.yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus.monitoring.svc:9090
    uid: prometheus-main        # pin the uid - dashboards reference it
    isDefault: true
    jsonData:
      httpMethod: POST
      timeInterval: 30s
    version: 1
    editable: false

The single most important field is uid. Pin it. Dashboards and alert rules reference data sources by UID, not name, and if you let Grafana auto-generate one it will differ between dev and prod and every panel will say “datasource not found.” A pinned, environment-stable UID is what makes a dashboard portable.

2. Bootstrap the provider and manage folders as immutable resources

Folders are the unit of organization and the unit of permission in Grafana, so they are the first thing Terraform should own. Configure the provider with a service account token (API keys are deprecated; use service accounts):

# providers.tf
terraform {
  required_version = ">= 1.6"
  required_providers {
    grafana = {
      source  = "grafana/grafana"
      version = "~> 3.0"
    }
  }
}

provider "grafana" {
  url  = var.grafana_url               # e.g. https://grafana.stage.internal
  auth = var.grafana_service_account_token
}

Never put the token in a .tf file or terraform.tfvars that gets committed. Source it from the environment (TF_VAR_grafana_service_account_token) backed by your secrets manager - Vault, AWS Secrets Manager, or the CI provider’s secret store.

Now the folder and its permissions:

# folders.tf
resource "grafana_folder" "platform" {
  title = "Platform"
  uid   = "platform"          # stable across environments
}

resource "grafana_folder_permission" "platform" {
  folder_uid = grafana_folder.platform.uid

  permissions {
    team_id    = grafana_team.sre.id
    permission = "Edit"
  }
  permissions {
    role       = "Viewer"
    permission = "View"
  }
}

Treat folder UIDs the same way you treat data source UIDs: pin them, keep them identical across environments, and never rename a folder in place - a rename forces every dashboard inside to be re-homed. Permissions are deliberately coarse here. Grafana’s folder model gives you View/Edit/Admin, and pushing all fine-grained access through team membership (rather than per-folder Terraform sprawl) keeps the blast radius of any permission change small and auditable.

3. Import existing dashboards and parameterize them

Almost nobody starts from a blank dashboard. You have dozens already built in the UI, and the migration path is: export the JSON, strip the instance-specific bits, and feed it to Terraform.

Export from the API (not the UI “Share” dialog, which adds export-only metadata):

curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
  "$GRAFANA_URL/api/dashboards/uid/abc123" \
  | jq '.dashboard' > dashboards/service-overview.json

Two fields must be removed from the exported JSON before Terraform manages it:

jq 'del(.id) | del(.version)' \
  dashboards/service-overview.json > dashboards/service-overview.clean.json

The id is the database primary key - it is instance-local and meaningless in another Grafana. The version is server-managed; leaving it in causes a perpetual diff. Drop both. Keep the uid - that is the stable, portable identifier you want consistent across environments.

The hardcoded data source UID inside the panels is the next problem. Replace it with a template token so the same JSON works everywhere:

# dashboards.tf
resource "grafana_dashboard" "service_overview" {
  folder      = grafana_folder.platform.uid
  config_json = templatefile("${path.module}/dashboards/service-overview.json", {
    datasource_uid = var.prometheus_datasource_uid
  })
  overwrite = true
}

Inside the JSON, panels reference "datasource": { "uid": "${datasource_uid}" }, and templatefile substitutes the per-environment value at plan time. overwrite = true lets Terraform reconcile a dashboard that already exists with the same UID rather than failing - essential when you import into an instance that already has the dashboard.

For dashboards that change often, hand-editing 1,500 lines of JSON does not scale. This is where Grafonnet earns its keep. Grafonnet is a Jsonnet library that renders dashboard JSON from composable functions, so a panel becomes a call you parameterize and reuse:

// service.jsonnet
local g = import 'g.libsonnet';
local prometheus = g.query.prometheus;

g.dashboard.new('Service Overview')
+ g.dashboard.withUid('service-overview')
+ g.dashboard.withPanels([
  g.panel.timeSeries.new('Request rate')
  + g.panel.timeSeries.queryOptions.withTargets([
    prometheus.new(
      '$datasource',
      'sum(rate(http_requests_total{service="$service"}[$__rate_interval]))'
    ),
  ]),
])

Render it to JSON in CI, then hand the output to the same grafana_dashboard resource:

jsonnet -J vendor service.jsonnet > dashboards/service-overview.json

The payoff is that a fleet of services share one Jsonnet template and differ only by their $service variable, instead of forty copy-pasted JSON files that drift apart panel by panel.

4. Express unified alerting as code

Unified alerting has three moving parts, and each maps to a Terraform resource: contact points (where a notification goes), notification policies (how alerts route to contact points), and rule groups (what fires). Manage all three in Terraform and the entire alerting surface becomes reviewable.

Contact point first - the destination:

# alerting_contactpoints.tf
resource "grafana_contact_point" "platform_oncall" {
  name = "platform-oncall"

  slack {
    url   = var.slack_webhook_url
    title = "{{ .CommonLabels.alertname }}"
    text  = "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
  }
}

The notification policy tree decides routing. There is exactly one root policy per Grafana instance, so it is a singleton resource:

# alerting_policies.tf
resource "grafana_notification_policy" "root" {
  contact_point = grafana_contact_point.platform_oncall.name
  group_by      = ["alertname", "service"]

  group_wait      = "30s"
  group_interval  = "5m"
  repeat_interval = "4h"

  policy {
    matcher {
      label = "severity"
      match = "="
      value = "critical"
    }
    contact_point   = grafana_contact_point.platform_oncall.name
    group_wait      = "10s"
    repeat_interval = "1h"
  }
}

group_wait is how long Grafana buffers the first alert in a new group before sending, so a burst of related failures arrives as one notification rather than forty. repeat_interval is the re-notify cadence for an alert that stays firing. Tightening both for severity=critical in the nested policy is the standard pattern: page faster and re-page sooner for the things that matter.

Now the rule group. A rule group is the atomic unit of alert evaluation - every rule in it shares one evaluation interval and evaluates sequentially:

# alerting_rules.tf
resource "grafana_rule_group" "service_health" {
  name             = "service-health"
  folder_uid       = grafana_folder.platform.uid
  interval_seconds = 60

  rule {
    name      = "HighErrorRatio"
    condition = "C"
    for       = "5m"

    data {
      ref_id         = "A"
      datasource_uid = var.prometheus_datasource_uid
      relative_time_range {
        from = 600
        to   = 0
      }
      model = jsonencode({
        expr  = "sum(rate(http_requests_total{code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
        refId = "A"
      })
    }

    data {
      ref_id         = "C"
      datasource_uid = "__expr__"
      relative_time_range {
        from = 0
        to   = 0
      }
      model = jsonencode({
        type       = "threshold"
        expression = "A"
        refId      = "C"
        conditions = [{
          evaluator = { type = "gt", params = [0.05] }
        }]
      })
    }

    labels = {
      severity = "critical"
    }
    annotations = {
      summary = "Error ratio above 5% for {{ $labels.service }}"
    }
  }
}

Two things trip people up here. First, the special data source UID __expr__ is Grafana’s server-side expression engine: query A pulls the raw ratio from Prometheus, and the threshold expression C is what the condition points at. The query does not fire the alert; the expression does. Second, for = "5m" is the pending period - the condition must hold continuously for five minutes before the alert transitions from Pending to Firing, which is your defense against single-scrape blips. The interval_seconds on the group (how often it evaluates) and for on the rule (how long it must stay true) are independent knobs; get both wrong and you either page on noise or react too slowly.

5. Build reusable library panels and dashboard modules

A library panel is a single panel definition stored once and referenced by many dashboards. Edit it in one place and every dashboard that embeds it updates. Terraform owns the definition:

# library_panels.tf
resource "grafana_library_panel" "error_ratio" {
  name       = "Error Ratio"
  folder_uid = grafana_folder.platform.uid
  model_json = jsonencode({
    title = "Error Ratio"
    type  = "timeseries"
    targets = [{
      expr = "sum(rate(http_requests_total{code=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total[$__rate_interval]))"
    }]
  })
}

The higher-leverage abstraction is a Terraform module that packages a folder, its standard dashboards, its library panels, and its alert rules into one callable unit. A platform team exposes a grafana-service module and every product team instantiates it with a few variables:

# teams/checkout/main.tf
module "checkout" {
  source                  = "../../modules/grafana-service"
  service_name            = "checkout"
  folder_title            = "Checkout"
  prometheus_datasource   = var.prometheus_datasource_uid
  oncall_contact_point    = "checkout-oncall"
  error_ratio_threshold   = 0.02
}

This is the difference between forty teams each reinventing a dashboard and forty teams inheriting a vetted standard. The module enforces that every service gets the same RED panels, the same naming, the same alert structure - and an improvement to the module propagates everywhere on the next apply.

6. Promote across environments with workspace isolation

The same code must produce dev, stage, and prod without copy-paste. Two patterns work; pick one and do not mix them.

The first is one Terraform workspace per environment with a tfvars file each:

terraform workspace new prod
terraform workspace select prod
terraform plan  -var-file=env/prod.tfvars
terraform apply -var-file=env/prod.tfvars
# env/prod.tfvars
grafana_url                = "https://grafana.prod.internal"
prometheus_datasource_uid  = "prometheus-prod"
error_ratio_threshold      = 0.02     # tighter in prod

The second pattern - which I prefer at scale - drops Terraform CLI workspaces in favor of one directory per environment, each with its own backend state and a shared module. The directory-per-environment layout makes the state boundary explicit on disk and removes the foot-gun of running apply against the wrong selected workspace:

environments/
  dev/    -> backend "dev", calls module "platform"
  stage/  -> backend "stage", calls module "platform"
  prod/   -> backend "prod", calls module "platform"
modules/
  platform/

Either way, the non-negotiable rule is separate state per environment. Shared state means a terraform apply aimed at dev can corrupt prod’s resource tracking. Separate backends - separate S3 keys or separate Terraform Cloud workspaces - give each environment an isolated blast radius. Promotion is then a git merge: a dashboard change lands in the dev directory, gets validated, and the same commit is promoted to stage and prod through your normal PR flow, with each environment’s apply running against its own state.

7. Wire the CI pipeline: plan, lint, drift detection

The pipeline has three jobs, and all three must pass before a merge.

Format and validate catch the cheap mistakes before anything touches Grafana:

# .github/workflows/grafana.yml
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform fmt -check -recursive
      - run: terraform init -backend=false
      - run: terraform validate

Lint the dashboards against the schema. This is the step most teams skip and most regret. A dashboard JSON that is structurally valid can still reference a panel type that does not exist or a deprecated field. dashboard-linter (the Grafana Labs tool) checks dashboards against a set of best-practice rules - template variable usage, panel titles, target configuration:

# lint every dashboard JSON in the repo
go install github.com/grafana/dashboard-linter@latest
for f in dashboards/*.json; do
  dashboard-linter lint "$f"
done

Plan on PR, apply on merge, and run drift detection on a schedule. The plan output posted to the pull request is what reviewers actually read - it shows exactly which folders, dashboards, and alert rules change. Drift detection is a scheduled plan that should always be empty; a non-empty plan on the nightly run means someone edited Grafana through the UI behind Terraform’s back:

  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform plan -detailed-exitcode
        # exit 0 = no drift, 2 = drift detected, 1 = error

The -detailed-exitcode flag is the trick: plan returns exit code 2 when there are changes to apply, so the scheduled job fails loudly the moment configuration drifts from code. Wire that failure to the same Slack channel your alerts go to.

Verify

Confirm the system end to end, not just that apply exited zero.

Check Terraform’s view of the world matches reality:

terraform plan -detailed-exitcode   # must report "No changes" (exit 0)

Confirm the dashboard exists with the UID you pinned:

curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
  "$GRAFANA_URL/api/dashboards/uid/service-overview" \
  | jq '.dashboard.title, .meta.provisioned'

Confirm the alert rule is loaded and evaluating - the rules API returns every rule’s current state:

curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
  "$GRAFANA_URL/api/prometheus/grafana/api/v1/rules" \
  | jq '.data.groups[].rules[] | {name: .name, state: .state}'

A healthy rule reports state: "inactive" (condition not met), "pending" (met, inside the for window), or "firing". A state of "error" means the query or expression is broken - usually a wrong data source UID.

Test the contact point without waiting for a real alert. The UI’s “Test” button on the contact point sends a synthetic notification; use it to confirm the Slack webhook actually delivers before you depend on it at 3 a.m.

Finally, prove drift detection works: edit a dashboard panel in the UI, then run terraform plan. It must show the panel reverting on the next apply. If it shows nothing, your state is not tracking what you think it is.

Enterprise scenario

A payments platform ran a single Grafana Enterprise instance behind their SRE team and let product teams self-serve dashboards through the UI. It worked until an auditor asked a simple question during a SOC 2 review: “show me the change history and approver for every production alert rule in the last quarter.” There was none. Alert thresholds had been edited live, dozens of times, with no record of who or why - and two of those edits had silenced a critical database-saturation alert that later contributed to an outage.

The constraint they could not move was organizational: 30+ product teams, none of whom would tolerate filing a ticket with SRE for every dashboard tweak. A central team owning all of Grafana through one Terraform state would have become the bottleneck the teams were trying to escape, and a single state file touched by 30 teams is a merge-conflict and blast-radius nightmare.

The solution was a directory-per-team layout over a shared module, with separate backend state per team, and Grafana’s UI editing disabled for everything Terraform managed. Each team got a directory, owned its own state, and instantiated the same vetted grafana-service module. The CI pipeline ran plan on every PR (posted for the team to self-review), apply on merge, and a nightly drift check across all teams. The audit answer became “every change is a git commit with an approver, here is the log” - and because Terraform-managed resources are read-only in the UI, the live-editing that caused the outage was structurally impossible.

The load-bearing piece was making UI edits fail rather than relying on policy. A team’s apply step verifies provenance before it runs:

# CI gate: refuse to apply if anything was edited outside Terraform
terraform plan -detailed-exitcode -refresh-only
# exit code 2 here means state drifted from real Grafana ->
# someone edited the UI; fail the pipeline and require a git change

The -refresh-only plan compares real Grafana against state without proposing config changes, so a non-zero exit specifically flags out-of-band edits. That one gate turned “trust people not to click” into a machine-enforced invariant, and the next audit took an afternoon instead of a week.

Checklist

grafanaterraformgitopsalertingobservability

Comments

Keep Reading