Self-managing Prometheus on Kubernetes means you own the storage, the HA pair, the WAL, the cardinality blast radius, and the 2 a.m. page when the TSDB fills a PVC. Azure Monitor managed service for Prometheus moves ingestion, storage, query, and rule evaluation into a PaaS plane while keeping the parts you want to own: PromQL, recording rules, alert rules, and Grafana dashboards. The catch is that the control surface is no longer prometheus.yml and CRDs. It is a metrics add-on, a Data Collection Rule, an Azure Monitor workspace, a ConfigMap for scrape customization, and ARM resources for rules. This walkthrough wires all of it together correctly, then closes on the cost model that decides whether this is cheaper than running it yourself.
We assume an existing AKS cluster on a managed identity, Azure CLI 2.x, and kubectl context set.
1. Register providers and enable the metrics add-on
Managed Prometheus is delivered by the metrics add-on (the ama-metrics agent), not by Container Insights. They are independent: you can run one, the other, or both. Four resource providers must be registered in the cluster’s subscription before enabling.
for ns in Microsoft.Insights Microsoft.AlertsManagement Microsoft.Monitor Microsoft.Dashboard; do
az provider register --namespace "$ns"
done
# poll until all report "Registered"
az provider show --namespace Microsoft.Monitor --query registrationState -o tsv
Create (or reference) an Azure Monitor workspace — the Prometheus-native store, distinct from a Log Analytics workspace — then enable the add-on on the cluster, passing both the workspace and a Managed Grafana instance so the data source and dashboards are provisioned for you.
AMW_ID=$(az monitor account create \
--name amw-platform-prod --resource-group rg-observability \
--query id -o tsv)
GRAFANA_ID=$(az grafana create \
--name graf-platform-prod --resource-group rg-observability \
--query id -o tsv)
az aks update \
--name aks-platform-prod --resource-group rg-platform \
--enable-azure-monitor-metrics \
--azure-monitor-workspace-resource-id "$AMW_ID" \
--grafana-resource-id "$GRAFANA_ID"
If you previously installed the
aks-previewextension, remove it first withaz extension remove --name aks-preview. A stale preview extension is the single most common cause of--enable-azure-monitor-metricsfailing or silently using old defaults.
You can omit --azure-monitor-workspace-resource-id to land on a default workspace per region, but in a platform setting always pin an explicit workspace you own. Likewise omit --grafana-resource-id if Grafana is managed separately (section 4).
2. What the add-on plumbs: DCR, DCE, and the agent pods
--enable-azure-monitor-metrics is not just a pod install. Behind it Azure creates the data-collection plumbing that routes scraped samples into the workspace.
| Resource | Name pattern | Purpose |
|---|---|---|
| Data Collection Rule | MSProm-<region>-<clusterName> |
Defines the Prometheus stream and the workspace destination. |
| Data Collection Endpoint | MSProm-<region>-<clusterName> |
Regional ingestion endpoint the agent writes to. |
| DCR association | on the cluster | Binds the DCR to the AKS resource. |
The DCR/DCE live in the cluster’s resource group and are managed by the add-on. You rarely edit them directly; the knobs you actually turn live in a ConfigMap (section 3). On the cluster side the add-on deploys a fixed set of workloads in kube-system:
kubectl get pods -n kube-system -l rsName=ama-metrics
kubectl get ds -n kube-system ama-metrics-node
ama-metrics— a ReplicaSet (2 replicas) that scrapes cluster-wide targets (kube-state-metrics, the control-plane jobs, custom replica jobs).ama-metrics-node— a DaemonSet that scrapes per-node targets (cadvisor,kubelet,node-exporter) so node-local series stay node-local.ama-metrics-ksm— the bundled kube-state-metrics deployment.ama-metrics-operator-targets— reconcilesPodMonitor/ServiceMonitorCRDs and custom scrape config into the collector.
This split matters for cost and correctness: anything node-scoped is sharded across the DaemonSet, while cluster-scoped scrapes are centralized and HA on the ReplicaSet.
3. Customize scrape targets with the ama-metrics ConfigMap
By default the add-on collects a minimal ingestion profile — a curated keep-list per target chosen to power the built-in dashboards and recording rules without exploding your series count. To change what is scraped, apply ama-metrics-settings-configmap in kube-system. It does not exist until you create it; the absence of the ConfigMap means “use defaults.”
Schema note: the current ConfigMap is schema v2. Target settings are split into two top-level sections,
cluster-metricsandcontrolplane-metrics, so you can govern node/cluster ingestion separately from control-plane ingestion. Older guides using a single flatdefault-scrape-settings-enabledare v1 — do not mix them.
apiVersion: v1
kind: ConfigMap
metadata:
name: ama-metrics-settings-configmap
namespace: kube-system
data:
prometheus-collector-settings: |-
cluster_alias = "platform-prod-eastus"
cluster-metrics: |-
default-targets-scrape-enabled: |-
kubelet = true
cadvisor = true
kube-state-metrics = true
nodeexporter = true
coredns = false
kubeproxy = false
default-targets-scrape-interval-settings: |-
kubelet = "30s"
cadvisor = "60s"
minimal-ingestion-profile: |-
enabled = true
default-targets-metrics-keep-list: |-
kubelet = "kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes"
kube-state-metrics = ".*"
controlplane-metrics: |-
default-targets-scrape-enabled: |-
apiserver = true
etcd = false
Three levers do almost all the work:
cluster_aliasrewrites theclusterlabel on every series from this cluster. Set it deliberately — it is the join key across clusters in one workspace, and it must match theclusterNameyou put on rule groups later (section 5). If you set an alias here and forget it there, your cluster-scoped rules evaluate against nothing.minimal-ingestion-profile: enabled = truekeeps you on the curated keep-list. Setfalseonly when you truly need full metrics from a default target; it can multiply that target’s series count.default-targets-metrics-keep-listis a per-target regex of metric names to keep. This is your primary cardinality control for built-in targets — drop here, not downstream.
Apply it and watch the agent reload:
kubectl apply -f ama-metrics-settings-configmap.yaml
# ama-metrics pods pick up changes and restart within ~2-3 minutes
kubectl rollout status deploy/ama-metrics -n kube-system
For your own apps, the cleanest path is the Prometheus operator CRDs the add-on already watches. A PodMonitor is scraped without touching any ConfigMap:
apiVersion: azmonitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: checkout-api
namespace: payments
spec:
selector:
matchLabels:
app: checkout-api
podMetricsEndpoints:
- port: metrics
interval: 30s
Note the API group azmonitoring.coreos.com/v1 — the add-on ships its own CRDs so it can coexist with a real Prometheus operator without colliding on monitoring.coreos.com. For static or kubernetes_sd targets that do not fit CRDs, drop a raw Prometheus config under the key prometheus-config in the ama-metrics-prometheus-config ConfigMap; the supported discovery methods there are static_configs and kubernetes_sd_configs.
4. Connect Azure Managed Grafana with managed identity and RBAC
If you passed --grafana-resource-id in section 1, the data source and Kubernetes dashboards already exist. The mechanism underneath is worth understanding because it is also how you connect Grafana created out-of-band.
Grafana queries the Azure Monitor workspace using its own system-assigned managed identity, which must hold the Monitoring Data Reader role on the workspace. There is no API key, no bearer token to rotate.
GRAFANA_MI=$(az grafana show --name graf-platform-prod \
--resource-group rg-observability \
--query identity.principalId -o tsv)
az role assignment create \
--assignee "$GRAFANA_MI" \
--role "Monitoring Data Reader" \
--scope "$AMW_ID"
Then attach the workspace (idempotent if the add-on already did it):
az grafana update \
--name graf-platform-prod --resource-group rg-observability \
--azure-monitor-workspaces "$AMW_ID"
Human and pipeline access is a separate RBAC plane. Logging into the Grafana UI is governed by Azure built-in roles on the Grafana resource, mapped to Grafana org roles, all backed by Microsoft Entra ID. Assign the least-privilege role that fits:
| Azure role | Grafana capability | Role definition ID |
|---|---|---|
| Grafana Admin | Manage data sources, dashboards, and role assignments | 22926164-76b3-42b3-bc55-97df8dab3e41 |
| Grafana Editor | View and edit dashboards and alerts | a79a5197-3a5c-4973-a920-486035ffd60f |
| Grafana Viewer | Read-only dashboards and alerts | 60921a7e-fef1-4a43-9b16-a26c52ad4769 |
# Grant an Entra group read-only dashboard access
az role assignment create \
--assignee "<group-object-id>" \
--role "Grafana Viewer" \
--scope "$GRAFANA_ID"
Keep the two planes straight: the managed identity reads metrics from the workspace (Monitoring Data Reader on the AMW); users read dashboards (Grafana Viewer/Editor/Admin on the Grafana resource). Granting a user Monitoring Data Reader does not let them open Grafana, and granting Grafana Admin does not let the identity query the workspace.
5. Recording and alert rules as ARM/Bicep
This is where managed Prometheus diverges most from open source. Rules are not loaded from rule files on disk. A rule group is an Azure resource — Microsoft.AlertsManagement/prometheusRuleGroups — evaluated by the managed service against your workspace. That means rules are governed exactly like the rest of your infrastructure: Bicep in a repo, deployed through a pipeline, diffed in PRs.
The shape mirrors open-source rule groups (a group has an interval and an ordered rules[] of record or alert entries) but adds Azure semantics: scopes (which workspace, optionally which cluster), clusterName, alert severity, auto-resolution, and actions pointing at action groups.
@description('Azure Monitor workspace resource ID')
param amwId string
@description('AKS cluster resource ID')
param clusterId string
@description('Action group resource ID for paging')
param actionGroupId string
param location string = resourceGroup().location
resource platformRules 'Microsoft.AlertsManagement/prometheusRuleGroups@2023-03-01' = {
name: 'platform-prod-rules'
location: location
properties: {
description: 'Recording + alert rules for platform-prod'
scopes: [
amwId
clusterId
]
clusterName: 'platform-prod-eastus' // MUST match cluster_alias from section 3
interval: 'PT1M'
rules: [
// --- recording rule: precompute per-node CPU utilisation ---
{
record: 'instance:node_cpu_utilisation:rate5m'
expression: '1 - avg without (cpu) (sum without (mode)(rate(node_cpu_seconds_total{job="node", mode=~"idle|iowait|steal"}[5m])))'
labels: {
source: 'platform-recording'
}
enabled: true
}
// --- alert rule: pods stuck not-ready ---
{
alert: 'KubePodNotReady'
expression: 'sum by (namespace, pod, cluster) (max by (namespace, pod, cluster) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod, cluster) group_left(owner_kind) topk by(namespace, pod, cluster) (1, max by(namespace, pod, owner_kind, cluster)(kube_pod_owner{owner_kind!="Job"}))) > 0'
for: 'PT15M'
severity: 2
labels: {
team: 'platform'
}
annotations: {
summary: 'Pod has been in a non-ready state for more than 15 minutes.'
description: 'Namespace {{ $labels.namespace }} pod {{ $labels.pod }} is not ready.'
}
resolveConfiguration: {
autoResolved: true
timeToResolve: 'PT10M'
}
actions: [
{
actionGroupId: actionGroupId
}
]
enabled: true
}
]
}
}
Several properties carry sharp edges:
scopesmust include the workspace ID. Adding the cluster ID narrows evaluation to one cluster; this is how you avoid running one rule set against every cluster’s data in a shared workspace and tripping query throttling.clusterNamemust equalcluster_alias. The managed service filters cluster-scoped rules by theclusterlabel. If you set an alias in the ConfigMap, that string — not the AKS resource name — is what belongs here. Mismatch = rules that fire on nothing.severityis 0–4 (0 critical, 4 verbose; default 3). It maps to Azure Monitor alert severity, which your action-group routing and on-call tooling key off.- Recording rules feed back in.
instance:node_cpu_utilisation:rate5mis ingested as a new series and is queryable from Grafana and from other rules — same as OSS, billed as ingested samples.
If you already maintain OSS rule YAML, you do not have to hand-translate it: the az-prom-rules-converter utility takes a standard Prometheus rules file plus the Azure metadata (subscription, RG, workspace, cluster, action groups) and emits a deployable ARM template.
6. Route Prometheus alerts through action groups
A fired alert from a managed Prometheus rule is a first-class Azure Monitor alert. Notification and integration are therefore delegated to action groups, the same primitive used by metric and log alerts — which is the real payoff: one notification fabric (PagerDuty, email, webhook, Logic App, Teams, Functions) across every alert source.
resource pagePlatform 'Microsoft.Insights/actionGroups@2023-01-01' = {
name: 'ag-platform-oncall'
location: 'global'
properties: {
groupShortName: 'platOncall'
enabled: true
emailReceivers: [
{
name: 'platform-dl'
emailAddress: 'platform-oncall@example.com'
useCommonAlertSchema: true
}
]
webhookReceivers: [
{
name: 'pagerduty'
serviceUri: 'https://events.pagerduty.com/integration/<key>/enqueue'
useCommonAlertSchema: true
}
]
}
}
Reference pagePlatform.id as the actionGroupId in the rule group from section 5. Set useCommonAlertSchema: true so every receiver gets the normalized Common Alert Schema payload — webhooks parse one shape regardless of source. An alert rule can list multiple action groups; route by severity (page on severity <= 1, ticket-only on severity >= 3) by pointing those rules at different groups.
Verify
Confirm each layer independently, top of stack to bottom.
# 1. Agent healthy and scraping
kubectl get pods -n kube-system -l rsName=ama-metrics
kubectl get ds -n kube-system ama-metrics-node
# transient agent issues surface in logs
kubectl logs -n kube-system -l rsName=ama-metrics -c prometheus-collector --tail=50
# 2. DCR association exists on the cluster
az monitor data-collection rule association list \
--resource "$(az aks show -n aks-platform-prod -g rg-platform --query id -o tsv)" \
--query "[].name" -o tsv
# 3. Rule group deployed and enabled
az alerts-management prometheus-rule-group show \
--name platform-prod-rules --resource-group rg-observability \
--query "{enabled:enabled, rules:length(rules)}" -o table
For the data path, open Grafana and run a query in Explore against the Managed Prometheus data source — up should return your scraped jobs, and instance:node_cpu_utilisation:rate5m should return your recording rule’s output (proof rules are evaluating). You can also confirm ingestion from the workspace side with a PromQL query in the Azure portal’s metrics explorer for the Azure Monitor workspace. To verify alert routing without waiting for a real incident, temporarily lower an alert’s threshold so it fires, watch it appear under Monitor > Alerts, and confirm the action group delivered.
Enterprise scenario
A platform team ran one shared Azure Monitor workspace behind 40+ AKS clusters to keep Grafana and rule management centralized. Within weeks, ingestion costs roughly tripled versus their old self-hosted Prometheus, and cluster-scoped alert rules began flapping into a Degraded resource-health state. Two root causes, both structural.
First, cardinality. Several teams had set minimal-ingestion-profile: enabled = false to “see everything,” and a few apps emitted a per-request-ID label. Active time series — the real cost driver — ballooned. Second, rule evaluation. Every rule group scoped only to the workspace re-scanned all 40 clusters’ data each minute, and the heavy kube_pod_owner join was throttled at that volume, which is exactly what the Degraded health state was reporting.
The fix was to treat ingestion and evaluation as governed surfaces, not defaults. They enforced the minimal profile and per-target keep-lists through a baseline ConfigMap shipped by their cluster bootstrap, and dropped the offending high-cardinality label at the source. For rules, they generated one rule group per cluster, each pinned to that cluster via both scopes and clusterName, so a rule only ever scanned one cluster’s series. A small Bicep loop produced all of them from a single rule definition:
param clusters array // [{ name: 'platform-prod-eastus', id: '/subscriptions/.../managedClusters/...' }]
resource perClusterRules 'Microsoft.AlertsManagement/prometheusRuleGroups@2023-03-01' = [for c in clusters: {
name: 'rules-${c.name}'
location: location
properties: {
scopes: [ amwId, c.id ]
clusterName: c.name
interval: 'PT1M'
rules: sharedRuleSet // identical PromQL, scoped per cluster
}
}]
Throttling disappeared because no single group spanned all clusters, and ingestion dropped by more than half once the keep-lists and the dropped label took effect — without losing a single dashboard or alert.
Checklist
Cost model: samples, retention, and active time series
Managed Prometheus bills on two axes that you must reason about separately, because they are optimized by different levers.
| Dimension | What it is | Primary lever |
|---|---|---|
| Ingested samples | Every datapoint written: series x (60s / scrape_interval) x time |
Scrape interval, keep-lists, target enable/disable |
| Query | Samples processed by rule evaluation and dashboard/API queries | Rule scoping, dashboard refresh, query breadth |
Retention is effectively fixed (managed Prometheus stores metrics for 18 months), so unlike self-hosted Prometheus or Mimir you do not tune storage class or block retention — you tune what you ingest in the first place. That reframes optimization: there is no cheap long-term tier to lean on, so the cost is decided at the scrape.
The dominant variable is active time series — distinct label-set combinations being written. Cost scales with series count far more than with raw metric count, because one metric with a high-cardinality label (a request ID, a full URL, a pod-hash gone wrong) is thousands of series. Practical controls, in order of impact:
- Stay on the minimal ingestion profile and expand keep-lists deliberately. Flipping
minimal-ingestion-profiletofalseacross fleets is the most expensive single mistake. - Kill cardinality at the source. Drop unbounded labels via
metric_relabel_configsin custom scrape config, or fix the instrumentation. Akeep-listfilters by metric name; it does not save you from a bad label on a metric you do want. - Right-size scrape intervals. 60s instead of 30s halves the sample volume for a target with no loss for slow-moving gauges. Reserve 30s for things you alert on at tight windows.
- Disable targets you do not use (
kubeproxy,etcd,corednsare off by default for a reason) and shard correctly — node targets on the DaemonSet, cluster targets on the ReplicaSet. - Scope rule groups per cluster in shared workspaces. This is a query-cost and throttling control, not an ingestion one, but in a busy workspace it is the difference between rules that evaluate and rules that sit
Degraded.
The honest comparison: managed Prometheus is rarely the cheapest option at low scale, where a single self-hosted Prometheus on a spot node costs almost nothing. It wins on total cost of ownership at fleet scale — no storage to operate, no HA pair to babysit, 18-month retention for free, and rules and dashboards governed as Azure resources. Model it on your active series, not on metric counts, and the line item stays a line item instead of becoming the crisis you adopted PaaS to avoid.