Cardinality is the single number that decides whether your Prometheus stack is a quiet utility or a recurring incident. It is the count of unique time series — every distinct combination of a metric name and its label values — and it governs three things at once: the RAM the head block needs to hold its index, how many samples a query must touch to compute an answer, and, the moment you remote-write to a vendor, the line item on your bill. One badly chosen label — a user ID, a full request URL, a Kubernetes pod name churning under an autoscaler — can multiply your series count by orders of magnitude in an afternoon. This is the working playbook: how to find the offenders, how to drop and rewrite labels with relabeling, how to set hard guardrails so a bad exporter can’t take the cluster down, and how to govern cardinality per team so it stays controlled.
1. Why cardinality is the load, not sample rate
Prometheus keeps an in-memory index of every active series and the most recent block of samples in a head block before flushing to disk. The cost that dominates is not how fast samples arrive — it is how many distinct series exist. A metric scraped once a minute with two million label combinations is far more expensive than a metric scraped every second with fifty.
The reason is multiplication. Total series for one metric is the product of the cardinalities of its labels:
http_requests_total{method, status, handler, instance}
= |method| x |status| x |handler| x |instance|
= 5 x 6 x 40 x 200 = 240,000 series
Add one unbounded label — user_id with 100,000 values — and that metric alone explodes past the point where the node survives. The damage shows up in three places:
- Memory. The head index holds every active series. OOM kills during head compaction are almost always a cardinality event, not a sample-rate event.
- Query latency. A
rate()over asum by (...)of a high-cardinality histogram across 30 days touches enormous sample counts on every dashboard refresh. - Bill. Remote-write vendors (Grafana Cloud, Amazon Managed Prometheus, Chronosphere) price on active series or samples ingested. Cardinality is the meter.
The mental model to keep: a label is a dimension you slice by, not a field you store data in. If a value is unbounded or high-cardinality, it does not belong in a label.
2. Diagnose the offenders before you touch a config
Never relabel blind. Find what is actually expensive first. The fastest source of truth is the built-in TSDB status page at http://<prometheus>:9090/tsdb-status, which surfaces the head’s cardinality stats directly. The same data is exposed via the API:
# Top label names by number of distinct series they participate in,
# plus the most expensive label-value pairs and metric names.
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data'
That returns seriesCountByMetricName, seriesCountByLabelValuePair, labelValueCountByLabelName, and memoryInBytesByLabelName — the four lists that tell you exactly where the series live.
For ad-hoc hunting, PromQL answers the same questions interactively. These are the queries I run first on any unfamiliar cluster:
# Top 10 metric names by series count - the headline offenders
topk(10, count by (__name__)({__name__=~".+"}))
# Total active series in the head - your headline number
prometheus_tsdb_head_series
# Which job is generating the most series? Find the bad exporter.
topk(10, count by (job)({__name__=~".+"}))
To find which label on a specific metric is doing the damage, count distinct values per label:
# How many distinct values does each label of this metric carry?
count(count by (handler)(http_request_duration_seconds_bucket))
count(count by (user_id)(http_request_duration_seconds_bucket))
If that user_id line returns 80,000 and handler returns 40, you have found your fuse. Offline, promtool tsdb analyze reads a block on disk and prints the same breakdown without touching the running server — useful in CI against a snapshot:
promtool tsdb analyze /prometheus/data --limit=20
It prints the highest-cardinality labels, the label pairs with the most series, and the label names with the most unique values. That last list is the one that catches unbounded labels.
3. Drop and rewrite labels with metric_relabel_configs
The most important distinction in Prometheus relabeling is where it runs:
relabel_configsruns before the scrape, against target meta-labels (the__address__, discovery labels). Use it to decide what to scrape and to shape target identity.metric_relabel_configsruns after the scrape, against every sample’s label set, before ingestion. This is the lever for cardinality — it drops metrics and strips labels that never reach the TSDB.
Dropping a whole noisy metric you never query:
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["node-exporter:9100"]
metric_relabel_configs:
# Drop go runtime histograms nobody dashboards on
- source_labels: [__name__]
regex: "go_gc_duration_seconds.*"
action: drop
Stripping a single high-cardinality label while keeping the metric — the labeldrop action removes a label by name, which collapses series that differ only in that dimension:
metric_relabel_configs:
# Remove the unbounded user_id label from everything in this job.
# Series collapse: 80,000 -> ~40 once user_id is gone.
- regex: "user_id"
action: labeldrop
A warning on
labeldrop: removing a label collapses series, and if two surviving series become identical, Prometheus reports a duplicate-sample error for that scrape. Make sure the label you drop is genuinely extra detail, not part of the series identity you need.
Keeping only an allow-list of metrics from a chatty exporter — invert the logic with keep so you ingest a known set and discard the rest:
metric_relabel_configs:
# Keep only the four metrics we actually use from kube-state-metrics
- source_labels: [__name__]
regex: "kube_pod_status_phase|kube_deployment_status_replicas|kube_node_status_condition|kube_pod_container_resource_requests"
action: keep
Truncating a high-cardinality label value instead of dropping it — rewrite path so /api/v1/orders/8a3f... becomes /api/v1/orders/:id:
metric_relabel_configs:
# Collapse UUID path segments into a placeholder
- source_labels: [path]
regex: "(/api/v1/orders/)[0-9a-f-]+"
target_label: path
replacement: "${1}:id"
metric_relabel_configs is your cheapest, most surgical control because it acts on data already at the server but not yet stored. Whatever you drop here costs zero memory, zero query time, and zero remote-write bill.
4. Enforce hard limits so one bad target can’t win
Relabeling is a scalpel; limits are the circuit breaker. A new exporter pushed by a team that did not read this article should fail its own scrape, not take down your Prometheus. Two per-scrape limits do this.
sample_limit caps how many samples a single scrape may yield. Exceed it and the entire scrape is dropped and marked failed — a loud, visible signal rather than silent series creep:
scrape_configs:
- job_name: "app"
sample_limit: 5000 # whole scrape fails past 5k series per target
label_limit: 30 # reject a sample with >30 labels
label_name_length_limit: 200
label_value_length_limit: 1000
static_configs:
- targets: ["app:8080"]
label_limit, label_name_length_limit, and label_value_length_limit reject individual samples that carry too many labels or labels that are too long — the signature of a runaway label-generation bug. Set a sane default in global so every job inherits a floor:
global:
scrape_interval: 30s
scrape_timeout: 10s
# Applied to every job unless overridden
sample_limit: 10000
label_limit: 30
When a scrape is rejected for exceeding a limit, the target’s up metric stays 1 but scrape_samples_scraped is suppressed and the failure is recorded — alert on it so a team learns immediately rather than after the bill arrives:
# Targets whose scrape was rejected by sample_limit
prometheus_target_scrapes_exceeded_sample_limit_total > 0
Target churn protection
Autoscaling and CI runners create the other cardinality leak: churn. Pods come and go, each with a unique pod or instance label, so series that are no longer scraped still occupy the head until they age out, and the cumulative unique count over a day dwarfs the instantaneous count. Two defenses:
- Drop pod-identifying labels you do not slice by (
pod,pod_template_hash,controller_revision_hash) withlabeldrop, so a ReplicaSet’s churn does not mint new series. - Cap concurrent targets per job with
target_limit, which fails service discovery for a job that suddenly tries to scrape thousands of endpoints:
scrape_configs:
- job_name: "kubernetes-pods"
target_limit: 2000 # refuse to scrape if SD returns >2000 targets
metric_relabel_configs:
- regex: "pod_template_hash|controller_revision_hash"
action: labeldrop
5. Kill the usual high-cardinality suspects
Most cardinality fires come from the same short list. Burn them out at the source:
| Offending label | Why it explodes | Fix |
|---|---|---|
user_id, customer_id, tenant_id |
Unbounded, grows with your business | labeldrop, or aggregate it away (Section 6) |
path, url, endpoint (raw) |
Unique per ID/query string | Normalize to route template :id via relabel replacement |
pod, instance under autoscaling |
Churns on every deploy/scale event | labeldrop if not sliced by; use target_limit |
trace_id, request_id, span_id |
One value per request — pure cardinality bomb | Never a label. Belongs in traces/logs |
email, session_id, IP addresses |
Effectively unbounded; also a PII risk | Drop entirely |
status_message, free-text errors |
Arbitrary strings | Use a bounded status_code instead |
The governing rule, written so a developer can self-check before adding a label:
If you cannot name a finite, reasonably small set of values a label will ever take, it is not a label. Put that value in a trace or a log line, and slice metrics by something bounded.
A particularly common trap is histogram buckets. A histogram with a high-cardinality label is not one extra series — it is (buckets + 2) extra series per label-value combination. Check whether you actually need per-handler latency at full bucket resolution before you ship it.
6. Aggregate at write time with recording rules
Sometimes you genuinely need a high-cardinality metric for short-term debugging but only ever query an aggregate. Recording rules pre-compute the aggregate on a schedule and store the small result. Combined with relabeling, this lets you keep raw data on a short retention while a downsampled series feeds dashboards and long-term storage.
groups:
- name: cardinality-reduction
interval: 30s
rules:
# Collapse per-pod, per-handler request rate into per-service rate.
# The stored result drops the pod dimension entirely.
- record: "service:http_requests:rate5m"
expr: |
sum by (service, method, status) (
rate(http_requests_total[5m])
)
# Pre-aggregate a latency histogram down to per-service buckets,
# so the quantile query later touches a fraction of the series.
- record: "service:http_request_duration_seconds_bucket:rate5m"
expr: |
sum by (service, le) (
rate(http_request_duration_seconds_bucket[5m])
)
Dashboards then query service:http_requests:rate5m instead of the raw metric — fewer series scanned, faster refresh. The high-leverage pattern is to pair this with remote-write filtering: keep raw data locally for 24 hours, but only forward the aggregated recording-rule output to the expensive long-term backend.
remote_write:
- url: "https://prometheus-prod.example.com/api/v1/write"
write_relabel_configs:
# Forward only pre-aggregated recording-rule series (service:...) and
# a small allow-list; drop raw per-pod series from long-term storage.
- source_labels: [__name__]
regex: "service:.*|up|node_.*"
action: keep
write_relabel_configs is the same relabeling engine applied at the remote-write boundary. It is where you turn “store everything locally” into “pay to keep only what matters,” which is frequently a 5-10x reduction in remote-write series.
7. Per-team cardinality budgets, dashboards, and alerts
Controlling cardinality once is a project; keeping it controlled is governance. The model that holds in a multi-team platform is a budget per team, attributed by a team or owner label that you attach via relabeling at scrape time, then measured and alerted on.
First, attribute every series to an owner. In Kubernetes, map a namespace label to a team via relabeling so attribution is automatic:
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
target_label: team
regex: "(payments|checkout|search)-.*"
replacement: "${1}"
Then build the per-team series count as a recording rule so the budget dashboard is cheap to render:
groups:
- name: cardinality-governance
interval: 1m
rules:
- record: "team:series:count"
expr: "count by (team) ({__name__=~'.+'})"
Alert when a team crosses its allocation, and — more importantly — alert on growth rate, because a slow leak is invisible on an instantaneous gauge until it is a crisis:
groups:
- name: cardinality-budgets
rules:
# Hard budget breach
- alert: TeamCardinalityBudgetExceeded
expr: "team:series:count > 200000"
for: 15m
labels:
severity: warning
annotations:
summary: "Team {{ $labels.team }} over its 200k series budget"
description: "{{ $labels.team }} is at {{ $value }} active series."
# Growth detector: series up >25% week-over-week
- alert: CardinalityGrowthSpike
expr: |
team:series:count
/ (team:series:count offset 1w) > 1.25
for: 1h
labels:
severity: warning
annotations:
summary: "Team {{ $labels.team }} cardinality up >25% WoW"
The growth alert is the one that earns its keep. A budget alert tells you something already broke; the week-over-week ratio catches the new exporter the day it ships, while the fix is still a one-line relabel rule.
Enterprise scenario
A payments platform team I worked with ran a single-tenant Prometheus per environment, remote-writing to Grafana Cloud. Over six weeks their active series climbed from 1.4M to 6.8M and the monthly bill roughly quintupled, with no corresponding traffic growth. The on-call narrative was “Prometheus is slow,” but the root cause was billing, not latency.
promtool tsdb analyze on a snapshot block made it obvious in under a minute: the top label name by unique values was card_bin (the first six digits of a card number) on a single payment_authorization_duration_seconds histogram. A well-meaning engineer had added it to slice latency by issuing bank. With ~12 native histogram buckets, 9,000 distinct BINs, and the existing method and status labels, that one histogram had ballooned to several million series — and it was streaming straight to the paid backend.
The constraint: they could not simply delete the metric, because the fraud team did use per-BIN latency during incident reviews. The fix was to split storage by audience. They kept the raw, per-BIN histogram locally on a 24-hour retention for fraud debugging, but stripped card_bin at the remote-write boundary and forwarded only a pre-aggregated recording rule to long-term storage:
# Local recording rule: aggregate away the BIN dimension
groups:
- name: payments
interval: 30s
rules:
- record: "service:payment_auth_duration:bucket:rate5m"
expr: |
sum by (service, method, status, le) (
rate(payment_authorization_duration_seconds_bucket[5m])
)
# Remote-write: drop the raw per-BIN series, keep the aggregate
remote_write:
- url: "https://prometheus-prod.grafana.net/api/prom/push"
write_relabel_configs:
- source_labels: [__name__]
regex: "payment_authorization_duration_seconds_bucket"
action: drop
Remote-write series for that metric dropped from millions to a few thousand. The fraud team kept its high-resolution local view; dashboards moved to the aggregated series and rendered faster. They then added the week-over-week growth alert from Section 7, scoped per team, so the next person who reached for a high-cardinality label got paged before it hit the invoice. Monthly spend returned to its prior baseline within one billing cycle.
Verify
Confirm the controls actually took effect rather than assuming the config reloaded cleanly.
# 1) Config is valid before you reload
promtool check config /etc/prometheus/prometheus.yml
# 2) Reload without restart, then confirm the new generation is live
curl -s -X POST http://localhost:9090/-/reload
curl -s http://localhost:9090/api/v1/status/config | jq -r '.data.yaml' | grep -A2 metric_relabel_configs
# 3) Headline series count should have dropped after relabeling
prometheus_tsdb_head_series
# 4) The label you dropped should be gone - this must return nothing
count(count by (user_id)({__name__=~".+"}))
# 5) No target is silently failing a sample_limit
prometheus_target_scrapes_exceeded_sample_limit_total
# 6) Per-team budget rule is producing data
team:series:count
Cross-check the /tsdb-status page after the reload: the top metric names and top label-value pairs should reflect your changes, and memoryInBytesByLabelName for the offending label should be absent.