A single Prometheus is one of the best engineering deals in infrastructure: scrape, store, alert, and query from one binary with no external dependencies. That deal expires the moment your active series count, retention requirement, or dashboard concurrency outgrows one machine’s RAM and disk. This guide is the path from that overloaded single node to a horizontally scalable stack: where the load actually comes from, how to shed it with recording rules and relabeling, how to tune remote-write so it does not melt your WAL, and how to choose between Thanos and Mimir for the long-term backend.
1. When a single Prometheus stops scaling
Prometheus is vertically scaled by design. It holds an in-memory index of every active series and keeps the most recent ~2 hours of samples in a head block before flushing to disk. Three symptoms tell you the node is past its limit.
Cardinality. Memory consumption tracks the number of active series (unique label-set combinations), not the sample rate. A single new high-cardinality label - a user ID, a full request path, a pod name churning under autoscaling - multiplies series. When the process starts getting OOM-killed during head compaction, cardinality is almost always the cause.
Retention. Local TSDB retention is governed by --storage.tsdb.retention.time (and/or --storage.tsdb.retention.size). Keeping a year of data locally means a year of blocks on one volume, with no replication and no horizontal read path. Local disk is for the recent window; anything you want to keep for capacity planning or compliance belongs in object storage.
Query time. A dashboard that computes histogram_quantile() over a rate() of a high-cardinality histogram across a 30-day range will touch enormous numbers of samples on every refresh. Long, complex range queries are the load you can engineer away before you ever shard.
Inspect the real cost before changing anything:
# Top 10 metric names by series count (run against your Prometheus)
topk(10, count by (__name__)({__name__=~".+"}))
# Series count contributed per scrape job
sort_desc(count by (job)({__name__=~".+"}))
# Head series and chunks over time - watch the trend, not the instant
prometheus_tsdb_head_series
prometheus_tsdb_head_chunks
Rule of thumb: budget roughly a few KB of RAM per active series including index overhead. A node holding several million active series wants tens of GB of RAM with headroom for query bursts and compaction. Measure your own ratio with
process_resident_memory_bytes / prometheus_tsdb_head_seriesrather than trusting a generic number.
The tsdb CLI gives you a block-level breakdown of where cardinality lives:
# Analyze the most recent block for the worst label/metric offenders
promtool tsdb analyze /prometheus
2. Recording rules as a load valve
A recording rule precomputes an expression on the scrape/evaluation interval and writes the result back as a new series. Instead of every dashboard panel re-deriving a per-service error rate over millions of raw samples, the rule computes it once and panels read a cheap, low-cardinality series. This is the single highest-leverage change for query-time pain.
Naming matters - the community convention is level:metric:operations, where level is the aggregation labels kept, metric is the source metric, and operations describes the transform applied (most recent first):
# recording-rules.yaml
groups:
- name: http_aggregations
interval: 30s
rules:
# Per-service, per-method 5m request rate. Drops instance/pod labels.
- record: job:http_requests:rate5m
expr: sum by (job, method) (rate(http_requests_total[5m]))
# Per-service error ratio, precomputed for SLO dashboards.
- record: job:http_request_errors:ratio_rate5m
expr: |
sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
# Pre-aggregated histogram buckets so quantiles read cheap series.
- record: job_le:http_request_duration_seconds:rate5m
expr: sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
With the bucket aggregation precomputed, a latency panel becomes histogram_quantile(0.99, job_le:http_request_duration_seconds:rate5m) - reading a handful of series per job instead of the full per-instance bucket set.
Two operational guardrails:
- Keep recording-rule expressions cheap. A rule that aggregates away labels reduces cardinality and is healthy. A rule whose evaluation itself is expensive just moves the cost from query time to evaluation time and can blow your rule-group evaluation interval. Watch
prometheus_rule_group_last_duration_secondsagainstprometheus_rule_group_interval_seconds; if duration approaches the interval, the group is falling behind. - Order within a group is sequential; groups run in parallel. Rules in one group can depend on earlier rules in the same group within an evaluation. Cross-group dependencies are not guaranteed in a single tick.
Validate before shipping:
promtool check rules recording-rules.yaml
3. Federation vs remote-write for global views
Once you run more than one Prometheus - per cluster, per region, per team - you need a way to query across them. There are two patterns, and they are not equivalent.
Federation has a central Prometheus scrape the /federate endpoint of leaf Prometheis, pulling a filtered subset of series on an interval. It works for small, pre-aggregated rollups, but it scales poorly: it is pull-based and bursty, the central node inherits the cardinality of everything it federates, and you only get whatever the federation match selectors captured at scrape time. Use it sparingly, for a thin layer of cross-cluster aggregates only.
Remote-write streams samples from each Prometheus to a central, horizontally scalable backend (Thanos Receive, Mimir, or any compatible endpoint) as they are ingested. It is push-based, continuous, carries full resolution, and the backend - not your scraping Prometheus - owns the global query path and long-term storage. For a true global view, remote-write is the right default.
# prometheus.yaml - minimal remote_write to a central backend
remote_write:
- url: https://mimir.internal.example.com/api/v1/push
headers:
X-Scope-OrgID: team-platform # Mimir/Thanos-Receive tenant header
# Send only what the backend needs for global views; keep the rest local.
write_relabel_configs:
- source_labels: [__name__]
regex: "(go_gc_duration_seconds_count|process_cpu_seconds_total)"
action: keep
You can run remote-write and keep local TSDB for fast, recent, single-cluster queries. A common topology: short local retention for on-call drill-down, remote-write to a global backend for cross-cluster and long-term queries. Many teams add
external_labels(e.g.cluster,region) underglobal:so every remote-written series is attributable to its source.
4. Tuning the remote-write queue
Remote-write reads from the WAL and ships samples through an in-memory queue per remote endpoint. If the queue cannot keep up with ingestion, the WAL grows, disk fills, and you get backpressure that eventually stalls the whole process. The relevant knobs live under queue_config.
remote_write:
- url: https://mimir.internal.example.com/api/v1/push
queue_config:
# Concurrency. Prometheus auto-scales shards between min and max
# based on backlog; raise the ceiling for high-throughput endpoints.
min_shards: 1
max_shards: 50
# Samples buffered per shard before a flush is forced.
capacity: 10000
# Max samples per outbound request to the backend.
max_samples_per_send: 2000
# Force a flush at this interval even if a batch is not full.
batch_send_deadline: 5s
# Retry backoff bounds for 5xx/429 from the backend.
min_backoff: 30ms
max_backoff: 5s
# Honor Retry-After on 429 instead of hammering a throttled backend.
retry_on_http_429: true
How to reason about it:
- Shards are the concurrency dial. Prometheus dynamically scales the shard count within
[min_shards, max_shards]based on how far behind the queue is. If you are persistently backlogged and not network- or backend-bound, raisemax_shards. Watchprometheus_remote_storage_shards,prometheus_remote_storage_shards_desired, andprometheus_remote_storage_shards_max. - Batch size trades latency for efficiency. Larger
max_samples_per_sendandcapacitymean fewer, fatter requests - more throughput, slightly more memory and per-request latency. Tune up only if the backend accepts larger batches without timing out. - WAL backpressure is the failure mode to watch. The lag between newest ingested and newest sent shows up as a growing gap. Track
prometheus_remote_storage_samples_pendingandprometheus_remote_storage_samples_failed_total; pair them withprometheus_remote_storage_highest_timestamp_in_secondsversusprometheus_remote_storage_queue_highest_sent_timestamp_secondsto see how far behind you are. - Drop noise at the edge. The cheapest sample to store is the one you never send. Use
write_relabel_configswithaction: dropto strip series the global backend does not need - debug metrics, ultra-high-cardinality internal counters, anything only useful for local drill-down.
write_relabel_configs:
# Drop noisy debug metrics from the remote stream.
- source_labels: [__name__]
regex: "go_memstats_.*|.*_bucket_le_.*"
action: drop
5. Thanos architecture
Thanos turns a fleet of Prometheus servers into a global, long-term system using object storage as the durable layer. The components you actually run:
- Sidecar. Runs next to each Prometheus. It uploads completed TSDB blocks to object storage (S3, GCS, Azure Blob) and exposes the Store API so the recent, in-Prometheus window is queryable globally. (In a remote-write-first design you may use Thanos Receive instead of the sidecar upload path; pick one ingestion model.)
- Store Gateway. Serves historical blocks from object storage over the Store API, with index caching so queries do not re-download whole blocks.
- Querier (Query). Fans a PromQL query out to all Store API endpoints - sidecars and store gateways - and deduplicates overlapping series (e.g. from replicated Prometheus pairs) at query time.
- Compactor. A singleton (per object-storage bucket) that compacts blocks, applies retention, and performs downsampling - precomputing 5-minute and 1-hour resolution series so long-range queries read far fewer samples. The compactor must run as exactly one instance per bucket to avoid corrupting blocks.
# Store Gateway with an object-storage config (S3 example)
# objstore.yaml
type: S3
config:
bucket: thanos-metrics
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
# Store Gateway pointed at the bucket
thanos store \
--objstore.config-file=objstore.yaml \
--data-dir=/var/thanos/store \
--index-cache-size=2GB
# Compactor: downsampling + retention per resolution. Run ONE per bucket.
thanos compact \
--objstore.config-file=objstore.yaml \
--data-dir=/var/thanos/compact \
--retention.resolution-raw=30d \
--retention.resolution-5m=180d \
--retention.resolution-1h=2y \
--wait
# Querier deduplicating across replicated Prometheus pairs
thanos query \
--store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc \
--store=dnssrv+_grpc._tcp.thanos-sidecar.monitoring.svc \
--query.replica-label=replica
The compactor’s downsampling is what makes Thanos viable for multi-year retention: a one-hour-resolution series over two years is orders of magnitude smaller to scan than raw. Set
--query.replica-labelto whicheverexternal_labeldistinguishes your HA Prometheus replicas, or the querier cannot dedupe them.
6. Grafana Mimir architecture
Mimir is a horizontally scalable, multi-tenant TSDB built primarily around the remote-write ingestion model. Where Thanos composes loosely coupled components over your Prometheus servers, Mimir is a clustered system with its own write and read paths. It is one binary that runs in different modes; in production it is commonly deployed in three groups.
- Write path. The
distributorvalidates and shards incoming remote-write samples and forwards them toingesters, which hold recent data in memory and flush blocks to object storage. - Read path. The
query-frontendsplits and caches queries, thequerierexecutes PromQL against ingesters (recent) and the store-gateway (historical), and thestore-gatewayserves blocks from object storage. - Backend. The
compactorcompacts and deduplicates blocks and enforces retention; thestore-gateway,ruler(for recording/alerting rules run server-side), and supporting services round out long-term storage and rule evaluation.
The defining feature is native multi-tenancy: every request carries an X-Scope-OrgID header, and Mimir isolates series, queries, and limits per tenant. Per-tenant limits are how you stop one team from exhausting the cluster:
# Mimir runtime overrides - per-tenant guardrails
overrides:
team-platform:
max_global_series_per_user: 3000000
ingestion_rate: 250000 # samples/sec
ingestion_burst_size: 500000
max_fetched_series_per_query: 500000
compactor_blocks_retention_period: 1y
team-payments:
max_global_series_per_user: 1000000
ingestion_rate: 100000
compactor_blocks_retention_period: 90d
Mimir deliberately reuses Thanos and Cortex lineage internally (block format, store-gateway concepts), so the storage primitives will look familiar - the difference is the operational packaging and the first-class tenancy and limits layer.
7. Choosing between Thanos and Mimir
Both solve global view + long-term storage on object storage. The decision is operational, not feature-checkbox.
| Dimension | Thanos | Mimir |
|---|---|---|
| Mental model | Loosely coupled components over your existing Prometheus | Clustered, purpose-built TSDB you push into |
| Primary ingestion | Sidecar block upload (or Receive for push) | Remote-write first |
| Multi-tenancy | Possible, less native | First-class, per-tenant limits built in |
| Long-range query latency | Good with downsampling + caches | Strong; query-frontend splitting/caching and sharding |
| Operational surface | Fewer moving parts to start; compactor is a singleton | More components/config; horizontally scalable end to end |
| Best fit | Already have many Prometheis; want global view with minimal re-architecture | Building a central, multi-team metrics platform at scale |
Decision guide:
- Pick Thanos when you already operate a fleet of Prometheus servers, want a global query view and durable storage with the smallest change to what you run, and value the simpler component model. The sidecar pattern bolts on without re-routing all ingestion.
- Pick Mimir when you are standing up a central metrics platform for many teams, need hard per-tenant isolation and limits, and want a write/read/backend split you can scale each tier of independently. Remote-write-first fits a hub-and-spoke org cleanly.
Either way, remote-write tuning (section 4) and cardinality control (section 8) apply identically - the backend choice does not save you from a series explosion.
8. Controlling cardinality at the source
The most durable fix is to never create the cardinality. Prometheus offers two relabeling stages, and the distinction is exam-critical:
relabel_configsruns before the scrape, on target labels (service discovery metadata). Use it to decide what to scrape and set target-level labels.metric_relabel_configsruns after the scrape, on every sample’s labels. Use it to drop entire metrics or strip high-cardinality labels before they hit the TSDB.
scrape_configs:
- job_name: app
kubernetes_sd_configs:
- role: pod
metric_relabel_configs:
# Drop a high-cardinality metric entirely.
- source_labels: [__name__]
regex: "apiserver_request_duration_seconds_bucket"
action: drop
# Strip a label that explodes cardinality (e.g. a full URL path),
# but keep the metric.
- regex: "url_path"
action: labeldrop
# Allow-list: keep only metrics you have explicitly blessed.
- source_labels: [__name__]
regex: "(http_requests_total|http_request_duration_seconds_bucket|up|process_cpu_seconds_total)"
action: keep
labeldrop/labelkeepact on label names;drop/keepact on samples matched bysource_labelsvalues. An allow-list (action: keepon__name__) is the strongest control: nothing enters the TSDB unless it is on the list. Apply allow-lists at the source for local TSDB and atwrite_relabel_configsfor the remote stream.
Enterprise scenario
A payments platform team ran HA Prometheus pairs per Kubernetes cluster and remote-wrote everything to a central Mimir cluster on S3. After a Black Friday traffic ramp, the distributors started returning 429 and on-call dashboards went blank for the team-payments tenant. The root cause was not Mimir capacity - it was the per-tenant ingestion_rate limit colliding with a cardinality spike from a new deploy that added a customer_id label to http_requests_total. Each Prometheus dutifully sharded up its remote-write queue, hammered the throttled distributor, honored Retry-After, and fell hours behind while the WAL grew toward the disk limit.
The fix had two halves. First, stop the bleeding at the edge - drop the offending label before it ever leaves the Prometheus, so the tenant fit back under its rate budget:
write_relabel_configs:
- regex: "customer_id"
action: labeldrop
- source_labels: [__name__]
regex: "http_requests_total"
action: keep
Second, make the limit a guardrail, not a silent ceiling: alert on cortex_discarded_samples_total by reason="rate_limited" per tenant, and raise the headroom deliberately rather than reactively.
overrides:
team-payments:
ingestion_rate: 200000 # was 100000
ingestion_burst_size: 600000 # absorb deploy-time spikes
The durable lesson: a multi-tenant backend turns one team’s cardinality mistake into that team’s outage, not everyone’s - but only if the per-tenant limit is observable. Wire cortex_discarded_samples_total and prometheus_remote_storage_samples_pending into paging before the next ramp, not after.
Verify
Confirm each layer is actually doing its job before declaring victory.
# Recording rules loaded and evaluating without lag
promtool check rules recording-rules.yaml
# 1. Recording rules produce series and stay within their interval
job:http_requests:rate5m
max by (rule_group) (prometheus_rule_group_last_duration_seconds)
/ max by (rule_group) (prometheus_rule_group_interval_seconds) # keep < 1
# 2. Remote-write is keeping up: failures flat, backlog small
rate(prometheus_remote_storage_samples_failed_total[5m]) # expect ~0
prometheus_remote_storage_samples_pending # bounded, not climbing
prometheus_remote_storage_shards_desired # <= shards_max
# 3. WAL is not growing unbounded (backpressure check)
prometheus_tsdb_wal_truncations_total
prometheus_remote_storage_highest_timestamp_in_seconds
- prometheus_remote_storage_queue_highest_sent_timestamp_seconds # small gap
# 4. Cardinality moved in the right direction after relabeling
count by (__name__)({__name__=~".+"})
# 5. Thanos: blocks present in object storage and visible to the querier
thanos tools bucket inspect --objstore.config-file=objstore.yaml
# 6. Mimir: confirm a tenant's series count against its limit
# (cortex_-prefixed metrics are exposed by Mimir components)
# Query Mimir's own /metrics or your monitoring stack for:
# cortex_ingester_memory_series
Checklist
Pitfalls and next steps
- Recording rules that do not reduce cardinality just relocate cost. If a rule keeps every label it had, it is buying you nothing - aggregate labels away.
- Raising
max_shardsblindly can overwhelm a backend that is the real bottleneck. Confirm where the backpressure is (network, backend 429s, or genuine under-sharding) before turning the dial. - Two Thanos compactors on one bucket will corrupt blocks. It is a singleton per bucket - guard it with a single-replica deployment and a lock you trust.
- Forgetting the tenant header. Both Thanos Receive and Mimir key on
X-Scope-OrgID; omit it and samples land in the wrong tenant or get rejected. - Treating the backend as a cardinality fix. Thanos and Mimir scale storage and query, not your series budget. Control cardinality at the source regardless of backend.
Next, push rule evaluation server-side (Thanos Ruler or Mimir’s ruler) so global rules run against the global dataset rather than per-Prometheus, add per-tenant or per-job cardinality alerts so explosions page before they OOM a node, and codify the downsampling/retention tiers to your actual query patterns - most teams over-retain raw resolution and under-use the 5m and 1h tiers.