A payments platform team runs forty Kubernetes clusters and a fleet of edge appliances, and their single Prometheus pair has stopped coping: the active series count crossed nine million the week they added per-transaction-id labels to a latency histogram, the box now needs 180 GiB of RAM just to stay up, queries that used to return in a second time out, and retention is capped at fifteen days because the local disk is full. The SRE lead’s mandate is blunt — “thirteen months of metrics, sub-second dashboards, and stop paging me about OOM at 3 a.m.” This guide walks through replacing that single Prometheus with a VictoriaMetrics cluster — vminsert, vmselect, and vmstorage as independently scalable tiers — fronted by vmagent as a drop-in Prometheus remote_write backend that absorbs high-cardinality ingestion and serves long-term queries without the memory cliff. Everything below is run against a real Kubernetes cluster and is reversible.
VictoriaMetrics splits the monolith Prometheus does in one process into three roles. vmstorage holds the data and does the heavy lifting of the query (it is stateful and the one tier you scale for cardinality and retention). vminsert is stateless, accepts writes, and shards each series across the storage nodes by a consistent hash of its labels. vmselect is stateless, fans a query out to every storage node, and merges the results. vmagent replaces Prometheus’ own scraping and remote-write: it scrapes targets (or receives Prometheus’ remote_write), buffers to disk when the backend is briefly unavailable, and can drop or relabel high-cardinality labels before they ever hit storage. Because the read and write tiers are stateless, you scale them with a replica count; because storage is sharded, you add cardinality headroom by adding vmstorage pods.
Prerequisites
- A Kubernetes cluster (this guide assumes 1.28+) with a working
StorageClassthat provisions SSD-backed volumes (e.g.managed-csi-premiumon AKS,gp3on EKS). High-cardinality storage is IOPS-bound. kubectland Helm 3.14+ configured against the cluster.- A namespace you can create (
monitoring), and the ability to createStatefulSets,PersistentVolumeClaims, andServices. - An existing Prometheus you will point at the new backend, or
vmagentscrape access to your targets. - Cluster sizing headroom: for ~10M active series plan 3 × vmstorage at 8 vCPU / 32 GiB / 500 GiB SSD to start. VictoriaMetrics needs roughly an order of magnitude less RAM than Prometheus for the same series, but storage IO is real.
- HashiCorp Vault reachable from the cluster (for the object-storage credentials used by backups), and an OIDC IdP — Microsoft Entra ID federated from Okta — for dashboard SSO.
Target topology
The write path and the read path share the storage tier but are otherwise independent, and keeping them separate in your head is the key to operating this well. On the write path, vmagent scrapes pods and appliances (or receives remote_write from your legacy Prometheus), applies relabeling to tame cardinality, and pushes to a load-balanced vminsert Service; vminsert hashes each series and shards it across the vmstorage pods. On the read path, Grafana (and Dynatrace, via a Prometheus datasource) queries vmselect, which scatter-gathers across every vmstorage pod and merges the result. vmstorage is the only stateful tier — its PVCs hold both the inverted index (what makes high cardinality expensive) and the compressed samples (what makes thirteen-month retention cheap). Around the edges: vmauth terminates auth and routes, Vault issues the object-storage credentials vmbackup uses for the durable copy, and Entra/Okta gate the dashboards.
The components, and the one configuration choice that matters most for each:
| Component | Role | The choice that matters |
|---|---|---|
| vmagent | Scrape / receive remote_write, relabel, buffer, forward |
-remoteWrite.maxDiskUsagePerURL so a backend blip buffers instead of dropping |
| vminsert | Stateless write router; shards series across storage | Replica count for write throughput; -replicationFactor for durability |
| vmstorage | Stateful index + sample store | -retentionPeriod, disk size, and pod count (your cardinality lever) |
| vmselect | Stateless query fan-out and merge | -search.maxUniqueTimeseries, cache size, replica count for QPS |
| vmauth | Auth proxy / router in front of insert + select | Per-tenant routing and bearer-token enforcement |
| vmbackup / vmrestore | Snapshot to object storage and restore | Vault-issued S3/Blob creds; backup cadence |
| Vault | Issues short-lived object-storage credentials for backups | Dynamic secrets engine; no static keys on disk |
| Entra ID + Okta | SSO for Grafana and vmauth | OIDC; Okta federated to Entra; group → role mapping |
1. Create the namespace and add the Helm repo
VictoriaMetrics ships an official Helm chart for the cluster topology. Pin the chart version so a helm upgrade never silently jumps a major.
kubectl create namespace monitoring
helm repo add vm https://victoriametrics.github.io/helm-charts/
helm repo update
# Pin to a known-good chart version (inspect what's available first)
helm search repo vm/victoria-metrics-cluster --versions | head
Confirm your SSD StorageClass exists before you ask a StatefulSet to bind 500 GiB volumes to it:
kubectl get storageclass
# Expect a premium/SSD class, e.g. managed-csi-premium or gp3, marked (default) or named explicitly below.
2. Write the cluster values file
This values.yaml defines the three tiers, sets thirteen-month retention, and enables a replication factor of 2 so the cluster survives a single vmstorage pod loss. The vmstorage block is where high-cardinality decisions live — disk size, retention, and the replica count you will grow over time.
# vm-cluster-values.yaml
vmstorage:
replicaCount: 3
retentionPeriod: "13" # months; the long-term-storage requirement
extraArgs:
dedup.minScrapeInterval: "30s" # global dedup if you HA-pair vmagent
# Reject absurd-cardinality streams at the door rather than OOM later:
storage.maxHourlySeries: "2000000"
storage.maxDailySeries: "8000000"
persistentVolume:
enabled: true
storageClassName: "managed-csi-premium"
size: 500Gi
resources:
requests: { cpu: "4", memory: "16Gi" }
limits: { cpu: "8", memory: "32Gi" }
podDisruptionBudget:
enabled: true
maxUnavailable: 1
vminsert:
replicaCount: 3
extraArgs:
replicationFactor: "2" # each series written to 2 storage nodes
maxLabelsPerTimeseries: "40" # hard cap on label count per series
resources:
requests: { cpu: "1", memory: "1Gi" }
limits: { cpu: "2", memory: "2Gi" }
vmselect:
replicaCount: 3
cacheMountPath: /cache
persistentVolume:
enabled: true
storageClassName: "managed-csi-premium"
size: 50Gi
extraArgs:
dedup.minScrapeInterval: "30s"
search.maxUniqueTimeseries: "1000000" # guard against runaway queries
search.maxQueryDuration: "60s"
resources:
requests: { cpu: "2", memory: "4Gi" }
limits: { cpu: "4", memory: "8Gi" }
A note on replicationFactor: it is set on vminsert (the writer), not storage, and it must be strictly less than the vmstorage replica count. With replicationFactor: 2 and 3 storage pods, vmselect must be told to tolerate one missing node, which the chart wires automatically when it sees the insert replication arg; verify it in step 5 if you tune these by hand.
3. Install the cluster
helm install vmcluster vm/victoria-metrics-cluster \
--namespace monitoring \
--version <pinned-chart-version> \
-f vm-cluster-values.yaml
# Watch the storage StatefulSet bind its PVCs and go Ready
kubectl -n monitoring rollout status statefulset/vmcluster-victoria-metrics-cluster-vmstorage --timeout=300s
kubectl -n monitoring get pods -l app.kubernetes.io/instance=vmcluster
Note the two Service names the chart creates — you will write to one and read from the other:
kubectl -n monitoring get svc | grep -E 'vminsert|vmselect'
# vminsert -> :8480 (write endpoint)
# vmselect -> :8481 (read endpoint)
The cluster URL paths carry a tenant id (0 for single-tenant). Writes go to /insert/0/prometheus/api/v1/write; reads go to /select/0/prometheus as a Prometheus-compatible datasource. The 0 is accountID:projectID collapsed to one number — you get multi-tenancy for free later by changing it.
4. Deploy vmagent as the remote-write backend
vmagent is what your existing Prometheus talks to, and what tames cardinality. Deploy it with a scrape config plus a relabeling rule that drops the high-cardinality label that started the incident, then forwards to vminsert. Buffering to disk (-remoteWrite.maxDiskUsagePerURL) means a storage hiccup queues data instead of losing it.
# vmagent-values.yaml
remoteWriteUrls:
- http://vmcluster-victoria-metrics-cluster-vminsert.monitoring.svc:8480/insert/0/prometheus/api/v1/write
extraArgs:
remoteWrite.maxDiskUsagePerURL: "10GiB" # on-disk buffer if vminsert is briefly down
remoteWrite.tmpDataPath: /vmagent-buffer
promscrape.maxScrapeSize: "32MiB"
persistentVolume:
enabled: true
storageClassName: "managed-csi-premium"
size: 20Gi
config:
global:
scrape_interval: 30s
external_labels:
cluster: payments-prod
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs: [{ role: pod }]
relabel_configs:
# only scrape pods that opt in
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
metric_relabel_configs:
# DROP the per-transaction-id label that exploded cardinality
- source_labels: [transaction_id]
action: labeldrop
# Drop a noisy histogram we never query
- source_labels: [__name__]
regex: "envoy_cluster_.*_bucket"
action: drop
helm install vmagent vm/victoria-metrics-agent \
--namespace monitoring \
--version <pinned-chart-version> \
-f vmagent-values.yaml
kubectl -n monitoring rollout status deployment/vmagent-victoria-metrics-agent
To migrate without ripping out the legacy Prometheus first, point its remote_write at the same vminsert endpoint and run both in parallel during cutover:
# add to the legacy prometheus.yml, then reload Prometheus
remote_write:
- url: http://vmcluster-victoria-metrics-cluster-vminsert.monitoring.svc:8480/insert/0/prometheus/api/v1/write
queue_config:
max_shards: 30
capacity: 20000
5. Point Grafana (and Dynatrace) at vmselect
Add vmselect as a Prometheus datasource. VictoriaMetrics speaks the Prometheus query API and MetricsQL (a superset of PromQL), so existing dashboards work unchanged.
# grafana datasource (provisioning/datasources/vm.yaml)
apiVersion: 1
datasources:
- name: VictoriaMetrics
type: prometheus
access: proxy
url: http://vmcluster-victoria-metrics-cluster-vmselect.monitoring.svc:8481/select/0/prometheus
isDefault: true
jsonData:
httpMethod: POST
prometheusType: Prometheus
timeInterval: 30s
The same vmselect URL becomes a Datadog/Dynatrace ingestion source where those platforms consume a Prometheus endpoint, so a single long-term store backs both your Grafana dashboards and your APM tool’s metric correlation — no second copy of the data. For SSO, front Grafana with Microsoft Entra ID as the OIDC provider, with Okta federated to Entra as the workforce IdP, and map an Entra group to the Grafana Admin role so dashboard access follows the same identity your humans already use everywhere else.
6. Wire backups to object storage via Vault
vmstorage PVCs are durable within the cluster, but the long-term-storage requirement means an off-cluster copy. Run vmbackup as a sidecar that snapshots to S3/Azure Blob, and fetch the bucket credentials at runtime from HashiCorp Vault so no static access key ever sits in a Secret or on disk.
# Vault issues short-lived S3 creds via its AWS secrets engine
export VAULT_ADDR=https://vault.internal:8200
vault read aws/creds/vmbackup-writer
# -> access_key / secret_key with a 1h TTL, rotated automatically
# vmbackup sidecar args (added to the vmstorage pod spec)
- name: vmbackup
image: victoriametrics/vmbackup:v1.103.0-cluster
args:
- -storageDataPath=/storage
- -snapshot.createURL=http://localhost:8482/snapshot/create
- -dst=s3://payments-vm-longterm/$(POD_NAME)/
- -customS3Endpoint=https://s3.ap-south-1.amazonaws.com
env:
- name: AWS_ACCESS_KEY_ID # injected by the Vault Agent sidecar
valueFrom: { secretKeyRef: { name: vmbackup-s3, key: access_key } }
- name: AWS_SECRET_ACCESS_KEY
valueFrom: { secretKeyRef: { name: vmbackup-s3, key: secret_key } }
volumeMounts:
- { name: vmstorage-volume, mountPath: /storage }
Schedule backups (a CronJob invoking the snapshot+upload, or vmbackupmanager for retention-aware rotation) and test a restore with vmrestore -src=s3://... -storageDataPath=/storage into a scratch pod before you trust it.
7. Manage it all as code
Keep every values file and Helm release in Git and reconcile with Argo CD, so the cluster’s desired state is reviewable and revertable — a helm upgrade becomes a pull request, not an SSH session. A GitHub Actions (or Jenkins) pipeline lints the chart values and runs helm template | kubeconform on every PR before Argo CD syncs. Provision the underlying nodes, the SSD StorageClass, the S3 bucket, and the Vault roles with Terraform, and use Ansible to lay down vmagent on the bare-metal virtual appliances at the edge that are not part of the Kubernetes cluster but still need to ship metrics into the same backend.
# what the CI gate runs on every PR
helm template vmcluster vm/victoria-metrics-cluster -f vm-cluster-values.yaml \
| kubeconform -strict -summary -kubernetes-version 1.28.0
Validation
Prove ingestion, sharding, retention, and query before you cut traffic over.
# 1. vmagent is forwarding and not dropping — check its /metrics
kubectl -n monitoring port-forward deploy/vmagent-victoria-metrics-agent 8429 &
curl -s localhost:8429/metrics | grep -E 'vmagent_remotewrite_(requests_total|errors_total|pending_data_bytes)'
# pending_data_bytes near 0 and errors_total flat = healthy forwarding
# 2. Storage is actually holding series, and they are sharded across pods
kubectl -n monitoring port-forward svc/vmcluster-victoria-metrics-cluster-vmstorage 8482 &
curl -s 'localhost:8482/metrics' | grep vm_cache_entries
# Repeat against each vmstorage pod; series counts should be roughly even.
# 3. Query through vmselect returns data (Prometheus API)
kubectl -n monitoring port-forward svc/vmcluster-victoria-metrics-cluster-vmselect 8481 &
curl -s 'localhost:8481/select/0/prometheus/api/v1/query?query=up' | jq '.data.result | length'
# 4. Cardinality explorer — find what is eating your index
curl -s 'localhost:8481/select/0/prometheus/api/v1/status/tsdb' | jq '.data.seriesCountByMetricName[0:10]'
# This is the high-cardinality audit: the top metrics by series count.
The fourth call is the one to run weekly: VictoriaMetrics’ cardinality explorer (also a UI at vmselect’s /select/0/vmui/#/cardinality) tells you exactly which metric or label is driving series growth, so you tune the vmagent labeldrop rule with evidence rather than guesswork. Confirm retention by querying a timestamp older than your old Prometheus’ fifteen-day cap and getting a result.
Rollback / teardown
Because the legacy Prometheus kept running through cutover, rollback is a config revert, not a recovery.
# Roll back a bad upgrade to the previous release revision
helm -n monitoring history vmcluster
helm -n monitoring rollback vmcluster <previous-revision>
# Point Grafana back at the old Prometheus datasource (revert the provisioning PR via Argo CD)
# Full teardown — uninstall the releases, then DELETE PVCs explicitly
helm -n monitoring uninstall vmagent vmcluster
kubectl -n monitoring delete pvc -l app.kubernetes.io/instance=vmcluster
helm uninstall deliberately does not delete the vmstorage PVCs — that is a safety feature, so your thirteen months of data survives an accidental uninstall. Delete them only when you are certain, and confirm a fresh backup exists in object storage first.
Common pitfalls
- Setting
replicationFactoron the wrong tier. It belongs onvminsertand must be less than thevmstoragereplica count; if you set it equal, a single node loss makes queries fail. With replication on,vmselectneeds the matching-replicationFactorflag to know it can tolerate a partial response — verify the chart wired it. - Forgetting
-dedup.minScrapeIntervalwhen running HAvmagent. Two agents scraping the same target double every sample; set the dedup interval on bothvmstorageandvmselectto the scrape interval so the duplicate is collapsed at write and read. - Treating high cardinality as a storage problem instead of an ingestion one. Scaling
vmstoragebuys headroom, but the cheap fix is alabeldrop/droprule invmagentthat stops the offending label at the door. Use the cardinality explorer to find it. - Under-provisioning
vmselectcache disk. Heavy dashboards thrash the cache; givevmselecta real PVC, notemptyDir, or query latency stays spiky. - Slow
StorageClassonvmstorage. The inverted index is IOPS-hungry; an HDD or burst-credit-limited volume turns into the new bottleneck. Use provisioned-IOPS SSD. - No backup restore test. A backup you have never restored is a hope, not a guarantee — run
vmrestoreinto a scratch pod on a schedule.
Security notes
Do not expose vminsert or vmselect directly. Put vmauth in front as the single authenticated front door — it validates bearer tokens, routes per tenant, and rate-limits — and gate the human-facing Grafana with OIDC through Microsoft Entra ID, federated from Okta so dashboard access follows your existing workforce identity and conditional-access policies. Pull the object-storage credentials vmbackup uses from HashiCorp Vault’s dynamic secrets engine so they are short-lived and never written to a Kubernetes Secret in plain text. Run Wiz (with Wiz Code scanning the Helm values and Terraform in the repo) for continuous cloud-posture and misconfiguration detection — it flags the moment a Service drifts to a public load balancer or a backup bucket loses its block-public-access setting. Deploy CrowdStrike Falcon sensors on the cluster nodes and the edge virtual appliances for runtime threat detection feeding your SOC, and auto-raise a ServiceNow incident on a guardrail breach (a public-exposure alert from Wiz, a sustained vmagent drop) so security gets a ticket, not just a log line. Network-policy the monitoring namespace so only vmagent and vmauth can reach vminsert, and only vmselect and vmauth can reach the query path.
Cost notes
The win is mostly RAM: VictoriaMetrics holds the same series in roughly an order of magnitude less memory than Prometheus, so the 180 GiB single box becomes three modest 32 GiB storage pods — and its on-disk compression (typically well under one byte per sample) makes thirteen-month retention on SSD genuinely affordable where Prometheus could not hold fifteen days. Scale the tiers independently and only where the pressure is: add vmstorage pods for cardinality and retention, vmselect replicas for dashboard QPS, vminsert replicas for write throughput — never one oversized everything. The biggest single lever is the vmagent relabel rule: every high-cardinality label you drop before storage is index you never pay to build, hold, or query, so the cardinality explorer pays for itself directly in storage cost. Tier the backup bucket to infrequent-access/cool storage since restores are rare, and meter ingestion and series growth into Dynatrace so the metrics platform’s own cost is on a dashboard the SRE lead sees — the same dashboard that proves the 3 a.m. pages stopped.