Most Kubernetes clusters are simultaneously over-provisioned and unreliable: aggregate node utilization sits at 25-35% on a typical billing dashboard, yet pods still get OOMKilled and throttled. Both symptoms have the same root cause — requests and limits set by copy-paste, never by measurement. This guide fixes that with the Vertical Pod Autoscaler (VPA): how to gather recommendations safely, how to read the three numbers it produces, where it conflicts with the HPA, and how right-sizing feeds directly into better bin-packing and a smaller bill.
1. Requests vs limits, QoS, and the two failure modes
Before touching VPA, internalize what the scheduler and kubelet actually do with these numbers, because VPA only ever changes one of them.
- Requests are a scheduling contract.
kube-schedulersums pod requests against each node’s allocatable capacity and places the pod where it fits. Requests reserve capacity whether or not the pod uses it. - Limits are an enforcement ceiling, applied by the kernel via cgroups. CPU over a limit is throttled (CFS throttling — the pod is slowed, not killed). Memory over a limit is OOMKilled — the kernel terminates the container, the kubelet restarts it, and you see a
CrashLoopBackOffif it repeats.
That asymmetry produces the two failure modes:
| Mis-set value | Consequence | Who pays |
|---|---|---|
| Requests too high | Capacity reserved but idle; nodes fill on paper at 30% real use | The bill |
| Requests too low | Pods crammed onto nodes, then evicted under node pressure | Reliability |
| Memory limit too low | OOMKill, restart, CrashLoopBackOff | Reliability |
| CPU limit too low | Silent CFS throttling, latency spikes | Latency SLOs |
QoS class is derived from these values and decides eviction order when a node runs out of memory:
| QoS class | Condition | Eviction priority |
|---|---|---|
Guaranteed |
requests == limits for every container, CPU and memory | Evicted last |
Burstable |
at least one request set, but not Guaranteed | Middle |
BestEffort |
no requests or limits at all | Evicted first |
The single most important rule on this page: set memory requests == memory limits for anything you care about. It pins the pod to
Guaranteedfor memory, removes the burst headroom that lures you into OOMKills, and makes scheduling deterministic. For CPU, leave the limit off or set it generously — CPU is compressible, and a low CPU limit throttles you for no capacity benefit. VPA’s job is to find the right request number; you decide the request/limit relationship.
2. VPA architecture: recommender, updater, admission controller
VPA is not built into Kubernetes. You install it from the autoscaler repo, and it ships as three independent components plus a CRD:
- Recommender — watches live usage via the metrics API and historical samples, and writes
target,lowerBound, andupperBoundnumbers into the VPA object’sstatus. This component does the math and is safe to run alone. - Updater — reads recommendations and, in
Auto/Recreatemode, evicts pods whose requests are out of bounds so they get rescheduled with new values. It does not patch running pods in place (in-place resize is a separate, newer KEP). - Admission controller — a mutating webhook that rewrites a pod’s resource requests at creation time to match the recommendation. Without it, evicted pods would just come back with the same old requests.
Install the released manifests:
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
# Generates webhook TLS certs and applies recommender + updater + admission controller
./hack/vpa-up.sh
kubectl get pods -n kube-system | grep vpa
# vpa-recommender-... 1/1 Running
# vpa-updater-... 1/1 Running
# vpa-admission-controller-... 1/1 Running
Prerequisite:
metrics-servermust be healthy —kubectl top podshas to return numbers. The recommender also benefits from a Prometheus history source for cold-start accuracy, but the default in-cluster checkpoint store works out of the box.
3. Run VPA in Off mode first — recommendation only
Never start with Auto. Deploy the VPA object in updateMode: "Off" so the recommender observes and reports, but nothing evicts or mutates your pods. This is pure, zero-risk telemetry.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: checkout
namespace: shop
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
updatePolicy:
updateMode: "Off" # recommend only; do not touch pods
Let it run across at least one full traffic cycle — a week is sensible so it sees weekday peaks, the weekend trough, and any batch jobs. The recommender keeps a decaying histogram of usage, so longer is strictly better for the first pass.
4. Reading target, lowerBound, upperBound
After it has data, the recommendation lives in status:
kubectl describe vpa checkout -n shop
status:
recommendation:
containerRecommendations:
- containerName: checkout
lowerBound:
cpu: 110m
memory: 262144k
target:
cpu: 250m
memory: 410Mi
uncappedTarget:
cpu: 250m
memory: 410Mi
upperBound:
cpu: 1200m
memory: 980Mi
Read these precisely — they are not min/typical/max of raw usage, they are percentile estimates with safety margin:
target— what VPA would set the request to right now. This is the number you act on. Internally it tracks roughly the 90th percentile of CPU and the peak of memory, plus a safety margin (about 15% by default).lowerBound— the floor below which VPA considers the pod under-provisioned. If your current request is belowlowerBound, you are at OOMKill/eviction risk. InAutomode, dropping below this triggers an eviction to scale up.upperBound— the ceiling above which the pod is wastefully over-provisioned. If your current request sits aboveupperBound, you are burning money. Early in the data windowupperBoundis huge and shrinks as confidence grows — do not act on it until it stabilizes.uncappedTarget— whattargetwould be ignoring anyminAllowed/maxAllowedcaps you set. The gap betweenuncappedTargetandtargettells you your cap is binding.
The decision rule for a manual first pass: set your request to target, set memory limit == memory request, drop the CPU limit.
You can constrain the recommender with a resourcePolicy so it never proposes something absurd — essential for sidecars and JVMs that need a memory floor:
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: checkout
minAllowed:
cpu: 100m
memory: 256Mi
maxAllowed:
cpu: "2"
memory: 2Gi
controlledResources: ["cpu", "memory"]
- containerName: istio-proxy
mode: "Off" # never right-size the sidecar
Tune the target percentile only if you have evidence. The recommender’s defaults (memory target near peak, CPU near p90) are deliberately conservative because under-sizing memory kills pods. Lowering the memory target percentile to save money is how teams reintroduce the OOMKills they just fixed.
5. Update modes — and the hard HPA conflict
VPA supports four updateMode values:
| Mode | Behavior |
|---|---|
Off |
Recommend only. Never mutates pods. |
Initial |
Applies recommendations only at pod creation. No eviction of running pods. |
Recreate |
Evicts and recreates pods whenever requests drift out of [lowerBound, upperBound]. |
Auto |
Currently behaves like Recreate; intended to use in-place resize as it matures. |
Initial is the underrated safe default for production: new pods get right-sized requests, but you never suffer surprise mid-day evictions. You pick up correct values naturally on every rollout.
Now the rule you cannot violate:
Do not run VPA in
Auto/Recreatemode and an HPA on the same resource metric for the same workload. If the HPA scales replicas on CPU utilization while VPA simultaneously rewrites the CPU request, they enter a feedback loop — VPA raises the request, which lowers measured utilization, which makes the HPA scale down, and the controllers fight. The official guidance is explicit: VPA must not be used with the HPA on CPU or memory.
This is the most common way teams break themselves with VPA. Memorize it.
6. Combining VPA and HPA correctly
You can absolutely use both — just keep them on different signals. Let the HPA scale replicas on a custom or external metric (queue depth, requests-per-second, p95 latency) and let VPA own CPU/memory requests. They no longer overlap.
HPA on a custom metric (replicas only — no CPU/memory resource metric here):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout
namespace: shop
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
minReplicas: 3
maxReplicas: 30
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "200"
VPA owning requests, scoped to CPU and memory only:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: checkout
namespace: shop
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
updatePolicy:
updateMode: "Initial"
resourcePolicy:
containerPolicies:
- containerName: checkout
controlledResources: ["cpu", "memory"]
The HPA decides how many pods; VPA decides how big each one is — orthogonal axes, no feedback loop.
7. Right-sizing is half the battle: fix bin-packing too
Correct requests only pay off if the scheduler can pack them densely. Three levers:
Scheduler scoring. The default kube-scheduler NodeResourcesFit plugin uses LeastAllocated scoring, which spreads pods for resilience. For cost-driven node pools, switch to MostAllocated so the scheduler fills nodes before opening new ones — this is what makes the autoscaler able to drain and remove a node:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
Node sizing. Bin-packing is a geometry problem. If your largest pod requests 6 GiB and your nodes are 8 GiB, you waste the remainder on every node. Match node shape to the request distribution you measured in step 4. On Karpenter, let it choose instance types from the actual pending-pod requirements rather than pinning one family:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-cpu
operator: In
values: ["4", "8", "16"]
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
Consolidation. Karpenter’s WhenEmptyOrUnderutilized policy actively recomputes whether the current pods would fit on fewer or cheaper nodes and replaces them when they would. Right-sized requests are the input that makes consolidation aggressive — shrink the requests and Karpenter discovers it can delete nodes. Cluster Autoscaler offers a weaker version via --scale-down-utilization-threshold.
Verify
Prove the change end to end rather than trusting the dashboard.
# 1. Recommendations exist and have stabilized (upperBound no longer huge)
kubectl describe vpa checkout -n shop | sed -n '/Recommendation/,/Events/p'
# 2. Pods actually picked up the new requests after a rollout
kubectl get pods -n shop -l app=checkout \
-o custom-columns=NAME:.metadata.name,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
MEM_REQ:.spec.containers[0].resources.requests.memory
# 3. QoS is Guaranteed for memory-sensitive pods
kubectl get pod -n shop -l app=checkout \
-o jsonpath='{.items[*].status.qosClass}{"\n"}'
# 4. No new OOMKills since the change
kubectl get events -n shop --field-selector reason=OOMKilling
# 5. Real allocation vs capacity per node (the bin-packing payoff)
kubectl describe nodes | grep -A6 "Allocated resources"
A successful right-sizing shows: requests near target, qosClass: Guaranteed, zero fresh OOMKilling events, and node “Allocated resources” climbing toward 70-80% of allocatable while node count drops.
Enterprise scenario
A payments platform team ran ~140 microservices on EKS, every Deployment copied from one Helm template with requests.cpu: 1 and requests.memory: 2Gi. Cluster cost was roughly 38,000 USD/month at 22% average CPU utilization — and despite that slack, three JVM services OOMKilled nightly because 2Gi was below their actual heap-plus-metaspace peak. Classic dual failure: massively over-provisioned on aggregate, under-provisioned where it mattered.
The constraint: those same services already ran HPAs on CPU utilization, so they could not simply flip VPA to Auto — that would have pitted the two controllers against each other on the CPU metric.
The fix, staged over three weeks:
- Deployed VPA in
updateMode: "Off"fleet-wide for one week to collect recommendations with zero production risk. - Re-platformed the HPAs off CPU onto a Prometheus custom metric (in-flight requests per pod via the Prometheus Adapter), freeing CPU/memory for VPA to own.
- Switched VPA to
updateMode: "Initial"so requests right-sized on each rollout without surprise evictions, with aminAllowed.memoryfloor on the JVM services so the recommender never proposed below their measured heap peak. - Set the cost node pool’s scheduler profile to
MostAllocatedand enabled KarpenterWhenEmptyOrUnderutilizedconsolidation.
The result over the next billing cycle: average CPU utilization rose from 22% to 61%, node count fell by 44%, monthly spend dropped from ~38,000 to ~21,000 USD, and the nightly OOMKills went to zero because the JVM services finally got Guaranteed memory at their real footprint. The custom-metric HPA snippet that unblocked everything:
metrics:
- type: Pods
pods:
metric:
name: http_inflight_requests
target:
type: AverageValue
averageValue: "50"
The non-obvious lesson: the savings did not come from VPA alone. VPA produced correct requests, but the money only materialized once
MostAllocatedscheduling plus Karpenter consolidation could act on those smaller requests and physically delete nodes. Right-sizing without consolidation just leaves the freed capacity stranded.