Right-Sizing Kubernetes Workloads: Vertical Pod Autoscaler, Resource Recommendations, and Bin-Packing Efficiency

Most Kubernetes clusters are simultaneously over-provisioned and unreliable: aggregate node utilization sits at 25-35% on a typical billing dashboard, yet pods still get OOMKilled and throttled. Both symptoms have the same root cause — requests and limits set by copy-paste, never by measurement. This guide fixes that with the Vertical Pod Autoscaler (VPA): how to gather recommendations safely, how to read the three numbers it produces, where it conflicts with the HPA, and how right-sizing feeds directly into better bin-packing and a smaller bill.

1. Requests vs limits, QoS, and the two failure modes

Before touching VPA, internalize what the scheduler and kubelet actually do with these numbers, because VPA only ever changes one of them.

Requests are a scheduling contract. kube-scheduler sums pod requests against each node’s allocatable capacity and places the pod where it fits. Requests reserve capacity whether or not the pod uses it.
Limits are an enforcement ceiling, applied by the kernel via cgroups. CPU over a limit is throttled (CFS throttling — the pod is slowed, not killed). Memory over a limit is OOMKilled — the kernel terminates the container, the kubelet restarts it, and you see a CrashLoopBackOff if it repeats.

That asymmetry produces the two failure modes:

Mis-set value	Consequence	Who pays
Requests too high	Capacity reserved but idle; nodes fill on paper at 30% real use	The bill
Requests too low	Pods crammed onto nodes, then evicted under node pressure	Reliability
Memory limit too low	OOMKill, restart, CrashLoopBackOff	Reliability
CPU limit too low	Silent CFS throttling, latency spikes	Latency SLOs

QoS class is derived from these values and decides eviction order when a node runs out of memory:

QoS class	Condition	Eviction priority
`Guaranteed`	requests == limits for every container, CPU and memory	Evicted last
`Burstable`	at least one request set, but not Guaranteed	Middle
`BestEffort`	no requests or limits at all	Evicted first

The single most important rule on this page: set memory requests == memory limits for anything you care about. It pins the pod to Guaranteed for memory, removes the burst headroom that lures you into OOMKills, and makes scheduling deterministic. For CPU, leave the limit off or set it generously — CPU is compressible, and a low CPU limit throttles you for no capacity benefit. VPA’s job is to find the right request number; you decide the request/limit relationship.

2. VPA architecture: recommender, updater, admission controller

VPA is not built into Kubernetes. You install it from the autoscaler repo, and it ships as three independent components plus a CRD:

Recommender — watches live usage via the metrics API and historical samples, and writes target, lowerBound, and upperBound numbers into the VPA object’s status. This component does the math and is safe to run alone.
Updater — reads recommendations and, in Auto/Recreate mode, evicts pods whose requests are out of bounds so they get rescheduled with new values. It does not patch running pods in place (in-place resize is a separate, newer KEP).
Admission controller — a mutating webhook that rewrites a pod’s resource requests at creation time to match the recommendation. Without it, evicted pods would just come back with the same old requests.

Install the released manifests:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
# Generates webhook TLS certs and applies recommender + updater + admission controller
./hack/vpa-up.sh

kubectl get pods -n kube-system | grep vpa
# vpa-recommender-...           1/1   Running
# vpa-updater-...               1/1   Running
# vpa-admission-controller-...  1/1   Running

Prerequisite: metrics-server must be healthy — kubectl top pods has to return numbers. The recommender also benefits from a Prometheus history source for cold-start accuracy, but the default in-cluster checkpoint store works out of the box.

3. Run VPA in Off mode first — recommendation only

Never start with Auto. Deploy the VPA object in updateMode: "Off" so the recommender observes and reports, but nothing evicts or mutates your pods. This is pure, zero-risk telemetry.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: checkout
  namespace: shop
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  updatePolicy:
    updateMode: "Off"          # recommend only; do not touch pods

Let it run across at least one full traffic cycle — a week is sensible so it sees weekday peaks, the weekend trough, and any batch jobs. The recommender keeps a decaying histogram of usage, so longer is strictly better for the first pass.

4. Reading target, lowerBound, upperBound

After it has data, the recommendation lives in status:

kubectl describe vpa checkout -n shop

status:
  recommendation:
    containerRecommendations:
    - containerName: checkout
      lowerBound:
        cpu: 110m
        memory: 262144k
      target:
        cpu: 250m
        memory: 410Mi
      uncappedTarget:
        cpu: 250m
        memory: 410Mi
      upperBound:
        cpu: 1200m
        memory: 980Mi

Read these precisely — they are not min/typical/max of raw usage, they are percentile estimates with safety margin:

target — what VPA would set the request to right now. This is the number you act on. Internally it tracks roughly the 90th percentile of CPU and the peak of memory, plus a safety margin (about 15% by default).
lowerBound — the floor below which VPA considers the pod under-provisioned. If your current request is below lowerBound, you are at OOMKill/eviction risk. In Auto mode, dropping below this triggers an eviction to scale up.
upperBound — the ceiling above which the pod is wastefully over-provisioned. If your current request sits above upperBound, you are burning money. Early in the data window upperBound is huge and shrinks as confidence grows — do not act on it until it stabilizes.
uncappedTarget — what target would be ignoring any minAllowed/maxAllowed caps you set. The gap between uncappedTarget and target tells you your cap is binding.

The decision rule for a manual first pass: set your request to target, set memory limit == memory request, drop the CPU limit.

You can constrain the recommender with a resourcePolicy so it never proposes something absurd — essential for sidecars and JVMs that need a memory floor:

spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: checkout
      minAllowed:
        cpu: 100m
        memory: 256Mi
      maxAllowed:
        cpu: "2"
        memory: 2Gi
      controlledResources: ["cpu", "memory"]
    - containerName: istio-proxy
      mode: "Off"               # never right-size the sidecar

Tune the target percentile only if you have evidence. The recommender’s defaults (memory target near peak, CPU near p90) are deliberately conservative because under-sizing memory kills pods. Lowering the memory target percentile to save money is how teams reintroduce the OOMKills they just fixed.

5. Update modes — and the hard HPA conflict

VPA supports four updateMode values:

Mode	Behavior
`Off`	Recommend only. Never mutates pods.
`Initial`	Applies recommendations only at pod creation. No eviction of running pods.
`Recreate`	Evicts and recreates pods whenever requests drift out of `[lowerBound, upperBound]`.
`Auto`	Currently behaves like `Recreate`; intended to use in-place resize as it matures.

Initial is the underrated safe default for production: new pods get right-sized requests, but you never suffer surprise mid-day evictions. You pick up correct values naturally on every rollout.

Now the rule you cannot violate:

Do not run VPA in Auto/Recreate mode and an HPA on the same resource metric for the same workload. If the HPA scales replicas on CPU utilization while VPA simultaneously rewrites the CPU request, they enter a feedback loop — VPA raises the request, which lowers measured utilization, which makes the HPA scale down, and the controllers fight. The official guidance is explicit: VPA must not be used with the HPA on CPU or memory.

This is the most common way teams break themselves with VPA. Memorize it.

6. Combining VPA and HPA correctly

You can absolutely use both — just keep them on different signals. Let the HPA scale replicas on a custom or external metric (queue depth, requests-per-second, p95 latency) and let VPA own CPU/memory requests. They no longer overlap.

HPA on a custom metric (replicas only — no CPU/memory resource metric here):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout
  namespace: shop
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "200"

VPA owning requests, scoped to CPU and memory only:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: checkout
  namespace: shop
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  updatePolicy:
    updateMode: "Initial"
  resourcePolicy:
    containerPolicies:
    - containerName: checkout
      controlledResources: ["cpu", "memory"]

The HPA decides how many pods; VPA decides how big each one is — orthogonal axes, no feedback loop.

7. Right-sizing is half the battle: fix bin-packing too

Correct requests only pay off if the scheduler can pack them densely. Three levers:

Scheduler scoring. The default kube-scheduler NodeResourcesFit plugin uses LeastAllocated scoring, which spreads pods for resilience. For cost-driven node pools, switch to MostAllocated so the scheduler fills nodes before opening new ones — this is what makes the autoscaler able to drain and remove a node:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated
            resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1

Node sizing. Bin-packing is a geometry problem. If your largest pod requests 6 GiB and your nodes are 8 GiB, you waste the remainder on every node. Match node shape to the request distribution you measured in step 4. On Karpenter, let it choose instance types from the actual pending-pod requirements rather than pinning one family:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values: ["4", "8", "16"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Consolidation. Karpenter’s WhenEmptyOrUnderutilized policy actively recomputes whether the current pods would fit on fewer or cheaper nodes and replaces them when they would. Right-sized requests are the input that makes consolidation aggressive — shrink the requests and Karpenter discovers it can delete nodes. Cluster Autoscaler offers a weaker version via --scale-down-utilization-threshold.

Verify

Prove the change end to end rather than trusting the dashboard.

# 1. Recommendations exist and have stabilized (upperBound no longer huge)
kubectl describe vpa checkout -n shop | sed -n '/Recommendation/,/Events/p'

# 2. Pods actually picked up the new requests after a rollout
kubectl get pods -n shop -l app=checkout \
  -o custom-columns=NAME:.metadata.name,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
MEM_REQ:.spec.containers[0].resources.requests.memory

# 3. QoS is Guaranteed for memory-sensitive pods
kubectl get pod -n shop -l app=checkout \
  -o jsonpath='{.items[*].status.qosClass}{"\n"}'

# 4. No new OOMKills since the change
kubectl get events -n shop --field-selector reason=OOMKilling

# 5. Real allocation vs capacity per node (the bin-packing payoff)
kubectl describe nodes | grep -A6 "Allocated resources"

A successful right-sizing shows: requests near target, qosClass: Guaranteed, zero fresh OOMKilling events, and node “Allocated resources” climbing toward 70-80% of allocatable while node count drops.

Enterprise scenario

A payments platform team ran ~140 microservices on EKS, every Deployment copied from one Helm template with requests.cpu: 1 and requests.memory: 2Gi. Cluster cost was roughly 38,000 USD/month at 22% average CPU utilization — and despite that slack, three JVM services OOMKilled nightly because 2Gi was below their actual heap-plus-metaspace peak. Classic dual failure: massively over-provisioned on aggregate, under-provisioned where it mattered.

The constraint: those same services already ran HPAs on CPU utilization, so they could not simply flip VPA to Auto — that would have pitted the two controllers against each other on the CPU metric.

The fix, staged over three weeks:

Deployed VPA in updateMode: "Off" fleet-wide for one week to collect recommendations with zero production risk.
Re-platformed the HPAs off CPU onto a Prometheus custom metric (in-flight requests per pod via the Prometheus Adapter), freeing CPU/memory for VPA to own.
Switched VPA to updateMode: "Initial" so requests right-sized on each rollout without surprise evictions, with a minAllowed.memory floor on the JVM services so the recommender never proposed below their measured heap peak.
Set the cost node pool’s scheduler profile to MostAllocated and enabled Karpenter WhenEmptyOrUnderutilized consolidation.

The result over the next billing cycle: average CPU utilization rose from 22% to 61%, node count fell by 44%, monthly spend dropped from ~38,000 to ~21,000 USD, and the nightly OOMKills went to zero because the JVM services finally got Guaranteed memory at their real footprint. The custom-metric HPA snippet that unblocked everything:

metrics:
  - type: Pods
    pods:
      metric:
        name: http_inflight_requests
      target:
        type: AverageValue
        averageValue: "50"

The non-obvious lesson: the savings did not come from VPA alone. VPA produced correct requests, but the money only materialized once MostAllocated scheduling plus Karpenter consolidation could act on those smaller requests and physically delete nodes. Right-sizing without consolidation just leaves the freed capacity stranded.

Right-Sizing Kubernetes Workloads: Vertical Pod Autoscaler, Resource Recommendations, and Bin-Packing Efficiency

1. Requests vs limits, QoS, and the two failure modes

2. VPA architecture: recommender, updater, admission controller

3. Run VPA in Off mode first — recommendation only

4. Reading target, lowerBound, upperBound

5. Update modes — and the hard HPA conflict

6. Combining VPA and HPA correctly

7. Right-sizing is half the battle: fix bin-packing too

Verify

Enterprise scenario

Checklist

Written by Vinod

Comments

Keep Reading

Cilium Beyond CNI: Cluster Mesh, Egress Gateway, and the BGP Control Plane

GitOps with Flux: Image Update Automation, OCI Artifact Sources, and Hard Multi-Tenancy

Helm for Complex Releases: Umbrella Charts, Library Charts, Lifecycle Hooks, and Safe Rollbacks