Containerization Workloads

Kubernetes Pod Autoscaling, In Depth: the HPA Algorithm, Metrics & VPA

Most engineers can write a Horizontal Pod Autoscaler manifest. Far fewer can explain why it chose seven replicas instead of six, why it scaled up in fifteen seconds but refused to scale down for five minutes, or what the <unknown> in the TARGETS column actually means. This lesson is about that machinery: the exact arithmetic the HPA runs on every loop, the four metric types and how each is computed, the metrics pipeline that feeds those numbers in, the behavior block that governs how fast scaling happens, and the Vertical Pod Autoscaler — its three components, four modes, and the one rule that stops it fighting the HPA.

This is the fundamentals-and-algorithm companion to two adjacent lessons. The HPA, KEDA & node autoscaling guide wires up the full practical stack — custom-metric adapters, event-driven KEDA scalers, Cluster Autoscaler vs Karpenter. The VPA right-sizing guide is about reading recommendations and improving bin-packing. This lesson deliberately stays underneath both of those: how the controllers compute their decisions in the first place. If you want to build an event-driven pipeline, start there; if you want to understand and debug what an autoscaler is doing, start here.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites & where this fits

You should be comfortable with Deployments and ReplicaSets, and with container requests and limits (the VPA right-sizing lesson covers requests/limits, QoS classes and the two failure modes in full — this lesson assumes them). You will need a cluster with metrics-server installed; kind or minikube is perfect for the lab. Everything here targets Kubernetes v1.30+ and the stable autoscaling/v2 API. In the Zero-to-Hero programme this sits in the Workloads module, after Deployments and before scheduling and security hardening.

Core concepts: two axes of scaling, one shared input

There are two independent ways to give a workload more capacity, and they operate on perpendicular axes:

A third controller, the Cluster Autoscaler (or Karpenter), operates a level below both: it adds nodes when Pods cannot be scheduled. It is not pod autoscaling at all — it reacts to Pending pods — but it is the third corner of the decision triangle, so we compare against it at the end.

HPA VPA Cluster Autoscaler
Changes replica count per-pod requests node count
Reacts to live metric vs target historical usage unschedulable (Pending) pods
Disruptive? no (adds/removes pods) yes (evicts to resize)* yes (drains nodes)
Built in? yes (autoscaling/v2) no (separate install) no (separate install / managed)

*In-place pod resize is graduating, which will eventually let the VPA resize without eviction; today the standard updater evicts.

The single most important shared fact: all utilisation-based autoscaling keys off the Pod’s resource request, not its limit and not the node capacity. Get requests wrong and every percentage the HPA computes is wrong. That is why the VPA (which finds the right request) and the HPA (which scales on a percentage of it) are so entangled — and why their conflict, covered later, is the classic interview trap.

The HPA control loop

The HPA is a closed control loop run by the kube-controller-manager. On a fixed interval — the --horizontal-pod-autoscaler-sync-period, 15 seconds by default — it does the following for every HPA object:

  1. Fetch the current metric value(s) from the appropriate metrics API.
  2. Compute a desired replica count for each metric using the scaling formula.
  3. Take the maximum desired count across all metrics (scale to satisfy the most demanding one).
  4. Apply behavior constraints (policies, stabilisation) to that proposal.
  5. Clamp the result to [minReplicas, maxReplicas].
  6. If the final number differs from the current replica count, patch the target’s /scale subresource.

Two consequences fall straight out of this. First, the HPA never talks to your Pods directly — it patches the Deployment’s replica count and lets the normal ReplicaSet controller create the Pods. That is why the HPA can target anything with a /scale subresource: Deployments, ReplicaSets, StatefulSets, and many CRDs. Second, because step 3 takes the maximum, adding a metric can only ever make the HPA scale up more, never less — each metric independently argues for a floor on the replica count.

The scaling algorithm and formula

Here is the heart of the whole topic. For a single metric the HPA computes:

desiredReplicas = ceil( currentReplicas × ( currentMetricValue / desiredMetricValue ) )

The ratio currentMetricValue / desiredMetricValue is the usage ratio. If you are at twice your target, the ratio is 2.0 and the HPA wants twice the replicas. The ceil() (round up) means the HPA always errs toward more capacity — it never rounds away a fractional pod you might need.

A worked example. You run 4 replicas, the HPA targets 50% CPU utilisation, and the current average utilisation is 90%:

desiredReplicas = ceil( 4 × (90 / 50) ) = ceil( 4 × 1.8 ) = ceil(7.2) = 8

The HPA scales to 8. Now utilisation should fall: the same total CPU work spread over 8 pods instead of 4 lands near 45%, just under target, and the loop stabilises. Work the reverse: at 4 replicas and 20% utilisation against a 50% target, ceil(4 × 0.4) = ceil(1.6) = 2, so it scales down to 2.

Three refinements make this production-accurate rather than textbook:

Tolerance. The HPA does not act on every tiny deviation. If the usage ratio is within a tolerance of 1.0 — default 0.1, i.e. ±10% — it does nothing. So a ratio between 0.9 and 1.1 is treated as “on target”. This is the first line of defence against flapping: minor metric noise never moves the replica count. (Since v1.33 this tolerance is configurable per-HPA under spec.behavior.scaleUp/scaleDown.tolerance; before that it was a single cluster-wide flag, --horizontal-pod-autoscaler-tolerance.)

Not-ready and missing pods. When computing an average, the HPA is careful about which pods to count. Pods that are not yet Ready, or are still in their CPU initialisation period, are excluded from the usage calculation but assumed to consume 0% when the metric would otherwise drive a scale-up, and 100% when it would drive a scale-down. This deliberate pessimism stops a burst of brand-new, not-yet-warm pods from being misread as idle (which would cancel the scale-up that just created them) or a terminating pod from blocking a needed scale-down.

Multiple metrics. With several metrics listed, the formula runs once per metric and the largest desiredReplicas wins, as noted in the loop. A CPU metric might say 6 and a requests-per-second metric might say 9 — the HPA picks 9.

The exam-favourite gotcha: a Utilization target is a percentage of the request. If a pod requests 250m CPU and uses 200m, that is 80% — even though the node has spare cores. People who reason about node capacity instead of the request get every HPA calculation wrong.

Metric types in autoscaling/v2

The autoscaling/v2 API accepts a list of metric sources, each with a type. There are four, and choosing the right one is most of writing a correct HPA.

type Reads Target kinds Served by Typical use
Resource CPU/memory of the target’s pods Utilization, AverageValue metrics-server CPU%, memory bytes
Pods a custom per-pod metric, averaged AverageValue only custom metrics API (adapter) requests/sec per pod
Object a metric describing one other object Value, AverageValue custom metrics API (adapter) Ingress RPS, queue length as one number
ContainerResource CPU/memory of one named container Utilization, AverageValue metrics-server scale on the app container, ignoring a noisy sidecar
External a metric not tied to any K8s object Value, AverageValue external metrics API (adapter/KEDA) cloud queue depth, third-party SLO

The target type inside each metric matters as much as the metric type:

The Value vs AverageValue distinction on Object/External metrics is the subtle one and a frequent source of wildly wrong scaling. With AverageValue the HPA computes desired = ceil(totalValue / targetPerReplica) directly — “give me one replica per N units of work”. With Value it computes the usage ratio against the current replica count like the resource formula. For a queue you almost always want AverageValue: “I want roughly 30 messages of backlog per worker”.

A ContainerResource example — scale on the application container only, so an Istio sidecar’s CPU does not skew the average:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
  namespace: shop
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: ContainerResource
      containerResource:
        name: cpu
        container: api            # ignore the proxy sidecar entirely
        target:
          type: Utilization
          averageUtilization: 70

An Object metric — scale on requests-per-second reported against an Ingress, taken as a single value divided across pods:

  metrics:
    - type: Object
      object:
        describedObject:
          apiVersion: networking.k8s.io/v1
          kind: Ingress
          name: shop
        metric:
          name: requests_per_second
        target:
          type: AverageValue       # total RPS / replicas
          averageValue: "200"

The metrics pipeline

A metric type is only useful if something serves it. Kubernetes does not collect application metrics itself; the HPA reads from three aggregated API groups, each backed by a different component. Understanding this pipeline is how you debug a <unknown> target.

                 ┌─────────────────────────────┐
   HPA  ───────► │ metrics.k8s.io              │ ◄── metrics-server  (CPU/memory)
 (controller)    ├─────────────────────────────┤
        ───────► │ custom.metrics.k8s.io       │ ◄── adapter (Prometheus Adapter, …)
        ───────► │ external.metrics.k8s.io     │ ◄── adapter / KEDA metrics adapter
                 └─────────────────────────────┘
                   (API aggregation layer)

Resource metrics → metrics.k8s.iometrics-server. metrics-server is a lightweight cluster add-on that scrapes the kubelet’s /metrics/resource endpoint on every node (which in turn reads cAdvisor) and keeps the latest CPU and memory reading per pod in memory. It is not a monitoring system — it stores no history, only the most recent value, which is all the HPA needs. It registers itself as an API service for the metrics.k8s.io group via the aggregation layer. The litmus test that it is healthy:

kubectl top nodes      # returns CPU/memory numbers, not an error
kubectl top pods -A    # per-pod usage

If kubectl top errors, no CPU/memory HPA can work — fix metrics-server first. On managed platforms (AKS, GKE, EKS) it usually ships pre-installed; on kind/minikube you install it yourself (see the lab). Two metrics-server gotchas: on kind you must add --kubelet-insecure-tls because the kubelet’s serving cert is self-signed, and the HPA only sees pods whose containers actually declare CPU/memory requests — a pod with no requests reports <unknown> for utilisation forever.

Custom & external metrics → adapters. For Pods, Object and External metrics, Kubernetes ships no default provider. You install an adapter that registers as the API service for custom.metrics.k8s.io (for Pods/Object) or external.metrics.k8s.io (for External). The common choice is the Prometheus Adapter, which translates PromQL queries into Kubernetes metrics; KEDA also registers an external-metrics adapter to drive the HPAs it manages. The wiring and adapter rules are covered in depth in the HPA, KEDA & node autoscaling lesson — what matters here is the topology: the HPA asks the aggregated API, the API routes to the adapter, the adapter queries the real source.

Inspect every layer directly — this is the single most useful debugging skill for autoscaling:

kubectl get apiservices | grep metrics            # which metrics APIs are registered & Available
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods | head          # resource metrics
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .        # what custom metrics exist
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 | jq .      # external metrics

A <unknown> in kubectl get hpa always traces to this pipeline: the relevant API service is missing/Unavailable, the adapter is down, the metric name is misspelled, or (for resource metrics) the pod has no requests. The HPA is flying blind until that column shows real numbers, so never tune thresholds before fixing it.

Scaling behaviour: stabilisation, policies and flapping

The raw formula tells the HPA where it wants to be; the spec.behavior block governs how fast it is allowed to get there. This is where you stop oscillation and control blast radius. There are two symmetric sub-blocks, scaleUp and scaleDown, each with a stabilisation window, a list of rate policies, and a policy-selection rule.

The default behaviour is asymmetric on purpose — scale up fast, scale down slowly:

Default scaleUp Default scaleDown
stabilizationWindowSeconds 0 (act immediately) 300 (consider last 5 min)
Rate policies +100% or +4 pods per 15s, whichever is more -100% per 15s (i.e. can remove all surplus)
selectPolicy Max Max

The rationale: briefly over-provisioning is cheap and protects users; thrashing replicas down and back up is expensive and risky. Removing capacity is the dangerous direction, so it is deliberately damped.

stabilizationWindowSeconds is the anti-flapping mechanism, and it works differently per direction. The HPA records its computed recommendations over the window and then, instead of using the latest, it picks the most conservative one for the direction of change:

This “take the safest recommendation over a window” behaviour, not the per-period rate limits, is what actually kills flapping. If pods still oscillate, widen scaleDown.stabilizationWindowSeconds before you touch anything else.

Rate policies cap how much can change per time period, independent of stabilisation. Each policy is a type: Percent or type: Pods with a value and a periodSeconds. selectPolicy decides how to combine multiple policies: Max (take the most permissive — the default, lets either policy authorise the change), Min (the most restrictive), or Disabled (forbid scaling in that direction entirely — e.g. scaleDown.selectPolicy: Disabled to make a workload scale up but never down automatically).

A complete, production-shaped block — react instantly upward but cap the rate, and shed capacity gently:

spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0      # no delay going up
      selectPolicy: Max
      policies:
        - type: Percent
          value: 100                     # at most double…
          periodSeconds: 30
        - type: Pods
          value: 8                       # …or +8 pods, whichever is larger
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300    # only shrink after 5 min of sustained low load
      selectPolicy: Max
      policies:
        - type: Percent
          value: 20                      # remove at most 20% of pods per minute
          periodSeconds: 60

There is no separate “cooldown” field in autoscaling/v2 — the old fixed --downscale-stabilization cooldown was replaced by exactly this per-HPA scaleDown.stabilizationWindowSeconds, which is strictly more expressive.

Kubernetes pod autoscaling: HPA & VPA

The diagram traces both axes: the HPA loop on the left (metrics APIs → formula → behaviour gate → /scale patch → more replicas) and the VPA loop on the right (recommender → recommendation → updater eviction → admission-controller rewrite → bigger pods), meeting at the shared “request” input that makes their conflict unavoidable.

The Vertical Pod Autoscaler: components

The VPA scales the other axis — it right-sizes a pod’s CPU/memory requests instead of changing the replica count. It is not built into Kubernetes; you install it from the autoscaler repository, and it ships as three independent components plus a CRD. (The right-sizing lesson covers how to read the recommendation numbers; here we focus on the mechanism that produces and applies them.)

Component Role Safe to run alone?
Recommender Watches live usage (metrics API) + history; writes target/lowerBound/upperBound into the VPA object’s status. Yes — pure observation.
Updater In Auto/Recreate mode, evicts pods whose requests are out of bounds so they reschedule. No — it disrupts pods.
Admission controller A mutating webhook that rewrites a new pod’s requests to the recommendation at creation time. Paired with the updater.

The interplay is the key insight: the recommender only suggests; the updater can only evict (it does not patch a running pod in place — the standard updater works by deletion, relying on the controller to recreate); and the admission controller is what makes a recreated pod come back with the new requests rather than the old ones. Remove the admission controller and an evicted pod would simply restart at its original size — the updater and admission controller are a pair. The recommender by contrast is genuinely standalone, which is exactly why the recommended first step is to run only recommendations.

Install and verify:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh                          # generates webhook certs; deploys all three
kubectl -n kube-system get pods | grep vpa
# vpa-recommender-...           1/1 Running
# vpa-updater-...               1/1 Running
# vpa-admission-controller-...  1/1 Running

VPA update modes

The VPA’s updatePolicy.updateMode decides how aggressively recommendations are applied. There are four:

Mode Behaviour Disruption
Off Recommend only — writes status, never touches pods. None.
Initial Applies the recommendation only when a pod is first created; never evicts running pods. Only on natural rollout/restart.
Recreate Evicts and recreates pods whenever their requests drift outside [lowerBound, upperBound]. Mid-life evictions.
Auto Currently equivalent to Recreate; intended to adopt in-place resize as that feature matures. Mid-life evictions (today).

The discipline that prevents self-inflicted outages: start in Off to gather a full traffic cycle of recommendations with zero risk, then graduate to Initial for most production workloads — new pods get correct requests on every deploy, but you never suffer a surprise eviction during a traffic spike. Reserve Recreate/Auto for workloads that tolerate eviction and where you actively want continuous resizing.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api
  namespace: shop
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"               # observe first — always
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed: { cpu: 100m, memory: 256Mi }
        maxAllowed: { cpu: "2",  memory: 2Gi }
        controlledResources: ["cpu", "memory"]
      - containerName: istio-proxy
        mode: "Off"                 # never resize the sidecar

The resourcePolicy bounds the recommender so it cannot propose something absurd: minAllowed is essential for JVMs and other runtimes with a fixed memory floor, maxAllowed is your guard-rail against a runaway recommendation, and per-container mode: "Off" excludes sidecars. controlledResources is the field that becomes load-bearing in the conflict section below — it scopes which resources the VPA is allowed to touch.

The HPA + VPA conflict — and how to combine them

This is the single most important rule in pod autoscaling, and a near-guaranteed interview question.

Never run the VPA (in Auto/Recreate) and an HPA on the same resource metric for the same workload.

The mechanism is a feedback loop. Suppose the HPA scales on CPU utilisation and the VPA also controls CPU. Under load, the VPA raises the CPU request. But utilisation is usage ÷ request — so raising the request lowers the measured utilisation percentage, even though real usage is unchanged. The HPA sees utilisation fall below target and scales the replica count down. Fewer replicas means more load per pod, the VPA raises the request again, and the two controllers oscillate against each other. The official guidance is explicit: do not use the VPA with the HPA on CPU or memory.

They combine perfectly, however, as long as they never share a controlled dimension. Two clean designs:

Design A — orthogonal signals (preferred). The HPA scales replicas on a custom or external metric (requests-per-second, queue depth, p95 latency); the VPA owns CPU and memory requests. No shared dimension, no loop. The HPA decides how many pods; the VPA decides how big each one is.

# HPA: replicas driven by RPS only — NO cpu/memory resource metric here
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api, namespace: shop }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 3
  maxReplicas: 40
  metrics:
    - type: Pods
      pods:
        metric: { name: http_requests_per_second }
        target: { type: AverageValue, averageValue: "50" }
---
# VPA: owns CPU + memory requests, Initial mode
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: { name: api, namespace: shop }
spec:
  targetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  updatePolicy: { updateMode: "Initial" }
  resourcePolicy:
    containerPolicies:
      - containerName: api
        controlledResources: ["cpu", "memory"]   # HPA touches neither

Design B — split the resources. If you must keep the HPA on CPU (say RPS metrics are not available), confine the VPA to memory only with controlledResources: ["memory"]. The HPA owns CPU; the VPA owns memory; they never touch the same number.

  # VPA scoped so it cannot interfere with a CPU-based HPA
  resourcePolicy:
    containerPolicies:
      - containerName: api
        controlledResources: ["memory"]   # HPA keeps CPU

Memorise the rule and one valid combination — examiners love to ask you to spot the broken config (HPA on CPU + VPA in Auto controlling CPU) and then fix it.

When HPA vs VPA vs Cluster Autoscaler

These solve genuinely different problems; the right answer is often “more than one, layered”.

Use… When the bottleneck is… It changes… Watch out for
HPA Load is horizontally scalable (stateless web/API, queue workers) and varies over time. replica count needs a good per-replica metric; can’t scale stateful singletons
VPA A pod is mis-sized (OOMKilled, or hugely over-provisioned) and you don’t know the right request. per-pod requests evicts to resize; conflicts with CPU/mem HPA; not for sharp spikes
Cluster Autoscaler / Karpenter Pods are Pending because the cluster has no room. node count only reacts after pods are unschedulable; adds node-provisioning latency

The mental model: HPA and VPA decide how much pod you need; the Cluster Autoscaler decides whether there is a node to put it on. They stack. A typical production setup runs an HPA on an application metric, a VPA in Initial mode on memory to keep requests honest, and a node autoscaler underneath to supply capacity when the HPA’s new replicas have nowhere to land. Crucially the latencies are additive: metric crosses target → HPA loop (~15s) → new pod Pending → node autoscaler reacts → node Ready → pod scheduled. The end-to-end scale-up time is the sum of all three loops, which is the operational point the HPA/KEDA/node lesson measures in its load test. Use the VPA’s recommendations as the source of truth for the requests your HPA percentages and your node bin-packing both depend on.

Hands-on lab

We will stand up metrics-server, watch the HPA formula produce a specific replica count under real CPU load, tune the behavior block, then deploy a VPA in Off mode and read its recommendation. A two-CPU kind or minikube node is enough.

1. Cluster and metrics-server

# kind:
kind create cluster --name hpa-lab
# Install metrics-server (the components.yaml release)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# kind's kubelet uses a self-signed serving cert — let metrics-server accept it:
kubectl -n kube-system patch deployment metrics-server --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

# Wait until it is serving:
kubectl -n kube-system rollout status deploy/metrics-server
kubectl top nodes      # must return numbers, not an error

minikube users can instead run minikube addons enable metrics-server and skip the TLS patch.

2. A CPU-burnable workload with requests

The HPA can only compute utilisation if the container declares a CPU request:

kubectl create deployment php --image=registry.k8s.io/hpa-example
kubectl set resources deployment php --requests=cpu=200m --limits=cpu=500m
kubectl expose deployment php --port=80
kubectl rollout status deploy/php

3. Create the HPA and read the algorithm inputs

kubectl autoscale deployment php --cpu-percent=50 --min=1 --max=10
# Equivalent to a metrics: [{type: Resource, resource: {name: cpu,
#   target: {type: Utilization, averageUtilization: 50}}}] HPA.

kubectl get hpa php
# NAME  REFERENCE        TARGETS       MINPODS  MAXPODS  REPLICAS
# php   Deployment/php   cpu: 0%/50%   1        10       1

cpu: 0%/50% is current/target. If you ever see <unknown>/50%, metrics-server is unhealthy or the deployment has no CPU request — stop and fix that.

4. Generate load and watch the formula fire

In one terminal, drive sustained load:

kubectl run load --rm -it --image=busybox --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://php; done"

In another, watch the HPA decide:

kubectl get hpa php -w
# TARGETS climbs (e.g. cpu: 250%/50%) → REPLICAS rises.
# 250%/50% gives a usage ratio of 5.0; from 1 replica → ceil(1×5)=5 (then re-evaluated as pods spread load)
kubectl describe hpa php      # the "events" log every decision and the computed desired count

Watch kubectl describe hpa php — its events explicitly state “computed the desired num of replicas” and the metric reading behind each change. That is the formula, observable in production.

5. Validation — confirm the maths

kubectl get hpa php
# Expect REPLICAS to settle where current utilisation × replicas / target ≈ 50% per pod.
kubectl top pods -l app=php   # per-pod CPU should hover near the 100m that 50% of a 200m request implies

Stop the load generator (Ctrl-C in its terminal). With the default 300s down-stabilisation window, observe that replicas do not immediately fall — the HPA waits, demonstrating stabilisation directly. After ~5 minutes it scales back toward 1.

6. Tune behaviour and observe the change

kubectl patch hpa php --type=merge -p '{
  "spec":{"behavior":{"scaleDown":{"stabilizationWindowSeconds":30,
    "policies":[{"type":"Percent","value":50,"periodSeconds":30}]}}}}'

Re-run the load, stop it, and note that scale-down now begins after ~30s instead of 5 minutes — you have changed the algorithm’s down-gate live.

7. VPA recommendation (observation only)

git clone --depth 1 https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler && ./hack/vpa-up.sh
kubectl -n kube-system get pods | grep vpa     # three components Running

cat <<'EOF' | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: { name: php }
spec:
  targetRef: { apiVersion: apps/v1, kind: Deployment, name: php }
  updatePolicy: { updateMode: "Off" }     # recommend only — zero risk
EOF

# Give the recommender a few minutes of the load running, then:
kubectl describe vpa php | sed -n '/Recommendation/,/Events/p'
# target / lowerBound / upperBound for cpu and memory appear in status.

The VPA in Off mode never touched a pod — it only wrote a recommendation you can act on, exactly as the right-sizing lesson advises.

Cleanup

kubectl delete hpa php
kubectl delete vpa php
kubectl delete deploy php
kubectl delete svc php
# remove VPA components:
cd autoscaler/vertical-pod-autoscaler && ./hack/vpa-down.sh
# remove the whole cluster:
kind delete cluster --name hpa-lab        # or: minikube delete

Cost note

Everything here runs on a single local kind/minikube node — zero cloud cost. On a managed cluster, the only spend is the extra replicas the HPA briefly creates under load; the VPA in Off mode costs nothing because it changes nothing.

Common mistakes & troubleshooting

Symptom Cause Fix
TARGETS shows <unknown> Pod has no CPU/memory request, or metrics-server down Set requests; verify kubectl top pods works
HPA never scales despite high load Reasoning about node capacity, but utilisation is % of request Recompute against the request; lower the request or the target
Replicas flap up and down Stabilisation window too short / metric noisy Widen scaleDown.stabilizationWindowSeconds; rely on default tolerance
Custom-metric HPA stuck at <unknown> Adapter not registered / metric name wrong Check kubectl get apiservices | grep custom.metrics; query the raw API
HPA and VPA “fighting”, odd oscillation Both control the same CPU/memory metric Move HPA to a custom metric, or scope VPA to memory only
Scale-up far slower than expected Latencies are additive (HPA + scheduler + node provisioning) Profile each hop; pre-warm or raise minReplicas
VPA evicts pods during a spike updateMode: Auto/Recreate on a user-facing app Use Initial (or Off) for anything latency-sensitive
HPA scales to maxReplicas and stops Hit the ceiling; looks identical to “broken” Alert on max reached; raise maxReplicas if capacity allows

Best practices

Security notes

Interview & exam questions

  1. State the HPA scaling formula. desiredReplicas = ceil(currentReplicas × currentMetricValue / desiredMetricValue). The ceil() rounds up so the HPA never under-provisions a fractional pod.

  2. At 6 replicas, 80% current CPU utilisation, 40% target — what does the HPA want? ceil(6 × 80/40) = ceil(12) = 12 replicas.

  3. What is a Utilization target a percentage of? The pod’s resource request — not its limit, not node capacity. Wrong requests give wrong HPA maths.

  4. Name the four autoscaling/v2 metric types and one use of each. Resource (CPU% of pods), Pods (per-pod custom metric like RPS), Object (a metric on one other object such as an Ingress), External (a metric outside the cluster like a cloud queue’s length).

  5. Value vs AverageValue on an External metric? AverageValue divides the metric by the replica count (“N units per pod”); Value compares the raw number as-is. For a queue you almost always want AverageValue.

  6. Which component serves resource metrics, and how do you check it? metrics-server, registered for metrics.k8s.io. Healthy iff kubectl top nodes/pods returns numbers.

  7. What does <unknown> in the HPA TARGETS column mean? The metrics pipeline is broken — missing request, dead metrics-server, unregistered adapter, or wrong metric name. The HPA cannot scale until it shows a real value.

  8. How does stabilizationWindowSeconds prevent flapping? The HPA picks the most conservative recommendation over the window — the highest when scaling down, the lowest when scaling up — so transient dips/spikes don’t move the replica count. Default is 0 up, 300 down.

  9. Why is the default behaviour asymmetric (fast up, slow down)? Brief over-provisioning is cheap and protects users; thrashing capacity down then back up is expensive and risky, so the down direction is damped.

  10. List the VPA’s three components and their roles. Recommender (writes target/lowerBound/upperBound), updater (evicts out-of-bounds pods in Auto/Recreate), admission controller (rewrites requests on new pods). Recommender is safe to run alone.

  11. State the HPA + VPA conflict rule and give one safe combination. Never let both control the same CPU/memory metric for a workload — raising the request lowers utilisation, which makes the HPA scale down, creating a loop. Safe: HPA on a custom metric (RPS) while VPA owns CPU/memory; or HPA on CPU while VPA is scoped to memory only.

  12. HPA vs VPA vs Cluster Autoscaler — what does each change, and in response to what? HPA changes replica count in response to a live metric vs target; VPA changes per-pod requests in response to historical usage; Cluster Autoscaler changes node count in response to unschedulable (Pending) pods.

Quick check

  1. Multiple metrics on one HPA: does it use the minimum, maximum, or average of the per-metric desired counts?
  2. True/false: an HPA can scale a Deployment down to zero replicas.
  3. Which metric target type is invalid for memory — Utilization or AverageValue — and why is it discouraged?
  4. Which VPA update mode applies recommendations only at pod creation, never evicting running pods?
  5. You see cpu: <unknown>/50%. Name two distinct causes.

Answers

  1. The maximum — the HPA scales to satisfy the most demanding metric, so each metric sets a floor.
  2. False. A plain HPA’s minReplicas is at least 1 (without the alpha HPAScaleToZero feature). Scale-to-zero is KEDA’s job.
  3. Utilization is technically valid but discouraged for memory because a percentage of a memory request rarely reflects real pressure (memory doesn’t “burst” usefully like CPU); use AverageValue (absolute bytes) instead.
  4. Initial — it sets requests on new pods and never evicts running ones.
  5. Any two of: the deployment’s containers declare no CPU request; metrics-server is not running/healthy; the metrics API service is Unavailable.

Exercise

On a kind/minikube cluster, deploy a workload with requests.cpu=100m. Create an HPA targeting 60% CPU utilisation, min=2 max=12. (a) Before generating load, compute by hand the replica count you’d expect if average utilisation hit 180%. (b) Generate load, confirm the HPA reaches your predicted number (within rounding), and capture the kubectl describe hpa event that states the computed desired replicas. © Add a behavior.scaleDown.stabilizationWindowSeconds: 20 and demonstrate, by stopping the load and watching kubectl get hpa -w, that scale-down now starts roughly 20s after load drops instead of the 5-minute default. (d) Add a VPA in Off mode and record its target CPU recommendation — then explain why you must not simultaneously run this CPU-based HPA and a VPA controlling CPU in Auto.

Certification mapping

Glossary

Next steps

KubernetesautoscalingHPAVPAmetrics-serverCKAD
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading