Containerization Containers

Kubernetes Autoscaling in Depth: HPA, KEDA Event-Driven Scaling & Node Autoscaling

Autoscaling on Kubernetes is three independent control loops stacked on top of each other, and most outages happen at the seams between them. This guide wires up all three — pod-level HPA on custom/external metrics, KEDA for event-driven and scale-to-zero workloads, and node autoscaling with both Cluster Autoscaler and Karpenter — then tunes them so they cooperate instead of fight.

The three layers, and why the order matters

Layer Controller Scales Reacts to
Pod replicas HPA / KEDA replica count of a Deployment CPU, memory, custom, external metrics
Pod requests VPA per-pod CPU/memory requests historical usage
Nodes Cluster Autoscaler / Karpenter the node count / shape unschedulable (Pending) pods

The causal chain runs top-down: a metric crosses a threshold, the HPA (or KEDA-managed HPA) adds replicas, those replicas go Pending because the cluster is full, and only then does the node autoscaler add capacity. Your end-to-end scale-up latency is the sum of all three loops — typically HPA sync (15s default) + scheduler + node provisioning (30s–several minutes). Internalizing that sum is the whole game.

Prerequisite: metrics-server must be running for any CPU/memory HPA. On AKS/GKE/EKS it ships managed; verify with kubectl top nodes returning numbers, not an error.

1. HPA beyond CPU: memory, custom, and external metrics

The v2 HPA API (autoscaling/v2) takes a list of metrics and scales to satisfy the most demanding one. Start with the two built-in resource metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout
  namespace: shop
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  minReplicas: 3
  maxReplicas: 40
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization        # % of the pod's CPU *request*
          averageUtilization: 65
    - type: Resource
      resource:
        name: memory
        target:
          type: AverageValue        # absolute, not %, for memory
          averageValue: 600Mi

Utilization targets are a percentage of the resource request, not the limit. If your requests are wrong, your HPA math is wrong. This is the single most common HPA misconfiguration.

CPU and memory rarely correlate with what users actually feel. To scale on a real signal — requests-per-second, p95 latency, queue depth — you need the custom metrics or external metrics API, served by an adapter. The canonical choice is the Prometheus Adapter.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
  -n monitoring --create-namespace \
  --set prometheus.url=http://prometheus-server.monitoring.svc \
  --set prometheus.port=80

The adapter exposes a rule-defined PromQL series as a Kubernetes metric. Scale on per-pod RPS:

# adapter rule (values.yaml -> rules.custom)
rules:
  custom:
    - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: { resource: namespace }
          pod: { resource: pod }
      name:
        matches: "http_requests_total"
        as: "http_requests_per_second"
      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
# the HPA consuming it
  metrics:
    - type: Pods
      pods:
        metric: { name: http_requests_per_second }
        target:
          type: AverageValue
          averageValue: "50"          # aim for ~50 rps per pod

Use type: Pods when the metric is per-replica (HPA divides total by replica count for you). Use type: External for a metric that is not attached to your pods — a cloud queue length, a third-party SLO — where the adapter (or KEDA, below) talks to the source directly.

2. Event-driven scaling with KEDA

HPA is a closed loop on a steady-state metric. KEDA is the right tool when work arrives as discrete events — a queue backlog, Kafka consumer lag, a cron window — and especially when you want scale-to-zero, which a plain HPA cannot do (HPA minReplicas is >= 1).

KEDA installs an operator plus a metrics adapter. Under the hood it creates and manages an HPA for you from a ScaledObject; you do not write the HPA by hand.

helm repo add kedacore https://kedacore.github.io/charts
helm upgrade --install keda kedacore/keda -n keda --create-namespace

A queue-driven worker that idles at zero and bursts on backlog (Azure Service Bus shown; the pattern is identical for SQS, Pub/Sub, RabbitMQ):

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-worker
  namespace: shop
spec:
  scaleTargetRef:
    name: order-worker            # the Deployment
  minReplicaCount: 0              # scale to zero when idle
  maxReplicaCount: 100
  pollingInterval: 15            # how often KEDA checks the source (s)
  cooldownPeriod: 120            # wait before scaling back to zero (s)
  triggers:
    - type: azure-servicebus
      metadata:
        queueName: orders
        messageCount: "20"        # target backlog per replica
      authenticationRef:
        name: sb-auth             # TriggerAuthentication (workload identity / secret)

Kafka consumer lag is the other workhorse trigger:

  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka.svc:9092
        consumerGroup: order-consumers
        topic: orders
        lagThreshold: "100"       # desired max lag per replica

Two more KEDA patterns worth knowing:

Scale-to-zero cuts cost but adds cold-start latency: the first event must wait for a node (maybe), a pull, and app start. For latency-sensitive paths keep minReplicaCount: 1. Reserve zero for genuinely bursty, latency-tolerant work.

3. Tuning behavior: stabilization, policies, no flapping

Default HPA behavior scales up fast and down slow (a 300s downscale stabilization window). That asymmetry is deliberate — over-provisioning briefly is cheap; thrashing is expensive. Tune it explicitly via spec.behavior:

spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0     # react immediately on the way up
      policies:
        - type: Percent
          value: 100                    # at most double
          periodSeconds: 30
        - type: Pods
          value: 8                      # ...or +8 pods
          periodSeconds: 30
      selectPolicy: Max                 # take the more aggressive of the two
    scaleDown:
      stabilizationWindowSeconds: 300   # consider the last 5 min of recommendations
      policies:
        - type: Percent
          value: 20                     # shed at most 20% per minute
          periodSeconds: 60

The downscale stabilization window makes the HPA pick the highest recommendation it computed over the window before acting — that is what kills flapping. If your traffic is spiky and pods still oscillate, widen scaleDown.stabilizationWindowSeconds and lower the per-period Percent before you touch thresholds. KEDA passes a advanced.horizontalPodAutoscalerConfig.behavior block straight through to the HPA it manages, so the same knobs apply to event-driven workloads.

4. Node autoscaling: Cluster Autoscaler vs Karpenter

Both react to the same trigger — Pending pods the scheduler cannot place — but they differ fundamentally in how they pick capacity.

Cluster Autoscaler (CA) Karpenter
Unit of scaling a node group (ASG / VMSS / MIG) you pre-define individual nodes, instance type chosen at provision time
Instance selection fixed per group from a flexible set; picks cheapest that fits
Speed slower (group scale, then schedule) faster (provisions the node the pod needs)
Bin-packing limited active consolidation built in
Availability every managed K8s EKS first-class; expanding to others

Cluster Autoscaler is the universal default. On AKS it’s a cluster toggle; the autoscaler watches your node pools’ min/max:

az aks nodepool update -g rg-shop --cluster-name aks-shop -n apps \
  --enable-cluster-autoscaler --min-count 3 --max-count 30

CA only scales node groups it owns and assumes all nodes in a group are interchangeable, so it works best with a handful of well-sized, single-instance-type pools. For node consolidation it removes a node only when its pods can be rescheduled elsewhere and it has sat under-utilized past --scale-down-unneeded-time.

Karpenter discards the node-group abstraction. You declare constraints (a NodePool) and a provisioning template (EC2NodeClass on AWS); Karpenter computes the cheapest instance(s) that satisfy pending pods and launches them directly.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]      # prefer spot, fall back
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]          # let it pick Graviton when it fits
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: "1000"                               # hard ceiling across this pool
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidationAfter: 1m

5. Bin-packing, consolidation, and Spot safety

Karpenter’s real value is consolidation: it continuously re-evaluates whether the current fleet is the cheapest way to host current pods, and will replace several small nodes with one larger node, or swap an on-demand node for a cheaper instance, draining the old one safely. That is bin-packing as a live process, not a one-time placement.

Two guardrails make this safe in production:

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    budgets:
      - nodes: "10%"                # never voluntarily disrupt >10% of nodes at once
      - nodes: "0"                  # ...and zero during business hours
        schedule: "0 9 * * mon-fri"
        duration: 8h

And on the workload side, a PodDisruptionBudget is non-negotiable once you run Spot or enable consolidation — it is what stops a node drain from taking your service below quorum:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: checkout, namespace: shop }
spec:
  minAvailable: 2
  selector: { matchLabels: { app: checkout } }

Spot capacity can be reclaimed with ~30s notice (interruption) or evaporate (no capacity). Mitigate by: spreading across many instance types (let Karpenter choose), keeping critical singletons on on-demand, setting PDBs, and using topology spread so a single AZ/instance-type pull can’t drain a whole tier.

6. Combining VPA with HPA safely

The Vertical Pod Autoscaler right-sizes requests; the HPA scales replica count. They collide when both act on the same resource: VPA raises the CPU request, which lowers CPU utilization (same usage / bigger request), which tells the HPA to scale down — a feedback loop that defeats both.

Rules that keep them from fighting:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: { name: checkout, namespace: shop }
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  updatePolicy:
    updateMode: "Initial"          # set requests at pod creation; don't evict running pods
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        controlledResources: ["memory"]   # HPA owns CPU; VPA owns memory only

7. Load-test the whole stack and read the timeline

A config that looks right on paper means nothing until you watch all three loops fire under load. Drive synthetic traffic and observe.

# generate load (k6 is convenient; hey/wrk/vegeta all work)
kubectl run k6 --rm -it --image=grafana/k6 -- run - <<'EOF'
import http from 'k6/http';
export const options = { stages: [
  { duration: '2m', target: 200 },   // ramp
  { duration: '5m', target: 800 },   // sustained peak
  { duration: '3m', target: 0 },     // drain -> watch scale-down
]};
export default function () { http.get('https://checkout.shop.svc/health'); }
EOF

In separate panes, watch each loop and timestamp the transitions:

kubectl get hpa checkout -n shop -w                # metric vs target, replica deltas
kubectl get pods -n shop -w                        # Pending -> ContainerCreating -> Running
kubectl get nodes -w                               # new nodes joining
kubectl get events -n shop --sort-by=.lastTimestamp | tail -30
kubectl describe hpa checkout -n shop              # the why behind each decision

Reading the timeline end to end, you should be able to attribute every second: metric crossed at T+0 → HPA bumped replicas at T+~15s → pods Pending at T+18s → node autoscaler reacted → node Ready → pods Running → metric back under target. If a stage is slow, you now know exactly which loop to tune.

Enterprise scenario

A payments platform ran KEDA scale-to-zero on its settlement-batch workers (SQS-driven) backed by a Karpenter Spot pool. Every weekday at 17:00 a fan-out job dumped ~40k messages into the queue. KEDA correctly scaled the Deployment from 0 to ~120 replicas, but p99 settlement time blew past the SLA on the first few thousand messages. The cause was additive cold-start, not throughput: 0→1 forced a Karpenter node launch, a 1.2 GB image pull, JVM warmup, and SQS ApproximateNumberOfMessages lags ~20–30s, so KEDA itself reacted late. The Spot pool made it worse — diversified instance types meant variable boot times, and one launch hit InsufficientInstanceCapacity.

The fix was to stop reacting and start pre-warming. They added a second KEDA trigger with a cron schedule so capacity was in place before the 17:00 dump, while keeping the queue trigger for actual backlog:

  minReplicaCount: 0
  triggers:
    - type: cron
      metadata:
        timezone: America/New_York
        start: "55 16 * * 1-5"   # warm up at 16:55
        end:   "30 18 * * 1-5"
        desiredReplicas: "30"     # floor during the window
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.../settlements
        queueLength: "50"

They also pinned the first 30 replicas to on-demand via a separate NodePool (Spot only above the floor) and pre-pulled the image with a DaemonSet. End-to-end p99 dropped back under SLA, and Spot still covered the long tail.

Verify

kubectl top nodes                                  # metrics-server returns data
kubectl get apiservices | grep metrics             # custom/external metrics API registered
kubectl get hpa -A                                 # TARGETS column shows current/target, not <unknown>
kubectl get scaledobject -A                        # KEDA objects; READY/ACTIVE = True
kubectl get hpa -n keda -A                         # the HPAs KEDA generated exist
kubectl get nodepool,nodeclaim                     # Karpenter intent + provisioned nodes
kubectl get pdb -A                                 # disruption budgets present for critical apps

A <unknown> in the HPA TARGETS column means the metrics pipeline is broken (adapter down, bad PromQL, or wrong label overrides) — fix that before tuning anything else, because the HPA is flying blind.

Production checklist

Pitfalls

Get the three loops cooperating and the cluster becomes self-managing: it absorbs traffic spikes, drains queues to zero cost, and packs nodes tightly — without a human in the loop. The work is almost entirely in the tuning and the testing, not the YAML.

KubernetesAutoscalingKEDAHPAKarpenter

Comments

Keep Reading