AWS Containers

EKS Cluster Upgrades: Version Lifecycle, Add-on Compatibility, and Fleet Operations

A single aws eks update-cluster-version call looks trivial. The risk is never the control plane API call itself, which AWS performs in a managed, rolling fashion. The risk is everything around it: an admission webhook that stops answering, a CSI driver that skews past the API server, a DaemonSet that won’t drain because nobody set a PodDisruptionBudget, and the slow bleed of clusters parked on a version that slid into extended support at six times the hourly rate. Across a fleet those small risks multiply by cluster count. This is the runbook I use to move a fleet forward one minor version at a time without paging anyone.

Assume EKS 1.31+ as a baseline and kubectl, eksctl, and AWS CLI v2 on the operator workstation throughout.

1. Understand the version lifecycle before you plan anything

EKS supports each Kubernetes minor version for a fixed window, and the window is what drives your cadence — not feature envy.

The practical takeaway: standard support gives you runway to land roughly one minor upgrade per quarter and stay perpetually within it. The moment a cluster crosses into extended support, every idle month is real money — and across a fleet that delta becomes a budget line, not a rounding error.

Check exactly where every cluster sits before you plan the wave:

# What versions exist, and which are in extended support?
aws eks describe-cluster-versions \
  --query 'clusterVersions[].{Version:clusterVersion,Status:status,Support:versionStatus,EndStd:endOfStandardSupportDate}' \
  --output table

# Inventory the fleet's current versions in one pass
for c in $(aws eks list-clusters --query 'clusters[]' --output text); do
  v=$(aws eks describe-cluster --name "$c" --query 'cluster.version' --output text)
  printf '%-28s %s\n' "$c" "$v"
done

Rule of thumb: never carry more than two distinct minor versions across the fleet at once. The more spread you allow, the more your add-on compatibility matrix and tooling fork.

2. Pre-upgrade readiness: hunt down removed and deprecated APIs

The number one cause of a “successful” upgrade that breaks workloads is a removed API. Kubernetes removes superseded beta APIs on minor bumps, and any manifest, Helm chart, or controller still calling the old group/version simply stops working. Catch this before you touch the control plane.

Two tools cover this. kube-no-trouble (kubent) scans live cluster state and Helm releases; pluto scans both live clusters and static manifests/charts in CI.

# kube-no-trouble: scan the live cluster (Helm v3 + collected manifests)
kubent --context platform-prod

# Pluto: detect deprecated/removed APIs against a TARGET version
pluto detect-all-in-cluster --target-versions k8s=v1.32

# Pluto in CI: scan rendered manifests before they ever reach a cluster
helm template ./charts/payments | pluto detect - --target-versions k8s=v1.32

Both report the offending object, the deprecated apiVersion, and the version where it is removed. The fix is almost always a chart bump or a apiVersion rewrite — for example policy/v1beta1 PodDisruptionBudget to policy/v1, or an old autoscaling/v2beta2 HPA to autoscaling/v2. Remediate, redeploy, and re-scan until clean.

EKS also runs upgrade insights server-side. Pull them as a hard gate — they flag deprecated API usage observed by the control plane itself, which catches clients you forgot existed:

aws eks list-insights --cluster-name platform-prod \
  --filter '{"categories":["UPGRADE_READINESS"]}'

aws eks describe-insight --cluster-name platform-prod --id <insight-id> \
  --query 'insight.{Name:name,Status:insightStatus.status,Reason:insightStatus.reason}'

Treat any insight not in PASSING as a release blocker.

3. Upgrade the control plane and respect the skip-version rules

EKS upgrades the control plane one minor version at a time — you cannot jump from 1.30 to 1.32 in a single API call. To cross two versions you issue two sequential updates, each completing before the next.

aws eks update-cluster-version \
  --name platform-prod \
  --kubernetes-version 1.32

# Watch the update to completion (status goes InProgress -> Successful)
aws eks describe-update \
  --name platform-prod \
  --update-id <update-id> \
  --query 'update.{Status:status,Type:type,Errors:errors}'

The ordering rule that trips people up is kubelet skew. On EKS 1.28 and newer, managed and Fargate nodes tolerate the control plane being up to three minor versions ahead of the kubelet. So you can advance the control plane 1.29 -> 1.30 -> 1.31 while nodes stay on 1.29, then catch the nodes up after. It does not let you skip control plane versions — only the data plane is allowed to lag. The correct order of operations is always:

  1. Control plane up one minor version (repeat as needed).
  2. Node groups (kubelet) up to a version within the skew window.
  3. kube-proxy last, never newer than the control plane and no more than three minors behind it.

EKS pre-flight checks the control plane upgrade for you: it requires free IP addresses in your control plane subnets and will refuse the upgrade if the subnets are exhausted, so confirm subnet headroom first.

4. Reconcile EKS managed add-ons and version skew

The four add-ons that gate a clean upgrade are VPC CNI, CoreDNS, kube-proxy, and the EBS CSI driver. Manage them as EKS managed add-ons so AWS exposes a per-version compatibility matrix. Ask AWS which build is compatible with the target version — do not guess:

# What add-on versions are compatible with the target cluster version?
aws eks describe-addon-versions \
  --kubernetes-version 1.32 \
  --addon-name kube-proxy \
  --query 'addons[].addonVersions[].{Version:addonVersion,Default:compatibilities[0].defaultVersion}' \
  --output table

kube-proxy is the strict one: it must not be newer than the control plane minor version, and must not be more than three minors older. CoreDNS and the CSI drivers are looser but still version-gated. Update each add-on to a compatible build, choosing your conflict-resolution mode deliberately:

aws eks update-addon \
  --cluster-name platform-prod \
  --addon-name kube-proxy \
  --addon-version v1.32.0-eksbuild.2 \
  --resolve-conflicts PRESERVE

The --resolve-conflicts flag has three values and the choice matters:

Value Behavior Use when
NONE EKS does not touch changed fields; update may fail on conflict You want a hard stop if anything was hand-edited
OVERWRITE EKS resets changed fields to its defaults The add-on config is fully owned by EKS / IaC
PRESERVE EKS keeps your out-of-band edits across the update You have intentional custom config (e.g. CNI env, CoreDNS Corefile)

If you have tuned the VPC CNI for prefix delegation or edited the CoreDNS Corefile, PRESERVE keeps the upgrade from silently reverting it. Use OVERWRITE only when you are certain EKS defaults are correct — it clobbers any field set through the Kubernetes API rather than the add-on API.

5. Upgrade node groups: managed rolling updates, Karpenter drift, Bottlerocket

Once the control plane is ahead, bring the kubelet forward. Pick the strategy that matches how the nodes were provisioned.

Managed node groups do a rolling, surge-based replacement. EKS launches new nodes on the target version, cordons and drains the old ones respecting PodDisruptionBudgets, then terminates them:

# Bump the AMI/version on a managed node group; EKS rolls it
aws eks update-nodegroup-version \
  --cluster-name platform-prod \
  --nodegroup-name core-2024 \
  --kubernetes-version 1.32

Tune the surge so the roll is fast but bounded. maxUnavailablePercentage caps how many nodes drain at once; without a sane cap a large node group either crawls or evicts too aggressively:

aws eks update-nodegroup-config \
  --cluster-name platform-prod \
  --nodegroup-name core-2024 \
  --update-config maxUnavailablePercentage=10

Karpenter-managed nodes upgrade through drift, not a node-group API. When you change the AMI the NodePool/EC2NodeClass references, Karpenter marks existing nodes drifted and replaces them. Pin the AMI explicitly so upgrades are intentional, not a surprise on the next AMI release:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    # Pin to the EKS-optimized AL2023 AMI for the TARGET version
    - alias: al2023@v20260601
  role: KarpenterNodeRole-platform-prod

Control the blast radius with a NodePool disruption budget so drift does not recycle the whole fleet at once. A budget scoped to the Drifted reason throttles upgrade churn while leaving normal consolidation alone:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    budgets:
      - nodes: "10%"        # default safety net for all reasons
      - nodes: "3"          # at most 3 nodes drifting at once
        reasons: ["Drifted"]

Bottlerocket nodes can be driven the same ways, plus the in-cluster Bottlerocket update operator (BRUPOP), which coordinates host updates and reboots while respecting PDBs — useful when you want OS patches decoupled from the Kubernetes minor bump.

6. Drain safely: PodDisruptionBudgets, surge, and autoscaler interplay

Draining is where availability is won or lost. The control is the PodDisruptionBudget. Every critical workload needs one, or a node drain can take all replicas down at once:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: checkout
spec:
  minAvailable: 2          # never let eviction drop below 2 healthy pods
  selector:
    matchLabels:
      app: checkout

Two failure modes to design out:

Watch evictions live and pounce on anything stuck:

kubectl get events -A --field-selector reason=Evicted --watch
# A drain that won't progress almost always means an unsatisfiable PDB:
kubectl get pdb -A -o wide

7. Fleet-scale orchestration: GitOps and staged ring rollouts

One cluster is a runbook. Forty clusters is an orchestration problem, and clicking through them by hand guarantees drift and human error. Two patterns make a fleet tractable.

GitOps as the source of truth. Express cluster and add-on versions declaratively (EKS Blueprints / Terraform for the cluster, Argo CD or Flux for in-cluster add-ons), so an upgrade is a reviewed pull request, not a command run from someone’s laptop:

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "platform-prod"
  cluster_version = "1.32"   # the upgrade is this one-line diff, reviewed in a PR

  cluster_addons = {
    coredns    = { addon_version = "v1.11.4-eksbuild.1" }
    kube-proxy = { addon_version = "v1.32.0-eksbuild.2", resolve_conflicts_on_update = "PRESERVE" }
    vpc-cni    = { addon_version = "v1.19.2-eksbuild.1", resolve_conflicts_on_update = "PRESERVE" }
    aws-ebs-csi-driver = { addon_version = "v1.38.1-eksbuild.1" }
  }
}

Ring rollouts. Never move the whole fleet at once. Promote the version through rings, gating each ring on the previous one passing its smoke tests:

Ring Scope Gate to promote
0 — canary 1 non-prod cluster kubent/pluto clean, smoke tests green, soak 24h
1 — early low-traffic prod no SLO regression, soak 48h
2 — broad bulk of prod error budget intact
3 — final highest-criticality change window, full sign-off

Encode the ring as a variable so the same module rolls each tier on its own schedule — you upgrade by merging the version bump ring by ring, never a fleet-wide script.

Verify

A control plane that reports Successful is necessary, not sufficient. Verify the whole stack agrees on the new version and that workloads are healthy.

Confirm every layer landed on the target version and the skew is legal:

# Control plane
kubectl version -o json | jq '.serverVersion.gitVersion'

# Every node's kubelet (must be within 3 minors of the control plane)
kubectl get nodes -o wide \
  -o custom-columns='NODE:.metadata.name,KUBELET:.status.nodeInfo.kubeletVersion'

# Add-ons all ACTIVE and on the compatible build
aws eks list-addons --cluster-name platform-prod --query 'addons[]' --output text \
  | xargs -n1 -I{} aws eks describe-addon --cluster-name platform-prod \
      --addon-name {} --query 'addon.{Addon:addonName,Ver:addonVersion,Status:status}'

Then prove the data plane is healthy, not just present:

# Core system pods Running, no CrashLoopBackOff
kubectl get pods -n kube-system

# DNS resolves end to end (CoreDNS is a common post-upgrade casualty)
kubectl run dns-probe --rm -it --restart=Never --image=busybox:1.36 -- \
  nslookup kubernetes.default.svc.cluster.local

# No nodes stuck cordoned/draining from a stalled roll
kubectl get nodes | grep -i 'SchedulingDisabled' || echo "no cordoned nodes"

# Insights flipped to passing post-upgrade
aws eks list-insights --cluster-name platform-prod \
  --filter '{"categories":["UPGRADE_READINESS"]}' \
  --query 'insights[].{Name:name,Status:insightStatus.status}'

Run your synthetic/smoke suite against the cluster’s ingress and watch the golden signals (latency, error rate, saturation) over the soak window before promoting the next ring.

Rollback boundary, stated plainly: the EKS control plane cannot be downgraded. Once you are on 1.32 you stay on 1.32. Your only “rollback” is forward — roll nodes back to the prior AMI (still in the skew window) and revert add-on versions while you fix the workload. That one-way move is exactly why the readiness scanning in steps 2-4 and the canary ring are non-negotiable.

Enterprise scenario

A media company ran 23 EKS clusters and let the cadence slip during a hiring freeze. Six drifted onto a version that aged out of standard support; the per-cluster control plane bill jumped from ~$72 to ~$432/month and finance flagged the ~$2,160/month surprise. Worse, two were now inside the window where AWS would auto-upgrade them on its own schedule — an unplanned minor bump on a payments-adjacent cluster.

The constraint: a three-person platform team could not take the big-bang risk of catching every cluster up in one weekend. The hidden landmine surfaced when the first canary stalled — a billing service ran a single replica behind a policy/v1beta1 PDB with minAvailable: 1, so the node drain could never complete, and that API version was also removed in the target release. Both problems lived in the same manifest.

They fixed it structurally, not per-cluster. The version became a per-ring Terraform variable so the same module rolled each tier on its own gated schedule, and a CI step ran pluto against every rendered chart so a removed API could never reach a cluster again:

variable "cluster_version" {
  type        = string
  description = "Set per ring; promote ring N+1 only after ring N soaks clean."
}

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = var.cluster_name
  cluster_version = var.cluster_version   # 1.32 in canary first, then promoted by PR
}

The PDB trap was fixed at the source — scale to two replicas and move the manifest to policy/v1. With the canary ring proving each step and pluto gating CI, they caught the remaining five clusters up over three weeks of reviewed pull requests, ended the extended-support charges, and took the auto-upgrade risk off the table. The durable fix was the ring variable plus the CI gate, not a heroic weekend.

Pre-flight and rollout checklist

What makes EKS upgrades boring is sequencing and gating, not heroics: scan before you move, advance the control plane one minor at a time, reconcile add-ons against the version matrix, drain behind PDBs, and promote through rings on a GitOps diff. Do that and a fleet upgrade is a quiet series of pull requests — and nobody on call ever learns it happened.

awsekskubernetesupgradesday-twofleet-operations

Comments

Keep Reading