A single aws eks update-cluster-version call looks trivial. The risk is never the control plane API call itself, which AWS performs in a managed, rolling fashion. The risk is everything around it: an admission webhook that stops answering, a CSI driver that skews past the API server, a DaemonSet that won’t drain because nobody set a PodDisruptionBudget, and the slow bleed of clusters parked on a version that slid into extended support at six times the hourly rate. Across a fleet those small risks multiply by cluster count. This is the runbook I use to move a fleet forward one minor version at a time without paging anyone.
Assume EKS 1.31+ as a baseline and kubectl, eksctl, and AWS CLI v2 on the operator workstation throughout.
1. Understand the version lifecycle before you plan anything
EKS supports each Kubernetes minor version for a fixed window, and the window is what drives your cadence — not feature envy.
- Standard support lasts roughly 14 months from when the version becomes available in EKS. Control plane cost is $0.10 per cluster per hour (~$72/month).
- Extended support then runs a further 12 months. The control plane price jumps to $0.60 per cluster per hour (~$432/month) — a 6x increase. Worker node and data transfer costs are unchanged; only the per-cluster control plane charge moves.
- A version that exits extended support is auto-upgraded by AWS to the next minor on a schedule you do not control. You do not want that to be your upgrade strategy.
The practical takeaway: standard support gives you runway to land roughly one minor upgrade per quarter and stay perpetually within it. The moment a cluster crosses into extended support, every idle month is real money — and across a fleet that delta becomes a budget line, not a rounding error.
Check exactly where every cluster sits before you plan the wave:
# What versions exist, and which are in extended support?
aws eks describe-cluster-versions \
--query 'clusterVersions[].{Version:clusterVersion,Status:status,Support:versionStatus,EndStd:endOfStandardSupportDate}' \
--output table
# Inventory the fleet's current versions in one pass
for c in $(aws eks list-clusters --query 'clusters[]' --output text); do
v=$(aws eks describe-cluster --name "$c" --query 'cluster.version' --output text)
printf '%-28s %s\n' "$c" "$v"
done
Rule of thumb: never carry more than two distinct minor versions across the fleet at once. The more spread you allow, the more your add-on compatibility matrix and tooling fork.
2. Pre-upgrade readiness: hunt down removed and deprecated APIs
The number one cause of a “successful” upgrade that breaks workloads is a removed API. Kubernetes removes superseded beta APIs on minor bumps, and any manifest, Helm chart, or controller still calling the old group/version simply stops working. Catch this before you touch the control plane.
Two tools cover this. kube-no-trouble (kubent) scans live cluster state and Helm releases; pluto scans both live clusters and static manifests/charts in CI.
# kube-no-trouble: scan the live cluster (Helm v3 + collected manifests)
kubent --context platform-prod
# Pluto: detect deprecated/removed APIs against a TARGET version
pluto detect-all-in-cluster --target-versions k8s=v1.32
# Pluto in CI: scan rendered manifests before they ever reach a cluster
helm template ./charts/payments | pluto detect - --target-versions k8s=v1.32
Both report the offending object, the deprecated apiVersion, and the version where it is removed. The fix is almost always a chart bump or a apiVersion rewrite — for example policy/v1beta1 PodDisruptionBudget to policy/v1, or an old autoscaling/v2beta2 HPA to autoscaling/v2. Remediate, redeploy, and re-scan until clean.
EKS also runs upgrade insights server-side. Pull them as a hard gate — they flag deprecated API usage observed by the control plane itself, which catches clients you forgot existed:
aws eks list-insights --cluster-name platform-prod \
--filter '{"categories":["UPGRADE_READINESS"]}'
aws eks describe-insight --cluster-name platform-prod --id <insight-id> \
--query 'insight.{Name:name,Status:insightStatus.status,Reason:insightStatus.reason}'
Treat any insight not in PASSING as a release blocker.
3. Upgrade the control plane and respect the skip-version rules
EKS upgrades the control plane one minor version at a time — you cannot jump from 1.30 to 1.32 in a single API call. To cross two versions you issue two sequential updates, each completing before the next.
aws eks update-cluster-version \
--name platform-prod \
--kubernetes-version 1.32
# Watch the update to completion (status goes InProgress -> Successful)
aws eks describe-update \
--name platform-prod \
--update-id <update-id> \
--query 'update.{Status:status,Type:type,Errors:errors}'
The ordering rule that trips people up is kubelet skew. On EKS 1.28 and newer, managed and Fargate nodes tolerate the control plane being up to three minor versions ahead of the kubelet. So you can advance the control plane 1.29 -> 1.30 -> 1.31 while nodes stay on 1.29, then catch the nodes up after. It does not let you skip control plane versions — only the data plane is allowed to lag. The correct order of operations is always:
- Control plane up one minor version (repeat as needed).
- Node groups (kubelet) up to a version within the skew window.
kube-proxylast, never newer than the control plane and no more than three minors behind it.
EKS pre-flight checks the control plane upgrade for you: it requires free IP addresses in your control plane subnets and will refuse the upgrade if the subnets are exhausted, so confirm subnet headroom first.
4. Reconcile EKS managed add-ons and version skew
The four add-ons that gate a clean upgrade are VPC CNI, CoreDNS, kube-proxy, and the EBS CSI driver. Manage them as EKS managed add-ons so AWS exposes a per-version compatibility matrix. Ask AWS which build is compatible with the target version — do not guess:
# What add-on versions are compatible with the target cluster version?
aws eks describe-addon-versions \
--kubernetes-version 1.32 \
--addon-name kube-proxy \
--query 'addons[].addonVersions[].{Version:addonVersion,Default:compatibilities[0].defaultVersion}' \
--output table
kube-proxy is the strict one: it must not be newer than the control plane minor version, and must not be more than three minors older. CoreDNS and the CSI drivers are looser but still version-gated. Update each add-on to a compatible build, choosing your conflict-resolution mode deliberately:
aws eks update-addon \
--cluster-name platform-prod \
--addon-name kube-proxy \
--addon-version v1.32.0-eksbuild.2 \
--resolve-conflicts PRESERVE
The --resolve-conflicts flag has three values and the choice matters:
| Value | Behavior | Use when |
|---|---|---|
NONE |
EKS does not touch changed fields; update may fail on conflict | You want a hard stop if anything was hand-edited |
OVERWRITE |
EKS resets changed fields to its defaults | The add-on config is fully owned by EKS / IaC |
PRESERVE |
EKS keeps your out-of-band edits across the update | You have intentional custom config (e.g. CNI env, CoreDNS Corefile) |
If you have tuned the VPC CNI for prefix delegation or edited the CoreDNS Corefile, PRESERVE keeps the upgrade from silently reverting it. Use OVERWRITE only when you are certain EKS defaults are correct — it clobbers any field set through the Kubernetes API rather than the add-on API.
5. Upgrade node groups: managed rolling updates, Karpenter drift, Bottlerocket
Once the control plane is ahead, bring the kubelet forward. Pick the strategy that matches how the nodes were provisioned.
Managed node groups do a rolling, surge-based replacement. EKS launches new nodes on the target version, cordons and drains the old ones respecting PodDisruptionBudgets, then terminates them:
# Bump the AMI/version on a managed node group; EKS rolls it
aws eks update-nodegroup-version \
--cluster-name platform-prod \
--nodegroup-name core-2024 \
--kubernetes-version 1.32
Tune the surge so the roll is fast but bounded. maxUnavailablePercentage caps how many nodes drain at once; without a sane cap a large node group either crawls or evicts too aggressively:
aws eks update-nodegroup-config \
--cluster-name platform-prod \
--nodegroup-name core-2024 \
--update-config maxUnavailablePercentage=10
Karpenter-managed nodes upgrade through drift, not a node-group API. When you change the AMI the NodePool/EC2NodeClass references, Karpenter marks existing nodes drifted and replaces them. Pin the AMI explicitly so upgrades are intentional, not a surprise on the next AMI release:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
# Pin to the EKS-optimized AL2023 AMI for the TARGET version
- alias: al2023@v20260601
role: KarpenterNodeRole-platform-prod
Control the blast radius with a NodePool disruption budget so drift does not recycle the whole fleet at once. A budget scoped to the Drifted reason throttles upgrade churn while leaving normal consolidation alone:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
budgets:
- nodes: "10%" # default safety net for all reasons
- nodes: "3" # at most 3 nodes drifting at once
reasons: ["Drifted"]
Bottlerocket nodes can be driven the same ways, plus the in-cluster Bottlerocket update operator (BRUPOP), which coordinates host updates and reboots while respecting PDBs — useful when you want OS patches decoupled from the Kubernetes minor bump.
6. Drain safely: PodDisruptionBudgets, surge, and autoscaler interplay
Draining is where availability is won or lost. The control is the PodDisruptionBudget. Every critical workload needs one, or a node drain can take all replicas down at once:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: checkout
spec:
minAvailable: 2 # never let eviction drop below 2 healthy pods
selector:
matchLabels:
app: checkout
Two failure modes to design out:
- A PDB that can never be satisfied — for example
minAvailable: 1on a single-replica Deployment — blocks the drain forever. The node stays cordoned and the upgrade stalls. Audit single-replica workloads with restrictive PDBs before you roll. - Autoscaler fighting the drain. With Cluster Autoscaler, scaling and your drain can race; pause aggressive scale-down during the wave or rely on managed node group surge for replacements. With Karpenter this is mostly a non-issue — draining a drifted node provisions replacement capacity first and honors PDBs — provided the disruption budget above is set.
Watch evictions live and pounce on anything stuck:
kubectl get events -A --field-selector reason=Evicted --watch
# A drain that won't progress almost always means an unsatisfiable PDB:
kubectl get pdb -A -o wide
7. Fleet-scale orchestration: GitOps and staged ring rollouts
One cluster is a runbook. Forty clusters is an orchestration problem, and clicking through them by hand guarantees drift and human error. Two patterns make a fleet tractable.
GitOps as the source of truth. Express cluster and add-on versions declaratively (EKS Blueprints / Terraform for the cluster, Argo CD or Flux for in-cluster add-ons), so an upgrade is a reviewed pull request, not a command run from someone’s laptop:
module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_name = "platform-prod"
cluster_version = "1.32" # the upgrade is this one-line diff, reviewed in a PR
cluster_addons = {
coredns = { addon_version = "v1.11.4-eksbuild.1" }
kube-proxy = { addon_version = "v1.32.0-eksbuild.2", resolve_conflicts_on_update = "PRESERVE" }
vpc-cni = { addon_version = "v1.19.2-eksbuild.1", resolve_conflicts_on_update = "PRESERVE" }
aws-ebs-csi-driver = { addon_version = "v1.38.1-eksbuild.1" }
}
}
Ring rollouts. Never move the whole fleet at once. Promote the version through rings, gating each ring on the previous one passing its smoke tests:
| Ring | Scope | Gate to promote |
|---|---|---|
| 0 — canary | 1 non-prod cluster | kubent/pluto clean, smoke tests green, soak 24h |
| 1 — early | low-traffic prod | no SLO regression, soak 48h |
| 2 — broad | bulk of prod | error budget intact |
| 3 — final | highest-criticality | change window, full sign-off |
Encode the ring as a variable so the same module rolls each tier on its own schedule — you upgrade by merging the version bump ring by ring, never a fleet-wide script.
Verify
A control plane that reports Successful is necessary, not sufficient. Verify the whole stack agrees on the new version and that workloads are healthy.
Confirm every layer landed on the target version and the skew is legal:
# Control plane
kubectl version -o json | jq '.serverVersion.gitVersion'
# Every node's kubelet (must be within 3 minors of the control plane)
kubectl get nodes -o wide \
-o custom-columns='NODE:.metadata.name,KUBELET:.status.nodeInfo.kubeletVersion'
# Add-ons all ACTIVE and on the compatible build
aws eks list-addons --cluster-name platform-prod --query 'addons[]' --output text \
| xargs -n1 -I{} aws eks describe-addon --cluster-name platform-prod \
--addon-name {} --query 'addon.{Addon:addonName,Ver:addonVersion,Status:status}'
Then prove the data plane is healthy, not just present:
# Core system pods Running, no CrashLoopBackOff
kubectl get pods -n kube-system
# DNS resolves end to end (CoreDNS is a common post-upgrade casualty)
kubectl run dns-probe --rm -it --restart=Never --image=busybox:1.36 -- \
nslookup kubernetes.default.svc.cluster.local
# No nodes stuck cordoned/draining from a stalled roll
kubectl get nodes | grep -i 'SchedulingDisabled' || echo "no cordoned nodes"
# Insights flipped to passing post-upgrade
aws eks list-insights --cluster-name platform-prod \
--filter '{"categories":["UPGRADE_READINESS"]}' \
--query 'insights[].{Name:name,Status:insightStatus.status}'
Run your synthetic/smoke suite against the cluster’s ingress and watch the golden signals (latency, error rate, saturation) over the soak window before promoting the next ring.
Rollback boundary, stated plainly: the EKS control plane cannot be downgraded. Once you are on 1.32 you stay on 1.32. Your only “rollback” is forward — roll nodes back to the prior AMI (still in the skew window) and revert add-on versions while you fix the workload. That one-way move is exactly why the readiness scanning in steps 2-4 and the canary ring are non-negotiable.
Enterprise scenario
A media company ran 23 EKS clusters and let the cadence slip during a hiring freeze. Six drifted onto a version that aged out of standard support; the per-cluster control plane bill jumped from ~$72 to ~$432/month and finance flagged the ~$2,160/month surprise. Worse, two were now inside the window where AWS would auto-upgrade them on its own schedule — an unplanned minor bump on a payments-adjacent cluster.
The constraint: a three-person platform team could not take the big-bang risk of catching every cluster up in one weekend. The hidden landmine surfaced when the first canary stalled — a billing service ran a single replica behind a policy/v1beta1 PDB with minAvailable: 1, so the node drain could never complete, and that API version was also removed in the target release. Both problems lived in the same manifest.
They fixed it structurally, not per-cluster. The version became a per-ring Terraform variable so the same module rolled each tier on its own gated schedule, and a CI step ran pluto against every rendered chart so a removed API could never reach a cluster again:
variable "cluster_version" {
type = string
description = "Set per ring; promote ring N+1 only after ring N soaks clean."
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_name = var.cluster_name
cluster_version = var.cluster_version # 1.32 in canary first, then promoted by PR
}
The PDB trap was fixed at the source — scale to two replicas and move the manifest to policy/v1. With the canary ring proving each step and pluto gating CI, they caught the remaining five clusters up over three weeks of reviewed pull requests, ended the extended-support charges, and took the auto-upgrade risk off the table. The durable fix was the ring variable plus the CI gate, not a heroic weekend.
Pre-flight and rollout checklist
What makes EKS upgrades boring is sequencing and gating, not heroics: scan before you move, advance the control plane one minor at a time, reconcile add-ons against the version matrix, drain behind PDBs, and promote through rings on a GitOps diff. Do that and a fleet upgrade is a quiet series of pull requests — and nobody on call ever learns it happened.