Containerization Lesson 86 of 113

EKS Cluster Upgrades: Version Lifecycle, Add-on Compatibility, and Fleet Operations

A single aws eks update-cluster-version call looks trivial. The risk is never the control-plane API call itself — AWS performs that in a managed, rolling fashion behind the scenes. The risk is everything around it: an admission webhook that stops answering mid-upgrade, a CSI driver that skews past the API server, a DaemonSet that won’t drain because nobody set a PodDisruptionBudget, a policy/v1beta1 object that the target version removed out from under a running controller, and the slow bleed of clusters parked on a version that slid into extended support at six times the hourly rate. One cluster, those are footguns. Across a forty-cluster fleet they multiply by cluster count and become a budget line and an on-call rotation. This is the runbook I use to move a fleet forward one minor version at a time without paging anyone.

An EKS upgrade is four upgrades that must happen in a fixed order: the control plane, the managed add-ons (VPC CNI, CoreDNS, kube-proxy, EBS CSI), the node groups (the kubelet), and then kube-proxy trailing last. Get the order wrong — bump nodes before the control plane, or let kube-proxy get newer than the API server — and you manufacture skew violations that the platform will not let you create but a hand-rolled script happily will. Layer on top of that the one-way door at the centre of the whole thing: the EKS control plane cannot be downgraded. Once you are on 1.32 you stay on 1.32. Your only “rollback” is forward. That single fact is why every gate in this runbook — the deprecated-API scan, the add-on compatibility check, the canary ring, the soak window — is non-negotiable rather than nice-to-have.

By the end you will treat a fleet upgrade as a quiet series of reviewed pull requests, not a heroic weekend. You will know exactly where every cluster sits in its support window, which clusters are bleeding money in extended support, how to hunt down a removed API before it breaks a workload, how to read the add-on compatibility matrix instead of guessing, how to roll node groups and Karpenter-managed capacity without an availability dip, and how to promote a version through rings so a regression stops at one non-prod cluster instead of taking the fleet. Assume EKS 1.31+ as a baseline and kubectl, eksctl, and AWS CLI v2 on the operator workstation throughout. Because this is a reference you will keep open mid-wave, the lifecycle windows, the skew rules, the add-on matrix, the drain failure modes and the ring gates are all laid out as scannable tables — read the prose once, then keep the tables open during the change window.

To frame the whole field before the deep dive, here are the four ordered phases of an EKS upgrade, what each one moves, the hard rule that governs it, and the single thing most likely to bite:

Phase What moves Hard rule Most common failure
0 — Readiness Nothing (scan only) Every removed/deprecated API remediated first A policy/v1beta1 PDB removed in the target version
1 — Control plane API server + etcd (managed) One minor at a time; needs free subnet IPs Subnet IP exhaustion refuses the upgrade
2 — Add-ons VPC CNI, CoreDNS, kube-proxy, EBS CSI kube-proxy never newer than the control plane OVERWRITE clobbers a tuned CNI/Corefile
3 — Node groups kubelet (data plane) Kubelet within 3 minors of the control plane Unsatisfiable PDB stalls the drain forever
3b — kube-proxy last kube-proxy to match nodes ≤ control plane, ≤3 minors behind Skew left in place; DNS/networking flakes

What problem this solves

EKS hides the control plane so completely that the upgrade looks like a one-line version bump — and that is exactly the trap. The managed control-plane roll is the easy, safe part; AWS does it for you with no downtime. The hard part is the blast radius in your cluster: the workloads, controllers, webhooks, CSI drivers and DaemonSets that were written against an API surface that the new Kubernetes minor quietly changed or removed. The update-cluster-version call succeeds, the console says Successful, and three hours later a Helm-managed controller that still calls autoscaling/v2beta2 stops reconciling, or a node drain hangs forever on a single-replica Deployment behind a minAvailable: 1 PDB, and now you are debugging a “successful” upgrade.

What breaks without a disciplined runbook: a removed API silently kills a workload (the number-one cause of a broken upgrade); a drain stalls and a node group roll wedges half-cordoned; an add-on update with --resolve-conflicts OVERWRITE reverts a hand-tuned VPC CNI prefix-delegation config and the cluster runs out of pod IPs; kube-proxy drifts newer than the control plane and networking goes flaky; and — the quiet, expensive one — clusters slide from standard support into extended support and the per-cluster control-plane charge jumps from ~$72 to ~$432 a month while nobody notices, until finance does.

Who hits this: every team running EKS past day one. It bites hardest on fleets (the per-cluster math compounds), on clusters with stateful or single-replica workloads (the PDB traps), on teams that hand-tuned add-ons (the OVERWRITE reversion), and on anyone who let cadence slip during a hiring freeze (the extended-support surprise plus the forced AWS auto-upgrade when a version finally ages out). The fix is almost never heroics — it is sequencing and gating: scan before you move, advance the control plane one minor at a time, reconcile add-ons against the compatibility matrix, drain behind PDBs, and promote through rings on a GitOps diff.

A quick map of the moving parts, who owns each, and the failure class it can cause, so you call the right person fast during a wave:

Layer What lives here Who usually owns it Failure class it can cause
Control plane (managed) API server, etcd, scheduler AWS (platform) Upgrade refused on subnet IP exhaustion
Managed add-ons CNI, CoreDNS, kube-proxy, CSI Platform team Skew violation; tuned config reverted
Node groups / kubelet The data plane VMs Platform / infra Drain stalls; surge too aggressive
Workload manifests Deployments, PDBs, HPAs App teams Removed-API breakage; unsatisfiable PDB
Admission webhooks Validating/mutating controllers Platform / security Webhook unavailable blocks all writes
GitOps / IaC Cluster + add-on versions Platform team Drift; unreviewed imperative changes
Billing / support tier Standard vs extended support FinOps + platform 6× control-plane cost; forced auto-upgrade

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already be comfortable operating an EKS cluster: kubectl against a context, reading aws eks describe-cluster output, and the basic objects — Deployments, DaemonSets, Services. You should know what a minor version is (the 1.x in 1.32), that Kubernetes deprecates and then removes beta APIs on minor bumps, and that EKS exposes a managed control plane you never SSH into. Familiarity with Helm (charts render manifests that may carry old apiVersions), with PodDisruptionBudgets, and with at least one of managed node groups or Karpenter will let you apply every section directly. AWS CLI v2, eksctl, kubent, and pluto on your workstation are assumed throughout.

This sits in the day-two / fleet-operations track. It assumes the managed-Kubernetes fundamentals from Understanding Managed Kubernetes: AKS vs EKS vs GKE Compared and the broader day-two checklist in Kubernetes Production Readiness: Day-2 Operations Checklist. It pairs tightly with EKS at Scale: Pod Identity, Karpenter, and Networking and Deploy Karpenter on EKS: Consolidation, Spot, and Disruption Budgets, because how your nodes are provisioned dictates the upgrade strategy. When a drain stalls or DNS breaks post-upgrade, lean on Kubernetes Troubleshooting Methodology: Pods, Nodes, Networking, Storage, RBAC. The Azure-shop equivalent of this exact runbook is AKS Day-Two: Upgrades and Fleet Operations — the sequencing rules rhyme.

Where each tool fits in the upgrade pipeline, so you reach for the right one at the right phase:

Tool Phase it serves What it does When you run it
aws eks describe-cluster-versions Plan Lists versions + support status Before planning the wave
kube-no-trouble (kubent) Readiness Scans live cluster + Helm for removed APIs Pre-upgrade gate, per cluster
pluto Readiness Scans live clusters and static charts in CI Pre-upgrade + every CI render
EKS upgrade insights Readiness Server-side deprecated-API detection Hard gate; treat non-PASSING as blocker
aws eks update-cluster-version Control plane Rolls the API server one minor up Phase 1
aws eks describe-addon-versions Add-ons Returns compatible builds for a target Before each add-on update
aws eks update-addon Add-ons Updates an add-on with a conflict mode Phase 2
aws eks update-nodegroup-version Nodes Rolling, surge-based managed-node roll Phase 3
Karpenter drift + budgets Nodes Replaces drifted nodes within a budget Phase 3 (Karpenter fleets)
Argo CD / Flux + Terraform Orchestration Version as a reviewed declarative diff All phases, fleet-wide

Core concepts

Five mental models make every later decision obvious.

An upgrade is four ordered upgrades, not one. The control plane, the add-ons, the kubelet, and kube-proxy move in sequence, each constrained by version-skew rules. The control plane leads; the data plane is allowed to lag (within the skew window); kube-proxy trails. You never bump nodes ahead of the control plane, and you never let kube-proxy get newer than the API server. The platform enforces some of this; a hand-rolled script enforces none of it.

The control plane is a one-way door. EKS upgrades the control plane one minor at a time and cannot downgrade it. To cross two minors you issue two sequential update-cluster-version calls. Once a control-plane upgrade completes there is no aws eks downgrade-cluster-version — it does not exist. “Rollback” means rolling nodes back to the prior AMI (still legal within the skew window) and reverting add-on versions while you fix the workload. This is why readiness scanning and a canary ring are mandatory, not optional.

Kubernetes removes APIs, and removal is silent until something calls it. On a minor bump, superseded beta APIs are removed: policy/v1beta1 PodDisruptionBudget → policy/v1, autoscaling/v2beta2 HPA → autoscaling/v2, old Ingress and CRD groups, and so on. Any manifest, Helm chart, or controller still calling the removed group/version simply stops working after the upgrade — no warning at upgrade time, just a workload that quietly fails to reconcile. You catch this before the control plane moves with kubent, pluto, and server-side insights, or you debug it in production.

Kubelet skew is the lever that lets you stage. On EKS 1.28 and newer, managed and Fargate nodes tolerate the control plane being up to three minor versions ahead of the kubelet. So you can advance the control plane 1.29 → 1.30 → 1.31 while nodes stay on 1.29, then catch nodes up afterward. Skew tolerance applies only to the data plane lagging — it never lets you skip control-plane versions, and kube-proxy must never be newer than the control plane and no more than three minors behind it.

Draining is where availability is won or lost, and the control is the PDB. A node upgrade cordons and drains nodes; the PodDisruptionBudget is what stops a drain from taking all replicas of a workload down at once. A PDB that can never be satisfied (minAvailable: 1 on a single-replica Deployment) blocks the drain forever and wedges the roll. Every critical workload needs a satisfiable PDB; every single-replica workload needs auditing before you roll.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters to upgrades
Minor version The 1.x in 1.32 Cluster + nodes You upgrade one minor at a time
Standard support ~14-month full-support window Per cluster version Plan cadence to stay inside it
Extended support +12 months at 6× control-plane cost Per cluster version Idle months become a budget line
Control plane Managed API server + etcd AWS-managed One-way: cannot be downgraded
Kubelet skew Allowed control-plane-ahead-of-node gap Data plane Up to 3 minors lets you stage
Removed API A beta group/version deleted on a bump Manifests/charts/controllers Silent breakage; scan first
Managed add-on EKS-versioned CNI/CoreDNS/kube-proxy/CSI Cluster add-on API Gated by a compatibility matrix
--resolve-conflicts How add-on update treats your edits Add-on update call OVERWRITE clobbers tuned config
PodDisruptionBudget Cap on simultaneous evictions policy/v1 object Unsatisfiable PDB stalls the drain
Surge / maxUnavailable How many nodes roll at once Managed node group config Too high evicts hard; too low crawls
Karpenter drift Node replacement on AMI change NodePool/EC2NodeClass Upgrade mechanism for Karpenter
Upgrade insights Server-side readiness checks EKS control plane Hard gate; non-PASSING blocks
Ring rollout Staged promotion across the fleet Your orchestration A regression stops at one ring

1. Understand the version lifecycle before you plan anything

EKS supports each Kubernetes minor version for a fixed window, and the window — not feature envy — is what drives your cadence.

The practical takeaway: standard support gives you runway to land roughly one minor upgrade per quarter and stay perpetually within it. The moment a cluster crosses into extended support, every idle month is real money — and across a fleet that delta becomes a budget line, not a rounding error.

The lifecycle phases, what each costs, and what you should be doing in each:

Phase Duration (approx) Control-plane cost AWS behaviour Your action
Standard support ~14 months $0.10/hr (~$72/mo) Full support, patches Upgrade ~1 minor/quarter; stay inside
Extended support +12 months $0.60/hr (~$432/mo) Security backports only Treat as a deadline; budget the 6×
End of extended (forced bump) Auto-upgrades to next minor Never reach here on purpose
Standard window of next minor resets ~14 months $0.10/hr New runway Land here before the old one expires

The cost delta made concrete across a fleet — this is the table finance actually reacts to:

Clusters in extended support Monthly delta vs standard Annualised delta Equivalent
1 ~$360 ~$4,320 A small managed service
6 ~$2,160 ~$25,920 A junior engineer’s tooling budget
12 ~$4,320 ~$51,840 A meaningful line in the cloud bill
23 ~$8,280 ~$99,360 An “explain this” finance escalation

Check exactly where every cluster sits before you plan the wave:

# What versions exist, and which are in extended support?
aws eks describe-cluster-versions \
  --query 'clusterVersions[].{Version:clusterVersion,Status:status,Support:versionStatus,EndStd:endOfStandardSupportDate}' \
  --output table

# Inventory the fleet's current versions in one pass
for c in $(aws eks list-clusters --query 'clusters[]' --output text); do
  v=$(aws eks describe-cluster --name "$c" --query 'cluster.version' --output text)
  printf '%-28s %s\n' "$c" "$v"
done

Turn that inventory into an action list by ranking clusters on urgency — the wave plan falls out of this table:

Cluster state Support status Urgency Action this quarter
On N or N-1, standard Healthy Low Routine: roll one minor on cadence
On N-2, standard, near end Aging Medium Schedule before standard window closes
In extended support Costing 6× High Prioritise; stop the bleed
Near end of extended Auto-upgrade imminent Critical Drop everything; AWS will move it for you
Two+ minors behind fleet Tooling fork risk Medium Catch up; cap fleet spread at 2 minors

Rule of thumb: never carry more than two distinct minor versions across the fleet at once. The more spread you allow, the more your add-on compatibility matrix and tooling fork.

2. Pre-upgrade readiness: hunt down removed and deprecated APIs

The number-one cause of a “successful” upgrade that breaks workloads is a removed API. Kubernetes removes superseded beta APIs on minor bumps, and any manifest, Helm chart, or controller still calling the old group/version simply stops working. Catch this before you touch the control plane.

Two tools cover this. kube-no-trouble (kubent) scans live cluster state and Helm releases; pluto scans both live clusters and static manifests/charts in CI.

# kube-no-trouble: scan the live cluster (Helm v3 + collected manifests)
kubent --context platform-prod

# Pluto: detect deprecated/removed APIs against a TARGET version
pluto detect-all-in-cluster --target-versions k8s=v1.32

# Pluto in CI: scan rendered manifests before they ever reach a cluster
helm template ./charts/payments | pluto detect - --target-versions k8s=v1.32

Both report the offending object, the deprecated apiVersion, and the version where it is removed. The fix is almost always a chart bump or an apiVersion rewrite. The migrations you will hit most, with the version each beta group is removed in:

Old apiVersion Kind Replace with Removed in Typical source
policy/v1beta1 PodDisruptionBudget policy/v1 1.25 Hand-written manifests, old charts
autoscaling/v2beta2 HorizontalPodAutoscaler autoscaling/v2 1.26 Legacy HPA definitions
autoscaling/v2beta1 HorizontalPodAutoscaler autoscaling/v2 1.25 Very old HPAs
batch/v1beta1 CronJob batch/v1 1.25 Older job manifests
discovery.k8s.io/v1beta1 EndpointSlice discovery.k8s.io/v1 1.25 Service-mesh / controller internals
networking.k8s.io/v1beta1 Ingress / IngressClass networking.k8s.io/v1 1.22 Ancient ingress definitions
flowcontrol.apiserver.k8s.io/v1beta2 FlowSchema / PriorityLevel .../v1 1.29 APF config
flowcontrol.apiserver.k8s.io/v1beta3 FlowSchema / PriorityLevel .../v1 1.32 APF config (newer)
apiextensions.k8s.io/v1beta1 CustomResourceDefinition apiextensions.k8s.io/v1 1.22 Old operator CRDs
admissionregistration.k8s.io/v1beta1 Validating/MutatingWebhookConfiguration .../v1 1.22 Old webhook configs
coordination.k8s.io/v1beta1 Lease coordination.k8s.io/v1 1.22 Leader-election internals
rbac.authorization.k8s.io/v1beta1 Role / ClusterRole / bindings .../v1 1.22 Legacy RBAC manifests
storage.k8s.io/v1beta1 CSIStorageCapacity storage.k8s.io/v1 1.27 CSI driver internals

How the three scanners differ, and why you run all three rather than picking one:

Scanner Scans Sees CI charts? Sees live clients? Role in the gate
kubent Live cluster state + Helm v3 releases No Partially (stored manifests) Fast per-cluster pre-flight
pluto Live clusters and static manifests/charts Yes (rendered templates) No CI gate + pre-flight
EKS upgrade insights Server-side, control-plane-observed API calls No Yes (actual requests) Catches clients you forgot exist

EKS also runs upgrade insights server-side. Pull them as a hard gate — they flag deprecated API usage observed by the control plane itself, which catches clients you forgot existed:

aws eks list-insights --cluster-name platform-prod \
  --filter '{"categories":["UPGRADE_READINESS"]}'

aws eks describe-insight --cluster-name platform-prod --id <insight-id> \
  --query 'insight.{Name:name,Status:insightStatus.status,Reason:insightStatus.reason}'

Treat any insight not in PASSING as a release blocker. The insight statuses and what each means for go/no-go:

Insight status Meaning Gate decision
PASSING No deprecated/removed API usage observed Proceed
WARNING Deprecated (not yet removed) APIs in use Remediate before this minor compounds
ERROR APIs removed in the target version still called Block — fix before upgrading
UNKNOWN Insufficient data / recently created Re-check after a soak; do not assume safe

Remediate, redeploy, and re-scan until clean. The readiness checklist as a gate table — every box must be green before Phase 1:

Readiness gate How to confirm Blocks upgrade if…
No removed APIs in live state kubent clean Any object on a removed group/version
No removed APIs in CI charts pluto detect on rendered templates clean A chart still renders an old apiVersion
No removed APIs observed server-side Insights UPGRADE_READINESS all PASSING Any insight in ERROR
Webhooks tolerate the new minor Vendor compatibility note checked Admission controller pinned below target
Control-plane subnets have free IPs describe-subnets available-IP count > 0 Subnets exhausted (upgrade refused)
CRDs/controllers support target Operator release notes checked Controller incompatible with target

3. Upgrade the control plane and respect the skip-version rules

EKS upgrades the control plane one minor version at a time — you cannot jump from 1.30 to 1.32 in a single API call. To cross two versions you issue two sequential updates, each completing before the next.

aws eks update-cluster-version \
  --name platform-prod \
  --kubernetes-version 1.32

# Watch the update to completion (status goes InProgress -> Successful)
aws eks describe-update \
  --name platform-prod \
  --update-id <update-id> \
  --query 'update.{Status:status,Type:type,Errors:errors}'

The ordering rule that trips people up is kubelet skew. On EKS 1.28 and newer, managed and Fargate nodes tolerate the control plane being up to three minor versions ahead of the kubelet. So you can advance the control plane 1.29 → 1.30 → 1.31 while nodes stay on 1.29, then catch the nodes up after. It does not let you skip control-plane versions — only the data plane is allowed to lag. The correct order of operations is always:

  1. Control plane up one minor version (repeat as needed).
  2. Node groups (kubelet) up to a version within the skew window.
  3. kube-proxy last, never newer than the control plane and no more than three minors behind it.

The complete skew matrix you must keep legal at all times:

Component Allowed relative to control plane Direction Violation symptom
kubelet (nodes) Up to 3 minors behind (EKS 1.28+) Lag only Nodes NotReady; pods unschedulable
kube-proxy ≤ control plane, ≤3 minors behind Lag only, never ahead Service routing / iptables flakiness
kubectl (client) ±1 minor of the API server Either way kubectl warnings; odd API errors
CoreDNS Per add-on compatibility matrix Version-gated DNS resolution failures
VPC CNI Per add-on compatibility matrix Version-gated Pods stuck ContainerCreating (no IP)
Control plane itself One minor per update; never downgrade Forward only API call rejected if you skip a minor

What describe-update reports while the roll is in flight, and what each status means for you:

Update status Meaning What to do
InProgress AWS is rolling the control plane Wait; it is a managed, no-downtime roll
Successful Control plane is on the new minor Proceed to add-ons (Phase 2)
Failed Pre-flight or roll failed Read errors[]; commonly subnet IPs
Cancelled Update aborted Re-check readiness, re-issue

EKS pre-flight checks the control-plane upgrade for you: it requires free IP addresses in your control-plane subnets and will refuse the upgrade if the subnets are exhausted, so confirm subnet headroom first. The pre-flight conditions EKS enforces, and the fix for each:

Pre-flight condition Why it exists How to confirm Fix if it fails
Free IPs in control-plane subnets New ENIs for the upgraded control plane aws ec2 describe-subnets --query '...AvailableIpAddressCount' Free IPs / add a larger subnet
Security groups allow control-plane traffic New control-plane ENIs must reach nodes Cluster SG rules Restore required 443/10250 rules
Cluster in ACTIVE state No concurrent operation describe-cluster --query cluster.status Wait for the in-flight op to finish
Subnets in supported AZs Control plane spans ≥2 AZs Subnet AZ list Add a subnet in a second AZ

4. Reconcile EKS managed add-ons and version skew

The four add-ons that gate a clean upgrade are VPC CNI, CoreDNS, kube-proxy, and the EBS CSI driver. Manage them as EKS managed add-ons so AWS exposes a per-version compatibility matrix. Ask AWS which build is compatible with the target version — do not guess:

# What add-on versions are compatible with the target cluster version?
aws eks describe-addon-versions \
  --kubernetes-version 1.32 \
  --addon-name kube-proxy \
  --query 'addons[].addonVersions[].{Version:addonVersion,Default:compatibilities[0].defaultVersion}' \
  --output table

The four gating add-ons, what each does, how strict its version coupling is, and what breaks if it skews:

Add-on Role Version strictness Symptom if skewed/broken
VPC CNI (vpc-cni) Assigns pod IPs from the VPC Looser, but config-sensitive Pods stuck ContainerCreating, no IP
CoreDNS (coredns) In-cluster DNS Version-gated, looser Name resolution fails cluster-wide
kube-proxy (kube-proxy) Service VIP → pod routing (iptables/IPVS) Strict: ≤ control plane, ≤3 behind Service traffic blackholes intermittently
EBS CSI (aws-ebs-csi-driver) Dynamic EBS volume provisioning Version-gated, looser PVCs stuck Pending; volumes won’t attach

kube-proxy is the strict one: it must not be newer than the control-plane minor version, and must not be more than three minors older. CoreDNS and the CSI drivers are looser but still version-gated. Update each add-on to a compatible build, choosing your conflict-resolution mode deliberately:

aws eks update-addon \
  --cluster-name platform-prod \
  --addon-name kube-proxy \
  --addon-version v1.32.0-eksbuild.2 \
  --resolve-conflicts PRESERVE

The --resolve-conflicts flag has three values and the choice matters:

Value Behavior Use when
NONE EKS does not touch changed fields; update may fail on conflict You want a hard stop if anything was hand-edited
OVERWRITE EKS resets changed fields to its defaults The add-on config is fully owned by EKS / IaC
PRESERVE EKS keeps your out-of-band edits across the update You have intentional custom config (e.g. CNI env, CoreDNS Corefile)

If you have tuned the VPC CNI for prefix delegation or edited the CoreDNS Corefile, PRESERVE keeps the upgrade from silently reverting it. Use OVERWRITE only when you are certain EKS defaults are correct — it clobbers any field set through the Kubernetes API rather than the add-on API. The custom config that OVERWRITE will silently revert, so you know what is at stake per add-on:

Add-on Common custom config What OVERWRITE does to it Recommended mode
VPC CNI ENABLE_PREFIX_DELEGATION, WARM_*, custom networking env Resets env to EKS defaults → IP-density loss PRESERVE
CoreDNS Edited Corefile (stub domains, forward, cache) Reverts to default Corefile → resolution gaps PRESERVE
kube-proxy Mode (iptables/IPVS), config tuning Resets to defaults PRESERVE if tuned, else OVERWRITE
EBS CSI Custom StorageClass params, tolerations Add-on-managed fields reset OVERWRITE usually safe

Confirm every add-on landed ACTIVE on a compatible build before you move to nodes:

aws eks list-addons --cluster-name platform-prod --query 'addons[]' --output text \
  | xargs -n1 -I{} aws eks describe-addon --cluster-name platform-prod \
      --addon-name {} --query 'addon.{Addon:addonName,Ver:addonVersion,Status:status}'

The add-on update states and what each means for go/no-go to Phase 3:

Add-on status Meaning Decision
ACTIVE Running on the requested build Good; proceed
UPDATING Roll in progress Wait
DEGRADED Running but unhealthy Investigate before nodes
CREATE_FAILED / UPDATE_FAILED Update did not apply Read health.issues[]; fix and retry

5. Upgrade node groups: managed rolling updates, Karpenter drift, Bottlerocket

Once the control plane is ahead, bring the kubelet forward. Pick the strategy that matches how the nodes were provisioned.

The three provisioning models and how each upgrades — choose your row, then read its detail:

Provisioning model Upgrade mechanism Blast-radius control Best for
Managed node group update-nodegroup-version (surge roll) maxUnavailable[Percentage] Stable, statically-sized pools
Karpenter AMI change → drift replacement NodePool disruption budgets Dynamic, bin-packed, spot-heavy fleets
Self-managed / ASG Custom (rotate launch template + drain) Your own automation Bespoke needs; you own the orchestration
Bottlerocket Any of the above + BRUPOP PDB-aware operator OS patches decoupled from the K8s bump

Managed node groups do a rolling, surge-based replacement. EKS launches new nodes on the target version, cordons and drains the old ones respecting PodDisruptionBudgets, then terminates them:

# Bump the AMI/version on a managed node group; EKS rolls it
aws eks update-nodegroup-version \
  --cluster-name platform-prod \
  --nodegroup-name core-2024 \
  --kubernetes-version 1.32

Tune the surge so the roll is fast but bounded. maxUnavailablePercentage caps how many nodes drain at once; without a sane cap a large node group either crawls or evicts too aggressively:

aws eks update-nodegroup-config \
  --cluster-name platform-prod \
  --nodegroup-name core-2024 \
  --update-config maxUnavailablePercentage=10

The managed-node-group roll knobs and how to reason about each:

Setting What it controls Default Valid range When to change
maxUnavailable Absolute nodes down at once 1 1–100 (≤ group size) Small fixed-size groups
maxUnavailablePercentage % of nodes down at once 1–100 Large groups; bound the churn
force (update flag) Evict even if a PDB blocks off on/off Last resort; breaks PDB guarantees
AMI type EKS-optimized AL2023 / BR / GPU per group per group Match workload + target version
Launch template version Pinned LT for the roll latest any LT version Pin for reproducible rolls

Karpenter-managed nodes upgrade through drift, not a node-group API. When you change the AMI the NodePool/EC2NodeClass references, Karpenter marks existing nodes drifted and replaces them. Pin the AMI explicitly so upgrades are intentional, not a surprise on the next AMI release:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    # Pin to the EKS-optimized AL2023 AMI for the TARGET version
    - alias: al2023@v20260601
  role: KarpenterNodeRole-platform-prod

Control the blast radius with a NodePool disruption budget so drift does not recycle the whole fleet at once. A budget scoped to the Drifted reason throttles upgrade churn while leaving normal consolidation alone:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    budgets:
      - nodes: "10%"        # default safety net for all reasons
      - nodes: "3"          # at most 3 nodes drifting at once
        reasons: ["Drifted"]

How Karpenter disruption budgets behave during an upgrade, by the reasons you scope them to:

Budget reasons Throttles Leaves alone Use during an upgrade
(unset / all) Every disruption (drift, consolidation, expiry) Nothing A blanket safety net
["Drifted"] Only AMI-drift replacement (the upgrade) Normal consolidation The precise upgrade throttle
["Empty"] Reclaiming empty nodes Drift, underutilized Rarely for upgrades
["Underutilized"] Bin-pack consolidation Drift Pause cost-churn during a wave

Bottlerocket nodes can be driven the same ways, plus the in-cluster Bottlerocket update operator (BRUPOP), which coordinates host updates and reboots while respecting PDBs — useful when you want OS patches decoupled from the Kubernetes minor bump. The node-strategy decision as a quick grid:

If your nodes are… Roll them via Key safety control Watch for
Managed node groups update-nodegroup-version maxUnavailablePercentage force silently breaking PDBs
Karpenter-provisioned Pin AMI → drift Drifted disruption budget Unpinned AMI = surprise drift
Bottlerocket + want OS/K8s split BRUPOP for OS, drift/MNG for K8s PDB-aware operator Two cadences to coordinate
Self-managed ASG Rotate LT + cordon/drain Your automation + PDBs No platform safety net at all

6. Drain safely: PodDisruptionBudgets, surge, and autoscaler interplay

Draining is where availability is won or lost. The control is the PodDisruptionBudget. Every critical workload needs one, or a node drain can take all replicas down at once:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: checkout
spec:
  minAvailable: 2          # never let eviction drop below 2 healthy pods
  selector:
    matchLabels:
      app: checkout

PDB sizing by workload shape — pick the row that matches the replica count and criticality:

Workload shape Replicas Recommended PDB Why
Critical, horizontally scaled ≥3 minAvailable: <N-1> or maxUnavailable: 1 Keeps quorum during drain
Stateless web tier ≥2 maxUnavailable: 25% Bounded eviction, fast roll
Single-replica (anti-pattern) 1 Scale to 2 first, then a PDB minAvailable: 1 on 1 replica blocks drain
Quorum system (etcd-like) 3/5 maxUnavailable: 1 Never lose more than one member
Batch/best-effort any No PDB or maxUnavailable: 100% Eviction is acceptable

Two failure modes to design out:

The drain failure modes as a symptom → cause → confirm → fix playbook:

# Symptom Root cause Confirm (exact cmd) Fix
1 Node stuck SchedulingDisabled, drain hangs forever Unsatisfiable PDB (minAvailable: 1 on 1 replica) kubectl get pdb -A -o wide (ALLOWED DISRUPTIONS = 0) Scale to ≥2; relax PDB
2 Drain blocked on one pod, “cannot evict” PDB at its floor; no headroom kubectl get pdb <name> -o yaml Add replicas or maxUnavailable
3 Pods evicted but never reschedule No spare capacity / taints mismatch kubectl get events --field-selector reason=FailedScheduling Scale nodes; fix tolerations
4 Roll crawls; one node at a time maxUnavailable=1 on a big group aws eks describe-nodegroup ... updateConfig Raise maxUnavailablePercentage
5 Autoscaler removes the node you’re draining Cluster Autoscaler racing the drain CA logs; scale-down events Pause CA scale-down during the wave
6 Eviction stalls on a webhook Admission webhook down mid-roll kubectl get validatingwebhookconfigurations Make webhook HA / failurePolicy aware
7 DaemonSet pods block drain DaemonSet not drain-tolerant kubectl drain --ignore-daemonsets Use --ignore-daemonsets (managed roll does)
8 Local-storage pod blocks drain emptyDir/local data on the node kubectl drain --delete-emptydir-data Accept data loss or migrate off local

Watch evictions live and pounce on anything stuck:

kubectl get events -A --field-selector reason=Evicted --watch
# A drain that won't progress almost always means an unsatisfiable PDB:
kubectl get pdb -A -o wide

How the two autoscalers interact with a drain, side by side:

Aspect Cluster Autoscaler Karpenter
Replacement capacity Reacts after pods go pending Provisions before draining a drifted node
PDB awareness Honors PDBs Honors PDBs + disruption budgets
Upgrade mechanism Node-group AMI roll AMI drift
Risk during a wave Scale-down can race the drain Mostly self-coordinating
Mitigation Pause/limit scale-down Set a Drifted budget

7. Fleet-scale orchestration: GitOps and staged ring rollouts

One cluster is a runbook. Forty clusters is an orchestration problem, and clicking through them by hand guarantees drift and human error. Two patterns make a fleet tractable.

GitOps as the source of truth. Express cluster and add-on versions declaratively (EKS Blueprints / Terraform for the cluster, Argo CD or Flux for in-cluster add-ons), so an upgrade is a reviewed pull request, not a command run from someone’s laptop:

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "platform-prod"
  cluster_version = "1.32"   # the upgrade is this one-line diff, reviewed in a PR

  cluster_addons = {
    coredns    = { addon_version = "v1.11.4-eksbuild.1" }
    kube-proxy = { addon_version = "v1.32.0-eksbuild.2", resolve_conflicts_on_update = "PRESERVE" }
    vpc-cni    = { addon_version = "v1.19.2-eksbuild.1", resolve_conflicts_on_update = "PRESERVE" }
    aws-ebs-csi-driver = { addon_version = "v1.38.1-eksbuild.1" }
  }
}

Imperative versus GitOps-driven upgrades, and why the fleet needs the latter:

Dimension Imperative (aws eks update-...) GitOps / Terraform diff
Reviewability None — runs from a laptop Pull request with diff + approval
Auditability Scattered CloudTrail entries Git history is the audit log
Reproducibility Per-operator, drift-prone Same module across every cluster
Rollback of intent Manual re-run Revert the commit
Fleet scale Linear human toil One module, N clusters via variables
Drift detection Manual Argo/Flux reconcile flags drift

Ring rollouts. Never move the whole fleet at once. Promote the version through rings, gating each ring on the previous one passing its smoke tests:

Ring Scope Gate to promote
0 — canary 1 non-prod cluster kubent/pluto clean, smoke tests green, soak 24h
1 — early low-traffic prod no SLO regression, soak 48h
2 — broad bulk of prod error budget intact
3 — final highest-criticality change window, full sign-off

Encode the ring as a variable so the same module rolls each tier on its own schedule — you upgrade by merging the version bump ring by ring, never a fleet-wide script. What each ring is actually checking before it lets the version through:

Ring Soak Signals watched Promote only if
0 — canary 24h Smoke suite, pod health, DNS, insights All green, zero scanner findings
1 — early 48h SLOs (latency, error rate), saturation No SLO regression vs baseline
2 — broad per change policy Error budget burn rate Budget intact, no new alerts
3 — final change window Full golden signals + sign-off Business approval + clean rings 0–2

Architecture at a glance

The diagram traces an EKS upgrade as it actually flows, left to right, through the four ordered phases plus the orchestration plane that drives them all. Read it as a pipeline. On the far left, the plan & readiness zone is where every upgrade starts: aws eks describe-cluster-versions tells you the support window, and kubent / pluto / upgrade insights scan for removed APIs — this is the gate, and nothing moves until it is clean. The arrow into the control plane zone is the one-way door: update-cluster-version rolls the managed API server one minor up, pre-flighted on free subnet IPs. From there the path fans into the add-ons zone — VPC CNI, CoreDNS, and the strict kube-proxy — each reconciled against the compatibility matrix with a deliberate --resolve-conflicts mode. Only then does flow reach the data plane zone, where managed node groups surge-roll and Karpenter replaces drifted nodes behind PDBs and disruption budgets.

Above the whole pipeline sits the orchestration plane — Argo CD / Terraform and the ring controller — because in a fleet none of these phases is a manual command; each is a reviewed diff promoted ring by ring. The numbered badges mark the five places this goes wrong: a removed API that the scan missed (1), the control-plane upgrade refused on subnet IP exhaustion (2), kube-proxy skewing newer than the control plane (3), a drain wedged on an unsatisfiable PDB (4), and the fleet-wide regression that a ring rollout is designed to contain (5). The legend narrates each as symptom · confirm · fix. The whole method is in the left-to-right order: scan, then control plane, then add-ons, then nodes, then kube-proxy last — each gated, each a pull request.

EKS fleet-upgrade pipeline showing four ordered phases left to right with an orchestration plane above. Phase 0 plan and readiness runs describe-cluster-versions plus kubent, pluto and upgrade insights to gate on removed APIs; the one-way control-plane phase rolls the managed API server one minor at a time via update-cluster-version pre-flighted on free subnet IPs; the add-ons phase reconciles VPC CNI, CoreDNS and the strict kube-proxy against the compatibility matrix with resolve-conflicts PRESERVE; the data-plane phase surge-rolls managed node groups and replaces Karpenter drifted nodes behind PodDisruptionBudgets and disruption budgets; an Argo CD and Terraform orchestration plane with a ring controller promotes the version ring by ring. Numbered failure badges mark a missed removed API, control-plane upgrade refused on subnet IP exhaustion, kube-proxy skewing newer than the control plane, a drain wedged on an unsatisfiable PDB, and a fleet-wide regression contained by ring rollout

Real-world scenario

A media company ran 23 EKS clusters and let the cadence slip during a hiring freeze. Six drifted onto a version that aged out of standard support; the per-cluster control-plane bill jumped from ~$72 to ~$432/month and finance flagged the ~$2,160/month surprise. Worse, two were now inside the window where AWS would auto-upgrade them on its own schedule — an unplanned minor bump on a payments-adjacent cluster, exactly the kind of change you want to schedule yourself.

The constraint: a three-person platform team could not take the big-bang risk of catching every cluster up in one weekend. The hidden landmine surfaced when the first canary stalled — a billing service ran a single replica behind a policy/v1beta1 PDB with minAvailable: 1, so the node drain could never complete (the node sat cordoned indefinitely), and that API version was also removed in the target release. Both problems lived in the same manifest: an unsatisfiable PDB and a removed API, on the most sensitive workload in the fleet. kubectl get pdb -A -o wide showed ALLOWED DISRUPTIONS: 0; pluto detect flagged the policy/v1beta1 object as removed in the target. The canary did exactly its job — it caught both in a non-prod cluster instead of a 02:00 page.

They fixed it structurally, not per-cluster. The version became a per-ring Terraform variable so the same module rolled each tier on its own gated schedule, and a CI step ran pluto against every rendered chart so a removed API could never reach a cluster again:

variable "cluster_version" {
  type        = string
  description = "Set per ring; promote ring N+1 only after ring N soaks clean."
}

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = var.cluster_name
  cluster_version = var.cluster_version   # 1.32 in canary first, then promoted by PR
}

The PDB trap was fixed at the source — scale to two replicas and move the manifest to policy/v1:

apiVersion: policy/v1            # was policy/v1beta1 (removed in the target minor)
kind: PodDisruptionBudget
metadata:
  name: billing
spec:
  minAvailable: 1                # now satisfiable: the Deployment runs 2 replicas
  selector:
    matchLabels:
      app: billing

With the canary ring proving each step and pluto gating CI, they caught the remaining five clusters up over three weeks of reviewed pull requests, ended the extended-support charges (~$2,160/month recovered), and took the auto-upgrade risk off the table. The durable fix was the ring variable plus the CI gate, not a heroic weekend. The incident, as a timeline, because the order of moves is the lesson:

Stage State Action taken Effect What it should have been
Drift 6 clusters in extended support (cadence slipped) +$2,160/mo, 2 near auto-upgrade Cap fleet at 2 minors; cadence per quarter
Canary Ring-0 cluster upgraded Roll the target on 1 non-prod Drain stalls immediately (This is the canary working)
Diagnose Drain wedged kubectl get pdb -A -o wide ALLOWED DISRUPTIONS: 0 found
Diagnose Removed API found pluto detect on the chart policy/v1beta1 flagged removed Should have been a CI gate already
Fix source Both bugs in one manifest Scale to 2; move to policy/v1 Drain completes Fix at the source, not per cluster
Structural Repeatable rollout Ring variable + pluto in CI Removed API can’t reach a cluster The durable fix
Complete 5 clusters remaining Promote ring by ring over 3 weeks Extended-support charges ended A series of reviewed PRs

Advantages and disadvantages

The managed-control-plane, version-gated, add-on model both de-risks the upgrade and introduces its own sharp edges. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
AWS rolls the control plane with zero downtime — the riskiest part is managed The control plane is one-way: no downgrade, ever; mistakes are forward-only
Server-side upgrade insights catch deprecated-API clients you forgot exist Removed-API breakage is silent at upgrade time — you only see it when a workload fails to reconcile
Managed add-ons expose a per-version compatibility matrix so you don’t guess --resolve-conflicts OVERWRITE silently reverts hand-tuned CNI/CoreDNS config
Kubelet skew (3 minors) lets you stage control plane ahead of nodes The skew rules are easy to violate with a hand-rolled script (e.g. kube-proxy newer than the API server)
Managed node groups + Karpenter drain behind PDBs automatically An unsatisfiable PDB stalls the drain forever and wedges the roll
Standard support gives a predictable ~14-month runway per minor Slipping into extended support silently 6×s the per-cluster control-plane bill
GitOps makes a fleet upgrade a reviewed diff, not laptop commands If you don’t adopt GitOps, fleet drift and human error compound by cluster count

The model is right for any team past day one that wants the control plane operated for them and a clear compatibility contract for add-ons. It bites hardest on fleets that let cadence slip (extended-support cost, forced auto-upgrade), on clusters with single-replica or stateful workloads (PDB traps), and on teams that hand-tuned add-ons then upgraded with OVERWRITE. Every disadvantage is manageable — but only if you know it exists and gate for it, which is the entire point of the runbook.

Hands-on lab

Stand up a tiny EKS cluster, practise the exact readiness-and-upgrade sequence — scan, control plane, add-on, verify — then tear it down. Keep it small (two nodes) so the bill is a few dollars for an hour. Run from a workstation with AWS CLI v2, eksctl, kubectl, and pluto installed.

Step 1 — Create a small cluster one minor behind the latest, so there is something to upgrade.

eksctl create cluster --name eks-upgrade-lab \
  --region ap-south-1 --version 1.31 \
  --nodes 2 --node-type t3.medium --managed

Expected: ~15 minutes; eksctl writes your kubeconfig. Confirm: kubectl get nodes shows two Ready nodes on v1.31.x.

Step 2 — Plant a removed-API landmine and catch it with pluto. Apply a PDB on the old policy/v1beta1 group (removed in later minors), then scan against the target:

cat <<'EOF' | kubectl apply -f -
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata: { name: legacy-pdb }
spec:
  minAvailable: 1
  selector: { matchLabels: { app: nope } }
EOF

pluto detect-all-in-cluster --target-versions k8s=v1.32
# Expected: legacy-pdb flagged — policy/v1beta1 removed in the target.

The scanner names the object, the deprecated apiVersion, and the removal version — exactly the pre-upgrade gate. Remediate by deleting it (or migrating to policy/v1):

kubectl delete pdb legacy-pdb

Step 3 — Check EKS readiness insights and the add-on matrix.

aws eks list-insights --cluster-name eks-upgrade-lab \
  --filter '{"categories":["UPGRADE_READINESS"]}'

# Which kube-proxy build is compatible with the target?
aws eks describe-addon-versions --kubernetes-version 1.32 \
  --addon-name kube-proxy \
  --query 'addons[].addonVersions[0].addonVersion' --output text

Expected: insights PASSING (you removed the landmine), and a concrete compatible kube-proxy build string.

Step 4 — Upgrade the control plane one minor (1.31 → 1.32) and watch it.

aws eks update-cluster-version --name eks-upgrade-lab --kubernetes-version 1.32
aws eks describe-cluster --name eks-upgrade-lab --query 'cluster.{Status:status,Version:version}'
# Status flips ACTIVE -> UPDATING -> ACTIVE; this takes several minutes.

Step 5 — Reconcile the add-on, then the nodes. Update kube-proxy to the compatible build, then roll the managed node group:

aws eks update-addon --cluster-name eks-upgrade-lab \
  --addon-name kube-proxy --addon-version <build-from-step-3> \
  --resolve-conflicts PRESERVE

eksctl upgrade nodegroup --cluster eks-upgrade-lab \
  --name <nodegroup-name> --kubernetes-version 1.32

Step 6 — Verify every layer agrees and the data plane is healthy.

kubectl version -o json | jq -r '.serverVersion.gitVersion'        # control plane on 1.32
kubectl get nodes -o custom-columns='NODE:.metadata.name,KUBELET:.status.nodeInfo.kubeletVersion'
kubectl get pods -n kube-system                                    # all Running, no CrashLoop
kubectl run dns-probe --rm -it --restart=Never --image=busybox:1.36 -- \
  nslookup kubernetes.default.svc.cluster.local                    # CoreDNS resolves

Expected: control plane and kubelet both on v1.32.x, kube-system healthy, DNS resolves. The lab steps mapped to what each proves:

Step What you did What it proves Real-world analogue
1 Create a cluster one minor behind There is a real upgrade to perform Any cluster on N-1
2 Plant + scan a policy/v1beta1 PDB Removed-API detection is real and specific The number-one upgrade breakage
3 Check insights + add-on matrix Readiness is a gate, compatibility is queryable The pre-flight that blocks bad upgrades
4 Control plane 1.31 → 1.32 The one-way roll is one minor at a time Phase 1 of every upgrade
5 Reconcile kube-proxy, roll nodes Add-ons then kubelet, in order Phases 2–3
6 Verify versions + DNS “Successful” ≠ healthy; you confirm The post-upgrade smoke check

Teardown (avoid lingering control-plane + node charges):

eksctl delete cluster --name eks-upgrade-lab --region ap-south-1

Cost note. One control plane at $0.10/hr plus two t3.medium nodes runs well under ~$1–2 for an hour; deleting the cluster stops every charge. There is no free tier for the EKS control plane, so do not leave the lab running overnight.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you can read mid-change-window, then the entries that bite hardest expanded with the full reasoning.

# Symptom Root cause Confirm (exact cmd / path) Fix
1 Upgrade “Successful” but a controller stops reconciling Removed API still called by a chart/controller pluto detect-all-in-cluster --target-versions k8s=<target>; insights ERROR Bump chart / rewrite apiVersion; redeploy; re-scan
2 update-cluster-version refused / Failed Control-plane subnets out of free IPs aws ec2 describe-subnets --query '...AvailableIpAddressCount' Free IPs / add a larger subnet, retry
3 Node drain hangs; node stuck SchedulingDisabled Unsatisfiable PDB (minAvailable: 1 on 1 replica) kubectl get pdb -A -o wide (ALLOWED DISRUPTIONS = 0) Scale to ≥2; relax PDB; never --force blindly
4 Pods ContainerCreating, no IP, after add-on update VPC CNI custom config reverted by OVERWRITE kubectl get ds aws-node -n kube-system -o yaml (env reset) Re-apply CNI env; re-run update with PRESERVE
5 Intermittent service routing failures post-upgrade kube-proxy skewed newer than the control plane kubectl get ds kube-proxy -n kube-system; compare to API version Downgrade kube-proxy to ≤ control plane build
6 DNS resolution broken cluster-wide after upgrade CoreDNS Corefile reverted / version skew kubectl get cm coredns -n kube-system -o yaml; CoreDNS logs Restore Corefile; update to compatible build
7 PVCs stuck Pending after upgrade EBS CSI driver incompatible/not updated kubectl get pods -n kube-system -l app=ebs-csi-controller Update aws-ebs-csi-driver to compatible build
8 All API writes fail mid-upgrade Admission webhook unavailable during node roll kubectl get validating/mutatingwebhookconfigurations Make webhook HA; review failurePolicy
9 Karpenter recycles too many nodes during upgrade No Drifted disruption budget kubectl get nodepool -o yaml (no budget) Add a Drifted budget (nodes: "3" etc.)
10 Unexpected minor bump on a cluster Version aged out of extended support → AWS auto-upgraded aws eks describe-cluster --query cluster.version; support status Never reach end-of-extended; upgrade on cadence
11 Bill jumped ~6× on some clusters Clusters slid into extended support aws eks describe-cluster-versions vs cluster versions Upgrade the stragglers; cap fleet spread
12 Managed node roll crawls one node at a time maxUnavailable=1 on a large group aws eks describe-nodegroup --query '...updateConfig' Set maxUnavailablePercentage (e.g. 10)
13 Pods evicted but never reschedule No spare capacity or taint/toleration mismatch kubectl get events --field-selector reason=FailedScheduling Add capacity; fix tolerations/affinity
14 kubectl warns “deprecated” or odd API errors Client skewed >1 minor from the API server kubectl version (compare client/server) Update kubectl to within ±1 minor

The expanded form, with the full reasoning for the entries that cost the most time:

1. Upgrade reports Successful but a controller silently stops reconciling. Root cause: A Helm chart or controller still calls an API removed in the target minor (e.g. policy/v1beta1, autoscaling/v2beta2). The control plane upgraded fine; the client is now talking to an endpoint that no longer exists. Confirm: pluto detect-all-in-cluster --target-versions k8s=<target> and kubent flag the object and the removal version; EKS upgrade insights show an ERROR for UPGRADE_READINESS. Fix: Bump the chart or rewrite the apiVersion, redeploy, and re-scan until clean. Add pluto to CI so a removed API can never reach a cluster again — this is the durable fix, not a per-cluster patch.

2. aws eks update-cluster-version is refused or returns Failed. Root cause: EKS pre-flight requires free IP addresses in the control-plane subnets to place new control-plane ENIs; exhausted subnets refuse the upgrade. Confirm: aws ec2 describe-subnets --subnet-ids <ids> --query 'Subnets[].{Id:SubnetId,Free:AvailableIpAddressCount}' shows zero or near-zero free IPs; describe-update --query update.errors names the condition. Fix: Free up IPs (clean up stale ENIs) or add a larger/extra subnet in a second AZ to the cluster, then retry.

3. A node drain hangs and the node sits SchedulingDisabled indefinitely. Root cause: A PodDisruptionBudget that can never be satisfied — classically minAvailable: 1 on a single-replica Deployment. Eviction would drop below the floor, so the API server refuses it forever, and the roll wedges. Confirm: kubectl get pdb -A -o wide shows ALLOWED DISRUPTIONS: 0 for the offending PDB; kubectl get nodes | grep SchedulingDisabled shows the cordoned node. Fix: Scale the workload to at least two replicas (so the PDB becomes satisfiable) or relax the PDB. Audit single-replica workloads before the roll. Do not reach for --force — it evicts past the PDB and breaks the very guarantee you set.

4. After an add-on update, pods are stuck ContainerCreating with no IP. Root cause: The add-on update ran with --resolve-conflicts OVERWRITE and reverted a hand-tuned VPC CNI config (e.g. ENABLE_PREFIX_DELEGATION, WARM_*), collapsing IP density so new pods can’t get an address. Confirm: kubectl get ds aws-node -n kube-system -o yaml shows the env reset to defaults; pod events show “failed to assign an IP”. Fix: Re-apply the CNI configuration and re-run the add-on update with --resolve-conflicts PRESERVE. Going forward, keep the CNI env in IaC and always use PRESERVE for tuned add-ons.

5. Intermittent service-routing failures appear right after the upgrade. Root cause: kube-proxy ended up newer than the control plane (or more than three minors behind it) — a skew the platform won’t create but a hand-rolled order can. Confirm: kubectl get ds kube-proxy -n kube-system -o jsonpath='{.spec.template.spec.containers[0].image}' versus the API-server version from kubectl version. Fix: Set kube-proxy to a build ≤ the control-plane minor (the compatible build from describe-addon-versions). Always roll kube-proxy last and never ahead of the API server.

6. DNS resolution breaks cluster-wide after the upgrade. Root cause: The CoreDNS Corefile was reverted by an OVERWRITE update (losing stub domains / forwarders) or CoreDNS skewed off a compatible build. Confirm: kubectl get cm coredns -n kube-system -o yaml shows the default Corefile; kubectl logs -n kube-system -l k8s-app=kube-dns shows resolution errors; the nslookup probe fails. Fix: Restore the Corefile and update CoreDNS to the matrix-compatible build with PRESERVE.

10–11. A cluster bumps a minor on its own, or the bill jumps ~6×. Root cause: The cluster aged out of extended support, so AWS auto-upgraded it on its own schedule (10); or several clusters slid into extended support and the per-cluster control-plane charge went from ~$72 to ~$432/month (11). Confirm: aws eks describe-cluster --query cluster.version against aws eks describe-cluster-versions support status; the surprise line on the bill. Fix: There is no after-the-fact fix beyond upgrading the stragglers. Prevent it: upgrade on a quarterly cadence, cap fleet spread at two minors, and rank clusters by support status every wave.

Best practices

Security notes

Upgrades are a security event in both directions: they close known CVEs and, done carelessly, they can widen access or break the guardrails that protect the cluster.

Cost & sizing

What drives the bill on an EKS upgrade is rarely the upgrade itself — it is the support tier you let clusters fall into, plus transient capacity during node rolls.

The cost levers, what each costs, and how to control it:

Cost driver What it costs Driven by How to control
Control plane (standard) $0.10/cluster/hr (~$72/mo) Each running cluster Baseline; consolidate idle clusters
Control plane (extended) $0.60/cluster/hr (~$432/mo) Versions out of standard support Upgrade on cadence; never slip
Surge nodes during a roll Extra node-hours while old+new overlap maxUnavailable/surge + group size Bound surge; roll in off-peak
Karpenter drift churn Replacement node-hours Drift recycling capacity Drifted disruption budget
Data transfer Unchanged by upgrade Workload traffic Not an upgrade lever
EBS volumes for new nodes gp3 GB-month during overlap Surge + drift node disks Bound surge; reclaim promptly
NAT data processing Per-GB during image re-pull New nodes pulling images Pre-pull / cache base images
Extended-support delta (fleet) ~$360/cluster/mo over standard Number of clusters in extended Rank + upgrade stragglers first

Right-sizing the upgrade, not the cluster — keep the wave cheap:

Rough INR framing for an India-region fleet: a single cluster’s control plane runs roughly ₹6,000/month in standard support and ₹36,000/month in extended — so six stragglers in extended support are about ₹1,80,000/month of avoidable spend, which is precisely the kind of number that turns an upgrade backlog into a funded project.

Interview & exam questions

1. Why can’t you upgrade an EKS control plane from 1.30 directly to 1.32? EKS upgrades the control plane one minor version at a time. To cross two minors you issue two sequential update-cluster-version calls, each completing before the next. This mirrors upstream Kubernetes’s supported upgrade path and keeps API/feature transitions incremental. (CKA / EKS practitioner.)

2. What is the kubelet skew rule on EKS, and how do you exploit it during a fleet upgrade? On EKS 1.28+, nodes tolerate the control plane being up to three minor versions ahead of the kubelet. You exploit it by advancing the control plane multiple minors first (e.g. 1.29 → 1.30 → 1.31) while nodes stay put, then catching nodes up — never the reverse, and kube-proxy is never newer than the control plane.

3. A cluster upgrade reports Successful but a controller stops working. What happened and how would you have prevented it? A Kubernetes minor bump removed a beta API the controller still called (e.g. policy/v1beta1, autoscaling/v2beta2). Prevention is a pre-upgrade scan — kubent/pluto plus EKS upgrade insights as a hard CI gate — remediating every removed-API usage before touching the control plane.

4. Explain the three --resolve-conflicts modes for EKS add-ons and when each is correct. NONE fails the update on any hand-edited field (a hard stop). OVERWRITE resets changed fields to EKS defaults (use when EKS owns the config). PRESERVE keeps your out-of-band edits across the update (use for tuned VPC CNI / CoreDNS). OVERWRITE on tuned config silently reverts it.

5. Which add-on is the strict one for version skew, and what is its rule? kube-proxy. It must not be newer than the control-plane minor and not more than three minors older. CoreDNS and the CSI drivers are version-gated but looser. Roll kube-proxy last.

6. A node drain hangs forever during an upgrade. What is the most likely cause and the fix? An unsatisfiable PodDisruptionBudget — classically minAvailable: 1 on a single-replica Deployment — so eviction would breach the floor and the API server refuses it indefinitely. Fix: scale to ≥2 replicas (or relax the PDB). kubectl get pdb -A -o wide showing ALLOWED DISRUPTIONS: 0 confirms it.

7. How do Karpenter-managed nodes upgrade, and how do you bound the churn? Through drift: when the AMI referenced by the EC2NodeClass/NodePool changes, Karpenter marks existing nodes drifted and replaces them. Bound it with a disruption budget scoped to the Drifted reason (e.g. nodes: "3") so only a few nodes recycle at once.

8. What is the cost difference between standard and extended support, and why does it matter at fleet scale? Standard control plane is $0.10/cluster/hr; extended is $0.60/cluster/hr — a 6× jump (~$72 → ~$432/month). Across a fleet, every cluster that slips into extended support adds ~$360/month, turning a missed cadence into a real budget line. (FinOps / EKS.)

9. Can you roll back an EKS control-plane upgrade? What is the recovery path if a workload breaks? No — the control plane cannot be downgraded. Recovery is forward: roll nodes back to the prior AMI (still legal within the three-minor skew window) and revert add-on versions while you fix the workload, then re-advance. This one-way property is why readiness scanning and a canary ring are mandatory.

10. What pre-flight condition most commonly blocks a control-plane upgrade, and how do you confirm it? Subnet IP exhaustion — EKS needs free IPs in the control-plane subnets for new ENIs and refuses the upgrade otherwise. Confirm with aws ec2 describe-subnets checking AvailableIpAddressCount; fix by freeing IPs or adding a larger/second-AZ subnet.

11. Describe a ring rollout and why a fleet needs one. Promote the version through rings — canary (1 non-prod) → early (low-traffic prod) → broad → final — gating each on the previous ring soaking clean (scanners green, no SLO regression, error budget intact). It contains a regression to one cluster instead of the fleet and turns the upgrade into reviewed PRs.

12. Why express the upgrade as a GitOps/Terraform diff instead of aws eks update-cluster-version? It makes the change reviewable, auditable, and reproducible across the fleet: a one-line cluster_version bump in a PR, the same module for N clusters via a per-ring variable, Git history as the audit log, and revert-the-commit as the rollback of intent. Imperative commands scatter the record and invite drift.

Quick check

  1. In what order do you upgrade the four moving parts of an EKS cluster, and which one goes last?
  2. How many minor versions can the EKS control plane be ahead of the kubelet (on 1.28+), and can the data plane ever be ahead?
  3. You hand-tuned the VPC CNI for prefix delegation. Which --resolve-conflicts mode do you use when updating the add-on, and why?
  4. A managed node group’s drain has wedged with a node stuck SchedulingDisabled. What is the first command you run, and what are you looking for?
  5. What happens to a cluster that reaches the end of its extended-support window, and what does that cost compared to standard support?

Answers

  1. Control plane → managed add-ons → node groups (kubelet) → kube-proxy last. kube-proxy trails because it must never be newer than the control plane.
  2. Up to three minors ahead (nodes lag the control plane). The data plane may only lag, never lead — and you can never skip control-plane minors.
  3. PRESERVE. It keeps your out-of-band CNI config across the update; OVERWRITE would silently reset the env to EKS defaults and collapse IP density, leaving pods stuck without an IP.
  4. kubectl get pdb -A -o wide, looking for a PDB with ALLOWED DISRUPTIONS: 0 — an unsatisfiable PodDisruptionBudget (often minAvailable: 1 on a single replica) blocking the eviction. Fix by scaling to ≥2 or relaxing the PDB.
  5. AWS auto-upgrades it to the next minor on a schedule you don’t control. While it sat in extended support it cost $0.60/cluster/hr (~$432/month) versus $0.10/hr (~$72/month) in standard — a 6× control-plane premium.

Glossary

Next steps

awsekskubernetesupgradesday-twofleet-operations
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments