Containerization AWS

Deploy Karpenter on EKS with Consolidation, Spot Diversification, and Disruption Budgets

A retail SaaS platform team runs a 60-node EKS cluster behind a single overprovisioned managed node group of m5.2xlarge On-Demand instances. Bin-packing is poor, nodes sit at 35% utilization through the night, and every traffic spike means a 4-minute wait while the Cluster Autoscaler asks an Auto Scaling Group to add a node of the one fixed shape it knows. The monthly compute bill is the second-largest line in the AWS invoice, and the FinOps lead has flagged it twice. The brief is concrete: cut steady-state compute cost by moving the right workloads onto Spot, scale in seconds instead of minutes, and right-size automatically as load falls — without paging the on-call team every time a Spot instance is reclaimed. This guide installs Karpenter to do exactly that, with consolidation to claw back the idle capacity, Spot diversification so a single capacity pool drying up does not stall the cluster, and disruption budgets so the churn never breaches the platform’s availability SLO.

Prerequisites

export CLUSTER_NAME="saas-prod"
export AWS_REGION="ap-south-1"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export KARPENTER_VERSION="1.3.3"
export K8S_VERSION="1.31"
# AL2023 EKS-optimized AMI alias Karpenter will resolve at launch
export AMI_ALIAS="al2023@latest"

Target topology

Deploy Karpenter on EKS with Consolidation, Spot Diversification, and Disruption Budgets — topology

Karpenter runs as a two-replica Deployment on a small, stable managed node group that you keep On-Demand on purpose — the controller must survive a Spot reclamation event, so it never schedules itself onto the capacity it manages. From that foothold it watches the Kubernetes API for unschedulable pods, and instead of resizing an Auto Scaling Group it calls the EC2 CreateFleet API directly to launch a right-sized node that fits the pending pods’ exact CPU, memory, architecture, and topology requirements. It owns the full node lifecycle: provisioning, in-place consolidation to pack workloads onto fewer or cheaper nodes, and graceful disruption with cordon-and-drain. An SQS interruption queue fed by EventBridge gives Karpenter ~2 minutes of warning on a Spot rebalance recommendation or reclaim, so it can cordon, drain, and pre-launch a replacement before the instance disappears. Two custom resources drive everything: a NodePool (what kinds of nodes may exist, and the rules for disrupting them) and an EC2NodeClass (the AWS-specific launch template — AMI, subnets, security groups, instance profile). Around the cluster, HashiCorp Vault issues the short-lived database and third-party API credentials the workloads consume so no static secret rides a Spot node to its grave, Dynatrace OneAgent (deployed by a DaemonSet that tolerates every node) traces pods across the constant node churn, Wiz continuously scans the running node AMIs and Kubernetes posture for drift, and the whole configuration ships through a GitHub Actions plus Argo CD GitOps pipeline rather than kubectl apply by hand.

1. Create the Karpenter controller and node IAM roles

Karpenter needs two identities: a controller role the pod assumes (to call EC2/SQS/pricing APIs) and a node role the launched EC2 instances run under. eksctl ships a Karpenter-aware command that creates both plus the SQS interruption queue and EventBridge rules in one shot. Run it against the existing cluster:

eksctl create iamidentitymapping \
  --cluster "$CLUSTER_NAME" --region "$AWS_REGION" \
  --arn "arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}" \
  --group system:bootstrappers --group system:nodes \
  --username "system:node:{{EC2PrivateDNSName}}"

If you are starting clean, the supported path is the CloudFormation template Karpenter publishes — it provisions KarpenterNodeRole-<cluster>, KarpenterControllerRole-<cluster>, the instance profile, the Karpenter-<cluster> SQS queue, and the EventBridge rules for Spot interruption, rebalance, instance state-change, and scheduled-change events:

curl -fsSL "https://raw.githubusercontent.com/aws/karpenter-provider-aws/v${KARPENTER_VERSION}/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml" \
  -o /tmp/karpenter-cfn.yaml

aws cloudformation deploy \
  --stack-name "Karpenter-${CLUSTER_NAME}" \
  --template-file /tmp/karpenter-cfn.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides "ClusterName=${CLUSTER_NAME}" \
  --region "$AWS_REGION"

The node role needs the four EKS-managed policies (AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, AmazonEC2ContainerRegistryReadOnly, AmazonSSMManagedInstanceCore) — the CloudFormation template attaches them. Confirm the queue exists before continuing:

aws sqs get-queue-url --queue-name "Karpenter-${CLUSTER_NAME}" --region "$AWS_REGION"

2. Associate the controller role with Pod Identity

Bind the controller role to the karpenter service account in the kube-system namespace using EKS Pod Identity — simpler than IRSA because there is no OIDC trust policy to hand-edit:

aws eks create-pod-identity-association \
  --cluster-name "$CLUSTER_NAME" --region "$AWS_REGION" \
  --namespace kube-system \
  --service-account karpenter \
  --role-arn "arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole-${CLUSTER_NAME}"

The controller role’s trust policy must allow the Pod Identity principal pods.eks.amazonaws.com to assume it with both sts:AssumeRole and sts:TagSession — the CloudFormation template sets this. Verify the association:

aws eks list-pod-identity-associations \
  --cluster-name "$CLUSTER_NAME" --region "$AWS_REGION" \
  --query "associations[?serviceAccount=='karpenter']"

3. Tag subnets and security groups for discovery

Karpenter finds where to launch nodes through tag selectors, not hardcoded IDs. Tag the private subnets and the cluster node security group so the EC2NodeClass in step 5 can select them:

# Tag the cluster's private subnets
for subnet in $(aws eks describe-cluster --name "$CLUSTER_NAME" --region "$AWS_REGION" \
    --query "cluster.resourcesVpcConfig.subnetIds[]" --output text); do
  aws ec2 create-tags --region "$AWS_REGION" --resources "$subnet" \
    --tags "Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}"
done

# Tag the shared node security group
NODE_SG=$(aws eks describe-cluster --name "$CLUSTER_NAME" --region "$AWS_REGION" \
  --query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output text)
aws ec2 create-tags --region "$AWS_REGION" --resources "$NODE_SG" \
  --tags "Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}"

Use only private subnets — launching Spot nodes into public subnets is both a cost and a security mistake. Wiz will flag any node that lands on a public subnet, but it is cheaper to never tag them in the first place.

4. Install Karpenter with Helm

Install the controller chart from the public ECR OCI registry, pinned to the exact version. Note that the controller is scheduled with a nodeAffinity onto the existing managed node group via the karpenter.sh/nodepool DoesNotExist rule so it never runs on nodes it manages — and replicas: 2 so it tolerates the loss of one:

helm upgrade --install karpenter \
  oci://public.ecr.aws/karpenter/karpenter \
  --version "$KARPENTER_VERSION" \
  --namespace kube-system \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=Karpenter-${CLUSTER_NAME}" \
  --set "controller.resources.requests.cpu=1" \
  --set "controller.resources.requests.memory=1Gi" \
  --set "controller.resources.limits.cpu=1" \
  --set "controller.resources.limits.memory=1Gi" \
  --set "replicas=2" \
  --wait

Confirm both replicas are running and leader election settled:

kubectl -n kube-system rollout status deploy/karpenter --timeout=180s
kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter -c controller --tail=20

5. Define the EC2NodeClass

The EC2NodeClass is the AWS-specific half: which AMI family, which subnets and security groups (by the tags from step 3), the instance profile, and disk. Pin the AL2023 family and an explicit AMI alias so a silent base-image change can never roll the fleet underneath you — let your pipeline bump it deliberately. Apply:

# ec2nodeclass.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: "KarpenterNodeRole-saas-prod"
  amiFamily: AL2023
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "saas-prod"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "saas-prod"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeType: gp3
        volumeSize: 50Gi
        encrypted: true       # EBS-at-rest encryption is non-negotiable for the security review
        deleteOnTermination: true
  metadataOptions:
    httpTokens: required        # IMDSv2 only — blocks SSRF-style credential theft
    httpPutResponseHopLimit: 1
  tags:
    team: platform
    managed-by: karpenter
kubectl apply -f ec2nodeclass.yaml

httpTokens: required forces IMDSv2 so a compromised pod cannot reach the node’s credentials over the legacy metadata endpoint — exactly the kind of finding Wiz raises if you leave it on the default.

6. Define a NodePool with Spot diversification

The NodePool is the Kubernetes-facing contract: the universe of instance types Karpenter may pick from, the capacity types it may use, and the disruption rules. The single most important lever for Spot resilience is diversification — give Karpenter a broad set of instance families and sizes so its Spot allocation strategy (price-capacity-optimized under the hood) can draw from many capacity pools. A NodePool locked to one instance type on Spot is fragile; a NodePool spanning a dozen is resilient.

# nodepool-general.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-spot
spec:
  template:
    metadata:
      labels:
        workload-class: general
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]   # prefer spot; fall back to on-demand
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]          # broad families = many Spot pools
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]                     # gen 6+ for price/perf
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values: ["4", "8", "16"]
      expireAfter: 168h                      # recycle nodes weekly for patching
      terminationGracePeriod: 5m
  limits:
    cpu: "400"                               # hard ceiling on this pool's total vCPU
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
kubectl apply -f nodepool-general.yaml

Two things make this safe and cheap. Listing both spot and on-demand lets Karpenter prefer Spot but fall back to On-Demand when no Spot capacity is available, so workloads never get stuck pending. The broad instance-category / instance-generation / instance-cpu requirements expand the pool count dramatically — Karpenter weighs ~hundreds of viable types and picks the cheapest that fits, then provisions from the deepest Spot pool. The limits.cpu is your blast-radius cap: Karpenter will never grow this pool past 400 vCPU regardless of pending demand.

For workloads that genuinely cannot tolerate interruption — the Karpenter controller’s own node group aside, things like stateful singletons — run a separate On-Demand-only NodePool with a taint and schedule only those pods there with a matching toleration, so Spot churn never touches them.

7. Turn on consolidation and bound it with disruption budgets

Consolidation is what recovers the idle 35% from the opening scenario. With consolidationPolicy: WhenEmptyOrUnderutilized (set in step 6), Karpenter continuously looks for nodes it can delete (their pods fit elsewhere) or replace with a cheaper node, and acts after consolidateAfter: 1m of stability. Left unbounded, that is dangerous — a cluster-wide repack could cordon and drain a large fraction of nodes at once and breach your availability SLO. Disruption budgets cap how much voluntary churn Karpenter may cause at any moment. Patch the NodePool to add them:

# nodepool-general-budgets.yaml (merge into spec.disruption)
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
    budgets:
      - nodes: "10%"                 # at most 10% of this pool's nodes disrupting at once
      - nodes: "0"                   # freeze all voluntary disruption during business peak
        schedule: "0 9 * * mon-fri"
        duration: 9h
        reasons: ["Underutilized", "Drifted"]
      - nodes: "5"                   # but always allow empty-node cleanup, capped at 5
        reasons: ["Empty"]
kubectl apply -f nodepool-general-budgets.yaml

Read the budgets top to bottom — Karpenter takes the most restrictive matching budget at any instant. The first line is the always-on ceiling: never disrupt more than 10% of the pool’s nodes simultaneously. The second freezes consolidation and drift disruptions for a 9-hour window every weekday from 09:00 (cron is in the controller’s UTC unless you set a timezone), so the cluster holds steady through peak trading hours — but it scopes reasons to Underutilized and Drifted, deliberately not Empty, so the third budget still lets Karpenter reap genuinely empty nodes (up to 5 at a time) even during the freeze. That combination — froze the risky repacking, kept the free cleanup — is the practical sweet spot.

Pair this with PodDisruptionBudgets on your actual workloads. Karpenter honors PDBs during drain, so a minAvailable on each Deployment is the second, app-level guardrail that stops a node drain from taking the last healthy replica:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web

Critical pods that must never be evicted by consolidation get the annotation karpenter.sh/do-not-disrupt: "true" on the pod template — use it sparingly, since every annotated pod pins its node.

Validation

Prove the three behaviors end to end. First, scale-up: deploy a pause workload that cannot fit on current nodes and watch Karpenter launch a right-sized one in seconds.

kubectl create deployment inflate --image=public.ecr.aws/eks-distro/kubernetes/pause:3.7 --replicas=0
kubectl set resources deployment/inflate --requests=cpu=1,memory=1.5Gi
kubectl scale deployment/inflate --replicas=12

# Watch a new NodeClaim go from pending to ready
kubectl get nodeclaims -w
kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter -c controller -f \
  | grep -E "launched|registered|initialized"

Confirm the new node is Spot and a diversified type:

kubectl get nodes -L karpenter.sh/capacity-type,node.kubernetes.io/instance-type,topology.kubernetes.io/zone

Then consolidation: scale the workload down and watch Karpenter delete or replace nodes after the stabilization window.

kubectl scale deployment/inflate --replicas=0
# After ~1m, expect NodeClaims to be deleted via consolidation
kubectl get nodeclaims -w
kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter -c controller \
  | grep -i "disrupting\|consolidat"

Finally, confirm the interruption path is live by checking the controller picked up the SQS queue, and inspect disruption-budget state:

kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter -c controller | grep -i "interruption"
kubectl get nodepool general-spot -o jsonpath='{.status.conditions}' | jq .

Karpenter exports Prometheus metrics on :8080/metrics; scrape karpenter_nodes_allocatable, karpenter_nodeclaims_disrupted_total, and karpenter_pods_state into Dynatrace to alert when pending-pod time or Spot-fallback rate climbs.

Rollback / teardown

To back out, scale Karpenter to zero first so it stops provisioning while you remove its resources — otherwise it will fight you by relaunching nodes:

kubectl -n kube-system scale deploy/karpenter --replicas=0

# Delete NodePools and NodeClasses; Karpenter-owned nodes drain and terminate
kubectl delete nodepool general-spot
kubectl delete ec2nodeclass default

# Confirm all Karpenter-managed nodes are gone
kubectl get nodes -l karpenter.sh/nodepool

# Remove the controller, Pod Identity association, and CloudFormation stack
helm uninstall karpenter -n kube-system
aws eks delete-pod-identity-association --cluster-name "$CLUSTER_NAME" --region "$AWS_REGION" \
  --association-id "$(aws eks list-pod-identity-associations --cluster-name "$CLUSTER_NAME" \
    --region "$AWS_REGION" --query "associations[?serviceAccount=='karpenter'].associationId" --output text)"
aws cloudformation delete-stack --stack-name "Karpenter-${CLUSTER_NAME}" --region "$AWS_REGION"

Before deleting NodePools, make sure your original managed node group still has headroom to absorb the rescheduled pods, or the teardown drain will leave pods pending. Deleting the EC2NodeClass while NodeClaims still reference it is blocked by a finalizer, which is the safe behavior — delete the NodePool first and let nodes drain.

Common pitfalls

Security notes

Karpenter’s controller role is powerful — it can launch EC2 and pass the node role — so scope it to the cluster’s resources and let Wiz continuously check the running node AMIs, IMDS configuration, and Kubernetes RBAC for posture drift, alerting if a node ever launches without IMDSv2 or with public exposure. Enforce IMDSv2 (httpTokens: required) and a hop limit of 1 on the EC2NodeClass so a compromised pod cannot steal node credentials. Encrypt the EBS root volume (encrypted: true) to satisfy the at-rest control. Because nodes are ephemeral and constantly recycled, never bake secrets into the AMI or a node bootstrap script — have workloads pull short-lived database and third-party API credentials from HashiCorp Vault at runtime (Vault Agent sidecar or the Vault CSI provider), so a reclaimed Spot node carries nothing of value to its termination. Human and service access to the cluster federates through your IdP — Okta or Entra ID brokered to AWS IAM Identity Center — so the same SSO and conditional-access policy that governs the console governs kubectl, and there are no long-lived IAM users.

Cost notes

The win comes from three compounding levers. Spot diversification typically lands instances at 60–90% off On-Demand, and the broad NodePool keeps that discount available because Karpenter always draws from the deepest, cheapest pool. Consolidation recovers the idle headroom — the opening cluster’s 35%-utilized nodes get repacked onto fewer, smaller instances, and WhenEmptyOrUnderutilized keeps doing it as load ebbs, so you stop paying for capacity you booked for a peak that passed. Right-sizing at launch means Karpenter picks the smallest instance that fits the pending pods rather than a fixed 2xlarge, so the fleet shape tracks demand minute to minute. Set limits.cpu per NodePool as a hard spend ceiling, set expireAfter to recycle nodes weekly (patching plus a natural nudge toward newer, cheaper generations), and pipe Karpenter’s metrics to Dynatrace alongside AWS Cost Explorer so the FinOps lead sees blended Spot discount and consolidation savings on one dashboard. Teams that move a general workload pool from a fixed On-Demand managed node group to a diversified, consolidating Karpenter NodePool routinely cut that compute line 50–70% — which is the number that closes the ticket the FinOps lead opened twice.

AWSEKSKarpenterSpotKubernetesCost Optimization
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading