AWS Containers

Running EKS at Scale: Pod Identity, Karpenter Autoscaling, and VPC CNI Networking

eksctl create cluster gives you a control plane and some nodes. It does not give you a platform. The gap between a demo cluster and one that runs hundreds of services across thousands of pods comes down to four decisions you make early and rarely revisit cheaply: how identity flows to workloads, how the data plane allocates IPs, how nodes appear and disappear, and how you keep the whole thing current. This guide walks each one with the commands and manifests I actually ship.

Beyond eksctl create: the four decisions

A production EKS platform on AWS lives or dies on these:

Decision Legacy default What scales
Cluster auth aws-auth ConfigMap Access entries (EKS access-management API)
Workload identity IRSA (OIDC + per-SA role) EKS Pod Identity (association API)
Pod networking One ENI per IP, low pod density VPC CNI prefix delegation
Node lifecycle Managed node groups + Cluster Autoscaler Karpenter with consolidation

None of these are exotic. They are the boring, correct defaults for a cluster you intend to operate for years. Assume EKS 1.31+ throughout.

Step 1 — Cluster provisioning with access entries

The aws-auth ConfigMap was the original way to map IAM principals to Kubernetes RBAC. It is a single YAML blob with no validation: one bad edit locks every admin out of the cluster. The access-management API replaces it with first-class AWS resources you manage via the API, CLI, or IaC.

Create the cluster with the API-based authentication mode. With eksctl:

# cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: platform-prod
  region: us-east-1
  version: "1.31"
accessConfig:
  authenticationMode: API_AND_CONFIG_MAP
  bootstrapClusterCreatorAdminPermissions: true
vpc:
  clusterEndpoints:
    publicAccess: true
    privateAccess: true
addons:
  - name: vpc-cni
  - name: coredns
  - name: kube-proxy
  - name: eks-pod-identity-agent
eksctl create cluster -f cluster.yaml

API_AND_CONFIG_MAP lets both mechanisms coexist while you migrate; flip to API once nothing reads the ConfigMap. Grant a role cluster-admin via an access entry plus an access policy association:

aws eks create-access-entry \
  --cluster-name platform-prod \
  --principal-arn arn:aws:iam::111122223333:role/platform-admins

aws eks associate-access-policy \
  --cluster-name platform-prod \
  --principal-arn arn:aws:iam::111122223333:role/platform-admins \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
  --access-scope type=cluster

AWS-managed access policies (AmazonEKSClusterAdminPolicy, AmazonEKSAdminPolicy, AmazonEKSViewPolicy, and others) map to predictable RBAC. For namespace-scoped grants, set --access-scope type=namespace,namespaces=team-a,team-b. For anything bespoke, create an access entry of type STANDARD and bind your own RBAC by Kubernetes group.

The payoff: access is auditable in CloudTrail, expressible in Terraform (aws_eks_access_entry / aws_eks_access_policy_association), and a typo returns an API error instead of bricking RBAC.

Step 2 — Workload identity: IRSA to EKS Pod Identity

IRSA works: annotate a ServiceAccount with a role ARN, the pod gets a projected token, and the SDK exchanges it via the cluster’s OIDC provider. The operational cost shows up at scale. Every cluster needs its own IAM OIDC provider, and every role’s trust policy hardcodes that provider’s URL plus the SA sub. Replicate a workload across ten clusters and you maintain ten trust policies per role.

EKS Pod Identity removes the OIDC plumbing. A node-level agent (the eks-pod-identity-agent add-on) vends credentials, and a single API call associates a role with a (namespace, ServiceAccount) pair. The role’s trust policy points at the EKS service, not a cluster-specific OIDC URL.

The trust policy is identical across every cluster:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "pods.eks.amazonaws.com" },
      "Action": ["sts:AssumeRole", "sts:TagSession"]
    }
  ]
}

Create the association:

aws eks create-pod-identity-association \
  --cluster-name platform-prod \
  --namespace payments \
  --service-account checkout-sa \
  --role-arn arn:aws:iam::111122223333:role/checkout-app

The ServiceAccount needs no annotation — the binding lives in EKS, not on the SA. Application code is unchanged: the AWS SDK (a recent version) resolves Pod Identity credentials transparently.

A practical migration sequence:

  1. Install the eks-pod-identity-agent add-on.
  2. For one workload, retarget its IAM role trust policy to pods.eks.amazonaws.com and create the association.
  3. Roll the pods, confirm AWS calls still succeed, then remove the IRSA SA annotation.
  4. Repeat per workload; decommission the IAM OIDC provider only after the last IRSA consumer is gone.

Keep IRSA where you genuinely need cross-account sts:AssumeRole chains or non-EKS consumers of the same role. For in-cluster workloads, Pod Identity is the lower-maintenance default.

Step 3 — VPC CNI tuning: prefix delegation and beyond

The AWS VPC CNI gives every pod a routable VPC IP — great for native security groups and flow logs, brutal for IP exhaustion. By default each ENI carries one IP per pod, so pod density per node is capped by ENI/IP limits, and large nodes burn through a /24 fast.

Prefix delegation assigns each ENI a /28 prefix (16 IPs) instead of single IPs, multiplying pod density and slashing EC2 API calls during scale-up. Enable it on the add-on:

kubectl set env daemonset aws-node -n kube-system \
  ENABLE_PREFIX_DELEGATION=true

# Warm capacity so pod scheduling never blocks on a slow ENI attach
kubectl set env daemonset aws-node -n kube-system \
  WARM_PREFIX_TARGET=1

Prefix delegation also changes how you size the --max-pods value on each node — derive it from the instance’s ENI and prefix limits rather than leaving the old per-IP default. AWS publishes a max-pods-calculator helper for this; bake the result into your node bootstrap.

Two adjacent features worth knowing:

apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
  name: payments-db-access
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: checkout
  securityGroups:
    groupIds:
      - sg-0abc123def4567890

Prefix delegation is the one almost everyone needs; custom networking and security-groups-for-pods are situational. Turn them on only when a real constraint demands it — each adds moving parts to the data plane.

Step 4 — Node lifecycle with Karpenter

Cluster Autoscaler scales node groups you predefine: it can only add nodes of a shape you already declared, and it bin-packs poorly across many instance types. Karpenter watches for unschedulable pods and provisions right-sized nodes directly against EC2, picks instance types from a broad pool, and consolidates — replacing or removing nodes when workloads no longer justify them.

Two CRDs drive it. EC2NodeClass is the AWS-specific template (AMI, subnets, security groups, IAM role). NodePool is the scheduling policy and constraints.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: "KarpenterNodeRole-platform-prod"
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "platform-prod"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "platform-prod"
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidationAfter: 1m
  limits:
    cpu: "1000"

Design notes from running this in anger:

Install Karpenter via its Helm chart, ensuring the controller has its own IAM permissions (a Pod Identity association is the clean way) and that the node role is registered as an EKS access entry of type EC2_LINUX so nodes can join.

Step 5 — Managing core add-ons and the upgrade cadence

CoreDNS, kube-proxy, the VPC CNI, and the EBS CSI driver are EKS managed add-ons — version them through EKS rather than as loose manifests, so the control plane tracks compatibility.

List what an add-on supports for your cluster version, then update:

aws eks describe-addon-versions \
  --addon-name aws-ebs-csi-driver \
  --kubernetes-version 1.31 \
  --query 'addons[].addonVersions[].addonVersion'

aws eks update-addon \
  --cluster-name platform-prod \
  --addon-name aws-ebs-csi-driver \
  --addon-version v1.35.0-eksbuild.1 \
  --resolve-conflicts PRESERVE

--resolve-conflicts PRESERVE keeps your field-level customizations (replica counts, tolerations) instead of clobbering them with add-on defaults. Use OVERWRITE deliberately, when you want to reset to defaults.

The EBS CSI driver needs IAM permissions to manage volumes — wire it with a Pod Identity association to its controller ServiceAccount rather than node-instance-profile permissions, so the blast radius stays narrow.

Upgrade cadence: EKS ships a new Kubernetes minor roughly every quarter, and each version has a support window after which extended support charges apply. Plan one planned upgrade per quarter rather than a panicked annual jump across four versions. Control-plane upgrades are one minor at a time and non-skippable.

Step 6 — Ingress with the AWS Load Balancer Controller

The AWS Load Balancer Controller reconciles Kubernetes Ingress objects into ALBs and Service type: LoadBalancer into NLBs, with target-type ip registering pod IPs directly (no extra node hop). Give its controller an IAM role via Pod Identity, then drive everything with annotations:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: checkout
  namespace: payments
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:111122223333:certificate/abcd-1234
    alb.ingress.kubernetes.io/healthcheck-path: /healthz
spec:
  ingressClassName: alb
  rules:
    - host: checkout.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: checkout
                port:
                  number: 80

Use IngressGroups (alb.ingress.kubernetes.io/group.name) to merge multiple Ingress resources onto one shared ALB — otherwise every Ingress spins up its own load balancer and the bill (and ENI consumption) climbs fast.

Enterprise scenario

A fintech platform team ran 40+ services on a single EKS cluster and started seeing pods stuck Pending during morning traffic ramps — but only on their m6i.4xlarge nodes, never the smaller ones. The constraint wasn’t compute; CPU and memory sat at 50%. It was IP exhaustion masked by a subtle interaction: they had enabled ENABLE_PREFIX_DELEGATION=true on the VPC CNI but never recalculated --max-pods, which Karpenter was still deriving from the old per-IP ENI formula. So a node advertised capacity for ~110 pods, but the CNI could only attach enough /28 prefixes for ~58 before hitting the per-instance ENI limit. The kubelet kept scheduling; the CNI kept failing IP allocation, leaving pods wedged.

The fix was to make Karpenter compute --max-pods consistently with prefix delegation by setting maxPods explicitly in the EC2NodeClass kubelet config, derived from AWS’s max-pods-calculator --cni-version 1.x --instance-type m6i.4xlarge --cni-prefix-delegation-enabled:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  kubelet:
    maxPods: 110

After applying it, Karpenter drifted the old nodes out under PDBs and the Pending storm disappeared. The lesson: prefix delegation and --max-pods are one decision, not two — and Karpenter’s advertised capacity must agree with what the CNI can physically allocate, or the scheduler will happily overcommit IPs you don’t have.

Verify

Confirm each layer before declaring the platform ready:

# Auth: access entries resolve, no stale aws-auth dependency
aws eks list-access-entries --cluster-name platform-prod

# Pod Identity: agent running, associations present
kubectl get daemonset eks-pod-identity-agent -n kube-system
aws eks list-pod-identity-associations --cluster-name platform-prod

# VPC CNI: prefix delegation active
kubectl get daemonset aws-node -n kube-system -o yaml | grep -i ENABLE_PREFIX_DELEGATION

# Karpenter: pools healthy, nodes claimed
kubectl get nodepool,ec2nodeclass
kubectl get nodeclaim

# Add-ons: all ACTIVE on compatible versions
aws eks list-addons --cluster-name platform-prod
aws eks describe-addon --cluster-name platform-prod --addon-name vpc-cni \
  --query 'addon.{v:addonVersion,status:status}'

# Ingress: ALB provisioned and address assigned
kubectl get ingress -A

A fast end-to-end identity smoke test: schedule a debug pod under a Pod-Identity-bound ServiceAccount and call STS.

kubectl run sts-check --rm -it --restart=Never \
  --image=public.ecr.aws/aws-cli/aws-cli \
  --overrides='{"spec":{"serviceAccountName":"checkout-sa"}}' \
  -n payments -- sts get-caller-identity

The returned ARN should be the assumed role you associated — proof the credential chain works without any SA annotation.

Production checklist

Cost visibility, scaling limits, and the upgrade runbook

Cost visibility. Enable Split Cost Allocation Data for EKS in the billing console to attribute shared node cost down to pods by namespace and label — this is what turns “the cluster costs $X” into per-team chargeback. Tag NodePool-provisioned instances (via EC2NodeClass tags) so Cost Explorer can group by team. Karpenter consolidation is the single biggest lever on the compute line item; measure node utilization before and after enabling it.

Scaling limits to respect. IP space is the usual wall — even with prefix delegation, plan VPC CIDRs (and secondary CIDRs / custom networking) for peak pod count, not today’s. Watch per-node --max-pods, per-ENI prefix limits, and Karpenter’s own controller throughput when scaling thousands of nodes. Service quotas (ENIs per region, EBS volumes, ELBs) bite at the data-plane edges before the control plane does.

Cluster-upgrade runbook (one minor at a time):

  1. Read the EKS release notes and Kubernetes deprecation guide for the target minor; scan workloads for removed APIs (kubectl deprecation warnings, or a tool like pluto).
  2. Upgrade add-ons first to versions compatible with the target Kubernetes version.
  3. Upgrade the control plane: aws eks update-cluster-version --name platform-prod --kubernetes-version 1.32.
  4. Roll the data plane: for Karpenter, bump the EC2NodeClass AMI alias and let consolidation/drift recycle nodes gracefully under PDBs; for managed node groups, do a rolling update.
  5. Re-run the Verify section end to end.
  6. Confirm no workload is pinned to a now-removed API and that HPA/Karpenter still react to load.

Pitfalls to avoid

Get identity, networking, node lifecycle, and add-on hygiene right at the start, and EKS becomes a platform your teams build on without thinking about it — which is exactly the point.

AWSEKSKarpenterIRSAVPC CNIKubernetes

Comments

Keep Reading