The first time an EKS cluster runs out of IPs, it is never obvious. Pods stick in ContainerCreating, the events say failed to assign an IP address to container, and the node has plenty of CPU and memory free. The cluster autoscaler or Karpenter sees no pressure, so it adds nothing. You are not out of compute; you are out of the one resource nobody put on a dashboard: VPC IPv4 addresses. This guide is the playbook I use to push pod density up and IP burn down on EKS, covering prefix delegation, custom networking, and security groups for pods, plus how the three interact.
1. How the VPC CNI allocates ENIs and IPs
The Amazon VPC CNI (aws-node, a DaemonSet) gives every pod a real VPC IP from the node’s subnet. That is the feature and the trap. The component doing the work is ipamd inside each aws-node pod. It attaches Elastic Network Interfaces (ENIs) to the EC2 instance and pulls secondary IPs onto them, maintaining a warm pool so pod creation does not wait on an EC2 API call.
Two hard EC2 limits govern this in the default “secondary IP” mode:
- ENIs per instance is fixed by instance type. An
m5.largegets 3 ENIs; anm5.4xlargegets 8. - IPs per ENI is also fixed by instance type. An
m5.largegets 10 per ENI; one of those is the ENI’s primary, leaving 9 usable for pods.
Max pods in secondary IP mode is therefore (ENIs * (IPs_per_ENI - 1)) + 2. For an m5.large: (3 * 9) + 2 = 29. The +2 accounts for pods on the host network (kube-proxy, aws-node) that do not consume a secondary IP. AWS ships this exact formula as a script (max-pods-calculator.sh) in the amazon-vpc-cni-k8s repo.
The problem at scale is subnet consumption. Each node holds a warm pool of pre-allocated IPs it is not using yet. With defaults (WARM_ENI_TARGET=1), a freshly scheduled node can claim a whole extra ENI worth of IPs just to keep one warm. Multiply by hundreds of nodes and a /24 subnet (251 usable) evaporates. You will see free IPs pinned to ENIs on idle nodes while new pods elsewhere cannot schedule.
Inspect what a node actually holds:
# IPs and prefixes currently attached, per ENI, on a node
kubectl exec -n kube-system aws-node-xxxxx -c aws-node -- \
curl -s http://localhost:61679/v1/enis | jq '.ENIs[] | {eni: .ID, ips: (.IPv4Addresses | length), prefixes: (.IPv4Prefixes | length)}'
2. Enabling prefix delegation (/28 prefixes)
Prefix delegation changes the unit of allocation. Instead of assigning individual secondary IPs to an ENI, the CNI assigns /28 IPv4 prefixes – 16 contiguous addresses per prefix. The EC2 limit on slots per ENI stays the same, but now each slot holds a prefix (16 IPs) instead of one IP. That multiplies addressable pods per ENI by up to 16 without attaching more ENIs.
The math: an m5.large ENI has 10 slots, minus 1 for the primary = 9 prefixes = 9 * 16 = 144 IPs per ENI. Across 3 ENIs that is theoretically far more than you need, so the practical limit becomes the EKS recommendation of 110 pods per node (or 250 on instances with enough capacity). Prefix mode is what makes a c5.large run 110 pods instead of 29.
Enable it on the add-on. The two knobs are ENABLE_PREFIX_DELEGATION and WARM_PREFIX_TARGET:
kubectl set env daemonset aws-node -n kube-system \
ENABLE_PREFIX_DELEGATION=true \
WARM_PREFIX_TARGET=1
Or, the way you should actually do it, through the managed add-on config so it survives upgrades:
aws eks update-addon \
--cluster-name prod-use1 \
--addon-name vpc-cni \
--resolve-conflicts OVERWRITE \
--configuration-values '{"env":{"ENABLE_PREFIX_DELEGATION":"true","WARM_PREFIX_TARGET":"1"}}'
Tuning the warm targets
WARM_PREFIX_TARGET=1 keeps one full extra prefix (16 IPs) warm. That is the AWS-recommended floor and the safest setting – it guarantees a node can always burst at least 16 pods without an EC2 call. The trade-off is up to 15 wasted IPs per node when pods are sparse.
For tighter packing, switch to IP-level targets, which work in prefix mode too:
kubectl set env daemonset aws-node -n kube-system \
WARM_IP_TARGET=5 \
MINIMUM_IP_TARGET=10
Do not set
WARM_PREFIX_TARGETandWARM_IP_TARGETto fight each other. IfWARM_IP_TARGET/MINIMUM_IP_TARGETare set, the CNI rounds up to whole prefixes to satisfy them and ignoresWARM_PREFIX_TARGET. Use one model. I useMINIMUM_IP_TARGET+WARM_IP_TARGETon IP-starved clusters andWARM_PREFIX_TARGET=1everywhere else.
There is one real constraint people miss: prefix delegation needs contiguous /28 blocks. On a subnet that has been fragmented by years of churn, EC2 may fail to find a free contiguous prefix even when scattered IPs exist. Fresh, generously sized subnets are a prerequisite, not a nicety.
You must also bump the node’s --max-pods kubelet flag, because the default Bottlerocket/AL2 bootstrap computes max-pods for secondary IP mode. With managed node groups, pass it through the AMI bootstrap:
# AL2/AL2023 bootstrap arguments for a launch template
--use-max-pods false --kubelet-extra-args '--max-pods=110'
3. Per-instance pod density: with and without prefix delegation
The gap is dramatic, and it changes your instance selection. A few representative types:
| Instance type | ENIs | IPs/ENI | Max pods (secondary IP) | Max pods (prefix delegation) |
|---|---|---|---|---|
| t3.medium | 3 | 6 | 17 | 110 |
| m5.large | 3 | 10 | 29 | 110 |
| c5.xlarge | 4 | 15 | 58 | 110 |
| m5.4xlarge | 8 | 30 | 234 | 250 |
| c5.18xlarge | 15 | 50 | 250 (capped) | 250 (capped) |
The headline: small and medium instances are transformed. A t3.medium going from 17 to 110 pods means you stop fragmenting workloads across oversized nodes just to get IPs. EKS caps the recommendation at 110 below 30 vCPUs and 250 above, because kubelet and kube-proxy performance degrade past that, not because the CNI cannot allocate more.
Always confirm with the calculator rather than trusting a table:
# from the amazon-vpc-cni-k8s repo
./max-pods-calculator.sh --instance-type m5.large --cni-version 1.18.0 --cni-prefix-delegation-enabled
4. Custom networking: pods on a secondary CIDR
Prefix delegation conserves IPs but still draws them from the node’s subnet. If your primary VPC CIDR is small (a /20 shared with on-prem via Transit Gateway, say), you cannot grow it. Custom networking solves this by putting pods on a separate, larger CIDR – typically the non-routable 100.64.0.0/10 (CGNAT) range added as a secondary VPC CIDR. Node primary IPs stay on the routable subnet; pod IPs live in space you do not have to advertise anywhere.
How it works: with AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true, the CNI stops using the node’s primary ENI/subnet for pods. Instead it reads an ENIConfig custom resource (selected per node via a label) that tells it which subnet and security groups to use for the secondary ENIs that carry pods.
Add the secondary CIDR and subnets
aws ec2 associate-vpc-cidr-block \
--vpc-id vpc-0abc123 \
--cidr-block 100.64.0.0/16
# create one pod subnet per AZ inside the new CIDR
aws ec2 create-subnet --vpc-id vpc-0abc123 \
--cidr-block 100.64.0.0/19 --availability-zone us-east-1a
Enable custom networking and create ENIConfigs
kubectl set env daemonset aws-node -n kube-system \
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true \
ENI_CONFIG_LABEL_DEF=topology.kubernetes.io/zone
ENI_CONFIG_LABEL_DEF=topology.kubernetes.io/zone is the trick that makes this maintainable: the CNI matches the node’s well-known zone label to an ENIConfig named after the zone, so you do not have to label nodes manually. Create one ENIConfig per AZ, named exactly for the zone:
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
name: us-east-1a
spec:
subnet: subnet-0podsubneta
securityGroups:
- sg-0nodesharedsg
---
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
name: us-east-1b
spec:
subnet: subnet-0podsubnetb
securityGroups:
- sg-0nodesharedsg
Two gotchas that cost people a day each. First, custom networking “wastes” the node’s primary ENI for pods – pods only land on secondary ENIs – so your per-node pod count drops by one ENI’s worth unless you combine it with prefix delegation (which you should). Second, this only applies to nodes launched after you enable it; existing nodes must be recycled.
5. Security groups for pods
By default every pod on a node shares the node’s security group. When a specific workload needs its own ingress/egress posture – say it talks to an RDS instance whose SG only allows a tight source – you want security groups at the pod level. EKS supports this through the CNI’s ENI trunking feature plus a SecurityGroupPolicy CRD.
Mechanically: the CNI creates a trunk ENI on the node and attaches branch ENIs to it, one per pod that matches a policy. Each branch ENI carries the SGs you specify. This is gated by a flag and supported only on Nitro instances:
kubectl set env daemonset aws-node -n kube-system \
ENABLE_POD_ENI=true
Then declare which pods get which SGs. The policy selects pods by label or service account:
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
name: payments-db-access
namespace: payments
spec:
podSelector:
matchLabels:
app: ledger-api
securityGroups:
groupIds:
- sg-0ledgerpodsg
- sg-0clustersharedsg
Pods matched by this policy get a branch ENI with sg-0ledgerpodsg (which the RDS SG trusts) instead of the node SG. Include the cluster shared SG too, or you break node-to-pod health checks and DNS.
The trunk interface limit is the real constraint
Branch ENIs come from a separate, smaller budget than regular ENIs. The number of branch ENIs (i.e., pods with their own SGs) per node is not the same as max-pods – it ranges from about 9 on smaller types to 54+ on large ones. Check it:
aws ec2 describe-instance-types --instance-types m5.large \
--query 'InstanceTypes[].NetworkInfo.[MaximumNetworkInterfaces,Ipv4AddressesPerInterface]'
Because branch ENIs are scarce, apply SecurityGroupPolicy only to workloads that genuinely need isolation, not the whole cluster. Pods without a matching policy keep using the node SG and do not consume the branch ENI budget.
There are also behavioral caveats: with security groups for pods, source NAT for off-VPC traffic and certain NetworkPolicy interactions change. If a branch-ENI pod needs internet egress, set POD_SECURITY_GROUP_ENFORCING_MODE=standard so traffic still SNATs through the primary ENI:
kubectl set env daemonset aws-node -n kube-system \
POD_SECURITY_GROUP_ENFORCING_MODE=standard
6. Combining the features, and the IPv6 alternative
These three features stack:
- Prefix delegation + custom networking is the default endgame for large IPv4 clusters: pods on a roomy
100.64.0.0/xCIDR, packed 110+ per node via /28 prefixes, node IPs staying small on routable subnets. Enable both; they do not conflict. - Add security groups for pods on top for the handful of workloads needing isolation. Branch ENIs honor the custom networking subnet too.
But there is a cleaner answer if you can adopt it: IPv6 mode. An IPv6 EKS cluster gives every pod a globally unique IPv6 address from a /80 per ENI – the address space is so vast that prefix delegation, custom networking, and WARM target tuning all become unnecessary. You set it at cluster creation (it cannot be toggled later):
aws eks create-cluster \
--name prod-v6 \
--kubernetes-network-config ipFamily=ipv6 \
--resources-vpc-config subnetIds=subnet-a,subnet-b \
--role-arn arn:aws:iam::111122223333:role/eksClusterRole \
--version 1.30
The trade-off is real and worth stating plainly: IPv4-only services (legacy partners, some SaaS endpoints, RDS without dual-stack) require an egress path, and IPv6 mode is permanent for the cluster’s life. I reach for IPv6 on greenfield clusters with modern workloads and stick with prefix delegation + custom networking when there is IPv4 baggage.
Verify
After enabling the features, confirm the data plane actually behaves:
# 1. Confirm prefix delegation: ENIs should show IPv4 prefixes, not just IPs
kubectl exec -n kube-system aws-node-xxxxx -c aws-node -- \
curl -s http://localhost:61679/v1/enis | jq '.ENIs[].IPv4Prefixes'
# 2. Confirm a pod got an IP from the secondary (custom networking) CIDR
kubectl get pod ledger-api-xxxx -n payments -o wide
# the pod IP should be in 100.64.0.0/x, the node IP in the routable subnet
# 3. Confirm security groups for pods: branch ENI exists with the right SG
kubectl describe pod ledger-api-xxxx -n payments | grep -A2 'vpc.amazonaws.com/pod-eni'
# 4. Watch ipamd allocate without errors
kubectl logs -n kube-system aws-node-xxxxx -c aws-node | grep -i 'prefix\|assign' | tail -20
The CNI metrics are exported on 127.0.0.1:61678/metrics (Prometheus). The two to alarm on are awscni_total_ip_addresses versus awscni_assigned_ip_addresses per node, and at the VPC level a CloudWatch alarm on subnet free-IP capacity:
// CloudWatch Metrics Insights: pod density approaching the node ceiling
SELECT AVG(awscni_assigned_ip_addresses)
FROM SCHEMA("ContainerInsights", ClusterName)
WHERE ClusterName = 'prod-use1'
Diagnosing IP allocation failures
When pods stick in ContainerCreating with failed to assign an IP address to container, walk this order:
ipamdlogs at/var/log/aws-routed-eni/ipamd.logon the node (or via theaws-nodecontainer logs). Look forInsufficientFreeAddressesInSubnet(subnet exhausted) versusInsufficientCidrBlocks(no contiguous /28 for prefix mode – a fragmentation signal).- ENI limit reached. If the node is at its max ENIs and max IPs/prefixes, it physically cannot take more pods. Confirm against the calculator; the fix is prefix delegation or a bigger instance.
- Subnet free IPs.
aws ec2 describe-subnets --subnet-ids subnet-x --query 'Subnets[].AvailableIpAddressCount'. If this is near zero, you need custom networking onto a secondary CIDR – adding nodes will not help. - Account ENI/EIP quotas. Trunk/branch ENIs and large fleets can hit
Network interfaces per Region(Service Quotas, codeL-DF5E4CA3).
Enterprise scenario
A fintech platform team ran a shared services EKS cluster in a /22 VPC – the largest block their networking team would carve from a Transit-Gateway-connected supernet, because every routable IP was inventory shared with on-prem. They were at 180 nodes when payments rollouts started failing: nodes had free CPU, but the three node subnets (/24 each) were exhausted. Worse, one workload, the ledger service, needed a dedicated SG because the RDS Aurora cluster it called only trusted a specific source SG, and they had been hairpinning through a NLB to fake it.
Two moves fixed both problems without renumbering the VPC. First, they associated 100.64.0.0/16 as a secondary CIDR and stood up /19 pod subnets per AZ, then enabled custom networking with zone-named ENIConfigs and prefix delegation together. Pod IPs moved entirely off the routable space; node count for the same workload dropped because each node now held 110 pods instead of ~30. Second, they applied a SecurityGroupPolicy to the ledger pods so they got branch ENIs carrying the SG Aurora trusted – deleting the NLB hairpin entirely.
The combined add-on config they standardized on:
{
"env": {
"ENABLE_PREFIX_DELEGATION": "true",
"WARM_PREFIX_TARGET": "1",
"AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG": "true",
"ENI_CONFIG_LABEL_DEF": "topology.kubernetes.io/zone",
"ENABLE_POD_ENI": "true",
"POD_SECURITY_GROUP_ENFORCING_MODE": "standard"
}
}
The one painful detail: enabling custom networking only affected new nodes, so they drained the fleet through a Karpenter-driven node rollout over a weekend rather than in place. Six months later the routable subnets sat below 15% utilization and the ledger team had a clean SG boundary.