IaC AWS

Terraform Module: AWS EKS Node Group — Managed Worker Pools with Safe Rolling Upgrades

Quick take — A reusable Terraform module for aws_eks_node_group on hashicorp/aws ~> 5.0: launch templates, taints/labels, autoscaling, and graceful rolling updates for production EKS clusters. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "eks_node_group" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-eks-node-group?ref=v1.0.0"

  cluster_name    = "..."           # Existing EKS cluster to attach the node group to.
  node_group_name = "..."           # Unique node group name (1-63 chars, validated).
  subnet_ids      = ["...", "..."]  # Subnet IDs (typically private) for the worker nodes.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

An EKS managed node group is a set of EC2 instances that EKS provisions, registers as Kubernetes worker nodes, and lifecycle-manages on your behalf. Behind the scenes EKS creates an Auto Scaling Group, bootstraps each node with the kubelet, joins it to the control plane, and — critically — knows how to cordon, drain, and roll nodes when you bump the AMI or instance type. You get the elasticity of an ASG without writing your own bootstrap user-data or aws-auth plumbing.

The raw aws_eks_node_group resource is deceptively simple to declare but easy to get subtly wrong: the scaling_config and update_config interplay, the choice between AWS-managed AMI types and a custom launch template, the taint / labels blocks that decide what pods land where, and the lifecycle { ignore_changes = [scaling_config[0].desired_size] } dance you need so the Cluster Autoscaler and Terraform stop fighting over node counts. Wrapping it in a module lets every cluster declare a node pool with a few well-named variables — instance_types, capacity_type = "SPOT", min_size/max_size, taints — while the module enforces the rolling-update safety net, sane disk encryption, and consistent tagging in one place.

When to use it

Module structure

terraform-module-aws-eks-node-group/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # IAM role, launch template, aws_eks_node_group
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # node group id/arn + ASG + role outputs

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  name = "${var.cluster_name}-${var.node_group_name}"

  tags = merge(
    {
      "Name"                                      = local.name
      "ManagedBy"                                 = "terraform"
      "kubernetes.io/cluster/${var.cluster_name}" = "owned"
    },
    var.tags,
  )

  # Create an IAM role unless the caller passes an existing one in.
  create_role = var.node_role_arn == null
}

# --- IAM role for the worker nodes (optional) ---------------------------------
data "aws_iam_policy_document" "assume_role" {
  count = local.create_role ? 1 : 0

  statement {
    sid     = "EKSNodeAssumeRole"
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "this" {
  count = local.create_role ? 1 : 0

  name                 = "${local.name}-node"
  assume_role_policy   = data.aws_iam_policy_document.assume_role[0].json
  permissions_boundary = var.permissions_boundary_arn
  tags                 = local.tags
}

# Minimum policies every EKS worker node needs to join and pull images.
resource "aws_iam_role_policy_attachment" "this" {
  for_each = local.create_role ? toset([
    "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy",
    "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
    "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly",
  ]) : toset([])

  role       = aws_iam_role.this[0].name
  policy_arn = each.value
}

# Optional SSM access for break-glass node debugging.
resource "aws_iam_role_policy_attachment" "ssm" {
  count = local.create_role && var.enable_ssm ? 1 : 0

  role       = aws_iam_role.this[0].name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

locals {
  node_role_arn = local.create_role ? aws_iam_role.this[0].arn : var.node_role_arn
}

# --- Launch template (encrypted disk, IMDSv2, custom labels) ------------------
# A managed launch template lets us enforce gp3 + encryption and IMDSv2 without
# baking a custom AMI. EKS still owns the AMI when ami_type is not "CUSTOM".
resource "aws_launch_template" "this" {
  name_prefix            = "${local.name}-"
  update_default_version = true
  vpc_security_group_ids = var.security_group_ids

  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size           = var.disk_size
      volume_type           = "gp3"
      iops                  = 3000
      throughput            = 125
      encrypted             = true
      kms_key_id            = var.ebs_kms_key_arn
      delete_on_termination = true
    }
  }

  # Force IMDSv2 to block SSRF-style credential theft from pods.
  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 2
    instance_metadata_tags      = "enabled"
  }

  monitoring {
    enabled = var.enable_detailed_monitoring
  }

  tag_specifications {
    resource_type = "instance"
    tags          = local.tags
  }

  tag_specifications {
    resource_type = "volume"
    tags          = local.tags
  }

  tags = local.tags

  lifecycle {
    create_before_destroy = true
  }
}

# --- The managed node group ---------------------------------------------------
resource "aws_eks_node_group" "this" {
  cluster_name    = var.cluster_name
  node_group_name = var.node_group_name
  node_role_arn   = local.node_role_arn
  subnet_ids      = var.subnet_ids
  version         = var.kubernetes_version

  # ami_type + instance_types govern the EKS-managed AMI. capacity_type picks
  # ON_DEMAND vs SPOT. force_update_version drains nodes even if PDBs block.
  ami_type             = var.ami_type
  instance_types       = var.instance_types
  capacity_type        = var.capacity_type
  force_update_version = var.force_update_version

  scaling_config {
    desired_size = var.desired_size
    min_size     = var.min_size
    max_size     = var.max_size
  }

  # Surge config: how many nodes EKS may add / take unavailable during a roll.
  update_config {
    max_unavailable_percentage = var.max_unavailable_percentage
  }

  launch_template {
    id      = aws_launch_template.this.id
    version = aws_launch_template.this.latest_version
  }

  dynamic "taint" {
    for_each = var.taints
    content {
      key    = taint.value.key
      value  = lookup(taint.value, "value", null)
      effect = taint.value.effect
    }
  }

  labels = var.labels

  tags = local.tags

  # Let the Cluster Autoscaler own desired_size; only Terraform changes the
  # bounds. Without this, every plan would try to reset the node count.
  lifecycle {
    create_before_destroy = true
    ignore_changes        = [scaling_config[0].desired_size]
  }

  # Policies must attach before the node group tries to join the cluster.
  depends_on = [aws_iam_role_policy_attachment.this]
}

variables.tf

variable "cluster_name" {
  description = "Name of the existing EKS cluster to attach this node group to."
  type        = string
}

variable "node_group_name" {
  description = "Name of the managed node group (must be unique within the cluster)."
  type        = string

  validation {
    condition     = can(regex("^[a-zA-Z0-9][a-zA-Z0-9-_]{0,62}$", var.node_group_name))
    error_message = "node_group_name must be 1-63 chars: letters, numbers, hyphens or underscores."
  }
}

variable "subnet_ids" {
  description = "Subnet IDs (typically private) in which to launch worker nodes."
  type        = list(string)

  validation {
    condition     = length(var.subnet_ids) > 0
    error_message = "At least one subnet ID is required."
  }
}

variable "node_role_arn" {
  description = "Existing IAM role ARN for the nodes. If null, the module creates one."
  type        = string
  default     = null
}

variable "permissions_boundary_arn" {
  description = "Optional permissions boundary applied to the node IAM role created by this module."
  type        = string
  default     = null
}

variable "security_group_ids" {
  description = "Additional security group IDs to attach to nodes via the launch template."
  type        = list(string)
  default     = []
}

variable "kubernetes_version" {
  description = "Kubernetes minor version for the node group (e.g. \"1.30\"). Defaults to the cluster version when null."
  type        = string
  default     = null
}

variable "ami_type" {
  description = "EKS AMI type, e.g. AL2023_x86_64_STANDARD, AL2023_ARM_64_STANDARD, BOTTLEROCKET_x86_64, AL2_x86_64_GPU."
  type        = string
  default     = "AL2023_x86_64_STANDARD"

  validation {
    condition = contains([
      "AL2023_x86_64_STANDARD", "AL2023_ARM_64_STANDARD",
      "AL2_x86_64", "AL2_ARM_64", "AL2_x86_64_GPU",
      "BOTTLEROCKET_x86_64", "BOTTLEROCKET_ARM_64", "CUSTOM",
    ], var.ami_type)
    error_message = "ami_type must be a valid EKS AMI type or CUSTOM."
  }
}

variable "instance_types" {
  description = "List of EC2 instance types for the node group."
  type        = list(string)
  default     = ["m6i.large"]

  validation {
    condition     = length(var.instance_types) > 0
    error_message = "At least one instance type is required."
  }
}

variable "capacity_type" {
  description = "Capacity type: ON_DEMAND or SPOT."
  type        = string
  default     = "ON_DEMAND"

  validation {
    condition     = contains(["ON_DEMAND", "SPOT"], var.capacity_type)
    error_message = "capacity_type must be ON_DEMAND or SPOT."
  }
}

variable "desired_size" {
  description = "Initial desired node count. After creation, the Cluster Autoscaler owns this value."
  type        = number
  default     = 2
}

variable "min_size" {
  description = "Minimum node count for the node group."
  type        = number
  default     = 1
}

variable "max_size" {
  description = "Maximum node count for the node group."
  type        = number
  default     = 5

  validation {
    condition     = var.max_size >= var.min_size
    error_message = "max_size must be greater than or equal to min_size."
  }
}

variable "max_unavailable_percentage" {
  description = "Percentage of nodes EKS may take unavailable at once during a rolling update."
  type        = number
  default     = 33

  validation {
    condition     = var.max_unavailable_percentage >= 1 && var.max_unavailable_percentage <= 100
    error_message = "max_unavailable_percentage must be between 1 and 100."
  }
}

variable "force_update_version" {
  description = "Force a node version update even when pods cannot be drained due to PodDisruptionBudgets."
  type        = bool
  default     = false
}

variable "disk_size" {
  description = "Root EBS volume size in GiB for each node."
  type        = number
  default     = 50

  validation {
    condition     = var.disk_size >= 20
    error_message = "disk_size must be at least 20 GiB to fit the OS, kubelet and image cache."
  }
}

variable "ebs_kms_key_arn" {
  description = "KMS key ARN to encrypt the root EBS volume. Uses the AWS-managed EBS key when null."
  type        = string
  default     = null
}

variable "enable_detailed_monitoring" {
  description = "Enable EC2 detailed (1-minute) CloudWatch monitoring on nodes."
  type        = bool
  default     = false
}

variable "enable_ssm" {
  description = "Attach AmazonSSMManagedInstanceCore to the node role for break-glass debugging (only when the module creates the role)."
  type        = bool
  default     = true
}

variable "labels" {
  description = "Map of Kubernetes labels applied to every node in the group."
  type        = map(string)
  default     = {}
}

variable "taints" {
  description = "List of Kubernetes taints. effect must be NO_SCHEDULE, NO_EXECUTE, or PREFER_NO_SCHEDULE."
  type = list(object({
    key    = string
    value  = optional(string)
    effect = string
  }))
  default = []

  validation {
    condition = alltrue([
      for t in var.taints :
      contains(["NO_SCHEDULE", "NO_EXECUTE", "PREFER_NO_SCHEDULE"], t.effect)
    ])
    error_message = "Each taint effect must be NO_SCHEDULE, NO_EXECUTE, or PREFER_NO_SCHEDULE."
  }
}

variable "tags" {
  description = "Additional tags merged onto all resources created by the module."
  type        = map(string)
  default     = {}
}

outputs.tf

output "node_group_id" {
  description = "EKS node group ID in the form cluster_name:node_group_name."
  value       = aws_eks_node_group.this.id
}

output "node_group_arn" {
  description = "ARN of the EKS managed node group."
  value       = aws_eks_node_group.this.arn
}

output "node_group_name" {
  description = "Name of the EKS managed node group."
  value       = aws_eks_node_group.this.node_group_name
}

output "status" {
  description = "Current status of the node group (e.g. ACTIVE, UPDATING)."
  value       = aws_eks_node_group.this.status
}

output "autoscaling_group_names" {
  description = "Names of the Auto Scaling Groups backing this node group (for Cluster Autoscaler discovery)."
  value       = [for r in aws_eks_node_group.this.resources[0].autoscaling_groups : r.name]
}

output "node_role_arn" {
  description = "IAM role ARN used by the worker nodes."
  value       = local.node_role_arn
}

output "node_role_name" {
  description = "IAM role name used by the worker nodes (null when an external role is supplied)."
  value       = local.create_role ? aws_iam_role.this[0].name : null
}

output "launch_template_id" {
  description = "ID of the launch template backing the node group."
  value       = aws_launch_template.this.id
}

How to use it

A realistic two-pool setup: an on-demand system pool and a tainted Spot pool for batch workloads. The Spot pool’s node role ARN is reused, and the system pool’s ASG name is fed straight into the Cluster Autoscaler’s tag set.

module "eks_node_group_system" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-eks-node-group?ref=v1.0.0"

  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "system"
  subnet_ids      = module.vpc.private_subnet_ids

  ami_type       = "AL2023_x86_64_STANDARD"
  instance_types = ["m6i.large"]
  capacity_type  = "ON_DEMAND"

  desired_size = 3
  min_size     = 3
  max_size     = 6

  disk_size       = 50
  ebs_kms_key_arn = aws_kms_key.ebs.arn

  labels = {
    "workload-type" = "system"
  }

  tags = {
    Environment = "prod"
    Team        = "platform"
  }
}

module "eks_node_group_batch" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-eks-node-group?ref=v1.0.0"

  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "batch-spot"
  subnet_ids      = module.vpc.private_subnet_ids

  # Reuse the role the system pool created instead of minting another.
  node_role_arn = module.eks_node_group_system.node_role_arn

  ami_type       = "AL2023_x86_64_STANDARD"
  instance_types = ["m6i.large", "m6a.large", "m5.large"] # diversify Spot pools
  capacity_type  = "SPOT"

  desired_size               = 2
  min_size                   = 0
  max_size                   = 20
  max_unavailable_percentage = 50

  labels = {
    "workload-type" = "batch"
  }

  taints = [
    {
      key    = "dedicated"
      value  = "batch"
      effect = "NO_SCHEDULE"
    },
  ]

  tags = {
    Environment = "prod"
    Team        = "data"
  }
}

# Downstream: let the Cluster Autoscaler discover the system pool's ASG.
resource "aws_autoscaling_group_tag" "ca_enabled" {
  for_each = toset(module.eks_node_group_system.autoscaling_group_names)

  autoscaling_group_name = each.value

  tag {
    key                 = "k8s.io/cluster-autoscaler/enabled"
    value               = "true"
    propagate_at_launch = false
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/eks_node_group/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-eks-node-group?ref=v1.0.0"
}

inputs = {
  cluster_name = "..."
  node_group_name = "..."
  subnet_ids = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/eks_node_group && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
cluster_name string yes Existing EKS cluster to attach the node group to.
node_group_name string yes Unique node group name (1-63 chars, validated).
subnet_ids list(string) yes Subnet IDs (typically private) for the worker nodes.
node_role_arn string null no Existing node IAM role; module creates one when null.
permissions_boundary_arn string null no Permissions boundary for the module-created node role.
security_group_ids list(string) [] no Extra security groups attached via the launch template.
kubernetes_version string null no Node K8s minor version; inherits cluster version when null.
ami_type string “AL2023_x86_64_STANDARD” no EKS AMI type or CUSTOM (validated against allowed set).
instance_types list(string) [“m6i.large”] no EC2 instance types for the node group.
capacity_type string “ON_DEMAND” no ON_DEMAND or SPOT (validated).
desired_size number 2 no Initial node count; Cluster Autoscaler owns it afterward.
min_size number 1 no Minimum node count.
max_size number 5 no Maximum node count (must be >= min_size).
max_unavailable_percentage number 33 no Percent of nodes unavailable during a rolling update (1-100).
force_update_version bool false no Force version updates even when PDBs block draining.
disk_size number 50 no Root EBS volume size in GiB (>= 20).
ebs_kms_key_arn string null no KMS key ARN for root volume encryption.
enable_detailed_monitoring bool false no Enable 1-minute EC2 CloudWatch monitoring.
enable_ssm bool true no Attach SSM core policy to the module-created node role.
labels map(string) {} no Kubernetes labels applied to every node.
taints list(object) [] no Kubernetes taints (effect validated).
tags map(string) {} no Additional tags merged onto all resources.

Outputs

Name Description
node_group_id Node group ID in the form cluster_name:node_group_name.
node_group_arn ARN of the managed node group.
node_group_name Name of the managed node group.
status Current node group status (ACTIVE, UPDATING, etc.).
autoscaling_group_names ASG names backing the node group for Cluster Autoscaler discovery.
node_role_arn IAM role ARN used by the worker nodes.
node_role_name IAM role name (null when an external role is supplied).
launch_template_id ID of the launch template backing the node group.

Enterprise scenario

A fintech platform team runs a regulated EKS cluster where control-plane add-ons (CoreDNS, the AWS Load Balancer Controller, Cluster Autoscaler) must never share nodes with tenant workloads. They instantiate this module three times: a small on-demand system pool, a general on-demand pool for stateful services, and a large batch-spot pool tainted dedicated=batch:NO_SCHEDULE that scales from zero. Because the module pins IMDSv2 (http_tokens = "required") and KMS-encrypts every root volume by default, each pool passes the firm’s CIS-EKS audit without extra wiring, and EKS-managed rolling upgrades honour the team’s PodDisruptionBudgets when they bump from Kubernetes 1.29 to 1.30.

Best practices

TerraformAWSEKS Node GroupModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading