IaC AWS

Terraform Module: AWS Batch — Spot-Backed Compute Environments Without the Boilerplate

Quick take — A reusable Terraform module for AWS Batch on hashicorp/aws ~> 5.0: provision a Fargate or EC2 Spot compute environment, job queue, and IAM service roles with validated, var-driven inputs. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "batch" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-batch?ref=v1.0.0"

  name_prefix        = "..."           # Prefix for all named resources; 2-25 lowercase alphanum…
  environment        = "..."           # Deployment environment; one of `dev`, `stage`, `prod`.
  subnet_ids         = ["...", "..."]  # Subnets (use private) for tasks/instances; must be non-…
  security_group_ids = ["...", "..."]  # Security groups attached to Batch tasks/instances; must…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS Batch is a managed batch-computing service that schedules and runs containerized jobs across pools of compute you do not have to babysit. You define a compute environment (the pool — Fargate, Fargate Spot, EC2 On-Demand, or EC2 Spot), attach one or more job queues to it, register job definitions, and submit jobs; Batch handles instance provisioning, bin-packing, scaling to zero when idle, and tearing capacity down when the queue drains. It is the natural fit for genomics pipelines, nightly ETL, ML training sweeps, rendering, and any embarrassingly parallel workload that is wasteful to keep on a long-running cluster.

The core resource, aws_batch_compute_environment, is deceptively fiddly. A production-ready environment needs a correctly-scoped service-linked or service IAM role, the right networking (subnets, security_group_ids), type = MANAGED with a compute_resources block whose type is one of FARGATE, FARGATE_SPOT, EC2, or SPOT, and — for Spot EC2 — a Spot fleet IAM role plus a bid_percentage. Get the update_policy wrong and Terraform fights the scaler on every apply; forget lifecycle { create_before_destroy = true } and replacements deadlock because the old environment is still referenced by the queue. This module encodes all of that once: it stands up the compute environment, an attached job queue, and the supporting IAM, exposes only the knobs that matter, and validates them so a bad bid_percentage or empty subnet list fails at plan instead of at apply.

When to use it

Reach for plain ECS/EKS instead if your workload is a long-lived service, needs sub-second scheduling latency, or requires per-request autoscaling — Batch is optimized for throughput of finite jobs, not steady-state request serving.

Module structure

terraform-module-aws-batch/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  name = "${var.name_prefix}-${var.environment}"

  # Spot EC2 needs a fleet role + bid percentage; Fargate/Fargate Spot/EC2 do not.
  is_spot_ec2     = var.compute_type == "SPOT"
  is_ec2_flavored = contains(["EC2", "SPOT"], var.compute_type)

  tags = merge(
    {
      Module      = "terraform-module-aws-batch"
      Environment = var.environment
      ManagedBy   = "terraform"
    },
    var.tags,
  )
}

# --- Trust policies -------------------------------------------------------

data "aws_iam_policy_document" "batch_service_assume" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["batch.amazonaws.com"]
    }
  }
}

data "aws_iam_policy_document" "ec2_assume" {
  count = local.is_ec2_flavored ? 1 : 0

  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

data "aws_iam_policy_document" "spot_fleet_assume" {
  count = local.is_spot_ec2 ? 1 : 0

  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["spotfleet.amazonaws.com"]
    }
  }
}

# --- Batch service role ---------------------------------------------------

resource "aws_iam_role" "batch_service" {
  name               = "${local.name}-batch-service"
  assume_role_policy = data.aws_iam_policy_document.batch_service_assume.json
  tags               = local.tags
}

resource "aws_iam_role_policy_attachment" "batch_service" {
  role       = aws_iam_role.batch_service.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole"
}

# --- EC2 instance role + profile (only for EC2/SPOT compute types) --------

resource "aws_iam_role" "ecs_instance" {
  count              = local.is_ec2_flavored ? 1 : 0
  name               = "${local.name}-ecs-instance"
  assume_role_policy = data.aws_iam_policy_document.ec2_assume[0].json
  tags               = local.tags
}

resource "aws_iam_role_policy_attachment" "ecs_instance" {
  count      = local.is_ec2_flavored ? 1 : 0
  role       = aws_iam_role.ecs_instance[0].name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
}

resource "aws_iam_instance_profile" "ecs_instance" {
  count = local.is_ec2_flavored ? 1 : 0
  name  = "${local.name}-ecs-instance"
  role  = aws_iam_role.ecs_instance[0].name
  tags  = local.tags
}

# --- Spot fleet role (only for SPOT compute type) -------------------------

resource "aws_iam_role" "spot_fleet" {
  count              = local.is_spot_ec2 ? 1 : 0
  name               = "${local.name}-spot-fleet"
  assume_role_policy = data.aws_iam_policy_document.spot_fleet_assume[0].json
  tags               = local.tags
}

resource "aws_iam_role_policy_attachment" "spot_fleet" {
  count      = local.is_spot_ec2 ? 1 : 0
  role       = aws_iam_role.spot_fleet[0].name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole"
}

# --- Compute environment --------------------------------------------------

resource "aws_batch_compute_environment" "this" {
  name         = local.name
  type         = "MANAGED"
  state        = var.state
  service_role = aws_iam_role.batch_service.arn

  compute_resources {
    type = var.compute_type

    # vCPU bounds. Fargate flavors ignore min/desired but require max_vcpus.
    max_vcpus     = var.max_vcpus
    min_vcpus     = local.is_ec2_flavored ? var.min_vcpus : null
    desired_vcpus = local.is_ec2_flavored ? var.min_vcpus : null

    subnets            = var.subnet_ids
    security_group_ids = var.security_group_ids

    # EC2/SPOT-only knobs.
    instance_role       = local.is_ec2_flavored ? aws_iam_instance_profile.ecs_instance[0].arn : null
    instance_type       = local.is_ec2_flavored ? var.instance_types : null
    allocation_strategy = local.is_ec2_flavored ? var.allocation_strategy : null
    spot_iam_fleet_role = local.is_spot_ec2 ? aws_iam_role.spot_fleet[0].arn : null
    bid_percentage      = local.is_spot_ec2 ? var.bid_percentage : null

    tags = local.tags
  }

  update_policy {
    job_execution_timeout_minutes = var.job_execution_timeout_minutes
    terminate_jobs_on_update      = var.terminate_jobs_on_update
  }

  tags = local.tags

  # The service role must exist before the CE, and the CE must be replaced
  # before deletion because the job queue references it.
  depends_on = [aws_iam_role_policy_attachment.batch_service]

  lifecycle {
    create_before_destroy = true
  }
}

# --- Job queue attached to the compute environment ------------------------

resource "aws_batch_job_queue" "this" {
  name     = "${local.name}-queue"
  state    = var.state
  priority = var.queue_priority

  compute_environment_order {
    order               = 1
    compute_environment = aws_batch_compute_environment.this.arn
  }

  tags = local.tags
}

variables.tf

variable "name_prefix" {
  description = "Prefix for all named resources (e.g. \"genomics\")."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,24}$", var.name_prefix))
    error_message = "name_prefix must be 2-25 chars, lowercase alphanumeric or hyphen, starting with a letter."
  }
}

variable "environment" {
  description = "Deployment environment, appended to the name prefix."
  type        = string

  validation {
    condition     = contains(["dev", "stage", "prod"], var.environment)
    error_message = "environment must be one of: dev, stage, prod."
  }
}

variable "compute_type" {
  description = "Compute environment provisioning model."
  type        = string
  default     = "FARGATE_SPOT"

  validation {
    condition     = contains(["FARGATE", "FARGATE_SPOT", "EC2", "SPOT"], var.compute_type)
    error_message = "compute_type must be one of: FARGATE, FARGATE_SPOT, EC2, SPOT."
  }
}

variable "state" {
  description = "Desired state of the compute environment and job queue."
  type        = string
  default     = "ENABLED"

  validation {
    condition     = contains(["ENABLED", "DISABLED"], var.state)
    error_message = "state must be ENABLED or DISABLED."
  }
}

variable "subnet_ids" {
  description = "Subnets the compute environment launches tasks/instances into (use private subnets)."
  type        = list(string)

  validation {
    condition     = length(var.subnet_ids) > 0
    error_message = "At least one subnet_id is required."
  }
}

variable "security_group_ids" {
  description = "Security groups attached to Batch tasks/instances."
  type        = list(string)

  validation {
    condition     = length(var.security_group_ids) > 0
    error_message = "At least one security_group_id is required."
  }
}

variable "max_vcpus" {
  description = "Maximum vCPUs the compute environment may scale to (hard cost ceiling)."
  type        = number
  default     = 16

  validation {
    condition     = var.max_vcpus >= 1 && var.max_vcpus <= 10000
    error_message = "max_vcpus must be between 1 and 10000."
  }
}

variable "min_vcpus" {
  description = "Minimum/desired vCPUs to keep warm (EC2/SPOT only; ignored for Fargate). Set 0 to scale to zero."
  type        = number
  default     = 0

  validation {
    condition     = var.min_vcpus >= 0
    error_message = "min_vcpus must be >= 0."
  }
}

variable "instance_types" {
  description = "EC2 instance types or families for EC2/SPOT compute (e.g. [\"optimal\"] or [\"c6i\", \"m6i\"]). Unused for Fargate."
  type        = list(string)
  default     = ["optimal"]
}

variable "allocation_strategy" {
  description = "EC2/SPOT allocation strategy: BEST_FIT, BEST_FIT_PROGRESSIVE, or SPOT_CAPACITY_OPTIMIZED."
  type        = string
  default     = "BEST_FIT_PROGRESSIVE"

  validation {
    condition     = contains(["BEST_FIT", "BEST_FIT_PROGRESSIVE", "SPOT_CAPACITY_OPTIMIZED"], var.allocation_strategy)
    error_message = "allocation_strategy must be BEST_FIT, BEST_FIT_PROGRESSIVE, or SPOT_CAPACITY_OPTIMIZED."
  }
}

variable "bid_percentage" {
  description = "Maximum Spot bid as a percentage of On-Demand price (SPOT compute type only)."
  type        = number
  default     = 100

  validation {
    condition     = var.bid_percentage >= 1 && var.bid_percentage <= 100
    error_message = "bid_percentage must be between 1 and 100."
  }
}

variable "queue_priority" {
  description = "Job queue priority; higher values are scheduled first when queues share a compute environment."
  type        = number
  default     = 1

  validation {
    condition     = var.queue_priority >= 0 && var.queue_priority <= 1000
    error_message = "queue_priority must be between 0 and 1000."
  }
}

variable "job_execution_timeout_minutes" {
  description = "Grace period for running jobs before infrastructure updates terminate them."
  type        = number
  default     = 30
}

variable "terminate_jobs_on_update" {
  description = "Whether to terminate running jobs when the compute environment is updated."
  type        = bool
  default     = false
}

variable "tags" {
  description = "Additional tags merged onto every resource."
  type        = map(string)
  default     = {}
}

outputs.tf

output "compute_environment_arn" {
  description = "ARN of the Batch compute environment."
  value       = aws_batch_compute_environment.this.arn
}

output "compute_environment_name" {
  description = "Name of the Batch compute environment."
  value       = aws_batch_compute_environment.this.name
}

output "compute_environment_status" {
  description = "Current status of the compute environment (e.g. VALID)."
  value       = aws_batch_compute_environment.this.status
}

output "job_queue_arn" {
  description = "ARN of the attached job queue — pass to job definitions and submit-job callers."
  value       = aws_batch_job_queue.this.arn
}

output "job_queue_name" {
  description = "Name of the attached job queue."
  value       = aws_batch_job_queue.this.name
}

output "service_role_arn" {
  description = "ARN of the Batch service IAM role."
  value       = aws_iam_role.batch_service.arn
}

output "instance_profile_arn" {
  description = "ARN of the ECS instance profile (null for Fargate compute types)."
  value       = try(aws_iam_instance_profile.ecs_instance[0].arn, null)
}

How to use it

module "batch" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-batch?ref=v1.0.0"

  name_prefix = "genomics"
  environment = "prod"

  # Cost-optimized Spot EC2 across compute-optimized families.
  compute_type        = "SPOT"
  instance_types      = ["c6i", "c6a", "m6i"]
  allocation_strategy = "SPOT_CAPACITY_OPTIMIZED"
  bid_percentage      = 80
  max_vcpus           = 256
  min_vcpus           = 0

  subnet_ids         = module.vpc.private_subnet_ids
  security_group_ids = [aws_security_group.batch.id]

  queue_priority = 10

  tags = {
    Team       = "bioinformatics"
    CostCenter = "rnd-4400"
  }
}

# Downstream: a job definition that targets the module's queue via a
# variant-calling container, plus an EventBridge schedule that submits to it.
resource "aws_batch_job_definition" "variant_calling" {
  name                  = "variant-calling"
  type                  = "container"
  platform_capabilities = ["EC2"]

  container_properties = jsonencode({
    image      = "${aws_ecr_repository.pipeline.repository_url}:latest"
    command    = ["python", "call_variants.py"]
    jobRoleArn = aws_iam_role.job_task.arn
    resourceRequirements = [
      { type = "VCPU", value = "4" },
      { type = "MEMORY", value = "8192" },
    ]
  })
}

# Reference a module output downstream.
output "submit_target_queue" {
  description = "Queue ARN that the scheduler should submit variant-calling jobs to."
  value       = module.batch.job_queue_arn
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/batch/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-batch?ref=v1.0.0"
}

inputs = {
  name_prefix = "..."
  environment = "..."
  subnet_ids = ["...", "..."]
  security_group_ids = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/batch && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name_prefix string Yes Prefix for all named resources; 2-25 lowercase alphanumeric/hyphen chars starting with a letter.
environment string Yes Deployment environment; one of dev, stage, prod.
subnet_ids list(string) Yes Subnets (use private) for tasks/instances; must be non-empty.
security_group_ids list(string) Yes Security groups attached to Batch tasks/instances; must be non-empty.
compute_type string "FARGATE_SPOT" No One of FARGATE, FARGATE_SPOT, EC2, SPOT.
state string "ENABLED" No ENABLED or DISABLED for the compute environment and queue.
max_vcpus number 16 No Maximum vCPUs (hard cost ceiling); 1-10000.
min_vcpus number 0 No Min/desired warm vCPUs for EC2/SPOT (ignored for Fargate).
instance_types list(string) ["optimal"] No EC2 instance types/families for EC2/SPOT compute.
allocation_strategy string "BEST_FIT_PROGRESSIVE" No BEST_FIT, BEST_FIT_PROGRESSIVE, or SPOT_CAPACITY_OPTIMIZED.
bid_percentage number 100 No Max Spot bid as % of On-Demand (SPOT only); 1-100.
queue_priority number 1 No Job queue priority; 0-1000, higher scheduled first.
job_execution_timeout_minutes number 30 No Grace period before infra updates terminate running jobs.
terminate_jobs_on_update bool false No Terminate running jobs when the compute environment updates.
tags map(string) {} No Additional tags merged onto every resource.

Outputs

Name Description
compute_environment_arn ARN of the Batch compute environment.
compute_environment_name Name of the Batch compute environment.
compute_environment_status Current status of the compute environment (e.g. VALID).
job_queue_arn ARN of the attached job queue; pass to job definitions and submit-job callers.
job_queue_name Name of the attached job queue.
service_role_arn ARN of the Batch service IAM role.
instance_profile_arn ARN of the ECS instance profile (null for Fargate compute types).

Enterprise scenario

A bioinformatics platform team runs a nightly secondary-analysis pipeline that fans out variant calling across thousands of patient samples. They instantiate this module once per environment with compute_type = "SPOT", instance_types = ["c6i", "c6a", "m6i"], allocation_strategy = "SPOT_CAPACITY_OPTIMIZED", and max_vcpus = 256 so capacity scales from zero overnight and tears down by morning, cutting compute spend roughly 70% versus On-Demand. An EventBridge Scheduler rule submits array jobs to the module’s job_queue_arn output, and because the pipeline checkpoints to S3, Spot interruptions simply re-queue affected samples instead of failing the run.

Best practices

TerraformAWSBatchModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading