Terraform Module: AWS Batch — Spot-Backed Compute Environments Without the Boilerplate

Quick take — A reusable Terraform module for AWS Batch on hashicorp/aws ~> 5.0: provision a Fargate or EC2 Spot compute environment, job queue, and IAM service roles with validated, var-driven inputs. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "batch" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-batch?ref=v1.0.0"

  name_prefix        = "..."           # Prefix for all named resources; 2-25 lowercase alphanum…
  environment        = "..."           # Deployment environment; one of `dev`, `stage`, `prod`.
  subnet_ids         = ["...", "..."]  # Subnets (use private) for tasks/instances; must be non-…
  security_group_ids = ["...", "..."]  # Security groups attached to Batch tasks/instances; must…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS Batch is a managed batch-computing service that schedules and runs containerized jobs across pools of compute you do not have to babysit. You define a compute environment (the pool — Fargate, Fargate Spot, EC2 On-Demand, or EC2 Spot), attach one or more job queues to it, register job definitions, and submit jobs; Batch handles instance provisioning, bin-packing, scaling to zero when idle, and tearing capacity down when the queue drains. It is the natural fit for genomics pipelines, nightly ETL, ML training sweeps, rendering, and any embarrassingly parallel workload that is wasteful to keep on a long-running cluster.

The core resource, aws_batch_compute_environment, is deceptively fiddly. A production-ready environment needs a correctly-scoped service-linked or service IAM role, the right networking (subnets, security_group_ids), type = MANAGED with a compute_resources block whose type is one of FARGATE, FARGATE_SPOT, EC2, or SPOT, and — for Spot EC2 — a Spot fleet IAM role plus a bid_percentage. Get the update_policy wrong and Terraform fights the scaler on every apply; forget lifecycle { create_before_destroy = true } and replacements deadlock because the old environment is still referenced by the queue. This module encodes all of that once: it stands up the compute environment, an attached job queue, and the supporting IAM, exposes only the knobs that matter, and validates them so a bad bid_percentage or empty subnet list fails at plan instead of at apply.

When to use it

You run containerized batch or scheduled jobs (ETL, ML training, simulation, media transcoding) and want managed scaling to zero rather than an always-on ECS/EKS cluster.
You want Fargate Spot or EC2 Spot economics for fault-tolerant work but do not want to hand-write the Spot fleet role, instance role, and capacity wiring every time.
You are standing up the same Batch pattern across many accounts or teams (dev/stage/prod, per-domain data platforms) and need consistent IAM, tagging, and networking.
You need the compute environment’s ARN and the job-queue name as outputs so downstream Terraform (job definitions, EventBridge Scheduler rules, Step Functions) can target them.

Reach for plain ECS/EKS instead if your workload is a long-lived service, needs sub-second scheduling latency, or requires per-request autoscaling — Batch is optimized for throughput of finite jobs, not steady-state request serving.

Module structure

terraform-module-aws-batch/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  name = "${var.name_prefix}-${var.environment}"

  # Spot EC2 needs a fleet role + bid percentage; Fargate/Fargate Spot/EC2 do not.
  is_spot_ec2     = var.compute_type == "SPOT"
  is_ec2_flavored = contains(["EC2", "SPOT"], var.compute_type)

  tags = merge(
    {
      Module      = "terraform-module-aws-batch"
      Environment = var.environment
      ManagedBy   = "terraform"
    },
    var.tags,
  )
}

# --- Trust policies -------------------------------------------------------

data "aws_iam_policy_document" "batch_service_assume" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["batch.amazonaws.com"]
    }
  }
}

data "aws_iam_policy_document" "ec2_assume" {
  count = local.is_ec2_flavored ? 1 : 0

  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

data "aws_iam_policy_document" "spot_fleet_assume" {
  count = local.is_spot_ec2 ? 1 : 0

  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["spotfleet.amazonaws.com"]
    }
  }
}

# --- Batch service role ---------------------------------------------------

resource "aws_iam_role" "batch_service" {
  name               = "${local.name}-batch-service"
  assume_role_policy = data.aws_iam_policy_document.batch_service_assume.json
  tags               = local.tags
}

resource "aws_iam_role_policy_attachment" "batch_service" {
  role       = aws_iam_role.batch_service.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole"
}

# --- EC2 instance role + profile (only for EC2/SPOT compute types) --------

resource "aws_iam_role" "ecs_instance" {
  count              = local.is_ec2_flavored ? 1 : 0
  name               = "${local.name}-ecs-instance"
  assume_role_policy = data.aws_iam_policy_document.ec2_assume[0].json
  tags               = local.tags
}

resource "aws_iam_role_policy_attachment" "ecs_instance" {
  count      = local.is_ec2_flavored ? 1 : 0
  role       = aws_iam_role.ecs_instance[0].name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
}

resource "aws_iam_instance_profile" "ecs_instance" {
  count = local.is_ec2_flavored ? 1 : 0
  name  = "${local.name}-ecs-instance"
  role  = aws_iam_role.ecs_instance[0].name
  tags  = local.tags
}

# --- Spot fleet role (only for SPOT compute type) -------------------------

resource "aws_iam_role" "spot_fleet" {
  count              = local.is_spot_ec2 ? 1 : 0
  name               = "${local.name}-spot-fleet"
  assume_role_policy = data.aws_iam_policy_document.spot_fleet_assume[0].json
  tags               = local.tags
}

resource "aws_iam_role_policy_attachment" "spot_fleet" {
  count      = local.is_spot_ec2 ? 1 : 0
  role       = aws_iam_role.spot_fleet[0].name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole"
}

# --- Compute environment --------------------------------------------------

resource "aws_batch_compute_environment" "this" {
  name         = local.name
  type         = "MANAGED"
  state        = var.state
  service_role = aws_iam_role.batch_service.arn

  compute_resources {
    type = var.compute_type

    # vCPU bounds. Fargate flavors ignore min/desired but require max_vcpus.
    max_vcpus     = var.max_vcpus
    min_vcpus     = local.is_ec2_flavored ? var.min_vcpus : null
    desired_vcpus = local.is_ec2_flavored ? var.min_vcpus : null

    subnets            = var.subnet_ids
    security_group_ids = var.security_group_ids

    # EC2/SPOT-only knobs.
    instance_role       = local.is_ec2_flavored ? aws_iam_instance_profile.ecs_instance[0].arn : null
    instance_type       = local.is_ec2_flavored ? var.instance_types : null
    allocation_strategy = local.is_ec2_flavored ? var.allocation_strategy : null
    spot_iam_fleet_role = local.is_spot_ec2 ? aws_iam_role.spot_fleet[0].arn : null
    bid_percentage      = local.is_spot_ec2 ? var.bid_percentage : null

    tags = local.tags
  }

  update_policy {
    job_execution_timeout_minutes = var.job_execution_timeout_minutes
    terminate_jobs_on_update      = var.terminate_jobs_on_update
  }

  tags = local.tags

  # The service role must exist before the CE, and the CE must be replaced
  # before deletion because the job queue references it.
  depends_on = [aws_iam_role_policy_attachment.batch_service]

  lifecycle {
    create_before_destroy = true
  }
}

# --- Job queue attached to the compute environment ------------------------

resource "aws_batch_job_queue" "this" {
  name     = "${local.name}-queue"
  state    = var.state
  priority = var.queue_priority

  compute_environment_order {
    order               = 1
    compute_environment = aws_batch_compute_environment.this.arn
  }

  tags = local.tags
}

variables.tf

variable "name_prefix" {
  description = "Prefix for all named resources (e.g. \"genomics\")."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,24}$", var.name_prefix))
    error_message = "name_prefix must be 2-25 chars, lowercase alphanumeric or hyphen, starting with a letter."
  }
}

variable "environment" {
  description = "Deployment environment, appended to the name prefix."
  type        = string

  validation {
    condition     = contains(["dev", "stage", "prod"], var.environment)
    error_message = "environment must be one of: dev, stage, prod."
  }
}

variable "compute_type" {
  description = "Compute environment provisioning model."
  type        = string
  default     = "FARGATE_SPOT"

  validation {
    condition     = contains(["FARGATE", "FARGATE_SPOT", "EC2", "SPOT"], var.compute_type)
    error_message = "compute_type must be one of: FARGATE, FARGATE_SPOT, EC2, SPOT."
  }
}

variable "state" {
  description = "Desired state of the compute environment and job queue."
  type        = string
  default     = "ENABLED"

  validation {
    condition     = contains(["ENABLED", "DISABLED"], var.state)
    error_message = "state must be ENABLED or DISABLED."
  }
}

variable "subnet_ids" {
  description = "Subnets the compute environment launches tasks/instances into (use private subnets)."
  type        = list(string)

  validation {
    condition     = length(var.subnet_ids) > 0
    error_message = "At least one subnet_id is required."
  }
}

variable "security_group_ids" {
  description = "Security groups attached to Batch tasks/instances."
  type        = list(string)

  validation {
    condition     = length(var.security_group_ids) > 0
    error_message = "At least one security_group_id is required."
  }
}

variable "max_vcpus" {
  description = "Maximum vCPUs the compute environment may scale to (hard cost ceiling)."
  type        = number
  default     = 16

  validation {
    condition     = var.max_vcpus >= 1 && var.max_vcpus <= 10000
    error_message = "max_vcpus must be between 1 and 10000."
  }
}

variable "min_vcpus" {
  description = "Minimum/desired vCPUs to keep warm (EC2/SPOT only; ignored for Fargate). Set 0 to scale to zero."
  type        = number
  default     = 0

  validation {
    condition     = var.min_vcpus >= 0
    error_message = "min_vcpus must be >= 0."
  }
}

variable "instance_types" {
  description = "EC2 instance types or families for EC2/SPOT compute (e.g. [\"optimal\"] or [\"c6i\", \"m6i\"]). Unused for Fargate."
  type        = list(string)
  default     = ["optimal"]
}

variable "allocation_strategy" {
  description = "EC2/SPOT allocation strategy: BEST_FIT, BEST_FIT_PROGRESSIVE, or SPOT_CAPACITY_OPTIMIZED."
  type        = string
  default     = "BEST_FIT_PROGRESSIVE"

  validation {
    condition     = contains(["BEST_FIT", "BEST_FIT_PROGRESSIVE", "SPOT_CAPACITY_OPTIMIZED"], var.allocation_strategy)
    error_message = "allocation_strategy must be BEST_FIT, BEST_FIT_PROGRESSIVE, or SPOT_CAPACITY_OPTIMIZED."
  }
}

variable "bid_percentage" {
  description = "Maximum Spot bid as a percentage of On-Demand price (SPOT compute type only)."
  type        = number
  default     = 100

  validation {
    condition     = var.bid_percentage >= 1 && var.bid_percentage <= 100
    error_message = "bid_percentage must be between 1 and 100."
  }
}

variable "queue_priority" {
  description = "Job queue priority; higher values are scheduled first when queues share a compute environment."
  type        = number
  default     = 1

  validation {
    condition     = var.queue_priority >= 0 && var.queue_priority <= 1000
    error_message = "queue_priority must be between 0 and 1000."
  }
}

variable "job_execution_timeout_minutes" {
  description = "Grace period for running jobs before infrastructure updates terminate them."
  type        = number
  default     = 30
}

variable "terminate_jobs_on_update" {
  description = "Whether to terminate running jobs when the compute environment is updated."
  type        = bool
  default     = false
}

variable "tags" {
  description = "Additional tags merged onto every resource."
  type        = map(string)
  default     = {}
}

outputs.tf

output "compute_environment_arn" {
  description = "ARN of the Batch compute environment."
  value       = aws_batch_compute_environment.this.arn
}

output "compute_environment_name" {
  description = "Name of the Batch compute environment."
  value       = aws_batch_compute_environment.this.name
}

output "compute_environment_status" {
  description = "Current status of the compute environment (e.g. VALID)."
  value       = aws_batch_compute_environment.this.status
}

output "job_queue_arn" {
  description = "ARN of the attached job queue — pass to job definitions and submit-job callers."
  value       = aws_batch_job_queue.this.arn
}

output "job_queue_name" {
  description = "Name of the attached job queue."
  value       = aws_batch_job_queue.this.name
}

output "service_role_arn" {
  description = "ARN of the Batch service IAM role."
  value       = aws_iam_role.batch_service.arn
}

output "instance_profile_arn" {
  description = "ARN of the ECS instance profile (null for Fargate compute types)."
  value       = try(aws_iam_instance_profile.ecs_instance[0].arn, null)
}

How to use it

module "batch" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-batch?ref=v1.0.0"

  name_prefix = "genomics"
  environment = "prod"

  # Cost-optimized Spot EC2 across compute-optimized families.
  compute_type        = "SPOT"
  instance_types      = ["c6i", "c6a", "m6i"]
  allocation_strategy = "SPOT_CAPACITY_OPTIMIZED"
  bid_percentage      = 80
  max_vcpus           = 256
  min_vcpus           = 0

  subnet_ids         = module.vpc.private_subnet_ids
  security_group_ids = [aws_security_group.batch.id]

  queue_priority = 10

  tags = {
    Team       = "bioinformatics"
    CostCenter = "rnd-4400"
  }
}

# Downstream: a job definition that targets the module's queue via a
# variant-calling container, plus an EventBridge schedule that submits to it.
resource "aws_batch_job_definition" "variant_calling" {
  name                  = "variant-calling"
  type                  = "container"
  platform_capabilities = ["EC2"]

  container_properties = jsonencode({
    image      = "${aws_ecr_repository.pipeline.repository_url}:latest"
    command    = ["python", "call_variants.py"]
    jobRoleArn = aws_iam_role.job_task.arn
    resourceRequirements = [
      { type = "VCPU", value = "4" },
      { type = "MEMORY", value = "8192" },
    ]
  })
}

# Reference a module output downstream.
output "submit_target_queue" {
  description = "Queue ARN that the scheduler should submit variant-calling jobs to."
  value       = module.batch.job_queue_arn
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module config — live/prod/batch/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-batch?ref=v1.0.0"
}

inputs = {
  name_prefix = "..."
  environment = "..."
  subnet_ids = ["...", "..."]
  security_group_ids = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/batch && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`name_prefix`	`string`	—	Yes	Prefix for all named resources; 2-25 lowercase alphanumeric/hyphen chars starting with a letter.
`environment`	`string`	—	Yes	Deployment environment; one of `dev`, `stage`, `prod`.
`subnet_ids`	`list(string)`	—	Yes	Subnets (use private) for tasks/instances; must be non-empty.
`security_group_ids`	`list(string)`	—	Yes	Security groups attached to Batch tasks/instances; must be non-empty.
`compute_type`	`string`	`"FARGATE_SPOT"`	No	One of `FARGATE`, `FARGATE_SPOT`, `EC2`, `SPOT`.
`state`	`string`	`"ENABLED"`	No	`ENABLED` or `DISABLED` for the compute environment and queue.
`max_vcpus`	`number`	`16`	No	Maximum vCPUs (hard cost ceiling); 1-10000.
`min_vcpus`	`number`	`0`	No	Min/desired warm vCPUs for EC2/SPOT (ignored for Fargate).
`instance_types`	`list(string)`	`["optimal"]`	No	EC2 instance types/families for EC2/SPOT compute.
`allocation_strategy`	`string`	`"BEST_FIT_PROGRESSIVE"`	No	`BEST_FIT`, `BEST_FIT_PROGRESSIVE`, or `SPOT_CAPACITY_OPTIMIZED`.
`bid_percentage`	`number`	`100`	No	Max Spot bid as % of On-Demand (SPOT only); 1-100.
`queue_priority`	`number`	`1`	No	Job queue priority; 0-1000, higher scheduled first.
`job_execution_timeout_minutes`	`number`	`30`	No	Grace period before infra updates terminate running jobs.
`terminate_jobs_on_update`	`bool`	`false`	No	Terminate running jobs when the compute environment updates.
`tags`	`map(string)`	`{}`	No	Additional tags merged onto every resource.

Outputs

Name	Description
`compute_environment_arn`	ARN of the Batch compute environment.
`compute_environment_name`	Name of the Batch compute environment.
`compute_environment_status`	Current status of the compute environment (e.g. `VALID`).
`job_queue_arn`	ARN of the attached job queue; pass to job definitions and submit-job callers.
`job_queue_name`	Name of the attached job queue.
`service_role_arn`	ARN of the Batch service IAM role.
`instance_profile_arn`	ARN of the ECS instance profile (`null` for Fargate compute types).

Enterprise scenario

A bioinformatics platform team runs a nightly secondary-analysis pipeline that fans out variant calling across thousands of patient samples. They instantiate this module once per environment with compute_type = "SPOT", instance_types = ["c6i", "c6a", "m6i"], allocation_strategy = "SPOT_CAPACITY_OPTIMIZED", and max_vcpus = 256 so capacity scales from zero overnight and tears down by morning, cutting compute spend roughly 70% versus On-Demand. An EventBridge Scheduler rule submits array jobs to the module’s job_queue_arn output, and because the pipeline checkpoints to S3, Spot interruptions simply re-queue affected samples instead of failing the run.

Best practices

Pin a hard max_vcpus ceiling and prefer Spot for fault-tolerant work. Batch will scale to whatever max_vcpus allows, so this number is your real cost guardrail — pair SPOT with SPOT_CAPACITY_OPTIMIZED and checkpointing to absorb interruptions cheaply.
Launch into private subnets and scope the security group tightly. Batch tasks rarely need inbound access; give them egress to ECR, S3, and CloudWatch (via VPC endpoints where possible) and nothing more, keeping data-plane traffic off the public internet.
Keep create_before_destroy = true and let the module own the update policy. Compute environments are referenced by job queues, so in-place replacement without create_before_destroy deadlocks; setting terminate_jobs_on_update = false lets running jobs drain during infra changes.
Use job-level IAM (jobRoleArn) for least privilege, not the instance/execution role. The compute environment roles should only grant Batch and ECS bootstrap permissions; per-job application permissions (S3 buckets, KMS keys) belong on the job definition’s task role.
Scale EC2/SPOT environments to zero with min_vcpus = 0 so idle queues cost nothing, and reserve non-zero min_vcpus only when cold-start latency on the first jobs is unacceptable.
Enforce consistent naming and tags via name_prefix + environment and the merged tags map so every compute environment, queue, and IAM role is traceable to a team and cost center across all accounts.