Quick take — A reusable Terraform module for AWS Batch on hashicorp/aws ~> 5.0: provision a Fargate or EC2 Spot compute environment, job queue, and IAM service roles with validated, var-driven inputs. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "batch" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-batch?ref=v1.0.0"
name_prefix = "..." # Prefix for all named resources; 2-25 lowercase alphanum…
environment = "..." # Deployment environment; one of `dev`, `stage`, `prod`.
subnet_ids = ["...", "..."] # Subnets (use private) for tasks/instances; must be non-…
security_group_ids = ["...", "..."] # Security groups attached to Batch tasks/instances; must…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
AWS Batch is a managed batch-computing service that schedules and runs containerized jobs across pools of compute you do not have to babysit. You define a compute environment (the pool — Fargate, Fargate Spot, EC2 On-Demand, or EC2 Spot), attach one or more job queues to it, register job definitions, and submit jobs; Batch handles instance provisioning, bin-packing, scaling to zero when idle, and tearing capacity down when the queue drains. It is the natural fit for genomics pipelines, nightly ETL, ML training sweeps, rendering, and any embarrassingly parallel workload that is wasteful to keep on a long-running cluster.
The core resource, aws_batch_compute_environment, is deceptively fiddly. A production-ready environment needs a correctly-scoped service-linked or service IAM role, the right networking (subnets, security_group_ids), type = MANAGED with a compute_resources block whose type is one of FARGATE, FARGATE_SPOT, EC2, or SPOT, and — for Spot EC2 — a Spot fleet IAM role plus a bid_percentage. Get the update_policy wrong and Terraform fights the scaler on every apply; forget lifecycle { create_before_destroy = true } and replacements deadlock because the old environment is still referenced by the queue. This module encodes all of that once: it stands up the compute environment, an attached job queue, and the supporting IAM, exposes only the knobs that matter, and validates them so a bad bid_percentage or empty subnet list fails at plan instead of at apply.
When to use it
- You run containerized batch or scheduled jobs (ETL, ML training, simulation, media transcoding) and want managed scaling to zero rather than an always-on ECS/EKS cluster.
- You want Fargate Spot or EC2 Spot economics for fault-tolerant work but do not want to hand-write the Spot fleet role, instance role, and capacity wiring every time.
- You are standing up the same Batch pattern across many accounts or teams (dev/stage/prod, per-domain data platforms) and need consistent IAM, tagging, and networking.
- You need the compute environment’s ARN and the job-queue name as outputs so downstream Terraform (job definitions, EventBridge Scheduler rules, Step Functions) can target them.
Reach for plain ECS/EKS instead if your workload is a long-lived service, needs sub-second scheduling latency, or requires per-request autoscaling — Batch is optimized for throughput of finite jobs, not steady-state request serving.
Module structure
terraform-module-aws-batch/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
main.tf
locals {
name = "${var.name_prefix}-${var.environment}"
# Spot EC2 needs a fleet role + bid percentage; Fargate/Fargate Spot/EC2 do not.
is_spot_ec2 = var.compute_type == "SPOT"
is_ec2_flavored = contains(["EC2", "SPOT"], var.compute_type)
tags = merge(
{
Module = "terraform-module-aws-batch"
Environment = var.environment
ManagedBy = "terraform"
},
var.tags,
)
}
# --- Trust policies -------------------------------------------------------
data "aws_iam_policy_document" "batch_service_assume" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["batch.amazonaws.com"]
}
}
}
data "aws_iam_policy_document" "ec2_assume" {
count = local.is_ec2_flavored ? 1 : 0
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ec2.amazonaws.com"]
}
}
}
data "aws_iam_policy_document" "spot_fleet_assume" {
count = local.is_spot_ec2 ? 1 : 0
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["spotfleet.amazonaws.com"]
}
}
}
# --- Batch service role ---------------------------------------------------
resource "aws_iam_role" "batch_service" {
name = "${local.name}-batch-service"
assume_role_policy = data.aws_iam_policy_document.batch_service_assume.json
tags = local.tags
}
resource "aws_iam_role_policy_attachment" "batch_service" {
role = aws_iam_role.batch_service.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole"
}
# --- EC2 instance role + profile (only for EC2/SPOT compute types) --------
resource "aws_iam_role" "ecs_instance" {
count = local.is_ec2_flavored ? 1 : 0
name = "${local.name}-ecs-instance"
assume_role_policy = data.aws_iam_policy_document.ec2_assume[0].json
tags = local.tags
}
resource "aws_iam_role_policy_attachment" "ecs_instance" {
count = local.is_ec2_flavored ? 1 : 0
role = aws_iam_role.ecs_instance[0].name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
}
resource "aws_iam_instance_profile" "ecs_instance" {
count = local.is_ec2_flavored ? 1 : 0
name = "${local.name}-ecs-instance"
role = aws_iam_role.ecs_instance[0].name
tags = local.tags
}
# --- Spot fleet role (only for SPOT compute type) -------------------------
resource "aws_iam_role" "spot_fleet" {
count = local.is_spot_ec2 ? 1 : 0
name = "${local.name}-spot-fleet"
assume_role_policy = data.aws_iam_policy_document.spot_fleet_assume[0].json
tags = local.tags
}
resource "aws_iam_role_policy_attachment" "spot_fleet" {
count = local.is_spot_ec2 ? 1 : 0
role = aws_iam_role.spot_fleet[0].name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole"
}
# --- Compute environment --------------------------------------------------
resource "aws_batch_compute_environment" "this" {
name = local.name
type = "MANAGED"
state = var.state
service_role = aws_iam_role.batch_service.arn
compute_resources {
type = var.compute_type
# vCPU bounds. Fargate flavors ignore min/desired but require max_vcpus.
max_vcpus = var.max_vcpus
min_vcpus = local.is_ec2_flavored ? var.min_vcpus : null
desired_vcpus = local.is_ec2_flavored ? var.min_vcpus : null
subnets = var.subnet_ids
security_group_ids = var.security_group_ids
# EC2/SPOT-only knobs.
instance_role = local.is_ec2_flavored ? aws_iam_instance_profile.ecs_instance[0].arn : null
instance_type = local.is_ec2_flavored ? var.instance_types : null
allocation_strategy = local.is_ec2_flavored ? var.allocation_strategy : null
spot_iam_fleet_role = local.is_spot_ec2 ? aws_iam_role.spot_fleet[0].arn : null
bid_percentage = local.is_spot_ec2 ? var.bid_percentage : null
tags = local.tags
}
update_policy {
job_execution_timeout_minutes = var.job_execution_timeout_minutes
terminate_jobs_on_update = var.terminate_jobs_on_update
}
tags = local.tags
# The service role must exist before the CE, and the CE must be replaced
# before deletion because the job queue references it.
depends_on = [aws_iam_role_policy_attachment.batch_service]
lifecycle {
create_before_destroy = true
}
}
# --- Job queue attached to the compute environment ------------------------
resource "aws_batch_job_queue" "this" {
name = "${local.name}-queue"
state = var.state
priority = var.queue_priority
compute_environment_order {
order = 1
compute_environment = aws_batch_compute_environment.this.arn
}
tags = local.tags
}
variables.tf
variable "name_prefix" {
description = "Prefix for all named resources (e.g. \"genomics\")."
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,24}$", var.name_prefix))
error_message = "name_prefix must be 2-25 chars, lowercase alphanumeric or hyphen, starting with a letter."
}
}
variable "environment" {
description = "Deployment environment, appended to the name prefix."
type = string
validation {
condition = contains(["dev", "stage", "prod"], var.environment)
error_message = "environment must be one of: dev, stage, prod."
}
}
variable "compute_type" {
description = "Compute environment provisioning model."
type = string
default = "FARGATE_SPOT"
validation {
condition = contains(["FARGATE", "FARGATE_SPOT", "EC2", "SPOT"], var.compute_type)
error_message = "compute_type must be one of: FARGATE, FARGATE_SPOT, EC2, SPOT."
}
}
variable "state" {
description = "Desired state of the compute environment and job queue."
type = string
default = "ENABLED"
validation {
condition = contains(["ENABLED", "DISABLED"], var.state)
error_message = "state must be ENABLED or DISABLED."
}
}
variable "subnet_ids" {
description = "Subnets the compute environment launches tasks/instances into (use private subnets)."
type = list(string)
validation {
condition = length(var.subnet_ids) > 0
error_message = "At least one subnet_id is required."
}
}
variable "security_group_ids" {
description = "Security groups attached to Batch tasks/instances."
type = list(string)
validation {
condition = length(var.security_group_ids) > 0
error_message = "At least one security_group_id is required."
}
}
variable "max_vcpus" {
description = "Maximum vCPUs the compute environment may scale to (hard cost ceiling)."
type = number
default = 16
validation {
condition = var.max_vcpus >= 1 && var.max_vcpus <= 10000
error_message = "max_vcpus must be between 1 and 10000."
}
}
variable "min_vcpus" {
description = "Minimum/desired vCPUs to keep warm (EC2/SPOT only; ignored for Fargate). Set 0 to scale to zero."
type = number
default = 0
validation {
condition = var.min_vcpus >= 0
error_message = "min_vcpus must be >= 0."
}
}
variable "instance_types" {
description = "EC2 instance types or families for EC2/SPOT compute (e.g. [\"optimal\"] or [\"c6i\", \"m6i\"]). Unused for Fargate."
type = list(string)
default = ["optimal"]
}
variable "allocation_strategy" {
description = "EC2/SPOT allocation strategy: BEST_FIT, BEST_FIT_PROGRESSIVE, or SPOT_CAPACITY_OPTIMIZED."
type = string
default = "BEST_FIT_PROGRESSIVE"
validation {
condition = contains(["BEST_FIT", "BEST_FIT_PROGRESSIVE", "SPOT_CAPACITY_OPTIMIZED"], var.allocation_strategy)
error_message = "allocation_strategy must be BEST_FIT, BEST_FIT_PROGRESSIVE, or SPOT_CAPACITY_OPTIMIZED."
}
}
variable "bid_percentage" {
description = "Maximum Spot bid as a percentage of On-Demand price (SPOT compute type only)."
type = number
default = 100
validation {
condition = var.bid_percentage >= 1 && var.bid_percentage <= 100
error_message = "bid_percentage must be between 1 and 100."
}
}
variable "queue_priority" {
description = "Job queue priority; higher values are scheduled first when queues share a compute environment."
type = number
default = 1
validation {
condition = var.queue_priority >= 0 && var.queue_priority <= 1000
error_message = "queue_priority must be between 0 and 1000."
}
}
variable "job_execution_timeout_minutes" {
description = "Grace period for running jobs before infrastructure updates terminate them."
type = number
default = 30
}
variable "terminate_jobs_on_update" {
description = "Whether to terminate running jobs when the compute environment is updated."
type = bool
default = false
}
variable "tags" {
description = "Additional tags merged onto every resource."
type = map(string)
default = {}
}
outputs.tf
output "compute_environment_arn" {
description = "ARN of the Batch compute environment."
value = aws_batch_compute_environment.this.arn
}
output "compute_environment_name" {
description = "Name of the Batch compute environment."
value = aws_batch_compute_environment.this.name
}
output "compute_environment_status" {
description = "Current status of the compute environment (e.g. VALID)."
value = aws_batch_compute_environment.this.status
}
output "job_queue_arn" {
description = "ARN of the attached job queue — pass to job definitions and submit-job callers."
value = aws_batch_job_queue.this.arn
}
output "job_queue_name" {
description = "Name of the attached job queue."
value = aws_batch_job_queue.this.name
}
output "service_role_arn" {
description = "ARN of the Batch service IAM role."
value = aws_iam_role.batch_service.arn
}
output "instance_profile_arn" {
description = "ARN of the ECS instance profile (null for Fargate compute types)."
value = try(aws_iam_instance_profile.ecs_instance[0].arn, null)
}
How to use it
module "batch" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-batch?ref=v1.0.0"
name_prefix = "genomics"
environment = "prod"
# Cost-optimized Spot EC2 across compute-optimized families.
compute_type = "SPOT"
instance_types = ["c6i", "c6a", "m6i"]
allocation_strategy = "SPOT_CAPACITY_OPTIMIZED"
bid_percentage = 80
max_vcpus = 256
min_vcpus = 0
subnet_ids = module.vpc.private_subnet_ids
security_group_ids = [aws_security_group.batch.id]
queue_priority = 10
tags = {
Team = "bioinformatics"
CostCenter = "rnd-4400"
}
}
# Downstream: a job definition that targets the module's queue via a
# variant-calling container, plus an EventBridge schedule that submits to it.
resource "aws_batch_job_definition" "variant_calling" {
name = "variant-calling"
type = "container"
platform_capabilities = ["EC2"]
container_properties = jsonencode({
image = "${aws_ecr_repository.pipeline.repository_url}:latest"
command = ["python", "call_variants.py"]
jobRoleArn = aws_iam_role.job_task.arn
resourceRequirements = [
{ type = "VCPU", value = "4" },
{ type = "MEMORY", value = "8192" },
]
})
}
# Reference a module output downstream.
output "submit_target_queue" {
description = "Queue ARN that the scheduler should submit variant-calling jobs to."
value = module.batch.job_queue_arn
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/batch/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-batch?ref=v1.0.0"
}
inputs = {
name_prefix = "..."
environment = "..."
subnet_ids = ["...", "..."]
security_group_ids = ["...", "..."]
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/batch && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
name_prefix |
string |
— | Yes | Prefix for all named resources; 2-25 lowercase alphanumeric/hyphen chars starting with a letter. |
environment |
string |
— | Yes | Deployment environment; one of dev, stage, prod. |
subnet_ids |
list(string) |
— | Yes | Subnets (use private) for tasks/instances; must be non-empty. |
security_group_ids |
list(string) |
— | Yes | Security groups attached to Batch tasks/instances; must be non-empty. |
compute_type |
string |
"FARGATE_SPOT" |
No | One of FARGATE, FARGATE_SPOT, EC2, SPOT. |
state |
string |
"ENABLED" |
No | ENABLED or DISABLED for the compute environment and queue. |
max_vcpus |
number |
16 |
No | Maximum vCPUs (hard cost ceiling); 1-10000. |
min_vcpus |
number |
0 |
No | Min/desired warm vCPUs for EC2/SPOT (ignored for Fargate). |
instance_types |
list(string) |
["optimal"] |
No | EC2 instance types/families for EC2/SPOT compute. |
allocation_strategy |
string |
"BEST_FIT_PROGRESSIVE" |
No | BEST_FIT, BEST_FIT_PROGRESSIVE, or SPOT_CAPACITY_OPTIMIZED. |
bid_percentage |
number |
100 |
No | Max Spot bid as % of On-Demand (SPOT only); 1-100. |
queue_priority |
number |
1 |
No | Job queue priority; 0-1000, higher scheduled first. |
job_execution_timeout_minutes |
number |
30 |
No | Grace period before infra updates terminate running jobs. |
terminate_jobs_on_update |
bool |
false |
No | Terminate running jobs when the compute environment updates. |
tags |
map(string) |
{} |
No | Additional tags merged onto every resource. |
Outputs
| Name | Description |
|---|---|
compute_environment_arn |
ARN of the Batch compute environment. |
compute_environment_name |
Name of the Batch compute environment. |
compute_environment_status |
Current status of the compute environment (e.g. VALID). |
job_queue_arn |
ARN of the attached job queue; pass to job definitions and submit-job callers. |
job_queue_name |
Name of the attached job queue. |
service_role_arn |
ARN of the Batch service IAM role. |
instance_profile_arn |
ARN of the ECS instance profile (null for Fargate compute types). |
Enterprise scenario
A bioinformatics platform team runs a nightly secondary-analysis pipeline that fans out variant calling across thousands of patient samples. They instantiate this module once per environment with compute_type = "SPOT", instance_types = ["c6i", "c6a", "m6i"], allocation_strategy = "SPOT_CAPACITY_OPTIMIZED", and max_vcpus = 256 so capacity scales from zero overnight and tears down by morning, cutting compute spend roughly 70% versus On-Demand. An EventBridge Scheduler rule submits array jobs to the module’s job_queue_arn output, and because the pipeline checkpoints to S3, Spot interruptions simply re-queue affected samples instead of failing the run.
Best practices
- Pin a hard
max_vcpusceiling and prefer Spot for fault-tolerant work. Batch will scale to whatevermax_vcpusallows, so this number is your real cost guardrail — pairSPOTwithSPOT_CAPACITY_OPTIMIZEDand checkpointing to absorb interruptions cheaply. - Launch into private subnets and scope the security group tightly. Batch tasks rarely need inbound access; give them egress to ECR, S3, and CloudWatch (via VPC endpoints where possible) and nothing more, keeping data-plane traffic off the public internet.
- Keep
create_before_destroy = trueand let the module own the update policy. Compute environments are referenced by job queues, so in-place replacement withoutcreate_before_destroydeadlocks; settingterminate_jobs_on_update = falselets running jobs drain during infra changes. - Use job-level IAM (
jobRoleArn) for least privilege, not the instance/execution role. The compute environment roles should only grant Batch and ECS bootstrap permissions; per-job application permissions (S3 buckets, KMS keys) belong on the job definition’s task role. - Scale EC2/SPOT environments to zero with
min_vcpus = 0so idle queues cost nothing, and reserve non-zeromin_vcpusonly when cold-start latency on the first jobs is unacceptable. - Enforce consistent naming and tags via
name_prefix+environmentand the mergedtagsmap so every compute environment, queue, and IAM role is traceable to a team and cost center across all accounts.