Terraform Module: AWS Glue Job — repeatable, governed ETL jobs

Quick take — Provision AWS Glue ETL jobs with Terraform: a reusable hashicorp/aws ~> 5.0 module wrapping aws_glue_job with worker sizing, job bookmarks, retries, timeouts, and CloudWatch logging baked in. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "glue_job" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-job?ref=v1.0.0"

  name            = "..."  # Glue job name; also used for log group and trigger.
  role_arn        = "..."  # IAM role ARN Glue assumes to run the job.
  script_location = "..."  # S3 URI of the job script.
  temp_bucket     = "..."  # S3 bucket (no s3://) for TempDir and Spark logs.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS Glue is a serverless data-integration service, and a Glue Job is the unit of work that actually runs your ETL: a PySpark, Scala Spark, Python Shell, or Ray script that reads from sources (S3, JDBC, the Glue Data Catalog), transforms data, and writes it back out. You point a job at a script in S3, choose a worker type and a number of workers, hand it an IAM role, and Glue spins up an ephemeral Spark cluster on demand — you never manage the infrastructure underneath.

The trouble is that a correct production Glue job is more than one resource. You need the right glue_version pinned (mixing 3.0 and 5.0 semantics silently breaks scripts), a worker type and count that match your data volume, job bookmarks so incremental runs don’t reprocess everything, sane max_retries/timeout/max_concurrent_runs so a stuck job doesn’t burn DPU-hours all weekend, and continuous CloudWatch logging plus Spark UI so you can actually debug failures. Hand-rolling all of that per job leads to drift — one team forgets bookmarks, another sets a 48-hour timeout, a third leaves logging off and flies blind.

Wrapping aws_glue_job in a reusable module fixes that. The module bakes in the production defaults (G.1X workers, bookmarks enabled, a 60-minute timeout, one retry, continuous logging) while exposing every knob as a variable, so every job in your estate is consistent, tagged, and observable, and a new pipeline is a 15-line module block instead of 80 lines of copy-pasted HCL.

When to use it

You run more than a couple of Glue jobs and want them to share worker sizing, logging, retry, and tagging conventions instead of drifting per team.
You need incremental ETL with job bookmarks so daily/hourly runs only process new partitions rather than rescanning the whole bucket.
You want cost guardrails — capped concurrency, a hard timeout, and right-sized workers — codified so nobody accidentally provisions 100 G.2X workers with no timeout.
You orchestrate jobs from Glue Workflows, Triggers, Step Functions, EventBridge, or MWAA/Airflow and need a stable job name and ARN to wire up downstream.
You’re standardizing a data platform / lakehouse and want ETL job definitions to live in version control, reviewed via PR, alongside the rest of your IaC.

If you only ever need one throwaway job, the raw resource is fine. The module pays off the moment you have a fleet to keep consistent.

Module structure

terraform-module-aws-glue-job/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # aws_glue_job + optional trigger + log group
├── variables.tf     # all inputs, validated
└── outputs.tf       # id/name/arn + role + log group

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Continuous logging needs a known log group so we can grant it and read it.
  log_group_name = coalesce(var.log_group_name, "/aws-glue/jobs/${var.name}")

  # Default arguments AWS expects, merged with user overrides. User values win.
  default_arguments = merge(
    {
      "--job-language"                     = var.job_language
      "--job-bookmark-option"              = var.bookmark_enabled ? "job-bookmark-enable" : "job-bookmark-disable"
      "--enable-metrics"                   = "true"
      "--enable-observability-metrics"     = "true"
      "--enable-continuous-cloudwatch-log" = tostring(var.continuous_logging)
      "--enable-spark-ui"                  = tostring(var.enable_spark_ui)
      "--TempDir"                          = "s3://${var.temp_bucket}/glue-temp/${var.name}/"
    },
    var.enable_spark_ui ? {
      "--spark-event-logs-path" = "s3://${var.temp_bucket}/glue-spark-logs/${var.name}/"
    } : {},
    var.extra_arguments
  )
}

resource "aws_cloudwatch_log_group" "this" {
  count             = var.create_log_group ? 1 : 0
  name              = local.log_group_name
  retention_in_days = var.log_retention_days
  tags              = var.tags
}

resource "aws_glue_job" "this" {
  name                   = var.name
  description            = var.description
  role_arn               = var.role_arn
  glue_version           = var.glue_version
  worker_type            = var.worker_type
  number_of_workers      = var.number_of_workers
  max_retries            = var.max_retries
  timeout                = var.timeout_minutes
  max_concurrent_runs    = null # set via execution_property below
  security_configuration = var.security_configuration
  connections            = var.connections
  default_arguments      = local.default_arguments
  tags                   = var.tags

  command {
    name            = var.command_name
    script_location = var.script_location
    python_version  = var.command_name == "pythonshell" ? var.python_version : null
  }

  execution_property {
    max_concurrent_runs = var.max_concurrent_runs
  }

  dynamic "notification_property" {
    for_each = var.notify_delay_after_minutes != null ? [1] : []
    content {
      notify_delay_after = var.notify_delay_after_minutes
    }
  }

  # Glue ignores worker_type/number_of_workers for pythonshell; null them out there.
  lifecycle {
    precondition {
      condition     = var.command_name != "pythonshell" || var.max_capacity != null
      error_message = "command_name = \"pythonshell\" requires max_capacity (0.0625 or 1.0), not workers."
    }
  }

  # Python Shell uses max_capacity instead of worker_type/number_of_workers.
  max_capacity = var.command_name == "pythonshell" ? var.max_capacity : null
}

# Optional schedule trigger so the module can stand up a self-contained pipeline.
resource "aws_glue_trigger" "schedule" {
  count    = var.schedule_expression != null ? 1 : 0
  name     = "${var.name}-schedule"
  type     = "SCHEDULED"
  schedule = var.schedule_expression
  enabled  = var.schedule_enabled
  tags     = var.tags

  actions {
    job_name = aws_glue_job.this.name
    timeout  = var.timeout_minutes
  }
}

variables.tf

variable "name" {
  description = "Name of the Glue job. Used for the job, its log group, and the trigger."
  type        = string

  validation {
    condition     = can(regex("^[A-Za-z0-9_.-]{1,255}$", var.name))
    error_message = "name must be 1-255 chars: letters, digits, '_', '.', '-' only."
  }
}

variable "description" {
  description = "Human-readable description of what the job does."
  type        = string
  default     = null
}

variable "role_arn" {
  description = "ARN of the IAM role Glue assumes to run the job (needs Glue + S3 + source access)."
  type        = string

  validation {
    condition     = can(regex("^arn:aws[a-z-]*:iam::[0-9]{12}:role/.+$", var.role_arn))
    error_message = "role_arn must be a valid IAM role ARN."
  }
}

variable "command_name" {
  description = "Job type: 'glueetl' (Spark), 'gluestreaming' (streaming), 'pythonshell', or 'glueray'."
  type        = string
  default     = "glueetl"

  validation {
    condition     = contains(["glueetl", "gluestreaming", "pythonshell", "glueray"], var.command_name)
    error_message = "command_name must be one of: glueetl, gluestreaming, pythonshell, glueray."
  }
}

variable "script_location" {
  description = "S3 URI of the job script, e.g. s3://my-bucket/scripts/etl.py."
  type        = string

  validation {
    condition     = can(regex("^s3://", var.script_location))
    error_message = "script_location must be an s3:// URI."
  }
}

variable "glue_version" {
  description = "Glue runtime version. 4.0/5.0 = Spark 3.x; pythonshell supports 3.9/0.9 separately."
  type        = string
  default     = "5.0"

  validation {
    condition     = contains(["2.0", "3.0", "4.0", "5.0"], var.glue_version)
    error_message = "glue_version must be one of: 2.0, 3.0, 4.0, 5.0."
  }
}

variable "job_language" {
  description = "Script language passed as --job-language ('python' or 'scala')."
  type        = string
  default     = "python"

  validation {
    condition     = contains(["python", "scala"], var.job_language)
    error_message = "job_language must be 'python' or 'scala'."
  }
}

variable "python_version" {
  description = "Python version for pythonshell jobs ('3.9' recommended)."
  type        = string
  default     = "3.9"
}

variable "worker_type" {
  description = "Worker size for Spark/Ray jobs: Standard, G.1X, G.2X, G.4X, G.8X, or Z.2X."
  type        = string
  default     = "G.1X"

  validation {
    condition     = contains(["Standard", "G.1X", "G.2X", "G.4X", "G.8X", "Z.2X"], var.worker_type)
    error_message = "worker_type must be one of: Standard, G.1X, G.2X, G.4X, G.8X, Z.2X."
  }
}

variable "number_of_workers" {
  description = "Number of workers (DPUs) for Spark/Ray jobs. Minimum 2."
  type        = number
  default     = 2

  validation {
    condition     = var.number_of_workers >= 2 && var.number_of_workers <= 299
    error_message = "number_of_workers must be between 2 and 299."
  }
}

variable "max_capacity" {
  description = "DPU capacity for pythonshell jobs only: 0.0625 or 1.0."
  type        = number
  default     = null

  validation {
    condition     = var.max_capacity == null || contains([0.0625, 1.0], var.max_capacity)
    error_message = "max_capacity (pythonshell) must be 0.0625 or 1.0."
  }
}

variable "max_retries" {
  description = "How many times Glue retries the job on failure."
  type        = number
  default     = 1

  validation {
    condition     = var.max_retries >= 0 && var.max_retries <= 10
    error_message = "max_retries must be between 0 and 10."
  }
}

variable "timeout_minutes" {
  description = "Job timeout in minutes. Hard cap to stop runaway DPU spend."
  type        = number
  default     = 60

  validation {
    condition     = var.timeout_minutes >= 1 && var.timeout_minutes <= 7200
    error_message = "timeout_minutes must be between 1 and 7200 (5 days)."
  }
}

variable "max_concurrent_runs" {
  description = "Max simultaneous runs of this job."
  type        = number
  default     = 1

  validation {
    condition     = var.max_concurrent_runs >= 1 && var.max_concurrent_runs <= 1000
    error_message = "max_concurrent_runs must be between 1 and 1000."
  }
}

variable "bookmark_enabled" {
  description = "Enable job bookmarks for incremental processing (skips already-read data)."
  type        = bool
  default     = true
}

variable "continuous_logging" {
  description = "Enable continuous CloudWatch logging during the run."
  type        = bool
  default     = true
}

variable "enable_spark_ui" {
  description = "Write Spark event logs to S3 for the Spark UI (Spark jobs only)."
  type        = bool
  default     = true
}

variable "temp_bucket" {
  description = "S3 bucket name (no s3://) for --TempDir and Spark event logs."
  type        = string
}

variable "connections" {
  description = "List of Glue connection names (e.g. for JDBC/VPC sources)."
  type        = list(string)
  default     = []
}

variable "security_configuration" {
  description = "Name of a Glue security configuration (for S3/CloudWatch encryption)."
  type        = string
  default     = null
}

variable "extra_arguments" {
  description = "Additional --key/value default arguments merged into the job (overrides defaults)."
  type        = map(string)
  default     = {}
}

variable "notify_delay_after_minutes" {
  description = "Emit a DELAYED notification if a run exceeds this many minutes. Null disables."
  type        = number
  default     = null
}

variable "create_log_group" {
  description = "Create and manage the CloudWatch log group for this job."
  type        = bool
  default     = true
}

variable "log_group_name" {
  description = "Override the log group name. Defaults to /aws-glue/jobs/<name>."
  type        = string
  default     = null
}

variable "log_retention_days" {
  description = "CloudWatch log retention in days."
  type        = number
  default     = 30
}

variable "schedule_expression" {
  description = "Optional cron(...) expression to create a SCHEDULED trigger. Null = no trigger."
  type        = string
  default     = null
}

variable "schedule_enabled" {
  description = "Whether the schedule trigger is active when created."
  type        = bool
  default     = true
}

variable "tags" {
  description = "Tags applied to the job, trigger, and log group."
  type        = map(string)
  default     = {}
}

outputs.tf

output "id" {
  description = "The Glue job ID (same as its name)."
  value       = aws_glue_job.this.id
}

output "name" {
  description = "The name of the Glue job — use to wire up triggers/workflows/Step Functions."
  value       = aws_glue_job.this.name
}

output "arn" {
  description = "The ARN of the Glue job."
  value       = aws_glue_job.this.arn
}

output "role_arn" {
  description = "The IAM role ARN the job runs as."
  value       = aws_glue_job.this.role_arn
}

output "default_arguments" {
  description = "The resolved default arguments passed to the job (bookmarks, TempDir, etc.)."
  value       = aws_glue_job.this.default_arguments
}

output "log_group_name" {
  description = "The CloudWatch log group name for the job."
  value       = local.log_group_name
}

output "trigger_name" {
  description = "Name of the created schedule trigger, or null if none."
  value       = try(aws_glue_trigger.schedule[0].name, null)
}

How to use it

module "glue_job" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-job?ref=v1.0.0"

  name            = "orders-curated-daily"
  description     = "Curate raw orders from the lake into the analytics zone, daily."
  role_arn        = aws_iam_role.glue_etl.arn
  command_name    = "glueetl"
  glue_version    = "5.0"
  script_location = "s3://kv-data-platform-artifacts/scripts/orders_curated.py"

  worker_type       = "G.2X"
  number_of_workers = 10
  timeout_minutes   = 90
  max_retries       = 1

  bookmark_enabled = true
  temp_bucket      = "kv-data-platform-glue-temp"

  # Pass pipeline config straight to the script.
  extra_arguments = {
    "--source_path"      = "s3://kv-data-lake-raw/orders/"
    "--target_path"      = "s3://kv-data-lake-curated/orders/"
    "--enable-auto-scaling" = "true"
  }

  # Stand up the daily schedule from the same module.
  schedule_expression = "cron(0 2 * * ? *)"

  notify_delay_after_minutes = 120
  log_retention_days         = 90

  tags = {
    Environment = "prod"
    Team        = "data-platform"
    Pipeline    = "orders-curated"
  }
}

# Downstream: chain a second job after this one using the output job name.
resource "aws_glue_trigger" "after_curated" {
  name = "run-orders-agg-after-curated"
  type = "CONDITIONAL"

  predicate {
    conditions {
      job_name = module.glue_job.name # output consumed here
      state    = "SUCCEEDED"
    }
  }

  actions {
    job_name = "orders-aggregate-daily"
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module config — live/prod/glue_job/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-job?ref=v1.0.0"
}

inputs = {
  name = "..."
  role_arn = "..."
  script_location = "..."
  temp_bucket = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/glue_job && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
name	string	—	yes	Glue job name; also used for log group and trigger.
description	string	null	no	Human-readable description of the job.
role_arn	string	—	yes	IAM role ARN Glue assumes to run the job.
command_name	string	“glueetl”	no	Job type: glueetl, gluestreaming, pythonshell, or glueray.
script_location	string	—	yes	S3 URI of the job script.
glue_version	string	“5.0”	no	Glue runtime version (2.0/3.0/4.0/5.0).
job_language	string	“python”	no	Script language: python or scala.
python_version	string	“3.9”	no	Python version for pythonshell jobs.
worker_type	string	“G.1X”	no	Worker size for Spark/Ray jobs.
number_of_workers	number	2	no	Number of workers (DPUs); 2–299.
max_capacity	number	null	no	DPU capacity for pythonshell only (0.0625 or 1.0).
max_retries	number	1	no	Retries on failure (0–10).
timeout_minutes	number	60	no	Hard job timeout in minutes (1–7200).
max_concurrent_runs	number	1	no	Max simultaneous runs (1–1000).
bookmark_enabled	bool	true	no	Enable job bookmarks for incremental processing.
continuous_logging	bool	true	no	Enable continuous CloudWatch logging.
enable_spark_ui	bool	true	no	Write Spark event logs to S3 for the Spark UI.
temp_bucket	string	—	yes	S3 bucket (no s3://) for TempDir and Spark logs.
connections	list(string)	[]	no	Glue connection names for JDBC/VPC sources.
security_configuration	string	null	no	Glue security configuration name for encryption.
extra_arguments	map(string)	{}	no	Extra default arguments merged into the job.
notify_delay_after_minutes	number	null	no	Emit a DELAYED notification past this runtime.
create_log_group	bool	true	no	Create and manage the CloudWatch log group.
log_group_name	string	null	no	Override log group name (default /aws-glue/jobs/<name>).
log_retention_days	number	30	no	CloudWatch log retention in days.
schedule_expression	string	null	no	cron(…) expression to create a SCHEDULED trigger.
schedule_enabled	bool	true	no	Whether the schedule trigger is active.
tags	map(string)	{}	no	Tags applied to job, trigger, and log group.

Outputs

Name	Description
id	The Glue job ID (same as its name).
name	Job name — wire up triggers, workflows, or Step Functions.
arn	The ARN of the Glue job.
role_arn	The IAM role ARN the job runs as.
default_arguments	Resolved default arguments (bookmarks, TempDir, flags).
log_group_name	CloudWatch log group name for the job.
trigger_name	Name of the created schedule trigger, or null.

Enterprise scenario

A retail analytics team runs a nightly lakehouse refresh: roughly 40 Glue jobs curate raw S3 data (orders, inventory, clickstream) into Iceberg tables, then aggregate marts on top. By standardizing every job on this module, they get G.2X workers with auto-scaling, bookmarks, a 90-minute timeout, and 90-day log retention applied uniformly — set once in the module, not 40 times. Curated jobs emit their name output into CONDITIONAL Glue triggers so aggregation only starts after the upstream SUCCEEDED, and the notify_delay_after_minutes setting pages the on-call data engineer the moment a run blows past its SLA instead of silently dragging into business hours.

Best practices

Pin glue_version explicitly and test before bumping. Moving 3.0 → 4.0 → 5.0 changes the Spark/Python runtime and bundled connectors; an unpinned or surprise upgrade can break scripts at 2 a.m. Pin it in the module and roll forward deliberately.
Right-size workers and enforce a timeout. Glue bills per DPU-hour, so over-provisioning number_of_workers or omitting timeout_minutes directly burns money. Start small (G.1X, 2–10 workers), enable --enable-auto-scaling, and keep the hard timeout as a runaway-cost backstop.
Keep job bookmarks on for incremental ETL — and reset them deliberately. bookmark_enabled = true stops jobs from rescanning the whole bucket every run. Just remember that a backfill needs an explicit bookmark reset, and that source data must be append-only for bookmarks to be correct.
Scope the IAM role tightly and encrypt via a security configuration. Give the role_arn only the specific S3 prefixes and Catalog/KMS actions it needs (never s3:*), and attach a security_configuration so CloudWatch logs, S3 outputs, and shuffle data are encrypted with your CMK.
Standardize naming and tags for cost allocation. Use a consistent <domain>-<dataset>-<frequency> job name and always set Environment, Team, and Pipeline tags so Glue spend shows up cleanly in Cost Explorer and jobs are discoverable in the console.
Turn on metrics, continuous logging, and the Spark UI from day one. Leave continuous_logging and enable_spark_ui on so failures are debuggable in real time rather than only post-mortem; pair them with sensible log_retention_days so logs don’t accumulate cost indefinitely.