Quick take — Provision AWS Glue ETL jobs with Terraform: a reusable hashicorp/aws ~> 5.0 module wrapping aws_glue_job with worker sizing, job bookmarks, retries, timeouts, and CloudWatch logging baked in. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "glue_job" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-job?ref=v1.0.0"
name = "..." # Glue job name; also used for log group and trigger.
role_arn = "..." # IAM role ARN Glue assumes to run the job.
script_location = "..." # S3 URI of the job script.
temp_bucket = "..." # S3 bucket (no s3://) for TempDir and Spark logs.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
AWS Glue is a serverless data-integration service, and a Glue Job is the unit of work that actually runs your ETL: a PySpark, Scala Spark, Python Shell, or Ray script that reads from sources (S3, JDBC, the Glue Data Catalog), transforms data, and writes it back out. You point a job at a script in S3, choose a worker type and a number of workers, hand it an IAM role, and Glue spins up an ephemeral Spark cluster on demand — you never manage the infrastructure underneath.
The trouble is that a correct production Glue job is more than one resource. You need the right glue_version pinned (mixing 3.0 and 5.0 semantics silently breaks scripts), a worker type and count that match your data volume, job bookmarks so incremental runs don’t reprocess everything, sane max_retries/timeout/max_concurrent_runs so a stuck job doesn’t burn DPU-hours all weekend, and continuous CloudWatch logging plus Spark UI so you can actually debug failures. Hand-rolling all of that per job leads to drift — one team forgets bookmarks, another sets a 48-hour timeout, a third leaves logging off and flies blind.
Wrapping aws_glue_job in a reusable module fixes that. The module bakes in the production defaults (G.1X workers, bookmarks enabled, a 60-minute timeout, one retry, continuous logging) while exposing every knob as a variable, so every job in your estate is consistent, tagged, and observable, and a new pipeline is a 15-line module block instead of 80 lines of copy-pasted HCL.
When to use it
- You run more than a couple of Glue jobs and want them to share worker sizing, logging, retry, and tagging conventions instead of drifting per team.
- You need incremental ETL with job bookmarks so daily/hourly runs only process new partitions rather than rescanning the whole bucket.
- You want cost guardrails — capped concurrency, a hard timeout, and right-sized workers — codified so nobody accidentally provisions 100 G.2X workers with no timeout.
- You orchestrate jobs from Glue Workflows, Triggers, Step Functions, EventBridge, or MWAA/Airflow and need a stable job name and ARN to wire up downstream.
- You’re standardizing a data platform / lakehouse and want ETL job definitions to live in version control, reviewed via PR, alongside the rest of your IaC.
If you only ever need one throwaway job, the raw resource is fine. The module pays off the moment you have a fleet to keep consistent.
Module structure
terraform-module-aws-glue-job/
├── versions.tf # provider + Terraform version pins
├── main.tf # aws_glue_job + optional trigger + log group
├── variables.tf # all inputs, validated
└── outputs.tf # id/name/arn + role + log group
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
main.tf
locals {
# Continuous logging needs a known log group so we can grant it and read it.
log_group_name = coalesce(var.log_group_name, "/aws-glue/jobs/${var.name}")
# Default arguments AWS expects, merged with user overrides. User values win.
default_arguments = merge(
{
"--job-language" = var.job_language
"--job-bookmark-option" = var.bookmark_enabled ? "job-bookmark-enable" : "job-bookmark-disable"
"--enable-metrics" = "true"
"--enable-observability-metrics" = "true"
"--enable-continuous-cloudwatch-log" = tostring(var.continuous_logging)
"--enable-spark-ui" = tostring(var.enable_spark_ui)
"--TempDir" = "s3://${var.temp_bucket}/glue-temp/${var.name}/"
},
var.enable_spark_ui ? {
"--spark-event-logs-path" = "s3://${var.temp_bucket}/glue-spark-logs/${var.name}/"
} : {},
var.extra_arguments
)
}
resource "aws_cloudwatch_log_group" "this" {
count = var.create_log_group ? 1 : 0
name = local.log_group_name
retention_in_days = var.log_retention_days
tags = var.tags
}
resource "aws_glue_job" "this" {
name = var.name
description = var.description
role_arn = var.role_arn
glue_version = var.glue_version
worker_type = var.worker_type
number_of_workers = var.number_of_workers
max_retries = var.max_retries
timeout = var.timeout_minutes
max_concurrent_runs = null # set via execution_property below
security_configuration = var.security_configuration
connections = var.connections
default_arguments = local.default_arguments
tags = var.tags
command {
name = var.command_name
script_location = var.script_location
python_version = var.command_name == "pythonshell" ? var.python_version : null
}
execution_property {
max_concurrent_runs = var.max_concurrent_runs
}
dynamic "notification_property" {
for_each = var.notify_delay_after_minutes != null ? [1] : []
content {
notify_delay_after = var.notify_delay_after_minutes
}
}
# Glue ignores worker_type/number_of_workers for pythonshell; null them out there.
lifecycle {
precondition {
condition = var.command_name != "pythonshell" || var.max_capacity != null
error_message = "command_name = \"pythonshell\" requires max_capacity (0.0625 or 1.0), not workers."
}
}
# Python Shell uses max_capacity instead of worker_type/number_of_workers.
max_capacity = var.command_name == "pythonshell" ? var.max_capacity : null
}
# Optional schedule trigger so the module can stand up a self-contained pipeline.
resource "aws_glue_trigger" "schedule" {
count = var.schedule_expression != null ? 1 : 0
name = "${var.name}-schedule"
type = "SCHEDULED"
schedule = var.schedule_expression
enabled = var.schedule_enabled
tags = var.tags
actions {
job_name = aws_glue_job.this.name
timeout = var.timeout_minutes
}
}
variables.tf
variable "name" {
description = "Name of the Glue job. Used for the job, its log group, and the trigger."
type = string
validation {
condition = can(regex("^[A-Za-z0-9_.-]{1,255}$", var.name))
error_message = "name must be 1-255 chars: letters, digits, '_', '.', '-' only."
}
}
variable "description" {
description = "Human-readable description of what the job does."
type = string
default = null
}
variable "role_arn" {
description = "ARN of the IAM role Glue assumes to run the job (needs Glue + S3 + source access)."
type = string
validation {
condition = can(regex("^arn:aws[a-z-]*:iam::[0-9]{12}:role/.+$", var.role_arn))
error_message = "role_arn must be a valid IAM role ARN."
}
}
variable "command_name" {
description = "Job type: 'glueetl' (Spark), 'gluestreaming' (streaming), 'pythonshell', or 'glueray'."
type = string
default = "glueetl"
validation {
condition = contains(["glueetl", "gluestreaming", "pythonshell", "glueray"], var.command_name)
error_message = "command_name must be one of: glueetl, gluestreaming, pythonshell, glueray."
}
}
variable "script_location" {
description = "S3 URI of the job script, e.g. s3://my-bucket/scripts/etl.py."
type = string
validation {
condition = can(regex("^s3://", var.script_location))
error_message = "script_location must be an s3:// URI."
}
}
variable "glue_version" {
description = "Glue runtime version. 4.0/5.0 = Spark 3.x; pythonshell supports 3.9/0.9 separately."
type = string
default = "5.0"
validation {
condition = contains(["2.0", "3.0", "4.0", "5.0"], var.glue_version)
error_message = "glue_version must be one of: 2.0, 3.0, 4.0, 5.0."
}
}
variable "job_language" {
description = "Script language passed as --job-language ('python' or 'scala')."
type = string
default = "python"
validation {
condition = contains(["python", "scala"], var.job_language)
error_message = "job_language must be 'python' or 'scala'."
}
}
variable "python_version" {
description = "Python version for pythonshell jobs ('3.9' recommended)."
type = string
default = "3.9"
}
variable "worker_type" {
description = "Worker size for Spark/Ray jobs: Standard, G.1X, G.2X, G.4X, G.8X, or Z.2X."
type = string
default = "G.1X"
validation {
condition = contains(["Standard", "G.1X", "G.2X", "G.4X", "G.8X", "Z.2X"], var.worker_type)
error_message = "worker_type must be one of: Standard, G.1X, G.2X, G.4X, G.8X, Z.2X."
}
}
variable "number_of_workers" {
description = "Number of workers (DPUs) for Spark/Ray jobs. Minimum 2."
type = number
default = 2
validation {
condition = var.number_of_workers >= 2 && var.number_of_workers <= 299
error_message = "number_of_workers must be between 2 and 299."
}
}
variable "max_capacity" {
description = "DPU capacity for pythonshell jobs only: 0.0625 or 1.0."
type = number
default = null
validation {
condition = var.max_capacity == null || contains([0.0625, 1.0], var.max_capacity)
error_message = "max_capacity (pythonshell) must be 0.0625 or 1.0."
}
}
variable "max_retries" {
description = "How many times Glue retries the job on failure."
type = number
default = 1
validation {
condition = var.max_retries >= 0 && var.max_retries <= 10
error_message = "max_retries must be between 0 and 10."
}
}
variable "timeout_minutes" {
description = "Job timeout in minutes. Hard cap to stop runaway DPU spend."
type = number
default = 60
validation {
condition = var.timeout_minutes >= 1 && var.timeout_minutes <= 7200
error_message = "timeout_minutes must be between 1 and 7200 (5 days)."
}
}
variable "max_concurrent_runs" {
description = "Max simultaneous runs of this job."
type = number
default = 1
validation {
condition = var.max_concurrent_runs >= 1 && var.max_concurrent_runs <= 1000
error_message = "max_concurrent_runs must be between 1 and 1000."
}
}
variable "bookmark_enabled" {
description = "Enable job bookmarks for incremental processing (skips already-read data)."
type = bool
default = true
}
variable "continuous_logging" {
description = "Enable continuous CloudWatch logging during the run."
type = bool
default = true
}
variable "enable_spark_ui" {
description = "Write Spark event logs to S3 for the Spark UI (Spark jobs only)."
type = bool
default = true
}
variable "temp_bucket" {
description = "S3 bucket name (no s3://) for --TempDir and Spark event logs."
type = string
}
variable "connections" {
description = "List of Glue connection names (e.g. for JDBC/VPC sources)."
type = list(string)
default = []
}
variable "security_configuration" {
description = "Name of a Glue security configuration (for S3/CloudWatch encryption)."
type = string
default = null
}
variable "extra_arguments" {
description = "Additional --key/value default arguments merged into the job (overrides defaults)."
type = map(string)
default = {}
}
variable "notify_delay_after_minutes" {
description = "Emit a DELAYED notification if a run exceeds this many minutes. Null disables."
type = number
default = null
}
variable "create_log_group" {
description = "Create and manage the CloudWatch log group for this job."
type = bool
default = true
}
variable "log_group_name" {
description = "Override the log group name. Defaults to /aws-glue/jobs/<name>."
type = string
default = null
}
variable "log_retention_days" {
description = "CloudWatch log retention in days."
type = number
default = 30
}
variable "schedule_expression" {
description = "Optional cron(...) expression to create a SCHEDULED trigger. Null = no trigger."
type = string
default = null
}
variable "schedule_enabled" {
description = "Whether the schedule trigger is active when created."
type = bool
default = true
}
variable "tags" {
description = "Tags applied to the job, trigger, and log group."
type = map(string)
default = {}
}
outputs.tf
output "id" {
description = "The Glue job ID (same as its name)."
value = aws_glue_job.this.id
}
output "name" {
description = "The name of the Glue job — use to wire up triggers/workflows/Step Functions."
value = aws_glue_job.this.name
}
output "arn" {
description = "The ARN of the Glue job."
value = aws_glue_job.this.arn
}
output "role_arn" {
description = "The IAM role ARN the job runs as."
value = aws_glue_job.this.role_arn
}
output "default_arguments" {
description = "The resolved default arguments passed to the job (bookmarks, TempDir, etc.)."
value = aws_glue_job.this.default_arguments
}
output "log_group_name" {
description = "The CloudWatch log group name for the job."
value = local.log_group_name
}
output "trigger_name" {
description = "Name of the created schedule trigger, or null if none."
value = try(aws_glue_trigger.schedule[0].name, null)
}
How to use it
module "glue_job" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-job?ref=v1.0.0"
name = "orders-curated-daily"
description = "Curate raw orders from the lake into the analytics zone, daily."
role_arn = aws_iam_role.glue_etl.arn
command_name = "glueetl"
glue_version = "5.0"
script_location = "s3://kv-data-platform-artifacts/scripts/orders_curated.py"
worker_type = "G.2X"
number_of_workers = 10
timeout_minutes = 90
max_retries = 1
bookmark_enabled = true
temp_bucket = "kv-data-platform-glue-temp"
# Pass pipeline config straight to the script.
extra_arguments = {
"--source_path" = "s3://kv-data-lake-raw/orders/"
"--target_path" = "s3://kv-data-lake-curated/orders/"
"--enable-auto-scaling" = "true"
}
# Stand up the daily schedule from the same module.
schedule_expression = "cron(0 2 * * ? *)"
notify_delay_after_minutes = 120
log_retention_days = 90
tags = {
Environment = "prod"
Team = "data-platform"
Pipeline = "orders-curated"
}
}
# Downstream: chain a second job after this one using the output job name.
resource "aws_glue_trigger" "after_curated" {
name = "run-orders-agg-after-curated"
type = "CONDITIONAL"
predicate {
conditions {
job_name = module.glue_job.name # output consumed here
state = "SUCCEEDED"
}
}
actions {
job_name = "orders-aggregate-daily"
}
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/glue_job/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-job?ref=v1.0.0"
}
inputs = {
name = "..."
role_arn = "..."
script_location = "..."
temp_bucket = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/glue_job && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name | string | — | yes | Glue job name; also used for log group and trigger. |
| description | string | null | no | Human-readable description of the job. |
| role_arn | string | — | yes | IAM role ARN Glue assumes to run the job. |
| command_name | string | “glueetl” | no | Job type: glueetl, gluestreaming, pythonshell, or glueray. |
| script_location | string | — | yes | S3 URI of the job script. |
| glue_version | string | “5.0” | no | Glue runtime version (2.0/3.0/4.0/5.0). |
| job_language | string | “python” | no | Script language: python or scala. |
| python_version | string | “3.9” | no | Python version for pythonshell jobs. |
| worker_type | string | “G.1X” | no | Worker size for Spark/Ray jobs. |
| number_of_workers | number | 2 | no | Number of workers (DPUs); 2–299. |
| max_capacity | number | null | no | DPU capacity for pythonshell only (0.0625 or 1.0). |
| max_retries | number | 1 | no | Retries on failure (0–10). |
| timeout_minutes | number | 60 | no | Hard job timeout in minutes (1–7200). |
| max_concurrent_runs | number | 1 | no | Max simultaneous runs (1–1000). |
| bookmark_enabled | bool | true | no | Enable job bookmarks for incremental processing. |
| continuous_logging | bool | true | no | Enable continuous CloudWatch logging. |
| enable_spark_ui | bool | true | no | Write Spark event logs to S3 for the Spark UI. |
| temp_bucket | string | — | yes | S3 bucket (no s3://) for TempDir and Spark logs. |
| connections | list(string) | [] | no | Glue connection names for JDBC/VPC sources. |
| security_configuration | string | null | no | Glue security configuration name for encryption. |
| extra_arguments | map(string) | {} | no | Extra default arguments merged into the job. |
| notify_delay_after_minutes | number | null | no | Emit a DELAYED notification past this runtime. |
| create_log_group | bool | true | no | Create and manage the CloudWatch log group. |
| log_group_name | string | null | no | Override log group name (default /aws-glue/jobs/<name>). |
| log_retention_days | number | 30 | no | CloudWatch log retention in days. |
| schedule_expression | string | null | no | cron(…) expression to create a SCHEDULED trigger. |
| schedule_enabled | bool | true | no | Whether the schedule trigger is active. |
| tags | map(string) | {} | no | Tags applied to job, trigger, and log group. |
Outputs
| Name | Description |
|---|---|
| id | The Glue job ID (same as its name). |
| name | Job name — wire up triggers, workflows, or Step Functions. |
| arn | The ARN of the Glue job. |
| role_arn | The IAM role ARN the job runs as. |
| default_arguments | Resolved default arguments (bookmarks, TempDir, flags). |
| log_group_name | CloudWatch log group name for the job. |
| trigger_name | Name of the created schedule trigger, or null. |
Enterprise scenario
A retail analytics team runs a nightly lakehouse refresh: roughly 40 Glue jobs curate raw S3 data (orders, inventory, clickstream) into Iceberg tables, then aggregate marts on top. By standardizing every job on this module, they get G.2X workers with auto-scaling, bookmarks, a 90-minute timeout, and 90-day log retention applied uniformly — set once in the module, not 40 times. Curated jobs emit their name output into CONDITIONAL Glue triggers so aggregation only starts after the upstream SUCCEEDED, and the notify_delay_after_minutes setting pages the on-call data engineer the moment a run blows past its SLA instead of silently dragging into business hours.
Best practices
- Pin
glue_versionexplicitly and test before bumping. Moving 3.0 → 4.0 → 5.0 changes the Spark/Python runtime and bundled connectors; an unpinned or surprise upgrade can break scripts at 2 a.m. Pin it in the module and roll forward deliberately. - Right-size workers and enforce a timeout. Glue bills per DPU-hour, so over-provisioning
number_of_workersor omittingtimeout_minutesdirectly burns money. Start small (G.1X, 2–10 workers), enable--enable-auto-scaling, and keep the hard timeout as a runaway-cost backstop. - Keep job bookmarks on for incremental ETL — and reset them deliberately.
bookmark_enabled = truestops jobs from rescanning the whole bucket every run. Just remember that a backfill needs an explicit bookmark reset, and that source data must be append-only for bookmarks to be correct. - Scope the IAM role tightly and encrypt via a security configuration. Give the
role_arnonly the specific S3 prefixes and Catalog/KMS actions it needs (nevers3:*), and attach asecurity_configurationso CloudWatch logs, S3 outputs, and shuffle data are encrypted with your CMK. - Standardize naming and tags for cost allocation. Use a consistent
<domain>-<dataset>-<frequency>job name and always setEnvironment,Team, andPipelinetags so Glue spend shows up cleanly in Cost Explorer and jobs are discoverable in the console. - Turn on metrics, continuous logging, and the Spark UI from day one. Leave
continuous_loggingandenable_spark_uion so failures are debuggable in real time rather than only post-mortem; pair them with sensiblelog_retention_daysso logs don’t accumulate cost indefinitely.