IaC AWS

Terraform Module: AWS Glue Job — repeatable, governed ETL jobs

Quick take — Provision AWS Glue ETL jobs with Terraform: a reusable hashicorp/aws ~> 5.0 module wrapping aws_glue_job with worker sizing, job bookmarks, retries, timeouts, and CloudWatch logging baked in. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "glue_job" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-job?ref=v1.0.0"

  name            = "..."  # Glue job name; also used for log group and trigger.
  role_arn        = "..."  # IAM role ARN Glue assumes to run the job.
  script_location = "..."  # S3 URI of the job script.
  temp_bucket     = "..."  # S3 bucket (no s3://) for TempDir and Spark logs.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS Glue is a serverless data-integration service, and a Glue Job is the unit of work that actually runs your ETL: a PySpark, Scala Spark, Python Shell, or Ray script that reads from sources (S3, JDBC, the Glue Data Catalog), transforms data, and writes it back out. You point a job at a script in S3, choose a worker type and a number of workers, hand it an IAM role, and Glue spins up an ephemeral Spark cluster on demand — you never manage the infrastructure underneath.

The trouble is that a correct production Glue job is more than one resource. You need the right glue_version pinned (mixing 3.0 and 5.0 semantics silently breaks scripts), a worker type and count that match your data volume, job bookmarks so incremental runs don’t reprocess everything, sane max_retries/timeout/max_concurrent_runs so a stuck job doesn’t burn DPU-hours all weekend, and continuous CloudWatch logging plus Spark UI so you can actually debug failures. Hand-rolling all of that per job leads to drift — one team forgets bookmarks, another sets a 48-hour timeout, a third leaves logging off and flies blind.

Wrapping aws_glue_job in a reusable module fixes that. The module bakes in the production defaults (G.1X workers, bookmarks enabled, a 60-minute timeout, one retry, continuous logging) while exposing every knob as a variable, so every job in your estate is consistent, tagged, and observable, and a new pipeline is a 15-line module block instead of 80 lines of copy-pasted HCL.

When to use it

If you only ever need one throwaway job, the raw resource is fine. The module pays off the moment you have a fleet to keep consistent.

Module structure

terraform-module-aws-glue-job/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # aws_glue_job + optional trigger + log group
├── variables.tf     # all inputs, validated
└── outputs.tf       # id/name/arn + role + log group

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Continuous logging needs a known log group so we can grant it and read it.
  log_group_name = coalesce(var.log_group_name, "/aws-glue/jobs/${var.name}")

  # Default arguments AWS expects, merged with user overrides. User values win.
  default_arguments = merge(
    {
      "--job-language"                     = var.job_language
      "--job-bookmark-option"              = var.bookmark_enabled ? "job-bookmark-enable" : "job-bookmark-disable"
      "--enable-metrics"                   = "true"
      "--enable-observability-metrics"     = "true"
      "--enable-continuous-cloudwatch-log" = tostring(var.continuous_logging)
      "--enable-spark-ui"                  = tostring(var.enable_spark_ui)
      "--TempDir"                          = "s3://${var.temp_bucket}/glue-temp/${var.name}/"
    },
    var.enable_spark_ui ? {
      "--spark-event-logs-path" = "s3://${var.temp_bucket}/glue-spark-logs/${var.name}/"
    } : {},
    var.extra_arguments
  )
}

resource "aws_cloudwatch_log_group" "this" {
  count             = var.create_log_group ? 1 : 0
  name              = local.log_group_name
  retention_in_days = var.log_retention_days
  tags              = var.tags
}

resource "aws_glue_job" "this" {
  name                   = var.name
  description            = var.description
  role_arn               = var.role_arn
  glue_version           = var.glue_version
  worker_type            = var.worker_type
  number_of_workers      = var.number_of_workers
  max_retries            = var.max_retries
  timeout                = var.timeout_minutes
  max_concurrent_runs    = null # set via execution_property below
  security_configuration = var.security_configuration
  connections            = var.connections
  default_arguments      = local.default_arguments
  tags                   = var.tags

  command {
    name            = var.command_name
    script_location = var.script_location
    python_version  = var.command_name == "pythonshell" ? var.python_version : null
  }

  execution_property {
    max_concurrent_runs = var.max_concurrent_runs
  }

  dynamic "notification_property" {
    for_each = var.notify_delay_after_minutes != null ? [1] : []
    content {
      notify_delay_after = var.notify_delay_after_minutes
    }
  }

  # Glue ignores worker_type/number_of_workers for pythonshell; null them out there.
  lifecycle {
    precondition {
      condition     = var.command_name != "pythonshell" || var.max_capacity != null
      error_message = "command_name = \"pythonshell\" requires max_capacity (0.0625 or 1.0), not workers."
    }
  }

  # Python Shell uses max_capacity instead of worker_type/number_of_workers.
  max_capacity = var.command_name == "pythonshell" ? var.max_capacity : null
}

# Optional schedule trigger so the module can stand up a self-contained pipeline.
resource "aws_glue_trigger" "schedule" {
  count    = var.schedule_expression != null ? 1 : 0
  name     = "${var.name}-schedule"
  type     = "SCHEDULED"
  schedule = var.schedule_expression
  enabled  = var.schedule_enabled
  tags     = var.tags

  actions {
    job_name = aws_glue_job.this.name
    timeout  = var.timeout_minutes
  }
}

variables.tf

variable "name" {
  description = "Name of the Glue job. Used for the job, its log group, and the trigger."
  type        = string

  validation {
    condition     = can(regex("^[A-Za-z0-9_.-]{1,255}$", var.name))
    error_message = "name must be 1-255 chars: letters, digits, '_', '.', '-' only."
  }
}

variable "description" {
  description = "Human-readable description of what the job does."
  type        = string
  default     = null
}

variable "role_arn" {
  description = "ARN of the IAM role Glue assumes to run the job (needs Glue + S3 + source access)."
  type        = string

  validation {
    condition     = can(regex("^arn:aws[a-z-]*:iam::[0-9]{12}:role/.+$", var.role_arn))
    error_message = "role_arn must be a valid IAM role ARN."
  }
}

variable "command_name" {
  description = "Job type: 'glueetl' (Spark), 'gluestreaming' (streaming), 'pythonshell', or 'glueray'."
  type        = string
  default     = "glueetl"

  validation {
    condition     = contains(["glueetl", "gluestreaming", "pythonshell", "glueray"], var.command_name)
    error_message = "command_name must be one of: glueetl, gluestreaming, pythonshell, glueray."
  }
}

variable "script_location" {
  description = "S3 URI of the job script, e.g. s3://my-bucket/scripts/etl.py."
  type        = string

  validation {
    condition     = can(regex("^s3://", var.script_location))
    error_message = "script_location must be an s3:// URI."
  }
}

variable "glue_version" {
  description = "Glue runtime version. 4.0/5.0 = Spark 3.x; pythonshell supports 3.9/0.9 separately."
  type        = string
  default     = "5.0"

  validation {
    condition     = contains(["2.0", "3.0", "4.0", "5.0"], var.glue_version)
    error_message = "glue_version must be one of: 2.0, 3.0, 4.0, 5.0."
  }
}

variable "job_language" {
  description = "Script language passed as --job-language ('python' or 'scala')."
  type        = string
  default     = "python"

  validation {
    condition     = contains(["python", "scala"], var.job_language)
    error_message = "job_language must be 'python' or 'scala'."
  }
}

variable "python_version" {
  description = "Python version for pythonshell jobs ('3.9' recommended)."
  type        = string
  default     = "3.9"
}

variable "worker_type" {
  description = "Worker size for Spark/Ray jobs: Standard, G.1X, G.2X, G.4X, G.8X, or Z.2X."
  type        = string
  default     = "G.1X"

  validation {
    condition     = contains(["Standard", "G.1X", "G.2X", "G.4X", "G.8X", "Z.2X"], var.worker_type)
    error_message = "worker_type must be one of: Standard, G.1X, G.2X, G.4X, G.8X, Z.2X."
  }
}

variable "number_of_workers" {
  description = "Number of workers (DPUs) for Spark/Ray jobs. Minimum 2."
  type        = number
  default     = 2

  validation {
    condition     = var.number_of_workers >= 2 && var.number_of_workers <= 299
    error_message = "number_of_workers must be between 2 and 299."
  }
}

variable "max_capacity" {
  description = "DPU capacity for pythonshell jobs only: 0.0625 or 1.0."
  type        = number
  default     = null

  validation {
    condition     = var.max_capacity == null || contains([0.0625, 1.0], var.max_capacity)
    error_message = "max_capacity (pythonshell) must be 0.0625 or 1.0."
  }
}

variable "max_retries" {
  description = "How many times Glue retries the job on failure."
  type        = number
  default     = 1

  validation {
    condition     = var.max_retries >= 0 && var.max_retries <= 10
    error_message = "max_retries must be between 0 and 10."
  }
}

variable "timeout_minutes" {
  description = "Job timeout in minutes. Hard cap to stop runaway DPU spend."
  type        = number
  default     = 60

  validation {
    condition     = var.timeout_minutes >= 1 && var.timeout_minutes <= 7200
    error_message = "timeout_minutes must be between 1 and 7200 (5 days)."
  }
}

variable "max_concurrent_runs" {
  description = "Max simultaneous runs of this job."
  type        = number
  default     = 1

  validation {
    condition     = var.max_concurrent_runs >= 1 && var.max_concurrent_runs <= 1000
    error_message = "max_concurrent_runs must be between 1 and 1000."
  }
}

variable "bookmark_enabled" {
  description = "Enable job bookmarks for incremental processing (skips already-read data)."
  type        = bool
  default     = true
}

variable "continuous_logging" {
  description = "Enable continuous CloudWatch logging during the run."
  type        = bool
  default     = true
}

variable "enable_spark_ui" {
  description = "Write Spark event logs to S3 for the Spark UI (Spark jobs only)."
  type        = bool
  default     = true
}

variable "temp_bucket" {
  description = "S3 bucket name (no s3://) for --TempDir and Spark event logs."
  type        = string
}

variable "connections" {
  description = "List of Glue connection names (e.g. for JDBC/VPC sources)."
  type        = list(string)
  default     = []
}

variable "security_configuration" {
  description = "Name of a Glue security configuration (for S3/CloudWatch encryption)."
  type        = string
  default     = null
}

variable "extra_arguments" {
  description = "Additional --key/value default arguments merged into the job (overrides defaults)."
  type        = map(string)
  default     = {}
}

variable "notify_delay_after_minutes" {
  description = "Emit a DELAYED notification if a run exceeds this many minutes. Null disables."
  type        = number
  default     = null
}

variable "create_log_group" {
  description = "Create and manage the CloudWatch log group for this job."
  type        = bool
  default     = true
}

variable "log_group_name" {
  description = "Override the log group name. Defaults to /aws-glue/jobs/<name>."
  type        = string
  default     = null
}

variable "log_retention_days" {
  description = "CloudWatch log retention in days."
  type        = number
  default     = 30
}

variable "schedule_expression" {
  description = "Optional cron(...) expression to create a SCHEDULED trigger. Null = no trigger."
  type        = string
  default     = null
}

variable "schedule_enabled" {
  description = "Whether the schedule trigger is active when created."
  type        = bool
  default     = true
}

variable "tags" {
  description = "Tags applied to the job, trigger, and log group."
  type        = map(string)
  default     = {}
}

outputs.tf

output "id" {
  description = "The Glue job ID (same as its name)."
  value       = aws_glue_job.this.id
}

output "name" {
  description = "The name of the Glue job — use to wire up triggers/workflows/Step Functions."
  value       = aws_glue_job.this.name
}

output "arn" {
  description = "The ARN of the Glue job."
  value       = aws_glue_job.this.arn
}

output "role_arn" {
  description = "The IAM role ARN the job runs as."
  value       = aws_glue_job.this.role_arn
}

output "default_arguments" {
  description = "The resolved default arguments passed to the job (bookmarks, TempDir, etc.)."
  value       = aws_glue_job.this.default_arguments
}

output "log_group_name" {
  description = "The CloudWatch log group name for the job."
  value       = local.log_group_name
}

output "trigger_name" {
  description = "Name of the created schedule trigger, or null if none."
  value       = try(aws_glue_trigger.schedule[0].name, null)
}

How to use it

module "glue_job" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-job?ref=v1.0.0"

  name            = "orders-curated-daily"
  description     = "Curate raw orders from the lake into the analytics zone, daily."
  role_arn        = aws_iam_role.glue_etl.arn
  command_name    = "glueetl"
  glue_version    = "5.0"
  script_location = "s3://kv-data-platform-artifacts/scripts/orders_curated.py"

  worker_type       = "G.2X"
  number_of_workers = 10
  timeout_minutes   = 90
  max_retries       = 1

  bookmark_enabled = true
  temp_bucket      = "kv-data-platform-glue-temp"

  # Pass pipeline config straight to the script.
  extra_arguments = {
    "--source_path"      = "s3://kv-data-lake-raw/orders/"
    "--target_path"      = "s3://kv-data-lake-curated/orders/"
    "--enable-auto-scaling" = "true"
  }

  # Stand up the daily schedule from the same module.
  schedule_expression = "cron(0 2 * * ? *)"

  notify_delay_after_minutes = 120
  log_retention_days         = 90

  tags = {
    Environment = "prod"
    Team        = "data-platform"
    Pipeline    = "orders-curated"
  }
}

# Downstream: chain a second job after this one using the output job name.
resource "aws_glue_trigger" "after_curated" {
  name = "run-orders-agg-after-curated"
  type = "CONDITIONAL"

  predicate {
    conditions {
      job_name = module.glue_job.name # output consumed here
      state    = "SUCCEEDED"
    }
  }

  actions {
    job_name = "orders-aggregate-daily"
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/glue_job/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-job?ref=v1.0.0"
}

inputs = {
  name = "..."
  role_arn = "..."
  script_location = "..."
  temp_bucket = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/glue_job && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name string yes Glue job name; also used for log group and trigger.
description string null no Human-readable description of the job.
role_arn string yes IAM role ARN Glue assumes to run the job.
command_name string “glueetl” no Job type: glueetl, gluestreaming, pythonshell, or glueray.
script_location string yes S3 URI of the job script.
glue_version string “5.0” no Glue runtime version (2.0/3.0/4.0/5.0).
job_language string “python” no Script language: python or scala.
python_version string “3.9” no Python version for pythonshell jobs.
worker_type string “G.1X” no Worker size for Spark/Ray jobs.
number_of_workers number 2 no Number of workers (DPUs); 2–299.
max_capacity number null no DPU capacity for pythonshell only (0.0625 or 1.0).
max_retries number 1 no Retries on failure (0–10).
timeout_minutes number 60 no Hard job timeout in minutes (1–7200).
max_concurrent_runs number 1 no Max simultaneous runs (1–1000).
bookmark_enabled bool true no Enable job bookmarks for incremental processing.
continuous_logging bool true no Enable continuous CloudWatch logging.
enable_spark_ui bool true no Write Spark event logs to S3 for the Spark UI.
temp_bucket string yes S3 bucket (no s3://) for TempDir and Spark logs.
connections list(string) [] no Glue connection names for JDBC/VPC sources.
security_configuration string null no Glue security configuration name for encryption.
extra_arguments map(string) {} no Extra default arguments merged into the job.
notify_delay_after_minutes number null no Emit a DELAYED notification past this runtime.
create_log_group bool true no Create and manage the CloudWatch log group.
log_group_name string null no Override log group name (default /aws-glue/jobs/<name>).
log_retention_days number 30 no CloudWatch log retention in days.
schedule_expression string null no cron(…) expression to create a SCHEDULED trigger.
schedule_enabled bool true no Whether the schedule trigger is active.
tags map(string) {} no Tags applied to job, trigger, and log group.

Outputs

Name Description
id The Glue job ID (same as its name).
name Job name — wire up triggers, workflows, or Step Functions.
arn The ARN of the Glue job.
role_arn The IAM role ARN the job runs as.
default_arguments Resolved default arguments (bookmarks, TempDir, flags).
log_group_name CloudWatch log group name for the job.
trigger_name Name of the created schedule trigger, or null.

Enterprise scenario

A retail analytics team runs a nightly lakehouse refresh: roughly 40 Glue jobs curate raw S3 data (orders, inventory, clickstream) into Iceberg tables, then aggregate marts on top. By standardizing every job on this module, they get G.2X workers with auto-scaling, bookmarks, a 90-minute timeout, and 90-day log retention applied uniformly — set once in the module, not 40 times. Curated jobs emit their name output into CONDITIONAL Glue triggers so aggregation only starts after the upstream SUCCEEDED, and the notify_delay_after_minutes setting pages the on-call data engineer the moment a run blows past its SLA instead of silently dragging into business hours.

Best practices

TerraformAWSGlue JobModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading