IaC AWS

Terraform Module: AWS Glue Crawler — schema discovery that keeps your Data Catalog in sync

Quick take — A reusable hashicorp/aws Terraform module for AWS Glue Crawler: var-driven S3/JDBC targets, schedule, classifiers, and schema-change policy so your Glue Data Catalog stays current without click-ops. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "glue_crawler" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-crawler?ref=v1.0.0"

  name          = "..."  # Crawler name; also used to derive the IAM role name.
  database_name = "..."  # Glue Data Catalog database where discovered tables are …
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

An AWS Glue Crawler connects to a data store (S3 prefix, JDBC database, DynamoDB table, Delta Lake, Iceberg, or Hudi tables), walks the data, infers a schema using built-in or custom classifiers, and writes or updates the resulting table definitions in a Glue Data Catalog database. That catalog is what Athena, Redshift Spectrum, EMR, and Glue ETL jobs query against — so the crawler is the thing that keeps “what’s actually in S3” and “what the query engine thinks is there” from drifting apart.

The raw aws_glue_crawler resource is deceptively fiddly in production: it needs an IAM role with both Glue service permissions and read access to the target data, a schema_change_policy so a stray column rename doesn’t silently delete partitions, a recrawl_policy to avoid re-scanning terabytes on every run, table-level configuration JSON for partition-grouping behaviour, and usually a cron schedule. Wrapping all of that in a module means every data domain gets the same safe defaults — LOG instead of DELETE_FROM_DATABASE, CRAWL_EVENT_MODE where S3 event notifications exist, consistent table_prefix naming — instead of each team hand-rolling a crawler in the console and forgetting half of it.

When to use it

If you only ever query a single, static S3 file with a fixed schema, skip the crawler and define the table directly with aws_glue_catalog_table — a crawler is overkill there.

Module structure

terraform-module-aws-glue-crawler/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # aws_glue_crawler + optional IAM role/policy
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # id / name / arn + role arn

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # When create_role = true we build a least-privilege-ish role for the crawler,
  # otherwise the caller passes an existing role ARN.
  role_arn = var.create_role ? aws_iam_role.this[0].arn : var.iam_role_arn

  # Glue expects crawler "configuration" as a JSON string. We only emit it when
  # the caller asks for a non-default partition/grouping behaviour.
  configuration = var.configuration_version == null ? null : jsonencode({
    Version = var.configuration_version
    Grouping = {
      TableGroupingPolicy = var.table_grouping_policy
    }
    CrawlerOutput = {
      Partitions = {
        AddOrUpdateBehavior = var.partitions_add_or_update_behavior
      }
      Tables = {
        AddOrUpdateBehavior = "MergeNewColumns"
      }
    }
  })
}

# ---------------------------------------------------------------------------
# Optional IAM role for the crawler
# ---------------------------------------------------------------------------
data "aws_iam_policy_document" "assume" {
  count = var.create_role ? 1 : 0

  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["glue.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "this" {
  count = var.create_role ? 1 : 0

  name                 = "${var.name}-crawler-role"
  assume_role_policy   = data.aws_iam_policy_document.assume[0].json
  permissions_boundary = var.permissions_boundary_arn
  tags                 = var.tags
}

# Managed policy that grants the Glue control-plane permissions a crawler needs.
resource "aws_iam_role_policy_attachment" "service" {
  count = var.create_role ? 1 : 0

  role       = aws_iam_role.this[0].name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}

# Scoped read access to the S3 targets so the crawler can actually read the data.
data "aws_iam_policy_document" "s3_read" {
  count = var.create_role && length(var.s3_target_paths) > 0 ? 1 : 0

  statement {
    sid     = "ListBuckets"
    effect  = "Allow"
    actions = ["s3:GetBucketLocation", "s3:ListBucket"]
    resources = distinct([
      for p in var.s3_target_paths : "arn:aws:s3:::${split("/", replace(p, "s3://", ""))[0]}"
    ])
  }

  statement {
    sid       = "GetObjects"
    effect    = "Allow"
    actions   = ["s3:GetObject"]
    resources = [for p in var.s3_target_paths : "${replace(p, "s3://", "arn:aws:s3:::")}*"]
  }
}

resource "aws_iam_role_policy" "s3_read" {
  count = var.create_role && length(var.s3_target_paths) > 0 ? 1 : 0

  name   = "${var.name}-s3-read"
  role   = aws_iam_role.this[0].id
  policy = data.aws_iam_policy_document.s3_read[0].json
}

# ---------------------------------------------------------------------------
# The crawler
# ---------------------------------------------------------------------------
resource "aws_glue_crawler" "this" {
  name          = var.name
  role          = local.role_arn
  database_name = var.database_name
  description   = var.description

  table_prefix  = var.table_prefix
  schedule      = var.schedule
  classifiers   = var.classifiers
  configuration = local.configuration

  dynamic "s3_target" {
    for_each = var.s3_target_paths
    content {
      path                = s3_target.value
      exclusions          = var.s3_exclusions
      sample_size         = var.s3_sample_size
      connection_name     = var.s3_connection_name
      event_queue_arn     = var.event_queue_arn
      dlq_event_queue_arn = var.dlq_event_queue_arn
    }
  }

  dynamic "jdbc_target" {
    for_each = var.jdbc_targets
    content {
      connection_name = jdbc_target.value.connection_name
      path            = jdbc_target.value.path
      exclusions      = lookup(jdbc_target.value, "exclusions", null)
    }
  }

  dynamic "catalog_target" {
    for_each = var.catalog_targets
    content {
      database_name = catalog_target.value.database_name
      tables        = catalog_target.value.tables
    }
  }

  recrawl_policy {
    recrawl_behavior = var.recrawl_behavior
  }

  schema_change_policy {
    update_behavior = var.schema_update_behavior
    delete_behavior = var.schema_delete_behavior
  }

  # Lineage / event-mode crawls require these blocks only when enabled.
  dynamic "lineage_configuration" {
    for_each = var.lineage_enabled ? [1] : []
    content {
      crawler_lineage_settings = "ENABLE"
    }
  }

  tags = var.tags
}

variables.tf

variable "name" {
  description = "Name of the Glue crawler (also used to derive the IAM role name)."
  type        = string

  validation {
    condition     = can(regex("^[A-Za-z0-9_-]{1,255}$", var.name))
    error_message = "name must be 1-255 chars of letters, digits, hyphen or underscore."
  }
}

variable "database_name" {
  description = "Glue Data Catalog database where discovered tables are written."
  type        = string
}

variable "description" {
  description = "Free-text description shown in the Glue console."
  type        = string
  default     = null
}

variable "table_prefix" {
  description = "Prefix prepended to every table the crawler creates (e.g. 'raw_')."
  type        = string
  default     = null
}

variable "schedule" {
  description = "Cron expression in AWS Glue format, e.g. 'cron(0 2 * * ? *)'. Null = on-demand only."
  type        = string
  default     = null

  validation {
    condition     = var.schedule == null || can(regex("^cron\\(.+\\)$", var.schedule))
    error_message = "schedule must be a Glue cron() expression, e.g. cron(0 2 * * ? *)."
  }
}

# ----- IAM ------------------------------------------------------------------
variable "create_role" {
  description = "Create an IAM role for the crawler. If false, supply iam_role_arn."
  type        = bool
  default     = true
}

variable "iam_role_arn" {
  description = "Existing IAM role ARN to use when create_role = false."
  type        = string
  default     = null

  validation {
    condition     = var.iam_role_arn == null || can(regex("^arn:aws[a-z-]*:iam::", var.iam_role_arn))
    error_message = "iam_role_arn must be a valid IAM role ARN."
  }
}

variable "permissions_boundary_arn" {
  description = "Optional permissions boundary applied to the created role."
  type        = string
  default     = null
}

# ----- S3 targets -----------------------------------------------------------
variable "s3_target_paths" {
  description = "List of s3:// paths to crawl, e.g. ['s3://my-lake/bronze/orders/']."
  type        = list(string)
  default     = []

  validation {
    condition     = alltrue([for p in var.s3_target_paths : startswith(p, "s3://")])
    error_message = "Every s3_target_paths entry must start with s3://."
  }
}

variable "s3_exclusions" {
  description = "Glob patterns to exclude from S3 crawls, e.g. ['**/_temporary/**']."
  type        = list(string)
  default     = []
}

variable "s3_sample_size" {
  description = "Number of files per leaf folder to sample (1-249). Null = crawl all files."
  type        = number
  default     = null

  validation {
    condition     = var.s3_sample_size == null || (var.s3_sample_size >= 1 && var.s3_sample_size <= 249)
    error_message = "s3_sample_size must be between 1 and 249."
  }
}

variable "s3_connection_name" {
  description = "Glue connection name for S3 targets reachable only via VPC (e.g. S3 access points)."
  type        = string
  default     = null
}

variable "event_queue_arn" {
  description = "SQS queue ARN for S3 event-driven crawls (CRAWL_EVENT_MODE)."
  type        = string
  default     = null
}

variable "dlq_event_queue_arn" {
  description = "Dead-letter SQS queue ARN for event-driven crawls."
  type        = string
  default     = null
}

# ----- JDBC / catalog targets ----------------------------------------------
variable "jdbc_targets" {
  description = "JDBC targets: list of objects with connection_name, path and optional exclusions."
  type = list(object({
    connection_name = string
    path            = string
    exclusions      = optional(list(string))
  }))
  default = []
}

variable "catalog_targets" {
  description = "Catalog targets for crawling existing tables: database_name + list of table names."
  type = list(object({
    database_name = string
    tables        = list(string)
  }))
  default = []
}

# ----- Behaviour / policy ---------------------------------------------------
variable "classifiers" {
  description = "Ordered list of custom classifier names to apply before built-ins."
  type        = list(string)
  default     = []
}

variable "recrawl_behavior" {
  description = "CRAWL_EVERYTHING, CRAWL_NEW_FOLDERS_ONLY, or CRAWL_EVENT_MODE."
  type        = string
  default     = "CRAWL_EVERYTHING"

  validation {
    condition     = contains(["CRAWL_EVERYTHING", "CRAWL_NEW_FOLDERS_ONLY", "CRAWL_EVENT_MODE"], var.recrawl_behavior)
    error_message = "recrawl_behavior must be CRAWL_EVERYTHING, CRAWL_NEW_FOLDERS_ONLY, or CRAWL_EVENT_MODE."
  }
}

variable "schema_update_behavior" {
  description = "How to handle schema changes: UPDATE_IN_DATABASE or LOG."
  type        = string
  default     = "UPDATE_IN_DATABASE"

  validation {
    condition     = contains(["UPDATE_IN_DATABASE", "LOG"], var.schema_update_behavior)
    error_message = "schema_update_behavior must be UPDATE_IN_DATABASE or LOG."
  }
}

variable "schema_delete_behavior" {
  description = "How to handle objects deleted from the data store: LOG, DELETE_FROM_DATABASE, or DEPRECATE_IN_DATABASE."
  type        = string
  default     = "LOG"

  validation {
    condition     = contains(["LOG", "DELETE_FROM_DATABASE", "DEPRECATE_IN_DATABASE"], var.schema_delete_behavior)
    error_message = "schema_delete_behavior must be LOG, DELETE_FROM_DATABASE, or DEPRECATE_IN_DATABASE."
  }
}

variable "configuration_version" {
  description = "Crawler configuration JSON version (e.g. 1.0). Null = omit configuration block."
  type        = number
  default     = null
}

variable "table_grouping_policy" {
  description = "CombineCompatibleSchemas to merge similar folders into one table, or null."
  type        = string
  default     = "CombineCompatibleSchemas"
}

variable "partitions_add_or_update_behavior" {
  description = "Partition output behaviour, typically InheritFromTable."
  type        = string
  default     = "InheritFromTable"
}

variable "lineage_enabled" {
  description = "Enable crawler lineage tracking."
  type        = bool
  default     = false
}

variable "tags" {
  description = "Tags applied to the crawler and any created IAM role."
  type        = map(string)
  default     = {}
}

outputs.tf

output "id" {
  description = "The crawler name, which is its Terraform/AWS identifier."
  value       = aws_glue_crawler.this.id
}

output "name" {
  description = "Name of the Glue crawler."
  value       = aws_glue_crawler.this.name
}

output "arn" {
  description = "ARN of the Glue crawler."
  value       = aws_glue_crawler.this.arn
}

output "database_name" {
  description = "Glue Data Catalog database the crawler populates."
  value       = aws_glue_crawler.this.database_name
}

output "role_arn" {
  description = "ARN of the IAM role used by the crawler."
  value       = local.role_arn
}

output "role_name" {
  description = "Name of the created IAM role, or null when an existing role was supplied."
  value       = try(aws_iam_role.this[0].name, null)
}

How to use it

# A Glue database to hold the discovered tables (a dependency the crawler needs).
resource "aws_glue_catalog_database" "bronze" {
  name = "lake_bronze"
}

module "glue_crawler" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-crawler?ref=v1.0.0"

  name          = "bronze-orders-crawler"
  database_name = aws_glue_catalog_database.bronze.name
  description   = "Discovers daily order drops in the bronze layer"
  table_prefix  = "raw_"

  # Crawl every night at 02:00 UTC.
  schedule = "cron(0 2 * * ? *)"

  s3_target_paths = [
    "s3://acme-data-lake/bronze/orders/",
    "s3://acme-data-lake/bronze/customers/",
  ]
  s3_exclusions = ["**/_temporary/**", "**/_SUCCESS"]

  # Only scan folders added since the last run — cheap incremental crawls.
  recrawl_behavior = "CRAWL_NEW_FOLDERS_ONLY"

  # Never delete catalog tables when files disappear; just mark them deprecated.
  schema_update_behavior = "UPDATE_IN_DATABASE"
  schema_delete_behavior = "DEPRECATE_IN_DATABASE"

  configuration_version = 1.0

  tags = {
    Environment = "prod"
    DataLayer   = "bronze"
    Team        = "data-platform"
  }
}

# Downstream: grant an Athena workgroup's role read access to whatever role the
# crawler runs as is not needed, but you DO commonly trigger an ETL job after a
# successful crawl. A Glue workflow trigger keys off the crawler by name:
resource "aws_glue_trigger" "after_crawl" {
  name = "start-etl-after-bronze-crawl"
  type = "CONDITIONAL"

  predicate {
    conditions {
      crawler_name = module.glue_crawler.name # output consumed here
      crawl_state  = "SUCCEEDED"
    }
  }

  actions {
    job_name = "bronze-to-silver-transform"
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/glue_crawler/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-crawler?ref=v1.0.0"
}

inputs = {
  name = "..."
  database_name = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/glue_crawler && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name string yes Crawler name; also used to derive the IAM role name.
database_name string yes Glue Data Catalog database where discovered tables are written.
description string null no Free-text description shown in the console.
table_prefix string null no Prefix prepended to every created table (e.g. raw_).
schedule string null no Glue cron expression; null means on-demand only.
create_role bool true no Create an IAM role for the crawler; if false, supply iam_role_arn.
iam_role_arn string null no Existing role ARN used when create_role = false.
permissions_boundary_arn string null no Permissions boundary for the created role.
s3_target_paths list(string) [] no List of s3:// paths to crawl.
s3_exclusions list(string) [] no Glob patterns to exclude from S3 crawls.
s3_sample_size number null no Files per leaf folder to sample (1–249); null crawls all.
s3_connection_name string null no Glue connection for VPC-only S3 targets.
event_queue_arn string null no SQS queue ARN for event-driven crawls.
dlq_event_queue_arn string null no Dead-letter SQS queue ARN for event-driven crawls.
jdbc_targets list(object) [] no JDBC targets: connection_name, path, optional exclusions.
catalog_targets list(object) [] no Existing-table targets: database_name + tables.
classifiers list(string) [] no Ordered custom classifier names applied before built-ins.
recrawl_behavior string CRAWL_EVERYTHING no CRAWL_EVERYTHING, CRAWL_NEW_FOLDERS_ONLY, or CRAWL_EVENT_MODE.
schema_update_behavior string UPDATE_IN_DATABASE no UPDATE_IN_DATABASE or LOG.
schema_delete_behavior string LOG no LOG, DELETE_FROM_DATABASE, or DEPRECATE_IN_DATABASE.
configuration_version number null no Configuration JSON version (e.g. 1.0); null omits the block.
table_grouping_policy string CombineCompatibleSchemas no Merge compatible folders into one table, or null.
partitions_add_or_update_behavior string InheritFromTable no Partition output behaviour.
lineage_enabled bool false no Enable crawler lineage tracking.
tags map(string) {} no Tags applied to the crawler and created role.

Outputs

Name Description
id The crawler name, which is its Terraform/AWS identifier.
name Name of the Glue crawler.
arn ARN of the Glue crawler.
database_name Glue Data Catalog database the crawler populates.
role_arn ARN of the IAM role used by the crawler.
role_name Name of the created IAM role, or null when an existing role was supplied.

Enterprise scenario

A retail analytics team runs a bronze/silver/gold lake on S3. Each night, ingestion jobs drop new date-partitioned Parquet under s3://acme-data-lake/bronze/<feed>/dt=YYYY-MM-DD/. They deploy one instance of this module per feed with recrawl_behavior = "CRAWL_NEW_FOLDERS_ONLY" and schema_delete_behavior = "DEPRECATE_IN_DATABASE", so new partitions appear in Athena within minutes of landing while an accidental upstream backfill that removes a day’s files can never silently delete a production table. A CONDITIONAL Glue trigger references each crawler’s name output to kick off the bronze-to-silver ETL only after the crawl reports SUCCEEDED, giving them a fully code-reviewed, dependency-ordered pipeline.

Best practices

TerraformAWSGlue CrawlerModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading