Quick take — A reusable hashicorp/aws Terraform module for AWS Glue Crawler: var-driven S3/JDBC targets, schedule, classifiers, and schema-change policy so your Glue Data Catalog stays current without click-ops. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "glue_crawler" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-crawler?ref=v1.0.0"
name = "..." # Crawler name; also used to derive the IAM role name.
database_name = "..." # Glue Data Catalog database where discovered tables are …
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
An AWS Glue Crawler connects to a data store (S3 prefix, JDBC database, DynamoDB table, Delta Lake, Iceberg, or Hudi tables), walks the data, infers a schema using built-in or custom classifiers, and writes or updates the resulting table definitions in a Glue Data Catalog database. That catalog is what Athena, Redshift Spectrum, EMR, and Glue ETL jobs query against — so the crawler is the thing that keeps “what’s actually in S3” and “what the query engine thinks is there” from drifting apart.
The raw aws_glue_crawler resource is deceptively fiddly in production: it needs an IAM role with both Glue service permissions and read access to the target data, a schema_change_policy so a stray column rename doesn’t silently delete partitions, a recrawl_policy to avoid re-scanning terabytes on every run, table-level configuration JSON for partition-grouping behaviour, and usually a cron schedule. Wrapping all of that in a module means every data domain gets the same safe defaults — LOG instead of DELETE_FROM_DATABASE, CRAWL_EVENT_MODE where S3 event notifications exist, consistent table_prefix naming — instead of each team hand-rolling a crawler in the console and forgetting half of it.
When to use it
- You land raw or curated data in S3 (Parquet/JSON/CSV) and need Athena/Spectrum tables created and partitions kept up to date automatically.
- You want schema discovery as code so a new data feed is a
moduleblock in a PR, reviewed and versioned, not a console wizard. - You are standing up a lake-house pattern (bronze/silver/gold) and need one crawler per layer pointed at different prefixes, all with identical guardrails.
- You need to crawl an on-prem or RDS database over a Glue JDBC connection on a schedule and surface it in the catalog.
- You care about cost and reliability: incremental crawls (
recrawl_policy), event-driven crawls instead of full re-scans, and a non-destructive schema-change policy.
If you only ever query a single, static S3 file with a fixed schema, skip the crawler and define the table directly with aws_glue_catalog_table — a crawler is overkill there.
Module structure
terraform-module-aws-glue-crawler/
├── versions.tf # provider + Terraform version pins
├── main.tf # aws_glue_crawler + optional IAM role/policy
├── variables.tf # var-driven inputs with validation
└── outputs.tf # id / name / arn + role arn
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
main.tf
locals {
# When create_role = true we build a least-privilege-ish role for the crawler,
# otherwise the caller passes an existing role ARN.
role_arn = var.create_role ? aws_iam_role.this[0].arn : var.iam_role_arn
# Glue expects crawler "configuration" as a JSON string. We only emit it when
# the caller asks for a non-default partition/grouping behaviour.
configuration = var.configuration_version == null ? null : jsonencode({
Version = var.configuration_version
Grouping = {
TableGroupingPolicy = var.table_grouping_policy
}
CrawlerOutput = {
Partitions = {
AddOrUpdateBehavior = var.partitions_add_or_update_behavior
}
Tables = {
AddOrUpdateBehavior = "MergeNewColumns"
}
}
})
}
# ---------------------------------------------------------------------------
# Optional IAM role for the crawler
# ---------------------------------------------------------------------------
data "aws_iam_policy_document" "assume" {
count = var.create_role ? 1 : 0
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["glue.amazonaws.com"]
}
}
}
resource "aws_iam_role" "this" {
count = var.create_role ? 1 : 0
name = "${var.name}-crawler-role"
assume_role_policy = data.aws_iam_policy_document.assume[0].json
permissions_boundary = var.permissions_boundary_arn
tags = var.tags
}
# Managed policy that grants the Glue control-plane permissions a crawler needs.
resource "aws_iam_role_policy_attachment" "service" {
count = var.create_role ? 1 : 0
role = aws_iam_role.this[0].name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}
# Scoped read access to the S3 targets so the crawler can actually read the data.
data "aws_iam_policy_document" "s3_read" {
count = var.create_role && length(var.s3_target_paths) > 0 ? 1 : 0
statement {
sid = "ListBuckets"
effect = "Allow"
actions = ["s3:GetBucketLocation", "s3:ListBucket"]
resources = distinct([
for p in var.s3_target_paths : "arn:aws:s3:::${split("/", replace(p, "s3://", ""))[0]}"
])
}
statement {
sid = "GetObjects"
effect = "Allow"
actions = ["s3:GetObject"]
resources = [for p in var.s3_target_paths : "${replace(p, "s3://", "arn:aws:s3:::")}*"]
}
}
resource "aws_iam_role_policy" "s3_read" {
count = var.create_role && length(var.s3_target_paths) > 0 ? 1 : 0
name = "${var.name}-s3-read"
role = aws_iam_role.this[0].id
policy = data.aws_iam_policy_document.s3_read[0].json
}
# ---------------------------------------------------------------------------
# The crawler
# ---------------------------------------------------------------------------
resource "aws_glue_crawler" "this" {
name = var.name
role = local.role_arn
database_name = var.database_name
description = var.description
table_prefix = var.table_prefix
schedule = var.schedule
classifiers = var.classifiers
configuration = local.configuration
dynamic "s3_target" {
for_each = var.s3_target_paths
content {
path = s3_target.value
exclusions = var.s3_exclusions
sample_size = var.s3_sample_size
connection_name = var.s3_connection_name
event_queue_arn = var.event_queue_arn
dlq_event_queue_arn = var.dlq_event_queue_arn
}
}
dynamic "jdbc_target" {
for_each = var.jdbc_targets
content {
connection_name = jdbc_target.value.connection_name
path = jdbc_target.value.path
exclusions = lookup(jdbc_target.value, "exclusions", null)
}
}
dynamic "catalog_target" {
for_each = var.catalog_targets
content {
database_name = catalog_target.value.database_name
tables = catalog_target.value.tables
}
}
recrawl_policy {
recrawl_behavior = var.recrawl_behavior
}
schema_change_policy {
update_behavior = var.schema_update_behavior
delete_behavior = var.schema_delete_behavior
}
# Lineage / event-mode crawls require these blocks only when enabled.
dynamic "lineage_configuration" {
for_each = var.lineage_enabled ? [1] : []
content {
crawler_lineage_settings = "ENABLE"
}
}
tags = var.tags
}
variables.tf
variable "name" {
description = "Name of the Glue crawler (also used to derive the IAM role name)."
type = string
validation {
condition = can(regex("^[A-Za-z0-9_-]{1,255}$", var.name))
error_message = "name must be 1-255 chars of letters, digits, hyphen or underscore."
}
}
variable "database_name" {
description = "Glue Data Catalog database where discovered tables are written."
type = string
}
variable "description" {
description = "Free-text description shown in the Glue console."
type = string
default = null
}
variable "table_prefix" {
description = "Prefix prepended to every table the crawler creates (e.g. 'raw_')."
type = string
default = null
}
variable "schedule" {
description = "Cron expression in AWS Glue format, e.g. 'cron(0 2 * * ? *)'. Null = on-demand only."
type = string
default = null
validation {
condition = var.schedule == null || can(regex("^cron\\(.+\\)$", var.schedule))
error_message = "schedule must be a Glue cron() expression, e.g. cron(0 2 * * ? *)."
}
}
# ----- IAM ------------------------------------------------------------------
variable "create_role" {
description = "Create an IAM role for the crawler. If false, supply iam_role_arn."
type = bool
default = true
}
variable "iam_role_arn" {
description = "Existing IAM role ARN to use when create_role = false."
type = string
default = null
validation {
condition = var.iam_role_arn == null || can(regex("^arn:aws[a-z-]*:iam::", var.iam_role_arn))
error_message = "iam_role_arn must be a valid IAM role ARN."
}
}
variable "permissions_boundary_arn" {
description = "Optional permissions boundary applied to the created role."
type = string
default = null
}
# ----- S3 targets -----------------------------------------------------------
variable "s3_target_paths" {
description = "List of s3:// paths to crawl, e.g. ['s3://my-lake/bronze/orders/']."
type = list(string)
default = []
validation {
condition = alltrue([for p in var.s3_target_paths : startswith(p, "s3://")])
error_message = "Every s3_target_paths entry must start with s3://."
}
}
variable "s3_exclusions" {
description = "Glob patterns to exclude from S3 crawls, e.g. ['**/_temporary/**']."
type = list(string)
default = []
}
variable "s3_sample_size" {
description = "Number of files per leaf folder to sample (1-249). Null = crawl all files."
type = number
default = null
validation {
condition = var.s3_sample_size == null || (var.s3_sample_size >= 1 && var.s3_sample_size <= 249)
error_message = "s3_sample_size must be between 1 and 249."
}
}
variable "s3_connection_name" {
description = "Glue connection name for S3 targets reachable only via VPC (e.g. S3 access points)."
type = string
default = null
}
variable "event_queue_arn" {
description = "SQS queue ARN for S3 event-driven crawls (CRAWL_EVENT_MODE)."
type = string
default = null
}
variable "dlq_event_queue_arn" {
description = "Dead-letter SQS queue ARN for event-driven crawls."
type = string
default = null
}
# ----- JDBC / catalog targets ----------------------------------------------
variable "jdbc_targets" {
description = "JDBC targets: list of objects with connection_name, path and optional exclusions."
type = list(object({
connection_name = string
path = string
exclusions = optional(list(string))
}))
default = []
}
variable "catalog_targets" {
description = "Catalog targets for crawling existing tables: database_name + list of table names."
type = list(object({
database_name = string
tables = list(string)
}))
default = []
}
# ----- Behaviour / policy ---------------------------------------------------
variable "classifiers" {
description = "Ordered list of custom classifier names to apply before built-ins."
type = list(string)
default = []
}
variable "recrawl_behavior" {
description = "CRAWL_EVERYTHING, CRAWL_NEW_FOLDERS_ONLY, or CRAWL_EVENT_MODE."
type = string
default = "CRAWL_EVERYTHING"
validation {
condition = contains(["CRAWL_EVERYTHING", "CRAWL_NEW_FOLDERS_ONLY", "CRAWL_EVENT_MODE"], var.recrawl_behavior)
error_message = "recrawl_behavior must be CRAWL_EVERYTHING, CRAWL_NEW_FOLDERS_ONLY, or CRAWL_EVENT_MODE."
}
}
variable "schema_update_behavior" {
description = "How to handle schema changes: UPDATE_IN_DATABASE or LOG."
type = string
default = "UPDATE_IN_DATABASE"
validation {
condition = contains(["UPDATE_IN_DATABASE", "LOG"], var.schema_update_behavior)
error_message = "schema_update_behavior must be UPDATE_IN_DATABASE or LOG."
}
}
variable "schema_delete_behavior" {
description = "How to handle objects deleted from the data store: LOG, DELETE_FROM_DATABASE, or DEPRECATE_IN_DATABASE."
type = string
default = "LOG"
validation {
condition = contains(["LOG", "DELETE_FROM_DATABASE", "DEPRECATE_IN_DATABASE"], var.schema_delete_behavior)
error_message = "schema_delete_behavior must be LOG, DELETE_FROM_DATABASE, or DEPRECATE_IN_DATABASE."
}
}
variable "configuration_version" {
description = "Crawler configuration JSON version (e.g. 1.0). Null = omit configuration block."
type = number
default = null
}
variable "table_grouping_policy" {
description = "CombineCompatibleSchemas to merge similar folders into one table, or null."
type = string
default = "CombineCompatibleSchemas"
}
variable "partitions_add_or_update_behavior" {
description = "Partition output behaviour, typically InheritFromTable."
type = string
default = "InheritFromTable"
}
variable "lineage_enabled" {
description = "Enable crawler lineage tracking."
type = bool
default = false
}
variable "tags" {
description = "Tags applied to the crawler and any created IAM role."
type = map(string)
default = {}
}
outputs.tf
output "id" {
description = "The crawler name, which is its Terraform/AWS identifier."
value = aws_glue_crawler.this.id
}
output "name" {
description = "Name of the Glue crawler."
value = aws_glue_crawler.this.name
}
output "arn" {
description = "ARN of the Glue crawler."
value = aws_glue_crawler.this.arn
}
output "database_name" {
description = "Glue Data Catalog database the crawler populates."
value = aws_glue_crawler.this.database_name
}
output "role_arn" {
description = "ARN of the IAM role used by the crawler."
value = local.role_arn
}
output "role_name" {
description = "Name of the created IAM role, or null when an existing role was supplied."
value = try(aws_iam_role.this[0].name, null)
}
How to use it
# A Glue database to hold the discovered tables (a dependency the crawler needs).
resource "aws_glue_catalog_database" "bronze" {
name = "lake_bronze"
}
module "glue_crawler" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-crawler?ref=v1.0.0"
name = "bronze-orders-crawler"
database_name = aws_glue_catalog_database.bronze.name
description = "Discovers daily order drops in the bronze layer"
table_prefix = "raw_"
# Crawl every night at 02:00 UTC.
schedule = "cron(0 2 * * ? *)"
s3_target_paths = [
"s3://acme-data-lake/bronze/orders/",
"s3://acme-data-lake/bronze/customers/",
]
s3_exclusions = ["**/_temporary/**", "**/_SUCCESS"]
# Only scan folders added since the last run — cheap incremental crawls.
recrawl_behavior = "CRAWL_NEW_FOLDERS_ONLY"
# Never delete catalog tables when files disappear; just mark them deprecated.
schema_update_behavior = "UPDATE_IN_DATABASE"
schema_delete_behavior = "DEPRECATE_IN_DATABASE"
configuration_version = 1.0
tags = {
Environment = "prod"
DataLayer = "bronze"
Team = "data-platform"
}
}
# Downstream: grant an Athena workgroup's role read access to whatever role the
# crawler runs as is not needed, but you DO commonly trigger an ETL job after a
# successful crawl. A Glue workflow trigger keys off the crawler by name:
resource "aws_glue_trigger" "after_crawl" {
name = "start-etl-after-bronze-crawl"
type = "CONDITIONAL"
predicate {
conditions {
crawler_name = module.glue_crawler.name # output consumed here
crawl_state = "SUCCEEDED"
}
}
actions {
job_name = "bronze-to-silver-transform"
}
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/glue_crawler/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue-crawler?ref=v1.0.0"
}
inputs = {
name = "..."
database_name = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/glue_crawler && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
name |
string |
— | yes | Crawler name; also used to derive the IAM role name. |
database_name |
string |
— | yes | Glue Data Catalog database where discovered tables are written. |
description |
string |
null |
no | Free-text description shown in the console. |
table_prefix |
string |
null |
no | Prefix prepended to every created table (e.g. raw_). |
schedule |
string |
null |
no | Glue cron expression; null means on-demand only. |
create_role |
bool |
true |
no | Create an IAM role for the crawler; if false, supply iam_role_arn. |
iam_role_arn |
string |
null |
no | Existing role ARN used when create_role = false. |
permissions_boundary_arn |
string |
null |
no | Permissions boundary for the created role. |
s3_target_paths |
list(string) |
[] |
no | List of s3:// paths to crawl. |
s3_exclusions |
list(string) |
[] |
no | Glob patterns to exclude from S3 crawls. |
s3_sample_size |
number |
null |
no | Files per leaf folder to sample (1–249); null crawls all. |
s3_connection_name |
string |
null |
no | Glue connection for VPC-only S3 targets. |
event_queue_arn |
string |
null |
no | SQS queue ARN for event-driven crawls. |
dlq_event_queue_arn |
string |
null |
no | Dead-letter SQS queue ARN for event-driven crawls. |
jdbc_targets |
list(object) |
[] |
no | JDBC targets: connection_name, path, optional exclusions. |
catalog_targets |
list(object) |
[] |
no | Existing-table targets: database_name + tables. |
classifiers |
list(string) |
[] |
no | Ordered custom classifier names applied before built-ins. |
recrawl_behavior |
string |
CRAWL_EVERYTHING |
no | CRAWL_EVERYTHING, CRAWL_NEW_FOLDERS_ONLY, or CRAWL_EVENT_MODE. |
schema_update_behavior |
string |
UPDATE_IN_DATABASE |
no | UPDATE_IN_DATABASE or LOG. |
schema_delete_behavior |
string |
LOG |
no | LOG, DELETE_FROM_DATABASE, or DEPRECATE_IN_DATABASE. |
configuration_version |
number |
null |
no | Configuration JSON version (e.g. 1.0); null omits the block. |
table_grouping_policy |
string |
CombineCompatibleSchemas |
no | Merge compatible folders into one table, or null. |
partitions_add_or_update_behavior |
string |
InheritFromTable |
no | Partition output behaviour. |
lineage_enabled |
bool |
false |
no | Enable crawler lineage tracking. |
tags |
map(string) |
{} |
no | Tags applied to the crawler and created role. |
Outputs
| Name | Description |
|---|---|
id |
The crawler name, which is its Terraform/AWS identifier. |
name |
Name of the Glue crawler. |
arn |
ARN of the Glue crawler. |
database_name |
Glue Data Catalog database the crawler populates. |
role_arn |
ARN of the IAM role used by the crawler. |
role_name |
Name of the created IAM role, or null when an existing role was supplied. |
Enterprise scenario
A retail analytics team runs a bronze/silver/gold lake on S3. Each night, ingestion jobs drop new date-partitioned Parquet under s3://acme-data-lake/bronze/<feed>/dt=YYYY-MM-DD/. They deploy one instance of this module per feed with recrawl_behavior = "CRAWL_NEW_FOLDERS_ONLY" and schema_delete_behavior = "DEPRECATE_IN_DATABASE", so new partitions appear in Athena within minutes of landing while an accidental upstream backfill that removes a day’s files can never silently delete a production table. A CONDITIONAL Glue trigger references each crawler’s name output to kick off the bronze-to-silver ETL only after the crawl reports SUCCEEDED, giving them a fully code-reviewed, dependency-ordered pipeline.
Best practices
- Never default
schema_delete_behaviortoDELETE_FROM_DATABASE. UseLOGorDEPRECATE_IN_DATABASEso a transient gap in the source data can’t drop catalog tables (and their partition metadata) out from under live queries. - Crawl incrementally to control cost.
CRAWL_NEW_FOLDERS_ONLYor event-mode (CRAWL_EVENT_MODEwith an SQSevent_queue_arn) means you pay Glue DPU-hours for new data only, not a full terabyte re-scan every night. - Scope the IAM role tightly. The bundled role attaches
AWSGlueServiceRoleplus a path-scoped S3 read policy derived froms3_target_paths— don’t widen it tos3:GetObjecton*; add the KMSDecryptpermission separately if the bucket is SSE-KMS encrypted. - Standardise naming with
table_prefixand consistent crawler names. Araw_/stg_/cur_prefix per layer makes the Data Catalog self-documenting and keeps Athena queries unambiguous across hundreds of tables. - Use exclusions to skip job exhaust. Patterns like
**/_temporary/**and**/_SUCCESSstop the crawler wasting time classifying Spark scratch files and prevent spurious tables from appearing in the catalog. - Prefer
CombineCompatibleSchemasfor partitioned data. Keepingtable_grouping_policy = "CombineCompatibleSchemas"collapses thousands of partition folders into a single logical table instead of creating one table per folder — essential for query performance and catalog hygiene.