Terraform Module: AWS Glue — a governed Data Catalog database as code

Quick take — Provision an AWS Glue Data Catalog database with Terraform using aws_glue_catalog_database — Lake Formation-friendly defaults, encrypted columns, target locations, and reusable variable-driven inputs for production data lakes. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "glue" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue?ref=v1.0.0"

  database_name = "..."  # Catalog database name; lowercased automatically (1-255 …
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS Glue is a serverless data-integration service, and the Glue Data Catalog is its persistent metadata store — a Hive-compatible catalog of databases, tables, partitions and schemas that Athena, EMR, Redshift Spectrum, Glue ETL jobs and Lake Formation all read from. The aws_glue_catalog_database resource creates the top-level container — a database — under which crawlers and ETL jobs register tables that point at data in S3.

On its own a Glue database is a small object, but in practice it is almost never created in isolation. It needs a stable name and tags for chargeback, an optional location_uri that anchors managed tables to an S3 prefix, a target_database link when you federate across accounts via Lake Formation, and increasingly a federated_database block when the catalog points at an external metastore (Redshift, Aurora, or another account). Wrapping it in a module gives every data domain a consistent, policy-compliant way to stamp out catalog databases — same naming convention, same tags, same encryption posture — instead of hand-clicking them or copy-pasting HCL between repos. This module produces one Glue Data Catalog database plus its most common production companions: catalog encryption settings and an optional resource-level Lake Formation permission grant.

When to use it

You run a data lake on S3 + Athena and need a catalog database per domain (e.g. sales_raw, sales_curated) created repeatably across dev/stage/prod.
A Glue crawler or ETL job needs a target database to write table definitions into, and you want that database managed in the same Terraform stack.
You are adopting Lake Formation and want catalog encryption and a baseline grant codified rather than configured by hand.
You need cross-account or federated catalog databases (a target_database pointer or a federated_database connection) defined as code.
You want consistent tagging and naming for cost allocation across many small catalog objects.

If you only need a throwaway database for a one-off Athena query, the console is faster — reach for this module when the database is part of a long-lived, governed platform.

Module structure

terraform-module-aws-glue/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Glue database names must be lowercase; normalise defensively.
  database_name = lower(var.database_name)

  common_tags = merge(
    {
      "ManagedBy" = "terraform"
      "Module"    = "terraform-module-aws-glue"
    },
    var.tags
  )
}

resource "aws_glue_catalog_database" "this" {
  name         = local.database_name
  description  = var.description
  catalog_id   = var.catalog_id
  location_uri = var.location_uri

  # Cross-account / federated catalog link to another Glue database.
  dynamic "target_database" {
    for_each = var.target_database != null ? [var.target_database] : []
    content {
      catalog_id    = target_database.value.catalog_id
      database_name = target_database.value.database_name
      region        = lookup(target_database.value, "region", null)
    }
  }

  # Point the database at an external metastore (e.g. Redshift, Aurora,
  # another account) via a Lake Formation connection.
  dynamic "federated_database" {
    for_each = var.federated_database != null ? [var.federated_database] : []
    content {
      identifier      = federated_database.value.identifier
      connection_name = federated_database.value.connection_name
    }
  }

  # Default table permissions applied to new tables in this database
  # (used by Lake Formation hybrid access mode).
  dynamic "create_table_default_permission" {
    for_each = var.create_table_default_permissions
    content {
      permissions = create_table_default_permission.value.permissions
      principal {
        data_lake_principal_identifier = create_table_default_permission.value.principal
      }
    }
  }

  tags = local.common_tags
}

# Account-level Glue Data Catalog encryption. Optional but recommended:
# encrypts metadata at rest and (optionally) connection passwords.
resource "aws_glue_data_catalog_encryption_settings" "this" {
  count      = var.enable_catalog_encryption ? 1 : 0
  catalog_id = var.catalog_id

  data_catalog_encryption_settings {
    encryption_at_rest {
      catalog_encryption_mode         = "SSE-KMS"
      sse_aws_kms_key_id              = var.catalog_kms_key_arn
      catalog_encryption_service_role = var.catalog_encryption_service_role_arn
    }

    connection_password_encryption {
      return_connection_password_encrypted = true
      aws_kms_key_id                        = var.catalog_kms_key_arn
    }
  }
}

# Optional Lake Formation grant on the database to a baseline principal.
resource "aws_lakeformation_permissions" "database" {
  count       = var.lakeformation_grant != null ? 1 : 0
  principal   = var.lakeformation_grant.principal
  permissions = var.lakeformation_grant.permissions

  database {
    catalog_id = var.catalog_id
    name       = aws_glue_catalog_database.this.name
  }
}

variables.tf

variable "database_name" {
  description = "Name of the Glue Data Catalog database. Lowercased automatically; 1-255 chars, letters/numbers/underscore."
  type        = string

  validation {
    condition     = can(regex("^[a-zA-Z0-9_]{1,255}$", var.database_name))
    error_message = "database_name must be 1-255 characters and contain only letters, numbers, and underscores."
  }
}

variable "description" {
  description = "Free-text description shown in the Glue console and APIs."
  type        = string
  default     = null
}

variable "catalog_id" {
  description = "AWS account ID of the Data Catalog. Defaults to the caller's account when null."
  type        = string
  default     = null

  validation {
    condition     = var.catalog_id == null || can(regex("^[0-9]{12}$", var.catalog_id))
    error_message = "catalog_id must be a 12-digit AWS account ID or null."
  }
}

variable "location_uri" {
  description = "Default S3 location for managed tables created in this database, e.g. s3://my-lake/curated/sales/."
  type        = string
  default     = null

  validation {
    condition     = var.location_uri == null || can(regex("^s3://", var.location_uri))
    error_message = "location_uri must be an s3:// URI when set."
  }
}

variable "target_database" {
  description = "Cross-account/region link to another Glue database (resource link). Object with catalog_id, database_name, and optional region."
  type = object({
    catalog_id    = string
    database_name = string
    region        = optional(string)
  })
  default = null
}

variable "federated_database" {
  description = "Link this database to an external metastore via a Lake Formation connection. Object with identifier and connection_name."
  type = object({
    identifier      = string
    connection_name = string
  })
  default = null
}

variable "create_table_default_permissions" {
  description = "Default Lake Formation permissions applied to new tables in this database. List of { permissions, principal } where principal is a data-lake principal identifier (e.g. IAM_ALLOWED_PRINCIPALS)."
  type = list(object({
    permissions = list(string)
    principal   = string
  }))
  default = []
}

variable "enable_catalog_encryption" {
  description = "Manage account-level Glue Data Catalog encryption settings (SSE-KMS at rest + encrypted connection passwords). Account-wide — set true in exactly one stack per account."
  type        = bool
  default     = false
}

variable "catalog_kms_key_arn" {
  description = "KMS key ARN used to encrypt the Data Catalog at rest and connection passwords. Required when enable_catalog_encryption is true."
  type        = string
  default     = null

  validation {
    condition     = var.catalog_kms_key_arn == null || can(regex("^arn:aws[a-z\\-]*:kms:", var.catalog_kms_key_arn))
    error_message = "catalog_kms_key_arn must be a KMS key ARN or null."
  }
}

variable "catalog_encryption_service_role_arn" {
  description = "IAM role ARN Glue assumes to use the KMS key for catalog encryption. Optional; leave null to use the Glue service-linked permissions."
  type        = string
  default     = null
}

variable "lakeformation_grant" {
  description = "Optional baseline Lake Formation grant on the database. Object with principal (IAM ARN) and permissions (e.g. [\"DESCRIBE\", \"CREATE_TABLE\"])."
  type = object({
    principal   = string
    permissions = list(string)
  })
  default = null
}

variable "tags" {
  description = "Extra tags merged onto the database (and any tagged sub-resources)."
  type        = map(string)
  default     = {}
}

outputs.tf

output "database_id" {
  description = "Glue catalog database ID in the form catalog_id:name (or just name)."
  value       = aws_glue_catalog_database.this.id
}

output "database_name" {
  description = "Name of the Glue Data Catalog database."
  value       = aws_glue_catalog_database.this.name
}

output "database_arn" {
  description = "ARN of the Glue Data Catalog database."
  value       = aws_glue_catalog_database.this.arn
}

output "catalog_id" {
  description = "Catalog (account) ID the database lives in."
  value       = aws_glue_catalog_database.this.catalog_id
}

output "location_uri" {
  description = "Default S3 location for managed tables, if set."
  value       = aws_glue_catalog_database.this.location_uri
}

How to use it

module "glue_sales_curated" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue?ref=v1.0.0"

  database_name = "sales_curated"
  description   = "Curated, conformed sales facts and dimensions for the analytics domain."
  location_uri  = "s3://kloudvin-lake-prod/curated/sales/"

  enable_catalog_encryption = true
  catalog_kms_key_arn       = aws_kms_key.glue_catalog.arn

  lakeformation_grant = {
    principal   = aws_iam_role.analytics_etl.arn
    permissions = ["DESCRIBE", "CREATE_TABLE", "ALTER", "DROP"]
  }

  tags = {
    Domain      = "sales"
    Environment = "prod"
    CostCenter  = "data-platform"
  }
}

# Downstream: a Glue crawler that registers tables into the module's database.
resource "aws_glue_crawler" "sales_curated" {
  name          = "sales-curated-crawler"
  role          = aws_iam_role.analytics_etl.arn
  database_name = module.glue_sales_curated.database_name # <- module output

  s3_target {
    path = module.glue_sales_curated.location_uri
  }

  schema_change_policy {
    delete_behavior = "LOG"
    update_behavior = "UPDATE_IN_DATABASE"
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module config — live/prod/glue/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue?ref=v1.0.0"
}

inputs = {
  database_name = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/glue && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`database_name`	`string`	—	Yes	Catalog database name; lowercased automatically (1-255 chars, alphanumeric + underscore).
`description`	`string`	`null`	No	Free-text description shown in console/APIs.
`catalog_id`	`string`	`null`	No	12-digit account ID of the Data Catalog; defaults to caller’s account.
`location_uri`	`string`	`null`	No	Default S3 location (`s3://...`) for managed tables in the database.
`target_database`	`object`	`null`	No	Cross-account/region resource link: `{ catalog_id, database_name, region? }`.
`federated_database`	`object`	`null`	No	External metastore link: `{ identifier, connection_name }`.
`create_table_default_permissions`	`list(object)`	`[]`	No	Default LF permissions for new tables: `{ permissions, principal }`.
`enable_catalog_encryption`	`bool`	`false`	No	Manage account-level catalog encryption (SSE-KMS). Account-wide — enable in one stack only.
`catalog_kms_key_arn`	`string`	`null`	No	KMS key ARN for catalog/connection-password encryption. Required when encryption is enabled.
`catalog_encryption_service_role_arn`	`string`	`null`	No	IAM role ARN Glue assumes to use the KMS key for catalog encryption.
`lakeformation_grant`	`object`	`null`	No	Baseline LF grant on the database: `{ principal, permissions }`.
`tags`	`map(string)`	`{}`	No	Extra tags merged onto the database.

Outputs

Name	Description
`database_id`	Catalog database ID (`catalog_id:name`).
`database_name`	Name of the Glue Data Catalog database.
`database_arn`	ARN of the database.
`catalog_id`	Catalog (account) ID the database resides in.
`location_uri`	Default S3 location for managed tables, if set.

Enterprise scenario

A retail analytics platform runs a medallion data lake on S3 with one Glue catalog database per layer and domain — orders_raw, orders_curated, inventory_curated — all stamped out by this module from a single environments folder. Each database carries Domain and CostCenter tags so finance can attribute Athena scan costs per team, while enable_catalog_encryption in the platform’s bootstrap stack enforces SSE-KMS on all metadata account-wide. When the data-governance team onboards a new domain, they add one module block, and Lake Formation grants plus crawler targets flow from the module’s outputs — no console clicks, fully reviewable in a pull request.

Best practices

Encrypt the catalog account-wide, once. aws_glue_data_catalog_encryption_settings is per-account, not per-database, so enable enable_catalog_encryption in exactly one bootstrap stack and reference a dedicated KMS key — flipping it in multiple stacks causes Terraform drift fights.
Keep names lowercase and convention-driven. Glue and Athena treat database names case-insensitively and reject uppercase in some paths; this module lowercases for you, but standardise on {domain}_{layer} (e.g. sales_curated) so crawlers, workgroups and IAM policies stay predictable.
Anchor managed tables with location_uri. Setting a default S3 prefix keeps Glue/Athena CREATE TABLE output organised and makes bucket-policy and Lake Formation scoping far simpler than scattering tables across arbitrary paths.
Prefer Lake Formation grants over IAM-only access for governed lakes. Use lakeformation_grant (and create_table_default_permissions) to give ETL roles least-privilege DESCRIBE/CREATE_TABLE rather than broad glue:* IAM, and avoid leaving IAM_ALLOWED_PRINCIPALS as the default on production databases.
Tag for cost allocation. Catalog objects are cheap, but the Athena queries against them are not — consistent Domain/CostCenter/Environment tags let you trace data-scan spend back to the owning team.
Pin the module ref and provider. Consume a tagged ?ref=v1.0.0 and keep aws ~> 5.0 pinned in versions.tf so catalog changes are deliberate and reviewable, never an accidental provider upgrade.