Quick take — Provision an AWS Glue Data Catalog database with Terraform using aws_glue_catalog_database — Lake Formation-friendly defaults, encrypted columns, target locations, and reusable variable-driven inputs for production data lakes. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "glue" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue?ref=v1.0.0"
database_name = "..." # Catalog database name; lowercased automatically (1-255 …
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
AWS Glue is a serverless data-integration service, and the Glue Data Catalog is its persistent metadata store — a Hive-compatible catalog of databases, tables, partitions and schemas that Athena, EMR, Redshift Spectrum, Glue ETL jobs and Lake Formation all read from. The aws_glue_catalog_database resource creates the top-level container — a database — under which crawlers and ETL jobs register tables that point at data in S3.
On its own a Glue database is a small object, but in practice it is almost never created in isolation. It needs a stable name and tags for chargeback, an optional location_uri that anchors managed tables to an S3 prefix, a target_database link when you federate across accounts via Lake Formation, and increasingly a federated_database block when the catalog points at an external metastore (Redshift, Aurora, or another account). Wrapping it in a module gives every data domain a consistent, policy-compliant way to stamp out catalog databases — same naming convention, same tags, same encryption posture — instead of hand-clicking them or copy-pasting HCL between repos. This module produces one Glue Data Catalog database plus its most common production companions: catalog encryption settings and an optional resource-level Lake Formation permission grant.
When to use it
- You run a data lake on S3 + Athena and need a catalog database per domain (e.g.
sales_raw,sales_curated) created repeatably across dev/stage/prod. - A Glue crawler or ETL job needs a target database to write table definitions into, and you want that database managed in the same Terraform stack.
- You are adopting Lake Formation and want catalog encryption and a baseline grant codified rather than configured by hand.
- You need cross-account or federated catalog databases (a
target_databasepointer or afederated_databaseconnection) defined as code. - You want consistent tagging and naming for cost allocation across many small catalog objects.
If you only need a throwaway database for a one-off Athena query, the console is faster — reach for this module when the database is part of a long-lived, governed platform.
Module structure
terraform-module-aws-glue/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
main.tf
locals {
# Glue database names must be lowercase; normalise defensively.
database_name = lower(var.database_name)
common_tags = merge(
{
"ManagedBy" = "terraform"
"Module" = "terraform-module-aws-glue"
},
var.tags
)
}
resource "aws_glue_catalog_database" "this" {
name = local.database_name
description = var.description
catalog_id = var.catalog_id
location_uri = var.location_uri
# Cross-account / federated catalog link to another Glue database.
dynamic "target_database" {
for_each = var.target_database != null ? [var.target_database] : []
content {
catalog_id = target_database.value.catalog_id
database_name = target_database.value.database_name
region = lookup(target_database.value, "region", null)
}
}
# Point the database at an external metastore (e.g. Redshift, Aurora,
# another account) via a Lake Formation connection.
dynamic "federated_database" {
for_each = var.federated_database != null ? [var.federated_database] : []
content {
identifier = federated_database.value.identifier
connection_name = federated_database.value.connection_name
}
}
# Default table permissions applied to new tables in this database
# (used by Lake Formation hybrid access mode).
dynamic "create_table_default_permission" {
for_each = var.create_table_default_permissions
content {
permissions = create_table_default_permission.value.permissions
principal {
data_lake_principal_identifier = create_table_default_permission.value.principal
}
}
}
tags = local.common_tags
}
# Account-level Glue Data Catalog encryption. Optional but recommended:
# encrypts metadata at rest and (optionally) connection passwords.
resource "aws_glue_data_catalog_encryption_settings" "this" {
count = var.enable_catalog_encryption ? 1 : 0
catalog_id = var.catalog_id
data_catalog_encryption_settings {
encryption_at_rest {
catalog_encryption_mode = "SSE-KMS"
sse_aws_kms_key_id = var.catalog_kms_key_arn
catalog_encryption_service_role = var.catalog_encryption_service_role_arn
}
connection_password_encryption {
return_connection_password_encrypted = true
aws_kms_key_id = var.catalog_kms_key_arn
}
}
}
# Optional Lake Formation grant on the database to a baseline principal.
resource "aws_lakeformation_permissions" "database" {
count = var.lakeformation_grant != null ? 1 : 0
principal = var.lakeformation_grant.principal
permissions = var.lakeformation_grant.permissions
database {
catalog_id = var.catalog_id
name = aws_glue_catalog_database.this.name
}
}
variables.tf
variable "database_name" {
description = "Name of the Glue Data Catalog database. Lowercased automatically; 1-255 chars, letters/numbers/underscore."
type = string
validation {
condition = can(regex("^[a-zA-Z0-9_]{1,255}$", var.database_name))
error_message = "database_name must be 1-255 characters and contain only letters, numbers, and underscores."
}
}
variable "description" {
description = "Free-text description shown in the Glue console and APIs."
type = string
default = null
}
variable "catalog_id" {
description = "AWS account ID of the Data Catalog. Defaults to the caller's account when null."
type = string
default = null
validation {
condition = var.catalog_id == null || can(regex("^[0-9]{12}$", var.catalog_id))
error_message = "catalog_id must be a 12-digit AWS account ID or null."
}
}
variable "location_uri" {
description = "Default S3 location for managed tables created in this database, e.g. s3://my-lake/curated/sales/."
type = string
default = null
validation {
condition = var.location_uri == null || can(regex("^s3://", var.location_uri))
error_message = "location_uri must be an s3:// URI when set."
}
}
variable "target_database" {
description = "Cross-account/region link to another Glue database (resource link). Object with catalog_id, database_name, and optional region."
type = object({
catalog_id = string
database_name = string
region = optional(string)
})
default = null
}
variable "federated_database" {
description = "Link this database to an external metastore via a Lake Formation connection. Object with identifier and connection_name."
type = object({
identifier = string
connection_name = string
})
default = null
}
variable "create_table_default_permissions" {
description = "Default Lake Formation permissions applied to new tables in this database. List of { permissions, principal } where principal is a data-lake principal identifier (e.g. IAM_ALLOWED_PRINCIPALS)."
type = list(object({
permissions = list(string)
principal = string
}))
default = []
}
variable "enable_catalog_encryption" {
description = "Manage account-level Glue Data Catalog encryption settings (SSE-KMS at rest + encrypted connection passwords). Account-wide — set true in exactly one stack per account."
type = bool
default = false
}
variable "catalog_kms_key_arn" {
description = "KMS key ARN used to encrypt the Data Catalog at rest and connection passwords. Required when enable_catalog_encryption is true."
type = string
default = null
validation {
condition = var.catalog_kms_key_arn == null || can(regex("^arn:aws[a-z\\-]*:kms:", var.catalog_kms_key_arn))
error_message = "catalog_kms_key_arn must be a KMS key ARN or null."
}
}
variable "catalog_encryption_service_role_arn" {
description = "IAM role ARN Glue assumes to use the KMS key for catalog encryption. Optional; leave null to use the Glue service-linked permissions."
type = string
default = null
}
variable "lakeformation_grant" {
description = "Optional baseline Lake Formation grant on the database. Object with principal (IAM ARN) and permissions (e.g. [\"DESCRIBE\", \"CREATE_TABLE\"])."
type = object({
principal = string
permissions = list(string)
})
default = null
}
variable "tags" {
description = "Extra tags merged onto the database (and any tagged sub-resources)."
type = map(string)
default = {}
}
outputs.tf
output "database_id" {
description = "Glue catalog database ID in the form catalog_id:name (or just name)."
value = aws_glue_catalog_database.this.id
}
output "database_name" {
description = "Name of the Glue Data Catalog database."
value = aws_glue_catalog_database.this.name
}
output "database_arn" {
description = "ARN of the Glue Data Catalog database."
value = aws_glue_catalog_database.this.arn
}
output "catalog_id" {
description = "Catalog (account) ID the database lives in."
value = aws_glue_catalog_database.this.catalog_id
}
output "location_uri" {
description = "Default S3 location for managed tables, if set."
value = aws_glue_catalog_database.this.location_uri
}
How to use it
module "glue_sales_curated" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue?ref=v1.0.0"
database_name = "sales_curated"
description = "Curated, conformed sales facts and dimensions for the analytics domain."
location_uri = "s3://kloudvin-lake-prod/curated/sales/"
enable_catalog_encryption = true
catalog_kms_key_arn = aws_kms_key.glue_catalog.arn
lakeformation_grant = {
principal = aws_iam_role.analytics_etl.arn
permissions = ["DESCRIBE", "CREATE_TABLE", "ALTER", "DROP"]
}
tags = {
Domain = "sales"
Environment = "prod"
CostCenter = "data-platform"
}
}
# Downstream: a Glue crawler that registers tables into the module's database.
resource "aws_glue_crawler" "sales_curated" {
name = "sales-curated-crawler"
role = aws_iam_role.analytics_etl.arn
database_name = module.glue_sales_curated.database_name # <- module output
s3_target {
path = module.glue_sales_curated.location_uri
}
schema_change_policy {
delete_behavior = "LOG"
update_behavior = "UPDATE_IN_DATABASE"
}
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/glue/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-glue?ref=v1.0.0"
}
inputs = {
database_name = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/glue && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
database_name |
string |
— | Yes | Catalog database name; lowercased automatically (1-255 chars, alphanumeric + underscore). |
description |
string |
null |
No | Free-text description shown in console/APIs. |
catalog_id |
string |
null |
No | 12-digit account ID of the Data Catalog; defaults to caller’s account. |
location_uri |
string |
null |
No | Default S3 location (s3://...) for managed tables in the database. |
target_database |
object |
null |
No | Cross-account/region resource link: { catalog_id, database_name, region? }. |
federated_database |
object |
null |
No | External metastore link: { identifier, connection_name }. |
create_table_default_permissions |
list(object) |
[] |
No | Default LF permissions for new tables: { permissions, principal }. |
enable_catalog_encryption |
bool |
false |
No | Manage account-level catalog encryption (SSE-KMS). Account-wide — enable in one stack only. |
catalog_kms_key_arn |
string |
null |
No | KMS key ARN for catalog/connection-password encryption. Required when encryption is enabled. |
catalog_encryption_service_role_arn |
string |
null |
No | IAM role ARN Glue assumes to use the KMS key for catalog encryption. |
lakeformation_grant |
object |
null |
No | Baseline LF grant on the database: { principal, permissions }. |
tags |
map(string) |
{} |
No | Extra tags merged onto the database. |
Outputs
| Name | Description |
|---|---|
database_id |
Catalog database ID (catalog_id:name). |
database_name |
Name of the Glue Data Catalog database. |
database_arn |
ARN of the database. |
catalog_id |
Catalog (account) ID the database resides in. |
location_uri |
Default S3 location for managed tables, if set. |
Enterprise scenario
A retail analytics platform runs a medallion data lake on S3 with one Glue catalog database per layer and domain — orders_raw, orders_curated, inventory_curated — all stamped out by this module from a single environments folder. Each database carries Domain and CostCenter tags so finance can attribute Athena scan costs per team, while enable_catalog_encryption in the platform’s bootstrap stack enforces SSE-KMS on all metadata account-wide. When the data-governance team onboards a new domain, they add one module block, and Lake Formation grants plus crawler targets flow from the module’s outputs — no console clicks, fully reviewable in a pull request.
Best practices
- Encrypt the catalog account-wide, once.
aws_glue_data_catalog_encryption_settingsis per-account, not per-database, so enableenable_catalog_encryptionin exactly one bootstrap stack and reference a dedicated KMS key — flipping it in multiple stacks causes Terraform drift fights. - Keep names lowercase and convention-driven. Glue and Athena treat database names case-insensitively and reject uppercase in some paths; this module lowercases for you, but standardise on
{domain}_{layer}(e.g.sales_curated) so crawlers, workgroups and IAM policies stay predictable. - Anchor managed tables with
location_uri. Setting a default S3 prefix keeps Glue/AthenaCREATE TABLEoutput organised and makes bucket-policy and Lake Formation scoping far simpler than scattering tables across arbitrary paths. - Prefer Lake Formation grants over IAM-only access for governed lakes. Use
lakeformation_grant(andcreate_table_default_permissions) to give ETL roles least-privilegeDESCRIBE/CREATE_TABLErather than broadglue:*IAM, and avoid leavingIAM_ALLOWED_PRINCIPALSas the default on production databases. - Tag for cost allocation. Catalog objects are cheap, but the Athena queries against them are not — consistent
Domain/CostCenter/Environmenttags let you trace data-scan spend back to the owning team. - Pin the module ref and provider. Consume a tagged
?ref=v1.0.0and keepaws ~> 5.0pinned inversions.tfso catalog changes are deliberate and reviewable, never an accidental provider upgrade.