Terraform Module: AWS Lake Formation — Govern data-lake access with centralized, tag-aware permissions

Quick take — A reusable Terraform module that registers S3 locations with AWS Lake Formation and grants fine-grained, auditable database, table, and column permissions to IAM principals — without brittle bucket policies. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "lake_formation" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-lake-formation?ref=v1.0.0"

  s3_resource_arn = "..."  # ARN of the S3 bucket/prefix to register with Lake Forma…
  database_name   = "..."  # Glue/Lake Formation database the grants target.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS Lake Formation is a governance layer that sits on top of an S3-backed data lake and the AWS Glue Data Catalog. Instead of hand-crafting S3 bucket policies and IAM statements for every analyst, ETL job, and BI tool, you register the underlying S3 location with Lake Formation once, then grant database-, table-, and column-level permissions to IAM principals. Athena, Redshift Spectrum, EMR, and Glue all honour those grants, and every access decision is centralized and auditable.

Two resources do the heavy lifting, and they are easy to get subtly wrong:

aws_lakeformation_resource — registers an S3 path with Lake Formation. You choose whether Lake Formation uses a service-linked role or your own role_arn to vend temporary credentials for that path. Register the wrong prefix, or forget the trailing slash semantics, and downstream grants silently resolve to nothing.
aws_lakeformation_permissions — the actual grant. It is a tri-state resource: you target either a database, a table, or a table_with_columns block, and you pass permissions plus optional permissions_with_grant_option. Mixing the wrong principal, or granting SELECT on a database (where only DESCRIBE/CREATE_TABLE are valid), produces an apply-time error or a no-op grant.

Wrapping this in a module gives every team a single, version-pinned way to onboard a data domain: register the bucket prefix, grant a read role and a write role with consistent, least-privilege permission sets, and emit the catalog IDs that downstream Athena workgroups and Glue jobs reference. The module hides the tri-state quirks and the service-linked-role decision behind validated variables.

When to use it

You run an S3 data lake catalogued in AWS Glue and want to retire per-bucket IAM/S3 policies in favour of central, table-level governance.
Multiple consumers (analysts via Athena, ETL via Glue, BI via Redshift Spectrum) need different slices of the same dataset — e.g. analysts get SELECT on non-PII columns only, while the pipeline role gets full ALTER/INSERT.
You need an audit trail of who can read which table/column, satisfying data-governance or regulatory review.
You are standing up a new data domain or lake-house zone (raw / curated / consumption) and want each zone registered and permissioned identically across accounts.

Skip it if your lake has a single trusted consumer and no column-level requirements — plain IAM may be simpler. Also note Lake Formation governs the catalog + S3 credential vending; it does not replace KMS encryption or VPC controls on the bucket itself.

Module structure

terraform-module-aws-lake-formation/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Lake Formation grant resources require the account that owns the catalog.
  catalog_id = coalesce(var.catalog_id, data.aws_caller_identity.current.account_id)

  # Normalise principal -> permission-set maps into a flat list we can for_each.
  database_grants = [
    for principal, perms in var.database_permissions : {
      principal = principal
      perms     = perms
    }
  ]

  table_grants = [
    for principal, perms in var.table_permissions : {
      principal = principal
      perms     = perms
    }
  ]

  # Column-scoped grants: each entry pins a principal to an explicit column list.
  column_grants = {
    for grant in var.column_permissions :
    "${grant.principal}:${grant.table}" => grant
  }
}

data "aws_caller_identity" "current" {}

# ---------------------------------------------------------------------------
# Register the S3 location with Lake Formation so it can vend credentials and
# enforce catalog permissions on objects under this prefix.
# ---------------------------------------------------------------------------
resource "aws_lakeformation_resource" "this" {
  arn = var.s3_resource_arn

  # When role_arn is null, Lake Formation uses its service-linked role.
  role_arn = var.registration_role_arn

  # Hybrid mode keeps existing IAM/S3 access working alongside LF permissions;
  # set false once you are ready to enforce LF-only access.
  use_service_linked_role = var.registration_role_arn == null
  hybrid_access_enabled   = var.hybrid_access_enabled
}

# ---------------------------------------------------------------------------
# Database-level grants (e.g. DESCRIBE, CREATE_TABLE) for catalog discovery.
# ---------------------------------------------------------------------------
resource "aws_lakeformation_permissions" "database" {
  for_each = { for g in local.database_grants : g.principal => g }

  principal                     = each.value.principal
  permissions                   = each.value.perms.permissions
  permissions_with_grant_option = each.value.perms.grant_options
  catalog_id                    = local.catalog_id

  database {
    name       = var.database_name
    catalog_id = local.catalog_id
  }

  # Ensure the location is registered before we hand out access to it.
  depends_on = [aws_lakeformation_resource.this]
}

# ---------------------------------------------------------------------------
# Whole-table grants (SELECT / INSERT / ALTER / DELETE / DROP) for ETL roles
# and consumers that need every column.
# ---------------------------------------------------------------------------
resource "aws_lakeformation_permissions" "table" {
  for_each = { for g in local.table_grants : g.principal => g }

  principal                     = each.value.principal
  permissions                   = each.value.perms.permissions
  permissions_with_grant_option = each.value.perms.grant_options
  catalog_id                    = local.catalog_id

  table {
    database_name = var.database_name
    name          = each.value.perms.table_name
    catalog_id    = local.catalog_id
  }

  depends_on = [aws_lakeformation_resource.this]
}

# ---------------------------------------------------------------------------
# Column-level grants: SELECT on an explicit allow-list of columns, used to
# hide PII from analysts while still exposing the rest of the table.
# ---------------------------------------------------------------------------
resource "aws_lakeformation_permissions" "columns" {
  for_each = local.column_grants

  principal   = each.value.principal
  permissions = each.value.permissions
  catalog_id  = local.catalog_id

  table_with_columns {
    database_name     = var.database_name
    name              = each.value.table
    catalog_id        = local.catalog_id
    column_names      = length(each.value.column_names) > 0 ? each.value.column_names : null
    excluded_column_names = length(each.value.excluded_column_names) > 0 ? each.value.excluded_column_names : null
  }

  depends_on = [aws_lakeformation_resource.this]
}

variables.tf

variable "s3_resource_arn" {
  description = "ARN of the S3 bucket or prefix to register with Lake Formation (e.g. arn:aws:s3:::my-lake/curated)."
  type        = string

  validation {
    condition     = can(regex("^arn:aws[a-z-]*:s3:::", var.s3_resource_arn))
    error_message = "s3_resource_arn must be a valid S3 ARN starting with arn:aws:s3:::."
  }
}

variable "registration_role_arn" {
  description = "IAM role ARN Lake Formation assumes to vend credentials for the location. Leave null to use the LF service-linked role."
  type        = string
  default     = null

  validation {
    condition     = var.registration_role_arn == null || can(regex("^arn:aws[a-z-]*:iam::[0-9]{12}:role/", var.registration_role_arn))
    error_message = "registration_role_arn must be null or a valid IAM role ARN."
  }
}

variable "hybrid_access_enabled" {
  description = "Keep existing IAM/S3 permissions effective alongside Lake Formation grants. Set false to enforce LF-only access."
  type        = bool
  default     = true
}

variable "catalog_id" {
  description = "Glue Data Catalog account ID owning the database. Defaults to the caller's account."
  type        = string
  default     = null

  validation {
    condition     = var.catalog_id == null || can(regex("^[0-9]{12}$", var.catalog_id))
    error_message = "catalog_id must be null or a 12-digit AWS account ID."
  }
}

variable "database_name" {
  description = "Name of the Glue/Lake Formation database these grants apply to."
  type        = string
}

variable "database_permissions" {
  description = "Map of principal ARN => database-level permission set. Valid permissions: ALTER, CREATE_TABLE, DESCRIBE, DROP."
  type = map(object({
    permissions   = list(string)
    grant_options = optional(list(string), [])
  }))
  default = {}

  validation {
    condition = alltrue([
      for p in values(var.database_permissions) : alltrue([
        for perm in p.permissions :
        contains(["ALTER", "CREATE_TABLE", "DESCRIBE", "DROP"], perm)
      ])
    ])
    error_message = "database_permissions may only use ALTER, CREATE_TABLE, DESCRIBE, or DROP."
  }
}

variable "table_permissions" {
  description = "Map of principal ARN => whole-table permission set. Valid permissions: SELECT, INSERT, DELETE, ALTER, DROP, DESCRIBE."
  type = map(object({
    table_name    = string
    permissions   = list(string)
    grant_options = optional(list(string), [])
  }))
  default = {}

  validation {
    condition = alltrue([
      for p in values(var.table_permissions) : alltrue([
        for perm in p.permissions :
        contains(["SELECT", "INSERT", "DELETE", "ALTER", "DROP", "DESCRIBE"], perm)
      ])
    ])
    error_message = "table_permissions may only use SELECT, INSERT, DELETE, ALTER, DROP, or DESCRIBE."
  }
}

variable "column_permissions" {
  description = "Column-scoped SELECT grants. Provide either column_names (allow-list) or excluded_column_names (deny-list) per entry, not both."
  type = list(object({
    principal             = string
    table                 = string
    column_names          = optional(list(string), [])
    excluded_column_names = optional(list(string), [])
  }))
  default = []

  validation {
    condition = alltrue([
      for g in var.column_permissions :
      (length(g.column_names) > 0) != (length(g.excluded_column_names) > 0)
    ])
    error_message = "Each column_permissions entry must set exactly one of column_names or excluded_column_names."
  }
}

outputs.tf

output "resource_id" {
  description = "ID of the registered Lake Formation resource (the S3 ARN)."
  value       = aws_lakeformation_resource.this.id
}

output "registered_arn" {
  description = "S3 ARN registered with Lake Formation."
  value       = aws_lakeformation_resource.this.arn
}

output "registration_role_arn" {
  description = "IAM role Lake Formation uses to vend credentials for the location (service-linked role when unset)."
  value       = aws_lakeformation_resource.this.role_arn
}

output "catalog_id" {
  description = "Catalog account ID the grants were applied against."
  value       = local.catalog_id
}

output "database_grant_principals" {
  description = "Principals granted database-level permissions."
  value       = keys(aws_lakeformation_permissions.database)
}

output "table_grant_principals" {
  description = "Principals granted whole-table permissions."
  value       = keys(aws_lakeformation_permissions.table)
}

output "column_grant_keys" {
  description = "principal:table keys for column-scoped SELECT grants."
  value       = keys(aws_lakeformation_permissions.columns)
}

How to use it

module "lake_formation" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-lake-formation?ref=v1.0.0"

  s3_resource_arn       = "arn:aws:s3:::kloudvin-datalake/curated/sales"
  registration_role_arn = aws_iam_role.lf_registration.arn
  hybrid_access_enabled = false

  database_name = "sales_curated"

  # Discovery rights for everyone who touches the database.
  database_permissions = {
    (aws_iam_role.analyst.arn) = {
      permissions = ["DESCRIBE"]
    }
    (aws_iam_role.etl.arn) = {
      permissions = ["DESCRIBE", "CREATE_TABLE", "ALTER"]
    }
  }

  # The ETL role owns the table end to end.
  table_permissions = {
    (aws_iam_role.etl.arn) = {
      table_name  = "orders"
      permissions = ["SELECT", "INSERT", "ALTER", "DELETE"]
    }
  }

  # Analysts read orders, but never see PII columns.
  column_permissions = [
    {
      principal             = aws_iam_role.analyst.arn
      table                 = "orders"
      excluded_column_names = ["customer_email", "customer_phone"]
    }
  ]
}

# Downstream: point an Athena workgroup's query results at the governed lake
# and reference the registered ARN so the dependency is explicit.
resource "aws_athena_workgroup" "sales" {
  name = "sales-analytics"

  configuration {
    result_configuration {
      output_location = "s3://kloudvin-athena-results/sales/"
    }
  }

  tags = {
    GovernedResource = module.lake_formation.registered_arn
    Catalog          = module.lake_formation.catalog_id
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module config — live/prod/lake_formation/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-lake-formation?ref=v1.0.0"
}

inputs = {
  s3_resource_arn = "..."
  database_name = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/lake_formation && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`s3_resource_arn`	`string`	—	Yes	ARN of the S3 bucket/prefix to register with Lake Formation.
`registration_role_arn`	`string`	`null`	No	IAM role LF assumes to vend credentials; `null` uses the service-linked role.
`hybrid_access_enabled`	`bool`	`true`	No	Keep IAM/S3 access effective alongside LF grants; `false` enforces LF-only.
`catalog_id`	`string`	`null`	No	Glue Data Catalog account ID; defaults to the caller’s account.
`database_name`	`string`	—	Yes	Glue/Lake Formation database the grants target.
`database_permissions`	`map(object)`	`{}`	No	Principal ARN => database permission set (`ALTER`, `CREATE_TABLE`, `DESCRIBE`, `DROP`).
`table_permissions`	`map(object)`	`{}`	No	Principal ARN => whole-table permission set incl. `table_name`.
`column_permissions`	`list(object)`	`[]`	No	Column-scoped `SELECT` grants via allow-list or deny-list of columns.

Outputs

Name	Description
`resource_id`	ID of the registered Lake Formation resource (the S3 ARN).
`registered_arn`	S3 ARN registered with Lake Formation.
`registration_role_arn`	IAM role LF uses to vend credentials (service-linked role when unset).
`catalog_id`	Catalog account ID the grants were applied against.
`database_grant_principals`	Principals granted database-level permissions.
`table_grant_principals`	Principals granted whole-table permissions.
`column_grant_keys`	`principal:table` keys for column-scoped `SELECT` grants.

Enterprise scenario

A retail analytics platform stores curated sales data in s3://kloudvin-datalake/curated/sales, catalogued in Glue and queried by 40+ analysts through Athena. Finance analysts must aggregate revenue but are barred from seeing customer_email and customer_phone under the company’s PII policy. The data platform team instantiates this module once per data domain: the ETL role gets full table permissions to land nightly batches, while the analyst role gets a column-level SELECT grant that excludes the two PII fields — so a stray SELECT * in Athena simply returns no PII column rather than leaking it, and every grant is recorded centrally for the quarterly access review.

Best practices

Disable hybrid mode once migrated. Keep hybrid_access_enabled = true only while you transition off bucket policies; set it to false so Lake Formation is the single source of truth and stale IAM grants can’t bypass column controls.
Prefer column deny-lists for PII. Use excluded_column_names for sensitive fields so newly added non-PII columns are automatically visible to analysts, instead of an allow-list you must update on every schema change.
Grant the narrowest verb set. Database principals rarely need more than DESCRIBE; reserve ALTER/CREATE_TABLE for pipeline roles, and never hand permissions_with_grant_option to human users unless they genuinely re-delegate access.
Register prefixes, not whole buckets. Point s3_resource_arn at the curated/consumption prefix rather than the bucket root, so raw or quarantine zones under the same bucket stay outside this grant’s blast radius.
Use a dedicated registration role. Supply registration_role_arn with a least-privilege role scoped to the registered prefix and its KMS key, rather than the broad service-linked role, to bound which objects Lake Formation can vend credentials for.
Name databases by zone and domain. A sales_curated / sales_consumption convention keeps grants legible in audits and prevents accidental cross-zone permissions when the module is reused across the lake-house.