IaC AWS

Terraform Module: AWS Lake Formation — Govern data-lake access with centralized, tag-aware permissions

Quick take — A reusable Terraform module that registers S3 locations with AWS Lake Formation and grants fine-grained, auditable database, table, and column permissions to IAM principals — without brittle bucket policies. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "lake_formation" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-lake-formation?ref=v1.0.0"

  s3_resource_arn = "..."  # ARN of the S3 bucket/prefix to register with Lake Forma…
  database_name   = "..."  # Glue/Lake Formation database the grants target.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS Lake Formation is a governance layer that sits on top of an S3-backed data lake and the AWS Glue Data Catalog. Instead of hand-crafting S3 bucket policies and IAM statements for every analyst, ETL job, and BI tool, you register the underlying S3 location with Lake Formation once, then grant database-, table-, and column-level permissions to IAM principals. Athena, Redshift Spectrum, EMR, and Glue all honour those grants, and every access decision is centralized and auditable.

Two resources do the heavy lifting, and they are easy to get subtly wrong:

Wrapping this in a module gives every team a single, version-pinned way to onboard a data domain: register the bucket prefix, grant a read role and a write role with consistent, least-privilege permission sets, and emit the catalog IDs that downstream Athena workgroups and Glue jobs reference. The module hides the tri-state quirks and the service-linked-role decision behind validated variables.

When to use it

Skip it if your lake has a single trusted consumer and no column-level requirements — plain IAM may be simpler. Also note Lake Formation governs the catalog + S3 credential vending; it does not replace KMS encryption or VPC controls on the bucket itself.

Module structure

terraform-module-aws-lake-formation/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Lake Formation grant resources require the account that owns the catalog.
  catalog_id = coalesce(var.catalog_id, data.aws_caller_identity.current.account_id)

  # Normalise principal -> permission-set maps into a flat list we can for_each.
  database_grants = [
    for principal, perms in var.database_permissions : {
      principal = principal
      perms     = perms
    }
  ]

  table_grants = [
    for principal, perms in var.table_permissions : {
      principal = principal
      perms     = perms
    }
  ]

  # Column-scoped grants: each entry pins a principal to an explicit column list.
  column_grants = {
    for grant in var.column_permissions :
    "${grant.principal}:${grant.table}" => grant
  }
}

data "aws_caller_identity" "current" {}

# ---------------------------------------------------------------------------
# Register the S3 location with Lake Formation so it can vend credentials and
# enforce catalog permissions on objects under this prefix.
# ---------------------------------------------------------------------------
resource "aws_lakeformation_resource" "this" {
  arn = var.s3_resource_arn

  # When role_arn is null, Lake Formation uses its service-linked role.
  role_arn = var.registration_role_arn

  # Hybrid mode keeps existing IAM/S3 access working alongside LF permissions;
  # set false once you are ready to enforce LF-only access.
  use_service_linked_role = var.registration_role_arn == null
  hybrid_access_enabled   = var.hybrid_access_enabled
}

# ---------------------------------------------------------------------------
# Database-level grants (e.g. DESCRIBE, CREATE_TABLE) for catalog discovery.
# ---------------------------------------------------------------------------
resource "aws_lakeformation_permissions" "database" {
  for_each = { for g in local.database_grants : g.principal => g }

  principal                     = each.value.principal
  permissions                   = each.value.perms.permissions
  permissions_with_grant_option = each.value.perms.grant_options
  catalog_id                    = local.catalog_id

  database {
    name       = var.database_name
    catalog_id = local.catalog_id
  }

  # Ensure the location is registered before we hand out access to it.
  depends_on = [aws_lakeformation_resource.this]
}

# ---------------------------------------------------------------------------
# Whole-table grants (SELECT / INSERT / ALTER / DELETE / DROP) for ETL roles
# and consumers that need every column.
# ---------------------------------------------------------------------------
resource "aws_lakeformation_permissions" "table" {
  for_each = { for g in local.table_grants : g.principal => g }

  principal                     = each.value.principal
  permissions                   = each.value.perms.permissions
  permissions_with_grant_option = each.value.perms.grant_options
  catalog_id                    = local.catalog_id

  table {
    database_name = var.database_name
    name          = each.value.perms.table_name
    catalog_id    = local.catalog_id
  }

  depends_on = [aws_lakeformation_resource.this]
}

# ---------------------------------------------------------------------------
# Column-level grants: SELECT on an explicit allow-list of columns, used to
# hide PII from analysts while still exposing the rest of the table.
# ---------------------------------------------------------------------------
resource "aws_lakeformation_permissions" "columns" {
  for_each = local.column_grants

  principal   = each.value.principal
  permissions = each.value.permissions
  catalog_id  = local.catalog_id

  table_with_columns {
    database_name     = var.database_name
    name              = each.value.table
    catalog_id        = local.catalog_id
    column_names      = length(each.value.column_names) > 0 ? each.value.column_names : null
    excluded_column_names = length(each.value.excluded_column_names) > 0 ? each.value.excluded_column_names : null
  }

  depends_on = [aws_lakeformation_resource.this]
}

variables.tf

variable "s3_resource_arn" {
  description = "ARN of the S3 bucket or prefix to register with Lake Formation (e.g. arn:aws:s3:::my-lake/curated)."
  type        = string

  validation {
    condition     = can(regex("^arn:aws[a-z-]*:s3:::", var.s3_resource_arn))
    error_message = "s3_resource_arn must be a valid S3 ARN starting with arn:aws:s3:::."
  }
}

variable "registration_role_arn" {
  description = "IAM role ARN Lake Formation assumes to vend credentials for the location. Leave null to use the LF service-linked role."
  type        = string
  default     = null

  validation {
    condition     = var.registration_role_arn == null || can(regex("^arn:aws[a-z-]*:iam::[0-9]{12}:role/", var.registration_role_arn))
    error_message = "registration_role_arn must be null or a valid IAM role ARN."
  }
}

variable "hybrid_access_enabled" {
  description = "Keep existing IAM/S3 permissions effective alongside Lake Formation grants. Set false to enforce LF-only access."
  type        = bool
  default     = true
}

variable "catalog_id" {
  description = "Glue Data Catalog account ID owning the database. Defaults to the caller's account."
  type        = string
  default     = null

  validation {
    condition     = var.catalog_id == null || can(regex("^[0-9]{12}$", var.catalog_id))
    error_message = "catalog_id must be null or a 12-digit AWS account ID."
  }
}

variable "database_name" {
  description = "Name of the Glue/Lake Formation database these grants apply to."
  type        = string
}

variable "database_permissions" {
  description = "Map of principal ARN => database-level permission set. Valid permissions: ALTER, CREATE_TABLE, DESCRIBE, DROP."
  type = map(object({
    permissions   = list(string)
    grant_options = optional(list(string), [])
  }))
  default = {}

  validation {
    condition = alltrue([
      for p in values(var.database_permissions) : alltrue([
        for perm in p.permissions :
        contains(["ALTER", "CREATE_TABLE", "DESCRIBE", "DROP"], perm)
      ])
    ])
    error_message = "database_permissions may only use ALTER, CREATE_TABLE, DESCRIBE, or DROP."
  }
}

variable "table_permissions" {
  description = "Map of principal ARN => whole-table permission set. Valid permissions: SELECT, INSERT, DELETE, ALTER, DROP, DESCRIBE."
  type = map(object({
    table_name    = string
    permissions   = list(string)
    grant_options = optional(list(string), [])
  }))
  default = {}

  validation {
    condition = alltrue([
      for p in values(var.table_permissions) : alltrue([
        for perm in p.permissions :
        contains(["SELECT", "INSERT", "DELETE", "ALTER", "DROP", "DESCRIBE"], perm)
      ])
    ])
    error_message = "table_permissions may only use SELECT, INSERT, DELETE, ALTER, DROP, or DESCRIBE."
  }
}

variable "column_permissions" {
  description = "Column-scoped SELECT grants. Provide either column_names (allow-list) or excluded_column_names (deny-list) per entry, not both."
  type = list(object({
    principal             = string
    table                 = string
    column_names          = optional(list(string), [])
    excluded_column_names = optional(list(string), [])
  }))
  default = []

  validation {
    condition = alltrue([
      for g in var.column_permissions :
      (length(g.column_names) > 0) != (length(g.excluded_column_names) > 0)
    ])
    error_message = "Each column_permissions entry must set exactly one of column_names or excluded_column_names."
  }
}

outputs.tf

output "resource_id" {
  description = "ID of the registered Lake Formation resource (the S3 ARN)."
  value       = aws_lakeformation_resource.this.id
}

output "registered_arn" {
  description = "S3 ARN registered with Lake Formation."
  value       = aws_lakeformation_resource.this.arn
}

output "registration_role_arn" {
  description = "IAM role Lake Formation uses to vend credentials for the location (service-linked role when unset)."
  value       = aws_lakeformation_resource.this.role_arn
}

output "catalog_id" {
  description = "Catalog account ID the grants were applied against."
  value       = local.catalog_id
}

output "database_grant_principals" {
  description = "Principals granted database-level permissions."
  value       = keys(aws_lakeformation_permissions.database)
}

output "table_grant_principals" {
  description = "Principals granted whole-table permissions."
  value       = keys(aws_lakeformation_permissions.table)
}

output "column_grant_keys" {
  description = "principal:table keys for column-scoped SELECT grants."
  value       = keys(aws_lakeformation_permissions.columns)
}

How to use it

module "lake_formation" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-lake-formation?ref=v1.0.0"

  s3_resource_arn       = "arn:aws:s3:::kloudvin-datalake/curated/sales"
  registration_role_arn = aws_iam_role.lf_registration.arn
  hybrid_access_enabled = false

  database_name = "sales_curated"

  # Discovery rights for everyone who touches the database.
  database_permissions = {
    (aws_iam_role.analyst.arn) = {
      permissions = ["DESCRIBE"]
    }
    (aws_iam_role.etl.arn) = {
      permissions = ["DESCRIBE", "CREATE_TABLE", "ALTER"]
    }
  }

  # The ETL role owns the table end to end.
  table_permissions = {
    (aws_iam_role.etl.arn) = {
      table_name  = "orders"
      permissions = ["SELECT", "INSERT", "ALTER", "DELETE"]
    }
  }

  # Analysts read orders, but never see PII columns.
  column_permissions = [
    {
      principal             = aws_iam_role.analyst.arn
      table                 = "orders"
      excluded_column_names = ["customer_email", "customer_phone"]
    }
  ]
}

# Downstream: point an Athena workgroup's query results at the governed lake
# and reference the registered ARN so the dependency is explicit.
resource "aws_athena_workgroup" "sales" {
  name = "sales-analytics"

  configuration {
    result_configuration {
      output_location = "s3://kloudvin-athena-results/sales/"
    }
  }

  tags = {
    GovernedResource = module.lake_formation.registered_arn
    Catalog          = module.lake_formation.catalog_id
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/lake_formation/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-lake-formation?ref=v1.0.0"
}

inputs = {
  s3_resource_arn = "..."
  database_name = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/lake_formation && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
s3_resource_arn string Yes ARN of the S3 bucket/prefix to register with Lake Formation.
registration_role_arn string null No IAM role LF assumes to vend credentials; null uses the service-linked role.
hybrid_access_enabled bool true No Keep IAM/S3 access effective alongside LF grants; false enforces LF-only.
catalog_id string null No Glue Data Catalog account ID; defaults to the caller’s account.
database_name string Yes Glue/Lake Formation database the grants target.
database_permissions map(object) {} No Principal ARN => database permission set (ALTER, CREATE_TABLE, DESCRIBE, DROP).
table_permissions map(object) {} No Principal ARN => whole-table permission set incl. table_name.
column_permissions list(object) [] No Column-scoped SELECT grants via allow-list or deny-list of columns.

Outputs

Name Description
resource_id ID of the registered Lake Formation resource (the S3 ARN).
registered_arn S3 ARN registered with Lake Formation.
registration_role_arn IAM role LF uses to vend credentials (service-linked role when unset).
catalog_id Catalog account ID the grants were applied against.
database_grant_principals Principals granted database-level permissions.
table_grant_principals Principals granted whole-table permissions.
column_grant_keys principal:table keys for column-scoped SELECT grants.

Enterprise scenario

A retail analytics platform stores curated sales data in s3://kloudvin-datalake/curated/sales, catalogued in Glue and queried by 40+ analysts through Athena. Finance analysts must aggregate revenue but are barred from seeing customer_email and customer_phone under the company’s PII policy. The data platform team instantiates this module once per data domain: the ETL role gets full table permissions to land nightly batches, while the analyst role gets a column-level SELECT grant that excludes the two PII fields — so a stray SELECT * in Athena simply returns no PII column rather than leaking it, and every grant is recorded centrally for the quarterly access review.

Best practices

TerraformAWSLake FormationModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading