IaC GCP

Terraform Module: GCP Data Catalog — a governed entry group with custom fileset entries and reader IAM in one module

Quick take — A reusable hashicorp/google Terraform module for Data Catalog: provision an entry group with deletion policy, register custom fileset/user-specified entries with GCS file patterns and JSON schemas, and grant viewer IAM from typed, validated variables. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "data_catalog" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-data-catalog?ref=v1.0.0"

  project_id     = "..."  # GCP project ID that owns the entry group, entries and I…
  region         = "..."  # Region for the entry group (e.g. `asia-south1`, `us`, `…
  entry_group_id = "..."  # Entry group ID; letter/underscore start, letters/number…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Data Catalog is Google Cloud’s managed metadata catalog — the searchable index that sits in front of your data estate so analysts can find a table, a Cloud Storage fileset, or a Pub/Sub topic without knowing which project or bucket it lives in. It auto-ingests metadata for “integrated systems” (BigQuery datasets/tables and Pub/Sub topics show up on their own), but the assets that aren’t integrated — a partitioned Parquet fileset in GCS, an on-prem Oracle table, a Kafka topic — only appear if you register them yourself as entries inside an entry group.

The entry group is the unit you actually own and manage. It’s a regional container (projects/P/locations/REGION/entryGroups/ID) that scopes IAM and holds your hand-registered entries, and it is the first thing every data-platform team re-creates by hand: pick entry_group_id, a region, a deletion_policy, then bolt on the same handful of google_data_catalog_entry resources and the same roles/datacatalog.viewer grants for the BI group. The name it exports (projects/.../entryGroups/...) is also the parent reference every entry and IAM binding needs, so wiring three raw resources together correctly is fiddly and easy to get subtly wrong (passing the short id where the URL-format name is required is the classic mistake).

This module wraps google_data_catalog_entry_group, a for_each map of google_data_catalog_entry (covering both the FILESET enum type with gcs_fileset_spec.file_patterns and arbitrary user_specified_type/user_specified_system entries with a JSON schema), and google_data_catalog_entry_group_iam_member viewer bindings behind typed, validated variables. A consuming team passes intent — “an asia-south1 entry group called lakehouse_curated, two GCS filesets registered, BI analysts can view it” — and gets a correct, governed catalog surface every time.

When to use it

Module structure

terraform-module-gcp-data-catalog/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # entry group + entries (fileset / user-specified) + viewer IAM
├── variables.tf     # typed, validated inputs
└── outputs.tf       # entry group id/name + entry names/ids map

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # The entry group's URL-format name is the parent every entry and IAM
  # binding must reference (NOT the short entry_group_id). Computed once.
  entry_group_name = google_data_catalog_entry_group.this.name

  # De-duplicated viewer principals, expanded to one IAM member each.
  viewer_members = { for m in distinct(var.viewer_members) : m => m }
}

resource "google_data_catalog_entry_group" "this" {
  project        = var.project_id
  region         = var.region
  entry_group_id = var.entry_group_id

  display_name = coalesce(var.display_name, var.entry_group_id)
  description  = var.description

  # DELETE allows `terraform destroy`; set ABANDON to keep the group in the
  # catalog and only drop it from state (useful for shared, long-lived groups).
  deletion_policy = var.deletion_policy
}

resource "google_data_catalog_entry" "this" {
  for_each = var.entries

  entry_group = local.entry_group_name
  entry_id    = each.key

  display_name    = each.value.display_name
  description     = each.value.description
  linked_resource = each.value.linked_resource

  # An entry is EITHER a typed FILESET, OR a fully user-specified entry.
  # type is the only EntryType enum the API accepts here; everything else
  # (Kafka, Oracle, a custom system) is modelled via user_specified_*.
  type                  = each.value.type
  user_specified_type   = each.value.user_specified_type
  user_specified_system = each.value.user_specified_system

  # JSON-encoded column schema (file()/jsonencode()), shown in catalog search.
  schema = each.value.schema

  # Required for FILESET entries: the GCS glob(s) that make up the fileset.
  dynamic "gcs_fileset_spec" {
    for_each = each.value.gcs_file_patterns == null ? [] : [each.value.gcs_file_patterns]
    content {
      file_patterns = gcs_fileset_spec.value
    }
  }
}

# Read access to the entry group and the entries inside it.
resource "google_data_catalog_entry_group_iam_member" "viewer" {
  for_each = local.viewer_members

  project     = google_data_catalog_entry_group.this.project
  region      = google_data_catalog_entry_group.this.region
  entry_group = google_data_catalog_entry_group.this.entry_group_id
  role        = "roles/datacatalog.viewer"
  member      = each.value
}

variables.tf

variable "project_id" {
  type        = string
  description = "GCP project ID that owns the entry group, its entries and IAM bindings."
}

variable "region" {
  type        = string
  description = "Region the entry group lives in (e.g. 'asia-south1', 'us', 'eu'). Entry groups are regional; co-locate with the data you are cataloguing."

  validation {
    condition     = length(var.region) > 0
    error_message = "region must be set (e.g. asia-south1)."
  }
}

variable "entry_group_id" {
  type        = string
  description = "Entry group ID. Must begin with a letter or underscore, contain only letters, numbers and underscores, and be at most 64 chars."

  validation {
    condition     = can(regex("^[A-Za-z_][A-Za-z0-9_]{0,63}$", var.entry_group_id))
    error_message = "entry_group_id must start with a letter or underscore, contain only letters/numbers/underscores, and be at most 64 chars (no hyphens or dots)."
  }
}

variable "display_name" {
  type        = string
  default     = null
  description = "Human-readable name shown in the console. Defaults to entry_group_id."
}

variable "description" {
  type        = string
  default     = null
  description = "Free-text description of what this entry group catalogues."
}

variable "deletion_policy" {
  type        = string
  default     = "DELETE"
  description = "DELETE lets `terraform destroy` remove the entry group; ABANDON drops it from state but leaves it in the catalog. Use ABANDON for shared, long-lived groups."

  validation {
    condition     = contains(["DELETE", "ABANDON"], var.deletion_policy)
    error_message = "deletion_policy must be either DELETE or ABANDON."
  }
}

variable "entries" {
  description = <<-EOT
    Map of entry_id => entry settings registered inside the entry group.
    Each entry is EITHER a FILESET (set type = "FILESET" and gcs_file_patterns),
    OR a user-specified asset (set user_specified_type, and optionally
    user_specified_system, e.g. type=null, user_specified_type="kafka_topic").
    Do not set both 'type' and 'user_specified_type' on the same entry.
    'user_specified_system' is only valid alongside 'user_specified_type'.
  EOT
  default     = {}
  type = map(object({
    display_name          = optional(string)
    description           = optional(string)
    linked_resource       = optional(string)        # full resource name the entry points at
    type                  = optional(string)         # only "FILESET" is accepted as an enum
    user_specified_type   = optional(string)         # custom type, e.g. "kafka_topic"
    user_specified_system = optional(string)         # custom source system, e.g. "on_prem_oracle"
    schema                = optional(string)         # JSON-encoded column schema
    gcs_file_patterns     = optional(list(string))   # required for FILESET entries
  }))

  # entry_id naming rules (same constraints as entry_group_id).
  validation {
    condition = alltrue([
      for k in keys(var.entries) :
      can(regex("^[A-Za-z_][A-Za-z0-9_]{0,63}$", k))
    ])
    error_message = "Each entry_id must start with a letter or underscore, contain only letters/numbers/underscores, and be at most 64 chars."
  }

  # 'type' (enum) and 'user_specified_type' are mutually exclusive.
  validation {
    condition = alltrue([
      for e in values(var.entries) :
      !(e.type != null && e.user_specified_type != null)
    ])
    error_message = "An entry may set 'type' OR 'user_specified_type', not both."
  }

  # Exactly one of the two type fields must be set per entry.
  validation {
    condition = alltrue([
      for e in values(var.entries) :
      (e.type != null || e.user_specified_type != null)
    ])
    error_message = "Each entry must set exactly one of 'type' (e.g. FILESET) or 'user_specified_type'."
  }

  # The only EntryType enum the API accepts on create here is FILESET.
  validation {
    condition = alltrue([
      for e in values(var.entries) :
      e.type == null ? true : e.type == "FILESET"
    ])
    error_message = "When 'type' is set it must be \"FILESET\" (the only EntryType enum allowed on a Terraform-created entry)."
  }

  # FILESET entries require at least one GCS file pattern.
  validation {
    condition = alltrue([
      for e in values(var.entries) :
      e.type == "FILESET" ? (e.gcs_file_patterns != null && length(coalesce(e.gcs_file_patterns, [])) > 0) : true
    ])
    error_message = "Every FILESET entry must set a non-empty 'gcs_file_patterns' list (e.g. [\"gs://bucket/prefix/*.parquet\"])."
  }

  # gcs_file_patterns only makes sense on a FILESET entry.
  validation {
    condition = alltrue([
      for e in values(var.entries) :
      e.gcs_file_patterns == null ? true : e.type == "FILESET"
    ])
    error_message = "'gcs_file_patterns' may only be set on entries with type = \"FILESET\"."
  }

  # user_specified_system is meaningless without user_specified_type.
  validation {
    condition = alltrue([
      for e in values(var.entries) :
      e.user_specified_system == null ? true : e.user_specified_type != null
    ])
    error_message = "'user_specified_system' may only be set together with 'user_specified_type'."
  }
}

variable "viewer_members" {
  type        = list(string)
  default     = []
  description = "Principals granted roles/datacatalog.viewer on the entry group (read/search the entries), e.g. [\"group:bi-analysts@kloudvin.com\", \"serviceAccount:lineage@kloudvin-prod.iam.gserviceaccount.com\"]."

  validation {
    condition     = alltrue([for m in var.viewer_members : can(regex("^(user|group|serviceAccount|domain):", m))])
    error_message = "Each viewer member must be user:, group:, serviceAccount:, or domain: — wildcards (allUsers/allAuthenticatedUsers) are not allowed."
  }
}

outputs.tf

output "entry_group_id" {
  description = "Short entry group ID (e.g. lakehouse_curated)."
  value       = google_data_catalog_entry_group.this.entry_group_id
}

output "entry_group_name" {
  description = "URL-format resource name (projects/P/locations/REGION/entryGroups/ID). Use this as the parent for additional entries or tags."
  value       = google_data_catalog_entry_group.this.name
}

output "entry_group_resource_id" {
  description = "Terraform resource id of the entry group ({{name}})."
  value       = google_data_catalog_entry_group.this.id
}

output "region" {
  description = "Region the entry group and its entries live in."
  value       = google_data_catalog_entry_group.this.region
}

output "entry_names" {
  description = "Map of entry_id => URL-format entry resource name (projects/.../entryGroups/.../entries/...)."
  value       = { for k, e in google_data_catalog_entry.this : k => e.name }
}

output "entry_ids" {
  description = "Map of entry_id => Terraform resource id of the entry."
  value       = { for k, e in google_data_catalog_entry.this : k => e.id }
}

output "entry_integrated_systems" {
  description = "Map of entry_id => integrated_system the entry resolves to (empty for user-specified entries)."
  value       = { for k, e in google_data_catalog_entry.this : k => e.integrated_system }
}

How to use it

A curated lakehouse entry group in asia-south1 that registers two GCS filesets (orders and clickstream Parquet) plus one user-specified on-prem Oracle table, with the BI group and the lineage service account granted viewer:

module "data_catalog" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-data-catalog?ref=v1.0.0"

  project_id     = "kloudvin-analytics-prod"
  region         = "asia-south1"
  entry_group_id = "lakehouse_curated"
  display_name   = "Lakehouse (Curated)"
  description     = "Catalogued curated assets for the analytics domain: GCS filesets + external sources."

  # Shared, long-lived group — drop from state on destroy, don't delete it.
  deletion_policy = "ABANDON"

  entries = {
    orders_fileset = {
      display_name    = "Orders (curated Parquet)"
      description     = "One Parquet object per daily partition under the curated orders prefix."
      type            = "FILESET"
      linked_resource = "//storage.googleapis.com/kloudvin-curated/orders"
      gcs_file_patterns = [
        "gs://kloudvin-curated/orders/dt=*/*.parquet",
      ]
      schema = file("${path.module}/schemas/orders.json")
    }

    clickstream_fileset = {
      display_name      = "Clickstream events (curated Parquet)"
      description       = "Hourly clickstream Parquet partitions."
      type              = "FILESET"
      gcs_file_patterns = ["gs://kloudvin-curated/clickstream/*/*.parquet"]
      schema            = file("${path.module}/schemas/clickstream.json")
    }

    legacy_orders_oracle = {
      display_name          = "Legacy orders (on-prem Oracle)"
      description           = "Reference to the soon-to-be-migrated Oracle ORDERS table."
      user_specified_type   = "oracle_table"
      user_specified_system = "on_prem_oracle"
      schema                = file("${path.module}/schemas/orders.json")
    }
  }

  viewer_members = [
    "group:bi-analysts@kloudvin.com",
    "serviceAccount:lineage@kloudvin-analytics-prod.iam.gserviceaccount.com",
  ]
}

# Downstream: anchor a Dataplex/data-lineage or governance workflow on the
# fileset entry by its catalogued resource name (no copy-pasted path).
resource "google_data_catalog_tag" "orders_sensitivity" {
  parent   = module.data_catalog.entry_names["orders_fileset"]
  template = google_data_catalog_tag_template.sensitivity.id

  fields {
    field_name   = "classification"
    enum_value   = "INTERNAL"
  }
}

# Hand the entry group name to another stack (e.g. a search/lineage job).
output "curated_entry_group" {
  description = "Entry group name to register further entries or attach tags."
  value       = module.data_catalog.entry_group_name
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/data_catalog/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-data-catalog?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  region = "..."
  entry_group_id = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/data_catalog && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string yes GCP project ID that owns the entry group, entries and IAM.
region string yes Region for the entry group (e.g. asia-south1, us, eu).
entry_group_id string yes Entry group ID; letter/underscore start, letters/numbers/underscores, ≤64 chars.
display_name string null no Console display name; defaults to entry_group_id.
description string null no Free-text description of the entry group.
deletion_policy string "DELETE" no DELETE (destroy removes it) or ABANDON (drop from state only).
entries map(object) {} no Entries to register: FILESET (type+gcs_file_patterns) or user-specified (user_specified_type/_system), with optional schema/linked_resource.
viewer_members list(string) [] no Principals granted roles/datacatalog.viewer on the entry group.

Outputs

Name Description
entry_group_id Short entry group ID (e.g. lakehouse_curated).
entry_group_name URL-format resource name (projects/P/locations/REGION/entryGroups/ID); the parent for entries/tags.
entry_group_resource_id Terraform resource id of the entry group.
region Region the entry group and entries live in.
entry_names Map of entry_id => URL-format entry resource name.
entry_ids Map of entry_id => Terraform resource id of the entry.
entry_integrated_systems Map of entry_id => resolved integrated_system (empty for user-specified).

Enterprise scenario

A retail analytics platform runs a GCS-backed lakehouse where curated orders and clickstream datasets are thousands of daily/hourly Parquet objects under bucket prefixes — invisible to Data Catalog because GCS filesets are not an integrated system. The data-platform team deploys this module once per environment: it creates the lakehouse_curated entry group in asia-south1, registers each prefix as a FILESET entry whose gcs_file_patterns glob the partitions and whose JSON schema surfaces columns in catalog search, and adds a legacy_orders_oracle user_specified_type entry so the about-to-be-migrated source still appears in lineage. The bi-analysts Google Group and the lineage service account get roles/datacatalog.viewer via viewer_members, deletion_policy = "ABANDON" protects the shared group from a stray destroy, and the whole catalog surface is reviewed in one pull request and reproduced identically in staging.

Best practices

TerraformGCPData CatalogModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading