Quick take — A reusable hashicorp/google Terraform module for Data Catalog: provision an entry group with deletion policy, register custom fileset/user-specified entries with GCS file patterns and JSON schemas, and grant viewer IAM from typed, validated variables. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "data_catalog" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-data-catalog?ref=v1.0.0"
project_id = "..." # GCP project ID that owns the entry group, entries and I…
region = "..." # Region for the entry group (e.g. `asia-south1`, `us`, `…
entry_group_id = "..." # Entry group ID; letter/underscore start, letters/number…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Data Catalog is Google Cloud’s managed metadata catalog — the searchable index that sits in front of your data estate so analysts can find a table, a Cloud Storage fileset, or a Pub/Sub topic without knowing which project or bucket it lives in. It auto-ingests metadata for “integrated systems” (BigQuery datasets/tables and Pub/Sub topics show up on their own), but the assets that aren’t integrated — a partitioned Parquet fileset in GCS, an on-prem Oracle table, a Kafka topic — only appear if you register them yourself as entries inside an entry group.
The entry group is the unit you actually own and manage. It’s a regional container (projects/P/locations/REGION/entryGroups/ID) that scopes IAM and holds your hand-registered entries, and it is the first thing every data-platform team re-creates by hand: pick entry_group_id, a region, a deletion_policy, then bolt on the same handful of google_data_catalog_entry resources and the same roles/datacatalog.viewer grants for the BI group. The name it exports (projects/.../entryGroups/...) is also the parent reference every entry and IAM binding needs, so wiring three raw resources together correctly is fiddly and easy to get subtly wrong (passing the short id where the URL-format name is required is the classic mistake).
This module wraps google_data_catalog_entry_group, a for_each map of google_data_catalog_entry (covering both the FILESET enum type with gcs_fileset_spec.file_patterns and arbitrary user_specified_type/user_specified_system entries with a JSON schema), and google_data_catalog_entry_group_iam_member viewer bindings behind typed, validated variables. A consuming team passes intent — “an asia-south1 entry group called lakehouse_curated, two GCS filesets registered, BI analysts can view it” — and gets a correct, governed catalog surface every time.
When to use it
- You have non-integrated data assets — GCS filesets (Parquet/CSV/Avro under a prefix), or external/on-prem systems — that need to be discoverable in Data Catalog search alongside your BigQuery and Pub/Sub metadata.
- You are standing up per-domain entry groups (
sales_curated,events_raw) and want a consistentregion,deletion_policy, naming and viewer IAM instead of hand-built one-offs. - You register filesets with
file_patterns(gs://bucket/prefix/*.parquet) so a logically-partitioned dataset spread across many objects shows up as one catalogued entry with a schema. - You want search-time discoverability and lineage anchors for assets that Terraform already provisions (a curated bucket, a Dataproc output path) — cataloguing them in the same plan keeps metadata in lockstep with the data.
- You need least-privilege read access to the catalog surface managed as a reviewed list of groups/service accounts via
roles/datacatalog.viewer, rather than relying on broad project-level grants. - Reach for the BigQuery/Pub/Sub resources directly (not this module) when the asset is an integrated system — Data Catalog already ingests that metadata, and you’d only add an entry group to attach tags or organize search, which is a different concern from registering the asset itself.
Module structure
terraform-module-gcp-data-catalog/
├── versions.tf # provider + Terraform version pins
├── main.tf # entry group + entries (fileset / user-specified) + viewer IAM
├── variables.tf # typed, validated inputs
└── outputs.tf # entry group id/name + entry names/ids map
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
locals {
# The entry group's URL-format name is the parent every entry and IAM
# binding must reference (NOT the short entry_group_id). Computed once.
entry_group_name = google_data_catalog_entry_group.this.name
# De-duplicated viewer principals, expanded to one IAM member each.
viewer_members = { for m in distinct(var.viewer_members) : m => m }
}
resource "google_data_catalog_entry_group" "this" {
project = var.project_id
region = var.region
entry_group_id = var.entry_group_id
display_name = coalesce(var.display_name, var.entry_group_id)
description = var.description
# DELETE allows `terraform destroy`; set ABANDON to keep the group in the
# catalog and only drop it from state (useful for shared, long-lived groups).
deletion_policy = var.deletion_policy
}
resource "google_data_catalog_entry" "this" {
for_each = var.entries
entry_group = local.entry_group_name
entry_id = each.key
display_name = each.value.display_name
description = each.value.description
linked_resource = each.value.linked_resource
# An entry is EITHER a typed FILESET, OR a fully user-specified entry.
# type is the only EntryType enum the API accepts here; everything else
# (Kafka, Oracle, a custom system) is modelled via user_specified_*.
type = each.value.type
user_specified_type = each.value.user_specified_type
user_specified_system = each.value.user_specified_system
# JSON-encoded column schema (file()/jsonencode()), shown in catalog search.
schema = each.value.schema
# Required for FILESET entries: the GCS glob(s) that make up the fileset.
dynamic "gcs_fileset_spec" {
for_each = each.value.gcs_file_patterns == null ? [] : [each.value.gcs_file_patterns]
content {
file_patterns = gcs_fileset_spec.value
}
}
}
# Read access to the entry group and the entries inside it.
resource "google_data_catalog_entry_group_iam_member" "viewer" {
for_each = local.viewer_members
project = google_data_catalog_entry_group.this.project
region = google_data_catalog_entry_group.this.region
entry_group = google_data_catalog_entry_group.this.entry_group_id
role = "roles/datacatalog.viewer"
member = each.value
}
variables.tf
variable "project_id" {
type = string
description = "GCP project ID that owns the entry group, its entries and IAM bindings."
}
variable "region" {
type = string
description = "Region the entry group lives in (e.g. 'asia-south1', 'us', 'eu'). Entry groups are regional; co-locate with the data you are cataloguing."
validation {
condition = length(var.region) > 0
error_message = "region must be set (e.g. asia-south1)."
}
}
variable "entry_group_id" {
type = string
description = "Entry group ID. Must begin with a letter or underscore, contain only letters, numbers and underscores, and be at most 64 chars."
validation {
condition = can(regex("^[A-Za-z_][A-Za-z0-9_]{0,63}$", var.entry_group_id))
error_message = "entry_group_id must start with a letter or underscore, contain only letters/numbers/underscores, and be at most 64 chars (no hyphens or dots)."
}
}
variable "display_name" {
type = string
default = null
description = "Human-readable name shown in the console. Defaults to entry_group_id."
}
variable "description" {
type = string
default = null
description = "Free-text description of what this entry group catalogues."
}
variable "deletion_policy" {
type = string
default = "DELETE"
description = "DELETE lets `terraform destroy` remove the entry group; ABANDON drops it from state but leaves it in the catalog. Use ABANDON for shared, long-lived groups."
validation {
condition = contains(["DELETE", "ABANDON"], var.deletion_policy)
error_message = "deletion_policy must be either DELETE or ABANDON."
}
}
variable "entries" {
description = <<-EOT
Map of entry_id => entry settings registered inside the entry group.
Each entry is EITHER a FILESET (set type = "FILESET" and gcs_file_patterns),
OR a user-specified asset (set user_specified_type, and optionally
user_specified_system, e.g. type=null, user_specified_type="kafka_topic").
Do not set both 'type' and 'user_specified_type' on the same entry.
'user_specified_system' is only valid alongside 'user_specified_type'.
EOT
default = {}
type = map(object({
display_name = optional(string)
description = optional(string)
linked_resource = optional(string) # full resource name the entry points at
type = optional(string) # only "FILESET" is accepted as an enum
user_specified_type = optional(string) # custom type, e.g. "kafka_topic"
user_specified_system = optional(string) # custom source system, e.g. "on_prem_oracle"
schema = optional(string) # JSON-encoded column schema
gcs_file_patterns = optional(list(string)) # required for FILESET entries
}))
# entry_id naming rules (same constraints as entry_group_id).
validation {
condition = alltrue([
for k in keys(var.entries) :
can(regex("^[A-Za-z_][A-Za-z0-9_]{0,63}$", k))
])
error_message = "Each entry_id must start with a letter or underscore, contain only letters/numbers/underscores, and be at most 64 chars."
}
# 'type' (enum) and 'user_specified_type' are mutually exclusive.
validation {
condition = alltrue([
for e in values(var.entries) :
!(e.type != null && e.user_specified_type != null)
])
error_message = "An entry may set 'type' OR 'user_specified_type', not both."
}
# Exactly one of the two type fields must be set per entry.
validation {
condition = alltrue([
for e in values(var.entries) :
(e.type != null || e.user_specified_type != null)
])
error_message = "Each entry must set exactly one of 'type' (e.g. FILESET) or 'user_specified_type'."
}
# The only EntryType enum the API accepts on create here is FILESET.
validation {
condition = alltrue([
for e in values(var.entries) :
e.type == null ? true : e.type == "FILESET"
])
error_message = "When 'type' is set it must be \"FILESET\" (the only EntryType enum allowed on a Terraform-created entry)."
}
# FILESET entries require at least one GCS file pattern.
validation {
condition = alltrue([
for e in values(var.entries) :
e.type == "FILESET" ? (e.gcs_file_patterns != null && length(coalesce(e.gcs_file_patterns, [])) > 0) : true
])
error_message = "Every FILESET entry must set a non-empty 'gcs_file_patterns' list (e.g. [\"gs://bucket/prefix/*.parquet\"])."
}
# gcs_file_patterns only makes sense on a FILESET entry.
validation {
condition = alltrue([
for e in values(var.entries) :
e.gcs_file_patterns == null ? true : e.type == "FILESET"
])
error_message = "'gcs_file_patterns' may only be set on entries with type = \"FILESET\"."
}
# user_specified_system is meaningless without user_specified_type.
validation {
condition = alltrue([
for e in values(var.entries) :
e.user_specified_system == null ? true : e.user_specified_type != null
])
error_message = "'user_specified_system' may only be set together with 'user_specified_type'."
}
}
variable "viewer_members" {
type = list(string)
default = []
description = "Principals granted roles/datacatalog.viewer on the entry group (read/search the entries), e.g. [\"group:bi-analysts@kloudvin.com\", \"serviceAccount:lineage@kloudvin-prod.iam.gserviceaccount.com\"]."
validation {
condition = alltrue([for m in var.viewer_members : can(regex("^(user|group|serviceAccount|domain):", m))])
error_message = "Each viewer member must be user:, group:, serviceAccount:, or domain: — wildcards (allUsers/allAuthenticatedUsers) are not allowed."
}
}
outputs.tf
output "entry_group_id" {
description = "Short entry group ID (e.g. lakehouse_curated)."
value = google_data_catalog_entry_group.this.entry_group_id
}
output "entry_group_name" {
description = "URL-format resource name (projects/P/locations/REGION/entryGroups/ID). Use this as the parent for additional entries or tags."
value = google_data_catalog_entry_group.this.name
}
output "entry_group_resource_id" {
description = "Terraform resource id of the entry group ({{name}})."
value = google_data_catalog_entry_group.this.id
}
output "region" {
description = "Region the entry group and its entries live in."
value = google_data_catalog_entry_group.this.region
}
output "entry_names" {
description = "Map of entry_id => URL-format entry resource name (projects/.../entryGroups/.../entries/...)."
value = { for k, e in google_data_catalog_entry.this : k => e.name }
}
output "entry_ids" {
description = "Map of entry_id => Terraform resource id of the entry."
value = { for k, e in google_data_catalog_entry.this : k => e.id }
}
output "entry_integrated_systems" {
description = "Map of entry_id => integrated_system the entry resolves to (empty for user-specified entries)."
value = { for k, e in google_data_catalog_entry.this : k => e.integrated_system }
}
How to use it
A curated lakehouse entry group in asia-south1 that registers two GCS filesets (orders and clickstream Parquet) plus one user-specified on-prem Oracle table, with the BI group and the lineage service account granted viewer:
module "data_catalog" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-data-catalog?ref=v1.0.0"
project_id = "kloudvin-analytics-prod"
region = "asia-south1"
entry_group_id = "lakehouse_curated"
display_name = "Lakehouse (Curated)"
description = "Catalogued curated assets for the analytics domain: GCS filesets + external sources."
# Shared, long-lived group — drop from state on destroy, don't delete it.
deletion_policy = "ABANDON"
entries = {
orders_fileset = {
display_name = "Orders (curated Parquet)"
description = "One Parquet object per daily partition under the curated orders prefix."
type = "FILESET"
linked_resource = "//storage.googleapis.com/kloudvin-curated/orders"
gcs_file_patterns = [
"gs://kloudvin-curated/orders/dt=*/*.parquet",
]
schema = file("${path.module}/schemas/orders.json")
}
clickstream_fileset = {
display_name = "Clickstream events (curated Parquet)"
description = "Hourly clickstream Parquet partitions."
type = "FILESET"
gcs_file_patterns = ["gs://kloudvin-curated/clickstream/*/*.parquet"]
schema = file("${path.module}/schemas/clickstream.json")
}
legacy_orders_oracle = {
display_name = "Legacy orders (on-prem Oracle)"
description = "Reference to the soon-to-be-migrated Oracle ORDERS table."
user_specified_type = "oracle_table"
user_specified_system = "on_prem_oracle"
schema = file("${path.module}/schemas/orders.json")
}
}
viewer_members = [
"group:bi-analysts@kloudvin.com",
"serviceAccount:lineage@kloudvin-analytics-prod.iam.gserviceaccount.com",
]
}
# Downstream: anchor a Dataplex/data-lineage or governance workflow on the
# fileset entry by its catalogued resource name (no copy-pasted path).
resource "google_data_catalog_tag" "orders_sensitivity" {
parent = module.data_catalog.entry_names["orders_fileset"]
template = google_data_catalog_tag_template.sensitivity.id
fields {
field_name = "classification"
enum_value = "INTERNAL"
}
}
# Hand the entry group name to another stack (e.g. a search/lineage job).
output "curated_entry_group" {
description = "Entry group name to register further entries or attach tags."
value = module.data_catalog.entry_group_name
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/data_catalog/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-data-catalog?ref=v1.0.0"
}
inputs = {
project_id = "..."
region = "..."
entry_group_id = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/data_catalog && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
project_id |
string |
— | yes | GCP project ID that owns the entry group, entries and IAM. |
region |
string |
— | yes | Region for the entry group (e.g. asia-south1, us, eu). |
entry_group_id |
string |
— | yes | Entry group ID; letter/underscore start, letters/numbers/underscores, ≤64 chars. |
display_name |
string |
null |
no | Console display name; defaults to entry_group_id. |
description |
string |
null |
no | Free-text description of the entry group. |
deletion_policy |
string |
"DELETE" |
no | DELETE (destroy removes it) or ABANDON (drop from state only). |
entries |
map(object) |
{} |
no | Entries to register: FILESET (type+gcs_file_patterns) or user-specified (user_specified_type/_system), with optional schema/linked_resource. |
viewer_members |
list(string) |
[] |
no | Principals granted roles/datacatalog.viewer on the entry group. |
Outputs
| Name | Description |
|---|---|
entry_group_id |
Short entry group ID (e.g. lakehouse_curated). |
entry_group_name |
URL-format resource name (projects/P/locations/REGION/entryGroups/ID); the parent for entries/tags. |
entry_group_resource_id |
Terraform resource id of the entry group. |
region |
Region the entry group and entries live in. |
entry_names |
Map of entry_id => URL-format entry resource name. |
entry_ids |
Map of entry_id => Terraform resource id of the entry. |
entry_integrated_systems |
Map of entry_id => resolved integrated_system (empty for user-specified). |
Enterprise scenario
A retail analytics platform runs a GCS-backed lakehouse where curated orders and clickstream datasets are thousands of daily/hourly Parquet objects under bucket prefixes — invisible to Data Catalog because GCS filesets are not an integrated system. The data-platform team deploys this module once per environment: it creates the lakehouse_curated entry group in asia-south1, registers each prefix as a FILESET entry whose gcs_file_patterns glob the partitions and whose JSON schema surfaces columns in catalog search, and adds a legacy_orders_oracle user_specified_type entry so the about-to-be-migrated source still appears in lineage. The bi-analysts Google Group and the lineage service account get roles/datacatalog.viewer via viewer_members, deletion_policy = "ABANDON" protects the shared group from a stray destroy, and the whole catalog surface is reviewed in one pull request and reproduced identically in staging.
Best practices
- Co-locate the entry group with the data, and treat
regionas fixed. Entry groups are regional and entries inherit that region; create the group in the same region as the bucket/dataset it catalogues (e.g.asia-south1for anasia-south1lakehouse) so search, lineage and residency stay consistent — moving it later means re-creating it. - Use
FILESET+ precisefile_patternsfor GCS,user_specified_*for everything else. Make the glob specific (gs://bucket/orders/dt=*/*.parquet, notgs://bucket/**) so the entry represents exactly the logical dataset; for non-GCS, non-integrated assets (Kafka, Oracle, an external warehouse) setuser_specified_typeand a clearuser_specified_systemrather than forcing them into an enum — the module’s validations enforce this either/or so a bad combination fails at plan time. - Always attach a JSON
schemaso entries are discoverable, not just named. A catalogued fileset with no schema is a dead end in search; pass the sameschemaJSON you use for the BigQuery/external table so analysts see columns and types, and keep it in source (file("${path.module}/schemas/orders.json")) next to the rest of the data definition. - Set
deletion_policy = "ABANDON"on shared, long-lived groups. A curated entry group is referenced by tags, lineage and search;ABANDONremoves it from Terraform state on destroy without deleting the catalogued metadata, so decommissioning the stack that created it doesn’t blow away everyone’s discoverability. KeepDELETEonly for throwaway/per-PR catalogs. - Grant
roles/datacatalog.viewerto groups, not individuals, and keep it least-privilege. Read/search access is the point of a catalog, so wireviewer_membersto Google Groups (bi-analysts@…) and the lineage/governance service accounts; the module blocksallUsers/allAuthenticatedUsers, and binding at the entry group (not project) scope means a leaver is handled in your IdP and a new entry group doesn’t silently widen who can see metadata. - Name entry groups and entries for the domain, and reference by output name. Use purpose-clear ids (
lakehouse_curated,orders_fileset) sinceentry_group_id/entry_idare immutable, and always wire downstream tags/lineage toentry_group_name/entry_namesoutputs rather than re-derivingprojects/.../entryGroups/...strings by hand — passing the short id where the URL-format name is required is the most common Data Catalog wiring bug.