Terraform Module: GCP Dataplex — a governed lake with typed zones in one block

Quick take — A reusable hashicorp/google Terraform module for GCP Dataplex: a data lake plus RAW and CURATED zones, scheduled auto-discovery, CSV/JSON parsing options, Dataproc Metastore attach, labels and least-privilege outputs. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "dataplex" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataplex?ref=v1.0.0"

  project_id     = "..."  # GCP project ID hosting the lake and zones.
  app            = "..."  # Workload short name used in the lake name (validated lo…
  environment    = "..."  # One of `dev`, `staging`, `prod`, `sandbox`.
  location_short = "..."  # Cosmetic region token for naming (1–8 lowercase chars).
  region         = "..."  # GCP region for the lake and all zones, e.g. `europe-wes…
  zones          = {}     # Zones keyed by ID; each has `type` (RAW/CURATED), `loca…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Dataplex is GCP’s data fabric and governance plane: instead of treating each Cloud Storage bucket and BigQuery dataset as an island, you organise them into a logical lake, carve that lake into zones by data quality tier, and attach the underlying storage as assets. Dataplex then runs automatic discovery across those assets — crawling files, inferring schemas, and registering tables in a Dataproc Metastore and the BigQuery external/@dataplex catalogues so analysts can query data that was just dropped into a bucket, without anyone hand-writing a CREATE EXTERNAL TABLE. It is GCP’s answer to a metadata-driven lakehouse, sitting roughly where AWS Lake Formation or an Azure Purview + ADLS combination sits, and it underpins data-quality scans, data lineage, and unified IAM across structured and unstructured data.

The mental model that the Terraform resources enforce is a strict hierarchy: a google_dataplex_lake is the top-level container scoped to a single region, and every google_dataplex_zone belongs to exactly one lake. A zone is not optional cosmetics — its type is a hard contract. A RAW zone holds data in any format (Avro, Parquet, CSV, JSON, images, logs) as it lands, while a CURATED zone is restricted to structured, query-optimised formats (Parquet, ORC, Avro, or BigQuery-native tables) and is what you point BI tools and CURATED-tier consumers at. Every zone must also declare a resource_spec.location_type (SINGLE_REGION or MULTI_REGION, which must be compatible with the buckets you later attach) and a discovery_spec that turns metadata crawling on or off, optionally on a cron schedule with CSV/JSON parsing hints. Get the type or location-type wrong and asset attachment is rejected later; forget the discovery schedule and your “automatic” catalogue silently never refreshes.

This module wraps google_dataplex_lake plus a for_each set of google_dataplex_zone resources behind clean, validated variables. You name a lake, optionally attach an existing Dataproc Metastore, and pass a map of zones — each with its tier, location type, and discovery settings — and the module provisions the whole hierarchy with consistent app-env-region naming, labels, and a discovery cadence that actually fires. Asset attachment and IAM are deliberately left to the caller (assets often reference buckets/datasets owned by other modules), but the lake’s service_account is exported so you can grant it read access in one downstream binding.

When to use it

You are building a lakehouse on GCP and want Cloud Storage + BigQuery organised into governed tiers (landing → raw → curated) with a single IAM and metadata plane rather than ad-hoc buckets.
You need schema-on-read at scale: drop Parquet/CSV/JSON into a bucket and have Dataplex discover it, infer the schema, and register queryable tables automatically on a schedule.
You want a clean RAW vs CURATED contract so downstream BI and ML consumers only ever read structured, query-optimised data from the curated zone, while ingestion lands messy data in raw.
You are standardising a data platform and want every team’s lake to carry the same zone taxonomy, discovery cadence, labels, and naming so the data estate is auditable and spend is attributable.
You plan to layer Dataplex data-quality scans, profiling, or lineage on top — these all key off the lake/zone/asset hierarchy this module creates.

Reach for plain BigQuery datasets instead when all your data is already native BigQuery and you do not need to govern Cloud Storage alongside it; reach for Dataproc Metastore on its own if you only need a Hive metastore for Spark and not the discovery/governance layer. Dataplex is the right tool when storage spans GCS and BigQuery and you want one fabric to discover, organise, and secure it.

Module structure

terraform-module-gcp-dataplex/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # google_dataplex_lake + for_each google_dataplex_zone
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # lake id/name/service_account + per-zone ids and names

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Consistent app-env-region naming, e.g. "analytics-prod-euw1".
  # Lake and zone IDs must be lowercase letters/digits/hyphen, start with a
  # letter, and be 1-63 chars. We compose, then the variable validations guard
  # the inputs so the result always fits.
  lake_name = "${var.app}-${var.environment}-${var.location_short}"
}

# ---------------------------------------------------------------------------
# The Dataplex lake — top-level, single-region container.
# ---------------------------------------------------------------------------

resource "google_dataplex_lake" "this" {
  project      = var.project_id
  name         = local.lake_name
  location     = var.region
  display_name = coalesce(var.display_name, local.lake_name)
  description  = var.description

  labels = var.labels

  # Optionally federate discovered tables into an existing Dataproc Metastore so
  # Spark/Hive and BigQuery share one schema catalogue. Omit for the built-in
  # Dataplex catalogue only.
  dynamic "metastore" {
    for_each = var.metastore_service == null ? [] : [var.metastore_service]
    content {
      service = metastore.value
    }
  }

  timeouts {
    create = "30m"
    update = "30m"
    delete = "30m"
  }
}

# ---------------------------------------------------------------------------
# Zones — one per quality tier. Each belongs to the lake above.
# RAW = any format as-landed; CURATED = structured/query-optimised only.
# ---------------------------------------------------------------------------

resource "google_dataplex_zone" "this" {
  for_each = var.zones

  project      = var.project_id
  name         = each.key
  location     = var.region
  lake         = google_dataplex_lake.this.name
  display_name = coalesce(each.value.display_name, each.key)
  description  = each.value.description

  # RAW or CURATED — a hard contract on what formats the zone accepts.
  type = each.value.type

  labels = merge(var.labels, each.value.labels)

  # SINGLE_REGION or MULTI_REGION — must be compatible with the GCS buckets /
  # BigQuery datasets you later attach as assets to this zone.
  resource_spec {
    location_type = each.value.location_type
  }

  # Automatic metadata discovery: crawl attached storage, infer schemas, and
  # register tables. schedule is a cron expression; runs must be >= 60 min apart.
  discovery_spec {
    enabled          = each.value.discovery_enabled
    schedule         = each.value.discovery_enabled ? each.value.discovery_schedule : null
    include_patterns = each.value.include_patterns
    exclude_patterns = each.value.exclude_patterns

    # CSV parsing hints applied during discovery (only meaningful for RAW data
    # containing CSV). header_rows are skipped; type inference can be disabled.
    dynamic "csv_options" {
      for_each = each.value.csv_options == null ? [] : [each.value.csv_options]
      content {
        header_rows            = csv_options.value.header_rows
        delimiter              = csv_options.value.delimiter
        encoding               = csv_options.value.encoding
        disable_type_inference = csv_options.value.disable_type_inference
      }
    }

    # JSON parsing hints applied during discovery.
    dynamic "json_options" {
      for_each = each.value.json_options == null ? [] : [each.value.json_options]
      content {
        encoding               = json_options.value.encoding
        disable_type_inference = json_options.value.disable_type_inference
      }
    }
  }

  timeouts {
    create = "30m"
    update = "30m"
    delete = "30m"
  }
}

variables.tf

variable "project_id" {
  description = "GCP project ID that will host the Dataplex lake and zones."
  type        = string
}

variable "app" {
  description = "Application/workload short name, used in the lake name (e.g. \"analytics\")."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,20}$", var.app))
    error_message = "app must be lowercase letters/digits/hyphen, 2-21 chars, starting with a letter."
  }
}

variable "environment" {
  description = "Deployment environment (dev, staging, prod, sandbox)."
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod", "sandbox"], var.environment)
    error_message = "environment must be one of: dev, staging, prod, sandbox."
  }
}

variable "location_short" {
  description = "Short region token for naming, e.g. \"euw1\", \"use4\". Cosmetic only."
  type        = string

  validation {
    condition     = can(regex("^[a-z0-9]{1,8}$", var.location_short))
    error_message = "location_short must be 1-8 lowercase letters/digits."
  }
}

variable "region" {
  description = "GCP region for the lake and all its zones, e.g. \"europe-west1\". Dataplex lakes are regional."
  type        = string
}

variable "display_name" {
  description = "Human-friendly display name for the lake. Defaults to the generated lake name."
  type        = string
  default     = null
}

variable "description" {
  description = "Free-text description shown in the Dataplex console for the lake."
  type        = string
  default     = "Managed by Terraform"
}

variable "metastore_service" {
  description = <<-EOT
    Optional relative reference to an existing Dataproc Metastore service to
    federate discovered tables into, e.g.
    "projects/p/locations/europe-west1/services/hms". null = Dataplex catalogue only.
  EOT
  type        = string
  default     = null
}

variable "zones" {
  description = <<-EOT
    Map of zones keyed by zone ID (lowercase letters/digits/hyphen, start with a
    letter, <= 63 chars). Each zone declares its tier, location type, and
    discovery behaviour.
  EOT
  type = map(object({
    type               = string                # RAW or CURATED
    location_type      = string                # SINGLE_REGION or MULTI_REGION
    display_name       = optional(string)
    description        = optional(string, "Managed by Terraform")
    labels             = optional(map(string), {})
    discovery_enabled  = optional(bool, true)
    discovery_schedule = optional(string, "0 * * * *") # hourly; >= 60 min apart
    include_patterns   = optional(list(string), [])
    exclude_patterns   = optional(list(string), [])
    csv_options = optional(object({
      header_rows            = optional(number, 1)
      delimiter              = optional(string, ",")
      encoding               = optional(string, "UTF-8")
      disable_type_inference = optional(bool, false)
    }))
    json_options = optional(object({
      encoding               = optional(string, "UTF-8")
      disable_type_inference = optional(bool, false)
    }))
  }))

  validation {
    condition     = length(var.zones) > 0
    error_message = "Provide at least one zone."
  }

  validation {
    condition     = alltrue([for z in values(var.zones) : contains(["RAW", "CURATED"], z.type)])
    error_message = "Each zone.type must be RAW or CURATED."
  }

  validation {
    condition     = alltrue([for z in values(var.zones) : contains(["SINGLE_REGION", "MULTI_REGION"], z.location_type)])
    error_message = "Each zone.location_type must be SINGLE_REGION or MULTI_REGION."
  }

  validation {
    condition     = alltrue([for k in keys(var.zones) : can(regex("^[a-z][a-z0-9-]{0,62}$", k))])
    error_message = "Each zone ID (map key) must be lowercase letters/digits/hyphen, start with a letter, <= 63 chars."
  }
}

variable "labels" {
  description = "Labels applied to the lake and merged into every zone."
  type        = map(string)
  default     = {}
}

outputs.tf

output "lake_id" {
  description = "Fully qualified lake ID (projects/<p>/locations/<region>/lakes/<name>)."
  value       = google_dataplex_lake.this.id
}

output "lake_name" {
  description = "Dataplex lake name (used as the parent in zone/asset references and gcloud)."
  value       = google_dataplex_lake.this.name
}

output "lake_uid" {
  description = "System-generated globally unique ID for the lake."
  value       = google_dataplex_lake.this.uid
}

output "lake_service_account" {
  description = "Service account Dataplex uses for this lake; grant it read on attached buckets/datasets."
  value       = google_dataplex_lake.this.service_account
}

output "lake_state" {
  description = "Current lake state (ACTIVE, CREATING, ACTION_REQUIRED, ...)."
  value       = google_dataplex_lake.this.state
}

output "zone_ids" {
  description = "Map of zone key => fully qualified zone ID."
  value       = { for k, z in google_dataplex_zone.this : k => z.id }
}

output "zone_names" {
  description = "Map of zone key => zone name (used as the parent when attaching assets)."
  value       = { for k, z in google_dataplex_zone.this : k => z.name }
}

output "zone_states" {
  description = "Map of zone key => current zone state (ACTIVE, CREATING, ...)."
  value       = { for k, z in google_dataplex_zone.this : k => z.state }
}

How to use it

The example provisions an analytics lake in europe-west1 federated into an existing Dataproc Metastore, with two zones: a RAW landing zone that discovers CSV drops hourly (skipping a _tmp/ staging prefix), and a CURATED zone restricted to structured data and discovered four times a day. The downstream block grants the lake’s Dataplex service account object-viewer on the landing bucket — using the module’s lake_service_account output instead of a hardcoded identity — so discovery can actually read the files.

module "dataplex" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataplex?ref=v1.0.0"

  project_id     = "kv-data-prod"
  app            = "analytics"
  environment    = "prod"
  location_short = "euw1"
  region         = "europe-west1"

  # Share discovered schemas with Spark/Hive via an existing Dataproc Metastore.
  metastore_service = "projects/kv-data-prod/locations/europe-west1/services/hms-prod"

  zones = {
    raw-landing = {
      type               = "RAW"
      location_type      = "SINGLE_REGION"
      display_name       = "Raw landing"
      discovery_enabled  = true
      discovery_schedule = "0 * * * *" # hourly
      exclude_patterns   = ["_tmp/**"]
      csv_options = {
        header_rows = 1
        delimiter   = ","
      }
    }

    curated-sales = {
      type               = "CURATED"
      location_type      = "SINGLE_REGION"
      display_name       = "Curated sales"
      discovery_enabled  = true
      discovery_schedule = "0 */6 * * *" # every 6 hours
      labels             = { tier = "gold" }
    }
  }

  labels = {
    team        = "data-platform"
    cost-center = "kv-1042"
    workload    = "lakehouse"
  }
}

# Downstream: Dataplex discovery must be able to read the landing bucket.
# Bind the lake's service account (a module output) to object-viewer.
resource "google_storage_bucket_iam_member" "dataplex_read_landing" {
  bucket = "kv-data-prod-landing"
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${module.dataplex.lake_service_account}"
}

# And expose the curated zone name so an asset module can attach a dataset to it.
output "curated_zone_name" {
  value = module.dataplex.zone_names["curated-sales"]
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module config — live/prod/dataplex/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataplex?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  app = "..."
  environment = "..."
  location_short = "..."
  region = "..."
  zones = {}
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/dataplex && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`project_id`	`string`	—	Yes	GCP project ID hosting the lake and zones.
`app`	`string`	—	Yes	Workload short name used in the lake name (validated lowercase, 2–21 chars).
`environment`	`string`	—	Yes	One of `dev`, `staging`, `prod`, `sandbox`.
`location_short`	`string`	—	Yes	Cosmetic region token for naming (1–8 lowercase chars).
`region`	`string`	—	Yes	GCP region for the lake and all zones, e.g. `europe-west1`.
`display_name`	`string`	`null`	No	Lake display name; defaults to the generated lake name.
`description`	`string`	`"Managed by Terraform"`	No	Lake console description.
`metastore_service`	`string`	`null`	No	Relative reference to an existing Dataproc Metastore to federate into.
`zones`	`map(object)`	—	Yes	Zones keyed by ID; each has `type` (RAW/CURATED), `location_type` (SINGLE_REGION/MULTI_REGION), and discovery settings (`discovery_enabled`, `discovery_schedule`, `include_patterns`, `exclude_patterns`, `csv_options`, `json_options`).
`labels`	`map(string)`	`{}`	No	Labels applied to the lake and merged into every zone.

Outputs

Name	Description
`lake_id`	Fully qualified lake ID (`projects/<p>/locations/<region>/lakes/<name>`).
`lake_name`	Lake name used as the parent in zone/asset references and `gcloud`.
`lake_uid`	System-generated globally unique ID for the lake.
`lake_service_account`	Dataplex service account for the lake; grant it read on attached storage.
`lake_state`	Current lake state (`ACTIVE`, `CREATING`, `ACTION_REQUIRED`, …).
`zone_ids`	Map of zone key → fully qualified zone ID.
`zone_names`	Map of zone key → zone name (parent when attaching assets).
`zone_states`	Map of zone key → current zone state.

Enterprise scenario

A media company lands clickstream, ad-impression, and CDN logs from a dozen sources into a regional Cloud Storage estate and needs analysts to query yesterday’s data without a data engineer wiring tables by hand. They deploy this module per environment as an analytics lake federated into their Dataproc Metastore, with a RAW zone discovering newline-delimited JSON and CSV hourly (excluding _tmp/** staging paths) and a CURATED zone restricted to the Parquet tables their dbt jobs write. Bucket and dataset assets are attached by a separate asset module that consumes the zone_names output, and the lake_service_account output drives the object-viewer bindings so discovery can read every source — so onboarding a new region’s lakehouse is one module block plus a zone map, and the BigQuery @dataplex catalogue is queryable within the hour.

Best practices

Treat zone type as an enforced contract, not a label. Keep messy, any-format ingestion in RAW and point BI/ML consumers only at CURATED (which rejects non-structured formats) so a stray CSV can never leak into a “gold” reporting surface; size the split around how data is actually consumed, not around source systems.
Tune the discovery schedule for cost and freshness, never below 60 minutes. Each discovery run scans attached storage and incurs metadata/scan cost, and Dataplex rejects schedules closer than 60 minutes apart — run high-churn RAW landing zones hourly but back curated zones off to every few hours, and disable discovery entirely (discovery_enabled = false) on zones whose schema is managed elsewhere.
Grant the lake service_account least privilege, per asset. Bind the exported lake_service_account to roles/storage.objectViewer / BigQuery data-viewer only on the specific buckets and datasets you attach, rather than project-wide, so discovery can read exactly what it governs and nothing more.
Keep location_type aligned with the storage you attach. A SINGLE_REGION zone must hold same-region buckets/datasets and a MULTI_REGION zone multi-region storage; mismatches fail at asset-attach time, so decide the regional topology up front and validate it in the zone map.
Standardise naming and labels for an auditable estate. The app-env-region lake name plus team/cost-center/tier labels (merged onto every zone) make a sprawling data fabric attributable and let you slice discovery/scan spend by owner in billing exports.
Federate into Dataproc Metastore when Spark and BigQuery share data. Set metastore_service so discovered tables land in one Hive-compatible catalogue both engines read, avoiding divergent schema definitions; leave it null when the built-in Dataplex catalogue is sufficient to avoid the extra metastore cost.