IaC GCP

Terraform Module: GCP Dataplex — a governed lake with typed zones in one block

Quick take — A reusable hashicorp/google Terraform module for GCP Dataplex: a data lake plus RAW and CURATED zones, scheduled auto-discovery, CSV/JSON parsing options, Dataproc Metastore attach, labels and least-privilege outputs. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "dataplex" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataplex?ref=v1.0.0"

  project_id     = "..."  # GCP project ID hosting the lake and zones.
  app            = "..."  # Workload short name used in the lake name (validated lo…
  environment    = "..."  # One of `dev`, `staging`, `prod`, `sandbox`.
  location_short = "..."  # Cosmetic region token for naming (1–8 lowercase chars).
  region         = "..."  # GCP region for the lake and all zones, e.g. `europe-wes…
  zones          = {}     # Zones keyed by ID; each has `type` (RAW/CURATED), `loca…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Dataplex is GCP’s data fabric and governance plane: instead of treating each Cloud Storage bucket and BigQuery dataset as an island, you organise them into a logical lake, carve that lake into zones by data quality tier, and attach the underlying storage as assets. Dataplex then runs automatic discovery across those assets — crawling files, inferring schemas, and registering tables in a Dataproc Metastore and the BigQuery external/@dataplex catalogues so analysts can query data that was just dropped into a bucket, without anyone hand-writing a CREATE EXTERNAL TABLE. It is GCP’s answer to a metadata-driven lakehouse, sitting roughly where AWS Lake Formation or an Azure Purview + ADLS combination sits, and it underpins data-quality scans, data lineage, and unified IAM across structured and unstructured data.

The mental model that the Terraform resources enforce is a strict hierarchy: a google_dataplex_lake is the top-level container scoped to a single region, and every google_dataplex_zone belongs to exactly one lake. A zone is not optional cosmetics — its type is a hard contract. A RAW zone holds data in any format (Avro, Parquet, CSV, JSON, images, logs) as it lands, while a CURATED zone is restricted to structured, query-optimised formats (Parquet, ORC, Avro, or BigQuery-native tables) and is what you point BI tools and CURATED-tier consumers at. Every zone must also declare a resource_spec.location_type (SINGLE_REGION or MULTI_REGION, which must be compatible with the buckets you later attach) and a discovery_spec that turns metadata crawling on or off, optionally on a cron schedule with CSV/JSON parsing hints. Get the type or location-type wrong and asset attachment is rejected later; forget the discovery schedule and your “automatic” catalogue silently never refreshes.

This module wraps google_dataplex_lake plus a for_each set of google_dataplex_zone resources behind clean, validated variables. You name a lake, optionally attach an existing Dataproc Metastore, and pass a map of zones — each with its tier, location type, and discovery settings — and the module provisions the whole hierarchy with consistent app-env-region naming, labels, and a discovery cadence that actually fires. Asset attachment and IAM are deliberately left to the caller (assets often reference buckets/datasets owned by other modules), but the lake’s service_account is exported so you can grant it read access in one downstream binding.

When to use it

Reach for plain BigQuery datasets instead when all your data is already native BigQuery and you do not need to govern Cloud Storage alongside it; reach for Dataproc Metastore on its own if you only need a Hive metastore for Spark and not the discovery/governance layer. Dataplex is the right tool when storage spans GCS and BigQuery and you want one fabric to discover, organise, and secure it.

Module structure

terraform-module-gcp-dataplex/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # google_dataplex_lake + for_each google_dataplex_zone
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # lake id/name/service_account + per-zone ids and names

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Consistent app-env-region naming, e.g. "analytics-prod-euw1".
  # Lake and zone IDs must be lowercase letters/digits/hyphen, start with a
  # letter, and be 1-63 chars. We compose, then the variable validations guard
  # the inputs so the result always fits.
  lake_name = "${var.app}-${var.environment}-${var.location_short}"
}

# ---------------------------------------------------------------------------
# The Dataplex lake — top-level, single-region container.
# ---------------------------------------------------------------------------

resource "google_dataplex_lake" "this" {
  project      = var.project_id
  name         = local.lake_name
  location     = var.region
  display_name = coalesce(var.display_name, local.lake_name)
  description  = var.description

  labels = var.labels

  # Optionally federate discovered tables into an existing Dataproc Metastore so
  # Spark/Hive and BigQuery share one schema catalogue. Omit for the built-in
  # Dataplex catalogue only.
  dynamic "metastore" {
    for_each = var.metastore_service == null ? [] : [var.metastore_service]
    content {
      service = metastore.value
    }
  }

  timeouts {
    create = "30m"
    update = "30m"
    delete = "30m"
  }
}

# ---------------------------------------------------------------------------
# Zones — one per quality tier. Each belongs to the lake above.
# RAW = any format as-landed; CURATED = structured/query-optimised only.
# ---------------------------------------------------------------------------

resource "google_dataplex_zone" "this" {
  for_each = var.zones

  project      = var.project_id
  name         = each.key
  location     = var.region
  lake         = google_dataplex_lake.this.name
  display_name = coalesce(each.value.display_name, each.key)
  description  = each.value.description

  # RAW or CURATED — a hard contract on what formats the zone accepts.
  type = each.value.type

  labels = merge(var.labels, each.value.labels)

  # SINGLE_REGION or MULTI_REGION — must be compatible with the GCS buckets /
  # BigQuery datasets you later attach as assets to this zone.
  resource_spec {
    location_type = each.value.location_type
  }

  # Automatic metadata discovery: crawl attached storage, infer schemas, and
  # register tables. schedule is a cron expression; runs must be >= 60 min apart.
  discovery_spec {
    enabled          = each.value.discovery_enabled
    schedule         = each.value.discovery_enabled ? each.value.discovery_schedule : null
    include_patterns = each.value.include_patterns
    exclude_patterns = each.value.exclude_patterns

    # CSV parsing hints applied during discovery (only meaningful for RAW data
    # containing CSV). header_rows are skipped; type inference can be disabled.
    dynamic "csv_options" {
      for_each = each.value.csv_options == null ? [] : [each.value.csv_options]
      content {
        header_rows            = csv_options.value.header_rows
        delimiter              = csv_options.value.delimiter
        encoding               = csv_options.value.encoding
        disable_type_inference = csv_options.value.disable_type_inference
      }
    }

    # JSON parsing hints applied during discovery.
    dynamic "json_options" {
      for_each = each.value.json_options == null ? [] : [each.value.json_options]
      content {
        encoding               = json_options.value.encoding
        disable_type_inference = json_options.value.disable_type_inference
      }
    }
  }

  timeouts {
    create = "30m"
    update = "30m"
    delete = "30m"
  }
}

variables.tf

variable "project_id" {
  description = "GCP project ID that will host the Dataplex lake and zones."
  type        = string
}

variable "app" {
  description = "Application/workload short name, used in the lake name (e.g. \"analytics\")."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,20}$", var.app))
    error_message = "app must be lowercase letters/digits/hyphen, 2-21 chars, starting with a letter."
  }
}

variable "environment" {
  description = "Deployment environment (dev, staging, prod, sandbox)."
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod", "sandbox"], var.environment)
    error_message = "environment must be one of: dev, staging, prod, sandbox."
  }
}

variable "location_short" {
  description = "Short region token for naming, e.g. \"euw1\", \"use4\". Cosmetic only."
  type        = string

  validation {
    condition     = can(regex("^[a-z0-9]{1,8}$", var.location_short))
    error_message = "location_short must be 1-8 lowercase letters/digits."
  }
}

variable "region" {
  description = "GCP region for the lake and all its zones, e.g. \"europe-west1\". Dataplex lakes are regional."
  type        = string
}

variable "display_name" {
  description = "Human-friendly display name for the lake. Defaults to the generated lake name."
  type        = string
  default     = null
}

variable "description" {
  description = "Free-text description shown in the Dataplex console for the lake."
  type        = string
  default     = "Managed by Terraform"
}

variable "metastore_service" {
  description = <<-EOT
    Optional relative reference to an existing Dataproc Metastore service to
    federate discovered tables into, e.g.
    "projects/p/locations/europe-west1/services/hms". null = Dataplex catalogue only.
  EOT
  type        = string
  default     = null
}

variable "zones" {
  description = <<-EOT
    Map of zones keyed by zone ID (lowercase letters/digits/hyphen, start with a
    letter, <= 63 chars). Each zone declares its tier, location type, and
    discovery behaviour.
  EOT
  type = map(object({
    type               = string                # RAW or CURATED
    location_type      = string                # SINGLE_REGION or MULTI_REGION
    display_name       = optional(string)
    description        = optional(string, "Managed by Terraform")
    labels             = optional(map(string), {})
    discovery_enabled  = optional(bool, true)
    discovery_schedule = optional(string, "0 * * * *") # hourly; >= 60 min apart
    include_patterns   = optional(list(string), [])
    exclude_patterns   = optional(list(string), [])
    csv_options = optional(object({
      header_rows            = optional(number, 1)
      delimiter              = optional(string, ",")
      encoding               = optional(string, "UTF-8")
      disable_type_inference = optional(bool, false)
    }))
    json_options = optional(object({
      encoding               = optional(string, "UTF-8")
      disable_type_inference = optional(bool, false)
    }))
  }))

  validation {
    condition     = length(var.zones) > 0
    error_message = "Provide at least one zone."
  }

  validation {
    condition     = alltrue([for z in values(var.zones) : contains(["RAW", "CURATED"], z.type)])
    error_message = "Each zone.type must be RAW or CURATED."
  }

  validation {
    condition     = alltrue([for z in values(var.zones) : contains(["SINGLE_REGION", "MULTI_REGION"], z.location_type)])
    error_message = "Each zone.location_type must be SINGLE_REGION or MULTI_REGION."
  }

  validation {
    condition     = alltrue([for k in keys(var.zones) : can(regex("^[a-z][a-z0-9-]{0,62}$", k))])
    error_message = "Each zone ID (map key) must be lowercase letters/digits/hyphen, start with a letter, <= 63 chars."
  }
}

variable "labels" {
  description = "Labels applied to the lake and merged into every zone."
  type        = map(string)
  default     = {}
}

outputs.tf

output "lake_id" {
  description = "Fully qualified lake ID (projects/<p>/locations/<region>/lakes/<name>)."
  value       = google_dataplex_lake.this.id
}

output "lake_name" {
  description = "Dataplex lake name (used as the parent in zone/asset references and gcloud)."
  value       = google_dataplex_lake.this.name
}

output "lake_uid" {
  description = "System-generated globally unique ID for the lake."
  value       = google_dataplex_lake.this.uid
}

output "lake_service_account" {
  description = "Service account Dataplex uses for this lake; grant it read on attached buckets/datasets."
  value       = google_dataplex_lake.this.service_account
}

output "lake_state" {
  description = "Current lake state (ACTIVE, CREATING, ACTION_REQUIRED, ...)."
  value       = google_dataplex_lake.this.state
}

output "zone_ids" {
  description = "Map of zone key => fully qualified zone ID."
  value       = { for k, z in google_dataplex_zone.this : k => z.id }
}

output "zone_names" {
  description = "Map of zone key => zone name (used as the parent when attaching assets)."
  value       = { for k, z in google_dataplex_zone.this : k => z.name }
}

output "zone_states" {
  description = "Map of zone key => current zone state (ACTIVE, CREATING, ...)."
  value       = { for k, z in google_dataplex_zone.this : k => z.state }
}

How to use it

The example provisions an analytics lake in europe-west1 federated into an existing Dataproc Metastore, with two zones: a RAW landing zone that discovers CSV drops hourly (skipping a _tmp/ staging prefix), and a CURATED zone restricted to structured data and discovered four times a day. The downstream block grants the lake’s Dataplex service account object-viewer on the landing bucket — using the module’s lake_service_account output instead of a hardcoded identity — so discovery can actually read the files.

module "dataplex" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataplex?ref=v1.0.0"

  project_id     = "kv-data-prod"
  app            = "analytics"
  environment    = "prod"
  location_short = "euw1"
  region         = "europe-west1"

  # Share discovered schemas with Spark/Hive via an existing Dataproc Metastore.
  metastore_service = "projects/kv-data-prod/locations/europe-west1/services/hms-prod"

  zones = {
    raw-landing = {
      type               = "RAW"
      location_type      = "SINGLE_REGION"
      display_name       = "Raw landing"
      discovery_enabled  = true
      discovery_schedule = "0 * * * *" # hourly
      exclude_patterns   = ["_tmp/**"]
      csv_options = {
        header_rows = 1
        delimiter   = ","
      }
    }

    curated-sales = {
      type               = "CURATED"
      location_type      = "SINGLE_REGION"
      display_name       = "Curated sales"
      discovery_enabled  = true
      discovery_schedule = "0 */6 * * *" # every 6 hours
      labels             = { tier = "gold" }
    }
  }

  labels = {
    team        = "data-platform"
    cost-center = "kv-1042"
    workload    = "lakehouse"
  }
}

# Downstream: Dataplex discovery must be able to read the landing bucket.
# Bind the lake's service account (a module output) to object-viewer.
resource "google_storage_bucket_iam_member" "dataplex_read_landing" {
  bucket = "kv-data-prod-landing"
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${module.dataplex.lake_service_account}"
}

# And expose the curated zone name so an asset module can attach a dataset to it.
output "curated_zone_name" {
  value = module.dataplex.zone_names["curated-sales"]
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/dataplex/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataplex?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  app = "..."
  environment = "..."
  location_short = "..."
  region = "..."
  zones = {}
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/dataplex && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string Yes GCP project ID hosting the lake and zones.
app string Yes Workload short name used in the lake name (validated lowercase, 2–21 chars).
environment string Yes One of dev, staging, prod, sandbox.
location_short string Yes Cosmetic region token for naming (1–8 lowercase chars).
region string Yes GCP region for the lake and all zones, e.g. europe-west1.
display_name string null No Lake display name; defaults to the generated lake name.
description string "Managed by Terraform" No Lake console description.
metastore_service string null No Relative reference to an existing Dataproc Metastore to federate into.
zones map(object) Yes Zones keyed by ID; each has type (RAW/CURATED), location_type (SINGLE_REGION/MULTI_REGION), and discovery settings (discovery_enabled, discovery_schedule, include_patterns, exclude_patterns, csv_options, json_options).
labels map(string) {} No Labels applied to the lake and merged into every zone.

Outputs

Name Description
lake_id Fully qualified lake ID (projects/<p>/locations/<region>/lakes/<name>).
lake_name Lake name used as the parent in zone/asset references and gcloud.
lake_uid System-generated globally unique ID for the lake.
lake_service_account Dataplex service account for the lake; grant it read on attached storage.
lake_state Current lake state (ACTIVE, CREATING, ACTION_REQUIRED, …).
zone_ids Map of zone key → fully qualified zone ID.
zone_names Map of zone key → zone name (parent when attaching assets).
zone_states Map of zone key → current zone state.

Enterprise scenario

A media company lands clickstream, ad-impression, and CDN logs from a dozen sources into a regional Cloud Storage estate and needs analysts to query yesterday’s data without a data engineer wiring tables by hand. They deploy this module per environment as an analytics lake federated into their Dataproc Metastore, with a RAW zone discovering newline-delimited JSON and CSV hourly (excluding _tmp/** staging paths) and a CURATED zone restricted to the Parquet tables their dbt jobs write. Bucket and dataset assets are attached by a separate asset module that consumes the zone_names output, and the lake_service_account output drives the object-viewer bindings so discovery can read every source — so onboarding a new region’s lakehouse is one module block plus a zone map, and the BigQuery @dataplex catalogue is queryable within the hour.

Best practices

TerraformGCPDataplexModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading