Quick take — A reusable hashicorp/google Terraform module for GCP Dataplex: a data lake plus RAW and CURATED zones, scheduled auto-discovery, CSV/JSON parsing options, Dataproc Metastore attach, labels and least-privilege outputs. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "dataplex" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataplex?ref=v1.0.0"
project_id = "..." # GCP project ID hosting the lake and zones.
app = "..." # Workload short name used in the lake name (validated lo…
environment = "..." # One of `dev`, `staging`, `prod`, `sandbox`.
location_short = "..." # Cosmetic region token for naming (1–8 lowercase chars).
region = "..." # GCP region for the lake and all zones, e.g. `europe-wes…
zones = {} # Zones keyed by ID; each has `type` (RAW/CURATED), `loca…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Dataplex is GCP’s data fabric and governance plane: instead of treating each Cloud Storage bucket and BigQuery dataset as an island, you organise them into a logical lake, carve that lake into zones by data quality tier, and attach the underlying storage as assets. Dataplex then runs automatic discovery across those assets — crawling files, inferring schemas, and registering tables in a Dataproc Metastore and the BigQuery external/@dataplex catalogues so analysts can query data that was just dropped into a bucket, without anyone hand-writing a CREATE EXTERNAL TABLE. It is GCP’s answer to a metadata-driven lakehouse, sitting roughly where AWS Lake Formation or an Azure Purview + ADLS combination sits, and it underpins data-quality scans, data lineage, and unified IAM across structured and unstructured data.
The mental model that the Terraform resources enforce is a strict hierarchy: a google_dataplex_lake is the top-level container scoped to a single region, and every google_dataplex_zone belongs to exactly one lake. A zone is not optional cosmetics — its type is a hard contract. A RAW zone holds data in any format (Avro, Parquet, CSV, JSON, images, logs) as it lands, while a CURATED zone is restricted to structured, query-optimised formats (Parquet, ORC, Avro, or BigQuery-native tables) and is what you point BI tools and CURATED-tier consumers at. Every zone must also declare a resource_spec.location_type (SINGLE_REGION or MULTI_REGION, which must be compatible with the buckets you later attach) and a discovery_spec that turns metadata crawling on or off, optionally on a cron schedule with CSV/JSON parsing hints. Get the type or location-type wrong and asset attachment is rejected later; forget the discovery schedule and your “automatic” catalogue silently never refreshes.
This module wraps google_dataplex_lake plus a for_each set of google_dataplex_zone resources behind clean, validated variables. You name a lake, optionally attach an existing Dataproc Metastore, and pass a map of zones — each with its tier, location type, and discovery settings — and the module provisions the whole hierarchy with consistent app-env-region naming, labels, and a discovery cadence that actually fires. Asset attachment and IAM are deliberately left to the caller (assets often reference buckets/datasets owned by other modules), but the lake’s service_account is exported so you can grant it read access in one downstream binding.
When to use it
- You are building a lakehouse on GCP and want Cloud Storage + BigQuery organised into governed tiers (landing → raw → curated) with a single IAM and metadata plane rather than ad-hoc buckets.
- You need schema-on-read at scale: drop Parquet/CSV/JSON into a bucket and have Dataplex discover it, infer the schema, and register queryable tables automatically on a schedule.
- You want a clean RAW vs CURATED contract so downstream BI and ML consumers only ever read structured, query-optimised data from the curated zone, while ingestion lands messy data in raw.
- You are standardising a data platform and want every team’s lake to carry the same zone taxonomy, discovery cadence, labels, and naming so the data estate is auditable and spend is attributable.
- You plan to layer Dataplex data-quality scans, profiling, or lineage on top — these all key off the lake/zone/asset hierarchy this module creates.
Reach for plain BigQuery datasets instead when all your data is already native BigQuery and you do not need to govern Cloud Storage alongside it; reach for Dataproc Metastore on its own if you only need a Hive metastore for Spark and not the discovery/governance layer. Dataplex is the right tool when storage spans GCS and BigQuery and you want one fabric to discover, organise, and secure it.
Module structure
terraform-module-gcp-dataplex/
├── versions.tf # provider + Terraform version pins
├── main.tf # google_dataplex_lake + for_each google_dataplex_zone
├── variables.tf # var-driven inputs with validation
└── outputs.tf # lake id/name/service_account + per-zone ids and names
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
locals {
# Consistent app-env-region naming, e.g. "analytics-prod-euw1".
# Lake and zone IDs must be lowercase letters/digits/hyphen, start with a
# letter, and be 1-63 chars. We compose, then the variable validations guard
# the inputs so the result always fits.
lake_name = "${var.app}-${var.environment}-${var.location_short}"
}
# ---------------------------------------------------------------------------
# The Dataplex lake — top-level, single-region container.
# ---------------------------------------------------------------------------
resource "google_dataplex_lake" "this" {
project = var.project_id
name = local.lake_name
location = var.region
display_name = coalesce(var.display_name, local.lake_name)
description = var.description
labels = var.labels
# Optionally federate discovered tables into an existing Dataproc Metastore so
# Spark/Hive and BigQuery share one schema catalogue. Omit for the built-in
# Dataplex catalogue only.
dynamic "metastore" {
for_each = var.metastore_service == null ? [] : [var.metastore_service]
content {
service = metastore.value
}
}
timeouts {
create = "30m"
update = "30m"
delete = "30m"
}
}
# ---------------------------------------------------------------------------
# Zones — one per quality tier. Each belongs to the lake above.
# RAW = any format as-landed; CURATED = structured/query-optimised only.
# ---------------------------------------------------------------------------
resource "google_dataplex_zone" "this" {
for_each = var.zones
project = var.project_id
name = each.key
location = var.region
lake = google_dataplex_lake.this.name
display_name = coalesce(each.value.display_name, each.key)
description = each.value.description
# RAW or CURATED — a hard contract on what formats the zone accepts.
type = each.value.type
labels = merge(var.labels, each.value.labels)
# SINGLE_REGION or MULTI_REGION — must be compatible with the GCS buckets /
# BigQuery datasets you later attach as assets to this zone.
resource_spec {
location_type = each.value.location_type
}
# Automatic metadata discovery: crawl attached storage, infer schemas, and
# register tables. schedule is a cron expression; runs must be >= 60 min apart.
discovery_spec {
enabled = each.value.discovery_enabled
schedule = each.value.discovery_enabled ? each.value.discovery_schedule : null
include_patterns = each.value.include_patterns
exclude_patterns = each.value.exclude_patterns
# CSV parsing hints applied during discovery (only meaningful for RAW data
# containing CSV). header_rows are skipped; type inference can be disabled.
dynamic "csv_options" {
for_each = each.value.csv_options == null ? [] : [each.value.csv_options]
content {
header_rows = csv_options.value.header_rows
delimiter = csv_options.value.delimiter
encoding = csv_options.value.encoding
disable_type_inference = csv_options.value.disable_type_inference
}
}
# JSON parsing hints applied during discovery.
dynamic "json_options" {
for_each = each.value.json_options == null ? [] : [each.value.json_options]
content {
encoding = json_options.value.encoding
disable_type_inference = json_options.value.disable_type_inference
}
}
}
timeouts {
create = "30m"
update = "30m"
delete = "30m"
}
}
variables.tf
variable "project_id" {
description = "GCP project ID that will host the Dataplex lake and zones."
type = string
}
variable "app" {
description = "Application/workload short name, used in the lake name (e.g. \"analytics\")."
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,20}$", var.app))
error_message = "app must be lowercase letters/digits/hyphen, 2-21 chars, starting with a letter."
}
}
variable "environment" {
description = "Deployment environment (dev, staging, prod, sandbox)."
type = string
validation {
condition = contains(["dev", "staging", "prod", "sandbox"], var.environment)
error_message = "environment must be one of: dev, staging, prod, sandbox."
}
}
variable "location_short" {
description = "Short region token for naming, e.g. \"euw1\", \"use4\". Cosmetic only."
type = string
validation {
condition = can(regex("^[a-z0-9]{1,8}$", var.location_short))
error_message = "location_short must be 1-8 lowercase letters/digits."
}
}
variable "region" {
description = "GCP region for the lake and all its zones, e.g. \"europe-west1\". Dataplex lakes are regional."
type = string
}
variable "display_name" {
description = "Human-friendly display name for the lake. Defaults to the generated lake name."
type = string
default = null
}
variable "description" {
description = "Free-text description shown in the Dataplex console for the lake."
type = string
default = "Managed by Terraform"
}
variable "metastore_service" {
description = <<-EOT
Optional relative reference to an existing Dataproc Metastore service to
federate discovered tables into, e.g.
"projects/p/locations/europe-west1/services/hms". null = Dataplex catalogue only.
EOT
type = string
default = null
}
variable "zones" {
description = <<-EOT
Map of zones keyed by zone ID (lowercase letters/digits/hyphen, start with a
letter, <= 63 chars). Each zone declares its tier, location type, and
discovery behaviour.
EOT
type = map(object({
type = string # RAW or CURATED
location_type = string # SINGLE_REGION or MULTI_REGION
display_name = optional(string)
description = optional(string, "Managed by Terraform")
labels = optional(map(string), {})
discovery_enabled = optional(bool, true)
discovery_schedule = optional(string, "0 * * * *") # hourly; >= 60 min apart
include_patterns = optional(list(string), [])
exclude_patterns = optional(list(string), [])
csv_options = optional(object({
header_rows = optional(number, 1)
delimiter = optional(string, ",")
encoding = optional(string, "UTF-8")
disable_type_inference = optional(bool, false)
}))
json_options = optional(object({
encoding = optional(string, "UTF-8")
disable_type_inference = optional(bool, false)
}))
}))
validation {
condition = length(var.zones) > 0
error_message = "Provide at least one zone."
}
validation {
condition = alltrue([for z in values(var.zones) : contains(["RAW", "CURATED"], z.type)])
error_message = "Each zone.type must be RAW or CURATED."
}
validation {
condition = alltrue([for z in values(var.zones) : contains(["SINGLE_REGION", "MULTI_REGION"], z.location_type)])
error_message = "Each zone.location_type must be SINGLE_REGION or MULTI_REGION."
}
validation {
condition = alltrue([for k in keys(var.zones) : can(regex("^[a-z][a-z0-9-]{0,62}$", k))])
error_message = "Each zone ID (map key) must be lowercase letters/digits/hyphen, start with a letter, <= 63 chars."
}
}
variable "labels" {
description = "Labels applied to the lake and merged into every zone."
type = map(string)
default = {}
}
outputs.tf
output "lake_id" {
description = "Fully qualified lake ID (projects/<p>/locations/<region>/lakes/<name>)."
value = google_dataplex_lake.this.id
}
output "lake_name" {
description = "Dataplex lake name (used as the parent in zone/asset references and gcloud)."
value = google_dataplex_lake.this.name
}
output "lake_uid" {
description = "System-generated globally unique ID for the lake."
value = google_dataplex_lake.this.uid
}
output "lake_service_account" {
description = "Service account Dataplex uses for this lake; grant it read on attached buckets/datasets."
value = google_dataplex_lake.this.service_account
}
output "lake_state" {
description = "Current lake state (ACTIVE, CREATING, ACTION_REQUIRED, ...)."
value = google_dataplex_lake.this.state
}
output "zone_ids" {
description = "Map of zone key => fully qualified zone ID."
value = { for k, z in google_dataplex_zone.this : k => z.id }
}
output "zone_names" {
description = "Map of zone key => zone name (used as the parent when attaching assets)."
value = { for k, z in google_dataplex_zone.this : k => z.name }
}
output "zone_states" {
description = "Map of zone key => current zone state (ACTIVE, CREATING, ...)."
value = { for k, z in google_dataplex_zone.this : k => z.state }
}
How to use it
The example provisions an analytics lake in europe-west1 federated into an existing Dataproc Metastore, with two zones: a RAW landing zone that discovers CSV drops hourly (skipping a _tmp/ staging prefix), and a CURATED zone restricted to structured data and discovered four times a day. The downstream block grants the lake’s Dataplex service account object-viewer on the landing bucket — using the module’s lake_service_account output instead of a hardcoded identity — so discovery can actually read the files.
module "dataplex" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataplex?ref=v1.0.0"
project_id = "kv-data-prod"
app = "analytics"
environment = "prod"
location_short = "euw1"
region = "europe-west1"
# Share discovered schemas with Spark/Hive via an existing Dataproc Metastore.
metastore_service = "projects/kv-data-prod/locations/europe-west1/services/hms-prod"
zones = {
raw-landing = {
type = "RAW"
location_type = "SINGLE_REGION"
display_name = "Raw landing"
discovery_enabled = true
discovery_schedule = "0 * * * *" # hourly
exclude_patterns = ["_tmp/**"]
csv_options = {
header_rows = 1
delimiter = ","
}
}
curated-sales = {
type = "CURATED"
location_type = "SINGLE_REGION"
display_name = "Curated sales"
discovery_enabled = true
discovery_schedule = "0 */6 * * *" # every 6 hours
labels = { tier = "gold" }
}
}
labels = {
team = "data-platform"
cost-center = "kv-1042"
workload = "lakehouse"
}
}
# Downstream: Dataplex discovery must be able to read the landing bucket.
# Bind the lake's service account (a module output) to object-viewer.
resource "google_storage_bucket_iam_member" "dataplex_read_landing" {
bucket = "kv-data-prod-landing"
role = "roles/storage.objectViewer"
member = "serviceAccount:${module.dataplex.lake_service_account}"
}
# And expose the curated zone name so an asset module can attach a dataset to it.
output "curated_zone_name" {
value = module.dataplex.zone_names["curated-sales"]
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/dataplex/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataplex?ref=v1.0.0"
}
inputs = {
project_id = "..."
app = "..."
environment = "..."
location_short = "..."
region = "..."
zones = {}
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/dataplex && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
project_id |
string |
— | Yes | GCP project ID hosting the lake and zones. |
app |
string |
— | Yes | Workload short name used in the lake name (validated lowercase, 2–21 chars). |
environment |
string |
— | Yes | One of dev, staging, prod, sandbox. |
location_short |
string |
— | Yes | Cosmetic region token for naming (1–8 lowercase chars). |
region |
string |
— | Yes | GCP region for the lake and all zones, e.g. europe-west1. |
display_name |
string |
null |
No | Lake display name; defaults to the generated lake name. |
description |
string |
"Managed by Terraform" |
No | Lake console description. |
metastore_service |
string |
null |
No | Relative reference to an existing Dataproc Metastore to federate into. |
zones |
map(object) |
— | Yes | Zones keyed by ID; each has type (RAW/CURATED), location_type (SINGLE_REGION/MULTI_REGION), and discovery settings (discovery_enabled, discovery_schedule, include_patterns, exclude_patterns, csv_options, json_options). |
labels |
map(string) |
{} |
No | Labels applied to the lake and merged into every zone. |
Outputs
| Name | Description |
|---|---|
lake_id |
Fully qualified lake ID (projects/<p>/locations/<region>/lakes/<name>). |
lake_name |
Lake name used as the parent in zone/asset references and gcloud. |
lake_uid |
System-generated globally unique ID for the lake. |
lake_service_account |
Dataplex service account for the lake; grant it read on attached storage. |
lake_state |
Current lake state (ACTIVE, CREATING, ACTION_REQUIRED, …). |
zone_ids |
Map of zone key → fully qualified zone ID. |
zone_names |
Map of zone key → zone name (parent when attaching assets). |
zone_states |
Map of zone key → current zone state. |
Enterprise scenario
A media company lands clickstream, ad-impression, and CDN logs from a dozen sources into a regional Cloud Storage estate and needs analysts to query yesterday’s data without a data engineer wiring tables by hand. They deploy this module per environment as an analytics lake federated into their Dataproc Metastore, with a RAW zone discovering newline-delimited JSON and CSV hourly (excluding _tmp/** staging paths) and a CURATED zone restricted to the Parquet tables their dbt jobs write. Bucket and dataset assets are attached by a separate asset module that consumes the zone_names output, and the lake_service_account output drives the object-viewer bindings so discovery can read every source — so onboarding a new region’s lakehouse is one module block plus a zone map, and the BigQuery @dataplex catalogue is queryable within the hour.
Best practices
- Treat zone
typeas an enforced contract, not a label. Keep messy, any-format ingestion inRAWand point BI/ML consumers only atCURATED(which rejects non-structured formats) so a stray CSV can never leak into a “gold” reporting surface; size the split around how data is actually consumed, not around source systems. - Tune the discovery schedule for cost and freshness, never below 60 minutes. Each discovery run scans attached storage and incurs metadata/scan cost, and Dataplex rejects schedules closer than 60 minutes apart — run high-churn RAW landing zones hourly but back curated zones off to every few hours, and disable discovery entirely (
discovery_enabled = false) on zones whose schema is managed elsewhere. - Grant the lake
service_accountleast privilege, per asset. Bind the exportedlake_service_accounttoroles/storage.objectViewer/ BigQuery data-viewer only on the specific buckets and datasets you attach, rather than project-wide, so discovery can read exactly what it governs and nothing more. - Keep
location_typealigned with the storage you attach. ASINGLE_REGIONzone must hold same-region buckets/datasets and aMULTI_REGIONzone multi-region storage; mismatches fail at asset-attach time, so decide the regional topology up front and validate it in the zone map. - Standardise naming and labels for an auditable estate. The
app-env-regionlake name plusteam/cost-center/tierlabels (merged onto every zone) make a sprawling data fabric attributable and let you slice discovery/scan spend by owner in billing exports. - Federate into Dataproc Metastore when Spark and BigQuery share data. Set
metastore_serviceso discovered tables land in one Hive-compatible catalogue both engines read, avoiding divergent schema definitions; leave it null when the built-in Dataplex catalogue is sufficient to avoid the extra metastore cost.