Quick take — A reusable hashicorp/google Terraform module for GCP Cloud Data Fusion: ENTERPRISE/BASIC editions, private-IP VPC peering, CMEK, RBAC, Stackdriver, accelerators, and event publishing to Pub/Sub. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "datafusion" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-datafusion?ref=v1.0.0"
project_id = "..." # GCP project ID hosting the Data Fusion instance.
app = "..." # Workload short name used in the instance name (validate…
environment = "..." # One of `dev`, `staging`, `prod`, `sandbox`.
location_short = "..." # Cosmetic region token for naming (≤6 chars to fit the 3…
region = "..." # GCP region, e.g. `europe-west1`.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Cloud Data Fusion is GCP’s fully managed, code-free data integration service built on the open-source CDAP project. It gives data engineers a visual, drag-and-drop pipeline studio with 150+ pre-built connectors and transforms, then compiles those pipelines down to ephemeral Dataproc (Spark/MapReduce) clusters at execution time. The product is sold as a single regional instance that hosts the design-time UI, the metadata/lineage store, and the pipeline orchestration plane — you pay per-instance-hour by edition (DEVELOPER, BASIC, ENTERPRISE), plus the Dataproc compute each pipeline run spins up. It is the GCP equivalent of Azure Data Factory’s mapping data flows or AWS Glue Studio, and it shines for teams who want ETL/ELT without hand-writing Spark.
The google_data_fusion_instance resource looks deceptively simple — name and type are the only required arguments — but a production instance is almost never the default. A real deployment is private_instance = true so the tenant VPC has no public IP, which forces a network_config block with a pre-allocated /22 peering range and a matching google_compute_global_address + google_service_networking_connection so the Google-managed tenant project can peer into your VPC. On top of that you usually want a crypto_key_config for CMEK, enable_rbac for namespace-level access control, Stackdriver logging/monitoring toggles, optional accelerators (CDC, Healthcare, CCAI Insights), and an event_publish_config that streams pipeline lifecycle events to Pub/Sub. Wire the peering range wrong and instance creation hangs for 20+ minutes before failing; forget deletion_policy and a terraform destroy can orphan the tenant project.
This module wraps google_data_fusion_instance plus its private-networking companions behind clean, validated variables. You pick an edition, optionally flip private_instance and pass a CIDR, and the module provisions the global address, the service-networking peering, and the instance itself — with consistent app-env-region naming, labels, and CMEK so every team ships Data Fusion the same safe, private-by-default way.
When to use it
- You want code-free, visual ETL/ELT on GCP — ingesting from Cloud SQL, BigQuery, GCS, Kafka, on-prem JDBC, SaaS APIs — without standing up and maintaining your own Spark/Airflow stack.
- You are migrating Informatica, Talend, SSIS, or Azure Data Factory pipelines and want a managed CDAP/Wrangler studio with lineage and a connector catalogue.
- You need enterprise networking guarantees: a
private_instancewith no public IP, VPC-peered into your shared VPC, reaching private Cloud SQL / on-prem sources over the peering link. - You want change-data-capture from operational databases (the
CDCaccelerator / Replication feature) landing into BigQuery in near-real-time. - You are standardizing a data platform and want every Data Fusion instance to carry the same edition policy, CMEK, RBAC, labels, and naming so spend and access are auditable.
Reach for Dataflow instead when you need pure code-first streaming/batch (Apache Beam) with fine-grained autoscaling, or Cloud Composer when orchestration (DAGs across many services) matters more than visual transformation. Data Fusion is the sweet spot when analysts and engineers want to build pipelines visually but still run them at Spark scale.
Module structure
terraform-module-gcp-datafusion/
├── versions.tf # provider + Terraform version pins
├── main.tf # global address + SN peering + google_data_fusion_instance
├── variables.tf # var-driven inputs with validation
└── outputs.tf # instance id/name, service & api endpoints, tenant project, gcs bucket
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
locals {
# Consistent app-env-region naming, e.g. "ingest-prod-euw1".
# Data Fusion instance IDs must be lowercase, <= 30 chars, start with a letter.
instance_name = "${var.app}-${var.environment}-${var.location_short}"
# Private instances peer the Google-managed tenant project into your VPC via
# Service Networking. That requires a reserved global address range up front.
use_private_networking = var.private_instance && var.network != null
}
# ---------------------------------------------------------------------------
# Private connectivity (only created when private_instance + network are set)
# ---------------------------------------------------------------------------
# Reserve the /22 (or chosen prefix) that the Data Fusion tenant project will
# use for its managed nodes. Data Fusion requires a dedicated, non-overlapping
# range; 22 is the documented minimum for ENTERPRISE.
resource "google_compute_global_address" "data_fusion" {
count = local.use_private_networking ? 1 : 0
project = var.project_id
name = "${local.instance_name}-psa-range"
purpose = "VPC_PEERING"
address_type = "INTERNAL"
prefix_length = var.ip_allocation_prefix_length
network = var.network
}
# Establish the Service Networking peering so the tenant project can reach the
# reserved range. Shared with other PSA consumers on the same VPC.
resource "google_service_networking_connection" "data_fusion" {
count = local.use_private_networking ? 1 : 0
network = var.network
service = "servicenetworking.googleapis.com"
reserved_peering_ranges = [google_compute_global_address.data_fusion[0].name]
}
# ---------------------------------------------------------------------------
# The Data Fusion instance
# ---------------------------------------------------------------------------
resource "google_data_fusion_instance" "this" {
project = var.project_id
name = local.instance_name
region = var.region
display_name = coalesce(var.display_name, local.instance_name)
description = var.description
# DEVELOPER (cheapest, single-user dev), BASIC, or ENTERPRISE (HA, replication,
# triggers, more concurrent pipelines). ENTERPRISE is the prod default.
type = var.type
# Pin the CDAP version for reproducibility; null = latest at create time.
version = var.cdap_version
# No public IP on the tenant project — traffic stays on the VPC peering link.
private_instance = var.private_instance
enable_stackdriver_logging = var.enable_stackdriver_logging
enable_stackdriver_monitoring = var.enable_stackdriver_monitoring
# Namespace-scoped role-based access control (ENTERPRISE only).
enable_rbac = var.type == "ENTERPRISE" ? var.enable_rbac : null
# User-managed SA the ephemeral Dataproc clusters run as. Lets you scope
# exactly what pipeline runs can touch instead of the default compute SA.
dataproc_service_account = var.dataproc_service_account
# Private network wiring. ip_allocation is the reserved range above; if the
# caller passed an explicit CIDR we honour it, otherwise we let the reserved
# global address drive allocation by name.
dynamic "network_config" {
for_each = local.use_private_networking ? [1] : []
content {
network = var.network
ip_allocation = coalesce(var.ip_allocation_cidr, "${google_compute_global_address.data_fusion[0].address}/${var.ip_allocation_prefix_length}")
}
}
# Customer-managed encryption key for data at rest (metadata + pipeline state).
dynamic "crypto_key_config" {
for_each = var.kms_key_reference == null ? [] : [var.kms_key_reference]
content {
key_reference = crypto_key_config.value
}
}
# Opt-in feature accelerators: CDC (Replication), HEALTHCARE, CCAI_INSIGHTS.
dynamic "accelerators" {
for_each = { for a in var.accelerators : a => a }
content {
accelerator_type = accelerators.value
state = "ENABLED"
}
}
# Stream pipeline lifecycle events (start/stop/failure) to a Pub/Sub topic for
# downstream alerting / orchestration.
dynamic "event_publish_config" {
for_each = var.event_publish_topic == null ? [] : [var.event_publish_topic]
content {
enabled = true
topic = event_publish_config.value
}
}
labels = var.labels
# DELETE | ABANDON. PREVENT-style protection is handled with prevent_destroy.
deletion_policy = var.deletion_policy
# The instance cannot be created until the peering connection exists.
depends_on = [google_service_networking_connection.data_fusion]
timeouts {
create = "90m"
update = "60m"
delete = "60m"
}
lifecycle {
# Guard against accidental teardown of a stateful pipeline plane.
prevent_destroy = false
}
}
variables.tf
variable "project_id" {
description = "GCP project ID that will host the Data Fusion instance."
type = string
}
variable "app" {
description = "Application/workload short name, used in the instance name (e.g. \"ingest\")."
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,12}$", var.app))
error_message = "app must be lowercase letters/digits/hyphen, 2-13 chars, starting with a letter."
}
}
variable "environment" {
description = "Deployment environment (dev, staging, prod, sandbox)."
type = string
validation {
condition = contains(["dev", "staging", "prod", "sandbox"], var.environment)
error_message = "environment must be one of: dev, staging, prod, sandbox."
}
}
variable "location_short" {
description = "Short region token for naming, e.g. \"euw1\", \"use4\". Cosmetic only."
type = string
validation {
# Keep the composed name within Data Fusion's 30-char instance-ID limit.
condition = length(var.location_short) <= 6
error_message = "location_short must be 6 characters or fewer to fit the 30-char instance name limit."
}
}
variable "region" {
description = "GCP region for the instance, e.g. \"europe-west1\". Data Fusion is regional."
type = string
}
variable "type" {
description = "Instance edition: DEVELOPER (cheap dev), BASIC, or ENTERPRISE (HA + advanced features)."
type = string
default = "ENTERPRISE"
validation {
condition = contains(["DEVELOPER", "BASIC", "ENTERPRISE"], var.type)
error_message = "type must be one of: DEVELOPER, BASIC, ENTERPRISE."
}
}
variable "display_name" {
description = "Human-friendly display name. Defaults to the generated instance name."
type = string
default = null
}
variable "description" {
description = "Free-text description shown in the console."
type = string
default = "Managed by Terraform"
}
variable "cdap_version" {
description = "Pin a specific Data Fusion (CDAP) version, e.g. \"6.10.0\". null = latest at create."
type = string
default = null
}
variable "private_instance" {
description = "Provision with no public IP; tenant project peers into your VPC. Requires network."
type = bool
default = true
}
variable "network" {
description = <<-EOT
Self-link or short name of the VPC the private instance peers into (e.g.
"projects/host-proj/global/networks/shared-vpc"). Required when
private_instance = true; ignored otherwise.
EOT
type = string
default = null
}
variable "ip_allocation_cidr" {
description = "Explicit /22 peering CIDR for the tenant nodes. If null, a global address is reserved automatically."
type = string
default = null
}
variable "ip_allocation_prefix_length" {
description = "Prefix length for the reserved peering range (22 is the documented minimum for ENTERPRISE)."
type = number
default = 22
validation {
condition = var.ip_allocation_prefix_length >= 16 && var.ip_allocation_prefix_length <= 22
error_message = "ip_allocation_prefix_length must be between 16 and 22."
}
}
variable "enable_stackdriver_logging" {
description = "Send instance + pipeline logs to Cloud Logging."
type = bool
default = true
}
variable "enable_stackdriver_monitoring" {
description = "Send instance metrics to Cloud Monitoring."
type = bool
default = true
}
variable "enable_rbac" {
description = "Enable namespace-scoped role-based access control. ENTERPRISE only; ignored on other editions."
type = bool
default = true
}
variable "dataproc_service_account" {
description = "Email of the user-managed SA the ephemeral Dataproc clusters run as. null = default compute SA."
type = string
default = null
}
variable "kms_key_reference" {
description = <<-EOT
Full Cloud KMS CryptoKey resource ID for CMEK at rest, e.g.
"projects/p/locations/europe-west1/keyRings/r/cryptoKeys/k". null = Google-managed key.
EOT
type = string
default = null
}
variable "accelerators" {
description = "List of feature accelerators to enable: any of CDC, HEALTHCARE, CCAI_INSIGHTS."
type = list(string)
default = []
validation {
condition = alltrue([for a in var.accelerators : contains(["CDC", "HEALTHCARE", "CCAI_INSIGHTS"], a)])
error_message = "accelerators may only contain CDC, HEALTHCARE, or CCAI_INSIGHTS."
}
}
variable "event_publish_topic" {
description = "Pub/Sub topic ID (projects/p/topics/t) to publish pipeline lifecycle events to. null = disabled."
type = string
default = null
}
variable "deletion_policy" {
description = "What happens to the underlying instance on destroy: DELETE or ABANDON."
type = string
default = "DELETE"
validation {
condition = contains(["DELETE", "ABANDON"], var.deletion_policy)
error_message = "deletion_policy must be DELETE or ABANDON."
}
}
variable "labels" {
description = "Labels applied to the Data Fusion instance."
type = map(string)
default = {}
}
outputs.tf
output "instance_id" {
description = "Fully qualified Data Fusion instance ID (projects/<p>/locations/<region>/instances/<name>)."
value = google_data_fusion_instance.this.id
}
output "instance_name" {
description = "Data Fusion instance name (used in gcloud and IAM bindings)."
value = google_data_fusion_instance.this.name
}
output "service_endpoint" {
description = "HTTPS endpoint of the Data Fusion UI / management REST API (CDAP)."
value = google_data_fusion_instance.this.service_endpoint
}
output "api_endpoint" {
description = "REST API endpoint for programmatic pipeline deployment."
value = google_data_fusion_instance.this.api_endpoint
}
output "version" {
description = "Resolved CDAP version actually running on the instance."
value = google_data_fusion_instance.this.version
}
output "gcs_bucket" {
description = "Auto-created Cloud Storage bucket Data Fusion uses for pipeline artifacts/staging."
value = google_data_fusion_instance.this.gcs_bucket
}
output "tenant_project_id" {
description = "Google-managed tenant project that hosts the instance control plane."
value = google_data_fusion_instance.this.tenant_project_id
}
output "p4_service_account" {
description = "Service agent (P4 SA) to grant cross-project roles (e.g. dataproc.serviceAgent)."
value = google_data_fusion_instance.this.p4_service_account
}
output "state" {
description = "Current instance state (RUNNING, CREATING, FAILED, ...)."
value = google_data_fusion_instance.this.state
}
How to use it
The example provisions a private, ENTERPRISE instance peered into a shared VPC with CMEK, the CDC accelerator, and pipeline events streamed to Pub/Sub. The downstream block grants the instance’s P4 service agent the Dataproc service-agent role in the project where pipelines actually run — using the module’s p4_service_account output instead of a hardcoded email.
module "data_fusion" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-datafusion?ref=v1.0.0"
project_id = "kv-data-prod"
app = "ingest"
environment = "prod"
location_short = "euw1"
region = "europe-west1"
type = "ENTERPRISE"
private_instance = true
network = "projects/kv-host-prod/global/networks/shared-vpc"
# Reserve a dedicated /22 for the tenant nodes (must not overlap the VPC).
ip_allocation_cidr = "10.84.0.0/22"
ip_allocation_prefix_length = 22
enable_rbac = true
dataproc_service_account = "df-pipelines@kv-data-prod.iam.gserviceaccount.com"
kms_key_reference = "projects/kv-sec-prod/locations/europe-west1/keyRings/data/cryptoKeys/datafusion"
# Change-data-capture into BigQuery, with lifecycle events to Pub/Sub.
accelerators = ["CDC"]
event_publish_topic = "projects/kv-data-prod/topics/df-pipeline-events"
labels = {
team = "data-platform"
cost-center = "kv-1042"
workload = "ingest"
}
}
# Downstream: the Data Fusion service agent must be able to spin up the
# ephemeral Dataproc clusters that run pipelines. Bind it by the module output.
resource "google_project_iam_member" "df_dataproc_agent" {
project = "kv-data-prod"
role = "roles/dataproc.serviceAgent"
member = "serviceAccount:${module.data_fusion.p4_service_account}"
}
# And let a CI job deploy pipelines against the instance's REST endpoint.
output "pipeline_deploy_endpoint" {
value = module.data_fusion.api_endpoint
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/datafusion/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-datafusion?ref=v1.0.0"
}
inputs = {
project_id = "..."
app = "..."
environment = "..."
location_short = "..."
region = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/datafusion && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
project_id |
string |
— | Yes | GCP project ID hosting the Data Fusion instance. |
app |
string |
— | Yes | Workload short name used in the instance name (validated lowercase, ≤13 chars). |
environment |
string |
— | Yes | One of dev, staging, prod, sandbox. |
location_short |
string |
— | Yes | Cosmetic region token for naming (≤6 chars to fit the 30-char ID limit). |
region |
string |
— | Yes | GCP region, e.g. europe-west1. |
type |
string |
"ENTERPRISE" |
No | Edition: DEVELOPER, BASIC, or ENTERPRISE. |
display_name |
string |
null |
No | Display name; defaults to the generated instance name. |
description |
string |
"Managed by Terraform" |
No | Console description. |
cdap_version |
string |
null |
No | Pin a specific CDAP version (e.g. 6.10.0). |
private_instance |
bool |
true |
No | No public IP; peer the tenant project into network. |
network |
string |
null |
No | VPC self-link/name to peer into. Required when private_instance = true. |
ip_allocation_cidr |
string |
null |
No | Explicit /22 peering CIDR; auto-reserved if null. |
ip_allocation_prefix_length |
number |
22 |
No | Prefix length for the reserved range (16–22). |
enable_stackdriver_logging |
bool |
true |
No | Ship logs to Cloud Logging. |
enable_stackdriver_monitoring |
bool |
true |
No | Ship metrics to Cloud Monitoring. |
enable_rbac |
bool |
true |
No | Namespace RBAC (ENTERPRISE only). |
dataproc_service_account |
string |
null |
No | User-managed SA the ephemeral Dataproc clusters run as. |
kms_key_reference |
string |
null |
No | Cloud KMS key for CMEK at rest. |
accelerators |
list(string) |
[] |
No | Any of CDC, HEALTHCARE, CCAI_INSIGHTS. |
event_publish_topic |
string |
null |
No | Pub/Sub topic for pipeline lifecycle events. |
deletion_policy |
string |
"DELETE" |
No | DELETE or ABANDON on destroy. |
labels |
map(string) |
{} |
No | Labels applied to the instance. |
Outputs
| Name | Description |
|---|---|
instance_id |
Fully qualified instance ID (projects/<p>/locations/<region>/instances/<name>). |
instance_name |
Instance name used in gcloud and IAM bindings. |
service_endpoint |
HTTPS endpoint of the Data Fusion UI / CDAP REST API. |
api_endpoint |
REST API endpoint for programmatic pipeline deployment. |
version |
Resolved CDAP version running on the instance. |
gcs_bucket |
Auto-created GCS bucket for pipeline artifacts/staging. |
tenant_project_id |
Google-managed tenant project hosting the control plane. |
p4_service_account |
Service agent (P4 SA) for cross-project role grants. |
state |
Current instance state (RUNNING, CREATING, FAILED, …). |
Enterprise scenario
A retail bank consolidates ten operational databases (Cloud SQL for PostgreSQL plus on-prem Oracle reached over Interconnect) into a BigQuery lakehouse. They deploy this module per environment as a private, ENTERPRISE instance peered into the shared VPC with a dedicated 10.84.0.0/22 range, CMEK from the security project’s key ring, and the CDC accelerator enabled for near-real-time replication. Pipeline lifecycle events flow to a df-pipeline-events Pub/Sub topic that a Cloud Function consumes to page on-call when an overnight load fails, and the p4_service_account output drives the cross-project Dataproc service-agent binding — so standing up a new region’s ingestion plane is a single module block plus a non-overlapping CIDR.
Best practices
- Plan the peering CIDR before the first apply. Data Fusion reserves a
/22for tenant nodes that must not overlap any subnet or other PSA consumer on the VPC. Allocate it from an IPAM-managed block, set it explicitly viaip_allocation_cidr, and keepprivate_instance = trueso the instance has no public IP and reaches private sources over the peering link. - Match the edition to the workload to control cost. ENTERPRISE bills the highest per-instance-hour and is the only edition with HA, replication/CDC, and RBAC — use it for prod, but drop dev/sandbox to DEVELOPER (or pause instances out of hours), and remember the real spend driver is the ephemeral Dataproc clusters, not the instance itself.
- Run pipelines as a scoped service account, not the default. Set
dataproc_service_accountto a least-privilege SA so a runaway pipeline can only touch its intended buckets/datasets, and grant thep4_service_accountonlyroles/dataproc.serviceAgent(and any cross-project read roles) rather than broad project editor. - Encrypt at rest with CMEK and turn on RBAC. Pass
kms_key_referencefrom a dedicated key ring so encryption keys are rotated and revocable independently of the data, and keepenable_rbac = trueon ENTERPRISE so pipeline namespaces are isolated per team. - Wire observability and events from day one. Leave
enable_stackdriver_logging/enable_stackdriver_monitoringon and pointevent_publish_topicat Pub/Sub so pipeline failures trigger alerting instead of being discovered the next morning. - Standardize naming, labels, and timeouts. The
app-env-regioninstance name (kept under the 30-char limit) plusteam/cost-centerlabels make Data Fusion spend attributable; keep the generous create/update timeouts because instance provisioning routinely takes 20–30 minutes.