IaC GCP

Terraform Module: GCP Data Fusion — private CDAP pipelines in one block

Quick take — A reusable hashicorp/google Terraform module for GCP Cloud Data Fusion: ENTERPRISE/BASIC editions, private-IP VPC peering, CMEK, RBAC, Stackdriver, accelerators, and event publishing to Pub/Sub. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "datafusion" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-datafusion?ref=v1.0.0"

  project_id     = "..."  # GCP project ID hosting the Data Fusion instance.
  app            = "..."  # Workload short name used in the instance name (validate…
  environment    = "..."  # One of `dev`, `staging`, `prod`, `sandbox`.
  location_short = "..."  # Cosmetic region token for naming (≤6 chars to fit the 3…
  region         = "..."  # GCP region, e.g. `europe-west1`.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Cloud Data Fusion is GCP’s fully managed, code-free data integration service built on the open-source CDAP project. It gives data engineers a visual, drag-and-drop pipeline studio with 150+ pre-built connectors and transforms, then compiles those pipelines down to ephemeral Dataproc (Spark/MapReduce) clusters at execution time. The product is sold as a single regional instance that hosts the design-time UI, the metadata/lineage store, and the pipeline orchestration plane — you pay per-instance-hour by edition (DEVELOPER, BASIC, ENTERPRISE), plus the Dataproc compute each pipeline run spins up. It is the GCP equivalent of Azure Data Factory’s mapping data flows or AWS Glue Studio, and it shines for teams who want ETL/ELT without hand-writing Spark.

The google_data_fusion_instance resource looks deceptively simple — name and type are the only required arguments — but a production instance is almost never the default. A real deployment is private_instance = true so the tenant VPC has no public IP, which forces a network_config block with a pre-allocated /22 peering range and a matching google_compute_global_address + google_service_networking_connection so the Google-managed tenant project can peer into your VPC. On top of that you usually want a crypto_key_config for CMEK, enable_rbac for namespace-level access control, Stackdriver logging/monitoring toggles, optional accelerators (CDC, Healthcare, CCAI Insights), and an event_publish_config that streams pipeline lifecycle events to Pub/Sub. Wire the peering range wrong and instance creation hangs for 20+ minutes before failing; forget deletion_policy and a terraform destroy can orphan the tenant project.

This module wraps google_data_fusion_instance plus its private-networking companions behind clean, validated variables. You pick an edition, optionally flip private_instance and pass a CIDR, and the module provisions the global address, the service-networking peering, and the instance itself — with consistent app-env-region naming, labels, and CMEK so every team ships Data Fusion the same safe, private-by-default way.

When to use it

Reach for Dataflow instead when you need pure code-first streaming/batch (Apache Beam) with fine-grained autoscaling, or Cloud Composer when orchestration (DAGs across many services) matters more than visual transformation. Data Fusion is the sweet spot when analysts and engineers want to build pipelines visually but still run them at Spark scale.

Module structure

terraform-module-gcp-datafusion/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # global address + SN peering + google_data_fusion_instance
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # instance id/name, service & api endpoints, tenant project, gcs bucket

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Consistent app-env-region naming, e.g. "ingest-prod-euw1".
  # Data Fusion instance IDs must be lowercase, <= 30 chars, start with a letter.
  instance_name = "${var.app}-${var.environment}-${var.location_short}"

  # Private instances peer the Google-managed tenant project into your VPC via
  # Service Networking. That requires a reserved global address range up front.
  use_private_networking = var.private_instance && var.network != null
}

# ---------------------------------------------------------------------------
# Private connectivity (only created when private_instance + network are set)
# ---------------------------------------------------------------------------

# Reserve the /22 (or chosen prefix) that the Data Fusion tenant project will
# use for its managed nodes. Data Fusion requires a dedicated, non-overlapping
# range; 22 is the documented minimum for ENTERPRISE.
resource "google_compute_global_address" "data_fusion" {
  count = local.use_private_networking ? 1 : 0

  project       = var.project_id
  name          = "${local.instance_name}-psa-range"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = var.ip_allocation_prefix_length
  network       = var.network
}

# Establish the Service Networking peering so the tenant project can reach the
# reserved range. Shared with other PSA consumers on the same VPC.
resource "google_service_networking_connection" "data_fusion" {
  count = local.use_private_networking ? 1 : 0

  network                 = var.network
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.data_fusion[0].name]
}

# ---------------------------------------------------------------------------
# The Data Fusion instance
# ---------------------------------------------------------------------------

resource "google_data_fusion_instance" "this" {
  project      = var.project_id
  name         = local.instance_name
  region       = var.region
  display_name = coalesce(var.display_name, local.instance_name)
  description  = var.description

  # DEVELOPER (cheapest, single-user dev), BASIC, or ENTERPRISE (HA, replication,
  # triggers, more concurrent pipelines). ENTERPRISE is the prod default.
  type = var.type

  # Pin the CDAP version for reproducibility; null = latest at create time.
  version = var.cdap_version

  # No public IP on the tenant project — traffic stays on the VPC peering link.
  private_instance = var.private_instance

  enable_stackdriver_logging    = var.enable_stackdriver_logging
  enable_stackdriver_monitoring = var.enable_stackdriver_monitoring

  # Namespace-scoped role-based access control (ENTERPRISE only).
  enable_rbac = var.type == "ENTERPRISE" ? var.enable_rbac : null

  # User-managed SA the ephemeral Dataproc clusters run as. Lets you scope
  # exactly what pipeline runs can touch instead of the default compute SA.
  dataproc_service_account = var.dataproc_service_account

  # Private network wiring. ip_allocation is the reserved range above; if the
  # caller passed an explicit CIDR we honour it, otherwise we let the reserved
  # global address drive allocation by name.
  dynamic "network_config" {
    for_each = local.use_private_networking ? [1] : []
    content {
      network       = var.network
      ip_allocation = coalesce(var.ip_allocation_cidr, "${google_compute_global_address.data_fusion[0].address}/${var.ip_allocation_prefix_length}")
    }
  }

  # Customer-managed encryption key for data at rest (metadata + pipeline state).
  dynamic "crypto_key_config" {
    for_each = var.kms_key_reference == null ? [] : [var.kms_key_reference]
    content {
      key_reference = crypto_key_config.value
    }
  }

  # Opt-in feature accelerators: CDC (Replication), HEALTHCARE, CCAI_INSIGHTS.
  dynamic "accelerators" {
    for_each = { for a in var.accelerators : a => a }
    content {
      accelerator_type = accelerators.value
      state            = "ENABLED"
    }
  }

  # Stream pipeline lifecycle events (start/stop/failure) to a Pub/Sub topic for
  # downstream alerting / orchestration.
  dynamic "event_publish_config" {
    for_each = var.event_publish_topic == null ? [] : [var.event_publish_topic]
    content {
      enabled = true
      topic   = event_publish_config.value
    }
  }

  labels = var.labels

  # DELETE | ABANDON. PREVENT-style protection is handled with prevent_destroy.
  deletion_policy = var.deletion_policy

  # The instance cannot be created until the peering connection exists.
  depends_on = [google_service_networking_connection.data_fusion]

  timeouts {
    create = "90m"
    update = "60m"
    delete = "60m"
  }

  lifecycle {
    # Guard against accidental teardown of a stateful pipeline plane.
    prevent_destroy = false
  }
}

variables.tf

variable "project_id" {
  description = "GCP project ID that will host the Data Fusion instance."
  type        = string
}

variable "app" {
  description = "Application/workload short name, used in the instance name (e.g. \"ingest\")."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,12}$", var.app))
    error_message = "app must be lowercase letters/digits/hyphen, 2-13 chars, starting with a letter."
  }
}

variable "environment" {
  description = "Deployment environment (dev, staging, prod, sandbox)."
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod", "sandbox"], var.environment)
    error_message = "environment must be one of: dev, staging, prod, sandbox."
  }
}

variable "location_short" {
  description = "Short region token for naming, e.g. \"euw1\", \"use4\". Cosmetic only."
  type        = string

  validation {
    # Keep the composed name within Data Fusion's 30-char instance-ID limit.
    condition     = length(var.location_short) <= 6
    error_message = "location_short must be 6 characters or fewer to fit the 30-char instance name limit."
  }
}

variable "region" {
  description = "GCP region for the instance, e.g. \"europe-west1\". Data Fusion is regional."
  type        = string
}

variable "type" {
  description = "Instance edition: DEVELOPER (cheap dev), BASIC, or ENTERPRISE (HA + advanced features)."
  type        = string
  default     = "ENTERPRISE"

  validation {
    condition     = contains(["DEVELOPER", "BASIC", "ENTERPRISE"], var.type)
    error_message = "type must be one of: DEVELOPER, BASIC, ENTERPRISE."
  }
}

variable "display_name" {
  description = "Human-friendly display name. Defaults to the generated instance name."
  type        = string
  default     = null
}

variable "description" {
  description = "Free-text description shown in the console."
  type        = string
  default     = "Managed by Terraform"
}

variable "cdap_version" {
  description = "Pin a specific Data Fusion (CDAP) version, e.g. \"6.10.0\". null = latest at create."
  type        = string
  default     = null
}

variable "private_instance" {
  description = "Provision with no public IP; tenant project peers into your VPC. Requires network."
  type        = bool
  default     = true
}

variable "network" {
  description = <<-EOT
    Self-link or short name of the VPC the private instance peers into (e.g.
    "projects/host-proj/global/networks/shared-vpc"). Required when
    private_instance = true; ignored otherwise.
  EOT
  type        = string
  default     = null
}

variable "ip_allocation_cidr" {
  description = "Explicit /22 peering CIDR for the tenant nodes. If null, a global address is reserved automatically."
  type        = string
  default     = null
}

variable "ip_allocation_prefix_length" {
  description = "Prefix length for the reserved peering range (22 is the documented minimum for ENTERPRISE)."
  type        = number
  default     = 22

  validation {
    condition     = var.ip_allocation_prefix_length >= 16 && var.ip_allocation_prefix_length <= 22
    error_message = "ip_allocation_prefix_length must be between 16 and 22."
  }
}

variable "enable_stackdriver_logging" {
  description = "Send instance + pipeline logs to Cloud Logging."
  type        = bool
  default     = true
}

variable "enable_stackdriver_monitoring" {
  description = "Send instance metrics to Cloud Monitoring."
  type        = bool
  default     = true
}

variable "enable_rbac" {
  description = "Enable namespace-scoped role-based access control. ENTERPRISE only; ignored on other editions."
  type        = bool
  default     = true
}

variable "dataproc_service_account" {
  description = "Email of the user-managed SA the ephemeral Dataproc clusters run as. null = default compute SA."
  type        = string
  default     = null
}

variable "kms_key_reference" {
  description = <<-EOT
    Full Cloud KMS CryptoKey resource ID for CMEK at rest, e.g.
    "projects/p/locations/europe-west1/keyRings/r/cryptoKeys/k". null = Google-managed key.
  EOT
  type        = string
  default     = null
}

variable "accelerators" {
  description = "List of feature accelerators to enable: any of CDC, HEALTHCARE, CCAI_INSIGHTS."
  type        = list(string)
  default     = []

  validation {
    condition     = alltrue([for a in var.accelerators : contains(["CDC", "HEALTHCARE", "CCAI_INSIGHTS"], a)])
    error_message = "accelerators may only contain CDC, HEALTHCARE, or CCAI_INSIGHTS."
  }
}

variable "event_publish_topic" {
  description = "Pub/Sub topic ID (projects/p/topics/t) to publish pipeline lifecycle events to. null = disabled."
  type        = string
  default     = null
}

variable "deletion_policy" {
  description = "What happens to the underlying instance on destroy: DELETE or ABANDON."
  type        = string
  default     = "DELETE"

  validation {
    condition     = contains(["DELETE", "ABANDON"], var.deletion_policy)
    error_message = "deletion_policy must be DELETE or ABANDON."
  }
}

variable "labels" {
  description = "Labels applied to the Data Fusion instance."
  type        = map(string)
  default     = {}
}

outputs.tf

output "instance_id" {
  description = "Fully qualified Data Fusion instance ID (projects/<p>/locations/<region>/instances/<name>)."
  value       = google_data_fusion_instance.this.id
}

output "instance_name" {
  description = "Data Fusion instance name (used in gcloud and IAM bindings)."
  value       = google_data_fusion_instance.this.name
}

output "service_endpoint" {
  description = "HTTPS endpoint of the Data Fusion UI / management REST API (CDAP)."
  value       = google_data_fusion_instance.this.service_endpoint
}

output "api_endpoint" {
  description = "REST API endpoint for programmatic pipeline deployment."
  value       = google_data_fusion_instance.this.api_endpoint
}

output "version" {
  description = "Resolved CDAP version actually running on the instance."
  value       = google_data_fusion_instance.this.version
}

output "gcs_bucket" {
  description = "Auto-created Cloud Storage bucket Data Fusion uses for pipeline artifacts/staging."
  value       = google_data_fusion_instance.this.gcs_bucket
}

output "tenant_project_id" {
  description = "Google-managed tenant project that hosts the instance control plane."
  value       = google_data_fusion_instance.this.tenant_project_id
}

output "p4_service_account" {
  description = "Service agent (P4 SA) to grant cross-project roles (e.g. dataproc.serviceAgent)."
  value       = google_data_fusion_instance.this.p4_service_account
}

output "state" {
  description = "Current instance state (RUNNING, CREATING, FAILED, ...)."
  value       = google_data_fusion_instance.this.state
}

How to use it

The example provisions a private, ENTERPRISE instance peered into a shared VPC with CMEK, the CDC accelerator, and pipeline events streamed to Pub/Sub. The downstream block grants the instance’s P4 service agent the Dataproc service-agent role in the project where pipelines actually run — using the module’s p4_service_account output instead of a hardcoded email.

module "data_fusion" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-datafusion?ref=v1.0.0"

  project_id     = "kv-data-prod"
  app            = "ingest"
  environment    = "prod"
  location_short = "euw1"
  region         = "europe-west1"

  type             = "ENTERPRISE"
  private_instance = true
  network          = "projects/kv-host-prod/global/networks/shared-vpc"

  # Reserve a dedicated /22 for the tenant nodes (must not overlap the VPC).
  ip_allocation_cidr          = "10.84.0.0/22"
  ip_allocation_prefix_length = 22

  enable_rbac              = true
  dataproc_service_account = "df-pipelines@kv-data-prod.iam.gserviceaccount.com"
  kms_key_reference        = "projects/kv-sec-prod/locations/europe-west1/keyRings/data/cryptoKeys/datafusion"

  # Change-data-capture into BigQuery, with lifecycle events to Pub/Sub.
  accelerators        = ["CDC"]
  event_publish_topic = "projects/kv-data-prod/topics/df-pipeline-events"

  labels = {
    team        = "data-platform"
    cost-center = "kv-1042"
    workload    = "ingest"
  }
}

# Downstream: the Data Fusion service agent must be able to spin up the
# ephemeral Dataproc clusters that run pipelines. Bind it by the module output.
resource "google_project_iam_member" "df_dataproc_agent" {
  project = "kv-data-prod"
  role    = "roles/dataproc.serviceAgent"
  member  = "serviceAccount:${module.data_fusion.p4_service_account}"
}

# And let a CI job deploy pipelines against the instance's REST endpoint.
output "pipeline_deploy_endpoint" {
  value = module.data_fusion.api_endpoint
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/datafusion/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-datafusion?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  app = "..."
  environment = "..."
  location_short = "..."
  region = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/datafusion && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string Yes GCP project ID hosting the Data Fusion instance.
app string Yes Workload short name used in the instance name (validated lowercase, ≤13 chars).
environment string Yes One of dev, staging, prod, sandbox.
location_short string Yes Cosmetic region token for naming (≤6 chars to fit the 30-char ID limit).
region string Yes GCP region, e.g. europe-west1.
type string "ENTERPRISE" No Edition: DEVELOPER, BASIC, or ENTERPRISE.
display_name string null No Display name; defaults to the generated instance name.
description string "Managed by Terraform" No Console description.
cdap_version string null No Pin a specific CDAP version (e.g. 6.10.0).
private_instance bool true No No public IP; peer the tenant project into network.
network string null No VPC self-link/name to peer into. Required when private_instance = true.
ip_allocation_cidr string null No Explicit /22 peering CIDR; auto-reserved if null.
ip_allocation_prefix_length number 22 No Prefix length for the reserved range (16–22).
enable_stackdriver_logging bool true No Ship logs to Cloud Logging.
enable_stackdriver_monitoring bool true No Ship metrics to Cloud Monitoring.
enable_rbac bool true No Namespace RBAC (ENTERPRISE only).
dataproc_service_account string null No User-managed SA the ephemeral Dataproc clusters run as.
kms_key_reference string null No Cloud KMS key for CMEK at rest.
accelerators list(string) [] No Any of CDC, HEALTHCARE, CCAI_INSIGHTS.
event_publish_topic string null No Pub/Sub topic for pipeline lifecycle events.
deletion_policy string "DELETE" No DELETE or ABANDON on destroy.
labels map(string) {} No Labels applied to the instance.

Outputs

Name Description
instance_id Fully qualified instance ID (projects/<p>/locations/<region>/instances/<name>).
instance_name Instance name used in gcloud and IAM bindings.
service_endpoint HTTPS endpoint of the Data Fusion UI / CDAP REST API.
api_endpoint REST API endpoint for programmatic pipeline deployment.
version Resolved CDAP version running on the instance.
gcs_bucket Auto-created GCS bucket for pipeline artifacts/staging.
tenant_project_id Google-managed tenant project hosting the control plane.
p4_service_account Service agent (P4 SA) for cross-project role grants.
state Current instance state (RUNNING, CREATING, FAILED, …).

Enterprise scenario

A retail bank consolidates ten operational databases (Cloud SQL for PostgreSQL plus on-prem Oracle reached over Interconnect) into a BigQuery lakehouse. They deploy this module per environment as a private, ENTERPRISE instance peered into the shared VPC with a dedicated 10.84.0.0/22 range, CMEK from the security project’s key ring, and the CDC accelerator enabled for near-real-time replication. Pipeline lifecycle events flow to a df-pipeline-events Pub/Sub topic that a Cloud Function consumes to page on-call when an overnight load fails, and the p4_service_account output drives the cross-project Dataproc service-agent binding — so standing up a new region’s ingestion plane is a single module block plus a non-overlapping CIDR.

Best practices

TerraformGCPData FusionModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading