IaC GCP

Terraform Module: GCP Dataform — version-controlled BigQuery ELT repositories as code

Quick take — Build a reusable Terraform module for google_dataform_repository on hashicorp/google ~> 5.0: wire up Git remotes, CMEK, a dedicated service account, and workspace compilation overrides for safe BigQuery ELT. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "dataform" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataform?ref=v1.0.0"

  project_id      = "..."  # GCP project ID hosting the Dataform repository.
  region          = "..."  # Repository region (must be a BigQuery-supported locatio…
  repository_name = "..."  # Repository name; validated `^[a-z][a-z0-9-]{1,62}$`.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Dataform is Google Cloud’s managed service for developing, version-controlling, and operationalising SQL-based ELT pipelines that run inside BigQuery. You write .sqlx files that declare tables, views, incremental models, and assertions; Dataform compiles them into a dependency graph and executes the SQL as BigQuery jobs. A google_dataform_repository is the top-level container that holds that code — either as a Git-connected repository (pointing at GitHub, GitLab, Azure Repos, or Cloud Source Repositories) or as a standalone repo edited through Dataform development workspaces.

Wrapping the repository in a Terraform module matters because a Dataform repo is rarely “just a repo.” In production it needs a dedicated service account with scoped BigQuery permissions, a Git remote authenticated through a Secret Manager secret, optional CMEK encryption for the stored compilation metadata, and workspace compilation overrides so that developers’ interactive workspaces write to sandboxed datasets instead of clobbering production tables. Hand-clicking those in the console drifts immediately. This module fixes the topology once, validates the inputs, and lets every data domain stamp out an identical, governed repository by changing a handful of variables.

When to use it

Module structure

terraform-module-gcp-dataform/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # google_dataform_repository + optional dedicated SA
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # repository id/name + key attributes

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Stable, predictable account id when the module creates the SA.
  sa_account_id = coalesce(
    var.service_account_id,
    substr("df-${var.repository_name}", 0, 30)
  )

  create_sa = var.create_service_account

  # Either the caller-provided SA email, or the one we create.
  effective_service_account = local.create_sa ? google_service_account.dataform[0].email : var.service_account_email
}

# Optional dedicated runtime identity for Dataform workflow invocations.
resource "google_service_account" "dataform" {
  count = local.create_sa ? 1 : 0

  project      = var.project_id
  account_id   = local.sa_account_id
  display_name = "Dataform runtime SA for ${var.repository_name}"
}

# Grant the runtime identity the ability to run BigQuery jobs in the project.
resource "google_project_iam_member" "bq_job_user" {
  count = local.create_sa ? 1 : 0

  project = var.project_id
  role    = "roles/bigquery.jobUser"
  member  = "serviceAccount:${google_service_account.dataform[0].email}"
}

resource "google_dataform_repository" "this" {
  provider = google

  project      = var.project_id
  region       = var.region
  name         = var.repository_name
  display_name = var.display_name

  # Runtime identity used when Dataform executes BigQuery jobs.
  service_account = local.effective_service_account

  # Optional CMEK for repository-stored compilation metadata.
  kms_key_name = var.kms_key_name

  # Secret version feeding variables into .npmrc during package installs.
  npmrc_environment_variables_secret_version = var.npmrc_secret_version

  labels = var.labels

  # Connect to an external Git remote when a URL is supplied.
  dynamic "git_remote_settings" {
    for_each = var.git_remote == null ? [] : [var.git_remote]
    content {
      url            = git_remote_settings.value.url
      default_branch = git_remote_settings.value.default_branch

      # Token-based auth (HTTPS) via Secret Manager.
      authentication_token_secret_version = try(git_remote_settings.value.authentication_token_secret_version, null)

      # OR SSH-based auth.
      dynamic "ssh_authentication_config" {
        for_each = try(git_remote_settings.value.ssh, null) == null ? [] : [git_remote_settings.value.ssh]
        content {
          user_private_key_secret_version = ssh_authentication_config.value.user_private_key_secret_version
          host_public_key                 = ssh_authentication_config.value.host_public_key
        }
      }
    }
  }

  # Sandbox developer workspaces away from production datasets/tables.
  dynamic "workspace_compilation_overrides" {
    for_each = var.workspace_compilation_overrides == null ? [] : [var.workspace_compilation_overrides]
    content {
      default_database = try(workspace_compilation_overrides.value.default_database, null)
      schema_suffix    = try(workspace_compilation_overrides.value.schema_suffix, null)
      table_prefix     = try(workspace_compilation_overrides.value.table_prefix, null)
    }
  }

  deletion_policy = var.deletion_policy
}

variables.tf

variable "project_id" {
  description = "GCP project ID where the Dataform repository lives."
  type        = string
}

variable "region" {
  description = "Region for the Dataform repository (must match a BigQuery-supported location, e.g. us-central1, europe-west2, asia-south1)."
  type        = string
}

variable "repository_name" {
  description = "Name of the Dataform repository. Lowercase letters, numbers and hyphens; must start with a letter."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,62}$", var.repository_name))
    error_message = "repository_name must start with a lowercase letter and contain only lowercase letters, numbers and hyphens (max 63 chars)."
  }
}

variable "display_name" {
  description = "User-friendly display name for the repository."
  type        = string
  default     = null
}

variable "create_service_account" {
  description = "If true, create a dedicated runtime service account and grant it roles/bigquery.jobUser. If false, supply service_account_email."
  type        = bool
  default     = true
}

variable "service_account_id" {
  description = "Account ID for the created SA (when create_service_account = true). Defaults to df-<repository_name> truncated to 30 chars."
  type        = string
  default     = null
}

variable "service_account_email" {
  description = "Existing service account email to use as the Dataform runtime identity (when create_service_account = false)."
  type        = string
  default     = null

  validation {
    condition     = var.service_account_email == null || can(regex("^[^@]+@[^@]+\\.iam\\.gserviceaccount\\.com$", var.service_account_email))
    error_message = "service_account_email must be a valid *.iam.gserviceaccount.com address."
  }
}

variable "kms_key_name" {
  description = "Optional CMEK key (projects/.../cryptoKeys/...) to encrypt repository-stored data. Key must be in the same region as the repository."
  type        = string
  default     = null
}

variable "npmrc_secret_version" {
  description = "Optional Secret Manager secret version (projects/.../versions/...) used to interpolate variables into .npmrc during package installs."
  type        = string
  default     = null
}

variable "git_remote" {
  description = "Optional Git remote connection. Provide either an HTTPS token (authentication_token_secret_version) or an ssh block — not both."
  type = object({
    url                                  = string
    default_branch                       = string
    authentication_token_secret_version  = optional(string)
    ssh = optional(object({
      user_private_key_secret_version = string
      host_public_key                 = string
    }))
  })
  default = null

  validation {
    condition = var.git_remote == null ? true : (
      (try(var.git_remote.authentication_token_secret_version, null) != null) != (try(var.git_remote.ssh, null) != null)
    )
    error_message = "Provide exactly one of git_remote.authentication_token_secret_version (HTTPS) or git_remote.ssh (SSH)."
  }
}

variable "workspace_compilation_overrides" {
  description = "Optional overrides applied to development workspace compilations to sandbox dev output (default_database, schema_suffix, table_prefix)."
  type = object({
    default_database = optional(string)
    schema_suffix    = optional(string)
    table_prefix     = optional(string)
  })
  default = null
}

variable "labels" {
  description = "Labels to apply to the Dataform repository."
  type        = map(string)
  default     = {}
}

variable "deletion_policy" {
  description = "Deletion behaviour: DELETE, FORCE (delete with child resources), ABANDON, or PREVENT."
  type        = string
  default     = "DELETE"

  validation {
    condition     = contains(["DELETE", "FORCE", "ABANDON", "PREVENT"], var.deletion_policy)
    error_message = "deletion_policy must be one of DELETE, FORCE, ABANDON, PREVENT."
  }
}

outputs.tf

output "id" {
  description = "Full resource ID: projects/{project}/locations/{region}/repositories/{name}."
  value       = google_dataform_repository.this.id
}

output "name" {
  description = "Short name of the Dataform repository."
  value       = google_dataform_repository.this.name
}

output "region" {
  description = "Region the repository was created in."
  value       = google_dataform_repository.this.region
}

output "service_account_email" {
  description = "Runtime service account email used by Dataform (created or supplied)."
  value       = local.effective_service_account
}

output "service_account_created" {
  description = "Whether this module created the runtime service account."
  value       = local.create_sa
}

output "git_connected" {
  description = "True when a Git remote is configured on the repository."
  value       = var.git_remote != null
}

How to use it

# A secret holding the GitHub PAT used by Dataform for HTTPS Git operations.
resource "google_secret_manager_secret_version" "github_pat" {
  secret      = google_secret_manager_secret.github_pat.id
  secret_data = var.github_pat # sensitive, sourced from a tfvars/CI secret
}

module "dataform" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataform?ref=v1.0.0"

  project_id      = "kv-analytics-prod"
  region          = "europe-west2"
  repository_name = "finance-elt"
  display_name    = "Finance ELT (BigQuery)"

  # Let the module mint a dedicated runtime SA + grant BigQuery Job User.
  create_service_account = true

  # CMEK in the same region as the repository.
  kms_key_name = "projects/kv-security/locations/europe-west2/keyRings/dataform/cryptoKeys/repo-cmek"

  # HTTPS Git connection authenticated by a Secret Manager secret version.
  git_remote = {
    url                                 = "https://github.com/kloudvin/finance-dataform.git"
    default_branch                      = "main"
    authentication_token_secret_version = google_secret_manager_secret_version.github_pat.id
  }

  # Developer workspaces compile into a sandbox: tables prefixed + dataset suffixed.
  workspace_compilation_overrides = {
    default_database = "kv-analytics-dev"
    schema_suffix    = "_dev"
    table_prefix     = "dev_"
  }

  labels = {
    domain      = "finance"
    environment = "prod"
    managed-by  = "terraform"
  }
}

# Downstream: grant the Dataform runtime SA write access to the curated dataset
# using the module's service_account_email output.
resource "google_bigquery_dataset_iam_member" "dataform_writer" {
  project    = "kv-analytics-prod"
  dataset_id = "finance_curated"
  role       = "roles/bigquery.dataEditor"
  member     = "serviceAccount:${module.dataform.service_account_email}"
}

# Downstream: a scheduled release/workflow config keyed off the repository name.
resource "google_dataform_repository_release_config" "nightly" {
  project       = "kv-analytics-prod"
  region        = "europe-west2"
  repository    = module.dataform.name
  name          = "nightly"
  git_commitish = "main"
  cron_schedule = "0 2 * * *"
  time_zone     = "Europe/London"
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/dataform/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataform?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  region = "..."
  repository_name = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/dataform && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string Yes GCP project ID hosting the Dataform repository.
region string Yes Repository region (must be a BigQuery-supported location).
repository_name string Yes Repository name; validated ^[a-z][a-z0-9-]{1,62}$.
display_name string null No User-friendly display name.
create_service_account bool true No Create a dedicated runtime SA and grant roles/bigquery.jobUser.
service_account_id string null No Account ID for the created SA; defaults to df-<repository_name> (≤30 chars).
service_account_email string null No Existing SA email when create_service_account = false; validated as *.iam.gserviceaccount.com.
kms_key_name string null No CMEK crypto key (same region) for repository data encryption.
npmrc_secret_version string null No Secret Manager secret version interpolated into .npmrc for package installs.
git_remote object null No Git remote (url, default_branch, plus exactly one of authentication_token_secret_version or ssh).
workspace_compilation_overrides object null No Dev-workspace overrides: default_database, schema_suffix, table_prefix.
labels map(string) {} No Labels applied to the repository.
deletion_policy string "DELETE" No One of DELETE, FORCE, ABANDON, PREVENT.

Outputs

Name Description
id Full resource ID projects/{project}/locations/{region}/repositories/{name}.
name Short name of the Dataform repository.
region Region the repository was created in.
service_account_email Runtime service account email (created or supplied).
service_account_created Whether the module created the runtime service account.
git_connected True when a Git remote is configured.

Enterprise scenario

A retail analytics platform runs three BigQuery data domains — finance, supply-chain, and marketing — each owned by a separate squad. The platform team calls this module three times from a single for_each, giving every domain an identically-governed Dataform repository: a dedicated runtime service account scoped to roles/bigquery.jobUser, a CMEK key from the central security project, and an HTTPS Git connection to each squad’s GitHub repo via a Secret Manager PAT. workspace_compilation_overrides route every developer’s interactive workspace into a *_dev sandbox dataset, so a junior engineer’s experimental model can never overwrite the nightly curated tables, while a downstream google_dataform_repository_release_config (keyed off the module’s name output) drives the 02:00 production release per domain.

Best practices

TerraformGCPDataformModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading