Quick take — Build a reusable Terraform module for google_dataform_repository on hashicorp/google ~> 5.0: wire up Git remotes, CMEK, a dedicated service account, and workspace compilation overrides for safe BigQuery ELT. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "dataform" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataform?ref=v1.0.0"
project_id = "..." # GCP project ID hosting the Dataform repository.
region = "..." # Repository region (must be a BigQuery-supported locatio…
repository_name = "..." # Repository name; validated `^[a-z][a-z0-9-]{1,62}$`.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Dataform is Google Cloud’s managed service for developing, version-controlling, and operationalising SQL-based ELT pipelines that run inside BigQuery. You write .sqlx files that declare tables, views, incremental models, and assertions; Dataform compiles them into a dependency graph and executes the SQL as BigQuery jobs. A google_dataform_repository is the top-level container that holds that code — either as a Git-connected repository (pointing at GitHub, GitLab, Azure Repos, or Cloud Source Repositories) or as a standalone repo edited through Dataform development workspaces.
Wrapping the repository in a Terraform module matters because a Dataform repo is rarely “just a repo.” In production it needs a dedicated service account with scoped BigQuery permissions, a Git remote authenticated through a Secret Manager secret, optional CMEK encryption for the stored compilation metadata, and workspace compilation overrides so that developers’ interactive workspaces write to sandboxed datasets instead of clobbering production tables. Hand-clicking those in the console drifts immediately. This module fixes the topology once, validates the inputs, and lets every data domain stamp out an identical, governed repository by changing a handful of variables.
When to use it
- You run BigQuery as your warehouse and want SQL transformations under version control with compilation, dependency resolution, and data assertions — without standing up a separate dbt runner on Composer or Cloud Run.
- You are onboarding multiple data domains or teams (finance, marketing, product analytics) and each needs its own Dataform repository with a consistent service account and Git-connection pattern.
- You need environment isolation: developer workspaces must compile into
dataform_dev_*style datasets while scheduled releases target production, achieved declaratively viaworkspace_compilation_overrides. - Compliance requires customer-managed encryption keys on the repository and no plaintext Git tokens in state — both handled here through
kms_key_nameand a Secret Manager secret version. - You want the repository created and governed in Terraform while Dataform release/workflow configs (schedules) are layered on top by a downstream stack that consumes this module’s outputs.
Module structure
terraform-module-gcp-dataform/
├── versions.tf # provider + Terraform version pins
├── main.tf # google_dataform_repository + optional dedicated SA
├── variables.tf # var-driven inputs with validation
└── outputs.tf # repository id/name + key attributes
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
locals {
# Stable, predictable account id when the module creates the SA.
sa_account_id = coalesce(
var.service_account_id,
substr("df-${var.repository_name}", 0, 30)
)
create_sa = var.create_service_account
# Either the caller-provided SA email, or the one we create.
effective_service_account = local.create_sa ? google_service_account.dataform[0].email : var.service_account_email
}
# Optional dedicated runtime identity for Dataform workflow invocations.
resource "google_service_account" "dataform" {
count = local.create_sa ? 1 : 0
project = var.project_id
account_id = local.sa_account_id
display_name = "Dataform runtime SA for ${var.repository_name}"
}
# Grant the runtime identity the ability to run BigQuery jobs in the project.
resource "google_project_iam_member" "bq_job_user" {
count = local.create_sa ? 1 : 0
project = var.project_id
role = "roles/bigquery.jobUser"
member = "serviceAccount:${google_service_account.dataform[0].email}"
}
resource "google_dataform_repository" "this" {
provider = google
project = var.project_id
region = var.region
name = var.repository_name
display_name = var.display_name
# Runtime identity used when Dataform executes BigQuery jobs.
service_account = local.effective_service_account
# Optional CMEK for repository-stored compilation metadata.
kms_key_name = var.kms_key_name
# Secret version feeding variables into .npmrc during package installs.
npmrc_environment_variables_secret_version = var.npmrc_secret_version
labels = var.labels
# Connect to an external Git remote when a URL is supplied.
dynamic "git_remote_settings" {
for_each = var.git_remote == null ? [] : [var.git_remote]
content {
url = git_remote_settings.value.url
default_branch = git_remote_settings.value.default_branch
# Token-based auth (HTTPS) via Secret Manager.
authentication_token_secret_version = try(git_remote_settings.value.authentication_token_secret_version, null)
# OR SSH-based auth.
dynamic "ssh_authentication_config" {
for_each = try(git_remote_settings.value.ssh, null) == null ? [] : [git_remote_settings.value.ssh]
content {
user_private_key_secret_version = ssh_authentication_config.value.user_private_key_secret_version
host_public_key = ssh_authentication_config.value.host_public_key
}
}
}
}
# Sandbox developer workspaces away from production datasets/tables.
dynamic "workspace_compilation_overrides" {
for_each = var.workspace_compilation_overrides == null ? [] : [var.workspace_compilation_overrides]
content {
default_database = try(workspace_compilation_overrides.value.default_database, null)
schema_suffix = try(workspace_compilation_overrides.value.schema_suffix, null)
table_prefix = try(workspace_compilation_overrides.value.table_prefix, null)
}
}
deletion_policy = var.deletion_policy
}
variables.tf
variable "project_id" {
description = "GCP project ID where the Dataform repository lives."
type = string
}
variable "region" {
description = "Region for the Dataform repository (must match a BigQuery-supported location, e.g. us-central1, europe-west2, asia-south1)."
type = string
}
variable "repository_name" {
description = "Name of the Dataform repository. Lowercase letters, numbers and hyphens; must start with a letter."
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,62}$", var.repository_name))
error_message = "repository_name must start with a lowercase letter and contain only lowercase letters, numbers and hyphens (max 63 chars)."
}
}
variable "display_name" {
description = "User-friendly display name for the repository."
type = string
default = null
}
variable "create_service_account" {
description = "If true, create a dedicated runtime service account and grant it roles/bigquery.jobUser. If false, supply service_account_email."
type = bool
default = true
}
variable "service_account_id" {
description = "Account ID for the created SA (when create_service_account = true). Defaults to df-<repository_name> truncated to 30 chars."
type = string
default = null
}
variable "service_account_email" {
description = "Existing service account email to use as the Dataform runtime identity (when create_service_account = false)."
type = string
default = null
validation {
condition = var.service_account_email == null || can(regex("^[^@]+@[^@]+\\.iam\\.gserviceaccount\\.com$", var.service_account_email))
error_message = "service_account_email must be a valid *.iam.gserviceaccount.com address."
}
}
variable "kms_key_name" {
description = "Optional CMEK key (projects/.../cryptoKeys/...) to encrypt repository-stored data. Key must be in the same region as the repository."
type = string
default = null
}
variable "npmrc_secret_version" {
description = "Optional Secret Manager secret version (projects/.../versions/...) used to interpolate variables into .npmrc during package installs."
type = string
default = null
}
variable "git_remote" {
description = "Optional Git remote connection. Provide either an HTTPS token (authentication_token_secret_version) or an ssh block — not both."
type = object({
url = string
default_branch = string
authentication_token_secret_version = optional(string)
ssh = optional(object({
user_private_key_secret_version = string
host_public_key = string
}))
})
default = null
validation {
condition = var.git_remote == null ? true : (
(try(var.git_remote.authentication_token_secret_version, null) != null) != (try(var.git_remote.ssh, null) != null)
)
error_message = "Provide exactly one of git_remote.authentication_token_secret_version (HTTPS) or git_remote.ssh (SSH)."
}
}
variable "workspace_compilation_overrides" {
description = "Optional overrides applied to development workspace compilations to sandbox dev output (default_database, schema_suffix, table_prefix)."
type = object({
default_database = optional(string)
schema_suffix = optional(string)
table_prefix = optional(string)
})
default = null
}
variable "labels" {
description = "Labels to apply to the Dataform repository."
type = map(string)
default = {}
}
variable "deletion_policy" {
description = "Deletion behaviour: DELETE, FORCE (delete with child resources), ABANDON, or PREVENT."
type = string
default = "DELETE"
validation {
condition = contains(["DELETE", "FORCE", "ABANDON", "PREVENT"], var.deletion_policy)
error_message = "deletion_policy must be one of DELETE, FORCE, ABANDON, PREVENT."
}
}
outputs.tf
output "id" {
description = "Full resource ID: projects/{project}/locations/{region}/repositories/{name}."
value = google_dataform_repository.this.id
}
output "name" {
description = "Short name of the Dataform repository."
value = google_dataform_repository.this.name
}
output "region" {
description = "Region the repository was created in."
value = google_dataform_repository.this.region
}
output "service_account_email" {
description = "Runtime service account email used by Dataform (created or supplied)."
value = local.effective_service_account
}
output "service_account_created" {
description = "Whether this module created the runtime service account."
value = local.create_sa
}
output "git_connected" {
description = "True when a Git remote is configured on the repository."
value = var.git_remote != null
}
How to use it
# A secret holding the GitHub PAT used by Dataform for HTTPS Git operations.
resource "google_secret_manager_secret_version" "github_pat" {
secret = google_secret_manager_secret.github_pat.id
secret_data = var.github_pat # sensitive, sourced from a tfvars/CI secret
}
module "dataform" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataform?ref=v1.0.0"
project_id = "kv-analytics-prod"
region = "europe-west2"
repository_name = "finance-elt"
display_name = "Finance ELT (BigQuery)"
# Let the module mint a dedicated runtime SA + grant BigQuery Job User.
create_service_account = true
# CMEK in the same region as the repository.
kms_key_name = "projects/kv-security/locations/europe-west2/keyRings/dataform/cryptoKeys/repo-cmek"
# HTTPS Git connection authenticated by a Secret Manager secret version.
git_remote = {
url = "https://github.com/kloudvin/finance-dataform.git"
default_branch = "main"
authentication_token_secret_version = google_secret_manager_secret_version.github_pat.id
}
# Developer workspaces compile into a sandbox: tables prefixed + dataset suffixed.
workspace_compilation_overrides = {
default_database = "kv-analytics-dev"
schema_suffix = "_dev"
table_prefix = "dev_"
}
labels = {
domain = "finance"
environment = "prod"
managed-by = "terraform"
}
}
# Downstream: grant the Dataform runtime SA write access to the curated dataset
# using the module's service_account_email output.
resource "google_bigquery_dataset_iam_member" "dataform_writer" {
project = "kv-analytics-prod"
dataset_id = "finance_curated"
role = "roles/bigquery.dataEditor"
member = "serviceAccount:${module.dataform.service_account_email}"
}
# Downstream: a scheduled release/workflow config keyed off the repository name.
resource "google_dataform_repository_release_config" "nightly" {
project = "kv-analytics-prod"
region = "europe-west2"
repository = module.dataform.name
name = "nightly"
git_commitish = "main"
cron_schedule = "0 2 * * *"
time_zone = "Europe/London"
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/dataform/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataform?ref=v1.0.0"
}
inputs = {
project_id = "..."
region = "..."
repository_name = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/dataform && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
project_id |
string | — | Yes | GCP project ID hosting the Dataform repository. |
region |
string | — | Yes | Repository region (must be a BigQuery-supported location). |
repository_name |
string | — | Yes | Repository name; validated ^[a-z][a-z0-9-]{1,62}$. |
display_name |
string | null |
No | User-friendly display name. |
create_service_account |
bool | true |
No | Create a dedicated runtime SA and grant roles/bigquery.jobUser. |
service_account_id |
string | null |
No | Account ID for the created SA; defaults to df-<repository_name> (≤30 chars). |
service_account_email |
string | null |
No | Existing SA email when create_service_account = false; validated as *.iam.gserviceaccount.com. |
kms_key_name |
string | null |
No | CMEK crypto key (same region) for repository data encryption. |
npmrc_secret_version |
string | null |
No | Secret Manager secret version interpolated into .npmrc for package installs. |
git_remote |
object | null |
No | Git remote (url, default_branch, plus exactly one of authentication_token_secret_version or ssh). |
workspace_compilation_overrides |
object | null |
No | Dev-workspace overrides: default_database, schema_suffix, table_prefix. |
labels |
map(string) | {} |
No | Labels applied to the repository. |
deletion_policy |
string | "DELETE" |
No | One of DELETE, FORCE, ABANDON, PREVENT. |
Outputs
| Name | Description |
|---|---|
id |
Full resource ID projects/{project}/locations/{region}/repositories/{name}. |
name |
Short name of the Dataform repository. |
region |
Region the repository was created in. |
service_account_email |
Runtime service account email (created or supplied). |
service_account_created |
Whether the module created the runtime service account. |
git_connected |
True when a Git remote is configured. |
Enterprise scenario
A retail analytics platform runs three BigQuery data domains — finance, supply-chain, and marketing — each owned by a separate squad. The platform team calls this module three times from a single for_each, giving every domain an identically-governed Dataform repository: a dedicated runtime service account scoped to roles/bigquery.jobUser, a CMEK key from the central security project, and an HTTPS Git connection to each squad’s GitHub repo via a Secret Manager PAT. workspace_compilation_overrides route every developer’s interactive workspace into a *_dev sandbox dataset, so a junior engineer’s experimental model can never overwrite the nightly curated tables, while a downstream google_dataform_repository_release_config (keyed off the module’s name output) drives the 02:00 production release per domain.
Best practices
- Never put raw Git tokens in Terraform state — pass
authentication_token_secret_versionpointing at a Secret Manager secret version (or use the SSH block). The repository stores only the secret reference, not the credential. - Always set
workspace_compilation_overridesfor non-trivial repos. Aschema_suffix/table_prefix(e.g._dev/dev_) plus a separatedefault_databasekeeps interactive development from writing into production datasets — the single biggest source of Dataform incidents. - Give each repository its own least-privilege service account rather than reusing the default Dataform service agent. Grant only
roles/bigquery.jobUserat the project plusroles/bigquery.dataEditoron the specific target datasets downstream, so a compromised repo cannot read or mutate unrelated data. - Pin the repository region to your BigQuery data location. Dataform executes jobs in-region; a mismatch between
regionand your datasets’ location causes cross-region failures. Apply CMEK with a key in that same region. - Use
deletion_policy = "PREVENT"on production repositories to stop an accidentalterraform destroyfrom wiping compilation results and release history; reserveFORCEfor ephemeral preview environments only. - Label consistently (
domain,environment,managed-by) so Dataform-driven BigQuery cost can be sliced by team in billing exports, and tag the source?ref=to an immutable module version for reproducible rollouts.