IaC GCP

Terraform Module: GCP BigQuery Data Transfer — scheduled, repeatable ingestion into BigQuery

Quick take — A reusable Terraform module for google_bigquery_data_transfer_config that codifies scheduled BigQuery Data Transfer Service runs — data source, schedule, destination dataset, and service-account auth — with validated inputs. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "bigquery_data_transfer" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery-data-transfer?ref=v1.0.0"

  project_id     = "..."  # GCP project owning the transfer config and dataset.
  location       = "..."  # Region of the destination dataset (must match the datas…
  display_name   = "..."  # UI display name (1-256 chars).
  data_source_id = "..."  # Transfer source (scheduled_query, google_cloud_storage,…
  params         = {}     # Source-specific params (must be non-empty).
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

The BigQuery Data Transfer Service (BQ DTS) is GCP’s managed, scheduled ingestion engine. Instead of writing cron jobs, Cloud Functions, or Composer DAGs to pull data into BigQuery, you declare a transfer config — a data source (Google Ads, Cloud Storage, Amazon S3, Google Play, YouTube, a scheduled query, or a cross-region BigQuery copy), a destination dataset, a schedule, and a bag of source-specific params — and the service runs it for you, retries on failure, and backfills history on demand.

The trouble is that google_bigquery_data_transfer_config is a deceptively fiddly resource. The params map is untyped and entirely different for every data_source_id; the service_account_name plus the useServiceAccount IAM grant is easy to get wrong; schedule_options and data_refresh_window_days interact in non-obvious ways; and the provider needs the bigquerydatatransfer.googleapis.com API enabled and a one-time google_project_service_identity to exist before the first transfer can be created. Wrapping all of that in a module gives you one tested, var-driven interface so every team ships transfers that are named consistently, authenticated by a dedicated service account, and pinned to a known schedule format — no copy-pasted params blocks drifting across repos.

When to use it

If you only need a one-off historical load, a bq load / bq cp command is simpler — reach for this module when the ingestion is recurring and needs to be in code.

Module structure

terraform-module-gcp-bigquery-data-transfer/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # API enablement, service identity, transfer config, IAM
├── variables.tf     # var-driven inputs with validations
└── outputs.tf       # config id/name + key attributes
# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}
# main.tf

locals {
  # Transfers backed by Google-managed first-party data sources (Ads, Play, YouTube,
  # Search Ads 360, etc.) require an OAuth/data-source authorization flow and CANNOT be
  # driven by a service account. GCS, S3, scheduled_query and cross_region_copy can.
  sa_capable_sources = [
    "google_cloud_storage",
    "amazon_s3",
    "scheduled_query",
    "cross_region_copy",
    "redshift",
    "azure_blob_storage",
  ]

  use_service_account = (
    var.service_account_email != null &&
    contains(local.sa_capable_sources, var.data_source_id)
  )
}

# Ensure the Data Transfer API is on before we try to create a config.
resource "google_project_service" "bigquerydatatransfer" {
  count = var.enable_api ? 1 : 0

  project                    = var.project_id
  service                    = "bigquerydatatransfer.googleapis.com"
  disable_dependent_services = false
  disable_on_destroy         = false
}

# Create the BQ DTS service agent (P4SA). Required so you can grant it
# roles/iam.serviceAccountTokenCreator on your transfer service account.
resource "google_project_service_identity" "bq_dts" {
  count = var.create_service_identity ? 1 : 0

  provider = google-beta
  project  = var.project_id
  service  = "bigquerydatatransfer.googleapis.com"

  depends_on = [google_project_service.bigquerydatatransfer]
}

# Let the BQ DTS service agent impersonate the user-supplied transfer SA.
resource "google_service_account_iam_member" "dts_token_creator" {
  count = local.use_service_account && var.create_service_identity ? 1 : 0

  service_account_id = "projects/${var.project_id}/serviceAccounts/${var.service_account_email}"
  role               = "roles/iam.serviceAccountTokenCreator"
  member             = "serviceAccount:${google_project_service_identity.bq_dts[0].email}"
}

resource "google_bigquery_data_transfer_config" "this" {
  project                = var.project_id
  location               = var.location
  display_name           = var.display_name
  data_source_id         = var.data_source_id
  destination_dataset_id = var.destination_dataset_id

  # cron-like ("every 24 hours", "first sunday of quarter 09:00", "1 of month 00:00")
  # or empty string for manual/event-driven (e.g. GCS notifications) transfers.
  schedule = var.schedule

  # How many days back each run reloads (handles late-arriving data). 0 = current day only.
  data_refresh_window_days = var.data_refresh_window_days

  # Pause without destroying the config.
  disabled = var.disabled

  # Send run-failure notifications to a Pub/Sub topic and/or email.
  notification_pubsub_topic = var.notification_pubsub_topic

  # Source-specific configuration. Shape depends entirely on data_source_id.
  params = var.params

  # Service-account auth (only set for SA-capable sources).
  service_account_name = local.use_service_account ? var.service_account_email : null

  dynamic "schedule_options" {
    for_each = var.schedule_start_time != null || var.schedule_end_time != null ? [1] : []
    content {
      disable_auto_scheduling = var.disable_auto_scheduling
      start_time              = var.schedule_start_time
      end_time                = var.schedule_end_time
    }
  }

  dynamic "email_preferences" {
    for_each = var.enable_failure_email ? [1] : []
    content {
      enable_failure_email = true
    }
  }

  depends_on = [
    google_project_service.bigquerydatatransfer,
    google_service_account_iam_member.dts_token_creator,
  ]
}
# variables.tf

variable "project_id" {
  description = "GCP project that owns the transfer config and destination dataset."
  type        = string
}

variable "location" {
  description = "Location of the destination dataset (e.g. \"US\", \"EU\", \"asia-south1\"). Must match the dataset's region."
  type        = string
}

variable "display_name" {
  description = "Human-readable name shown in the BigQuery Data Transfers UI."
  type        = string

  validation {
    condition     = length(var.display_name) > 0 && length(var.display_name) <= 256
    error_message = "display_name must be 1-256 characters."
  }
}

variable "data_source_id" {
  description = "The transfer data source, e.g. scheduled_query, google_cloud_storage, amazon_s3, cross_region_copy, google_ads, play."
  type        = string

  validation {
    condition = contains([
      "scheduled_query", "google_cloud_storage", "amazon_s3", "azure_blob_storage",
      "redshift", "cross_region_copy", "google_ads", "play", "youtube_channel",
      "youtube_content_owner", "search_ads", "merchant_center",
    ], var.data_source_id)
    error_message = "data_source_id is not a recognised BigQuery Data Transfer source."
  }
}

variable "destination_dataset_id" {
  description = "Existing BigQuery dataset ID that receives the transferred data. Not required for some sources (e.g. some scheduled queries write via DDL)."
  type        = string
  default     = null
}

variable "schedule" {
  description = "Schedule in BQ DTS cron syntax (e.g. \"every 6 hours\", \"every day 07:00\"). Empty string = manual/event-driven."
  type        = string
  default     = "every 24 hours"
}

variable "data_refresh_window_days" {
  description = "Number of days of historical data to reload on each run (for late data). 0 reloads only the current day."
  type        = number
  default     = 0

  validation {
    condition     = var.data_refresh_window_days >= 0 && var.data_refresh_window_days <= 30
    error_message = "data_refresh_window_days must be between 0 and 30."
  }
}

variable "disabled" {
  description = "If true, the transfer is created but paused (no runs are scheduled)."
  type        = bool
  default     = false
}

variable "params" {
  description = "Source-specific parameters map. Keys depend on data_source_id (e.g. query for scheduled_query; data_path_template + file_format for google_cloud_storage)."
  type        = map(string)

  validation {
    condition     = length(var.params) > 0
    error_message = "params must contain at least one key (every data_source_id requires source-specific params)."
  }
}

variable "service_account_email" {
  description = "Email of a dedicated service account to run the transfer under. Only honoured for SA-capable sources (GCS, S3, scheduled_query, cross_region_copy, etc.). Null = use the configuring user's credentials."
  type        = string
  default     = null

  validation {
    condition     = var.service_account_email == null || can(regex("^[^@]+@[^@]+\\.iam\\.gserviceaccount\\.com$", var.service_account_email))
    error_message = "service_account_email must be a valid *.iam.gserviceaccount.com address or null."
  }
}

variable "schedule_start_time" {
  description = "RFC3339 timestamp; transfer will not run before this time. Null to omit."
  type        = string
  default     = null
}

variable "schedule_end_time" {
  description = "RFC3339 timestamp; transfer will not run after this time. Null to omit."
  type        = string
  default     = null
}

variable "disable_auto_scheduling" {
  description = "If true, no automatic runs are scheduled (manual/backfill only) while keeping the schedule string for reference."
  type        = bool
  default     = false
}

variable "notification_pubsub_topic" {
  description = "Full Pub/Sub topic name (projects/<p>/topics/<t>) to receive transfer-run completion notifications. Null to disable."
  type        = string
  default     = null

  validation {
    condition     = var.notification_pubsub_topic == null || can(regex("^projects/[^/]+/topics/[^/]+$", var.notification_pubsub_topic))
    error_message = "notification_pubsub_topic must be of the form projects/<project>/topics/<topic> or null."
  }
}

variable "enable_failure_email" {
  description = "Email the transfer's owner when a run fails."
  type        = bool
  default     = true
}

variable "enable_api" {
  description = "Whether this module should enable bigquerydatatransfer.googleapis.com (set false if managed elsewhere)."
  type        = bool
  default     = true
}

variable "create_service_identity" {
  description = "Whether to create the BQ DTS service agent and grant it tokenCreator on the transfer SA. Requires google-beta provider."
  type        = bool
  default     = true
}
# outputs.tf

output "id" {
  description = "Fully-qualified resource ID of the transfer config (projects/<p>/locations/<l>/transferConfigs/<uuid>)."
  value       = google_bigquery_data_transfer_config.this.id
}

output "name" {
  description = "Resource name of the transfer config, including the server-generated UUID."
  value       = google_bigquery_data_transfer_config.this.name
}

output "display_name" {
  description = "Human-readable display name of the transfer."
  value       = google_bigquery_data_transfer_config.this.display_name
}

output "data_source_id" {
  description = "The data source backing this transfer."
  value       = google_bigquery_data_transfer_config.this.data_source_id
}

output "destination_dataset_id" {
  description = "Destination BigQuery dataset for the transfer (may be null for some sources)."
  value       = google_bigquery_data_transfer_config.this.destination_dataset_id
}

output "schedule" {
  description = "Effective schedule string of the transfer config."
  value       = google_bigquery_data_transfer_config.this.schedule
}

output "service_account_email" {
  description = "Service account the transfer runs as, or null if running under user credentials."
  value       = google_bigquery_data_transfer_config.this.service_account_name
}

output "dts_service_agent_email" {
  description = "Email of the BQ DTS service agent (P4SA), if created by this module."
  value       = try(google_project_service_identity.bq_dts[0].email, null)
}

How to use it

A scheduled hourly load of CSV files from Cloud Storage into a partitioned table, running under a dedicated service account:

module "bigquery_data_transfer" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery-data-transfer?ref=v1.0.0"

  project_id             = "kv-data-prod"
  location               = "asia-south1"
  display_name           = "gcs-orders-hourly"
  data_source_id         = "google_cloud_storage"
  destination_dataset_id = google_bigquery_dataset.raw.dataset_id
  schedule               = "every 1 hours"
  service_account_email  = google_service_account.bq_transfer.email

  # GCS source params: where to read, target table, format, and write disposition.
  params = {
    data_path_template              = "gs://kv-landing-orders/incoming/*.csv"
    destination_table_name_template = "orders_raw"
    file_format                     = "CSV"
    write_disposition               = "APPEND"
    skip_leading_rows               = "1"
    max_bad_records                 = "0"
  }

  notification_pubsub_topic = "projects/kv-data-prod/topics/bq-transfer-events"
}

# Downstream: alert on failed transfer runs by subscribing the ops function
# to the Pub/Sub topic, wired off this module's destination dataset + name.
resource "google_pubsub_subscription" "transfer_alerts" {
  name  = "alert-${module.bigquery_data_transfer.display_name}"
  topic = "projects/kv-data-prod/topics/bq-transfer-events"

  push_config {
    push_endpoint = "https://asia-south1-kv-data-prod.cloudfunctions.net/transfer-failure-alert"
  }

  labels = {
    transfer_id = element(split("/", module.bigquery_data_transfer.id), length(split("/", module.bigquery_data_transfer.id)) - 1)
  }
}

For a scheduled query instead, set data_source_id = "scheduled_query", drop destination_dataset_id if the query uses DDL, and pass params = { query = "MERGE ...", write_disposition = "WRITE_TRUNCATE" }.

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/bigquery_data_transfer/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery-data-transfer?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  location = "..."
  display_name = "..."
  data_source_id = "..."
  params = {}
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/bigquery_data_transfer && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string Yes GCP project owning the transfer config and dataset.
location string Yes Region of the destination dataset (must match the dataset).
display_name string Yes UI display name (1-256 chars).
data_source_id string Yes Transfer source (scheduled_query, google_cloud_storage, amazon_s3, cross_region_copy, google_ads, play, …).
destination_dataset_id string null No Destination BigQuery dataset ID.
schedule string "every 24 hours" No BQ DTS cron schedule; empty string = manual/event-driven.
data_refresh_window_days number 0 No Days of history reloaded per run (0-30).
disabled bool false No Create the config but pause scheduling.
params map(string) Yes Source-specific params (must be non-empty).
service_account_email string null No Dedicated SA to run the transfer; honoured only for SA-capable sources.
schedule_start_time string null No RFC3339 earliest run time.
schedule_end_time string null No RFC3339 latest run time.
disable_auto_scheduling bool false No Manual/backfill only; suppress automatic runs.
notification_pubsub_topic string null No Pub/Sub topic (projects/<p>/topics/<t>) for run notifications.
enable_failure_email bool true No Email the owner on run failure.
enable_api bool true No Enable bigquerydatatransfer.googleapis.com from this module.
create_service_identity bool true No Create the BQ DTS service agent and grant it tokenCreator on the SA.

Outputs

Name Description
id Fully-qualified transfer config ID (projects/<p>/locations/<l>/transferConfigs/<uuid>).
name Resource name including the server-generated UUID.
display_name Human-readable display name.
data_source_id The data source backing the transfer.
destination_dataset_id Destination dataset (may be null for some sources).
schedule Effective schedule string.
service_account_email SA the transfer runs as, or null for user credentials.
dts_service_agent_email BQ DTS service agent (P4SA) email, if created here.

Enterprise scenario

A retail analytics team centralises marketing spend reporting in BigQuery. They instantiate this module once per channel — one google_ads transfer (user-authorised, daily, data_refresh_window_days = 7 to capture click-attribution backfill) and three google_cloud_storage transfers pulling partner CSVs that land hourly in a landing bucket — all writing into a single marketing_raw dataset in asia-south1. Every config runs under a dedicated bq-transfer@ service account, emits run events to a shared Pub/Sub topic, and a downstream Cloud Function pages the on-call channel on failure, so a broken partner feed surfaces in minutes instead of at the next morning’s dashboard refresh.

Best practices

TerraformGCPBigQuery Data TransferModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading