Quick take — A reusable Terraform module for google_bigquery_data_transfer_config that codifies scheduled BigQuery Data Transfer Service runs — data source, schedule, destination dataset, and service-account auth — with validated inputs. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "bigquery_data_transfer" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery-data-transfer?ref=v1.0.0"
project_id = "..." # GCP project owning the transfer config and dataset.
location = "..." # Region of the destination dataset (must match the datas…
display_name = "..." # UI display name (1-256 chars).
data_source_id = "..." # Transfer source (scheduled_query, google_cloud_storage,…
params = {} # Source-specific params (must be non-empty).
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
The BigQuery Data Transfer Service (BQ DTS) is GCP’s managed, scheduled ingestion engine. Instead of writing cron jobs, Cloud Functions, or Composer DAGs to pull data into BigQuery, you declare a transfer config — a data source (Google Ads, Cloud Storage, Amazon S3, Google Play, YouTube, a scheduled query, or a cross-region BigQuery copy), a destination dataset, a schedule, and a bag of source-specific params — and the service runs it for you, retries on failure, and backfills history on demand.
The trouble is that google_bigquery_data_transfer_config is a deceptively fiddly resource. The params map is untyped and entirely different for every data_source_id; the service_account_name plus the useServiceAccount IAM grant is easy to get wrong; schedule_options and data_refresh_window_days interact in non-obvious ways; and the provider needs the bigquerydatatransfer.googleapis.com API enabled and a one-time google_project_service_identity to exist before the first transfer can be created. Wrapping all of that in a module gives you one tested, var-driven interface so every team ships transfers that are named consistently, authenticated by a dedicated service account, and pinned to a known schedule format — no copy-pasted params blocks drifting across repos.
When to use it
- You routinely land external SaaS data (Google Ads, Google Play, YouTube channel/content owner reports, Search Ads 360) into BigQuery and want it scheduled, not scripted.
- You ingest from Cloud Storage or Amazon S3 on a recurring cadence (e.g. hourly Parquet/CSV drops into a partitioned table).
- You run scheduled queries (
data_source_id = "scheduled_query") for periodic rollups, materializations, or ELT steps and want them version-controlled. - You perform cross-region or cross-project BigQuery dataset copies (
data_source_id = "cross_region_copy") for DR or data locality. - You want transfers owned by a dedicated service account (not a user’s credentials) so they survive offboarding and pass audit.
If you only need a one-off historical load, a bq load / bq cp command is simpler — reach for this module when the ingestion is recurring and needs to be in code.
Module structure
terraform-module-gcp-bigquery-data-transfer/
├── versions.tf # provider + Terraform version pins
├── main.tf # API enablement, service identity, transfer config, IAM
├── variables.tf # var-driven inputs with validations
└── outputs.tf # config id/name + key attributes
# versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
# main.tf
locals {
# Transfers backed by Google-managed first-party data sources (Ads, Play, YouTube,
# Search Ads 360, etc.) require an OAuth/data-source authorization flow and CANNOT be
# driven by a service account. GCS, S3, scheduled_query and cross_region_copy can.
sa_capable_sources = [
"google_cloud_storage",
"amazon_s3",
"scheduled_query",
"cross_region_copy",
"redshift",
"azure_blob_storage",
]
use_service_account = (
var.service_account_email != null &&
contains(local.sa_capable_sources, var.data_source_id)
)
}
# Ensure the Data Transfer API is on before we try to create a config.
resource "google_project_service" "bigquerydatatransfer" {
count = var.enable_api ? 1 : 0
project = var.project_id
service = "bigquerydatatransfer.googleapis.com"
disable_dependent_services = false
disable_on_destroy = false
}
# Create the BQ DTS service agent (P4SA). Required so you can grant it
# roles/iam.serviceAccountTokenCreator on your transfer service account.
resource "google_project_service_identity" "bq_dts" {
count = var.create_service_identity ? 1 : 0
provider = google-beta
project = var.project_id
service = "bigquerydatatransfer.googleapis.com"
depends_on = [google_project_service.bigquerydatatransfer]
}
# Let the BQ DTS service agent impersonate the user-supplied transfer SA.
resource "google_service_account_iam_member" "dts_token_creator" {
count = local.use_service_account && var.create_service_identity ? 1 : 0
service_account_id = "projects/${var.project_id}/serviceAccounts/${var.service_account_email}"
role = "roles/iam.serviceAccountTokenCreator"
member = "serviceAccount:${google_project_service_identity.bq_dts[0].email}"
}
resource "google_bigquery_data_transfer_config" "this" {
project = var.project_id
location = var.location
display_name = var.display_name
data_source_id = var.data_source_id
destination_dataset_id = var.destination_dataset_id
# cron-like ("every 24 hours", "first sunday of quarter 09:00", "1 of month 00:00")
# or empty string for manual/event-driven (e.g. GCS notifications) transfers.
schedule = var.schedule
# How many days back each run reloads (handles late-arriving data). 0 = current day only.
data_refresh_window_days = var.data_refresh_window_days
# Pause without destroying the config.
disabled = var.disabled
# Send run-failure notifications to a Pub/Sub topic and/or email.
notification_pubsub_topic = var.notification_pubsub_topic
# Source-specific configuration. Shape depends entirely on data_source_id.
params = var.params
# Service-account auth (only set for SA-capable sources).
service_account_name = local.use_service_account ? var.service_account_email : null
dynamic "schedule_options" {
for_each = var.schedule_start_time != null || var.schedule_end_time != null ? [1] : []
content {
disable_auto_scheduling = var.disable_auto_scheduling
start_time = var.schedule_start_time
end_time = var.schedule_end_time
}
}
dynamic "email_preferences" {
for_each = var.enable_failure_email ? [1] : []
content {
enable_failure_email = true
}
}
depends_on = [
google_project_service.bigquerydatatransfer,
google_service_account_iam_member.dts_token_creator,
]
}
# variables.tf
variable "project_id" {
description = "GCP project that owns the transfer config and destination dataset."
type = string
}
variable "location" {
description = "Location of the destination dataset (e.g. \"US\", \"EU\", \"asia-south1\"). Must match the dataset's region."
type = string
}
variable "display_name" {
description = "Human-readable name shown in the BigQuery Data Transfers UI."
type = string
validation {
condition = length(var.display_name) > 0 && length(var.display_name) <= 256
error_message = "display_name must be 1-256 characters."
}
}
variable "data_source_id" {
description = "The transfer data source, e.g. scheduled_query, google_cloud_storage, amazon_s3, cross_region_copy, google_ads, play."
type = string
validation {
condition = contains([
"scheduled_query", "google_cloud_storage", "amazon_s3", "azure_blob_storage",
"redshift", "cross_region_copy", "google_ads", "play", "youtube_channel",
"youtube_content_owner", "search_ads", "merchant_center",
], var.data_source_id)
error_message = "data_source_id is not a recognised BigQuery Data Transfer source."
}
}
variable "destination_dataset_id" {
description = "Existing BigQuery dataset ID that receives the transferred data. Not required for some sources (e.g. some scheduled queries write via DDL)."
type = string
default = null
}
variable "schedule" {
description = "Schedule in BQ DTS cron syntax (e.g. \"every 6 hours\", \"every day 07:00\"). Empty string = manual/event-driven."
type = string
default = "every 24 hours"
}
variable "data_refresh_window_days" {
description = "Number of days of historical data to reload on each run (for late data). 0 reloads only the current day."
type = number
default = 0
validation {
condition = var.data_refresh_window_days >= 0 && var.data_refresh_window_days <= 30
error_message = "data_refresh_window_days must be between 0 and 30."
}
}
variable "disabled" {
description = "If true, the transfer is created but paused (no runs are scheduled)."
type = bool
default = false
}
variable "params" {
description = "Source-specific parameters map. Keys depend on data_source_id (e.g. query for scheduled_query; data_path_template + file_format for google_cloud_storage)."
type = map(string)
validation {
condition = length(var.params) > 0
error_message = "params must contain at least one key (every data_source_id requires source-specific params)."
}
}
variable "service_account_email" {
description = "Email of a dedicated service account to run the transfer under. Only honoured for SA-capable sources (GCS, S3, scheduled_query, cross_region_copy, etc.). Null = use the configuring user's credentials."
type = string
default = null
validation {
condition = var.service_account_email == null || can(regex("^[^@]+@[^@]+\\.iam\\.gserviceaccount\\.com$", var.service_account_email))
error_message = "service_account_email must be a valid *.iam.gserviceaccount.com address or null."
}
}
variable "schedule_start_time" {
description = "RFC3339 timestamp; transfer will not run before this time. Null to omit."
type = string
default = null
}
variable "schedule_end_time" {
description = "RFC3339 timestamp; transfer will not run after this time. Null to omit."
type = string
default = null
}
variable "disable_auto_scheduling" {
description = "If true, no automatic runs are scheduled (manual/backfill only) while keeping the schedule string for reference."
type = bool
default = false
}
variable "notification_pubsub_topic" {
description = "Full Pub/Sub topic name (projects/<p>/topics/<t>) to receive transfer-run completion notifications. Null to disable."
type = string
default = null
validation {
condition = var.notification_pubsub_topic == null || can(regex("^projects/[^/]+/topics/[^/]+$", var.notification_pubsub_topic))
error_message = "notification_pubsub_topic must be of the form projects/<project>/topics/<topic> or null."
}
}
variable "enable_failure_email" {
description = "Email the transfer's owner when a run fails."
type = bool
default = true
}
variable "enable_api" {
description = "Whether this module should enable bigquerydatatransfer.googleapis.com (set false if managed elsewhere)."
type = bool
default = true
}
variable "create_service_identity" {
description = "Whether to create the BQ DTS service agent and grant it tokenCreator on the transfer SA. Requires google-beta provider."
type = bool
default = true
}
# outputs.tf
output "id" {
description = "Fully-qualified resource ID of the transfer config (projects/<p>/locations/<l>/transferConfigs/<uuid>)."
value = google_bigquery_data_transfer_config.this.id
}
output "name" {
description = "Resource name of the transfer config, including the server-generated UUID."
value = google_bigquery_data_transfer_config.this.name
}
output "display_name" {
description = "Human-readable display name of the transfer."
value = google_bigquery_data_transfer_config.this.display_name
}
output "data_source_id" {
description = "The data source backing this transfer."
value = google_bigquery_data_transfer_config.this.data_source_id
}
output "destination_dataset_id" {
description = "Destination BigQuery dataset for the transfer (may be null for some sources)."
value = google_bigquery_data_transfer_config.this.destination_dataset_id
}
output "schedule" {
description = "Effective schedule string of the transfer config."
value = google_bigquery_data_transfer_config.this.schedule
}
output "service_account_email" {
description = "Service account the transfer runs as, or null if running under user credentials."
value = google_bigquery_data_transfer_config.this.service_account_name
}
output "dts_service_agent_email" {
description = "Email of the BQ DTS service agent (P4SA), if created by this module."
value = try(google_project_service_identity.bq_dts[0].email, null)
}
How to use it
A scheduled hourly load of CSV files from Cloud Storage into a partitioned table, running under a dedicated service account:
module "bigquery_data_transfer" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery-data-transfer?ref=v1.0.0"
project_id = "kv-data-prod"
location = "asia-south1"
display_name = "gcs-orders-hourly"
data_source_id = "google_cloud_storage"
destination_dataset_id = google_bigquery_dataset.raw.dataset_id
schedule = "every 1 hours"
service_account_email = google_service_account.bq_transfer.email
# GCS source params: where to read, target table, format, and write disposition.
params = {
data_path_template = "gs://kv-landing-orders/incoming/*.csv"
destination_table_name_template = "orders_raw"
file_format = "CSV"
write_disposition = "APPEND"
skip_leading_rows = "1"
max_bad_records = "0"
}
notification_pubsub_topic = "projects/kv-data-prod/topics/bq-transfer-events"
}
# Downstream: alert on failed transfer runs by subscribing the ops function
# to the Pub/Sub topic, wired off this module's destination dataset + name.
resource "google_pubsub_subscription" "transfer_alerts" {
name = "alert-${module.bigquery_data_transfer.display_name}"
topic = "projects/kv-data-prod/topics/bq-transfer-events"
push_config {
push_endpoint = "https://asia-south1-kv-data-prod.cloudfunctions.net/transfer-failure-alert"
}
labels = {
transfer_id = element(split("/", module.bigquery_data_transfer.id), length(split("/", module.bigquery_data_transfer.id)) - 1)
}
}
For a scheduled query instead, set data_source_id = "scheduled_query", drop destination_dataset_id if the query uses DDL, and pass params = { query = "MERGE ...", write_disposition = "WRITE_TRUNCATE" }.
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/bigquery_data_transfer/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery-data-transfer?ref=v1.0.0"
}
inputs = {
project_id = "..."
location = "..."
display_name = "..."
data_source_id = "..."
params = {}
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/bigquery_data_transfer && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
project_id |
string |
— | Yes | GCP project owning the transfer config and dataset. |
location |
string |
— | Yes | Region of the destination dataset (must match the dataset). |
display_name |
string |
— | Yes | UI display name (1-256 chars). |
data_source_id |
string |
— | Yes | Transfer source (scheduled_query, google_cloud_storage, amazon_s3, cross_region_copy, google_ads, play, …). |
destination_dataset_id |
string |
null |
No | Destination BigQuery dataset ID. |
schedule |
string |
"every 24 hours" |
No | BQ DTS cron schedule; empty string = manual/event-driven. |
data_refresh_window_days |
number |
0 |
No | Days of history reloaded per run (0-30). |
disabled |
bool |
false |
No | Create the config but pause scheduling. |
params |
map(string) |
— | Yes | Source-specific params (must be non-empty). |
service_account_email |
string |
null |
No | Dedicated SA to run the transfer; honoured only for SA-capable sources. |
schedule_start_time |
string |
null |
No | RFC3339 earliest run time. |
schedule_end_time |
string |
null |
No | RFC3339 latest run time. |
disable_auto_scheduling |
bool |
false |
No | Manual/backfill only; suppress automatic runs. |
notification_pubsub_topic |
string |
null |
No | Pub/Sub topic (projects/<p>/topics/<t>) for run notifications. |
enable_failure_email |
bool |
true |
No | Email the owner on run failure. |
enable_api |
bool |
true |
No | Enable bigquerydatatransfer.googleapis.com from this module. |
create_service_identity |
bool |
true |
No | Create the BQ DTS service agent and grant it tokenCreator on the SA. |
Outputs
| Name | Description |
|---|---|
id |
Fully-qualified transfer config ID (projects/<p>/locations/<l>/transferConfigs/<uuid>). |
name |
Resource name including the server-generated UUID. |
display_name |
Human-readable display name. |
data_source_id |
The data source backing the transfer. |
destination_dataset_id |
Destination dataset (may be null for some sources). |
schedule |
Effective schedule string. |
service_account_email |
SA the transfer runs as, or null for user credentials. |
dts_service_agent_email |
BQ DTS service agent (P4SA) email, if created here. |
Enterprise scenario
A retail analytics team centralises marketing spend reporting in BigQuery. They instantiate this module once per channel — one google_ads transfer (user-authorised, daily, data_refresh_window_days = 7 to capture click-attribution backfill) and three google_cloud_storage transfers pulling partner CSVs that land hourly in a landing bucket — all writing into a single marketing_raw dataset in asia-south1. Every config runs under a dedicated bq-transfer@ service account, emits run events to a shared Pub/Sub topic, and a downstream Cloud Function pages the on-call channel on failure, so a broken partner feed surfaces in minutes instead of at the next morning’s dashboard refresh.
Best practices
- Run on a dedicated service account, never user credentials, for every SA-capable source (GCS, S3, scheduled queries, cross-region copies). User-owned transfers break the moment that person leaves; grant the BQ DTS service agent
roles/iam.serviceAccountTokenCreatoron the SA (this module does it) and give the SA onlyroles/bigquery.dataEditoron the target dataset plus read on the source. - Tune
data_refresh_window_daysto the data, not the maximum. A large refresh window re-scans and re-bills the same source rows every run; set it to the realistic lateness of your data (often 1-3 days) and use one-off backfills for historical loads rather than a permanently wide window. - Keep the destination dataset and the transfer
locationidentical, and prefer partition-friendlydestination_table_name_templatepatterns (e.g. table name plus{run_date}/{run_time}placeholders for GCS) so each run lands in its own partition and reloads stay cheap and idempotent. - Wire
notification_pubsub_topic(andenable_failure_email) on day one. Transfers fail silently otherwise — a missed schedule or a malformed source file just produces stale tables — so alerting is the difference between a five-minute fix and a day-old dashboard. - Name configs by source-and-cadence (
gcs-orders-hourly,ads-spend-daily) since thedisplay_nameis the only human handle; the underlying config name is a server-generated UUID you cannot choose. - Pin the module by tag and treat
paramsas source-specific contracts — validate the exact keys eachdata_source_idexpects (e.g.queryfor scheduled queries,data_path_template+file_formatfor GCS) in review, because the provider only fails at apply time when a required param is missing.