Quick take — Build a reusable Terraform module for google_storage_transfer_job: schedule S3/Azure/GCS transfers, wire the agent service account, set object conditions, and emit job names for downstream orchestration. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "storage_transfer" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-storage-transfer?ref=v1.0.0"
project_id = "..." # Project owning the job; its STS service account is gran…
description = "..." # Job description; unique within the project (1–1024 char…
source_type = "..." # Source family: gcs, aws_s3, or azure_blob.
sink_bucket = "..." # Destination GCS bucket name.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Google Cloud Storage Transfer Service (STS) is a managed, agentless service for moving large volumes of objects into Cloud Storage — from another GCS bucket, from Amazon S3, from Azure Blob Storage, from an arbitrary HTTP/HTTPS URL list, or from an S3-compatible endpoint (MinIO, Wasabi, on-prem object stores). It handles parallelism, retries, checksum verification, and incremental sync (only copying objects that changed) without you running any VMs or writing a single line of copy logic. You define a transfer job that pairs a source and a sink, optionally attaches a schedule and object conditions, and STS does the rest on Google’s infrastructure.
The single Terraform resource that drives all of this is google_storage_transfer_job. On its own it is deceptively fiddly: the schedule block uses split year/month/day objects rather than a cron string, the source credential blocks differ per provider (AWS access keys vs. an AWS role ARN vs. an Azure SAS token), and — the part that trips everyone up — the Storage Transfer Service-managed service account (project-<NUMBER>@storage-transfer-service.iam.gserviceaccount.com) must be granted roles/storage.objectViewer on the source and roles/storage.objectUser (or legacy objectAdmin) on the sink before the job will run. Wrapping all of this in a module means every team gets a job whose service-account IAM, schedule, retry conditions, and notification topic are correct and consistent — instead of copy-pasting a brittle resource and rediscovering the same permission error each time.
When to use it
- Cloud migration / repatriation — one-shot or recurring backfill from Amazon S3 or Azure Blob into GCS as you move workloads onto Google Cloud.
- Cross-region / cross-project replication — keep a GCS bucket in
asia-south1synced to a DR bucket ineurope-west1, on a nightly schedule, deleting objects from the sink when they’re removed at source. - Periodic ingestion from a partner’s S3-compatible endpoint into your data-lake landing bucket, filtered to a prefix and a max file age.
- Lifecycle-aware archival — move objects older than N days out of a hot Standard bucket into a Coldline/Archive bucket, using object conditions on
min_time_elapsed_since_last_modification.
Reach for a different tool when you need sub-minute event-driven copies (use Pub/Sub + Cloud Functions / Eventarc), POSIX filesystem transfers from on-prem NFS (use the Transfer service for on-premises / agent pools with transfer_agent_pool, a different shape than this module targets), or a single ad-hoc gcloud storage cp.
Module structure
terraform-module-gcp-storage-transfer/
├── versions.tf # provider pin
├── main.tf # IAM bindings + google_storage_transfer_job
├── variables.tf # var-driven inputs + validations
├── outputs.tf # job name/id + key attributes
└── README.md
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
# Resolve the project so we can build the STS-managed service account address.
data "google_project" "this" {
project_id = var.project_id
}
# The Storage Transfer Service control plane runs as this Google-managed SA.
# It must be able to read the source and write the sink.
locals {
sts_service_account = "project-${data.google_project.this.number}@storage-transfer-service.iam.gserviceaccount.com"
# Only one source family is active at a time; this keeps the dynamic blocks tidy.
is_gcs_source = var.source_type == "gcs"
is_s3_source = var.source_type == "aws_s3"
is_azure_source = var.source_type == "azure_blob"
}
# Grant STS read on the sink-side source bucket when source is GCS in this project.
resource "google_storage_bucket_iam_member" "source_reader" {
count = local.is_gcs_source && var.grant_sts_iam ? 1 : 0
bucket = var.gcs_source_bucket
role = "roles/storage.objectViewer"
member = "serviceAccount:${local.sts_service_account}"
}
# Grant STS read+write on the destination bucket.
resource "google_storage_bucket_iam_member" "sink_writer" {
count = var.grant_sts_iam ? 1 : 0
bucket = var.sink_bucket
role = "roles/storage.objectUser"
member = "serviceAccount:${local.sts_service_account}"
}
resource "google_storage_transfer_job" "this" {
project = var.project_id
description = var.description
status = var.status
transfer_spec {
# ---- SOURCE: pick exactly one family based on var.source_type ----
dynamic "gcs_data_source" {
for_each = local.is_gcs_source ? [1] : []
content {
bucket_name = var.gcs_source_bucket
path = var.source_path
}
}
dynamic "aws_s3_data_source" {
for_each = local.is_s3_source ? [1] : []
content {
bucket_name = var.aws_s3_bucket
path = var.source_path
# Prefer a federated role over long-lived keys when set.
role_arn = var.aws_role_arn != null ? var.aws_role_arn : null
dynamic "aws_access_key" {
for_each = var.aws_role_arn == null ? [1] : []
content {
access_key_id = var.aws_access_key_id
secret_access_key = var.aws_secret_access_key
}
}
}
}
dynamic "azure_blob_storage_data_source" {
for_each = local.is_azure_source ? [1] : []
content {
storage_account = var.azure_storage_account
container = var.azure_container
path = var.source_path
azure_credentials {
sas_token = var.azure_sas_token
}
}
}
# ---- SINK: always a GCS bucket ----
gcs_data_sink {
bucket_name = var.sink_bucket
path = var.sink_path
}
# ---- What to move / how aggressively ----
object_conditions {
include_prefixes = var.include_prefixes
exclude_prefixes = var.exclude_prefixes
max_time_elapsed_since_last_modification = var.max_time_elapsed_since_last_modification
min_time_elapsed_since_last_modification = var.min_time_elapsed_since_last_modification
}
transfer_options {
overwrite_objects_already_existing_in_sink = var.overwrite_existing
delete_objects_unique_in_sink = var.delete_unique_in_sink
delete_objects_from_source_after_transfer = var.delete_from_source_after_transfer
overwrite_when = var.overwrite_when
}
}
# ---- SCHEDULE: omit entirely for a one-time, run-now job ----
dynamic "schedule" {
for_each = var.schedule_enabled ? [1] : []
content {
schedule_start_date {
year = var.schedule_start_date.year
month = var.schedule_start_date.month
day = var.schedule_start_date.day
}
dynamic "schedule_end_date" {
for_each = var.schedule_end_date != null ? [1] : []
content {
year = var.schedule_end_date.year
month = var.schedule_end_date.month
day = var.schedule_end_date.day
}
}
dynamic "start_time_of_day" {
for_each = var.start_time_of_day != null ? [1] : []
content {
hours = var.start_time_of_day.hours
minutes = var.start_time_of_day.minutes
seconds = 0
nanos = 0
}
}
repeat_interval = var.repeat_interval
}
}
# ---- Pub/Sub notifications on run completion (optional) ----
dynamic "notification_config" {
for_each = var.notification_pubsub_topic != null ? [1] : []
content {
pubsub_topic = var.notification_pubsub_topic
event_types = var.notification_event_types
payload_format = "JSON"
}
}
depends_on = [
google_storage_bucket_iam_member.sink_writer,
google_storage_bucket_iam_member.source_reader,
]
}
variables.tf
variable "project_id" {
type = string
description = "Project that owns the transfer job and whose STS service account is granted IAM."
}
variable "description" {
type = string
description = "Human-readable description of the transfer job (shown in the console). Must be unique within the project."
validation {
condition = length(var.description) > 0 && length(var.description) <= 1024
error_message = "description must be 1-1024 characters."
}
}
variable "status" {
type = string
default = "ENABLED"
description = "Job status: ENABLED, DISABLED, or DELETED."
validation {
condition = contains(["ENABLED", "DISABLED", "DELETED"], var.status)
error_message = "status must be one of ENABLED, DISABLED, DELETED."
}
}
variable "source_type" {
type = string
description = "Source family: gcs, aws_s3, or azure_blob."
validation {
condition = contains(["gcs", "aws_s3", "azure_blob"], var.source_type)
error_message = "source_type must be one of gcs, aws_s3, azure_blob."
}
}
variable "grant_sts_iam" {
type = bool
default = true
description = "If true, grant the STS-managed service account objectViewer on a GCS source and objectUser on the sink."
}
# ---- Source: GCS ----
variable "gcs_source_bucket" {
type = string
default = null
description = "Source bucket name when source_type = gcs."
}
# ---- Source: AWS S3 ----
variable "aws_s3_bucket" {
type = string
default = null
description = "Source S3 bucket name when source_type = aws_s3."
}
variable "aws_role_arn" {
type = string
default = null
description = "AWS IAM role ARN for federated access. Preferred over static keys; when set, access keys are ignored."
}
variable "aws_access_key_id" {
type = string
default = null
sensitive = true
description = "AWS access key ID (used only if aws_role_arn is null)."
}
variable "aws_secret_access_key" {
type = string
default = null
sensitive = true
description = "AWS secret access key (used only if aws_role_arn is null)."
}
# ---- Source: Azure Blob ----
variable "azure_storage_account" {
type = string
default = null
description = "Azure storage account name when source_type = azure_blob."
}
variable "azure_container" {
type = string
default = null
description = "Azure blob container name when source_type = azure_blob."
}
variable "azure_sas_token" {
type = string
default = null
sensitive = true
description = "Azure SAS token granting read+list on the container."
}
# ---- Paths ----
variable "source_path" {
type = string
default = null
description = "Optional object prefix at the source (e.g. 'incoming/'). Must end with '/' if set."
validation {
condition = var.source_path == null || endswith(var.source_path, "/")
error_message = "source_path must end with a trailing slash."
}
}
variable "sink_bucket" {
type = string
description = "Destination GCS bucket name."
}
variable "sink_path" {
type = string
default = null
description = "Optional object prefix at the sink. Must end with '/' if set."
validation {
condition = var.sink_path == null || endswith(var.sink_path, "/")
error_message = "sink_path must end with a trailing slash."
}
}
# ---- Object conditions ----
variable "include_prefixes" {
type = list(string)
default = []
description = "Only transfer objects whose name begins with one of these prefixes."
}
variable "exclude_prefixes" {
type = list(string)
default = []
description = "Skip objects whose name begins with one of these prefixes."
}
variable "min_time_elapsed_since_last_modification" {
type = string
default = null
description = "Only transfer objects last modified at least this long ago, as a duration string e.g. '2592000s' (30 days)."
validation {
condition = var.min_time_elapsed_since_last_modification == null || endswith(var.min_time_elapsed_since_last_modification, "s")
error_message = "Duration must be expressed in seconds with a trailing 's', e.g. '2592000s'."
}
}
variable "max_time_elapsed_since_last_modification" {
type = string
default = null
description = "Only transfer objects last modified at most this long ago (seconds with trailing 's')."
validation {
condition = var.max_time_elapsed_since_last_modification == null || endswith(var.max_time_elapsed_since_last_modification, "s")
error_message = "Duration must be expressed in seconds with a trailing 's', e.g. '86400s'."
}
}
# ---- Transfer options ----
variable "overwrite_existing" {
type = bool
default = false
description = "Overwrite objects that already exist in the sink (set with overwrite_when)."
}
variable "overwrite_when" {
type = string
default = "DIFFERENT"
description = "When to overwrite sink objects: ALWAYS, DIFFERENT, or NEVER."
validation {
condition = contains(["ALWAYS", "DIFFERENT", "NEVER"], var.overwrite_when)
error_message = "overwrite_when must be ALWAYS, DIFFERENT, or NEVER."
}
}
variable "delete_unique_in_sink" {
type = bool
default = false
description = "Delete objects in the sink that are not present at the source (true sync/mirror). Mutually exclusive with delete_from_source_after_transfer."
}
variable "delete_from_source_after_transfer" {
type = bool
default = false
description = "Delete objects from the source once transferred (move semantics). Mutually exclusive with delete_unique_in_sink."
validation {
condition = !(var.delete_from_source_after_transfer && var.delete_unique_in_sink)
error_message = "delete_from_source_after_transfer and delete_unique_in_sink cannot both be true."
}
}
# ---- Schedule ----
variable "schedule_enabled" {
type = bool
default = true
description = "If false, no schedule block is emitted and the job runs once on creation."
}
variable "schedule_start_date" {
type = object({
year = number
month = number
day = number
})
default = null
description = "First date the job is eligible to run (UTC). Required when schedule_enabled = true."
}
variable "schedule_end_date" {
type = object({
year = number
month = number
day = number
})
default = null
description = "Last date the job runs. Omit for an indefinitely recurring schedule."
}
variable "start_time_of_day" {
type = object({
hours = number
minutes = number
})
default = null
description = "UTC time of day each run starts. Omit to run as soon as eligible."
}
variable "repeat_interval" {
type = string
default = "86400s"
description = "Time between recurring runs as a duration string, e.g. '86400s' for daily. Omit-equivalent value runs once."
}
# ---- Notifications ----
variable "notification_pubsub_topic" {
type = string
default = null
description = "Full Pub/Sub topic resource name (projects/<p>/topics/<t>) to publish run events to."
}
variable "notification_event_types" {
type = list(string)
default = ["TRANSFER_OPERATION_SUCCESS", "TRANSFER_OPERATION_FAILED"]
description = "Which run events trigger a Pub/Sub notification."
}
outputs.tf
output "name" {
description = "Server-assigned unique name of the transfer job, e.g. transferJobs/1234567890. Use this to trigger runs or query operations."
value = google_storage_transfer_job.this.name
}
output "description" {
description = "Description of the transfer job."
value = google_storage_transfer_job.this.description
}
output "status" {
description = "Current status of the job (ENABLED / DISABLED / DELETED)."
value = google_storage_transfer_job.this.status
}
output "creation_time" {
description = "Timestamp the transfer job was created."
value = google_storage_transfer_job.this.creation_time
}
output "sts_service_account" {
description = "The Storage Transfer Service-managed service account that was (optionally) granted bucket IAM. Useful for granting cross-project access manually."
value = local.sts_service_account
}
output "sink_bucket" {
description = "Destination bucket the job writes to."
value = var.sink_bucket
}
How to use it
A nightly S3 → GCS sync into a data-lake landing bucket, using a federated AWS role (no static keys), restricted to a prefix, mirroring deletions, and notifying a Pub/Sub topic on completion:
module "storage_transfer_service" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-storage-transfer?ref=v1.0.0"
project_id = "kv-data-platform-prod"
description = "Nightly sync: partner-acme S3 incoming -> lakehouse landing"
source_type = "aws_s3"
aws_s3_bucket = "acme-export-prod"
aws_role_arn = "arn:aws:iam::210987654321:role/gcp-sts-reader"
source_path = "incoming/"
sink_bucket = "kv-lakehouse-landing-prod"
sink_path = "acme/"
include_prefixes = ["incoming/orders/", "incoming/inventory/"]
overwrite_existing = true
overwrite_when = "DIFFERENT"
delete_unique_in_sink = true
schedule_enabled = true
schedule_start_date = { year = 2026, month = 6, day = 10 }
start_time_of_day = { hours = 18, minutes = 30 } # 00:00 IST
repeat_interval = "86400s"
notification_pubsub_topic = google_pubsub_topic.transfer_events.id
}
resource "google_pubsub_topic" "transfer_events" {
name = "sts-transfer-events"
project = "kv-data-platform-prod"
}
# Downstream: a Cloud Run job that triggers an ad-hoc run uses the job name output.
resource "google_cloud_scheduler_job" "manual_kick" {
name = "kick-acme-transfer"
project = "kv-data-platform-prod"
region = "asia-south1"
schedule = "0 12 * * 1" # Monday noon catch-up run
http_target {
http_method = "POST"
uri = "https://storagetransfer.googleapis.com/v1/${module.storage_transfer_service.name}:run"
body = base64encode(jsonencode({ projectId = "kv-data-platform-prod" }))
oauth_token {
service_account_email = google_service_account.scheduler_sa.email
}
}
}
Because module.storage_transfer_service.name is the server-assigned transferJobs/<id>, downstream resources (Cloud Scheduler, a Cloud Function, monitoring alerts) can reference the exact job without you hard-coding the generated ID.
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/storage_transfer/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-storage-transfer?ref=v1.0.0"
}
inputs = {
project_id = "..."
description = "..."
source_type = "..."
sink_bucket = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/storage_transfer && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| project_id | string | — | Yes | Project owning the job; its STS service account is granted IAM. |
| description | string | — | Yes | Job description; unique within the project (1–1024 chars). |
| status | string | “ENABLED” | No | ENABLED, DISABLED, or DELETED. |
| source_type | string | — | Yes | Source family: gcs, aws_s3, or azure_blob. |
| grant_sts_iam | bool | true | No | Grant STS SA objectViewer on a GCS source and objectUser on the sink. |
| gcs_source_bucket | string | null | If gcs | Source bucket name for a GCS source. |
| aws_s3_bucket | string | null | If aws_s3 | Source S3 bucket name. |
| aws_role_arn | string | null | No | AWS role ARN for federated access; preferred over static keys. |
| aws_access_key_id | string | null | If aws_s3 + no role | AWS access key ID (sensitive). |
| aws_secret_access_key | string | null | If aws_s3 + no role | AWS secret access key (sensitive). |
| azure_storage_account | string | null | If azure_blob | Azure storage account name. |
| azure_container | string | null | If azure_blob | Azure blob container name. |
| azure_sas_token | string | null | If azure_blob | SAS token with read+list (sensitive). |
| source_path | string | null | No | Object prefix at source; must end with ‘/’. |
| sink_bucket | string | — | Yes | Destination GCS bucket name. |
| sink_path | string | null | No | Object prefix at sink; must end with ‘/’. |
| include_prefixes | list(string) | [] | No | Only transfer objects matching these prefixes. |
| exclude_prefixes | list(string) | [] | No | Skip objects matching these prefixes. |
| min_time_elapsed_since_last_modification | string | null | No | Min object age, seconds with trailing ‘s’. |
| max_time_elapsed_since_last_modification | string | null | No | Max object age, seconds with trailing ‘s’. |
| overwrite_existing | bool | false | No | Overwrite objects already in the sink. |
| overwrite_when | string | “DIFFERENT” | No | ALWAYS, DIFFERENT, or NEVER. |
| delete_unique_in_sink | bool | false | No | Delete sink objects absent at source (mirror). |
| delete_from_source_after_transfer | bool | false | No | Delete source objects after transfer (move). |
| schedule_enabled | bool | true | No | If false, runs once with no schedule block. |
| schedule_start_date | object(year,month,day) | null | If scheduled | First eligible run date (UTC). |
| schedule_end_date | object(year,month,day) | null | No | Last run date; omit for indefinite. |
| start_time_of_day | object(hours,minutes) | null | No | UTC time each run starts. |
| repeat_interval | string | “86400s” | No | Gap between runs, seconds with trailing ‘s’. |
| notification_pubsub_topic | string | null | No | Pub/Sub topic resource name for run events. |
| notification_event_types | list(string) | [“TRANSFER_OPERATION_SUCCESS”,“TRANSFER_OPERATION_FAILED”] | No | Events that trigger notifications. |
Outputs
| Name | Description |
|---|---|
| name | Server-assigned job name (transferJobs/<id>); use to trigger runs and query operations. |
| description | The job’s description. |
| status | Current job status (ENABLED / DISABLED / DELETED). |
| creation_time | Timestamp the job was created. |
| sts_service_account | The STS-managed service account address granted bucket IAM. |
| sink_bucket | Destination bucket the job writes to. |
Enterprise scenario
A retail analytics group is mid-migration from AWS to Google Cloud and must keep their Google BigQuery lakehouse fed while order-processing still writes to Amazon S3. They instantiate this module once per source domain (orders, inventory, clickstream), each pointed at a different S3 prefix with a federated aws_role_arn so no long-lived AWS keys ever land in Terraform state, scheduled for 00:00 IST with delete_unique_in_sink = true so the GCS landing zone is a faithful mirror. The notification_pubsub_topic output feeds a Cloud Function that kicks off the downstream BigQuery load only after TRANSFER_OPERATION_SUCCESS, giving them an event-driven pipeline with zero copy infrastructure to operate.
Best practices
- Never use static AWS keys when you can federate. Prefer
aws_role_arn(AWS IAM role + STS web-identity trust) overaws_access_key_id/aws_secret_access_key; if you must use keys, mark themsensitive(the module does) and source them from Secret Manager, never from.tfvarsin git. - Grant the STS-managed SA least privilege, scoped to buckets. The module binds
roles/storage.objectVieweron the source androles/storage.objectUseron the sink at the bucket level — do not grant these at the project level. For cross-project sinks, use the exportedsts_service_accountoutput to add the binding in the other project. - Treat
delete_unique_in_sinkanddelete_from_source_after_transferas loaded guns. They are mutually exclusive (the module validates this) and irreversible per run; enable mirror/move semantics only when the sink is genuinely disposable-and-rebuildable, and confirm object conditions first with astatus = "DISABLED"dry run. - Cut cost and run time with object conditions, not bigger schedules.
include_prefixesplusmin/max_time_elapsed_since_last_modificationmean STS lists and copies far fewer objects; incremental syncs withoverwrite_when = "DIFFERENT"skip unchanged data, so a daily job over a large bucket transfers (and bills egress for) only the delta. - Schedule in UTC and document the local offset. STS schedule blocks are UTC-only with split year/month/day/time objects — bake the IST/PST conversion into a comment (as the example does) so on-call engineers reading
hours = 18know it fires at midnight India time. - Wire
notification_pubsub_topicfor observability and chaining. EmittingTRANSFER_OPERATION_SUCCESS/FAILEDto Pub/Sub gives you alerting on failed runs and a clean trigger to start downstream load jobs only after a successful transfer — far more reliable than guessing completion from a fixed delay.