Terraform Module: GCP Storage Transfer Service — scheduled, IAM-correct cross-cloud and bucket-to-bucket data movement

Quick take — Build a reusable Terraform module for google_storage_transfer_job: schedule S3/Azure/GCS transfers, wire the agent service account, set object conditions, and emit job names for downstream orchestration. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "storage_transfer" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-storage-transfer?ref=v1.0.0"

  project_id  = "..."  # Project owning the job; its STS service account is gran…
  description = "..."  # Job description; unique within the project (1–1024 char…
  source_type = "..."  # Source family: gcs, aws_s3, or azure_blob.
  sink_bucket = "..."  # Destination GCS bucket name.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Google Cloud Storage Transfer Service (STS) is a managed, agentless service for moving large volumes of objects into Cloud Storage — from another GCS bucket, from Amazon S3, from Azure Blob Storage, from an arbitrary HTTP/HTTPS URL list, or from an S3-compatible endpoint (MinIO, Wasabi, on-prem object stores). It handles parallelism, retries, checksum verification, and incremental sync (only copying objects that changed) without you running any VMs or writing a single line of copy logic. You define a transfer job that pairs a source and a sink, optionally attaches a schedule and object conditions, and STS does the rest on Google’s infrastructure.

The single Terraform resource that drives all of this is google_storage_transfer_job. On its own it is deceptively fiddly: the schedule block uses split year/month/day objects rather than a cron string, the source credential blocks differ per provider (AWS access keys vs. an AWS role ARN vs. an Azure SAS token), and — the part that trips everyone up — the Storage Transfer Service-managed service account (project-<NUMBER>@storage-transfer-service.iam.gserviceaccount.com) must be granted roles/storage.objectViewer on the source and roles/storage.objectUser (or legacy objectAdmin) on the sink before the job will run. Wrapping all of this in a module means every team gets a job whose service-account IAM, schedule, retry conditions, and notification topic are correct and consistent — instead of copy-pasting a brittle resource and rediscovering the same permission error each time.

When to use it

Cloud migration / repatriation — one-shot or recurring backfill from Amazon S3 or Azure Blob into GCS as you move workloads onto Google Cloud.
Cross-region / cross-project replication — keep a GCS bucket in asia-south1 synced to a DR bucket in europe-west1, on a nightly schedule, deleting objects from the sink when they’re removed at source.
Periodic ingestion from a partner’s S3-compatible endpoint into your data-lake landing bucket, filtered to a prefix and a max file age.
Lifecycle-aware archival — move objects older than N days out of a hot Standard bucket into a Coldline/Archive bucket, using object conditions on min_time_elapsed_since_last_modification.

Reach for a different tool when you need sub-minute event-driven copies (use Pub/Sub + Cloud Functions / Eventarc), POSIX filesystem transfers from on-prem NFS (use the Transfer service for on-premises / agent pools with transfer_agent_pool, a different shape than this module targets), or a single ad-hoc gcloud storage cp.

Module structure

terraform-module-gcp-storage-transfer/
├── versions.tf      # provider pin
├── main.tf          # IAM bindings + google_storage_transfer_job
├── variables.tf     # var-driven inputs + validations
├── outputs.tf       # job name/id + key attributes
└── README.md

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

# Resolve the project so we can build the STS-managed service account address.
data "google_project" "this" {
  project_id = var.project_id
}

# The Storage Transfer Service control plane runs as this Google-managed SA.
# It must be able to read the source and write the sink.
locals {
  sts_service_account = "project-${data.google_project.this.number}@storage-transfer-service.iam.gserviceaccount.com"

  # Only one source family is active at a time; this keeps the dynamic blocks tidy.
  is_gcs_source   = var.source_type == "gcs"
  is_s3_source    = var.source_type == "aws_s3"
  is_azure_source = var.source_type == "azure_blob"
}

# Grant STS read on the sink-side source bucket when source is GCS in this project.
resource "google_storage_bucket_iam_member" "source_reader" {
  count  = local.is_gcs_source && var.grant_sts_iam ? 1 : 0
  bucket = var.gcs_source_bucket
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${local.sts_service_account}"
}

# Grant STS read+write on the destination bucket.
resource "google_storage_bucket_iam_member" "sink_writer" {
  count  = var.grant_sts_iam ? 1 : 0
  bucket = var.sink_bucket
  role   = "roles/storage.objectUser"
  member = "serviceAccount:${local.sts_service_account}"
}

resource "google_storage_transfer_job" "this" {
  project     = var.project_id
  description = var.description
  status      = var.status

  transfer_spec {
    # ---- SOURCE: pick exactly one family based on var.source_type ----

    dynamic "gcs_data_source" {
      for_each = local.is_gcs_source ? [1] : []
      content {
        bucket_name = var.gcs_source_bucket
        path        = var.source_path
      }
    }

    dynamic "aws_s3_data_source" {
      for_each = local.is_s3_source ? [1] : []
      content {
        bucket_name = var.aws_s3_bucket
        path        = var.source_path
        # Prefer a federated role over long-lived keys when set.
        role_arn = var.aws_role_arn != null ? var.aws_role_arn : null

        dynamic "aws_access_key" {
          for_each = var.aws_role_arn == null ? [1] : []
          content {
            access_key_id     = var.aws_access_key_id
            secret_access_key = var.aws_secret_access_key
          }
        }
      }
    }

    dynamic "azure_blob_storage_data_source" {
      for_each = local.is_azure_source ? [1] : []
      content {
        storage_account = var.azure_storage_account
        container       = var.azure_container
        path            = var.source_path
        azure_credentials {
          sas_token = var.azure_sas_token
        }
      }
    }

    # ---- SINK: always a GCS bucket ----
    gcs_data_sink {
      bucket_name = var.sink_bucket
      path        = var.sink_path
    }

    # ---- What to move / how aggressively ----
    object_conditions {
      include_prefixes                     = var.include_prefixes
      exclude_prefixes                     = var.exclude_prefixes
      max_time_elapsed_since_last_modification = var.max_time_elapsed_since_last_modification
      min_time_elapsed_since_last_modification = var.min_time_elapsed_since_last_modification
    }

    transfer_options {
      overwrite_objects_already_existing_in_sink = var.overwrite_existing
      delete_objects_unique_in_sink              = var.delete_unique_in_sink
      delete_objects_from_source_after_transfer  = var.delete_from_source_after_transfer
      overwrite_when                             = var.overwrite_when
    }
  }

  # ---- SCHEDULE: omit entirely for a one-time, run-now job ----
  dynamic "schedule" {
    for_each = var.schedule_enabled ? [1] : []
    content {
      schedule_start_date {
        year  = var.schedule_start_date.year
        month = var.schedule_start_date.month
        day   = var.schedule_start_date.day
      }

      dynamic "schedule_end_date" {
        for_each = var.schedule_end_date != null ? [1] : []
        content {
          year  = var.schedule_end_date.year
          month = var.schedule_end_date.month
          day   = var.schedule_end_date.day
        }
      }

      dynamic "start_time_of_day" {
        for_each = var.start_time_of_day != null ? [1] : []
        content {
          hours   = var.start_time_of_day.hours
          minutes = var.start_time_of_day.minutes
          seconds = 0
          nanos   = 0
        }
      }

      repeat_interval = var.repeat_interval
    }
  }

  # ---- Pub/Sub notifications on run completion (optional) ----
  dynamic "notification_config" {
    for_each = var.notification_pubsub_topic != null ? [1] : []
    content {
      pubsub_topic  = var.notification_pubsub_topic
      event_types   = var.notification_event_types
      payload_format = "JSON"
    }
  }

  depends_on = [
    google_storage_bucket_iam_member.sink_writer,
    google_storage_bucket_iam_member.source_reader,
  ]
}

variables.tf

variable "project_id" {
  type        = string
  description = "Project that owns the transfer job and whose STS service account is granted IAM."
}

variable "description" {
  type        = string
  description = "Human-readable description of the transfer job (shown in the console). Must be unique within the project."

  validation {
    condition     = length(var.description) > 0 && length(var.description) <= 1024
    error_message = "description must be 1-1024 characters."
  }
}

variable "status" {
  type        = string
  default     = "ENABLED"
  description = "Job status: ENABLED, DISABLED, or DELETED."

  validation {
    condition     = contains(["ENABLED", "DISABLED", "DELETED"], var.status)
    error_message = "status must be one of ENABLED, DISABLED, DELETED."
  }
}

variable "source_type" {
  type        = string
  description = "Source family: gcs, aws_s3, or azure_blob."

  validation {
    condition     = contains(["gcs", "aws_s3", "azure_blob"], var.source_type)
    error_message = "source_type must be one of gcs, aws_s3, azure_blob."
  }
}

variable "grant_sts_iam" {
  type        = bool
  default     = true
  description = "If true, grant the STS-managed service account objectViewer on a GCS source and objectUser on the sink."
}

# ---- Source: GCS ----
variable "gcs_source_bucket" {
  type        = string
  default     = null
  description = "Source bucket name when source_type = gcs."
}

# ---- Source: AWS S3 ----
variable "aws_s3_bucket" {
  type        = string
  default     = null
  description = "Source S3 bucket name when source_type = aws_s3."
}

variable "aws_role_arn" {
  type        = string
  default     = null
  description = "AWS IAM role ARN for federated access. Preferred over static keys; when set, access keys are ignored."
}

variable "aws_access_key_id" {
  type        = string
  default     = null
  sensitive   = true
  description = "AWS access key ID (used only if aws_role_arn is null)."
}

variable "aws_secret_access_key" {
  type        = string
  default     = null
  sensitive   = true
  description = "AWS secret access key (used only if aws_role_arn is null)."
}

# ---- Source: Azure Blob ----
variable "azure_storage_account" {
  type        = string
  default     = null
  description = "Azure storage account name when source_type = azure_blob."
}

variable "azure_container" {
  type        = string
  default     = null
  description = "Azure blob container name when source_type = azure_blob."
}

variable "azure_sas_token" {
  type        = string
  default     = null
  sensitive   = true
  description = "Azure SAS token granting read+list on the container."
}

# ---- Paths ----
variable "source_path" {
  type        = string
  default     = null
  description = "Optional object prefix at the source (e.g. 'incoming/'). Must end with '/' if set."

  validation {
    condition     = var.source_path == null || endswith(var.source_path, "/")
    error_message = "source_path must end with a trailing slash."
  }
}

variable "sink_bucket" {
  type        = string
  description = "Destination GCS bucket name."
}

variable "sink_path" {
  type        = string
  default     = null
  description = "Optional object prefix at the sink. Must end with '/' if set."

  validation {
    condition     = var.sink_path == null || endswith(var.sink_path, "/")
    error_message = "sink_path must end with a trailing slash."
  }
}

# ---- Object conditions ----
variable "include_prefixes" {
  type        = list(string)
  default     = []
  description = "Only transfer objects whose name begins with one of these prefixes."
}

variable "exclude_prefixes" {
  type        = list(string)
  default     = []
  description = "Skip objects whose name begins with one of these prefixes."
}

variable "min_time_elapsed_since_last_modification" {
  type        = string
  default     = null
  description = "Only transfer objects last modified at least this long ago, as a duration string e.g. '2592000s' (30 days)."

  validation {
    condition     = var.min_time_elapsed_since_last_modification == null || endswith(var.min_time_elapsed_since_last_modification, "s")
    error_message = "Duration must be expressed in seconds with a trailing 's', e.g. '2592000s'."
  }
}

variable "max_time_elapsed_since_last_modification" {
  type        = string
  default     = null
  description = "Only transfer objects last modified at most this long ago (seconds with trailing 's')."

  validation {
    condition     = var.max_time_elapsed_since_last_modification == null || endswith(var.max_time_elapsed_since_last_modification, "s")
    error_message = "Duration must be expressed in seconds with a trailing 's', e.g. '86400s'."
  }
}

# ---- Transfer options ----
variable "overwrite_existing" {
  type        = bool
  default     = false
  description = "Overwrite objects that already exist in the sink (set with overwrite_when)."
}

variable "overwrite_when" {
  type        = string
  default     = "DIFFERENT"
  description = "When to overwrite sink objects: ALWAYS, DIFFERENT, or NEVER."

  validation {
    condition     = contains(["ALWAYS", "DIFFERENT", "NEVER"], var.overwrite_when)
    error_message = "overwrite_when must be ALWAYS, DIFFERENT, or NEVER."
  }
}

variable "delete_unique_in_sink" {
  type        = bool
  default     = false
  description = "Delete objects in the sink that are not present at the source (true sync/mirror). Mutually exclusive with delete_from_source_after_transfer."
}

variable "delete_from_source_after_transfer" {
  type        = bool
  default     = false
  description = "Delete objects from the source once transferred (move semantics). Mutually exclusive with delete_unique_in_sink."

  validation {
    condition     = !(var.delete_from_source_after_transfer && var.delete_unique_in_sink)
    error_message = "delete_from_source_after_transfer and delete_unique_in_sink cannot both be true."
  }
}

# ---- Schedule ----
variable "schedule_enabled" {
  type        = bool
  default     = true
  description = "If false, no schedule block is emitted and the job runs once on creation."
}

variable "schedule_start_date" {
  type = object({
    year  = number
    month = number
    day   = number
  })
  default     = null
  description = "First date the job is eligible to run (UTC). Required when schedule_enabled = true."
}

variable "schedule_end_date" {
  type = object({
    year  = number
    month = number
    day   = number
  })
  default     = null
  description = "Last date the job runs. Omit for an indefinitely recurring schedule."
}

variable "start_time_of_day" {
  type = object({
    hours   = number
    minutes = number
  })
  default     = null
  description = "UTC time of day each run starts. Omit to run as soon as eligible."
}

variable "repeat_interval" {
  type        = string
  default     = "86400s"
  description = "Time between recurring runs as a duration string, e.g. '86400s' for daily. Omit-equivalent value runs once."
}

# ---- Notifications ----
variable "notification_pubsub_topic" {
  type        = string
  default     = null
  description = "Full Pub/Sub topic resource name (projects/<p>/topics/<t>) to publish run events to."
}

variable "notification_event_types" {
  type        = list(string)
  default     = ["TRANSFER_OPERATION_SUCCESS", "TRANSFER_OPERATION_FAILED"]
  description = "Which run events trigger a Pub/Sub notification."
}

outputs.tf

output "name" {
  description = "Server-assigned unique name of the transfer job, e.g. transferJobs/1234567890. Use this to trigger runs or query operations."
  value       = google_storage_transfer_job.this.name
}

output "description" {
  description = "Description of the transfer job."
  value       = google_storage_transfer_job.this.description
}

output "status" {
  description = "Current status of the job (ENABLED / DISABLED / DELETED)."
  value       = google_storage_transfer_job.this.status
}

output "creation_time" {
  description = "Timestamp the transfer job was created."
  value       = google_storage_transfer_job.this.creation_time
}

output "sts_service_account" {
  description = "The Storage Transfer Service-managed service account that was (optionally) granted bucket IAM. Useful for granting cross-project access manually."
  value       = local.sts_service_account
}

output "sink_bucket" {
  description = "Destination bucket the job writes to."
  value       = var.sink_bucket
}

How to use it

A nightly S3 → GCS sync into a data-lake landing bucket, using a federated AWS role (no static keys), restricted to a prefix, mirroring deletions, and notifying a Pub/Sub topic on completion:

module "storage_transfer_service" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-storage-transfer?ref=v1.0.0"

  project_id  = "kv-data-platform-prod"
  description = "Nightly sync: partner-acme S3 incoming -> lakehouse landing"

  source_type   = "aws_s3"
  aws_s3_bucket = "acme-export-prod"
  aws_role_arn  = "arn:aws:iam::210987654321:role/gcp-sts-reader"
  source_path   = "incoming/"

  sink_bucket = "kv-lakehouse-landing-prod"
  sink_path   = "acme/"

  include_prefixes      = ["incoming/orders/", "incoming/inventory/"]
  overwrite_existing    = true
  overwrite_when        = "DIFFERENT"
  delete_unique_in_sink = true

  schedule_enabled    = true
  schedule_start_date = { year = 2026, month = 6, day = 10 }
  start_time_of_day   = { hours = 18, minutes = 30 } # 00:00 IST
  repeat_interval     = "86400s"

  notification_pubsub_topic = google_pubsub_topic.transfer_events.id
}

resource "google_pubsub_topic" "transfer_events" {
  name    = "sts-transfer-events"
  project = "kv-data-platform-prod"
}

# Downstream: a Cloud Run job that triggers an ad-hoc run uses the job name output.
resource "google_cloud_scheduler_job" "manual_kick" {
  name     = "kick-acme-transfer"
  project  = "kv-data-platform-prod"
  region   = "asia-south1"
  schedule = "0 12 * * 1" # Monday noon catch-up run

  http_target {
    http_method = "POST"
    uri         = "https://storagetransfer.googleapis.com/v1/${module.storage_transfer_service.name}:run"
    body        = base64encode(jsonencode({ projectId = "kv-data-platform-prod" }))
    oauth_token {
      service_account_email = google_service_account.scheduler_sa.email
    }
  }
}

Because module.storage_transfer_service.name is the server-assigned transferJobs/<id>, downstream resources (Cloud Scheduler, a Cloud Function, monitoring alerts) can reference the exact job without you hard-coding the generated ID.

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module config — live/prod/storage_transfer/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-storage-transfer?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  description = "..."
  source_type = "..."
  sink_bucket = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/storage_transfer && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
project_id	string	—	Yes	Project owning the job; its STS service account is granted IAM.
description	string	—	Yes	Job description; unique within the project (1–1024 chars).
status	string	“ENABLED”	No	ENABLED, DISABLED, or DELETED.
source_type	string	—	Yes	Source family: gcs, aws_s3, or azure_blob.
grant_sts_iam	bool	true	No	Grant STS SA objectViewer on a GCS source and objectUser on the sink.
gcs_source_bucket	string	null	If gcs	Source bucket name for a GCS source.
aws_s3_bucket	string	null	If aws_s3	Source S3 bucket name.
aws_role_arn	string	null	No	AWS role ARN for federated access; preferred over static keys.
aws_access_key_id	string	null	If aws_s3 + no role	AWS access key ID (sensitive).
aws_secret_access_key	string	null	If aws_s3 + no role	AWS secret access key (sensitive).
azure_storage_account	string	null	If azure_blob	Azure storage account name.
azure_container	string	null	If azure_blob	Azure blob container name.
azure_sas_token	string	null	If azure_blob	SAS token with read+list (sensitive).
source_path	string	null	No	Object prefix at source; must end with ‘/’.
sink_bucket	string	—	Yes	Destination GCS bucket name.
sink_path	string	null	No	Object prefix at sink; must end with ‘/’.
include_prefixes	list(string)	[]	No	Only transfer objects matching these prefixes.
exclude_prefixes	list(string)	[]	No	Skip objects matching these prefixes.
min_time_elapsed_since_last_modification	string	null	No	Min object age, seconds with trailing ‘s’.
max_time_elapsed_since_last_modification	string	null	No	Max object age, seconds with trailing ‘s’.
overwrite_existing	bool	false	No	Overwrite objects already in the sink.
overwrite_when	string	“DIFFERENT”	No	ALWAYS, DIFFERENT, or NEVER.
delete_unique_in_sink	bool	false	No	Delete sink objects absent at source (mirror).
delete_from_source_after_transfer	bool	false	No	Delete source objects after transfer (move).
schedule_enabled	bool	true	No	If false, runs once with no schedule block.
schedule_start_date	object(year,month,day)	null	If scheduled	First eligible run date (UTC).
schedule_end_date	object(year,month,day)	null	No	Last run date; omit for indefinite.
start_time_of_day	object(hours,minutes)	null	No	UTC time each run starts.
repeat_interval	string	“86400s”	No	Gap between runs, seconds with trailing ‘s’.
notification_pubsub_topic	string	null	No	Pub/Sub topic resource name for run events.
notification_event_types	list(string)	[“TRANSFER_OPERATION_SUCCESS”,“TRANSFER_OPERATION_FAILED”]	No	Events that trigger notifications.

Outputs

Name	Description
name	Server-assigned job name (transferJobs/<id>); use to trigger runs and query operations.
description	The job’s description.
status	Current job status (ENABLED / DISABLED / DELETED).
creation_time	Timestamp the job was created.
sts_service_account	The STS-managed service account address granted bucket IAM.
sink_bucket	Destination bucket the job writes to.

Enterprise scenario

A retail analytics group is mid-migration from AWS to Google Cloud and must keep their Google BigQuery lakehouse fed while order-processing still writes to Amazon S3. They instantiate this module once per source domain (orders, inventory, clickstream), each pointed at a different S3 prefix with a federated aws_role_arn so no long-lived AWS keys ever land in Terraform state, scheduled for 00:00 IST with delete_unique_in_sink = true so the GCS landing zone is a faithful mirror. The notification_pubsub_topic output feeds a Cloud Function that kicks off the downstream BigQuery load only after TRANSFER_OPERATION_SUCCESS, giving them an event-driven pipeline with zero copy infrastructure to operate.

Best practices

Never use static AWS keys when you can federate. Prefer aws_role_arn (AWS IAM role + STS web-identity trust) over aws_access_key_id/aws_secret_access_key; if you must use keys, mark them sensitive (the module does) and source them from Secret Manager, never from .tfvars in git.
Grant the STS-managed SA least privilege, scoped to buckets. The module binds roles/storage.objectViewer on the source and roles/storage.objectUser on the sink at the bucket level — do not grant these at the project level. For cross-project sinks, use the exported sts_service_account output to add the binding in the other project.
Treat delete_unique_in_sink and delete_from_source_after_transfer as loaded guns. They are mutually exclusive (the module validates this) and irreversible per run; enable mirror/move semantics only when the sink is genuinely disposable-and-rebuildable, and confirm object conditions first with a status = "DISABLED" dry run.
Cut cost and run time with object conditions, not bigger schedules. include_prefixes plus min/max_time_elapsed_since_last_modification mean STS lists and copies far fewer objects; incremental syncs with overwrite_when = "DIFFERENT" skip unchanged data, so a daily job over a large bucket transfers (and bills egress for) only the delta.
Schedule in UTC and document the local offset. STS schedule blocks are UTC-only with split year/month/day/time objects — bake the IST/PST conversion into a comment (as the example does) so on-call engineers reading hours = 18 know it fires at midnight India time.
Wire notification_pubsub_topic for observability and chaining. Emitting TRANSFER_OPERATION_SUCCESS/FAILED to Pub/Sub gives you alerting on failed runs and a clean trigger to start downstream load jobs only after a successful transfer — far more reliable than guessing completion from a fixed delay.