IaC GCP

Terraform Module: GCP Cloud Run Jobs — serverless batch that runs to completion

Quick take — Provision a GCP Cloud Run Job with Terraform: container task config, parallelism and retries, Secret Manager env, VPC egress, a dedicated runtime SA, and an optional Cloud Scheduler trigger. A reusable hashicorp/google ~> 5.0 module. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "cloud_run_jobs" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-cloud-run-jobs?ref=v1.0.0"

  project_id = "..."  # GCP project ID for the job.
  location   = "..."  # Region, e.g. `asia-south1`.
  name       = "..."  # Job name; 1-63 chars, lowercase, regex-validated.
  image      = "..."  # Container image; pin a tag/digest, not `:latest`.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

A Cloud Run Job runs a container to completion rather than serving traffic. Where a Cloud Run service keeps a revision warm to answer HTTP requests, a job spins up one or more tasks, each runs your container’s entrypoint, and the execution finishes when every task exits 0. There is no URL, no port, no min-instance scaling — it is GCP’s serverless answer to “run this batch workload, optionally in parallel, retry on failure, then stop billing.” Typical fits: nightly ETL, database migrations, report generation, queue drains, ML batch inference, and scheduled housekeeping.

The provider models this with google_cloud_run_v2_job (the second-generation, GA resource — prefer it over the deprecated google_cloud_run_job). The resource itself carries a fair amount of production-critical surface area: the task count and parallelism, max_retries and timeout per task, CPU/memory limits, environment variables sourced from Secret Manager, Direct VPC egress to reach private databases, and the runtime service account the tasks authenticate as. Getting any of these wrong — a missing retry budget, a secret passed as a plaintext env var, the default Compute Engine SA with roles/editor — is how batch jobs become incidents.

Wrapping it in a module means a team requests “a job that runs migrate:latest once, 30-minute timeout, with the DATABASE_URL secret and egress into the prod VPC” in ~15 lines, instead of hand-assembling a deeply-nested template { template { containers { ... } } } block, a google_cloud_run_v2_job_iam_member for who can run it, and a Cloud Scheduler trigger — and getting the nesting or the IAM subtly wrong every time.

When to use it

Reach for something else when the workload serves HTTP or needs to stay warm (that’s a Cloud Run service), when you need rich DAG orchestration with dependencies between steps (Cloud Composer / Workflows), or when a sub-second, event-driven function is a better fit (Cloud Run functions). Jobs have a hard per-task timeout ceiling and no inbound networking.

Module structure

terraform-module-gcp-cloud-run-jobs/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Plain (non-secret) environment variables -> name/value pairs.
  plain_env = [
    for k, v in var.env_vars : {
      name  = k
      value = v
    }
  ]

  # Secret-backed environment variables -> name + value_source ref.
  # Map key is the env var name; value points at a Secret Manager secret + version.
  secret_env = [
    for name, ref in var.secret_env_vars : {
      name    = name
      secret  = ref.secret
      version = ref.version
    }
  ]

  # Only build a VPC access block when a subnetwork (Direct VPC egress) is set.
  vpc_access_enabled = var.vpc_subnetwork != null
}

resource "google_cloud_run_v2_job" "this" {
  name         = var.name
  location     = var.location
  project      = var.project_id
  labels       = var.labels
  launch_stage = var.launch_stage

  # Fail the apply (rather than silently delete) if Terraform would replace
  # a job that has executions, when the caller opts into that guardrail.
  deletion_protection = var.deletion_protection

  template {
    # How many tasks the execution fans out into, and how many run at once.
    task_count  = var.task_count
    parallelism = var.parallelism

    template {
      # Per-task retry budget and wall-clock timeout.
      max_retries     = var.max_retries
      timeout         = var.task_timeout
      service_account = var.service_account_email
      execution_environment = var.execution_environment

      containers {
        image   = var.image
        command = var.command
        args    = var.args

        resources {
          limits = {
            cpu    = var.cpu
            memory = var.memory
          }
        }

        # Plaintext environment variables.
        dynamic "env" {
          for_each = local.plain_env
          content {
            name  = env.value.name
            value = env.value.value
          }
        }

        # Secret Manager-backed environment variables (never stored in plan/state
        # as values — only the secret reference is).
        dynamic "env" {
          for_each = local.secret_env
          content {
            name = env.value.name
            value_source {
              secret_key_ref {
                secret  = env.value.secret
                version = env.value.version
              }
            }
          }
        }
      }

      # Direct VPC egress — reach private databases / internal services without
      # a Serverless VPC Access connector. Only emitted when a subnetwork is set.
      dynamic "vpc_access" {
        for_each = local.vpc_access_enabled ? [1] : []
        content {
          egress = var.vpc_egress
          network_interfaces {
            network    = var.vpc_network
            subnetwork = var.vpc_subnetwork
            tags       = var.vpc_network_tags
          }
        }
      }
    }
  }

  lifecycle {
    # The launcher (Cloud Scheduler / CI) annotates the latest execution; ignore
    # client-name churn so unrelated applies don't show a perpetual diff.
    ignore_changes = [
      template[0].labels,
      client,
      client_version,
    ]
  }
}

# Who is allowed to RUN (execute) this job. Granting roles/run.invoker on the
# job resource lets these members trigger executions (e.g. a scheduler SA, a CI SA).
resource "google_cloud_run_v2_job_iam_member" "invokers" {
  for_each = toset(var.invoker_members)

  name     = google_cloud_run_v2_job.this.name
  location = google_cloud_run_v2_job.this.location
  project  = var.project_id
  role     = "roles/run.invoker"
  member   = each.value
}

# Optional Cloud Scheduler trigger that calls the Cloud Run Admin API to start
# an execution on a cron schedule, authenticated as scheduler_service_account.
resource "google_cloud_scheduler_job" "trigger" {
  count = var.schedule == null ? 0 : 1

  name      = "${var.name}-trigger"
  project   = var.project_id
  region    = var.location
  schedule  = var.schedule
  time_zone = var.schedule_time_zone

  http_target {
    http_method = "POST"
    uri         = "https://run.googleapis.com/v2/${google_cloud_run_v2_job.this.id}:run"

    oauth_token {
      service_account_email = var.scheduler_service_account
      scope                 = "https://www.googleapis.com/auth/cloud-platform"
    }
  }

  retry_config {
    retry_count = var.schedule_retry_count
  }
}

variables.tf

variable "project_id" {
  description = "GCP project ID in which to create the Cloud Run Job."
  type        = string
}

variable "location" {
  description = "Region for the job, e.g. asia-south1. Must support Cloud Run Jobs."
  type        = string
}

variable "name" {
  description = "Job name. 1-63 chars, lowercase letters/digits/hyphens, must start with a letter and not end with a hyphen."
  type        = string

  validation {
    condition     = can(regex("^[a-z]([-a-z0-9]{0,61}[a-z0-9])?$", var.name))
    error_message = "name must be 1-63 chars, lowercase, start with a letter, contain only letters/digits/hyphens, and not end with a hyphen."
  }
}

variable "image" {
  description = "Fully-qualified container image, e.g. asia-south1-docker.pkg.dev/proj/repo/migrate:1.4.2. Pin a tag/digest, not :latest, for reproducible runs."
  type        = string
}

variable "command" {
  description = "Entrypoint override (the container ENTRYPOINT). Empty list uses the image's ENTRYPOINT."
  type        = list(string)
  default     = []
}

variable "args" {
  description = "Arguments passed to the entrypoint (the container CMD)."
  type        = list(string)
  default     = []
}

variable "service_account_email" {
  description = "Runtime service account the tasks authenticate as. Strongly recommended — omitting it falls back to the default Compute Engine SA, which is over-privileged."
  type        = string
  default     = null
}

variable "task_count" {
  description = "Number of tasks the execution runs. Each task runs the container once; the execution succeeds only when all tasks complete."
  type        = number
  default     = 1

  validation {
    condition     = var.task_count >= 1 && var.task_count <= 10000
    error_message = "task_count must be between 1 and 10000."
  }
}

variable "parallelism" {
  description = "Maximum tasks to run concurrently. 0 lets Cloud Run choose the max. Cap this to protect downstream systems (DB connections, rate limits)."
  type        = number
  default     = 0

  validation {
    condition     = var.parallelism >= 0
    error_message = "parallelism must be >= 0 (0 means 'as parallel as possible')."
  }
}

variable "max_retries" {
  description = "Times a FAILED task is retried before the task is marked failed. 0 means no retries (a single failed task fails the execution)."
  type        = number
  default     = 3

  validation {
    condition     = var.max_retries >= 0 && var.max_retries <= 10
    error_message = "max_retries must be between 0 and 10."
  }
}

variable "task_timeout" {
  description = "Max wall-clock duration of a single task attempt, as a duration string (e.g. \"600s\", \"3600s\"). Up to 24h (\"86400s\")."
  type        = string
  default     = "600s"

  validation {
    condition     = can(regex("^[0-9]+s$", var.task_timeout))
    error_message = "task_timeout must be a duration in seconds, e.g. \"600s\"."
  }
}

variable "cpu" {
  description = "CPU limit per task, e.g. \"1\", \"2\", \"1000m\". Jobs are billed for CPU/memory only while a task runs."
  type        = string
  default     = "1"
}

variable "memory" {
  description = "Memory limit per task, e.g. \"512Mi\", \"2Gi\". Must be consistent with the CPU value per Cloud Run's CPU/memory ratios."
  type        = string
  default     = "512Mi"
}

variable "execution_environment" {
  description = "Sandbox: EXECUTION_ENVIRONMENT_GEN2 (faster CPU, full Linux compat, needed for some mounts) or EXECUTION_ENVIRONMENT_GEN1."
  type        = string
  default     = "EXECUTION_ENVIRONMENT_GEN2"

  validation {
    condition     = contains(["EXECUTION_ENVIRONMENT_GEN1", "EXECUTION_ENVIRONMENT_GEN2"], var.execution_environment)
    error_message = "execution_environment must be EXECUTION_ENVIRONMENT_GEN1 or EXECUTION_ENVIRONMENT_GEN2."
  }
}

variable "env_vars" {
  description = "Plaintext environment variables (map of name => value). Do NOT put secrets here — use secret_env_vars."
  type        = map(string)
  default     = {}
}

variable "secret_env_vars" {
  description = "Secret Manager-backed env vars: map of ENV_NAME => { secret = \"projects/p/secrets/db-url\" or \"db-url\", version = \"latest\" }. Only the reference lives in state, never the value."
  type = map(object({
    secret  = string
    version = optional(string, "latest")
  }))
  default = {}
}

variable "vpc_network" {
  description = "VPC network self-link/name for Direct VPC egress. Required when vpc_subnetwork is set."
  type        = string
  default     = null
}

variable "vpc_subnetwork" {
  description = "Subnetwork for Direct VPC egress. Set this to give tasks a private IP and reach internal resources. Null disables VPC egress."
  type        = string
  default     = null
}

variable "vpc_egress" {
  description = "Egress setting when VPC access is enabled: ALL_TRAFFIC (all egress via the VPC) or PRIVATE_RANGES_ONLY (only RFC1918 via VPC, internet direct)."
  type        = string
  default     = "PRIVATE_RANGES_ONLY"

  validation {
    condition     = contains(["ALL_TRAFFIC", "PRIVATE_RANGES_ONLY"], var.vpc_egress)
    error_message = "vpc_egress must be ALL_TRAFFIC or PRIVATE_RANGES_ONLY."
  }
}

variable "vpc_network_tags" {
  description = "Network tags applied to the task's network interface (for firewall targeting) when VPC egress is enabled."
  type        = list(string)
  default     = []
}

variable "labels" {
  description = "Labels applied to the job for cost attribution and inventory (e.g. team, env, cost-center)."
  type        = map(string)
  default     = {}
}

variable "launch_stage" {
  description = "API launch stage for preview features, e.g. GA, BETA. Leave GA unless using a preview field."
  type        = string
  default     = "GA"
}

variable "deletion_protection" {
  description = "When true, Terraform refuses to delete the job (guards against destroying a production batch pipeline)."
  type        = bool
  default     = false
}

variable "invoker_members" {
  description = "IAM members granted roles/run.invoker on the job (allowed to start executions). e.g. [\"serviceAccount:scheduler@proj.iam.gserviceaccount.com\"]."
  type        = list(string)
  default     = []
}

variable "schedule" {
  description = "Optional cron schedule (e.g. \"0 2 * * *\"). When set, a Cloud Scheduler job is created to trigger an execution. Null = no schedule."
  type        = string
  default     = null
}

variable "schedule_time_zone" {
  description = "IANA time zone for the cron schedule, e.g. Asia/Kolkata."
  type        = string
  default     = "Etc/UTC"
}

variable "scheduler_service_account" {
  description = "SA email Cloud Scheduler authenticates as when triggering the job. Must hold roles/run.invoker on the job (add it via invoker_members). Required if schedule is set."
  type        = string
  default     = null

  validation {
    condition     = var.schedule == null || var.scheduler_service_account != null
    error_message = "scheduler_service_account is required when schedule is set."
  }
}

variable "schedule_retry_count" {
  description = "Cloud Scheduler retry attempts if the trigger HTTP call fails (does not affect in-job task retries)."
  type        = number
  default     = 1
}

outputs.tf

output "id" {
  description = "Fully-qualified job resource id: projects/{project}/locations/{location}/jobs/{name}. Used to build the :run execution URI."
  value       = google_cloud_run_v2_job.this.id
}

output "name" {
  description = "The job name."
  value       = google_cloud_run_v2_job.this.name
}

output "location" {
  description = "Region the job is deployed in."
  value       = google_cloud_run_v2_job.this.location
}

output "run_command" {
  description = "Ready-to-run gcloud command to execute the job manually (e.g. in a deploy pipeline)."
  value       = "gcloud run jobs execute ${google_cloud_run_v2_job.this.name} --region ${google_cloud_run_v2_job.this.location} --project ${var.project_id}"
}

output "run_uri" {
  description = "Cloud Run Admin API endpoint to start an execution via authenticated POST (used by Cloud Scheduler / Workflows)."
  value       = "https://run.googleapis.com/v2/${google_cloud_run_v2_job.this.id}:run"
}

output "scheduler_job_name" {
  description = "Name of the Cloud Scheduler trigger job, or null when no schedule was set."
  value       = var.schedule == null ? null : google_cloud_scheduler_job.trigger[0].name
}

output "latest_created_execution" {
  description = "Reference to the most recent execution created by the job, as reported by the API."
  value       = google_cloud_run_v2_job.this.latest_created_execution
}

How to use it

A nightly orders-aggregation batch: 8 parallel tasks, the DB URL pulled from Secret Manager, Direct VPC egress into the prod network, running as a dedicated SA, and triggered at 02:00 IST by Cloud Scheduler.

module "cloud_run_jobs" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-cloud-run-jobs?ref=v1.0.0"

  project_id = "kloudvin-prod"
  location   = "asia-south1"
  name       = "orders-nightly-rollup"
  image      = "asia-south1-docker.pkg.dev/kloudvin-prod/batch/orders-rollup:1.7.0"

  # Fan out across shards; cap concurrency to protect the database.
  task_count   = 8
  parallelism  = 4
  max_retries  = 2
  task_timeout = "1800s" # 30 min per task
  cpu          = "2"
  memory       = "2Gi"

  # Dedicated least-privilege runtime identity (see the GCP Service Account module).
  service_account_email = google_service_account.rollup.email

  env_vars = {
    LOG_LEVEL = "info"
    SHARD_KEY = "order_date"
  }

  # DB credentials never leave Secret Manager.
  secret_env_vars = {
    DATABASE_URL = { secret = "orders-db-url", version = "latest" }
  }

  # Reach the private Cloud SQL instance over the VPC.
  vpc_network    = "projects/kloudvin-prod/global/networks/prod-vpc"
  vpc_subnetwork = "projects/kloudvin-prod/regions/asia-south1/subnetworks/prod-run"
  vpc_egress     = "PRIVATE_RANGES_ONLY"

  # Let the scheduler SA start executions, and trigger nightly at 02:00 IST.
  invoker_members           = ["serviceAccount:${google_service_account.scheduler.email}"]
  schedule                  = "0 2 * * *"
  schedule_time_zone        = "Asia/Kolkata"
  scheduler_service_account = google_service_account.scheduler.email

  labels = {
    team        = "data-platform"
    environment = "prod"
    cost-center = "cc-4412"
  }

  deletion_protection = true
}

# Downstream: run the job once as a post-deploy migration step in a pipeline,
# using the run_command output so the region/name never drift out of sync.
resource "null_resource" "run_migration" {
  triggers = {
    image = "asia-south1-docker.pkg.dev/kloudvin-prod/batch/orders-rollup:1.7.0"
  }

  provisioner "local-exec" {
    command = module.cloud_run_jobs.run_command
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/cloud_run_jobs/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-cloud-run-jobs?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  location = "..."
  name = "..."
  image = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/cloud_run_jobs && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string Yes GCP project ID for the job.
location string Yes Region, e.g. asia-south1.
name string Yes Job name; 1-63 chars, lowercase, regex-validated.
image string Yes Container image; pin a tag/digest, not :latest.
command list(string) [] No Entrypoint override (container ENTRYPOINT).
args list(string) [] No Arguments to the entrypoint (container CMD).
service_account_email string null No Runtime SA the tasks run as; set it to avoid the default Compute SA.
task_count number 1 No Number of tasks per execution (1-10000).
parallelism number 0 No Max concurrent tasks; 0 = Cloud Run picks the max.
max_retries number 3 No Per-task retry budget (0-10).
task_timeout string "600s" No Per-task wall-clock timeout in seconds (up to 86400s).
cpu string "1" No CPU limit per task (e.g. "2", "1000m").
memory string "512Mi" No Memory limit per task (e.g. "2Gi").
execution_environment string "EXECUTION_ENVIRONMENT_GEN2" No Sandbox generation (GEN1/GEN2).
env_vars map(string) {} No Plaintext env vars; never put secrets here.
secret_env_vars map(object) {} No Secret Manager-backed env vars ({secret, version}); only the ref lives in state.
vpc_network string null No VPC network for Direct VPC egress; required with vpc_subnetwork.
vpc_subnetwork string null No Subnetwork for Direct VPC egress; null disables VPC access.
vpc_egress string "PRIVATE_RANGES_ONLY" No ALL_TRAFFIC or PRIVATE_RANGES_ONLY when VPC access is on.
vpc_network_tags list(string) [] No Network tags on the task interface for firewall targeting.
labels map(string) {} No Labels for cost attribution / inventory.
launch_stage string "GA" No API launch stage for preview fields.
deletion_protection bool false No Refuse to delete the job when true.
invoker_members list(string) [] No Members granted roles/run.invoker (may start executions).
schedule string null No Cron schedule; when set, creates a Cloud Scheduler trigger.
schedule_time_zone string "Etc/UTC" No IANA time zone for the cron schedule.
scheduler_service_account string null No SA Scheduler authenticates as; required when schedule is set.
schedule_retry_count number 1 No Scheduler retry attempts on a failed trigger call.

Outputs

Name Description
id Fully-qualified job id projects/{project}/locations/{location}/jobs/{name}.
name The job name.
location Region the job is deployed in.
run_command Ready-to-run gcloud run jobs execute ... command for pipelines.
run_uri Cloud Run Admin API :run endpoint for authenticated POST triggers.
scheduler_job_name Name of the Cloud Scheduler trigger, or null when unscheduled.
latest_created_execution Reference to the most recent execution the job created.

Enterprise scenario

A fintech platform runs end-of-day reconciliation across 20 ledger shards in its kloudvin-prod project. The data team calls this module once with task_count = 20, parallelism = 5 (to stay under the Cloud SQL connection limit), max_retries = 2, the DATABASE_URL and SFTP_KEY secrets from Secret Manager, and Direct VPC egress to the private ledger database. A schedule = "30 18 * * *" in Asia/Kolkata fires the run after market close via a scheduler SA that holds only roles/run.invoker on this one job, while the job’s own runtime SA holds only roles/cloudsql.client and roles/storage.objectCreator — so a bug in the batch code can never reach beyond reconciliation, and finance can attribute the exact compute cost via the cost-center label.

Best practices

TerraformGCPCloud Run JobsModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading