Quick take — Provision a GCP Cloud Run Job with Terraform: container task config, parallelism and retries, Secret Manager env, VPC egress, a dedicated runtime SA, and an optional Cloud Scheduler trigger. A reusable hashicorp/google ~> 5.0 module. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "cloud_run_jobs" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-cloud-run-jobs?ref=v1.0.0"
project_id = "..." # GCP project ID for the job.
location = "..." # Region, e.g. `asia-south1`.
name = "..." # Job name; 1-63 chars, lowercase, regex-validated.
image = "..." # Container image; pin a tag/digest, not `:latest`.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
A Cloud Run Job runs a container to completion rather than serving traffic. Where a Cloud Run service keeps a revision warm to answer HTTP requests, a job spins up one or more tasks, each runs your container’s entrypoint, and the execution finishes when every task exits 0. There is no URL, no port, no min-instance scaling — it is GCP’s serverless answer to “run this batch workload, optionally in parallel, retry on failure, then stop billing.” Typical fits: nightly ETL, database migrations, report generation, queue drains, ML batch inference, and scheduled housekeeping.
The provider models this with google_cloud_run_v2_job (the second-generation, GA resource — prefer it over the deprecated google_cloud_run_job). The resource itself carries a fair amount of production-critical surface area: the task count and parallelism, max_retries and timeout per task, CPU/memory limits, environment variables sourced from Secret Manager, Direct VPC egress to reach private databases, and the runtime service account the tasks authenticate as. Getting any of these wrong — a missing retry budget, a secret passed as a plaintext env var, the default Compute Engine SA with roles/editor — is how batch jobs become incidents.
Wrapping it in a module means a team requests “a job that runs migrate:latest once, 30-minute timeout, with the DATABASE_URL secret and egress into the prod VPC” in ~15 lines, instead of hand-assembling a deeply-nested template { template { containers { ... } } } block, a google_cloud_run_v2_job_iam_member for who can run it, and a Cloud Scheduler trigger — and getting the nesting or the IAM subtly wrong every time.
When to use it
- A scheduled batch workload (nightly aggregation, billing run, data export) needs to execute on cron and then stop, without paying for an always-on service or managing a GKE CronJob.
- A database migration or one-shot task must run as part of a deploy pipeline (
gcloud run jobs executeor a Terraform-managed trigger) before the new service revision goes live. - An embarrassingly-parallel job (process 500 shards, render 1,000 thumbnails) should fan out across many tasks via
task_count+parallelismand only succeed when all tasks complete. - You want every job in the estate created the same way: a dedicated least-privilege runtime SA, secrets from Secret Manager (never plaintext), a sane retry/timeout budget, and consistent labels for cost attribution.
Reach for something else when the workload serves HTTP or needs to stay warm (that’s a Cloud Run service), when you need rich DAG orchestration with dependencies between steps (Cloud Composer / Workflows), or when a sub-second, event-driven function is a better fit (Cloud Run functions). Jobs have a hard per-task timeout ceiling and no inbound networking.
Module structure
terraform-module-gcp-cloud-run-jobs/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
locals {
# Plain (non-secret) environment variables -> name/value pairs.
plain_env = [
for k, v in var.env_vars : {
name = k
value = v
}
]
# Secret-backed environment variables -> name + value_source ref.
# Map key is the env var name; value points at a Secret Manager secret + version.
secret_env = [
for name, ref in var.secret_env_vars : {
name = name
secret = ref.secret
version = ref.version
}
]
# Only build a VPC access block when a subnetwork (Direct VPC egress) is set.
vpc_access_enabled = var.vpc_subnetwork != null
}
resource "google_cloud_run_v2_job" "this" {
name = var.name
location = var.location
project = var.project_id
labels = var.labels
launch_stage = var.launch_stage
# Fail the apply (rather than silently delete) if Terraform would replace
# a job that has executions, when the caller opts into that guardrail.
deletion_protection = var.deletion_protection
template {
# How many tasks the execution fans out into, and how many run at once.
task_count = var.task_count
parallelism = var.parallelism
template {
# Per-task retry budget and wall-clock timeout.
max_retries = var.max_retries
timeout = var.task_timeout
service_account = var.service_account_email
execution_environment = var.execution_environment
containers {
image = var.image
command = var.command
args = var.args
resources {
limits = {
cpu = var.cpu
memory = var.memory
}
}
# Plaintext environment variables.
dynamic "env" {
for_each = local.plain_env
content {
name = env.value.name
value = env.value.value
}
}
# Secret Manager-backed environment variables (never stored in plan/state
# as values — only the secret reference is).
dynamic "env" {
for_each = local.secret_env
content {
name = env.value.name
value_source {
secret_key_ref {
secret = env.value.secret
version = env.value.version
}
}
}
}
}
# Direct VPC egress — reach private databases / internal services without
# a Serverless VPC Access connector. Only emitted when a subnetwork is set.
dynamic "vpc_access" {
for_each = local.vpc_access_enabled ? [1] : []
content {
egress = var.vpc_egress
network_interfaces {
network = var.vpc_network
subnetwork = var.vpc_subnetwork
tags = var.vpc_network_tags
}
}
}
}
}
lifecycle {
# The launcher (Cloud Scheduler / CI) annotates the latest execution; ignore
# client-name churn so unrelated applies don't show a perpetual diff.
ignore_changes = [
template[0].labels,
client,
client_version,
]
}
}
# Who is allowed to RUN (execute) this job. Granting roles/run.invoker on the
# job resource lets these members trigger executions (e.g. a scheduler SA, a CI SA).
resource "google_cloud_run_v2_job_iam_member" "invokers" {
for_each = toset(var.invoker_members)
name = google_cloud_run_v2_job.this.name
location = google_cloud_run_v2_job.this.location
project = var.project_id
role = "roles/run.invoker"
member = each.value
}
# Optional Cloud Scheduler trigger that calls the Cloud Run Admin API to start
# an execution on a cron schedule, authenticated as scheduler_service_account.
resource "google_cloud_scheduler_job" "trigger" {
count = var.schedule == null ? 0 : 1
name = "${var.name}-trigger"
project = var.project_id
region = var.location
schedule = var.schedule
time_zone = var.schedule_time_zone
http_target {
http_method = "POST"
uri = "https://run.googleapis.com/v2/${google_cloud_run_v2_job.this.id}:run"
oauth_token {
service_account_email = var.scheduler_service_account
scope = "https://www.googleapis.com/auth/cloud-platform"
}
}
retry_config {
retry_count = var.schedule_retry_count
}
}
variables.tf
variable "project_id" {
description = "GCP project ID in which to create the Cloud Run Job."
type = string
}
variable "location" {
description = "Region for the job, e.g. asia-south1. Must support Cloud Run Jobs."
type = string
}
variable "name" {
description = "Job name. 1-63 chars, lowercase letters/digits/hyphens, must start with a letter and not end with a hyphen."
type = string
validation {
condition = can(regex("^[a-z]([-a-z0-9]{0,61}[a-z0-9])?$", var.name))
error_message = "name must be 1-63 chars, lowercase, start with a letter, contain only letters/digits/hyphens, and not end with a hyphen."
}
}
variable "image" {
description = "Fully-qualified container image, e.g. asia-south1-docker.pkg.dev/proj/repo/migrate:1.4.2. Pin a tag/digest, not :latest, for reproducible runs."
type = string
}
variable "command" {
description = "Entrypoint override (the container ENTRYPOINT). Empty list uses the image's ENTRYPOINT."
type = list(string)
default = []
}
variable "args" {
description = "Arguments passed to the entrypoint (the container CMD)."
type = list(string)
default = []
}
variable "service_account_email" {
description = "Runtime service account the tasks authenticate as. Strongly recommended — omitting it falls back to the default Compute Engine SA, which is over-privileged."
type = string
default = null
}
variable "task_count" {
description = "Number of tasks the execution runs. Each task runs the container once; the execution succeeds only when all tasks complete."
type = number
default = 1
validation {
condition = var.task_count >= 1 && var.task_count <= 10000
error_message = "task_count must be between 1 and 10000."
}
}
variable "parallelism" {
description = "Maximum tasks to run concurrently. 0 lets Cloud Run choose the max. Cap this to protect downstream systems (DB connections, rate limits)."
type = number
default = 0
validation {
condition = var.parallelism >= 0
error_message = "parallelism must be >= 0 (0 means 'as parallel as possible')."
}
}
variable "max_retries" {
description = "Times a FAILED task is retried before the task is marked failed. 0 means no retries (a single failed task fails the execution)."
type = number
default = 3
validation {
condition = var.max_retries >= 0 && var.max_retries <= 10
error_message = "max_retries must be between 0 and 10."
}
}
variable "task_timeout" {
description = "Max wall-clock duration of a single task attempt, as a duration string (e.g. \"600s\", \"3600s\"). Up to 24h (\"86400s\")."
type = string
default = "600s"
validation {
condition = can(regex("^[0-9]+s$", var.task_timeout))
error_message = "task_timeout must be a duration in seconds, e.g. \"600s\"."
}
}
variable "cpu" {
description = "CPU limit per task, e.g. \"1\", \"2\", \"1000m\". Jobs are billed for CPU/memory only while a task runs."
type = string
default = "1"
}
variable "memory" {
description = "Memory limit per task, e.g. \"512Mi\", \"2Gi\". Must be consistent with the CPU value per Cloud Run's CPU/memory ratios."
type = string
default = "512Mi"
}
variable "execution_environment" {
description = "Sandbox: EXECUTION_ENVIRONMENT_GEN2 (faster CPU, full Linux compat, needed for some mounts) or EXECUTION_ENVIRONMENT_GEN1."
type = string
default = "EXECUTION_ENVIRONMENT_GEN2"
validation {
condition = contains(["EXECUTION_ENVIRONMENT_GEN1", "EXECUTION_ENVIRONMENT_GEN2"], var.execution_environment)
error_message = "execution_environment must be EXECUTION_ENVIRONMENT_GEN1 or EXECUTION_ENVIRONMENT_GEN2."
}
}
variable "env_vars" {
description = "Plaintext environment variables (map of name => value). Do NOT put secrets here — use secret_env_vars."
type = map(string)
default = {}
}
variable "secret_env_vars" {
description = "Secret Manager-backed env vars: map of ENV_NAME => { secret = \"projects/p/secrets/db-url\" or \"db-url\", version = \"latest\" }. Only the reference lives in state, never the value."
type = map(object({
secret = string
version = optional(string, "latest")
}))
default = {}
}
variable "vpc_network" {
description = "VPC network self-link/name for Direct VPC egress. Required when vpc_subnetwork is set."
type = string
default = null
}
variable "vpc_subnetwork" {
description = "Subnetwork for Direct VPC egress. Set this to give tasks a private IP and reach internal resources. Null disables VPC egress."
type = string
default = null
}
variable "vpc_egress" {
description = "Egress setting when VPC access is enabled: ALL_TRAFFIC (all egress via the VPC) or PRIVATE_RANGES_ONLY (only RFC1918 via VPC, internet direct)."
type = string
default = "PRIVATE_RANGES_ONLY"
validation {
condition = contains(["ALL_TRAFFIC", "PRIVATE_RANGES_ONLY"], var.vpc_egress)
error_message = "vpc_egress must be ALL_TRAFFIC or PRIVATE_RANGES_ONLY."
}
}
variable "vpc_network_tags" {
description = "Network tags applied to the task's network interface (for firewall targeting) when VPC egress is enabled."
type = list(string)
default = []
}
variable "labels" {
description = "Labels applied to the job for cost attribution and inventory (e.g. team, env, cost-center)."
type = map(string)
default = {}
}
variable "launch_stage" {
description = "API launch stage for preview features, e.g. GA, BETA. Leave GA unless using a preview field."
type = string
default = "GA"
}
variable "deletion_protection" {
description = "When true, Terraform refuses to delete the job (guards against destroying a production batch pipeline)."
type = bool
default = false
}
variable "invoker_members" {
description = "IAM members granted roles/run.invoker on the job (allowed to start executions). e.g. [\"serviceAccount:scheduler@proj.iam.gserviceaccount.com\"]."
type = list(string)
default = []
}
variable "schedule" {
description = "Optional cron schedule (e.g. \"0 2 * * *\"). When set, a Cloud Scheduler job is created to trigger an execution. Null = no schedule."
type = string
default = null
}
variable "schedule_time_zone" {
description = "IANA time zone for the cron schedule, e.g. Asia/Kolkata."
type = string
default = "Etc/UTC"
}
variable "scheduler_service_account" {
description = "SA email Cloud Scheduler authenticates as when triggering the job. Must hold roles/run.invoker on the job (add it via invoker_members). Required if schedule is set."
type = string
default = null
validation {
condition = var.schedule == null || var.scheduler_service_account != null
error_message = "scheduler_service_account is required when schedule is set."
}
}
variable "schedule_retry_count" {
description = "Cloud Scheduler retry attempts if the trigger HTTP call fails (does not affect in-job task retries)."
type = number
default = 1
}
outputs.tf
output "id" {
description = "Fully-qualified job resource id: projects/{project}/locations/{location}/jobs/{name}. Used to build the :run execution URI."
value = google_cloud_run_v2_job.this.id
}
output "name" {
description = "The job name."
value = google_cloud_run_v2_job.this.name
}
output "location" {
description = "Region the job is deployed in."
value = google_cloud_run_v2_job.this.location
}
output "run_command" {
description = "Ready-to-run gcloud command to execute the job manually (e.g. in a deploy pipeline)."
value = "gcloud run jobs execute ${google_cloud_run_v2_job.this.name} --region ${google_cloud_run_v2_job.this.location} --project ${var.project_id}"
}
output "run_uri" {
description = "Cloud Run Admin API endpoint to start an execution via authenticated POST (used by Cloud Scheduler / Workflows)."
value = "https://run.googleapis.com/v2/${google_cloud_run_v2_job.this.id}:run"
}
output "scheduler_job_name" {
description = "Name of the Cloud Scheduler trigger job, or null when no schedule was set."
value = var.schedule == null ? null : google_cloud_scheduler_job.trigger[0].name
}
output "latest_created_execution" {
description = "Reference to the most recent execution created by the job, as reported by the API."
value = google_cloud_run_v2_job.this.latest_created_execution
}
How to use it
A nightly orders-aggregation batch: 8 parallel tasks, the DB URL pulled from Secret Manager, Direct VPC egress into the prod network, running as a dedicated SA, and triggered at 02:00 IST by Cloud Scheduler.
module "cloud_run_jobs" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-cloud-run-jobs?ref=v1.0.0"
project_id = "kloudvin-prod"
location = "asia-south1"
name = "orders-nightly-rollup"
image = "asia-south1-docker.pkg.dev/kloudvin-prod/batch/orders-rollup:1.7.0"
# Fan out across shards; cap concurrency to protect the database.
task_count = 8
parallelism = 4
max_retries = 2
task_timeout = "1800s" # 30 min per task
cpu = "2"
memory = "2Gi"
# Dedicated least-privilege runtime identity (see the GCP Service Account module).
service_account_email = google_service_account.rollup.email
env_vars = {
LOG_LEVEL = "info"
SHARD_KEY = "order_date"
}
# DB credentials never leave Secret Manager.
secret_env_vars = {
DATABASE_URL = { secret = "orders-db-url", version = "latest" }
}
# Reach the private Cloud SQL instance over the VPC.
vpc_network = "projects/kloudvin-prod/global/networks/prod-vpc"
vpc_subnetwork = "projects/kloudvin-prod/regions/asia-south1/subnetworks/prod-run"
vpc_egress = "PRIVATE_RANGES_ONLY"
# Let the scheduler SA start executions, and trigger nightly at 02:00 IST.
invoker_members = ["serviceAccount:${google_service_account.scheduler.email}"]
schedule = "0 2 * * *"
schedule_time_zone = "Asia/Kolkata"
scheduler_service_account = google_service_account.scheduler.email
labels = {
team = "data-platform"
environment = "prod"
cost-center = "cc-4412"
}
deletion_protection = true
}
# Downstream: run the job once as a post-deploy migration step in a pipeline,
# using the run_command output so the region/name never drift out of sync.
resource "null_resource" "run_migration" {
triggers = {
image = "asia-south1-docker.pkg.dev/kloudvin-prod/batch/orders-rollup:1.7.0"
}
provisioner "local-exec" {
command = module.cloud_run_jobs.run_command
}
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/cloud_run_jobs/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-cloud-run-jobs?ref=v1.0.0"
}
inputs = {
project_id = "..."
location = "..."
name = "..."
image = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/cloud_run_jobs && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
project_id |
string |
— | Yes | GCP project ID for the job. |
location |
string |
— | Yes | Region, e.g. asia-south1. |
name |
string |
— | Yes | Job name; 1-63 chars, lowercase, regex-validated. |
image |
string |
— | Yes | Container image; pin a tag/digest, not :latest. |
command |
list(string) |
[] |
No | Entrypoint override (container ENTRYPOINT). |
args |
list(string) |
[] |
No | Arguments to the entrypoint (container CMD). |
service_account_email |
string |
null |
No | Runtime SA the tasks run as; set it to avoid the default Compute SA. |
task_count |
number |
1 |
No | Number of tasks per execution (1-10000). |
parallelism |
number |
0 |
No | Max concurrent tasks; 0 = Cloud Run picks the max. |
max_retries |
number |
3 |
No | Per-task retry budget (0-10). |
task_timeout |
string |
"600s" |
No | Per-task wall-clock timeout in seconds (up to 86400s). |
cpu |
string |
"1" |
No | CPU limit per task (e.g. "2", "1000m"). |
memory |
string |
"512Mi" |
No | Memory limit per task (e.g. "2Gi"). |
execution_environment |
string |
"EXECUTION_ENVIRONMENT_GEN2" |
No | Sandbox generation (GEN1/GEN2). |
env_vars |
map(string) |
{} |
No | Plaintext env vars; never put secrets here. |
secret_env_vars |
map(object) |
{} |
No | Secret Manager-backed env vars ({secret, version}); only the ref lives in state. |
vpc_network |
string |
null |
No | VPC network for Direct VPC egress; required with vpc_subnetwork. |
vpc_subnetwork |
string |
null |
No | Subnetwork for Direct VPC egress; null disables VPC access. |
vpc_egress |
string |
"PRIVATE_RANGES_ONLY" |
No | ALL_TRAFFIC or PRIVATE_RANGES_ONLY when VPC access is on. |
vpc_network_tags |
list(string) |
[] |
No | Network tags on the task interface for firewall targeting. |
labels |
map(string) |
{} |
No | Labels for cost attribution / inventory. |
launch_stage |
string |
"GA" |
No | API launch stage for preview fields. |
deletion_protection |
bool |
false |
No | Refuse to delete the job when true. |
invoker_members |
list(string) |
[] |
No | Members granted roles/run.invoker (may start executions). |
schedule |
string |
null |
No | Cron schedule; when set, creates a Cloud Scheduler trigger. |
schedule_time_zone |
string |
"Etc/UTC" |
No | IANA time zone for the cron schedule. |
scheduler_service_account |
string |
null |
No | SA Scheduler authenticates as; required when schedule is set. |
schedule_retry_count |
number |
1 |
No | Scheduler retry attempts on a failed trigger call. |
Outputs
| Name | Description |
|---|---|
id |
Fully-qualified job id projects/{project}/locations/{location}/jobs/{name}. |
name |
The job name. |
location |
Region the job is deployed in. |
run_command |
Ready-to-run gcloud run jobs execute ... command for pipelines. |
run_uri |
Cloud Run Admin API :run endpoint for authenticated POST triggers. |
scheduler_job_name |
Name of the Cloud Scheduler trigger, or null when unscheduled. |
latest_created_execution |
Reference to the most recent execution the job created. |
Enterprise scenario
A fintech platform runs end-of-day reconciliation across 20 ledger shards in its kloudvin-prod project. The data team calls this module once with task_count = 20, parallelism = 5 (to stay under the Cloud SQL connection limit), max_retries = 2, the DATABASE_URL and SFTP_KEY secrets from Secret Manager, and Direct VPC egress to the private ledger database. A schedule = "30 18 * * *" in Asia/Kolkata fires the run after market close via a scheduler SA that holds only roles/run.invoker on this one job, while the job’s own runtime SA holds only roles/cloudsql.client and roles/storage.objectCreator — so a bug in the batch code can never reach beyond reconciliation, and finance can attribute the exact compute cost via the cost-center label.
Best practices
- Give every job a dedicated, least-privilege runtime SA. Always set
service_account_email; never let a job fall back to the default Compute Engine SA (it ships withroles/editor). Grant the runtime SA only the roles the task actually uses (e.g.roles/cloudsql.client), and keep the invoker SA (who can press “run”) separate from the runtime SA (what the task can do). - Never put secrets in
env_vars. Usesecret_env_varsso values come from Secret Manager at runtime — only the reference lands in plan/state. Pinversion = "latest"for rotation, or a fixed version for reproducible, auditable runs. - Set a realistic retry and timeout budget, and make tasks idempotent.
max_retriesre-runs the whole failed task, so a non-idempotent migration can double-apply; design for at-least-once. Sizetask_timeoutto the slowest shard, not the average, or long tasks get killed mid-flight. - Cap
parallelismto protect downstream systems.parallelism = 0fans out as wide as Cloud Run allows — fine for stateless work, dangerous against a database. Bound it to your connection-pool / rate-limit headroom even whentask_countis large. - Pin image digests for reproducibility. A
:latesttag means re-running last night’s failed job may execute different code. Use an immutable tag or@sha256:digest so an execution is always the artifact you tested, and soterraform planshows real image changes. - Label for cost attribution and turn on
deletion_protectionin prod. Consistentlabels(team,environment,cost-center) make per-job spend visible in billing exports, anddeletion_protection = truestops a strayterraform destroyor refactor from deleting a revenue-critical batch pipeline.