Cloud Run in Production: Services, Jobs, VPC Egress, and Concurrency Tuning

Cloud Run gets sold as “just give us a container,” and for a hello-world it is. In production it is a request-scaling, scale-to-zero compute platform with a specific billing model and a set of networking knobs that, if you misread them, produce either a surprise invoice or a service that silently can’t reach your database. This guide is the operational mental model: how instances spin up and bill, when to reach for a job instead of a service, how concurrency and CPU allocation interact, and how to wire private ingress and VPC egress without leaking traffic to the public internet.

Everything below uses the v2 API surface (gcloud run with current flags). Where a flag changed names or a default flipped, I call it out.

1. The execution model: requests, concurrency, instance lifecycle

A Cloud Run service is a set of revisions. Each revision is an immutable container config (image, env, resources, scaling bounds). Traffic is routed to revisions by a traffic-split policy. The autoscaler creates instances of the active revision to absorb load and removes them when load drops, down to min-instances (zero by default).

The unit that matters for both behavior and billing is the instance, and the lever that governs how many you need is concurrency – the maximum number of requests one instance handles simultaneously. Default concurrency is 80. If your container can genuinely serve 80 concurrent requests, one instance covers a lot of traffic. If each request pins a CPU or holds a scarce backend connection, 80 is a way to overload and time out.

The autoscaler’s rough target: keep instances at about 60% of the concurrency setting. So with concurrency 80 it aims to add instances as you approach ~50 in-flight requests per instance. You do not control that 60% directly; you control concurrency, min/max instances, and CPU.

The billing distinction that trips people up: by default (request-based billing) you are billed for CPU and memory only during request processing, rounded up, plus a small per-request fee. Outside a request the instance still exists for a while (warm) but is not billing CPU. Switch to instance-based billing (--no-cpu-throttling, covered below) and you pay for the full lifetime of the instance instead – the right choice for background work, the wrong choice for bursty request traffic.

2. Services vs jobs: long-running APIs vs batch and scheduled work

A service answers requests on a port and scales on request load. A job runs a container to completion and exits – no port, no ingress, no request concurrency. Reach for a job when the workload is “do this and finish”: a nightly export, a database migration, a Pub/Sub-triggered batch, a one-shot data backfill.

Deploy a service from source or image:

gcloud run deploy orders-api \
  --image=us-docker.pkg.dev/acme-prod/apps/orders-api:1.42.0 \
  --region=us-central1 \
  --project=acme-prod \
  --concurrency=40 \
  --cpu=1 --memory=512Mi \
  --min-instances=1 --max-instances=50 \
  --no-allow-unauthenticated \
  --port=8080

A job is defined once, then executed (manually, on a schedule, or from an event). Jobs support parallelism via task arrays: --tasks is how many tasks run, --parallelism is how many run at once, and each task gets its index in CLOUD_RUN_TASK_INDEX.

# Define a job that fans a backfill across 100 shards, 10 at a time
gcloud run jobs create nightly-backfill \
  --image=us-docker.pkg.dev/acme-prod/apps/backfill:2.3.0 \
  --region=us-central1 \
  --project=acme-prod \
  --tasks=100 --parallelism=10 \
  --task-timeout=3600s \
  --max-retries=3 \
  --cpu=2 --memory=2Gi

# Run it now and stream until completion
gcloud run jobs execute nightly-backfill --region=us-central1 --wait

Schedule it with Cloud Scheduler hitting the Jobs Admin API via OIDC – no public endpoint, no secret to rotate:

gcloud scheduler jobs create http nightly-backfill-trigger \
  --location=us-central1 \
  --schedule="0 2 * * *" --time-zone="Etc/UTC" \
  --uri="https://run.googleapis.com/v2/projects/acme-prod/locations/us-central1/jobs/nightly-backfill:run" \
  --http-method=POST \
  --oauth-service-account-email=scheduler-invoker@acme-prod.iam.gserviceaccount.com

The scheduler service account needs roles/run.invoker on the job (or roles/run.developer). Jobs read their workload from env/args, not from a request body, so design them to be re-runnable and idempotent – retries and manual re-execution are normal.

3. Concurrency, CPU allocation, and CPU-always-on

Three settings interact and you must tune them together, not in isolation:

Setting	Flag	What it controls
Concurrency	`--concurrency`	Max simultaneous requests per instance (1-1000)
CPU	`--cpu`	vCPU per instance (0.08 up to 8; values <1 cap concurrency)
CPU throttling	`--cpu-throttling` / `--no-cpu-throttling`	Whether CPU is throttled outside request processing

The rule of thumb: CPU-bound work wants low concurrency; I/O-bound work wants high concurrency. A handler that does heavy serialization or image processing should run at concurrency 1-8 so requests don’t starve each other on a shared core. A handler that mostly awaits a downstream API or database can run at 80+ because the CPU sits idle during the wait and a single instance multiplexes cheaply.

Fractional CPU has a hard constraint worth memorizing: any allocation below 1 vCPU forces concurrency to 1 and is incompatible with CPU-always-on. Sub-1-vCPU instances are for cheap, strictly serial, latency-tolerant endpoints – not your main API.

CPU throttling is the other half. By default, outside request processing CPU is throttled to near zero. That breaks anything that needs to keep working between requests: a background goroutine flushing a buffer, an async logging/telemetry exporter, a queue consumer that ack’s after the response is sent. Turn it off:

gcloud run services update telemetry-relay \
  --region=us-central1 \
  --no-cpu-throttling \
  --min-instances=1

--no-cpu-throttling is instance-based billing: you pay for the full instance lifetime, so pin a sane min-instances and max-instances and don’t leave it scaling to zero-and-back all day. If your code finishes all work before returning the response and needs nothing in between, leave throttling on and save the money.

4. Cold starts: min instances, startup CPU boost, startup probes

A cold start is the time from “autoscaler decides it needs an instance” to “that instance serves its first request”: pull/start the container, run your startup code, pass the startup probe. You attack it from three directions.

Min instances keep warm capacity so the common path never pays a cold start:

gcloud run services update orders-api \
  --region=us-central1 \
  --min-instances=2

Warm idle instances outside a request are billed at a reduced idle rate under request-based billing – cheap insurance for a latency-sensitive front door, but it is not free, so size it to real traffic.

Startup CPU boost temporarily allocates extra CPU during container startup so initialization (JIT warm-up, framework boot, connection pools) finishes faster. It is on by default in v2; keep it on for JVM/Node services with heavy boot:

gcloud run services update orders-api \
  --region=us-central1 \
  --cpu-boost

Startup probes define when an instance is considered ready. Get this right or the autoscaler routes traffic into a process that hasn’t bound its port yet, producing 503s that look like cold-start failures. A correct startup probe gives slow boots room without making fast boots wait:

# service.yaml (knative-style; apply with: gcloud run services replace service.yaml)
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: orders-api
spec:
  template:
    spec:
      containers:
        - image: us-docker.pkg.dev/acme-prod/apps/orders-api:1.42.0
          ports:
            - containerPort: 8080
          startupProbe:
            httpGet:
              path: /healthz/startup
            initialDelaySeconds: 0
            periodSeconds: 5
            failureThreshold: 12   # up to ~60s to become ready
          livenessProbe:
            httpGet:
              path: /healthz/live
            periodSeconds: 30

Keep startup and liveness checks separate. A liveness probe that also validates a database connection will restart your instance during a transient DB blip and turn a 30-second backend hiccup into a self-inflicted outage.

5. Direct VPC egress vs Serverless VPC Access connectors

By default a Cloud Run service reaches the internet directly and cannot reach RFC 1918 addresses in your VPC. To talk to private resources you attach the service to a VPC. There are two mechanisms.

Serverless VPC Access connector is the older model: a managed set of e2-micro-class instances (a connector) that you provision in a /28, sized by throughput, and that all your serverless egress hairpins through. It works, but it is a standing cost, a throughput bottleneck, and one more thing to scale.

Direct VPC egress is the current default choice: Cloud Run instances get IPs directly from a subnet in your VPC – no connector instances, lower latency, higher throughput ceiling, and you pay nothing extra for the data path. Give it its own subnet with enough address space for peak instance count plus headroom.

# Direct VPC egress: attach the service straight to a subnet
gcloud run services update orders-api \
  --region=us-central1 \
  --network=prod-vpc \
  --subnet=run-egress-subnet \
  --vpc-egress=private-ranges-only \
  --network-tags=cloud-run-orders

The --vpc-egress value is the decision that determines where your traffic goes:

Value	Behavior
`private-ranges-only`	Only RFC 1918 / private traffic goes through the VPC; public traffic exits directly via Google
`all-traffic`	All egress, including internet-bound, routes through the VPC

Use all-traffic when you need every outbound packet to leave through Cloud NAT with a known, allowlistable static IP – the classic “the partner’s firewall only accepts our two NAT IPs” requirement. It also means public traffic now depends on your NAT being healthy and adequately provisioned with ports, so plan NAT capacity accordingly. The --network-tags you assign let VPC firewall rules and Cloud NAT target this service specifically.

Subnet sizing is a real capacity decision. Each running instance consumes an address. A /28 (16 IPs, minus reserved) caps you at roughly a dozen concurrent instances. If max-instances is 100, you need at least a /25. Running out of addresses surfaces as instances failing to start under load – exactly when you can least afford it. Size the subnet to max-instances with margin and don’t share it with anything else.

6. Private ingress: internal load balancers, IAP, and PSC endpoints

Egress is half the story; ingress is the other. Lock down who can reach the service with --ingress:

Ingress setting	Reachable by
`all`	The public internet (default)
`internal`	Internal LBs, VPC sources, and the same project’s VPC; also Pub/Sub, Eventarc, Workflows
`internal-and-cloud-load-balancing`	The above plus an external Application Load Balancer

gcloud run services update orders-api \
  --region=us-central1 \
  --ingress=internal

--ingress=internal plus an internal Application Load Balancer is the standard private front door: the service has no public URL that resolves to anything routable, and clients reach it only from inside the VPC (or across VPN/Interconnect). You point the ILB at the service with a serverless NEG:

gcloud compute network-endpoint-groups create orders-neg \
  --region=us-central1 \
  --network-endpoint-type=serverless \
  --cloud-run-service=orders-api

gcloud compute backend-services create orders-backend \
  --load-balancing-scheme=INTERNAL_MANAGED \
  --region=us-central1

gcloud compute backend-services add-backend orders-backend \
  --region=us-central1 \
  --network-endpoint-group=orders-neg \
  --network-endpoint-group-region=us-central1

For browser-facing internal apps that need user identity at the edge, front the service with an external Application Load Balancer (set ingress to internal-and-cloud-load-balancing) and enable Identity-Aware Proxy on the backend service. IAP authenticates every request against your IdP before it reaches the container, so the app gets a verified identity in a signed header and never sees unauthenticated traffic.

To reach a Cloud Run service from another VPC or another project’s network without an external LB, use a Private Service Connect endpoint targeting the Google APIs bundle and call the service through run.app over that private path – traffic never touches the public internet, and you keep a single private IP as the reach point.

7. Connecting to Cloud SQL and private services securely

Two correct ways to reach Cloud SQL; pick deliberately.

Cloud SQL connector (Unix socket). The platform mounts a socket at /cloudsql/INSTANCE_CONNECTION_NAME. This path does not need VPC egress at all and handles IAM auth and encryption for you. Best default for most apps:

gcloud run services update orders-api \
  --region=us-central1 \
  --add-cloudsql-instances=acme-prod:us-central1:orders-db \
  --set-env-vars=DB_SOCKET=/cloudsql/acme-prod:us-central1:orders-db

Your code connects to the socket path (e.g. host=/cloudsql/acme-prod:us-central1:orders-db for Postgres) and authenticates – prefer IAM database authentication with the service’s own service account over a stored password, so there is no DB credential to leak or rotate.

Private IP over Direct VPC egress. If the instance has a private IP and you’ve enabled the Service Networking / private connection, connect straight to that IP through the VPC attachment from section 5 – lower latency, no socket proxy, and the natural choice when you’re already on Direct VPC egress for other reasons.

For other private dependencies (Memorystore, an internal API, a partner service behind PSC), the pattern is identical: attach to the VPC, set --vpc-egress to at least private-ranges-only, and make sure a firewall rule allows the egress subnet’s range (or the network tag) to reach the target port. Pull secrets from Secret Manager mounted as env or files rather than baking them into the image:

gcloud run services update orders-api \
  --region=us-central1 \
  --set-secrets=DB_PASSWORD=orders-db-password:latest

8. Revisions, traffic splitting, and gradual rollouts with tags

Every deploy creates a revision. By default 100% of traffic shifts to the newest one – fine for dev, reckless for a tier-1 API. Decouple deploy from promote so you can ship a revision, smoke-test it on a private URL, then move traffic in steps.

Deploy without taking traffic, and assign a tag that mints a stable, revision-specific URL:

gcloud run deploy orders-api \
  --image=us-docker.pkg.dev/acme-prod/apps/orders-api:1.43.0 \
  --region=us-central1 \
  --no-traffic \
  --tag=canary

That gives you https://canary---orders-api-<hash>-<region>.run.app, addressable for tests while live traffic stays on the old revision. Promote in stages once it’s healthy:

# 10% canary
gcloud run services update-traffic orders-api \
  --region=us-central1 \
  --to-tags=canary=10

# Full cutover when SLOs hold
gcloud run services update-traffic orders-api \
  --region=us-central1 \
  --to-latest

Instant rollback is a one-liner to a known-good revision – which is the entire reason revisions are immutable:

gcloud run services update-traffic orders-api \
  --region=us-central1 \
  --to-revisions=orders-api-00041-abc=100

Verify

Confirm the running config and behavior match intent before you call it done.

# Effective scaling, concurrency, CPU, ingress, egress on the live revision
gcloud run services describe orders-api \
  --region=us-central1 \
  --format="yaml(spec.template.spec.containerConcurrency,
                 spec.template.metadata.annotations,
                 status.traffic)"

# Confirm ingress is locked down (expect 'internal')
gcloud run services describe orders-api --region=us-central1 \
  --format="value(metadata.annotations['run.googleapis.com/ingress'])"

# Authenticated call (no --allow-unauthenticated, so a token is required)
curl -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
  https://orders-api-<hash>-uc.a.run.app/healthz/live

# Prove private egress: from a hardened endpoint, a public address must NOT resolve a route
# (with --vpc-egress=all-traffic, confirm the source IP seen downstream is your Cloud NAT IP)

# Jobs: inspect the last execution and per-task outcomes
gcloud run jobs executions list --job=nightly-backfill --region=us-central1
gcloud run jobs executions describe <execution-id> --region=us-central1 \
  --format="value(status.succeededCount,status.failedCount)"

Watch the autoscaler under load in Cloud Monitoring: run.googleapis.com/container/instance_count (split by state to see active vs idle), container/cpu/utilizations, and request_latencies. If instance count is pinned at max-instances while latency climbs, you are concurrency- or CPU-starved – raise concurrency for I/O-bound work or raise max-instances and CPU for compute-bound work.

Enterprise scenario

A fintech platform team ran a payments-reconciliation API on Cloud Run behind an external LB. The card-network partner enforced an IP allowlist: outbound calls to the settlement endpoint had to originate from two pre-registered static IPs, or they were dropped at the partner’s firewall. The service worked in staging (open egress) and failed intermittently in production – some pods happened to egress through Google IPs the partner had once seen, most didn’t.

The constraint: every outbound packet to the partner had to leave from a fixed, allowlisted IP, while the service still scaled to dozens of instances and still served low-latency public ingress.

The fix was Direct VPC egress with all-traffic, forcing all egress through a Cloud NAT configured with two reserved static IPs – the exact pair the partner had registered. The team gave the egress subnet a /24 (max-instances was 80, and they wanted headroom plus room for other serverless egress), tagged the service, and scoped a firewall rule and the NAT to that tag:

gcloud compute addresses create recon-nat-ip-1 recon-nat-ip-2 \
  --region=us-central1

gcloud compute routers nats create recon-nat \
  --router=prod-router --region=us-central1 \
  --nat-custom-subnet-ip-ranges=run-egress-subnet \
  --nat-external-ip-pool=recon-nat-ip-1,recon-nat-ip-2

gcloud run services update recon-api \
  --region=us-central1 \
  --network=prod-vpc --subnet=run-egress-subnet \
  --vpc-egress=all-traffic \
  --network-tags=recon-egress

After cutover, 100% of partner-bound traffic egressed from the two registered IPs, the intermittent drops stopped, and the allowlist held. The hidden cost they planned for: routing all traffic through NAT meant NAT port exhaustion was now a production risk, so they bumped --min-ports-per-vm and alerted on the NAT dropped_sent_packets metric. The lesson the team wrote down: all-traffic is not a networking detail, it is a dependency – your public egress now lives and dies with Cloud NAT capacity.

Cloud Run in Production: Services, Jobs, VPC Egress, and Concurrency Tuning

1. The execution model: requests, concurrency, instance lifecycle

2. Services vs jobs: long-running APIs vs batch and scheduled work

3. Concurrency, CPU allocation, and CPU-always-on

4. Cold starts: min instances, startup CPU boost, startup probes

5. Direct VPC egress vs Serverless VPC Access connectors

6. Private ingress: internal load balancers, IAP, and PSC endpoints

7. Connecting to Cloud SQL and private services securely

8. Revisions, traffic splitting, and gradual rollouts with tags

Verify

Enterprise scenario

Production checklist

Written by Vinod

Comments

Keep Reading

BigQuery Fine-Grained Security: Column-Level, Row-Level, and Data Masking

Cloud DNS at Scale: Private Zones, Peering, Forwarding, and Response Policies

Event-Driven Architecture with Cloud Functions 2nd Gen and Eventarc