Quick take — A reusable Terraform module for GCP Vertex AI Workbench (google_workbench_instance): private-IP JupyterLab instances with idle shutdown, CMEK disks, Shielded VM, and least-privilege service accounts. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "vertex_workbench" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-vertex-workbench?ref=v1.0.0"
project_id = "..." # GCP project ID that will host the Workbench instance.
instance_name = "..." # Instance name; lowercase letters, digits, hyphens; star…
zone = "..." # Zone such as `asia-south1-a`.
network = "..." # VPC network self-link or name.
subnet = "..." # Subnetwork self-link (Private Google Access enabled).
environment = "..." # `dev`/`staging`/`prod` label.
team = "..." # Owning team (cost-allocation label).
cost_center = "..." # Cost-centre code (chargeback label).
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Vertex AI Workbench is GCP’s managed JupyterLab environment for data scientists and ML engineers. A Workbench instance is essentially a Compute Engine VM, pre-loaded with a curated Deep Learning container (TensorFlow, PyTorch, RAPIDS, or a base image), that exposes a JupyterLab UI you reach through a Google-proxied URL or directly over a private IP. Google handles the OS image, the notebook runtime, optional GPU attachment, and lifecycle features like idle auto-shutdown so an idle GPU box doesn’t quietly burn your budget overnight.
The catch is that the defaults are not what a regulated enterprise wants. Created from the console, a Workbench instance tends to land with a public IP, the default Compute Engine service account (which is wildly over-privileged), no customer-managed encryption, and no consistent labelling. Multiply that across forty data scientists who each spin up their own box and you have a sprawling, unauditable, non-compliant estate.
This module wraps google_workbench_instance (the current GA resource under hashicorp/google ~> 5.0, replacing the deprecated google_notebooks_instance) so that every notebook is born compliant: private IP only, Shielded VM with Secure Boot and integrity monitoring, CMEK-encrypted boot and data disks, a dedicated least-privilege service account, mandatory idle-shutdown, and a standard label set for cost allocation. Data scientists get self-service notebooks; platform and security teams get a single, reviewable definition of what “a notebook” is allowed to be.
When to use it
- You run a data science or ML platform and want to give analysts self-service JupyterLab without handing them the GCP console or letting them provision public-IP VMs.
- You need notebooks that sit inside a Shared VPC and reach BigQuery, Cloud Storage, and on-prem data over Private Google Access / Private Service Connect, never the public internet.
- Compliance requires CMEK on all persistent disks and Shielded VM posture on every compute instance, notebooks included.
- You want predictable cost: enforced idle-shutdown timers, right-sized machine types, and labels that map every rupee of spend back to a team and cost centre.
- You are standardising on GPU notebooks (e.g. an
NVIDIA_TESLA_T4for prototyping) and want the GPU type, count, and driver install captured in code rather than clicked per-instance.
If you only need a single throwaway notebook for a one-hour spike, the console is faster. This module pays off the moment notebooks become shared infrastructure that auditors will ask about.
Module structure
terraform-module-gcp-vertex-workbench/
├── versions.tf # provider + Terraform version pins
├── main.tf # google_workbench_instance + dedicated SA + IAM
├── variables.tf # all knobs, with validation
├── outputs.tf # id, name, proxy URI, service account, state
└── README.md
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
locals {
# Standard labels merged onto every resource for cost allocation + governance.
base_labels = {
managed-by = "terraform"
module = "vertex-workbench"
environment = var.environment
team = var.team
cost-center = var.cost_center
}
labels = merge(local.base_labels, var.additional_labels)
# Derive a stable SA account_id (<= 30 chars, RFC1035-ish) from the instance name.
sa_account_id = substr("wb-${var.instance_name}", 0, 30)
}
# Dedicated, least-privilege service account for this notebook.
# Created only when the caller does not pass an existing SA email.
resource "google_service_account" "workbench" {
count = var.service_account_email == null ? 1 : 0
project = var.project_id
account_id = local.sa_account_id
display_name = "Vertex Workbench SA - ${var.instance_name}"
description = "Runtime identity for Vertex AI Workbench instance ${var.instance_name}"
}
locals {
effective_sa_email = coalesce(
var.service_account_email,
try(google_service_account.workbench[0].email, null)
)
}
# Minimal project-level roles the notebook runtime needs to function.
# Grant data-access (BigQuery, GCS buckets) at the resource level outside this module.
resource "google_project_iam_member" "workbench_roles" {
for_each = var.service_account_email == null ? toset(var.service_account_roles) : toset([])
project = var.project_id
role = each.value
member = "serviceAccount:${google_service_account.workbench[0].email}"
}
resource "google_workbench_instance" "this" {
project = var.project_id
name = var.instance_name
location = var.zone
gce_setup {
machine_type = var.machine_type
# Disable external IP. The instance is reachable only over the private
# network and the Google-managed proxy URI.
disable_public_ip = true
# Shielded VM posture — required for most CIS / regulated baselines.
shielded_instance_config {
enable_secure_boot = true
enable_vtpm = true
enable_integrity_monitoring = true
}
# Optional GPU accelerator (e.g. NVIDIA_TESLA_T4). Driver auto-installed
# so notebooks get CUDA without manual setup.
dynamic "accelerator_configs" {
for_each = var.accelerator_type == null ? [] : [1]
content {
type = var.accelerator_type
core_count = var.accelerator_count
}
}
# Boot disk — CMEK-encrypted when a KMS key is supplied.
boot_disk {
disk_type = var.boot_disk_type
disk_size_gb = var.boot_disk_size_gb
kms_key = var.kms_key
}
# Data disk — separate, persistent, also CMEK-encrypted.
data_disks {
disk_type = var.data_disk_type
disk_size_gb = var.data_disk_size_gb
kms_key = var.kms_key
}
network_interfaces {
network = var.network
subnet = var.subnet
# nic_type left to provider default (GVNIC on supported images).
}
service_accounts {
email = local.effective_sa_email
}
# Curated Deep Learning image. Default channel keeps the notebook patched.
vm_image {
project = var.image_project
family = var.image_family
}
# Hardening + lifecycle metadata.
metadata = merge(
{
# Auto-shut-down after N minutes idle to control cost.
"idle-timeout-seconds" = tostring(var.idle_shutdown_minutes * 60)
# Block project-wide SSH keys; access is via JupyterLab proxy / IAP.
"block-project-ssh-keys" = "true"
# Disable the per-instance Jupyter "terminal as root" surface.
"notebook-disable-root" = "true"
# Report instance health/metrics to the Workbench control plane.
"report-system-health" = "true"
},
var.metadata
)
# Restrict which gcloud scopes the notebook may use.
enable_ip_forwarding = false
tags = var.network_tags
}
# Allow only members of these domains/groups to open the JupyterLab UI.
instance_owners = var.instance_owners
# Disable the public proxy access when running fully private (IAP / VPN only).
disable_proxy_access = var.disable_proxy_access
labels = local.labels
lifecycle {
# Image family updates are applied via a controlled re-create, not silently.
ignore_changes = [
gce_setup[0].vm_image,
]
}
}
variables.tf
variable "project_id" {
type = string
description = "GCP project ID that will host the Workbench instance."
}
variable "instance_name" {
type = string
description = "Name of the Workbench instance. Lowercase letters, digits and hyphens; must start with a letter."
validation {
condition = can(regex("^[a-z][a-z0-9-]{0,62}$", var.instance_name))
error_message = "instance_name must start with a lowercase letter and contain only lowercase letters, digits, and hyphens (max 63 chars)."
}
}
variable "zone" {
type = string
description = "Zone for the instance, e.g. asia-south1-a."
validation {
condition = can(regex("^[a-z]+-[a-z0-9]+-[a-z]$", var.zone))
error_message = "zone must be a valid GCP zone such as asia-south1-a."
}
}
variable "machine_type" {
type = string
description = "Compute Engine machine type for the notebook, e.g. n1-standard-4 or e2-standard-8."
default = "e2-standard-4"
}
variable "network" {
type = string
description = "Self-link or short name of the VPC network the instance attaches to."
}
variable "subnet" {
type = string
description = "Self-link of the subnetwork (must have Private Google Access enabled)."
}
variable "network_tags" {
type = list(string)
description = "Network tags applied to the instance for firewall targeting."
default = []
}
variable "service_account_email" {
type = string
description = "Optional existing service account email for the notebook runtime. If null, a dedicated SA is created by the module."
default = null
}
variable "service_account_roles" {
type = list(string)
description = "Project-level roles granted to the module-created service account. Ignored when service_account_email is supplied."
default = [
"roles/aiplatform.user",
"roles/storage.objectViewer",
"roles/logging.logWriter",
"roles/monitoring.metricWriter",
]
}
variable "image_project" {
type = string
description = "Project hosting the VM image family."
default = "deeplearning-platform-release"
}
variable "image_family" {
type = string
description = "Deep Learning VM image family (e.g. workbench-instances, tf-latest-cpu, pytorch-latest-gpu)."
default = "workbench-instances"
}
variable "boot_disk_type" {
type = string
description = "Boot disk type."
default = "PD_SSD"
validation {
condition = contains(["PD_STANDARD", "PD_SSD", "PD_BALANCED"], var.boot_disk_type)
error_message = "boot_disk_type must be one of PD_STANDARD, PD_SSD, PD_BALANCED."
}
}
variable "boot_disk_size_gb" {
type = number
description = "Boot disk size in GB."
default = 150
validation {
condition = var.boot_disk_size_gb >= 100 && var.boot_disk_size_gb <= 65536
error_message = "boot_disk_size_gb must be between 100 and 65536."
}
}
variable "data_disk_type" {
type = string
description = "Persistent data disk type (holds /home/jupyter)."
default = "PD_BALANCED"
validation {
condition = contains(["PD_STANDARD", "PD_SSD", "PD_BALANCED"], var.data_disk_type)
error_message = "data_disk_type must be one of PD_STANDARD, PD_SSD, PD_BALANCED."
}
}
variable "data_disk_size_gb" {
type = number
description = "Persistent data disk size in GB."
default = 200
validation {
condition = var.data_disk_size_gb >= 100 && var.data_disk_size_gb <= 65536
error_message = "data_disk_size_gb must be between 100 and 65536."
}
}
variable "kms_key" {
type = string
description = "CMEK key resource ID for disk encryption (projects/<p>/locations/<l>/keyRings/<r>/cryptoKeys/<k>). Null uses Google-managed keys."
default = null
}
variable "accelerator_type" {
type = string
description = "Optional GPU accelerator type, e.g. NVIDIA_TESLA_T4. Null means CPU-only."
default = null
validation {
condition = var.accelerator_type == null || contains([
"NVIDIA_TESLA_T4", "NVIDIA_TESLA_V100", "NVIDIA_TESLA_P100",
"NVIDIA_TESLA_A100", "NVIDIA_A100_80GB", "NVIDIA_L4",
], coalesce(var.accelerator_type, "none"))
error_message = "accelerator_type must be a supported Vertex Workbench GPU type or null."
}
}
variable "accelerator_count" {
type = number
description = "Number of GPUs to attach when accelerator_type is set."
default = 1
validation {
condition = var.accelerator_count >= 1 && var.accelerator_count <= 8
error_message = "accelerator_count must be between 1 and 8."
}
}
variable "idle_shutdown_minutes" {
type = number
description = "Minutes of inactivity before the notebook auto-stops. Set to 0 to disable (not recommended)."
default = 60
validation {
condition = var.idle_shutdown_minutes >= 0 && var.idle_shutdown_minutes <= 1440
error_message = "idle_shutdown_minutes must be between 0 and 1440 (24h)."
}
}
variable "instance_owners" {
type = list(string)
description = "User emails permitted to access the JupyterLab UI (single-user notebooks). Empty for service-managed access."
default = []
}
variable "disable_proxy_access" {
type = bool
description = "If true, disables the public Google proxy URI; access only via IAP/VPN to the private IP."
default = false
}
variable "environment" {
type = string
description = "Deployment environment label (e.g. dev, staging, prod)."
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "environment must be one of dev, staging, prod."
}
}
variable "team" {
type = string
description = "Owning team, used as a label for cost allocation."
}
variable "cost_center" {
type = string
description = "Cost-centre code, used as a label for chargeback."
}
variable "metadata" {
type = map(string)
description = "Extra instance metadata key/value pairs, merged onto the module defaults."
default = {}
}
variable "additional_labels" {
type = map(string)
description = "Extra labels merged onto the standard label set."
default = {}
}
outputs.tf
output "id" {
description = "Fully-qualified resource ID of the Workbench instance."
value = google_workbench_instance.this.id
}
output "name" {
description = "Name of the Workbench instance."
value = google_workbench_instance.this.name
}
output "proxy_uri" {
description = "Google-managed proxy URI to open the JupyterLab UI (empty when proxy access is disabled)."
value = google_workbench_instance.this.proxy_uri
}
output "state" {
description = "Current lifecycle state of the instance (e.g. ACTIVE, STOPPED)."
value = google_workbench_instance.this.state
}
output "service_account_email" {
description = "Service account email the notebook runs as."
value = local.effective_sa_email
}
output "creator" {
description = "Email of the principal that created the instance."
value = google_workbench_instance.this.creator
}
output "health_state" {
description = "Reported health state of the instance from the Workbench control plane."
value = google_workbench_instance.this.health_state
}
How to use it
module "vertex_ai_workbench" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-vertex-workbench?ref=v1.0.0"
project_id = "kv-ml-platform-prod"
instance_name = "fraud-ds-anita"
zone = "asia-south1-a"
machine_type = "n1-standard-8"
# GPU notebook for model prototyping.
accelerator_type = "NVIDIA_TESLA_T4"
accelerator_count = 1
# Attach to the Shared VPC; subnet has Private Google Access on.
network = "projects/kv-shared-host/global/networks/ml-vpc"
subnet = "projects/kv-shared-host/regions/asia-south1/subnetworks/notebooks-asia-south1"
network_tags = ["workbench", "egress-via-proxy"]
# CMEK on all disks.
kms_key = "projects/kv-shared-host/locations/asia-south1/keyRings/ml-kr/cryptoKeys/notebooks"
# Single-user notebook locked to one analyst.
instance_owners = ["anita@kloudvin.com"]
# Aggressive idle-shutdown for an expensive GPU box.
idle_shutdown_minutes = 30
data_disk_size_gb = 500
environment = "prod"
team = "fraud-ml"
cost_center = "CC-4471"
additional_labels = {
project = "realtime-fraud-scoring"
}
}
# Downstream: grant this notebook's SA read access to a specific BigQuery
# dataset, using the module's service_account_email output.
resource "google_bigquery_dataset_iam_member" "notebook_reader" {
project = "kv-data-warehouse-prod"
dataset_id = "fraud_features"
role = "roles/bigquery.dataViewer"
member = "serviceAccount:${module.vertex_ai_workbench.service_account_email}"
}
output "notebook_url" {
value = module.vertex_ai_workbench.proxy_uri
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/vertex_workbench/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-vertex-workbench?ref=v1.0.0"
}
inputs = {
project_id = "..."
instance_name = "..."
zone = "..."
network = "..."
subnet = "..."
environment = "..."
team = "..."
cost_center = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/vertex_workbench && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| project_id | string | — | yes | GCP project ID that will host the Workbench instance. |
| instance_name | string | — | yes | Instance name; lowercase letters, digits, hyphens; starts with a letter. |
| zone | string | — | yes | Zone such as asia-south1-a. |
| machine_type | string | e2-standard-4 |
no | Compute Engine machine type. |
| network | string | — | yes | VPC network self-link or name. |
| subnet | string | — | yes | Subnetwork self-link (Private Google Access enabled). |
| network_tags | list(string) | [] |
no | Network tags for firewall targeting. |
| service_account_email | string | null |
no | Existing SA email; null creates a dedicated SA. |
| service_account_roles | list(string) | [aiplatform.user, storage.objectViewer, logging.logWriter, monitoring.metricWriter] |
no | Roles granted to the module-created SA. |
| image_project | string | deeplearning-platform-release |
no | Project hosting the VM image family. |
| image_family | string | workbench-instances |
no | Deep Learning VM image family. |
| boot_disk_type | string | PD_SSD |
no | Boot disk type (PD_STANDARD/PD_SSD/PD_BALANCED). |
| boot_disk_size_gb | number | 150 |
no | Boot disk size in GB (100–65536). |
| data_disk_type | string | PD_BALANCED |
no | Data disk type (holds /home/jupyter). |
| data_disk_size_gb | number | 200 |
no | Data disk size in GB (100–65536). |
| kms_key | string | null |
no | CMEK key resource ID for disk encryption. |
| accelerator_type | string | null |
no | GPU type (e.g. NVIDIA_TESLA_T4); null = CPU-only. |
| accelerator_count | number | 1 |
no | Number of GPUs (1–8) when accelerator_type set. |
| idle_shutdown_minutes | number | 60 |
no | Idle minutes before auto-stop; 0 disables. |
| instance_owners | list(string) | [] |
no | User emails allowed into the JupyterLab UI. |
| disable_proxy_access | bool | false |
no | Disable the public proxy URI (IAP/VPN-only access). |
| environment | string | — | yes | dev/staging/prod label. |
| team | string | — | yes | Owning team (cost-allocation label). |
| cost_center | string | — | yes | Cost-centre code (chargeback label). |
| metadata | map(string) | {} |
no | Extra instance metadata, merged onto defaults. |
| additional_labels | map(string) | {} |
no | Extra labels merged onto the standard set. |
Outputs
| Name | Description |
|---|---|
| id | Fully-qualified resource ID of the Workbench instance. |
| name | Name of the Workbench instance. |
| proxy_uri | Google-managed proxy URI for the JupyterLab UI (empty when proxy access disabled). |
| state | Current lifecycle state (e.g. ACTIVE, STOPPED). |
| service_account_email | Service account email the notebook runs as. |
| creator | Email of the principal that created the instance. |
| health_state | Reported health state from the Workbench control plane. |
Enterprise scenario
A fintech’s fraud-modelling group needs forty data scientists able to spin up GPU notebooks against a feature store in BigQuery, but their PCI-DSS scope forbids any public-IP compute and mandates CMEK on all persistent storage. The platform team exposes this module behind a thin self-service wrapper: an analyst opens a pull request with their name, team, and required machine type, and CI applies a fraud-ds-<name> instance that lands private-only, Shielded, CMEK-encrypted, and owned by exactly one user — then grants its dedicated service account read-only access to just the fraud_features dataset. The enforced 30-minute idle-shutdown alone cut their monthly notebook GPU spend by roughly 40% versus the previous click-ops fleet that ran around the clock.
Best practices
- Always run private. Keep
disable_public_ip = true(module default) and put the instance on a subnet with Private Google Access so it reaches BigQuery, GCS, and Artifact Registry without a route to the internet. For fully air-gapped access setdisable_proxy_access = trueand reach JupyterLab through IAP or a VPN to the private IP. - One notebook, one identity, one user. Let the module mint a dedicated service account per instance and pin
instance_ownersto a single analyst. Never reuse the default Compute Engine SA — grant data access narrowly at the dataset/bucket level via theservice_account_emailoutput, not broad project roles. - Enforce idle-shutdown, especially on GPUs. A T4 or A100 left running overnight is the single biggest avoidable cost on Workbench. Keep
idle_shutdown_minuteslow (30–60) and treat any request to disable it as a budget exception that needs sign-off. - CMEK everywhere and keep the keyring co-located. Supply
kms_keyfor both boot and data disks, and make sure the Workbench service agent hasroles/cloudkms.cryptoKeyEncrypterDecrypteron that key, with the keyring in the same region as the zone to avoid cross-region key calls. - Hold state on the data disk, not the boot disk. Keep work under
/home/jupyteron the persistentdata_disksvolume so you can recreate or upgrade the instance (the module ignores in-placevm_imagedrift) without losing a scientist’s notebooks. - Standardise naming and labels for chargeback. Use a predictable
team-purpose-userinstance name and always populateenvironment,team, andcost_centerso every notebook’s spend rolls up cleanly in billing exports and BigQuery cost reports.