Terraform Module: GCP Vertex AI Workbench — governed, private-by-default notebooks for data science teams

Quick take — A reusable Terraform module for GCP Vertex AI Workbench (google_workbench_instance): private-IP JupyterLab instances with idle shutdown, CMEK disks, Shielded VM, and least-privilege service accounts. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "vertex_workbench" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-vertex-workbench?ref=v1.0.0"

  project_id    = "..."  # GCP project ID that will host the Workbench instance.
  instance_name = "..."  # Instance name; lowercase letters, digits, hyphens; star…
  zone          = "..."  # Zone such as `asia-south1-a`.
  network       = "..."  # VPC network self-link or name.
  subnet        = "..."  # Subnetwork self-link (Private Google Access enabled).
  environment   = "..."  # `dev`/`staging`/`prod` label.
  team          = "..."  # Owning team (cost-allocation label).
  cost_center   = "..."  # Cost-centre code (chargeback label).
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Vertex AI Workbench is GCP’s managed JupyterLab environment for data scientists and ML engineers. A Workbench instance is essentially a Compute Engine VM, pre-loaded with a curated Deep Learning container (TensorFlow, PyTorch, RAPIDS, or a base image), that exposes a JupyterLab UI you reach through a Google-proxied URL or directly over a private IP. Google handles the OS image, the notebook runtime, optional GPU attachment, and lifecycle features like idle auto-shutdown so an idle GPU box doesn’t quietly burn your budget overnight.

The catch is that the defaults are not what a regulated enterprise wants. Created from the console, a Workbench instance tends to land with a public IP, the default Compute Engine service account (which is wildly over-privileged), no customer-managed encryption, and no consistent labelling. Multiply that across forty data scientists who each spin up their own box and you have a sprawling, unauditable, non-compliant estate.

This module wraps google_workbench_instance (the current GA resource under hashicorp/google ~> 5.0, replacing the deprecated google_notebooks_instance) so that every notebook is born compliant: private IP only, Shielded VM with Secure Boot and integrity monitoring, CMEK-encrypted boot and data disks, a dedicated least-privilege service account, mandatory idle-shutdown, and a standard label set for cost allocation. Data scientists get self-service notebooks; platform and security teams get a single, reviewable definition of what “a notebook” is allowed to be.

When to use it

You run a data science or ML platform and want to give analysts self-service JupyterLab without handing them the GCP console or letting them provision public-IP VMs.
You need notebooks that sit inside a Shared VPC and reach BigQuery, Cloud Storage, and on-prem data over Private Google Access / Private Service Connect, never the public internet.
Compliance requires CMEK on all persistent disks and Shielded VM posture on every compute instance, notebooks included.
You want predictable cost: enforced idle-shutdown timers, right-sized machine types, and labels that map every rupee of spend back to a team and cost centre.
You are standardising on GPU notebooks (e.g. an NVIDIA_TESLA_T4 for prototyping) and want the GPU type, count, and driver install captured in code rather than clicked per-instance.

If you only need a single throwaway notebook for a one-hour spike, the console is faster. This module pays off the moment notebooks become shared infrastructure that auditors will ask about.

Module structure

terraform-module-gcp-vertex-workbench/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # google_workbench_instance + dedicated SA + IAM
├── variables.tf     # all knobs, with validation
├── outputs.tf       # id, name, proxy URI, service account, state
└── README.md

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Standard labels merged onto every resource for cost allocation + governance.
  base_labels = {
    managed-by   = "terraform"
    module       = "vertex-workbench"
    environment  = var.environment
    team         = var.team
    cost-center  = var.cost_center
  }

  labels = merge(local.base_labels, var.additional_labels)

  # Derive a stable SA account_id (<= 30 chars, RFC1035-ish) from the instance name.
  sa_account_id = substr("wb-${var.instance_name}", 0, 30)
}

# Dedicated, least-privilege service account for this notebook.
# Created only when the caller does not pass an existing SA email.
resource "google_service_account" "workbench" {
  count        = var.service_account_email == null ? 1 : 0
  project      = var.project_id
  account_id   = local.sa_account_id
  display_name = "Vertex Workbench SA - ${var.instance_name}"
  description  = "Runtime identity for Vertex AI Workbench instance ${var.instance_name}"
}

locals {
  effective_sa_email = coalesce(
    var.service_account_email,
    try(google_service_account.workbench[0].email, null)
  )
}

# Minimal project-level roles the notebook runtime needs to function.
# Grant data-access (BigQuery, GCS buckets) at the resource level outside this module.
resource "google_project_iam_member" "workbench_roles" {
  for_each = var.service_account_email == null ? toset(var.service_account_roles) : toset([])

  project = var.project_id
  role    = each.value
  member  = "serviceAccount:${google_service_account.workbench[0].email}"
}

resource "google_workbench_instance" "this" {
  project  = var.project_id
  name     = var.instance_name
  location = var.zone

  gce_setup {
    machine_type = var.machine_type

    # Disable external IP. The instance is reachable only over the private
    # network and the Google-managed proxy URI.
    disable_public_ip = true

    # Shielded VM posture — required for most CIS / regulated baselines.
    shielded_instance_config {
      enable_secure_boot          = true
      enable_vtpm                 = true
      enable_integrity_monitoring = true
    }

    # Optional GPU accelerator (e.g. NVIDIA_TESLA_T4). Driver auto-installed
    # so notebooks get CUDA without manual setup.
    dynamic "accelerator_configs" {
      for_each = var.accelerator_type == null ? [] : [1]
      content {
        type       = var.accelerator_type
        core_count = var.accelerator_count
      }
    }

    # Boot disk — CMEK-encrypted when a KMS key is supplied.
    boot_disk {
      disk_type    = var.boot_disk_type
      disk_size_gb = var.boot_disk_size_gb
      kms_key      = var.kms_key
    }

    # Data disk — separate, persistent, also CMEK-encrypted.
    data_disks {
      disk_type    = var.data_disk_type
      disk_size_gb = var.data_disk_size_gb
      kms_key      = var.kms_key
    }

    network_interfaces {
      network = var.network
      subnet  = var.subnet
      # nic_type left to provider default (GVNIC on supported images).
    }

    service_accounts {
      email = local.effective_sa_email
    }

    # Curated Deep Learning image. Default channel keeps the notebook patched.
    vm_image {
      project = var.image_project
      family  = var.image_family
    }

    # Hardening + lifecycle metadata.
    metadata = merge(
      {
        # Auto-shut-down after N minutes idle to control cost.
        "idle-timeout-seconds" = tostring(var.idle_shutdown_minutes * 60)
        # Block project-wide SSH keys; access is via JupyterLab proxy / IAP.
        "block-project-ssh-keys" = "true"
        # Disable the per-instance Jupyter "terminal as root" surface.
        "notebook-disable-root" = "true"
        # Report instance health/metrics to the Workbench control plane.
        "report-system-health" = "true"
      },
      var.metadata
    )

    # Restrict which gcloud scopes the notebook may use.
    enable_ip_forwarding = false

    tags = var.network_tags
  }

  # Allow only members of these domains/groups to open the JupyterLab UI.
  instance_owners = var.instance_owners

  # Disable the public proxy access when running fully private (IAP / VPN only).
  disable_proxy_access = var.disable_proxy_access

  labels = local.labels

  lifecycle {
    # Image family updates are applied via a controlled re-create, not silently.
    ignore_changes = [
      gce_setup[0].vm_image,
    ]
  }
}

variables.tf

variable "project_id" {
  type        = string
  description = "GCP project ID that will host the Workbench instance."
}

variable "instance_name" {
  type        = string
  description = "Name of the Workbench instance. Lowercase letters, digits and hyphens; must start with a letter."

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{0,62}$", var.instance_name))
    error_message = "instance_name must start with a lowercase letter and contain only lowercase letters, digits, and hyphens (max 63 chars)."
  }
}

variable "zone" {
  type        = string
  description = "Zone for the instance, e.g. asia-south1-a."

  validation {
    condition     = can(regex("^[a-z]+-[a-z0-9]+-[a-z]$", var.zone))
    error_message = "zone must be a valid GCP zone such as asia-south1-a."
  }
}

variable "machine_type" {
  type        = string
  description = "Compute Engine machine type for the notebook, e.g. n1-standard-4 or e2-standard-8."
  default     = "e2-standard-4"
}

variable "network" {
  type        = string
  description = "Self-link or short name of the VPC network the instance attaches to."
}

variable "subnet" {
  type        = string
  description = "Self-link of the subnetwork (must have Private Google Access enabled)."
}

variable "network_tags" {
  type        = list(string)
  description = "Network tags applied to the instance for firewall targeting."
  default     = []
}

variable "service_account_email" {
  type        = string
  description = "Optional existing service account email for the notebook runtime. If null, a dedicated SA is created by the module."
  default     = null
}

variable "service_account_roles" {
  type        = list(string)
  description = "Project-level roles granted to the module-created service account. Ignored when service_account_email is supplied."
  default = [
    "roles/aiplatform.user",
    "roles/storage.objectViewer",
    "roles/logging.logWriter",
    "roles/monitoring.metricWriter",
  ]
}

variable "image_project" {
  type        = string
  description = "Project hosting the VM image family."
  default     = "deeplearning-platform-release"
}

variable "image_family" {
  type        = string
  description = "Deep Learning VM image family (e.g. workbench-instances, tf-latest-cpu, pytorch-latest-gpu)."
  default     = "workbench-instances"
}

variable "boot_disk_type" {
  type        = string
  description = "Boot disk type."
  default     = "PD_SSD"

  validation {
    condition     = contains(["PD_STANDARD", "PD_SSD", "PD_BALANCED"], var.boot_disk_type)
    error_message = "boot_disk_type must be one of PD_STANDARD, PD_SSD, PD_BALANCED."
  }
}

variable "boot_disk_size_gb" {
  type        = number
  description = "Boot disk size in GB."
  default     = 150

  validation {
    condition     = var.boot_disk_size_gb >= 100 && var.boot_disk_size_gb <= 65536
    error_message = "boot_disk_size_gb must be between 100 and 65536."
  }
}

variable "data_disk_type" {
  type        = string
  description = "Persistent data disk type (holds /home/jupyter)."
  default     = "PD_BALANCED"

  validation {
    condition     = contains(["PD_STANDARD", "PD_SSD", "PD_BALANCED"], var.data_disk_type)
    error_message = "data_disk_type must be one of PD_STANDARD, PD_SSD, PD_BALANCED."
  }
}

variable "data_disk_size_gb" {
  type        = number
  description = "Persistent data disk size in GB."
  default     = 200

  validation {
    condition     = var.data_disk_size_gb >= 100 && var.data_disk_size_gb <= 65536
    error_message = "data_disk_size_gb must be between 100 and 65536."
  }
}

variable "kms_key" {
  type        = string
  description = "CMEK key resource ID for disk encryption (projects/<p>/locations/<l>/keyRings/<r>/cryptoKeys/<k>). Null uses Google-managed keys."
  default     = null
}

variable "accelerator_type" {
  type        = string
  description = "Optional GPU accelerator type, e.g. NVIDIA_TESLA_T4. Null means CPU-only."
  default     = null

  validation {
    condition = var.accelerator_type == null || contains([
      "NVIDIA_TESLA_T4", "NVIDIA_TESLA_V100", "NVIDIA_TESLA_P100",
      "NVIDIA_TESLA_A100", "NVIDIA_A100_80GB", "NVIDIA_L4",
    ], coalesce(var.accelerator_type, "none"))
    error_message = "accelerator_type must be a supported Vertex Workbench GPU type or null."
  }
}

variable "accelerator_count" {
  type        = number
  description = "Number of GPUs to attach when accelerator_type is set."
  default     = 1

  validation {
    condition     = var.accelerator_count >= 1 && var.accelerator_count <= 8
    error_message = "accelerator_count must be between 1 and 8."
  }
}

variable "idle_shutdown_minutes" {
  type        = number
  description = "Minutes of inactivity before the notebook auto-stops. Set to 0 to disable (not recommended)."
  default     = 60

  validation {
    condition     = var.idle_shutdown_minutes >= 0 && var.idle_shutdown_minutes <= 1440
    error_message = "idle_shutdown_minutes must be between 0 and 1440 (24h)."
  }
}

variable "instance_owners" {
  type        = list(string)
  description = "User emails permitted to access the JupyterLab UI (single-user notebooks). Empty for service-managed access."
  default     = []
}

variable "disable_proxy_access" {
  type        = bool
  description = "If true, disables the public Google proxy URI; access only via IAP/VPN to the private IP."
  default     = false
}

variable "environment" {
  type        = string
  description = "Deployment environment label (e.g. dev, staging, prod)."

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be one of dev, staging, prod."
  }
}

variable "team" {
  type        = string
  description = "Owning team, used as a label for cost allocation."
}

variable "cost_center" {
  type        = string
  description = "Cost-centre code, used as a label for chargeback."
}

variable "metadata" {
  type        = map(string)
  description = "Extra instance metadata key/value pairs, merged onto the module defaults."
  default     = {}
}

variable "additional_labels" {
  type        = map(string)
  description = "Extra labels merged onto the standard label set."
  default     = {}
}

outputs.tf

output "id" {
  description = "Fully-qualified resource ID of the Workbench instance."
  value       = google_workbench_instance.this.id
}

output "name" {
  description = "Name of the Workbench instance."
  value       = google_workbench_instance.this.name
}

output "proxy_uri" {
  description = "Google-managed proxy URI to open the JupyterLab UI (empty when proxy access is disabled)."
  value       = google_workbench_instance.this.proxy_uri
}

output "state" {
  description = "Current lifecycle state of the instance (e.g. ACTIVE, STOPPED)."
  value       = google_workbench_instance.this.state
}

output "service_account_email" {
  description = "Service account email the notebook runs as."
  value       = local.effective_sa_email
}

output "creator" {
  description = "Email of the principal that created the instance."
  value       = google_workbench_instance.this.creator
}

output "health_state" {
  description = "Reported health state of the instance from the Workbench control plane."
  value       = google_workbench_instance.this.health_state
}

How to use it

module "vertex_ai_workbench" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-vertex-workbench?ref=v1.0.0"

  project_id    = "kv-ml-platform-prod"
  instance_name = "fraud-ds-anita"
  zone          = "asia-south1-a"

  machine_type = "n1-standard-8"

  # GPU notebook for model prototyping.
  accelerator_type  = "NVIDIA_TESLA_T4"
  accelerator_count = 1

  # Attach to the Shared VPC; subnet has Private Google Access on.
  network = "projects/kv-shared-host/global/networks/ml-vpc"
  subnet  = "projects/kv-shared-host/regions/asia-south1/subnetworks/notebooks-asia-south1"

  network_tags = ["workbench", "egress-via-proxy"]

  # CMEK on all disks.
  kms_key = "projects/kv-shared-host/locations/asia-south1/keyRings/ml-kr/cryptoKeys/notebooks"

  # Single-user notebook locked to one analyst.
  instance_owners = ["anita@kloudvin.com"]

  # Aggressive idle-shutdown for an expensive GPU box.
  idle_shutdown_minutes = 30

  data_disk_size_gb = 500

  environment = "prod"
  team        = "fraud-ml"
  cost_center = "CC-4471"

  additional_labels = {
    project = "realtime-fraud-scoring"
  }
}

# Downstream: grant this notebook's SA read access to a specific BigQuery
# dataset, using the module's service_account_email output.
resource "google_bigquery_dataset_iam_member" "notebook_reader" {
  project    = "kv-data-warehouse-prod"
  dataset_id = "fraud_features"
  role       = "roles/bigquery.dataViewer"
  member     = "serviceAccount:${module.vertex_ai_workbench.service_account_email}"
}

output "notebook_url" {
  value = module.vertex_ai_workbench.proxy_uri
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module config — live/prod/vertex_workbench/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-vertex-workbench?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  instance_name = "..."
  zone = "..."
  network = "..."
  subnet = "..."
  environment = "..."
  team = "..."
  cost_center = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/vertex_workbench && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
project_id	string	—	yes	GCP project ID that will host the Workbench instance.
instance_name	string	—	yes	Instance name; lowercase letters, digits, hyphens; starts with a letter.
zone	string	—	yes	Zone such as `asia-south1-a`.
machine_type	string	`e2-standard-4`	no	Compute Engine machine type.
network	string	—	yes	VPC network self-link or name.
subnet	string	—	yes	Subnetwork self-link (Private Google Access enabled).
network_tags	list(string)	`[]`	no	Network tags for firewall targeting.
service_account_email	string	`null`	no	Existing SA email; null creates a dedicated SA.
service_account_roles	list(string)	`[aiplatform.user, storage.objectViewer, logging.logWriter, monitoring.metricWriter]`	no	Roles granted to the module-created SA.
image_project	string	`deeplearning-platform-release`	no	Project hosting the VM image family.
image_family	string	`workbench-instances`	no	Deep Learning VM image family.
boot_disk_type	string	`PD_SSD`	no	Boot disk type (`PD_STANDARD`/`PD_SSD`/`PD_BALANCED`).
boot_disk_size_gb	number	`150`	no	Boot disk size in GB (100–65536).
data_disk_type	string	`PD_BALANCED`	no	Data disk type (holds `/home/jupyter`).
data_disk_size_gb	number	`200`	no	Data disk size in GB (100–65536).
kms_key	string	`null`	no	CMEK key resource ID for disk encryption.
accelerator_type	string	`null`	no	GPU type (e.g. `NVIDIA_TESLA_T4`); null = CPU-only.
accelerator_count	number	`1`	no	Number of GPUs (1–8) when `accelerator_type` set.
idle_shutdown_minutes	number	`60`	no	Idle minutes before auto-stop; 0 disables.
instance_owners	list(string)	`[]`	no	User emails allowed into the JupyterLab UI.
disable_proxy_access	bool	`false`	no	Disable the public proxy URI (IAP/VPN-only access).
environment	string	—	yes	`dev`/`staging`/`prod` label.
team	string	—	yes	Owning team (cost-allocation label).
cost_center	string	—	yes	Cost-centre code (chargeback label).
metadata	map(string)	`{}`	no	Extra instance metadata, merged onto defaults.
additional_labels	map(string)	`{}`	no	Extra labels merged onto the standard set.

Outputs

Name	Description
id	Fully-qualified resource ID of the Workbench instance.
name	Name of the Workbench instance.
proxy_uri	Google-managed proxy URI for the JupyterLab UI (empty when proxy access disabled).
state	Current lifecycle state (e.g. `ACTIVE`, `STOPPED`).
service_account_email	Service account email the notebook runs as.
creator	Email of the principal that created the instance.
health_state	Reported health state from the Workbench control plane.

Enterprise scenario

A fintech’s fraud-modelling group needs forty data scientists able to spin up GPU notebooks against a feature store in BigQuery, but their PCI-DSS scope forbids any public-IP compute and mandates CMEK on all persistent storage. The platform team exposes this module behind a thin self-service wrapper: an analyst opens a pull request with their name, team, and required machine type, and CI applies a fraud-ds-<name> instance that lands private-only, Shielded, CMEK-encrypted, and owned by exactly one user — then grants its dedicated service account read-only access to just the fraud_features dataset. The enforced 30-minute idle-shutdown alone cut their monthly notebook GPU spend by roughly 40% versus the previous click-ops fleet that ran around the clock.

Best practices

Always run private. Keep disable_public_ip = true (module default) and put the instance on a subnet with Private Google Access so it reaches BigQuery, GCS, and Artifact Registry without a route to the internet. For fully air-gapped access set disable_proxy_access = true and reach JupyterLab through IAP or a VPN to the private IP.
One notebook, one identity, one user. Let the module mint a dedicated service account per instance and pin instance_owners to a single analyst. Never reuse the default Compute Engine SA — grant data access narrowly at the dataset/bucket level via the service_account_email output, not broad project roles.
Enforce idle-shutdown, especially on GPUs. A T4 or A100 left running overnight is the single biggest avoidable cost on Workbench. Keep idle_shutdown_minutes low (30–60) and treat any request to disable it as a budget exception that needs sign-off.
CMEK everywhere and keep the keyring co-located. Supply kms_key for both boot and data disks, and make sure the Workbench service agent has roles/cloudkms.cryptoKeyEncrypterDecrypter on that key, with the keyring in the same region as the zone to avoid cross-region key calls.
Hold state on the data disk, not the boot disk. Keep work under /home/jupyter on the persistent data_disks volume so you can recreate or upgrade the instance (the module ignores in-place vm_image drift) without losing a scientist’s notebooks.
Standardise naming and labels for chargeback. Use a predictable team-purpose-user instance name and always populate environment, team, and cost_center so every notebook’s spend rolls up cleanly in billing exports and BigQuery cost reports.