IaC GCP

Terraform Module: GCP Vertex AI Workbench — governed, private-by-default notebooks for data science teams

Quick take — A reusable Terraform module for GCP Vertex AI Workbench (google_workbench_instance): private-IP JupyterLab instances with idle shutdown, CMEK disks, Shielded VM, and least-privilege service accounts. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "vertex_workbench" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-vertex-workbench?ref=v1.0.0"

  project_id    = "..."  # GCP project ID that will host the Workbench instance.
  instance_name = "..."  # Instance name; lowercase letters, digits, hyphens; star…
  zone          = "..."  # Zone such as `asia-south1-a`.
  network       = "..."  # VPC network self-link or name.
  subnet        = "..."  # Subnetwork self-link (Private Google Access enabled).
  environment   = "..."  # `dev`/`staging`/`prod` label.
  team          = "..."  # Owning team (cost-allocation label).
  cost_center   = "..."  # Cost-centre code (chargeback label).
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Vertex AI Workbench is GCP’s managed JupyterLab environment for data scientists and ML engineers. A Workbench instance is essentially a Compute Engine VM, pre-loaded with a curated Deep Learning container (TensorFlow, PyTorch, RAPIDS, or a base image), that exposes a JupyterLab UI you reach through a Google-proxied URL or directly over a private IP. Google handles the OS image, the notebook runtime, optional GPU attachment, and lifecycle features like idle auto-shutdown so an idle GPU box doesn’t quietly burn your budget overnight.

The catch is that the defaults are not what a regulated enterprise wants. Created from the console, a Workbench instance tends to land with a public IP, the default Compute Engine service account (which is wildly over-privileged), no customer-managed encryption, and no consistent labelling. Multiply that across forty data scientists who each spin up their own box and you have a sprawling, unauditable, non-compliant estate.

This module wraps google_workbench_instance (the current GA resource under hashicorp/google ~> 5.0, replacing the deprecated google_notebooks_instance) so that every notebook is born compliant: private IP only, Shielded VM with Secure Boot and integrity monitoring, CMEK-encrypted boot and data disks, a dedicated least-privilege service account, mandatory idle-shutdown, and a standard label set for cost allocation. Data scientists get self-service notebooks; platform and security teams get a single, reviewable definition of what “a notebook” is allowed to be.

When to use it

If you only need a single throwaway notebook for a one-hour spike, the console is faster. This module pays off the moment notebooks become shared infrastructure that auditors will ask about.

Module structure

terraform-module-gcp-vertex-workbench/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # google_workbench_instance + dedicated SA + IAM
├── variables.tf     # all knobs, with validation
├── outputs.tf       # id, name, proxy URI, service account, state
└── README.md

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Standard labels merged onto every resource for cost allocation + governance.
  base_labels = {
    managed-by   = "terraform"
    module       = "vertex-workbench"
    environment  = var.environment
    team         = var.team
    cost-center  = var.cost_center
  }

  labels = merge(local.base_labels, var.additional_labels)

  # Derive a stable SA account_id (<= 30 chars, RFC1035-ish) from the instance name.
  sa_account_id = substr("wb-${var.instance_name}", 0, 30)
}

# Dedicated, least-privilege service account for this notebook.
# Created only when the caller does not pass an existing SA email.
resource "google_service_account" "workbench" {
  count        = var.service_account_email == null ? 1 : 0
  project      = var.project_id
  account_id   = local.sa_account_id
  display_name = "Vertex Workbench SA - ${var.instance_name}"
  description  = "Runtime identity for Vertex AI Workbench instance ${var.instance_name}"
}

locals {
  effective_sa_email = coalesce(
    var.service_account_email,
    try(google_service_account.workbench[0].email, null)
  )
}

# Minimal project-level roles the notebook runtime needs to function.
# Grant data-access (BigQuery, GCS buckets) at the resource level outside this module.
resource "google_project_iam_member" "workbench_roles" {
  for_each = var.service_account_email == null ? toset(var.service_account_roles) : toset([])

  project = var.project_id
  role    = each.value
  member  = "serviceAccount:${google_service_account.workbench[0].email}"
}

resource "google_workbench_instance" "this" {
  project  = var.project_id
  name     = var.instance_name
  location = var.zone

  gce_setup {
    machine_type = var.machine_type

    # Disable external IP. The instance is reachable only over the private
    # network and the Google-managed proxy URI.
    disable_public_ip = true

    # Shielded VM posture — required for most CIS / regulated baselines.
    shielded_instance_config {
      enable_secure_boot          = true
      enable_vtpm                 = true
      enable_integrity_monitoring = true
    }

    # Optional GPU accelerator (e.g. NVIDIA_TESLA_T4). Driver auto-installed
    # so notebooks get CUDA without manual setup.
    dynamic "accelerator_configs" {
      for_each = var.accelerator_type == null ? [] : [1]
      content {
        type       = var.accelerator_type
        core_count = var.accelerator_count
      }
    }

    # Boot disk — CMEK-encrypted when a KMS key is supplied.
    boot_disk {
      disk_type    = var.boot_disk_type
      disk_size_gb = var.boot_disk_size_gb
      kms_key      = var.kms_key
    }

    # Data disk — separate, persistent, also CMEK-encrypted.
    data_disks {
      disk_type    = var.data_disk_type
      disk_size_gb = var.data_disk_size_gb
      kms_key      = var.kms_key
    }

    network_interfaces {
      network = var.network
      subnet  = var.subnet
      # nic_type left to provider default (GVNIC on supported images).
    }

    service_accounts {
      email = local.effective_sa_email
    }

    # Curated Deep Learning image. Default channel keeps the notebook patched.
    vm_image {
      project = var.image_project
      family  = var.image_family
    }

    # Hardening + lifecycle metadata.
    metadata = merge(
      {
        # Auto-shut-down after N minutes idle to control cost.
        "idle-timeout-seconds" = tostring(var.idle_shutdown_minutes * 60)
        # Block project-wide SSH keys; access is via JupyterLab proxy / IAP.
        "block-project-ssh-keys" = "true"
        # Disable the per-instance Jupyter "terminal as root" surface.
        "notebook-disable-root" = "true"
        # Report instance health/metrics to the Workbench control plane.
        "report-system-health" = "true"
      },
      var.metadata
    )

    # Restrict which gcloud scopes the notebook may use.
    enable_ip_forwarding = false

    tags = var.network_tags
  }

  # Allow only members of these domains/groups to open the JupyterLab UI.
  instance_owners = var.instance_owners

  # Disable the public proxy access when running fully private (IAP / VPN only).
  disable_proxy_access = var.disable_proxy_access

  labels = local.labels

  lifecycle {
    # Image family updates are applied via a controlled re-create, not silently.
    ignore_changes = [
      gce_setup[0].vm_image,
    ]
  }
}

variables.tf

variable "project_id" {
  type        = string
  description = "GCP project ID that will host the Workbench instance."
}

variable "instance_name" {
  type        = string
  description = "Name of the Workbench instance. Lowercase letters, digits and hyphens; must start with a letter."

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{0,62}$", var.instance_name))
    error_message = "instance_name must start with a lowercase letter and contain only lowercase letters, digits, and hyphens (max 63 chars)."
  }
}

variable "zone" {
  type        = string
  description = "Zone for the instance, e.g. asia-south1-a."

  validation {
    condition     = can(regex("^[a-z]+-[a-z0-9]+-[a-z]$", var.zone))
    error_message = "zone must be a valid GCP zone such as asia-south1-a."
  }
}

variable "machine_type" {
  type        = string
  description = "Compute Engine machine type for the notebook, e.g. n1-standard-4 or e2-standard-8."
  default     = "e2-standard-4"
}

variable "network" {
  type        = string
  description = "Self-link or short name of the VPC network the instance attaches to."
}

variable "subnet" {
  type        = string
  description = "Self-link of the subnetwork (must have Private Google Access enabled)."
}

variable "network_tags" {
  type        = list(string)
  description = "Network tags applied to the instance for firewall targeting."
  default     = []
}

variable "service_account_email" {
  type        = string
  description = "Optional existing service account email for the notebook runtime. If null, a dedicated SA is created by the module."
  default     = null
}

variable "service_account_roles" {
  type        = list(string)
  description = "Project-level roles granted to the module-created service account. Ignored when service_account_email is supplied."
  default = [
    "roles/aiplatform.user",
    "roles/storage.objectViewer",
    "roles/logging.logWriter",
    "roles/monitoring.metricWriter",
  ]
}

variable "image_project" {
  type        = string
  description = "Project hosting the VM image family."
  default     = "deeplearning-platform-release"
}

variable "image_family" {
  type        = string
  description = "Deep Learning VM image family (e.g. workbench-instances, tf-latest-cpu, pytorch-latest-gpu)."
  default     = "workbench-instances"
}

variable "boot_disk_type" {
  type        = string
  description = "Boot disk type."
  default     = "PD_SSD"

  validation {
    condition     = contains(["PD_STANDARD", "PD_SSD", "PD_BALANCED"], var.boot_disk_type)
    error_message = "boot_disk_type must be one of PD_STANDARD, PD_SSD, PD_BALANCED."
  }
}

variable "boot_disk_size_gb" {
  type        = number
  description = "Boot disk size in GB."
  default     = 150

  validation {
    condition     = var.boot_disk_size_gb >= 100 && var.boot_disk_size_gb <= 65536
    error_message = "boot_disk_size_gb must be between 100 and 65536."
  }
}

variable "data_disk_type" {
  type        = string
  description = "Persistent data disk type (holds /home/jupyter)."
  default     = "PD_BALANCED"

  validation {
    condition     = contains(["PD_STANDARD", "PD_SSD", "PD_BALANCED"], var.data_disk_type)
    error_message = "data_disk_type must be one of PD_STANDARD, PD_SSD, PD_BALANCED."
  }
}

variable "data_disk_size_gb" {
  type        = number
  description = "Persistent data disk size in GB."
  default     = 200

  validation {
    condition     = var.data_disk_size_gb >= 100 && var.data_disk_size_gb <= 65536
    error_message = "data_disk_size_gb must be between 100 and 65536."
  }
}

variable "kms_key" {
  type        = string
  description = "CMEK key resource ID for disk encryption (projects/<p>/locations/<l>/keyRings/<r>/cryptoKeys/<k>). Null uses Google-managed keys."
  default     = null
}

variable "accelerator_type" {
  type        = string
  description = "Optional GPU accelerator type, e.g. NVIDIA_TESLA_T4. Null means CPU-only."
  default     = null

  validation {
    condition = var.accelerator_type == null || contains([
      "NVIDIA_TESLA_T4", "NVIDIA_TESLA_V100", "NVIDIA_TESLA_P100",
      "NVIDIA_TESLA_A100", "NVIDIA_A100_80GB", "NVIDIA_L4",
    ], coalesce(var.accelerator_type, "none"))
    error_message = "accelerator_type must be a supported Vertex Workbench GPU type or null."
  }
}

variable "accelerator_count" {
  type        = number
  description = "Number of GPUs to attach when accelerator_type is set."
  default     = 1

  validation {
    condition     = var.accelerator_count >= 1 && var.accelerator_count <= 8
    error_message = "accelerator_count must be between 1 and 8."
  }
}

variable "idle_shutdown_minutes" {
  type        = number
  description = "Minutes of inactivity before the notebook auto-stops. Set to 0 to disable (not recommended)."
  default     = 60

  validation {
    condition     = var.idle_shutdown_minutes >= 0 && var.idle_shutdown_minutes <= 1440
    error_message = "idle_shutdown_minutes must be between 0 and 1440 (24h)."
  }
}

variable "instance_owners" {
  type        = list(string)
  description = "User emails permitted to access the JupyterLab UI (single-user notebooks). Empty for service-managed access."
  default     = []
}

variable "disable_proxy_access" {
  type        = bool
  description = "If true, disables the public Google proxy URI; access only via IAP/VPN to the private IP."
  default     = false
}

variable "environment" {
  type        = string
  description = "Deployment environment label (e.g. dev, staging, prod)."

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be one of dev, staging, prod."
  }
}

variable "team" {
  type        = string
  description = "Owning team, used as a label for cost allocation."
}

variable "cost_center" {
  type        = string
  description = "Cost-centre code, used as a label for chargeback."
}

variable "metadata" {
  type        = map(string)
  description = "Extra instance metadata key/value pairs, merged onto the module defaults."
  default     = {}
}

variable "additional_labels" {
  type        = map(string)
  description = "Extra labels merged onto the standard label set."
  default     = {}
}

outputs.tf

output "id" {
  description = "Fully-qualified resource ID of the Workbench instance."
  value       = google_workbench_instance.this.id
}

output "name" {
  description = "Name of the Workbench instance."
  value       = google_workbench_instance.this.name
}

output "proxy_uri" {
  description = "Google-managed proxy URI to open the JupyterLab UI (empty when proxy access is disabled)."
  value       = google_workbench_instance.this.proxy_uri
}

output "state" {
  description = "Current lifecycle state of the instance (e.g. ACTIVE, STOPPED)."
  value       = google_workbench_instance.this.state
}

output "service_account_email" {
  description = "Service account email the notebook runs as."
  value       = local.effective_sa_email
}

output "creator" {
  description = "Email of the principal that created the instance."
  value       = google_workbench_instance.this.creator
}

output "health_state" {
  description = "Reported health state of the instance from the Workbench control plane."
  value       = google_workbench_instance.this.health_state
}

How to use it

module "vertex_ai_workbench" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-vertex-workbench?ref=v1.0.0"

  project_id    = "kv-ml-platform-prod"
  instance_name = "fraud-ds-anita"
  zone          = "asia-south1-a"

  machine_type = "n1-standard-8"

  # GPU notebook for model prototyping.
  accelerator_type  = "NVIDIA_TESLA_T4"
  accelerator_count = 1

  # Attach to the Shared VPC; subnet has Private Google Access on.
  network = "projects/kv-shared-host/global/networks/ml-vpc"
  subnet  = "projects/kv-shared-host/regions/asia-south1/subnetworks/notebooks-asia-south1"

  network_tags = ["workbench", "egress-via-proxy"]

  # CMEK on all disks.
  kms_key = "projects/kv-shared-host/locations/asia-south1/keyRings/ml-kr/cryptoKeys/notebooks"

  # Single-user notebook locked to one analyst.
  instance_owners = ["anita@kloudvin.com"]

  # Aggressive idle-shutdown for an expensive GPU box.
  idle_shutdown_minutes = 30

  data_disk_size_gb = 500

  environment = "prod"
  team        = "fraud-ml"
  cost_center = "CC-4471"

  additional_labels = {
    project = "realtime-fraud-scoring"
  }
}

# Downstream: grant this notebook's SA read access to a specific BigQuery
# dataset, using the module's service_account_email output.
resource "google_bigquery_dataset_iam_member" "notebook_reader" {
  project    = "kv-data-warehouse-prod"
  dataset_id = "fraud_features"
  role       = "roles/bigquery.dataViewer"
  member     = "serviceAccount:${module.vertex_ai_workbench.service_account_email}"
}

output "notebook_url" {
  value = module.vertex_ai_workbench.proxy_uri
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/vertex_workbench/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-vertex-workbench?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  instance_name = "..."
  zone = "..."
  network = "..."
  subnet = "..."
  environment = "..."
  team = "..."
  cost_center = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/vertex_workbench && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string yes GCP project ID that will host the Workbench instance.
instance_name string yes Instance name; lowercase letters, digits, hyphens; starts with a letter.
zone string yes Zone such as asia-south1-a.
machine_type string e2-standard-4 no Compute Engine machine type.
network string yes VPC network self-link or name.
subnet string yes Subnetwork self-link (Private Google Access enabled).
network_tags list(string) [] no Network tags for firewall targeting.
service_account_email string null no Existing SA email; null creates a dedicated SA.
service_account_roles list(string) [aiplatform.user, storage.objectViewer, logging.logWriter, monitoring.metricWriter] no Roles granted to the module-created SA.
image_project string deeplearning-platform-release no Project hosting the VM image family.
image_family string workbench-instances no Deep Learning VM image family.
boot_disk_type string PD_SSD no Boot disk type (PD_STANDARD/PD_SSD/PD_BALANCED).
boot_disk_size_gb number 150 no Boot disk size in GB (100–65536).
data_disk_type string PD_BALANCED no Data disk type (holds /home/jupyter).
data_disk_size_gb number 200 no Data disk size in GB (100–65536).
kms_key string null no CMEK key resource ID for disk encryption.
accelerator_type string null no GPU type (e.g. NVIDIA_TESLA_T4); null = CPU-only.
accelerator_count number 1 no Number of GPUs (1–8) when accelerator_type set.
idle_shutdown_minutes number 60 no Idle minutes before auto-stop; 0 disables.
instance_owners list(string) [] no User emails allowed into the JupyterLab UI.
disable_proxy_access bool false no Disable the public proxy URI (IAP/VPN-only access).
environment string yes dev/staging/prod label.
team string yes Owning team (cost-allocation label).
cost_center string yes Cost-centre code (chargeback label).
metadata map(string) {} no Extra instance metadata, merged onto defaults.
additional_labels map(string) {} no Extra labels merged onto the standard set.

Outputs

Name Description
id Fully-qualified resource ID of the Workbench instance.
name Name of the Workbench instance.
proxy_uri Google-managed proxy URI for the JupyterLab UI (empty when proxy access disabled).
state Current lifecycle state (e.g. ACTIVE, STOPPED).
service_account_email Service account email the notebook runs as.
creator Email of the principal that created the instance.
health_state Reported health state from the Workbench control plane.

Enterprise scenario

A fintech’s fraud-modelling group needs forty data scientists able to spin up GPU notebooks against a feature store in BigQuery, but their PCI-DSS scope forbids any public-IP compute and mandates CMEK on all persistent storage. The platform team exposes this module behind a thin self-service wrapper: an analyst opens a pull request with their name, team, and required machine type, and CI applies a fraud-ds-<name> instance that lands private-only, Shielded, CMEK-encrypted, and owned by exactly one user — then grants its dedicated service account read-only access to just the fraud_features dataset. The enforced 30-minute idle-shutdown alone cut their monthly notebook GPU spend by roughly 40% versus the previous click-ops fleet that ran around the clock.

Best practices

TerraformGCPVertex AI WorkbenchModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading