Terraform Module: GCP GKE Cluster — a hardened, VPC-native cluster you can stamp out per environment

Quick take — Build a reusable Terraform module for a production GKE cluster on hashicorp/google ~> 5.0: VPC-native networking, Workload Identity, private nodes, release channels, and a managed node pool — all var-driven. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "gke" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-gke?ref=v1.0.0"

  project_id           = "..."  # GCP project ID that owns the cluster.
  cluster_name         = "..."  # Cluster name (validated: lowercase, starts with a lette…
  region               = "..."  # Region for the regional (HA) cluster.
  network              = "..."  # VPC network self-link or name.
  subnetwork           = "..."  # Subnetwork self-link or name for nodes.
  pods_range_name      = "..."  # Subnet secondary range name for pod alias IPs.
  services_range_name  = "..."  # Subnet secondary range name for service IPs.
  node_service_account = "..."  # Email of the least-privilege node SA.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

A Google Kubernetes Engine (GKE) cluster is GCP’s managed Kubernetes control plane plus the worker infrastructure that runs your pods. The control plane (API server, scheduler, etcd) is fully managed by Google; you own the node pools — the VM groups that actually schedule your workloads. In Terraform, the cluster itself is the google_container_cluster resource, and the worker VMs live in one or more google_container_node_pool resources.

The reason to wrap this in a reusable module is that a correct GKE cluster is deceptively fiddly. The defaults Google ships are not the ones you want in production: the default cluster has a publicly reachable node IP surface, basic authentication and client certificates that should be off, no Workload Identity, and a built-in “default node pool” that you almost always want to delete and replace with your own managed pool. Getting VPC-native (alias IP) networking, a private control plane, release channels, and Workload Identity wired together correctly takes ~150 lines of HCL that nobody wants to copy-paste — and copy-paste is exactly how one environment ends up with shielded nodes and another without.

This module bakes those production decisions in once. It creates a VPC-native, regional cluster with the default node pool removed, attaches a separately managed node pool with autoscaling and auto-repair, enables Workload Identity so pods authenticate to GCP APIs without node-level service account keys, and optionally makes the control plane private. Every environment-specific value — project, region, CIDR ranges, machine type, node count — is a variable, so dev, staging, and prod are the same code with different .tfvars.

When to use it

You run more than one GKE cluster (per-environment, per-region, or per-team) and want them provisioned identically rather than hand-tuned.
You need VPC-native networking (alias IPs) because you’re peering with other VPCs, using GKE Ingress with container-native load balancing, or you’ve outgrown route-based clusters.
You want Workload Identity as the standard for pod-to-GCP-API auth and need it enabled consistently — not bolted on later.
You’re enforcing a security baseline (private nodes, shielded nodes, release channel, no legacy ABAC/basic-auth) and want it as code, not a runbook.
You’re standing up clusters through CI/CD and need a stable module reference (?ref=v1.0.0) so a cluster’s config is reproducible and reviewable.

If you only ever need a single throwaway sandbox cluster, gcloud container clusters create is faster. The moment a second cluster appears, reach for the module.

Module structure

terraform-module-gcp-gke/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # google_container_cluster + google_container_node_pool
├── variables.tf     # all environment-specific inputs (with validation)
├── outputs.tf       # cluster id/name/endpoint + CA cert + node pool name
└── README.md

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # GKE requires the control-plane CIDR to be a /28.
  master_ipv4_cidr_block = var.master_ipv4_cidr_block

  # Workload Identity pool is always "<project>.svc.id.goog".
  workload_pool = "${var.project_id}.svc.id.goog"
}

resource "google_container_cluster" "this" {
  provider = google

  name     = var.cluster_name
  project  = var.project_id
  location = var.region

  # Removing the default node pool is the recommended pattern: it lets us
  # manage the real worker pool separately via google_container_node_pool.
  remove_default_node_pool = true
  initial_node_count       = 1

  # Deletion protection prevents accidental `terraform destroy` of the cluster.
  deletion_protection = var.deletion_protection

  # VPC-native (alias IP) networking. Cluster/services ranges are looked up
  # by the secondary range *names* defined on the subnet.
  networking_mode = "VPC_NATIVE"
  network         = var.network
  subnetwork      = var.subnetwork

  ip_allocation_policy {
    cluster_secondary_range_name  = var.pods_range_name
    services_secondary_range_name = var.services_range_name
  }

  # Managed control-plane upgrades. Use a release channel rather than pinning
  # min_master_version so Google handles patching within the channel.
  release_channel {
    channel = var.release_channel
  }

  # Workload Identity: the recommended way for pods to authenticate to Google
  # APIs without node-level service account keys.
  workload_identity_config {
    workload_pool = local.workload_pool
  }

  # Private control plane / private nodes. Nodes get only internal IPs;
  # the master endpoint is optionally private as well.
  private_cluster_config {
    enable_private_nodes    = var.enable_private_nodes
    enable_private_endpoint = var.enable_private_endpoint
    master_ipv4_cidr_block  = var.enable_private_nodes ? local.master_ipv4_cidr_block : null
  }

  # Restrict who can reach the public control-plane endpoint.
  dynamic "master_authorized_networks_config" {
    for_each = length(var.master_authorized_networks) > 0 ? [1] : []
    content {
      dynamic "cidr_blocks" {
        for_each = var.master_authorized_networks
        content {
          cidr_block   = cidr_blocks.value.cidr_block
          display_name = cidr_blocks.value.display_name
        }
      }
    }
  }

  # Disable legacy auth surfaces. (Basic auth / client certs were removed in
  # GKE 1.19+, so we simply do not configure master_auth credentials.)
  enable_legacy_abac = false

  # Shielded Nodes hardens the VM boot integrity for the whole cluster.
  enable_shielded_nodes = true

  # Optional dataplane v2 (eBPF-based networking + network policy).
  datapath_provider = var.enable_dataplane_v2 ? "ADVANCED_DATAPATH" : "DATAPATH_PROVIDER_UNSPECIFIED"

  # Maintenance window so node/control-plane upgrades land off-peak.
  maintenance_policy {
    recurring_window {
      start_time = var.maintenance_start_time
      end_time   = var.maintenance_end_time
      recurrence = var.maintenance_recurrence
    }
  }

  resource_labels = var.labels

  lifecycle {
    # The control plane occasionally rewrites node_config on the (removed)
    # default pool; ignore it so plans stay clean.
    ignore_changes = [node_config]
  }
}

resource "google_container_node_pool" "primary" {
  provider = google

  name     = "${var.cluster_name}-primary"
  project  = var.project_id
  location = var.region
  cluster  = google_container_cluster.this.name

  # With autoscaling, initial_node_count is per-zone; a regional cluster
  # multiplies this across its zones.
  initial_node_count = var.node_count

  autoscaling {
    min_node_count = var.min_node_count
    max_node_count = var.max_node_count
  }

  management {
    auto_repair  = true
    auto_upgrade = true
  }

  upgrade_settings {
    max_surge       = var.max_surge
    max_unavailable = var.max_unavailable
  }

  node_config {
    machine_type = var.machine_type
    disk_size_gb = var.disk_size_gb
    disk_type    = var.disk_type
    image_type   = "COS_CONTAINERD"

    # Least-privilege node identity. Prefer a dedicated SA with only the
    # logging/monitoring/artifact-registry roles it needs.
    service_account = var.node_service_account
    oauth_scopes    = ["https://www.googleapis.com/auth/cloud-platform"]

    # Required to let pods on these nodes use Workload Identity.
    workload_metadata_config {
      mode = "GKE_METADATA"
    }

    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }

    labels = var.node_labels
    tags   = var.node_network_tags

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

variables.tf

variable "project_id" {
  description = "GCP project ID that owns the cluster."
  type        = string
}

variable "cluster_name" {
  description = "Name of the GKE cluster. Lowercase letters, numbers and hyphens; must start with a letter."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{0,38}[a-z0-9]$", var.cluster_name))
    error_message = "cluster_name must be 2-40 chars, lowercase alphanumeric or '-', start with a letter, and not end with '-'."
  }
}

variable "region" {
  description = "Region for a regional cluster (e.g. asia-south1). Use a region, not a zone, for HA control plane."
  type        = string
}

variable "network" {
  description = "Self-link or name of the VPC network the cluster attaches to."
  type        = string
}

variable "subnetwork" {
  description = "Self-link or name of the subnetwork the nodes live in."
  type        = string
}

variable "pods_range_name" {
  description = "Name of the subnet secondary range used for pod IPs (alias IPs)."
  type        = string
}

variable "services_range_name" {
  description = "Name of the subnet secondary range used for service (ClusterIP) IPs."
  type        = string
}

variable "release_channel" {
  description = "GKE release channel governing auto-upgrade cadence."
  type        = string
  default     = "REGULAR"

  validation {
    condition     = contains(["RAPID", "REGULAR", "STABLE", "UNSPECIFIED"], var.release_channel)
    error_message = "release_channel must be one of RAPID, REGULAR, STABLE, or UNSPECIFIED."
  }
}

variable "enable_private_nodes" {
  description = "If true, nodes receive only internal IPs."
  type        = bool
  default     = true
}

variable "enable_private_endpoint" {
  description = "If true, the control-plane endpoint is private (reachable only from authorized internal networks)."
  type        = bool
  default     = false
}

variable "master_ipv4_cidr_block" {
  description = "RFC 1918 /28 block for the managed control plane. Required when enable_private_nodes is true."
  type        = string
  default     = "172.16.0.0/28"

  validation {
    condition     = can(cidrhost(var.master_ipv4_cidr_block, 0)) && tonumber(split("/", var.master_ipv4_cidr_block)[1]) == 28
    error_message = "master_ipv4_cidr_block must be a valid /28 CIDR."
  }
}

variable "master_authorized_networks" {
  description = "CIDR blocks allowed to reach the control-plane endpoint."
  type = list(object({
    cidr_block   = string
    display_name = string
  }))
  default = []
}

variable "enable_dataplane_v2" {
  description = "Enable GKE Dataplane V2 (Cilium/eBPF) for advanced networking and network policy."
  type        = bool
  default     = true
}

variable "deletion_protection" {
  description = "Prevent accidental destroy of the cluster via Terraform."
  type        = bool
  default     = true
}

variable "node_service_account" {
  description = "Email of the IAM service account attached to nodes. Use a dedicated least-privilege SA."
  type        = string
}

variable "machine_type" {
  description = "Compute Engine machine type for node VMs."
  type        = string
  default     = "e2-standard-4"
}

variable "node_count" {
  description = "Initial node count per zone for the primary node pool."
  type        = number
  default     = 1

  validation {
    condition     = var.node_count >= 1
    error_message = "node_count must be at least 1."
  }
}

variable "min_node_count" {
  description = "Minimum nodes per zone for autoscaling."
  type        = number
  default     = 1
}

variable "max_node_count" {
  description = "Maximum nodes per zone for autoscaling."
  type        = number
  default     = 5

  validation {
    condition     = var.max_node_count >= var.min_node_count
    error_message = "max_node_count must be >= min_node_count."
  }
}

variable "disk_size_gb" {
  description = "Boot disk size (GB) per node."
  type        = number
  default     = 100
}

variable "disk_type" {
  description = "Boot disk type for nodes."
  type        = string
  default     = "pd-balanced"

  validation {
    condition     = contains(["pd-standard", "pd-balanced", "pd-ssd"], var.disk_type)
    error_message = "disk_type must be pd-standard, pd-balanced, or pd-ssd."
  }
}

variable "max_surge" {
  description = "Extra nodes allowed above pool size during a surge upgrade."
  type        = number
  default     = 1
}

variable "max_unavailable" {
  description = "Nodes allowed to be unavailable during an upgrade."
  type        = number
  default     = 0
}

variable "maintenance_start_time" {
  description = "RFC3339 start of the recurring maintenance window."
  type        = string
  default     = "2026-01-01T18:00:00Z"
}

variable "maintenance_end_time" {
  description = "RFC3339 end of the recurring maintenance window."
  type        = string
  default     = "2026-01-01T22:00:00Z"
}

variable "maintenance_recurrence" {
  description = "RFC5545 RRULE for the maintenance window recurrence."
  type        = string
  default     = "FREQ=WEEKLY;BYDAY=SA,SU"
}

variable "labels" {
  description = "Resource labels applied to the cluster."
  type        = map(string)
  default     = {}
}

variable "node_labels" {
  description = "Kubernetes labels applied to nodes in the primary pool."
  type        = map(string)
  default     = {}
}

variable "node_network_tags" {
  description = "Network tags applied to node VMs (for firewall targeting)."
  type        = list(string)
  default     = []
}

outputs.tf

output "cluster_id" {
  description = "Fully qualified GKE cluster ID."
  value       = google_container_cluster.this.id
}

output "cluster_name" {
  description = "Name of the GKE cluster."
  value       = google_container_cluster.this.name
}

output "endpoint" {
  description = "IP address of the cluster's Kubernetes API server endpoint."
  value       = google_container_cluster.this.endpoint
  sensitive   = true
}

output "cluster_ca_certificate" {
  description = "Base64-encoded public CA certificate for the cluster control plane."
  value       = google_container_cluster.this.master_auth[0].cluster_ca_certificate
  sensitive   = true
}

output "location" {
  description = "Region/location the cluster runs in."
  value       = google_container_cluster.this.location
}

output "workload_identity_pool" {
  description = "Workload Identity pool for binding KSAs to GCP service accounts."
  value       = google_container_cluster.this.workload_identity_config[0].workload_pool
}

output "primary_node_pool_name" {
  description = "Name of the primary managed node pool."
  value       = google_container_node_pool.primary.name
}

How to use it

module "gke_cluster" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-gke?ref=v1.0.0"

  project_id   = "kloudvin-prod"
  cluster_name = "kv-prod-apps"
  region       = "asia-south1"

  # VPC-native wiring: the subnet must already define these secondary ranges.
  network             = google_compute_network.vpc.self_link
  subnetwork          = google_compute_subnetwork.gke.self_link
  pods_range_name     = "gke-pods"
  services_range_name = "gke-services"

  # Security baseline
  enable_private_nodes    = true
  enable_private_endpoint = false
  master_ipv4_cidr_block  = "172.16.8.0/28"
  release_channel         = "STABLE"

  master_authorized_networks = [
    {
      cidr_block   = "10.20.0.0/16"
      display_name = "corp-vpn"
    }
  ]

  # Least-privilege node identity
  node_service_account = google_service_account.gke_nodes.email

  # Capacity
  machine_type   = "e2-standard-8"
  node_count     = 2
  min_node_count = 2
  max_node_count = 10

  labels = {
    env   = "prod"
    team  = "platform"
    owner = "vinod"
  }
}

# Downstream: configure the Kubernetes/Helm providers from the module outputs
# so you can deploy workloads into the cluster you just created.
data "google_client_config" "default" {}

provider "kubernetes" {
  host                   = "https://${module.gke_cluster.endpoint}"
  token                  = data.google_client_config.default.access_token
  cluster_ca_certificate = base64decode(module.gke_cluster.cluster_ca_certificate)
}

# Downstream: bind a Kubernetes service account to a GCP SA via the
# Workload Identity pool the module exposes.
resource "google_service_account_iam_member" "wi_binding" {
  service_account_id = google_service_account.app.name
  role               = "roles/iam.workloadIdentityUser"
  member             = "serviceAccount:${module.gke_cluster.workload_identity_pool}[default/checkout-api]"
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module config — live/prod/gke/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-gke?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  cluster_name = "..."
  region = "..."
  network = "..."
  subnetwork = "..."
  pods_range_name = "..."
  services_range_name = "..."
  node_service_account = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/gke && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
project_id	string	—	yes	GCP project ID that owns the cluster.
cluster_name	string	—	yes	Cluster name (validated: lowercase, starts with a letter, ≤40 chars).
region	string	—	yes	Region for the regional (HA) cluster.
network	string	—	yes	VPC network self-link or name.
subnetwork	string	—	yes	Subnetwork self-link or name for nodes.
pods_range_name	string	—	yes	Subnet secondary range name for pod alias IPs.
services_range_name	string	—	yes	Subnet secondary range name for service IPs.
node_service_account	string	—	yes	Email of the least-privilege node SA.
release_channel	string	`"REGULAR"`	no	Release channel: RAPID, REGULAR, STABLE, UNSPECIFIED.
enable_private_nodes	bool	`true`	no	Give nodes internal-only IPs.
enable_private_endpoint	bool	`false`	no	Make the control-plane endpoint private.
master_ipv4_cidr_block	string	`"172.16.0.0/28"`	no	/28 block for the managed control plane.
master_authorized_networks	list(object)	`[]`	no	CIDRs allowed to reach the control plane.
enable_dataplane_v2	bool	`true`	no	Enable GKE Dataplane V2 (eBPF) + network policy.
deletion_protection	bool	`true`	no	Block Terraform destroy of the cluster.
machine_type	string	`"e2-standard-4"`	no	Node VM machine type.
node_count	number	`1`	no	Initial nodes per zone.
min_node_count	number	`1`	no	Autoscaler minimum nodes per zone.
max_node_count	number	`5`	no	Autoscaler maximum nodes per zone (≥ min).
disk_size_gb	number	`100`	no	Node boot disk size in GB.
disk_type	string	`"pd-balanced"`	no	Node boot disk type.
max_surge	number	`1`	no	Surge nodes during upgrades.
max_unavailable	number	`0`	no	Unavailable nodes during upgrades.
maintenance_start_time	string	`"2026-01-01T18:00:00Z"`	no	RFC3339 maintenance window start.
maintenance_end_time	string	`"2026-01-01T22:00:00Z"`	no	RFC3339 maintenance window end.
maintenance_recurrence	string	`"FREQ=WEEKLY;BYDAY=SA,SU"`	no	RRULE for the maintenance window.
labels	map(string)	`{}`	no	Resource labels on the cluster.
node_labels	map(string)	`{}`	no	Kubernetes labels on primary-pool nodes.
node_network_tags	list(string)	`[]`	no	Network tags for firewall targeting.

Outputs

Name	Description
cluster_id	Fully qualified GKE cluster ID.
cluster_name	Name of the cluster.
endpoint	API server endpoint IP (sensitive).
cluster_ca_certificate	Base64 CA cert for the control plane (sensitive).
location	Region the cluster runs in.
workload_identity_pool	Workload Identity pool (`<project>.svc.id.goog`) for KSA→GSA bindings.
primary_node_pool_name	Name of the primary managed node pool.

Enterprise scenario

A fintech platform team runs the same checkout and ledger services across three GKE clusters — kv-dev-apps, kv-stg-apps, and kv-prod-apps in asia-south1 — and is subject to PCI controls. They consume this module from three Terraform workspaces with identical code and per-environment .tfvars: prod pins release_channel = "STABLE", sets enable_private_endpoint = true with only the corporate VPN CIDR in master_authorized_networks, and scales max_node_count to 30, while dev stays on REGULAR with a smaller e2-standard-4 pool. Because Workload Identity is enabled uniformly, every pod authenticates to Cloud SQL and Secret Manager through a bound GCP service account with zero exported keys, satisfying the auditors that node and pod credentials are short-lived and least-privilege everywhere.

Best practices

Remove the default node pool and manage pools separately (this module does). Editing node_config on the in-cluster default pool forces cluster recreation; a standalone google_container_node_pool lets you change machine types, scale, or roll node OS versions without touching the control plane.
Use a release channel, not a pinned master version. STABLE for prod, REGULAR for non-prod. Channels give you automatic, tested security patches; pinning min_master_version leaves you responsible for CVE patching and tends to rot.
Give nodes a dedicated least-privilege service account — never the default Compute Engine SA, which carries Editor on the project. Grant the node SA only logging.logWriter, monitoring.metricWriter, monitoring.viewer, and artifactregistry.reader, and push all app-level access through Workload Identity bindings instead.
Right-size with autoscaling and pick the disk to match the workload. Set realistic min/max_node_count, use pd-balanced (not pd-ssd) unless you measure I/O pressure, and prefer e2/t2d families for general workloads to keep INR spend down; reserve n2/c3 for CPU-bound services.
Keep nodes private and lock the control plane. enable_private_nodes = true plus a tight master_authorized_networks list (or enable_private_endpoint = true) removes the public attack surface; pair with enable_shielded_nodes, secure boot, and integrity monitoring, all set here by default.
Name and label for cost attribution. Encode environment and region in cluster_name (kv-prod-apps) and always set labels with env/team/owner so GKE node VMs roll up cleanly in billing exports and budget alerts.