IaC GCP

Terraform Module: GCP GKE Cluster — a hardened, VPC-native cluster you can stamp out per environment

Quick take — Build a reusable Terraform module for a production GKE cluster on hashicorp/google ~> 5.0: VPC-native networking, Workload Identity, private nodes, release channels, and a managed node pool — all var-driven. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "gke" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-gke?ref=v1.0.0"

  project_id           = "..."  # GCP project ID that owns the cluster.
  cluster_name         = "..."  # Cluster name (validated: lowercase, starts with a lette…
  region               = "..."  # Region for the regional (HA) cluster.
  network              = "..."  # VPC network self-link or name.
  subnetwork           = "..."  # Subnetwork self-link or name for nodes.
  pods_range_name      = "..."  # Subnet secondary range name for pod alias IPs.
  services_range_name  = "..."  # Subnet secondary range name for service IPs.
  node_service_account = "..."  # Email of the least-privilege node SA.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

A Google Kubernetes Engine (GKE) cluster is GCP’s managed Kubernetes control plane plus the worker infrastructure that runs your pods. The control plane (API server, scheduler, etcd) is fully managed by Google; you own the node pools — the VM groups that actually schedule your workloads. In Terraform, the cluster itself is the google_container_cluster resource, and the worker VMs live in one or more google_container_node_pool resources.

The reason to wrap this in a reusable module is that a correct GKE cluster is deceptively fiddly. The defaults Google ships are not the ones you want in production: the default cluster has a publicly reachable node IP surface, basic authentication and client certificates that should be off, no Workload Identity, and a built-in “default node pool” that you almost always want to delete and replace with your own managed pool. Getting VPC-native (alias IP) networking, a private control plane, release channels, and Workload Identity wired together correctly takes ~150 lines of HCL that nobody wants to copy-paste — and copy-paste is exactly how one environment ends up with shielded nodes and another without.

This module bakes those production decisions in once. It creates a VPC-native, regional cluster with the default node pool removed, attaches a separately managed node pool with autoscaling and auto-repair, enables Workload Identity so pods authenticate to GCP APIs without node-level service account keys, and optionally makes the control plane private. Every environment-specific value — project, region, CIDR ranges, machine type, node count — is a variable, so dev, staging, and prod are the same code with different .tfvars.

When to use it

If you only ever need a single throwaway sandbox cluster, gcloud container clusters create is faster. The moment a second cluster appears, reach for the module.

Module structure

terraform-module-gcp-gke/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # google_container_cluster + google_container_node_pool
├── variables.tf     # all environment-specific inputs (with validation)
├── outputs.tf       # cluster id/name/endpoint + CA cert + node pool name
└── README.md

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # GKE requires the control-plane CIDR to be a /28.
  master_ipv4_cidr_block = var.master_ipv4_cidr_block

  # Workload Identity pool is always "<project>.svc.id.goog".
  workload_pool = "${var.project_id}.svc.id.goog"
}

resource "google_container_cluster" "this" {
  provider = google

  name     = var.cluster_name
  project  = var.project_id
  location = var.region

  # Removing the default node pool is the recommended pattern: it lets us
  # manage the real worker pool separately via google_container_node_pool.
  remove_default_node_pool = true
  initial_node_count       = 1

  # Deletion protection prevents accidental `terraform destroy` of the cluster.
  deletion_protection = var.deletion_protection

  # VPC-native (alias IP) networking. Cluster/services ranges are looked up
  # by the secondary range *names* defined on the subnet.
  networking_mode = "VPC_NATIVE"
  network         = var.network
  subnetwork      = var.subnetwork

  ip_allocation_policy {
    cluster_secondary_range_name  = var.pods_range_name
    services_secondary_range_name = var.services_range_name
  }

  # Managed control-plane upgrades. Use a release channel rather than pinning
  # min_master_version so Google handles patching within the channel.
  release_channel {
    channel = var.release_channel
  }

  # Workload Identity: the recommended way for pods to authenticate to Google
  # APIs without node-level service account keys.
  workload_identity_config {
    workload_pool = local.workload_pool
  }

  # Private control plane / private nodes. Nodes get only internal IPs;
  # the master endpoint is optionally private as well.
  private_cluster_config {
    enable_private_nodes    = var.enable_private_nodes
    enable_private_endpoint = var.enable_private_endpoint
    master_ipv4_cidr_block  = var.enable_private_nodes ? local.master_ipv4_cidr_block : null
  }

  # Restrict who can reach the public control-plane endpoint.
  dynamic "master_authorized_networks_config" {
    for_each = length(var.master_authorized_networks) > 0 ? [1] : []
    content {
      dynamic "cidr_blocks" {
        for_each = var.master_authorized_networks
        content {
          cidr_block   = cidr_blocks.value.cidr_block
          display_name = cidr_blocks.value.display_name
        }
      }
    }
  }

  # Disable legacy auth surfaces. (Basic auth / client certs were removed in
  # GKE 1.19+, so we simply do not configure master_auth credentials.)
  enable_legacy_abac = false

  # Shielded Nodes hardens the VM boot integrity for the whole cluster.
  enable_shielded_nodes = true

  # Optional dataplane v2 (eBPF-based networking + network policy).
  datapath_provider = var.enable_dataplane_v2 ? "ADVANCED_DATAPATH" : "DATAPATH_PROVIDER_UNSPECIFIED"

  # Maintenance window so node/control-plane upgrades land off-peak.
  maintenance_policy {
    recurring_window {
      start_time = var.maintenance_start_time
      end_time   = var.maintenance_end_time
      recurrence = var.maintenance_recurrence
    }
  }

  resource_labels = var.labels

  lifecycle {
    # The control plane occasionally rewrites node_config on the (removed)
    # default pool; ignore it so plans stay clean.
    ignore_changes = [node_config]
  }
}

resource "google_container_node_pool" "primary" {
  provider = google

  name     = "${var.cluster_name}-primary"
  project  = var.project_id
  location = var.region
  cluster  = google_container_cluster.this.name

  # With autoscaling, initial_node_count is per-zone; a regional cluster
  # multiplies this across its zones.
  initial_node_count = var.node_count

  autoscaling {
    min_node_count = var.min_node_count
    max_node_count = var.max_node_count
  }

  management {
    auto_repair  = true
    auto_upgrade = true
  }

  upgrade_settings {
    max_surge       = var.max_surge
    max_unavailable = var.max_unavailable
  }

  node_config {
    machine_type = var.machine_type
    disk_size_gb = var.disk_size_gb
    disk_type    = var.disk_type
    image_type   = "COS_CONTAINERD"

    # Least-privilege node identity. Prefer a dedicated SA with only the
    # logging/monitoring/artifact-registry roles it needs.
    service_account = var.node_service_account
    oauth_scopes    = ["https://www.googleapis.com/auth/cloud-platform"]

    # Required to let pods on these nodes use Workload Identity.
    workload_metadata_config {
      mode = "GKE_METADATA"
    }

    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }

    labels = var.node_labels
    tags   = var.node_network_tags

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

variables.tf

variable "project_id" {
  description = "GCP project ID that owns the cluster."
  type        = string
}

variable "cluster_name" {
  description = "Name of the GKE cluster. Lowercase letters, numbers and hyphens; must start with a letter."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{0,38}[a-z0-9]$", var.cluster_name))
    error_message = "cluster_name must be 2-40 chars, lowercase alphanumeric or '-', start with a letter, and not end with '-'."
  }
}

variable "region" {
  description = "Region for a regional cluster (e.g. asia-south1). Use a region, not a zone, for HA control plane."
  type        = string
}

variable "network" {
  description = "Self-link or name of the VPC network the cluster attaches to."
  type        = string
}

variable "subnetwork" {
  description = "Self-link or name of the subnetwork the nodes live in."
  type        = string
}

variable "pods_range_name" {
  description = "Name of the subnet secondary range used for pod IPs (alias IPs)."
  type        = string
}

variable "services_range_name" {
  description = "Name of the subnet secondary range used for service (ClusterIP) IPs."
  type        = string
}

variable "release_channel" {
  description = "GKE release channel governing auto-upgrade cadence."
  type        = string
  default     = "REGULAR"

  validation {
    condition     = contains(["RAPID", "REGULAR", "STABLE", "UNSPECIFIED"], var.release_channel)
    error_message = "release_channel must be one of RAPID, REGULAR, STABLE, or UNSPECIFIED."
  }
}

variable "enable_private_nodes" {
  description = "If true, nodes receive only internal IPs."
  type        = bool
  default     = true
}

variable "enable_private_endpoint" {
  description = "If true, the control-plane endpoint is private (reachable only from authorized internal networks)."
  type        = bool
  default     = false
}

variable "master_ipv4_cidr_block" {
  description = "RFC 1918 /28 block for the managed control plane. Required when enable_private_nodes is true."
  type        = string
  default     = "172.16.0.0/28"

  validation {
    condition     = can(cidrhost(var.master_ipv4_cidr_block, 0)) && tonumber(split("/", var.master_ipv4_cidr_block)[1]) == 28
    error_message = "master_ipv4_cidr_block must be a valid /28 CIDR."
  }
}

variable "master_authorized_networks" {
  description = "CIDR blocks allowed to reach the control-plane endpoint."
  type = list(object({
    cidr_block   = string
    display_name = string
  }))
  default = []
}

variable "enable_dataplane_v2" {
  description = "Enable GKE Dataplane V2 (Cilium/eBPF) for advanced networking and network policy."
  type        = bool
  default     = true
}

variable "deletion_protection" {
  description = "Prevent accidental destroy of the cluster via Terraform."
  type        = bool
  default     = true
}

variable "node_service_account" {
  description = "Email of the IAM service account attached to nodes. Use a dedicated least-privilege SA."
  type        = string
}

variable "machine_type" {
  description = "Compute Engine machine type for node VMs."
  type        = string
  default     = "e2-standard-4"
}

variable "node_count" {
  description = "Initial node count per zone for the primary node pool."
  type        = number
  default     = 1

  validation {
    condition     = var.node_count >= 1
    error_message = "node_count must be at least 1."
  }
}

variable "min_node_count" {
  description = "Minimum nodes per zone for autoscaling."
  type        = number
  default     = 1
}

variable "max_node_count" {
  description = "Maximum nodes per zone for autoscaling."
  type        = number
  default     = 5

  validation {
    condition     = var.max_node_count >= var.min_node_count
    error_message = "max_node_count must be >= min_node_count."
  }
}

variable "disk_size_gb" {
  description = "Boot disk size (GB) per node."
  type        = number
  default     = 100
}

variable "disk_type" {
  description = "Boot disk type for nodes."
  type        = string
  default     = "pd-balanced"

  validation {
    condition     = contains(["pd-standard", "pd-balanced", "pd-ssd"], var.disk_type)
    error_message = "disk_type must be pd-standard, pd-balanced, or pd-ssd."
  }
}

variable "max_surge" {
  description = "Extra nodes allowed above pool size during a surge upgrade."
  type        = number
  default     = 1
}

variable "max_unavailable" {
  description = "Nodes allowed to be unavailable during an upgrade."
  type        = number
  default     = 0
}

variable "maintenance_start_time" {
  description = "RFC3339 start of the recurring maintenance window."
  type        = string
  default     = "2026-01-01T18:00:00Z"
}

variable "maintenance_end_time" {
  description = "RFC3339 end of the recurring maintenance window."
  type        = string
  default     = "2026-01-01T22:00:00Z"
}

variable "maintenance_recurrence" {
  description = "RFC5545 RRULE for the maintenance window recurrence."
  type        = string
  default     = "FREQ=WEEKLY;BYDAY=SA,SU"
}

variable "labels" {
  description = "Resource labels applied to the cluster."
  type        = map(string)
  default     = {}
}

variable "node_labels" {
  description = "Kubernetes labels applied to nodes in the primary pool."
  type        = map(string)
  default     = {}
}

variable "node_network_tags" {
  description = "Network tags applied to node VMs (for firewall targeting)."
  type        = list(string)
  default     = []
}

outputs.tf

output "cluster_id" {
  description = "Fully qualified GKE cluster ID."
  value       = google_container_cluster.this.id
}

output "cluster_name" {
  description = "Name of the GKE cluster."
  value       = google_container_cluster.this.name
}

output "endpoint" {
  description = "IP address of the cluster's Kubernetes API server endpoint."
  value       = google_container_cluster.this.endpoint
  sensitive   = true
}

output "cluster_ca_certificate" {
  description = "Base64-encoded public CA certificate for the cluster control plane."
  value       = google_container_cluster.this.master_auth[0].cluster_ca_certificate
  sensitive   = true
}

output "location" {
  description = "Region/location the cluster runs in."
  value       = google_container_cluster.this.location
}

output "workload_identity_pool" {
  description = "Workload Identity pool for binding KSAs to GCP service accounts."
  value       = google_container_cluster.this.workload_identity_config[0].workload_pool
}

output "primary_node_pool_name" {
  description = "Name of the primary managed node pool."
  value       = google_container_node_pool.primary.name
}

How to use it

module "gke_cluster" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-gke?ref=v1.0.0"

  project_id   = "kloudvin-prod"
  cluster_name = "kv-prod-apps"
  region       = "asia-south1"

  # VPC-native wiring: the subnet must already define these secondary ranges.
  network             = google_compute_network.vpc.self_link
  subnetwork          = google_compute_subnetwork.gke.self_link
  pods_range_name     = "gke-pods"
  services_range_name = "gke-services"

  # Security baseline
  enable_private_nodes    = true
  enable_private_endpoint = false
  master_ipv4_cidr_block  = "172.16.8.0/28"
  release_channel         = "STABLE"

  master_authorized_networks = [
    {
      cidr_block   = "10.20.0.0/16"
      display_name = "corp-vpn"
    }
  ]

  # Least-privilege node identity
  node_service_account = google_service_account.gke_nodes.email

  # Capacity
  machine_type   = "e2-standard-8"
  node_count     = 2
  min_node_count = 2
  max_node_count = 10

  labels = {
    env   = "prod"
    team  = "platform"
    owner = "vinod"
  }
}

# Downstream: configure the Kubernetes/Helm providers from the module outputs
# so you can deploy workloads into the cluster you just created.
data "google_client_config" "default" {}

provider "kubernetes" {
  host                   = "https://${module.gke_cluster.endpoint}"
  token                  = data.google_client_config.default.access_token
  cluster_ca_certificate = base64decode(module.gke_cluster.cluster_ca_certificate)
}

# Downstream: bind a Kubernetes service account to a GCP SA via the
# Workload Identity pool the module exposes.
resource "google_service_account_iam_member" "wi_binding" {
  service_account_id = google_service_account.app.name
  role               = "roles/iam.workloadIdentityUser"
  member             = "serviceAccount:${module.gke_cluster.workload_identity_pool}[default/checkout-api]"
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/gke/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-gke?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  cluster_name = "..."
  region = "..."
  network = "..."
  subnetwork = "..."
  pods_range_name = "..."
  services_range_name = "..."
  node_service_account = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/gke && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string yes GCP project ID that owns the cluster.
cluster_name string yes Cluster name (validated: lowercase, starts with a letter, ≤40 chars).
region string yes Region for the regional (HA) cluster.
network string yes VPC network self-link or name.
subnetwork string yes Subnetwork self-link or name for nodes.
pods_range_name string yes Subnet secondary range name for pod alias IPs.
services_range_name string yes Subnet secondary range name for service IPs.
node_service_account string yes Email of the least-privilege node SA.
release_channel string "REGULAR" no Release channel: RAPID, REGULAR, STABLE, UNSPECIFIED.
enable_private_nodes bool true no Give nodes internal-only IPs.
enable_private_endpoint bool false no Make the control-plane endpoint private.
master_ipv4_cidr_block string "172.16.0.0/28" no /28 block for the managed control plane.
master_authorized_networks list(object) [] no CIDRs allowed to reach the control plane.
enable_dataplane_v2 bool true no Enable GKE Dataplane V2 (eBPF) + network policy.
deletion_protection bool true no Block Terraform destroy of the cluster.
machine_type string "e2-standard-4" no Node VM machine type.
node_count number 1 no Initial nodes per zone.
min_node_count number 1 no Autoscaler minimum nodes per zone.
max_node_count number 5 no Autoscaler maximum nodes per zone (≥ min).
disk_size_gb number 100 no Node boot disk size in GB.
disk_type string "pd-balanced" no Node boot disk type.
max_surge number 1 no Surge nodes during upgrades.
max_unavailable number 0 no Unavailable nodes during upgrades.
maintenance_start_time string "2026-01-01T18:00:00Z" no RFC3339 maintenance window start.
maintenance_end_time string "2026-01-01T22:00:00Z" no RFC3339 maintenance window end.
maintenance_recurrence string "FREQ=WEEKLY;BYDAY=SA,SU" no RRULE for the maintenance window.
labels map(string) {} no Resource labels on the cluster.
node_labels map(string) {} no Kubernetes labels on primary-pool nodes.
node_network_tags list(string) [] no Network tags for firewall targeting.

Outputs

Name Description
cluster_id Fully qualified GKE cluster ID.
cluster_name Name of the cluster.
endpoint API server endpoint IP (sensitive).
cluster_ca_certificate Base64 CA cert for the control plane (sensitive).
location Region the cluster runs in.
workload_identity_pool Workload Identity pool (<project>.svc.id.goog) for KSA→GSA bindings.
primary_node_pool_name Name of the primary managed node pool.

Enterprise scenario

A fintech platform team runs the same checkout and ledger services across three GKE clusterskv-dev-apps, kv-stg-apps, and kv-prod-apps in asia-south1 — and is subject to PCI controls. They consume this module from three Terraform workspaces with identical code and per-environment .tfvars: prod pins release_channel = "STABLE", sets enable_private_endpoint = true with only the corporate VPN CIDR in master_authorized_networks, and scales max_node_count to 30, while dev stays on REGULAR with a smaller e2-standard-4 pool. Because Workload Identity is enabled uniformly, every pod authenticates to Cloud SQL and Secret Manager through a bound GCP service account with zero exported keys, satisfying the auditors that node and pod credentials are short-lived and least-privilege everywhere.

Best practices

TerraformGCPGKE ClusterModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading