Quick take — Build a reusable Terraform module for a production GKE cluster on hashicorp/google ~> 5.0: VPC-native networking, Workload Identity, private nodes, release channels, and a managed node pool — all var-driven. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "gke" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-gke?ref=v1.0.0"
project_id = "..." # GCP project ID that owns the cluster.
cluster_name = "..." # Cluster name (validated: lowercase, starts with a lette…
region = "..." # Region for the regional (HA) cluster.
network = "..." # VPC network self-link or name.
subnetwork = "..." # Subnetwork self-link or name for nodes.
pods_range_name = "..." # Subnet secondary range name for pod alias IPs.
services_range_name = "..." # Subnet secondary range name for service IPs.
node_service_account = "..." # Email of the least-privilege node SA.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
A Google Kubernetes Engine (GKE) cluster is GCP’s managed Kubernetes control plane plus the worker infrastructure that runs your pods. The control plane (API server, scheduler, etcd) is fully managed by Google; you own the node pools — the VM groups that actually schedule your workloads. In Terraform, the cluster itself is the google_container_cluster resource, and the worker VMs live in one or more google_container_node_pool resources.
The reason to wrap this in a reusable module is that a correct GKE cluster is deceptively fiddly. The defaults Google ships are not the ones you want in production: the default cluster has a publicly reachable node IP surface, basic authentication and client certificates that should be off, no Workload Identity, and a built-in “default node pool” that you almost always want to delete and replace with your own managed pool. Getting VPC-native (alias IP) networking, a private control plane, release channels, and Workload Identity wired together correctly takes ~150 lines of HCL that nobody wants to copy-paste — and copy-paste is exactly how one environment ends up with shielded nodes and another without.
This module bakes those production decisions in once. It creates a VPC-native, regional cluster with the default node pool removed, attaches a separately managed node pool with autoscaling and auto-repair, enables Workload Identity so pods authenticate to GCP APIs without node-level service account keys, and optionally makes the control plane private. Every environment-specific value — project, region, CIDR ranges, machine type, node count — is a variable, so dev, staging, and prod are the same code with different .tfvars.
When to use it
- You run more than one GKE cluster (per-environment, per-region, or per-team) and want them provisioned identically rather than hand-tuned.
- You need VPC-native networking (alias IPs) because you’re peering with other VPCs, using GKE Ingress with container-native load balancing, or you’ve outgrown route-based clusters.
- You want Workload Identity as the standard for pod-to-GCP-API auth and need it enabled consistently — not bolted on later.
- You’re enforcing a security baseline (private nodes, shielded nodes, release channel, no legacy ABAC/basic-auth) and want it as code, not a runbook.
- You’re standing up clusters through CI/CD and need a stable module reference (
?ref=v1.0.0) so a cluster’s config is reproducible and reviewable.
If you only ever need a single throwaway sandbox cluster, gcloud container clusters create is faster. The moment a second cluster appears, reach for the module.
Module structure
terraform-module-gcp-gke/
├── versions.tf # provider + Terraform version pins
├── main.tf # google_container_cluster + google_container_node_pool
├── variables.tf # all environment-specific inputs (with validation)
├── outputs.tf # cluster id/name/endpoint + CA cert + node pool name
└── README.md
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
locals {
# GKE requires the control-plane CIDR to be a /28.
master_ipv4_cidr_block = var.master_ipv4_cidr_block
# Workload Identity pool is always "<project>.svc.id.goog".
workload_pool = "${var.project_id}.svc.id.goog"
}
resource "google_container_cluster" "this" {
provider = google
name = var.cluster_name
project = var.project_id
location = var.region
# Removing the default node pool is the recommended pattern: it lets us
# manage the real worker pool separately via google_container_node_pool.
remove_default_node_pool = true
initial_node_count = 1
# Deletion protection prevents accidental `terraform destroy` of the cluster.
deletion_protection = var.deletion_protection
# VPC-native (alias IP) networking. Cluster/services ranges are looked up
# by the secondary range *names* defined on the subnet.
networking_mode = "VPC_NATIVE"
network = var.network
subnetwork = var.subnetwork
ip_allocation_policy {
cluster_secondary_range_name = var.pods_range_name
services_secondary_range_name = var.services_range_name
}
# Managed control-plane upgrades. Use a release channel rather than pinning
# min_master_version so Google handles patching within the channel.
release_channel {
channel = var.release_channel
}
# Workload Identity: the recommended way for pods to authenticate to Google
# APIs without node-level service account keys.
workload_identity_config {
workload_pool = local.workload_pool
}
# Private control plane / private nodes. Nodes get only internal IPs;
# the master endpoint is optionally private as well.
private_cluster_config {
enable_private_nodes = var.enable_private_nodes
enable_private_endpoint = var.enable_private_endpoint
master_ipv4_cidr_block = var.enable_private_nodes ? local.master_ipv4_cidr_block : null
}
# Restrict who can reach the public control-plane endpoint.
dynamic "master_authorized_networks_config" {
for_each = length(var.master_authorized_networks) > 0 ? [1] : []
content {
dynamic "cidr_blocks" {
for_each = var.master_authorized_networks
content {
cidr_block = cidr_blocks.value.cidr_block
display_name = cidr_blocks.value.display_name
}
}
}
}
# Disable legacy auth surfaces. (Basic auth / client certs were removed in
# GKE 1.19+, so we simply do not configure master_auth credentials.)
enable_legacy_abac = false
# Shielded Nodes hardens the VM boot integrity for the whole cluster.
enable_shielded_nodes = true
# Optional dataplane v2 (eBPF-based networking + network policy).
datapath_provider = var.enable_dataplane_v2 ? "ADVANCED_DATAPATH" : "DATAPATH_PROVIDER_UNSPECIFIED"
# Maintenance window so node/control-plane upgrades land off-peak.
maintenance_policy {
recurring_window {
start_time = var.maintenance_start_time
end_time = var.maintenance_end_time
recurrence = var.maintenance_recurrence
}
}
resource_labels = var.labels
lifecycle {
# The control plane occasionally rewrites node_config on the (removed)
# default pool; ignore it so plans stay clean.
ignore_changes = [node_config]
}
}
resource "google_container_node_pool" "primary" {
provider = google
name = "${var.cluster_name}-primary"
project = var.project_id
location = var.region
cluster = google_container_cluster.this.name
# With autoscaling, initial_node_count is per-zone; a regional cluster
# multiplies this across its zones.
initial_node_count = var.node_count
autoscaling {
min_node_count = var.min_node_count
max_node_count = var.max_node_count
}
management {
auto_repair = true
auto_upgrade = true
}
upgrade_settings {
max_surge = var.max_surge
max_unavailable = var.max_unavailable
}
node_config {
machine_type = var.machine_type
disk_size_gb = var.disk_size_gb
disk_type = var.disk_type
image_type = "COS_CONTAINERD"
# Least-privilege node identity. Prefer a dedicated SA with only the
# logging/monitoring/artifact-registry roles it needs.
service_account = var.node_service_account
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
# Required to let pods on these nodes use Workload Identity.
workload_metadata_config {
mode = "GKE_METADATA"
}
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
labels = var.node_labels
tags = var.node_network_tags
metadata = {
disable-legacy-endpoints = "true"
}
}
lifecycle {
create_before_destroy = true
}
}
variables.tf
variable "project_id" {
description = "GCP project ID that owns the cluster."
type = string
}
variable "cluster_name" {
description = "Name of the GKE cluster. Lowercase letters, numbers and hyphens; must start with a letter."
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{0,38}[a-z0-9]$", var.cluster_name))
error_message = "cluster_name must be 2-40 chars, lowercase alphanumeric or '-', start with a letter, and not end with '-'."
}
}
variable "region" {
description = "Region for a regional cluster (e.g. asia-south1). Use a region, not a zone, for HA control plane."
type = string
}
variable "network" {
description = "Self-link or name of the VPC network the cluster attaches to."
type = string
}
variable "subnetwork" {
description = "Self-link or name of the subnetwork the nodes live in."
type = string
}
variable "pods_range_name" {
description = "Name of the subnet secondary range used for pod IPs (alias IPs)."
type = string
}
variable "services_range_name" {
description = "Name of the subnet secondary range used for service (ClusterIP) IPs."
type = string
}
variable "release_channel" {
description = "GKE release channel governing auto-upgrade cadence."
type = string
default = "REGULAR"
validation {
condition = contains(["RAPID", "REGULAR", "STABLE", "UNSPECIFIED"], var.release_channel)
error_message = "release_channel must be one of RAPID, REGULAR, STABLE, or UNSPECIFIED."
}
}
variable "enable_private_nodes" {
description = "If true, nodes receive only internal IPs."
type = bool
default = true
}
variable "enable_private_endpoint" {
description = "If true, the control-plane endpoint is private (reachable only from authorized internal networks)."
type = bool
default = false
}
variable "master_ipv4_cidr_block" {
description = "RFC 1918 /28 block for the managed control plane. Required when enable_private_nodes is true."
type = string
default = "172.16.0.0/28"
validation {
condition = can(cidrhost(var.master_ipv4_cidr_block, 0)) && tonumber(split("/", var.master_ipv4_cidr_block)[1]) == 28
error_message = "master_ipv4_cidr_block must be a valid /28 CIDR."
}
}
variable "master_authorized_networks" {
description = "CIDR blocks allowed to reach the control-plane endpoint."
type = list(object({
cidr_block = string
display_name = string
}))
default = []
}
variable "enable_dataplane_v2" {
description = "Enable GKE Dataplane V2 (Cilium/eBPF) for advanced networking and network policy."
type = bool
default = true
}
variable "deletion_protection" {
description = "Prevent accidental destroy of the cluster via Terraform."
type = bool
default = true
}
variable "node_service_account" {
description = "Email of the IAM service account attached to nodes. Use a dedicated least-privilege SA."
type = string
}
variable "machine_type" {
description = "Compute Engine machine type for node VMs."
type = string
default = "e2-standard-4"
}
variable "node_count" {
description = "Initial node count per zone for the primary node pool."
type = number
default = 1
validation {
condition = var.node_count >= 1
error_message = "node_count must be at least 1."
}
}
variable "min_node_count" {
description = "Minimum nodes per zone for autoscaling."
type = number
default = 1
}
variable "max_node_count" {
description = "Maximum nodes per zone for autoscaling."
type = number
default = 5
validation {
condition = var.max_node_count >= var.min_node_count
error_message = "max_node_count must be >= min_node_count."
}
}
variable "disk_size_gb" {
description = "Boot disk size (GB) per node."
type = number
default = 100
}
variable "disk_type" {
description = "Boot disk type for nodes."
type = string
default = "pd-balanced"
validation {
condition = contains(["pd-standard", "pd-balanced", "pd-ssd"], var.disk_type)
error_message = "disk_type must be pd-standard, pd-balanced, or pd-ssd."
}
}
variable "max_surge" {
description = "Extra nodes allowed above pool size during a surge upgrade."
type = number
default = 1
}
variable "max_unavailable" {
description = "Nodes allowed to be unavailable during an upgrade."
type = number
default = 0
}
variable "maintenance_start_time" {
description = "RFC3339 start of the recurring maintenance window."
type = string
default = "2026-01-01T18:00:00Z"
}
variable "maintenance_end_time" {
description = "RFC3339 end of the recurring maintenance window."
type = string
default = "2026-01-01T22:00:00Z"
}
variable "maintenance_recurrence" {
description = "RFC5545 RRULE for the maintenance window recurrence."
type = string
default = "FREQ=WEEKLY;BYDAY=SA,SU"
}
variable "labels" {
description = "Resource labels applied to the cluster."
type = map(string)
default = {}
}
variable "node_labels" {
description = "Kubernetes labels applied to nodes in the primary pool."
type = map(string)
default = {}
}
variable "node_network_tags" {
description = "Network tags applied to node VMs (for firewall targeting)."
type = list(string)
default = []
}
outputs.tf
output "cluster_id" {
description = "Fully qualified GKE cluster ID."
value = google_container_cluster.this.id
}
output "cluster_name" {
description = "Name of the GKE cluster."
value = google_container_cluster.this.name
}
output "endpoint" {
description = "IP address of the cluster's Kubernetes API server endpoint."
value = google_container_cluster.this.endpoint
sensitive = true
}
output "cluster_ca_certificate" {
description = "Base64-encoded public CA certificate for the cluster control plane."
value = google_container_cluster.this.master_auth[0].cluster_ca_certificate
sensitive = true
}
output "location" {
description = "Region/location the cluster runs in."
value = google_container_cluster.this.location
}
output "workload_identity_pool" {
description = "Workload Identity pool for binding KSAs to GCP service accounts."
value = google_container_cluster.this.workload_identity_config[0].workload_pool
}
output "primary_node_pool_name" {
description = "Name of the primary managed node pool."
value = google_container_node_pool.primary.name
}
How to use it
module "gke_cluster" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-gke?ref=v1.0.0"
project_id = "kloudvin-prod"
cluster_name = "kv-prod-apps"
region = "asia-south1"
# VPC-native wiring: the subnet must already define these secondary ranges.
network = google_compute_network.vpc.self_link
subnetwork = google_compute_subnetwork.gke.self_link
pods_range_name = "gke-pods"
services_range_name = "gke-services"
# Security baseline
enable_private_nodes = true
enable_private_endpoint = false
master_ipv4_cidr_block = "172.16.8.0/28"
release_channel = "STABLE"
master_authorized_networks = [
{
cidr_block = "10.20.0.0/16"
display_name = "corp-vpn"
}
]
# Least-privilege node identity
node_service_account = google_service_account.gke_nodes.email
# Capacity
machine_type = "e2-standard-8"
node_count = 2
min_node_count = 2
max_node_count = 10
labels = {
env = "prod"
team = "platform"
owner = "vinod"
}
}
# Downstream: configure the Kubernetes/Helm providers from the module outputs
# so you can deploy workloads into the cluster you just created.
data "google_client_config" "default" {}
provider "kubernetes" {
host = "https://${module.gke_cluster.endpoint}"
token = data.google_client_config.default.access_token
cluster_ca_certificate = base64decode(module.gke_cluster.cluster_ca_certificate)
}
# Downstream: bind a Kubernetes service account to a GCP SA via the
# Workload Identity pool the module exposes.
resource "google_service_account_iam_member" "wi_binding" {
service_account_id = google_service_account.app.name
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${module.gke_cluster.workload_identity_pool}[default/checkout-api]"
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/gke/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-gke?ref=v1.0.0"
}
inputs = {
project_id = "..."
cluster_name = "..."
region = "..."
network = "..."
subnetwork = "..."
pods_range_name = "..."
services_range_name = "..."
node_service_account = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/gke && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| project_id | string | — | yes | GCP project ID that owns the cluster. |
| cluster_name | string | — | yes | Cluster name (validated: lowercase, starts with a letter, ≤40 chars). |
| region | string | — | yes | Region for the regional (HA) cluster. |
| network | string | — | yes | VPC network self-link or name. |
| subnetwork | string | — | yes | Subnetwork self-link or name for nodes. |
| pods_range_name | string | — | yes | Subnet secondary range name for pod alias IPs. |
| services_range_name | string | — | yes | Subnet secondary range name for service IPs. |
| node_service_account | string | — | yes | Email of the least-privilege node SA. |
| release_channel | string | "REGULAR" |
no | Release channel: RAPID, REGULAR, STABLE, UNSPECIFIED. |
| enable_private_nodes | bool | true |
no | Give nodes internal-only IPs. |
| enable_private_endpoint | bool | false |
no | Make the control-plane endpoint private. |
| master_ipv4_cidr_block | string | "172.16.0.0/28" |
no | /28 block for the managed control plane. |
| master_authorized_networks | list(object) | [] |
no | CIDRs allowed to reach the control plane. |
| enable_dataplane_v2 | bool | true |
no | Enable GKE Dataplane V2 (eBPF) + network policy. |
| deletion_protection | bool | true |
no | Block Terraform destroy of the cluster. |
| machine_type | string | "e2-standard-4" |
no | Node VM machine type. |
| node_count | number | 1 |
no | Initial nodes per zone. |
| min_node_count | number | 1 |
no | Autoscaler minimum nodes per zone. |
| max_node_count | number | 5 |
no | Autoscaler maximum nodes per zone (≥ min). |
| disk_size_gb | number | 100 |
no | Node boot disk size in GB. |
| disk_type | string | "pd-balanced" |
no | Node boot disk type. |
| max_surge | number | 1 |
no | Surge nodes during upgrades. |
| max_unavailable | number | 0 |
no | Unavailable nodes during upgrades. |
| maintenance_start_time | string | "2026-01-01T18:00:00Z" |
no | RFC3339 maintenance window start. |
| maintenance_end_time | string | "2026-01-01T22:00:00Z" |
no | RFC3339 maintenance window end. |
| maintenance_recurrence | string | "FREQ=WEEKLY;BYDAY=SA,SU" |
no | RRULE for the maintenance window. |
| labels | map(string) | {} |
no | Resource labels on the cluster. |
| node_labels | map(string) | {} |
no | Kubernetes labels on primary-pool nodes. |
| node_network_tags | list(string) | [] |
no | Network tags for firewall targeting. |
Outputs
| Name | Description |
|---|---|
| cluster_id | Fully qualified GKE cluster ID. |
| cluster_name | Name of the cluster. |
| endpoint | API server endpoint IP (sensitive). |
| cluster_ca_certificate | Base64 CA cert for the control plane (sensitive). |
| location | Region the cluster runs in. |
| workload_identity_pool | Workload Identity pool (<project>.svc.id.goog) for KSA→GSA bindings. |
| primary_node_pool_name | Name of the primary managed node pool. |
Enterprise scenario
A fintech platform team runs the same checkout and ledger services across three GKE clusters — kv-dev-apps, kv-stg-apps, and kv-prod-apps in asia-south1 — and is subject to PCI controls. They consume this module from three Terraform workspaces with identical code and per-environment .tfvars: prod pins release_channel = "STABLE", sets enable_private_endpoint = true with only the corporate VPN CIDR in master_authorized_networks, and scales max_node_count to 30, while dev stays on REGULAR with a smaller e2-standard-4 pool. Because Workload Identity is enabled uniformly, every pod authenticates to Cloud SQL and Secret Manager through a bound GCP service account with zero exported keys, satisfying the auditors that node and pod credentials are short-lived and least-privilege everywhere.
Best practices
- Remove the default node pool and manage pools separately (this module does). Editing
node_configon the in-cluster default pool forces cluster recreation; a standalonegoogle_container_node_poollets you change machine types, scale, or roll node OS versions without touching the control plane. - Use a release channel, not a pinned master version.
STABLEfor prod,REGULARfor non-prod. Channels give you automatic, tested security patches; pinningmin_master_versionleaves you responsible for CVE patching and tends to rot. - Give nodes a dedicated least-privilege service account — never the default Compute Engine SA, which carries Editor on the project. Grant the node SA only
logging.logWriter,monitoring.metricWriter,monitoring.viewer, andartifactregistry.reader, and push all app-level access through Workload Identity bindings instead. - Right-size with autoscaling and pick the disk to match the workload. Set realistic
min/max_node_count, usepd-balanced(notpd-ssd) unless you measure I/O pressure, and prefere2/t2dfamilies for general workloads to keep INR spend down; reserven2/c3for CPU-bound services. - Keep nodes private and lock the control plane.
enable_private_nodes = trueplus a tightmaster_authorized_networkslist (orenable_private_endpoint = true) removes the public attack surface; pair withenable_shielded_nodes, secure boot, and integrity monitoring, all set here by default. - Name and label for cost attribution. Encode environment and region in
cluster_name(kv-prod-apps) and always setlabelswithenv/team/ownerso GKE node VMs roll up cleanly in billing exports and budget alerts.