IaC GCP

Terraform Module: GCP Managed Instance Group — Self-Healing, Auto-Scaling Compute Across Zones

Quick take — Reusable Terraform module for a regional GCP Managed Instance Group: instance template, multi-zone distribution, autoscaler, named ports, auto-healing health checks and rolling updates. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "managed_instance_group" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-managed-instance-group?ref=v1.0.0"

  project_id            = "..."  # GCP project the MIG and supporting resources live in.
  name                  = "..."  # Base name (RFC1035) for the MIG and derived resources.
  region                = "..."  # Region for the regional MIG.
  base_instance_name    = "..."  # VM name prefix; GCP appends a unique suffix.
  source_image          = "..."  # Boot image self-link or family path.
  network               = "..."  # VPC network self-link or name.
  subnetwork            = "..."  # Subnetwork in the target region.
  service_account_email = "..."  # Least-privilege SA attached to each VM.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

A Managed Instance Group (MIG) on Google Cloud is a controller that runs a fleet of identical Compute Engine VMs from a single instance template. Unlike an unmanaged group, a MIG actively maintains a desired state: it recreates VMs that crash or fail their health check (auto-healing), spreads instances across zones for resilience, scales the fleet up and down based on load (autoscaling), and rolls out new template versions gradually instead of replacing everything at once. It is the standard GCP building block behind stateless web tiers, API backends, and worker pools sitting behind a load balancer.

This module wraps google_compute_region_instance_group_manager — the regional variant that distributes VMs across multiple zones in a region, so the loss of a single zone does not take the whole tier down. Wrapping it in a module matters because a production-grade MIG is never just one resource: you need a versioned instance template, a health check, an autoscaler, named ports for load-balancer wiring, and a carefully tuned rolling-update policy. Hand-rolling those five resources per environment invites drift and copy-paste bugs. The module turns the whole pattern into a handful of variables with sane, opinionated defaults (surge-based zero-downtime updates, proactive auto-healing, balanced multi-zone spread) while still exposing the knobs that differ per workload.

When to use it

Reach for a regional MIG (this module) over a zonal one for anything that must survive a single-zone outage. Skip MIGs entirely for stateful singletons (a primary database, a license server) — those belong on a standalone instance with a persistent disk, not in an auto-healing group that may delete and recreate the VM out from under your data.

Module structure

terraform-module-gcp-managed-instance-group/
├── versions.tf      # provider + Terraform version constraints
├── main.tf          # instance template, health check, region MIG, autoscaler
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # ids, self-links, instance group URL, named ports
# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}
# main.tf

locals {
  # A stable prefix lets the template use create_before_destroy safely.
  name_prefix = "${var.name}-"
}

# ---------------------------------------------------------------------------
# Instance template — the immutable blueprint every VM in the MIG is born from.
# name_prefix + create_before_destroy lets a new template be created before the
# old one is destroyed, which is what makes rolling updates non-disruptive.
# ---------------------------------------------------------------------------
resource "google_compute_instance_template" "this" {
  project      = var.project_id
  name_prefix  = local.name_prefix
  description  = "Instance template for MIG ${var.name}"
  machine_type = var.machine_type
  region       = var.region
  tags         = var.network_tags
  labels       = var.labels

  disk {
    source_image = var.source_image
    auto_delete  = true
    boot         = true
    disk_type    = var.boot_disk_type
    disk_size_gb = var.boot_disk_size_gb
  }

  network_interface {
    network    = var.network
    subnetwork = var.subnetwork

    # Only attach an external IP when explicitly requested.
    dynamic "access_config" {
      for_each = var.assign_external_ip ? [1] : []
      content {
        network_tier = "PREMIUM"
      }
    }
  }

  metadata = var.metadata

  # Optional startup script wired through metadata.
  metadata_startup_script = var.startup_script

  service_account {
    email  = var.service_account_email
    scopes = var.service_account_scopes
  }

  # Shielded VM hardening — on by default for a stronger security posture.
  shielded_instance_config {
    enable_secure_boot          = var.enable_secure_boot
    enable_vtpm                 = true
    enable_integrity_monitoring = true
  }

  lifecycle {
    create_before_destroy = true
  }
}

# ---------------------------------------------------------------------------
# Health check used for auto-healing. This is the signal the MIG uses to decide
# a VM is unhealthy and must be recreated — keep it distinct from (and usually
# more lenient than) the load balancer's serving health check.
# ---------------------------------------------------------------------------
resource "google_compute_health_check" "autohealing" {
  project             = var.project_id
  name                = "${var.name}-autohealing"
  check_interval_sec  = var.health_check_interval_sec
  timeout_sec         = var.health_check_timeout_sec
  healthy_threshold   = var.health_check_healthy_threshold
  unhealthy_threshold = var.health_check_unhealthy_threshold

  http_health_check {
    port         = var.health_check_port
    request_path = var.health_check_request_path
  }

  log_config {
    enable = var.enable_health_check_logging
  }
}

# ---------------------------------------------------------------------------
# Regional Managed Instance Group — distributes target_size VMs across the
# region's zones and maintains desired state (auto-healing + rolling updates).
# ---------------------------------------------------------------------------
resource "google_compute_region_instance_group_manager" "this" {
  project            = var.project_id
  name               = var.name
  region             = var.region
  base_instance_name = var.base_instance_name

  # When the autoscaler is enabled it owns target_size, so we omit it here to
  # avoid the two controllers fighting over the instance count.
  target_size = var.enable_autoscaler ? null : var.target_size

  distribution_policy_zones        = var.distribution_policy_zones
  distribution_policy_target_shape = var.distribution_policy_target_shape

  version {
    instance_template = google_compute_instance_template.this.id
    name              = "primary"
  }

  dynamic "named_port" {
    for_each = var.named_ports
    content {
      name = named_port.value.name
      port = named_port.value.port
    }
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.autohealing.id
    initial_delay_sec = var.health_check_initial_delay_sec
  }

  update_policy {
    type                         = var.update_policy_type
    instance_redistribution_type = var.update_instance_redistribution_type
    minimal_action               = var.update_minimal_action
    max_surge_fixed              = var.update_max_surge_fixed
    max_unavailable_fixed        = var.update_max_unavailable_fixed
    replacement_method           = "SUBSTITUTE"
  }

  # Let GCP manage the fleet; ignore drift on the count the autoscaler controls.
  lifecycle {
    ignore_changes = [target_size]
  }
}

# ---------------------------------------------------------------------------
# Autoscaler — optional. Scales the MIG between min and max replicas based on
# CPU and (optionally) load-balancer utilization.
# ---------------------------------------------------------------------------
resource "google_compute_region_autoscaler" "this" {
  count = var.enable_autoscaler ? 1 : 0

  project = var.project_id
  name    = "${var.name}-autoscaler"
  region  = var.region
  target  = google_compute_region_instance_group_manager.this.id

  autoscaling_policy {
    min_replicas    = var.autoscaler_min_replicas
    max_replicas    = var.autoscaler_max_replicas
    cooldown_period = var.autoscaler_cooldown_period

    cpu_utilization {
      target = var.autoscaler_cpu_target
    }

    dynamic "load_balancing_utilization" {
      for_each = var.autoscaler_lb_target == null ? [] : [1]
      content {
        target = var.autoscaler_lb_target
      }
    }
  }
}
# variables.tf

variable "project_id" {
  type        = string
  description = "GCP project ID the MIG and its supporting resources are created in."
}

variable "name" {
  type        = string
  description = "Base name for the MIG and derived resources (autoscaler, health check)."

  validation {
    condition     = can(regex("^[a-z]([-a-z0-9]{0,61}[a-z0-9])?$", var.name))
    error_message = "name must be 1-63 chars, lowercase RFC1035: start with a letter, contain only lowercase letters, digits, and hyphens."
  }
}

variable "region" {
  type        = string
  description = "Region for the regional MIG (e.g. asia-south1)."
}

variable "base_instance_name" {
  type        = string
  description = "Prefix for VM instance names; GCP appends a unique suffix per VM."
}

variable "machine_type" {
  type        = string
  description = "Compute Engine machine type for each VM (e.g. e2-standard-2)."
  default     = "e2-standard-2"
}

variable "source_image" {
  type        = string
  description = "Boot image self-link or family path (e.g. projects/debian-cloud/global/images/family/debian-12)."
}

variable "boot_disk_type" {
  type        = string
  description = "Boot disk type."
  default     = "pd-balanced"

  validation {
    condition     = contains(["pd-standard", "pd-balanced", "pd-ssd"], var.boot_disk_type)
    error_message = "boot_disk_type must be one of: pd-standard, pd-balanced, pd-ssd."
  }
}

variable "boot_disk_size_gb" {
  type        = number
  description = "Boot disk size in GB."
  default     = 20

  validation {
    condition     = var.boot_disk_size_gb >= 10
    error_message = "boot_disk_size_gb must be at least 10 GB."
  }
}

variable "network" {
  type        = string
  description = "VPC network self-link or name."
}

variable "subnetwork" {
  type        = string
  description = "Subnetwork self-link or name in the target region."
}

variable "assign_external_ip" {
  type        = bool
  description = "Whether to attach an ephemeral external IP to each VM. Keep false behind a load balancer."
  default     = false
}

variable "network_tags" {
  type        = list(string)
  description = "Network tags applied to each VM (used by firewall rules)."
  default     = []
}

variable "labels" {
  type        = map(string)
  description = "Labels applied to the instance template and VMs."
  default     = {}
}

variable "metadata" {
  type        = map(string)
  description = "Custom instance metadata key/value pairs."
  default     = {}
}

variable "startup_script" {
  type        = string
  description = "Startup script run on each VM at boot. Null to omit."
  default     = null
}

variable "service_account_email" {
  type        = string
  description = "Service account email attached to each VM. Use a dedicated least-privilege SA, not the default."
}

variable "service_account_scopes" {
  type        = list(string)
  description = "OAuth scopes for the attached service account. Prefer cloud-platform plus IAM roles."
  default     = ["https://www.googleapis.com/auth/cloud-platform"]
}

variable "enable_secure_boot" {
  type        = bool
  description = "Enable Shielded VM Secure Boot. Disable only for unsigned/custom kernels."
  default     = true
}

variable "target_size" {
  type        = number
  description = "Static number of VMs when the autoscaler is disabled."
  default     = 2

  validation {
    condition     = var.target_size >= 0
    error_message = "target_size must be 0 or greater."
  }
}

variable "distribution_policy_zones" {
  type        = list(string)
  description = "Explicit zones to spread VMs across. Empty list lets GCP pick all zones in the region."
  default     = []
}

variable "distribution_policy_target_shape" {
  type        = string
  description = "How instances are spread across zones: EVEN, BALANCED, or ANY."
  default     = "EVEN"

  validation {
    condition     = contains(["EVEN", "BALANCED", "ANY"], var.distribution_policy_target_shape)
    error_message = "distribution_policy_target_shape must be EVEN, BALANCED, or ANY."
  }
}

variable "named_ports" {
  type = list(object({
    name = string
    port = number
  }))
  description = "Named ports exposed to load balancer backends (e.g. [{ name = \"http\", port = 8080 }])."
  default     = []
}

# --- Auto-healing health check -------------------------------------------------

variable "health_check_port" {
  type        = number
  description = "TCP port the auto-healing HTTP health check probes."
  default     = 80
}

variable "health_check_request_path" {
  type        = string
  description = "HTTP path the auto-healing health check requests."
  default     = "/healthz"
}

variable "health_check_interval_sec" {
  type        = number
  description = "Seconds between health check probes."
  default     = 10
}

variable "health_check_timeout_sec" {
  type        = number
  description = "Seconds to wait for a probe response before it is a failure."
  default     = 5
}

variable "health_check_healthy_threshold" {
  type        = number
  description = "Consecutive successes before a VM is considered healthy."
  default     = 2
}

variable "health_check_unhealthy_threshold" {
  type        = number
  description = "Consecutive failures before a VM is recreated by auto-healing."
  default     = 3
}

variable "health_check_initial_delay_sec" {
  type        = number
  description = "Grace period after a VM boots before auto-healing starts probing it. Set above app warm-up time."
  default     = 300
}

variable "enable_health_check_logging" {
  type        = bool
  description = "Emit health check probe results to Cloud Logging."
  default     = false
}

# --- Rolling update policy -----------------------------------------------------

variable "update_policy_type" {
  type        = string
  description = "Rolling update mode: PROACTIVE (auto-roll on template change) or OPPORTUNISTIC."
  default     = "PROACTIVE"

  validation {
    condition     = contains(["PROACTIVE", "OPPORTUNISTIC"], var.update_policy_type)
    error_message = "update_policy_type must be PROACTIVE or OPPORTUNISTIC."
  }
}

variable "update_instance_redistribution_type" {
  type        = string
  description = "Whether the MIG redistributes instances across zones during updates: PROACTIVE or NONE."
  default     = "PROACTIVE"

  validation {
    condition     = contains(["PROACTIVE", "NONE"], var.update_instance_redistribution_type)
    error_message = "update_instance_redistribution_type must be PROACTIVE or NONE."
  }
}

variable "update_minimal_action" {
  type        = string
  description = "Least disruptive action applied on update: RESTART, REFRESH, or REPLACE."
  default     = "REPLACE"

  validation {
    condition     = contains(["RESTART", "REFRESH", "REPLACE"], var.update_minimal_action)
    error_message = "update_minimal_action must be RESTART, REFRESH, or REPLACE."
  }
}

variable "update_max_surge_fixed" {
  type        = number
  description = "Extra VMs created above target during a rolling update. Must be >= number of zones for regional MIGs."
  default     = 3
}

variable "update_max_unavailable_fixed" {
  type        = number
  description = "Max VMs that may be unavailable during a rolling update. 0 means strictly add-before-remove."
  default     = 0
}

# --- Autoscaler ----------------------------------------------------------------

variable "enable_autoscaler" {
  type        = bool
  description = "Create a regional autoscaler that owns the MIG's instance count."
  default     = true
}

variable "autoscaler_min_replicas" {
  type        = number
  description = "Minimum number of VMs the autoscaler will keep running."
  default     = 2

  validation {
    condition     = var.autoscaler_min_replicas >= 1
    error_message = "autoscaler_min_replicas must be at least 1."
  }
}

variable "autoscaler_max_replicas" {
  type        = number
  description = "Maximum number of VMs the autoscaler may scale up to."
  default     = 10
}

variable "autoscaler_cooldown_period" {
  type        = number
  description = "Seconds the autoscaler waits after a new VM boots before using its metrics."
  default     = 60
}

variable "autoscaler_cpu_target" {
  type        = number
  description = "Target average CPU utilization (0.0-1.0) the autoscaler steers toward."
  default     = 0.6

  validation {
    condition     = var.autoscaler_cpu_target > 0 && var.autoscaler_cpu_target <= 1
    error_message = "autoscaler_cpu_target must be between 0 (exclusive) and 1 (inclusive)."
  }
}

variable "autoscaler_lb_target" {
  type        = number
  description = "Optional target load-balancing utilization (0.0-1.0). Null disables LB-based scaling."
  default     = null
}
# outputs.tf

output "instance_group_manager_id" {
  description = "Fully qualified ID of the regional instance group manager."
  value       = google_compute_region_instance_group_manager.this.id
}

output "instance_group_manager_name" {
  description = "Name of the regional instance group manager."
  value       = google_compute_region_instance_group_manager.this.name
}

output "instance_group" {
  description = "Self-link of the underlying instance group, used as a load balancer backend."
  value       = google_compute_region_instance_group_manager.this.instance_group
}

output "instance_template_id" {
  description = "ID of the instance template currently driving the MIG."
  value       = google_compute_instance_template.this.id
}

output "instance_template_self_link" {
  description = "Self-link of the active instance template."
  value       = google_compute_instance_template.this.self_link
}

output "health_check_id" {
  description = "ID of the auto-healing health check."
  value       = google_compute_health_check.autohealing.id
}

output "autoscaler_id" {
  description = "ID of the regional autoscaler, or null when the autoscaler is disabled."
  value       = try(google_compute_region_autoscaler.this[0].id, null)
}

output "named_ports" {
  description = "Named ports configured on the instance group for load balancer wiring."
  value       = var.named_ports
}

How to use it

module "managed_instance_group" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-managed-instance-group?ref=v1.0.0"

  project_id         = "kloudvin-prod"
  name               = "web-frontend"
  region             = "asia-south1"
  base_instance_name = "web"

  machine_type = "e2-standard-4"
  source_image = "projects/kloudvin-prod/global/images/family/web-frontend"

  network    = "projects/kloudvin-prod/global/networks/prod-vpc"
  subnetwork = "projects/kloudvin-prod/regions/asia-south1/subnetworks/prod-web-asia-south1"

  service_account_email = google_service_account.web.email
  network_tags          = ["web", "allow-lb-health-checks"]
  labels                = { app = "web-frontend", env = "prod" }

  # Expose the app port so the load balancer backend can target it by name.
  named_ports = [{ name = "http", port = 8080 }]

  # Auto-healing probes the app's readiness endpoint.
  health_check_port         = 8080
  health_check_request_path = "/healthz"

  # CPU + load-balancer driven autoscaling.
  enable_autoscaler       = true
  autoscaler_min_replicas = 3
  autoscaler_max_replicas = 20
  autoscaler_cpu_target   = 0.55
  autoscaler_lb_target    = 0.8

  # Zero-downtime rolling deploys: add 3 before removing any.
  update_max_surge_fixed       = 3
  update_max_unavailable_fixed = 0
}

# Downstream: wire the MIG into an HTTP(S) load balancer backend service.
resource "google_compute_backend_service" "web" {
  project               = "kloudvin-prod"
  name                  = "web-frontend-backend"
  protocol              = "HTTP"
  port_name             = "http"
  load_balancing_scheme = "EXTERNAL_MANAGED"
  health_checks         = [google_compute_health_check.lb_serving.id]

  backend {
    group           = module.managed_instance_group.instance_group
    balancing_mode  = "UTILIZATION"
    max_utilization = 0.8
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/managed_instance_group/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-managed-instance-group?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  name = "..."
  region = "..."
  base_instance_name = "..."
  source_image = "..."
  network = "..."
  subnetwork = "..."
  service_account_email = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/managed_instance_group && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string Yes GCP project the MIG and supporting resources live in.
name string Yes Base name (RFC1035) for the MIG and derived resources.
region string Yes Region for the regional MIG.
base_instance_name string Yes VM name prefix; GCP appends a unique suffix.
machine_type string e2-standard-2 No Machine type per VM.
source_image string Yes Boot image self-link or family path.
boot_disk_type string pd-balanced No Boot disk type (pd-standard/pd-balanced/pd-ssd).
boot_disk_size_gb number 20 No Boot disk size in GB (>= 10).
network string Yes VPC network self-link or name.
subnetwork string Yes Subnetwork in the target region.
assign_external_ip bool false No Attach an ephemeral external IP to each VM.
network_tags list(string) [] No Network tags applied to each VM.
labels map(string) {} No Labels on the template and VMs.
metadata map(string) {} No Custom instance metadata.
startup_script string null No Startup script run at boot.
service_account_email string Yes Least-privilege SA attached to each VM.
service_account_scopes list(string) ["…/auth/cloud-platform"] No OAuth scopes for the attached SA.
enable_secure_boot bool true No Enable Shielded VM Secure Boot.
target_size number 2 No Static VM count when autoscaler is disabled.
distribution_policy_zones list(string) [] No Explicit zones to spread across; empty = all zones.
distribution_policy_target_shape string EVEN No Zone spread strategy (EVEN/BALANCED/ANY).
named_ports list(object) [] No Named ports for load balancer backends.
health_check_port number 80 No Port for the auto-healing health check.
health_check_request_path string /healthz No HTTP path for the auto-healing check.
health_check_interval_sec number 10 No Seconds between probes.
health_check_timeout_sec number 5 No Probe response timeout.
health_check_healthy_threshold number 2 No Consecutive successes to mark healthy.
health_check_unhealthy_threshold number 3 No Consecutive failures before recreation.
health_check_initial_delay_sec number 300 No Grace period after boot before probing.
enable_health_check_logging bool false No Log health check probe results.
update_policy_type string PROACTIVE No Rolling update mode (PROACTIVE/OPPORTUNISTIC).
update_instance_redistribution_type string PROACTIVE No Redistribute across zones on update (PROACTIVE/NONE).
update_minimal_action string REPLACE No Least disruptive update action.
update_max_surge_fixed number 3 No Extra VMs added during update (>= zone count).
update_max_unavailable_fixed number 0 No Max VMs unavailable during update.
enable_autoscaler bool true No Create a regional autoscaler owning the VM count.
autoscaler_min_replicas number 2 No Minimum VMs the autoscaler keeps (>= 1).
autoscaler_max_replicas number 10 No Maximum VMs the autoscaler may reach.
autoscaler_cooldown_period number 60 No Seconds before a new VM’s metrics count.
autoscaler_cpu_target number 0.6 No Target average CPU utilization (0–1).
autoscaler_lb_target number null No Target LB utilization (0–1); null disables it.

Outputs

Name Description
instance_group_manager_id Fully qualified ID of the regional instance group manager.
instance_group_manager_name Name of the regional instance group manager.
instance_group Self-link of the underlying instance group, used as a load balancer backend.
instance_template_id ID of the instance template currently driving the MIG.
instance_template_self_link Self-link of the active instance template.
health_check_id ID of the auto-healing health check.
autoscaler_id ID of the regional autoscaler, or null when disabled.
named_ports Named ports configured on the instance group.

Enterprise scenario

KloudVin’s e-commerce client runs its checkout API as a stateless tier in asia-south1. They instantiate this module with enable_autoscaler = true, autoscaler_min_replicas = 4, autoscaler_max_replicas = 40, and autoscaler_lb_target = 0.8, so the fleet expands across all three zones during flash sales and contracts overnight to control spend. Because update_max_unavailable_fixed = 0 with a surge of 3, the platform team ships a new container-optimized image every sprint by publishing a fresh template version — the MIG drains and replaces VMs in waves with no checkout downtime — and the auto-healing health check (/healthz, 300s initial delay) silently recreates any VM that wedges under load before customers ever notice.

Best practices

TerraformGCPManaged Instance GroupModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading