Terraform Module: GCP Managed Instance Group — Self-Healing, Auto-Scaling Compute Across Zones

Quick take — Reusable Terraform module for a regional GCP Managed Instance Group: instance template, multi-zone distribution, autoscaler, named ports, auto-healing health checks and rolling updates. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "managed_instance_group" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-managed-instance-group?ref=v1.0.0"

  project_id            = "..."  # GCP project the MIG and supporting resources live in.
  name                  = "..."  # Base name (RFC1035) for the MIG and derived resources.
  region                = "..."  # Region for the regional MIG.
  base_instance_name    = "..."  # VM name prefix; GCP appends a unique suffix.
  source_image          = "..."  # Boot image self-link or family path.
  network               = "..."  # VPC network self-link or name.
  subnetwork            = "..."  # Subnetwork in the target region.
  service_account_email = "..."  # Least-privilege SA attached to each VM.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

A Managed Instance Group (MIG) on Google Cloud is a controller that runs a fleet of identical Compute Engine VMs from a single instance template. Unlike an unmanaged group, a MIG actively maintains a desired state: it recreates VMs that crash or fail their health check (auto-healing), spreads instances across zones for resilience, scales the fleet up and down based on load (autoscaling), and rolls out new template versions gradually instead of replacing everything at once. It is the standard GCP building block behind stateless web tiers, API backends, and worker pools sitting behind a load balancer.

This module wraps google_compute_region_instance_group_manager — the regional variant that distributes VMs across multiple zones in a region, so the loss of a single zone does not take the whole tier down. Wrapping it in a module matters because a production-grade MIG is never just one resource: you need a versioned instance template, a health check, an autoscaler, named ports for load-balancer wiring, and a carefully tuned rolling-update policy. Hand-rolling those five resources per environment invites drift and copy-paste bugs. The module turns the whole pattern into a handful of variables with sane, opinionated defaults (surge-based zero-downtime updates, proactive auto-healing, balanced multi-zone spread) while still exposing the knobs that differ per workload.

When to use it

You run a stateless tier (web frontends, REST/gRPC backends, queue consumers) where any VM can serve any request and instances are cattle, not pets.
You want horizontal autoscaling driven by CPU, load-balancer utilization, or a custom Cloud Monitoring metric, instead of a fixed VM count.
You need zonal redundancy within a region and want GCP to keep the fleet evenly balanced across zones automatically.
You front the group with an HTTP(S), TCP, or internal load balancer and need a backend with named ports and health-based membership.
You want zero-downtime rolling deploys — bake a new image, publish a new template, and let the MIG replace instances with controlled surge and max-unavailable limits.

Reach for a regional MIG (this module) over a zonal one for anything that must survive a single-zone outage. Skip MIGs entirely for stateful singletons (a primary database, a license server) — those belong on a standalone instance with a persistent disk, not in an auto-healing group that may delete and recreate the VM out from under your data.

Module structure

terraform-module-gcp-managed-instance-group/
├── versions.tf      # provider + Terraform version constraints
├── main.tf          # instance template, health check, region MIG, autoscaler
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # ids, self-links, instance group URL, named ports

# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

# main.tf

locals {
  # A stable prefix lets the template use create_before_destroy safely.
  name_prefix = "${var.name}-"
}

# ---------------------------------------------------------------------------
# Instance template — the immutable blueprint every VM in the MIG is born from.
# name_prefix + create_before_destroy lets a new template be created before the
# old one is destroyed, which is what makes rolling updates non-disruptive.
# ---------------------------------------------------------------------------
resource "google_compute_instance_template" "this" {
  project      = var.project_id
  name_prefix  = local.name_prefix
  description  = "Instance template for MIG ${var.name}"
  machine_type = var.machine_type
  region       = var.region
  tags         = var.network_tags
  labels       = var.labels

  disk {
    source_image = var.source_image
    auto_delete  = true
    boot         = true
    disk_type    = var.boot_disk_type
    disk_size_gb = var.boot_disk_size_gb
  }

  network_interface {
    network    = var.network
    subnetwork = var.subnetwork

    # Only attach an external IP when explicitly requested.
    dynamic "access_config" {
      for_each = var.assign_external_ip ? [1] : []
      content {
        network_tier = "PREMIUM"
      }
    }
  }

  metadata = var.metadata

  # Optional startup script wired through metadata.
  metadata_startup_script = var.startup_script

  service_account {
    email  = var.service_account_email
    scopes = var.service_account_scopes
  }

  # Shielded VM hardening — on by default for a stronger security posture.
  shielded_instance_config {
    enable_secure_boot          = var.enable_secure_boot
    enable_vtpm                 = true
    enable_integrity_monitoring = true
  }

  lifecycle {
    create_before_destroy = true
  }
}

# ---------------------------------------------------------------------------
# Health check used for auto-healing. This is the signal the MIG uses to decide
# a VM is unhealthy and must be recreated — keep it distinct from (and usually
# more lenient than) the load balancer's serving health check.
# ---------------------------------------------------------------------------
resource "google_compute_health_check" "autohealing" {
  project             = var.project_id
  name                = "${var.name}-autohealing"
  check_interval_sec  = var.health_check_interval_sec
  timeout_sec         = var.health_check_timeout_sec
  healthy_threshold   = var.health_check_healthy_threshold
  unhealthy_threshold = var.health_check_unhealthy_threshold

  http_health_check {
    port         = var.health_check_port
    request_path = var.health_check_request_path
  }

  log_config {
    enable = var.enable_health_check_logging
  }
}

# ---------------------------------------------------------------------------
# Regional Managed Instance Group — distributes target_size VMs across the
# region's zones and maintains desired state (auto-healing + rolling updates).
# ---------------------------------------------------------------------------
resource "google_compute_region_instance_group_manager" "this" {
  project            = var.project_id
  name               = var.name
  region             = var.region
  base_instance_name = var.base_instance_name

  # When the autoscaler is enabled it owns target_size, so we omit it here to
  # avoid the two controllers fighting over the instance count.
  target_size = var.enable_autoscaler ? null : var.target_size

  distribution_policy_zones        = var.distribution_policy_zones
  distribution_policy_target_shape = var.distribution_policy_target_shape

  version {
    instance_template = google_compute_instance_template.this.id
    name              = "primary"
  }

  dynamic "named_port" {
    for_each = var.named_ports
    content {
      name = named_port.value.name
      port = named_port.value.port
    }
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.autohealing.id
    initial_delay_sec = var.health_check_initial_delay_sec
  }

  update_policy {
    type                         = var.update_policy_type
    instance_redistribution_type = var.update_instance_redistribution_type
    minimal_action               = var.update_minimal_action
    max_surge_fixed              = var.update_max_surge_fixed
    max_unavailable_fixed        = var.update_max_unavailable_fixed
    replacement_method           = "SUBSTITUTE"
  }

  # Let GCP manage the fleet; ignore drift on the count the autoscaler controls.
  lifecycle {
    ignore_changes = [target_size]
  }
}

# ---------------------------------------------------------------------------
# Autoscaler — optional. Scales the MIG between min and max replicas based on
# CPU and (optionally) load-balancer utilization.
# ---------------------------------------------------------------------------
resource "google_compute_region_autoscaler" "this" {
  count = var.enable_autoscaler ? 1 : 0

  project = var.project_id
  name    = "${var.name}-autoscaler"
  region  = var.region
  target  = google_compute_region_instance_group_manager.this.id

  autoscaling_policy {
    min_replicas    = var.autoscaler_min_replicas
    max_replicas    = var.autoscaler_max_replicas
    cooldown_period = var.autoscaler_cooldown_period

    cpu_utilization {
      target = var.autoscaler_cpu_target
    }

    dynamic "load_balancing_utilization" {
      for_each = var.autoscaler_lb_target == null ? [] : [1]
      content {
        target = var.autoscaler_lb_target
      }
    }
  }
}

# variables.tf

variable "project_id" {
  type        = string
  description = "GCP project ID the MIG and its supporting resources are created in."
}

variable "name" {
  type        = string
  description = "Base name for the MIG and derived resources (autoscaler, health check)."

  validation {
    condition     = can(regex("^[a-z]([-a-z0-9]{0,61}[a-z0-9])?$", var.name))
    error_message = "name must be 1-63 chars, lowercase RFC1035: start with a letter, contain only lowercase letters, digits, and hyphens."
  }
}

variable "region" {
  type        = string
  description = "Region for the regional MIG (e.g. asia-south1)."
}

variable "base_instance_name" {
  type        = string
  description = "Prefix for VM instance names; GCP appends a unique suffix per VM."
}

variable "machine_type" {
  type        = string
  description = "Compute Engine machine type for each VM (e.g. e2-standard-2)."
  default     = "e2-standard-2"
}

variable "source_image" {
  type        = string
  description = "Boot image self-link or family path (e.g. projects/debian-cloud/global/images/family/debian-12)."
}

variable "boot_disk_type" {
  type        = string
  description = "Boot disk type."
  default     = "pd-balanced"

  validation {
    condition     = contains(["pd-standard", "pd-balanced", "pd-ssd"], var.boot_disk_type)
    error_message = "boot_disk_type must be one of: pd-standard, pd-balanced, pd-ssd."
  }
}

variable "boot_disk_size_gb" {
  type        = number
  description = "Boot disk size in GB."
  default     = 20

  validation {
    condition     = var.boot_disk_size_gb >= 10
    error_message = "boot_disk_size_gb must be at least 10 GB."
  }
}

variable "network" {
  type        = string
  description = "VPC network self-link or name."
}

variable "subnetwork" {
  type        = string
  description = "Subnetwork self-link or name in the target region."
}

variable "assign_external_ip" {
  type        = bool
  description = "Whether to attach an ephemeral external IP to each VM. Keep false behind a load balancer."
  default     = false
}

variable "network_tags" {
  type        = list(string)
  description = "Network tags applied to each VM (used by firewall rules)."
  default     = []
}

variable "labels" {
  type        = map(string)
  description = "Labels applied to the instance template and VMs."
  default     = {}
}

variable "metadata" {
  type        = map(string)
  description = "Custom instance metadata key/value pairs."
  default     = {}
}

variable "startup_script" {
  type        = string
  description = "Startup script run on each VM at boot. Null to omit."
  default     = null
}

variable "service_account_email" {
  type        = string
  description = "Service account email attached to each VM. Use a dedicated least-privilege SA, not the default."
}

variable "service_account_scopes" {
  type        = list(string)
  description = "OAuth scopes for the attached service account. Prefer cloud-platform plus IAM roles."
  default     = ["https://www.googleapis.com/auth/cloud-platform"]
}

variable "enable_secure_boot" {
  type        = bool
  description = "Enable Shielded VM Secure Boot. Disable only for unsigned/custom kernels."
  default     = true
}

variable "target_size" {
  type        = number
  description = "Static number of VMs when the autoscaler is disabled."
  default     = 2

  validation {
    condition     = var.target_size >= 0
    error_message = "target_size must be 0 or greater."
  }
}

variable "distribution_policy_zones" {
  type        = list(string)
  description = "Explicit zones to spread VMs across. Empty list lets GCP pick all zones in the region."
  default     = []
}

variable "distribution_policy_target_shape" {
  type        = string
  description = "How instances are spread across zones: EVEN, BALANCED, or ANY."
  default     = "EVEN"

  validation {
    condition     = contains(["EVEN", "BALANCED", "ANY"], var.distribution_policy_target_shape)
    error_message = "distribution_policy_target_shape must be EVEN, BALANCED, or ANY."
  }
}

variable "named_ports" {
  type = list(object({
    name = string
    port = number
  }))
  description = "Named ports exposed to load balancer backends (e.g. [{ name = \"http\", port = 8080 }])."
  default     = []
}

# --- Auto-healing health check -------------------------------------------------

variable "health_check_port" {
  type        = number
  description = "TCP port the auto-healing HTTP health check probes."
  default     = 80
}

variable "health_check_request_path" {
  type        = string
  description = "HTTP path the auto-healing health check requests."
  default     = "/healthz"
}

variable "health_check_interval_sec" {
  type        = number
  description = "Seconds between health check probes."
  default     = 10
}

variable "health_check_timeout_sec" {
  type        = number
  description = "Seconds to wait for a probe response before it is a failure."
  default     = 5
}

variable "health_check_healthy_threshold" {
  type        = number
  description = "Consecutive successes before a VM is considered healthy."
  default     = 2
}

variable "health_check_unhealthy_threshold" {
  type        = number
  description = "Consecutive failures before a VM is recreated by auto-healing."
  default     = 3
}

variable "health_check_initial_delay_sec" {
  type        = number
  description = "Grace period after a VM boots before auto-healing starts probing it. Set above app warm-up time."
  default     = 300
}

variable "enable_health_check_logging" {
  type        = bool
  description = "Emit health check probe results to Cloud Logging."
  default     = false
}

# --- Rolling update policy -----------------------------------------------------

variable "update_policy_type" {
  type        = string
  description = "Rolling update mode: PROACTIVE (auto-roll on template change) or OPPORTUNISTIC."
  default     = "PROACTIVE"

  validation {
    condition     = contains(["PROACTIVE", "OPPORTUNISTIC"], var.update_policy_type)
    error_message = "update_policy_type must be PROACTIVE or OPPORTUNISTIC."
  }
}

variable "update_instance_redistribution_type" {
  type        = string
  description = "Whether the MIG redistributes instances across zones during updates: PROACTIVE or NONE."
  default     = "PROACTIVE"

  validation {
    condition     = contains(["PROACTIVE", "NONE"], var.update_instance_redistribution_type)
    error_message = "update_instance_redistribution_type must be PROACTIVE or NONE."
  }
}

variable "update_minimal_action" {
  type        = string
  description = "Least disruptive action applied on update: RESTART, REFRESH, or REPLACE."
  default     = "REPLACE"

  validation {
    condition     = contains(["RESTART", "REFRESH", "REPLACE"], var.update_minimal_action)
    error_message = "update_minimal_action must be RESTART, REFRESH, or REPLACE."
  }
}

variable "update_max_surge_fixed" {
  type        = number
  description = "Extra VMs created above target during a rolling update. Must be >= number of zones for regional MIGs."
  default     = 3
}

variable "update_max_unavailable_fixed" {
  type        = number
  description = "Max VMs that may be unavailable during a rolling update. 0 means strictly add-before-remove."
  default     = 0
}

# --- Autoscaler ----------------------------------------------------------------

variable "enable_autoscaler" {
  type        = bool
  description = "Create a regional autoscaler that owns the MIG's instance count."
  default     = true
}

variable "autoscaler_min_replicas" {
  type        = number
  description = "Minimum number of VMs the autoscaler will keep running."
  default     = 2

  validation {
    condition     = var.autoscaler_min_replicas >= 1
    error_message = "autoscaler_min_replicas must be at least 1."
  }
}

variable "autoscaler_max_replicas" {
  type        = number
  description = "Maximum number of VMs the autoscaler may scale up to."
  default     = 10
}

variable "autoscaler_cooldown_period" {
  type        = number
  description = "Seconds the autoscaler waits after a new VM boots before using its metrics."
  default     = 60
}

variable "autoscaler_cpu_target" {
  type        = number
  description = "Target average CPU utilization (0.0-1.0) the autoscaler steers toward."
  default     = 0.6

  validation {
    condition     = var.autoscaler_cpu_target > 0 && var.autoscaler_cpu_target <= 1
    error_message = "autoscaler_cpu_target must be between 0 (exclusive) and 1 (inclusive)."
  }
}

variable "autoscaler_lb_target" {
  type        = number
  description = "Optional target load-balancing utilization (0.0-1.0). Null disables LB-based scaling."
  default     = null
}

# outputs.tf

output "instance_group_manager_id" {
  description = "Fully qualified ID of the regional instance group manager."
  value       = google_compute_region_instance_group_manager.this.id
}

output "instance_group_manager_name" {
  description = "Name of the regional instance group manager."
  value       = google_compute_region_instance_group_manager.this.name
}

output "instance_group" {
  description = "Self-link of the underlying instance group, used as a load balancer backend."
  value       = google_compute_region_instance_group_manager.this.instance_group
}

output "instance_template_id" {
  description = "ID of the instance template currently driving the MIG."
  value       = google_compute_instance_template.this.id
}

output "instance_template_self_link" {
  description = "Self-link of the active instance template."
  value       = google_compute_instance_template.this.self_link
}

output "health_check_id" {
  description = "ID of the auto-healing health check."
  value       = google_compute_health_check.autohealing.id
}

output "autoscaler_id" {
  description = "ID of the regional autoscaler, or null when the autoscaler is disabled."
  value       = try(google_compute_region_autoscaler.this[0].id, null)
}

output "named_ports" {
  description = "Named ports configured on the instance group for load balancer wiring."
  value       = var.named_ports
}

How to use it

module "managed_instance_group" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-managed-instance-group?ref=v1.0.0"

  project_id         = "kloudvin-prod"
  name               = "web-frontend"
  region             = "asia-south1"
  base_instance_name = "web"

  machine_type = "e2-standard-4"
  source_image = "projects/kloudvin-prod/global/images/family/web-frontend"

  network    = "projects/kloudvin-prod/global/networks/prod-vpc"
  subnetwork = "projects/kloudvin-prod/regions/asia-south1/subnetworks/prod-web-asia-south1"

  service_account_email = google_service_account.web.email
  network_tags          = ["web", "allow-lb-health-checks"]
  labels                = { app = "web-frontend", env = "prod" }

  # Expose the app port so the load balancer backend can target it by name.
  named_ports = [{ name = "http", port = 8080 }]

  # Auto-healing probes the app's readiness endpoint.
  health_check_port         = 8080
  health_check_request_path = "/healthz"

  # CPU + load-balancer driven autoscaling.
  enable_autoscaler       = true
  autoscaler_min_replicas = 3
  autoscaler_max_replicas = 20
  autoscaler_cpu_target   = 0.55
  autoscaler_lb_target    = 0.8

  # Zero-downtime rolling deploys: add 3 before removing any.
  update_max_surge_fixed       = 3
  update_max_unavailable_fixed = 0
}

# Downstream: wire the MIG into an HTTP(S) load balancer backend service.
resource "google_compute_backend_service" "web" {
  project               = "kloudvin-prod"
  name                  = "web-frontend-backend"
  protocol              = "HTTP"
  port_name             = "http"
  load_balancing_scheme = "EXTERNAL_MANAGED"
  health_checks         = [google_compute_health_check.lb_serving.id]

  backend {
    group           = module.managed_instance_group.instance_group
    balancing_mode  = "UTILIZATION"
    max_utilization = 0.8
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module config — live/prod/managed_instance_group/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-managed-instance-group?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  name = "..."
  region = "..."
  base_instance_name = "..."
  source_image = "..."
  network = "..."
  subnetwork = "..."
  service_account_email = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/managed_instance_group && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
project_id	string	—	Yes	GCP project the MIG and supporting resources live in.
name	string	—	Yes	Base name (RFC1035) for the MIG and derived resources.
region	string	—	Yes	Region for the regional MIG.
base_instance_name	string	—	Yes	VM name prefix; GCP appends a unique suffix.
machine_type	string	`e2-standard-2`	No	Machine type per VM.
source_image	string	—	Yes	Boot image self-link or family path.
boot_disk_type	string	`pd-balanced`	No	Boot disk type (pd-standard/pd-balanced/pd-ssd).
boot_disk_size_gb	number	`20`	No	Boot disk size in GB (>= 10).
network	string	—	Yes	VPC network self-link or name.
subnetwork	string	—	Yes	Subnetwork in the target region.
assign_external_ip	bool	`false`	No	Attach an ephemeral external IP to each VM.
network_tags	list(string)	`[]`	No	Network tags applied to each VM.
labels	map(string)	`{}`	No	Labels on the template and VMs.
metadata	map(string)	`{}`	No	Custom instance metadata.
startup_script	string	`null`	No	Startup script run at boot.
service_account_email	string	—	Yes	Least-privilege SA attached to each VM.
service_account_scopes	list(string)	`["…/auth/cloud-platform"]`	No	OAuth scopes for the attached SA.
enable_secure_boot	bool	`true`	No	Enable Shielded VM Secure Boot.
target_size	number	`2`	No	Static VM count when autoscaler is disabled.
distribution_policy_zones	list(string)	`[]`	No	Explicit zones to spread across; empty = all zones.
distribution_policy_target_shape	string	`EVEN`	No	Zone spread strategy (EVEN/BALANCED/ANY).
named_ports	list(object)	`[]`	No	Named ports for load balancer backends.
health_check_port	number	`80`	No	Port for the auto-healing health check.
health_check_request_path	string	`/healthz`	No	HTTP path for the auto-healing check.
health_check_interval_sec	number	`10`	No	Seconds between probes.
health_check_timeout_sec	number	`5`	No	Probe response timeout.
health_check_healthy_threshold	number	`2`	No	Consecutive successes to mark healthy.
health_check_unhealthy_threshold	number	`3`	No	Consecutive failures before recreation.
health_check_initial_delay_sec	number	`300`	No	Grace period after boot before probing.
enable_health_check_logging	bool	`false`	No	Log health check probe results.
update_policy_type	string	`PROACTIVE`	No	Rolling update mode (PROACTIVE/OPPORTUNISTIC).
update_instance_redistribution_type	string	`PROACTIVE`	No	Redistribute across zones on update (PROACTIVE/NONE).
update_minimal_action	string	`REPLACE`	No	Least disruptive update action.
update_max_surge_fixed	number	`3`	No	Extra VMs added during update (>= zone count).
update_max_unavailable_fixed	number	`0`	No	Max VMs unavailable during update.
enable_autoscaler	bool	`true`	No	Create a regional autoscaler owning the VM count.
autoscaler_min_replicas	number	`2`	No	Minimum VMs the autoscaler keeps (>= 1).
autoscaler_max_replicas	number	`10`	No	Maximum VMs the autoscaler may reach.
autoscaler_cooldown_period	number	`60`	No	Seconds before a new VM’s metrics count.
autoscaler_cpu_target	number	`0.6`	No	Target average CPU utilization (0–1).
autoscaler_lb_target	number	`null`	No	Target LB utilization (0–1); null disables it.

Outputs

Name	Description
instance_group_manager_id	Fully qualified ID of the regional instance group manager.
instance_group_manager_name	Name of the regional instance group manager.
instance_group	Self-link of the underlying instance group, used as a load balancer backend.
instance_template_id	ID of the instance template currently driving the MIG.
instance_template_self_link	Self-link of the active instance template.
health_check_id	ID of the auto-healing health check.
autoscaler_id	ID of the regional autoscaler, or null when disabled.
named_ports	Named ports configured on the instance group.

Enterprise scenario

KloudVin’s e-commerce client runs its checkout API as a stateless tier in asia-south1. They instantiate this module with enable_autoscaler = true, autoscaler_min_replicas = 4, autoscaler_max_replicas = 40, and autoscaler_lb_target = 0.8, so the fleet expands across all three zones during flash sales and contracts overnight to control spend. Because update_max_unavailable_fixed = 0 with a surge of 3, the platform team ships a new container-optimized image every sprint by publishing a fresh template version — the MIG drains and replaces VMs in waves with no checkout downtime — and the auto-healing health check (/healthz, 300s initial delay) silently recreates any VM that wedges under load before customers ever notice.

Best practices

Keep auto-healing and load-balancer health checks separate. The auto-healing probe should be lenient (deep liveness, generous initial_delay_sec above warm-up time) so the MIG never recreates a VM that is merely slow to start, while the load balancer’s serving check can be stricter about readiness.
Surge before you subtract for zero-downtime deploys. Set update_max_unavailable_fixed = 0 and a max_surge_fixed at least equal to the number of zones (regional MIGs require this) so new instances are healthy before old ones drain — never let a rollout dip below capacity.
Attach a dedicated, least-privilege service account. Never run MIG VMs as the default Compute Engine SA. Create a purpose-built SA, grant only the IAM roles the workload needs, and leave Shielded VM Secure Boot, vTPM, and integrity monitoring enabled.
Let the autoscaler own target_size. This module nulls out target_size and ignores its drift when the autoscaler is enabled — do not also set a static count, or the two controllers will thrash the fleet up and down against each other.
Use pd-balanced and tune CPU targets for cost. Default to pd-balanced over pd-ssd unless I/O-bound, and pick an autoscaler_cpu_target around 0.55–0.65 — too high starves headroom during scale-up latency, too low wastes idle VMs and inflates the bill.
Name MIGs and base instances consistently per env and region. A scheme like web-frontend / web makes VM names, health checks, and autoscalers self-describing in the console and in billing exports, and the RFC1035 validation in this module enforces valid, predictable names up front.

Terraform Module: GCP Managed Instance Group — Self-Healing, Auto-Scaling Compute Across Zones

Quickstart (copy-paste)

What this module is

When to use it

Module structure

How to use it

With Terragrunt

Inputs

Outputs

Enterprise scenario

Best practices

Written by Vinod

Comments

Keep Reading

The Terraform Architecting Ladder: From a Single Module to an Enterprise IaC Platform

HashiCorp Terraform Associate (003) Prep Kit: Objectives, Practice Questions & Cheat Sheet

Terraform Fundamentals: HCL, Providers, State & the Core Workflow