Quick take — A reusable hashicorp/azurerm ~> 4.0 Terraform module for Azure Linux VM Scale Sets: rolling upgrades, autoscale profiles, application health probes, automatic instance repair, and SSH-key-only auth. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "azurerm" {
features {}
}
module "vm_scale_set" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-vm-scale-set?ref=v1.0.0"
name = "..." # Name of the scale set; prefix for child resources.
resource_group_name = "..." # Resource group for the scale set.
location = "..." # Azure region.
admin_ssh_public_key = "..." # OpenSSH public key (password auth always off).
subnet_id = "..." # Subnet for the primary NIC.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
An Azure Virtual Machine Scale Set (VMSS) is a managed group of identical, load-balanced VMs that you scale in and out as a single unit. Instead of provisioning a fixed number of VMs by hand, you declare a VM profile — image, SKU, OS disk, NIC, identity — and Azure stamps out as many instances as the scale set’s capacity (or an autoscale rule) demands. For stateless workloads — web tiers, API fleets, worker pools, self-hosted CI agents — VMSS is the workhorse compute primitive that sits behind a load balancer or application gateway.
The raw azurerm_linux_virtual_machine_scale_set resource is deceptively large: it carries nested blocks for network interfaces, IP configurations, OS disk, data disks, identity, boot diagnostics, rolling upgrade policy, automatic instance repair, and the application health extension. Getting the production-grade combination right — upgrade_mode = "Rolling" paired with a health probe, automatic_instance_repair so unhealthy instances get reimaged, SSH-key-only auth with disable_password_authentication = true, and a separate azurerm_monitor_autoscale_setting — is the same boilerplate on every project, and easy to get subtly wrong.
This module wraps that combination behind clean, validated variables. You pass a subnet ID, an image reference, an instance SKU, and an autoscale band, and you get a scale set that does zero-downtime rolling upgrades, self-heals unhealthy instances, and scales on CPU. The module is opinionated toward security defaults (no public IPs per instance by default, system-assigned managed identity, encryption-at-host) so callers fall into the pit of success.
When to use it
Reach for this module when you need a fleet of interchangeable Linux VMs rather than one or two pets:
- Stateless horizontal tiers — web/API servers behind an Azure Load Balancer or Application Gateway where you want to add capacity by adding instances.
- Background worker pools — queue consumers, batch processors, or render farms that scale with backlog depth or CPU.
- Self-hosted runners/agents — Azure DevOps or GitHub Actions agents, or HPC-style compute that you grow and shrink on a schedule.
- Workloads needing self-healing — anywhere you want Azure to automatically detect an unhealthy instance (via a health probe) and reimage it without paging a human.
Do not use it for stateful singletons (databases, brokers with sticky disks) — use a standalone azurerm_linux_virtual_machine or a StatefulSet on AKS. If your workload is containerised and you want bin-packing and per-pod scaling, AKS or Container Apps is a better fit than VMSS. And if you only ever need a fixed count with no scaling or rolling-upgrade story, a plain availability set of VMs may be simpler.
Module structure
terraform-module-azure-vm-scale-set/
├── versions.tf # provider + Terraform version pins
├── main.tf # VMSS + autoscale setting wired together
├── variables.tf # validated input variables
└── outputs.tf # id/name + identity + key attributes
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 4.0"
}
}
}
main.tf
locals {
# A computer_name_prefix must be 1-9 chars for Linux; derive a safe one.
computer_name_prefix = substr(replace(var.name, "_", "-"), 0, 9)
base_tags = merge(
{
module = "terraform-module-azure-vm-scale-set"
environment = var.environment
managed_by = "terraform"
},
var.tags
)
}
resource "azurerm_linux_virtual_machine_scale_set" "this" {
name = var.name
resource_group_name = var.resource_group_name
location = var.location
sku = var.instance_sku
instances = var.instances
# Spread instances across availability zones when supplied.
zones = var.zones
zone_balance = length(var.zones) > 0 ? var.zone_balance : null
platform_fault_domain_count = var.platform_fault_domain_count
# Distribute new VMs across the scale set's update + fault domains.
upgrade_mode = var.upgrade_mode
# Allow surge instances during a rolling upgrade so capacity is preserved.
overprovision = var.overprovision
# Security baseline: SSH keys only, no inline passwords.
admin_username = var.admin_username
disable_password_authentication = true
encryption_at_host_enabled = var.encryption_at_host_enabled
admin_ssh_key {
username = var.admin_username
public_key = var.admin_ssh_public_key
}
source_image_id = var.source_image_id
dynamic "source_image_reference" {
for_each = var.source_image_id == null ? [var.source_image_reference] : []
content {
publisher = source_image_reference.value.publisher
offer = source_image_reference.value.offer
sku = source_image_reference.value.sku
version = source_image_reference.value.version
}
}
os_disk {
caching = var.os_disk_caching
storage_account_type = var.os_disk_storage_account_type
disk_size_gb = var.os_disk_size_gb
write_accelerator_enabled = false
}
dynamic "data_disk" {
for_each = var.data_disks
content {
lun = data_disk.value.lun
caching = data_disk.value.caching
storage_account_type = data_disk.value.storage_account_type
disk_size_gb = data_disk.value.disk_size_gb
create_option = "Empty"
}
}
computer_name_prefix = local.computer_name_prefix
network_interface {
name = "${var.name}-nic"
primary = true
enable_accelerated_networking = var.enable_accelerated_networking
ip_configuration {
name = "internal"
primary = true
subnet_id = var.subnet_id
# Attach to a load balancer backend pool when one is provided.
load_balancer_backend_address_pool_ids = var.load_balancer_backend_pool_ids
# Attach to an Application Gateway backend pool when provided.
application_gateway_backend_address_pool_ids = var.application_gateway_backend_pool_ids
}
}
identity {
type = var.identity_type
identity_ids = var.identity_type == "UserAssigned" || var.identity_type == "SystemAssigned, UserAssigned" ? var.user_assigned_identity_ids : null
}
boot_diagnostics {
# Empty endpoint => managed (Azure-hosted) boot diagnostics storage.
storage_account_uri = var.boot_diagnostics_storage_account_uri
}
# Required for Rolling upgrades and automatic_instance_repair: wire the
# application health extension to a load-balancer or TCP/HTTP probe.
health_probe_id = var.health_probe_id
dynamic "rolling_upgrade_policy" {
for_each = var.upgrade_mode == "Rolling" ? [1] : []
content {
max_batch_instance_percent = var.rolling_max_batch_instance_percent
max_unhealthy_instance_percent = var.rolling_max_unhealthy_instance_percent
max_unhealthy_upgraded_instance_percent = var.rolling_max_unhealthy_upgraded_instance_percent
pause_time_between_batches = var.rolling_pause_time_between_batches
}
}
dynamic "automatic_instance_repair" {
for_each = var.enable_automatic_instance_repair ? [1] : []
content {
enabled = true
grace_period = var.instance_repair_grace_period
}
}
tags = local.base_tags
lifecycle {
# Let autoscale own the instance count after creation.
ignore_changes = [instances]
}
}
resource "azurerm_monitor_autoscale_setting" "this" {
count = var.enable_autoscale ? 1 : 0
name = "${var.name}-autoscale"
resource_group_name = var.resource_group_name
location = var.location
target_resource_id = azurerm_linux_virtual_machine_scale_set.this.id
profile {
name = "cpu-based"
capacity {
default = var.autoscale_default
minimum = var.autoscale_min
maximum = var.autoscale_max
}
rule {
metric_trigger {
metric_name = "Percentage CPU"
metric_resource_id = azurerm_linux_virtual_machine_scale_set.this.id
time_grain = "PT1M"
statistic = "Average"
time_window = "PT5M"
time_aggregation = "Average"
operator = "GreaterThan"
threshold = var.autoscale_out_cpu_threshold
}
scale_action {
direction = "Increase"
type = "ChangeCount"
value = "1"
cooldown = "PT5M"
}
}
rule {
metric_trigger {
metric_name = "Percentage CPU"
metric_resource_id = azurerm_linux_virtual_machine_scale_set.this.id
time_grain = "PT1M"
statistic = "Average"
time_window = "PT5M"
time_aggregation = "Average"
operator = "LessThan"
threshold = var.autoscale_in_cpu_threshold
}
scale_action {
direction = "Decrease"
type = "ChangeCount"
value = "1"
cooldown = "PT5M"
}
}
}
tags = local.base_tags
}
variables.tf
variable "name" {
description = "Name of the VM Scale Set (also used as a prefix for child resources)."
type = string
validation {
condition = can(regex("^[a-zA-Z0-9][a-zA-Z0-9._-]{0,62}[a-zA-Z0-9_]$", var.name))
error_message = "name must be 2-64 chars and start with alphanumeric."
}
}
variable "resource_group_name" {
description = "Resource group that will contain the scale set."
type = string
}
variable "location" {
description = "Azure region (e.g. eastus, centralindia)."
type = string
}
variable "environment" {
description = "Environment label applied as a tag (dev, staging, prod)."
type = string
default = "dev"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "environment must be one of: dev, staging, prod."
}
}
variable "instance_sku" {
description = "VM size for each instance (e.g. Standard_D2s_v5)."
type = string
default = "Standard_D2s_v5"
}
variable "instances" {
description = "Initial instance count. Autoscale takes over after creation when enabled."
type = number
default = 2
validation {
condition = var.instances >= 0 && var.instances <= 1000
error_message = "instances must be between 0 and 1000."
}
}
variable "zones" {
description = "Availability zones to spread instances across. Empty = regional (no zones)."
type = list(string)
default = ["1", "2", "3"]
}
variable "zone_balance" {
description = "Force even distribution across zones (only honoured when zones is non-empty)."
type = bool
default = true
}
variable "platform_fault_domain_count" {
description = "Number of fault domains. Use 1 for zonal deployments in most regions."
type = number
default = 1
}
variable "upgrade_mode" {
description = "How OS/image upgrades roll out: Manual, Automatic, or Rolling."
type = string
default = "Rolling"
validation {
condition = contains(["Manual", "Automatic", "Rolling"], var.upgrade_mode)
error_message = "upgrade_mode must be Manual, Automatic, or Rolling."
}
}
variable "overprovision" {
description = "Allow Azure to temporarily over-provision instances to improve deployment reliability."
type = bool
default = true
}
variable "admin_username" {
description = "Admin user created on each instance."
type = string
default = "azureuser"
}
variable "admin_ssh_public_key" {
description = "OpenSSH public key used for SSH-key-only login (password auth is always disabled)."
type = string
validation {
condition = can(regex("^ssh-(rsa|ed25519) ", var.admin_ssh_public_key))
error_message = "admin_ssh_public_key must be an ssh-rsa or ssh-ed25519 OpenSSH public key."
}
}
variable "encryption_at_host_enabled" {
description = "Encrypt temp disks and VM caches on the host. Requires the feature to be registered on the subscription."
type = bool
default = true
}
variable "source_image_id" {
description = "Resource ID of a managed/shared image or gallery image version. If set, source_image_reference is ignored."
type = string
default = null
}
variable "source_image_reference" {
description = "Marketplace image to use when source_image_id is null."
type = object({
publisher = string
offer = string
sku = string
version = string
})
default = {
publisher = "Canonical"
offer = "ubuntu-24_04-lts"
sku = "server"
version = "latest"
}
}
variable "os_disk_caching" {
description = "OS disk caching mode."
type = string
default = "ReadWrite"
validation {
condition = contains(["None", "ReadOnly", "ReadWrite"], var.os_disk_caching)
error_message = "os_disk_caching must be None, ReadOnly, or ReadWrite."
}
}
variable "os_disk_storage_account_type" {
description = "OS disk storage tier (e.g. Premium_LRS, StandardSSD_LRS)."
type = string
default = "Premium_LRS"
}
variable "os_disk_size_gb" {
description = "OS disk size in GB. Null uses the image default."
type = number
default = null
}
variable "data_disks" {
description = "Empty data disks to attach to every instance."
type = list(object({
lun = number
caching = string
storage_account_type = string
disk_size_gb = number
}))
default = []
}
variable "subnet_id" {
description = "Subnet ID the primary NIC attaches to."
type = string
}
variable "enable_accelerated_networking" {
description = "Enable accelerated networking (SR-IOV). The chosen SKU must support it."
type = bool
default = false
}
variable "load_balancer_backend_pool_ids" {
description = "Load Balancer backend address pool IDs to register instances into."
type = list(string)
default = []
}
variable "application_gateway_backend_pool_ids" {
description = "Application Gateway backend address pool IDs to register instances into."
type = list(string)
default = []
}
variable "identity_type" {
description = "Managed identity type for the scale set."
type = string
default = "SystemAssigned"
validation {
condition = contains(["SystemAssigned", "UserAssigned", "SystemAssigned, UserAssigned"], var.identity_type)
error_message = "identity_type must be SystemAssigned, UserAssigned, or 'SystemAssigned, UserAssigned'."
}
}
variable "user_assigned_identity_ids" {
description = "User-assigned managed identity resource IDs (required when identity_type includes UserAssigned)."
type = list(string)
default = []
}
variable "boot_diagnostics_storage_account_uri" {
description = "Blob endpoint for boot diagnostics. Empty string = Azure-managed storage."
type = string
default = ""
}
variable "health_probe_id" {
description = "Load Balancer health probe ID. Required for Rolling upgrades and automatic instance repair."
type = string
default = null
}
variable "enable_automatic_instance_repair" {
description = "Reimage instances that report unhealthy via the health probe."
type = bool
default = true
}
variable "instance_repair_grace_period" {
description = "ISO 8601 grace period before an unhealthy instance is repaired (PT10M..PT90M)."
type = string
default = "PT30M"
}
# --- Rolling upgrade policy ---
variable "rolling_max_batch_instance_percent" {
description = "Max percent of instances upgraded simultaneously in one batch."
type = number
default = 20
}
variable "rolling_max_unhealthy_instance_percent" {
description = "Max percent of instances allowed unhealthy before/during an upgrade."
type = number
default = 20
}
variable "rolling_max_unhealthy_upgraded_instance_percent" {
description = "Max percent of newly upgraded instances that may be unhealthy before rollback."
type = number
default = 20
}
variable "rolling_pause_time_between_batches" {
description = "ISO 8601 wait between upgrade batches."
type = string
default = "PT2M"
}
# --- Autoscale ---
variable "enable_autoscale" {
description = "Create a CPU-based autoscale setting targeting this scale set."
type = bool
default = true
}
variable "autoscale_default" {
description = "Capacity to fall back to when metrics are unavailable."
type = number
default = 2
}
variable "autoscale_min" {
description = "Minimum instance count under autoscale."
type = number
default = 2
}
variable "autoscale_max" {
description = "Maximum instance count under autoscale."
type = number
default = 10
}
variable "autoscale_out_cpu_threshold" {
description = "Average CPU percent above which to scale out by one instance."
type = number
default = 75
}
variable "autoscale_in_cpu_threshold" {
description = "Average CPU percent below which to scale in by one instance."
type = number
default = 25
}
variable "tags" {
description = "Additional tags merged onto all resources."
type = map(string)
default = {}
}
outputs.tf
output "id" {
description = "Resource ID of the VM Scale Set."
value = azurerm_linux_virtual_machine_scale_set.this.id
}
output "name" {
description = "Name of the VM Scale Set."
value = azurerm_linux_virtual_machine_scale_set.this.name
}
output "unique_id" {
description = "Globally unique, immutable identifier of the scale set."
value = azurerm_linux_virtual_machine_scale_set.this.unique_id
}
output "identity_principal_id" {
description = "Principal (object) ID of the system-assigned managed identity, if enabled. Use this for RBAC role assignments."
value = try(azurerm_linux_virtual_machine_scale_set.this.identity[0].principal_id, null)
}
output "instances" {
description = "Current configured instance count."
value = azurerm_linux_virtual_machine_scale_set.this.instances
}
output "autoscale_setting_id" {
description = "Resource ID of the autoscale setting, if created."
value = try(azurerm_monitor_autoscale_setting.this[0].id, null)
}
How to use it
The example below builds a load-balanced web fleet: instances register into an internal Load Balancer backend pool, a TCP health probe drives both rolling upgrades and self-healing, and a system-assigned identity is granted Key Vault access downstream.
module "vm_scale_set" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-vm-scale-set?ref=v1.0.0"
name = "vmss-web-prod"
resource_group_name = azurerm_resource_group.app.name
location = "centralindia"
environment = "prod"
instance_sku = "Standard_D4s_v5"
instances = 3
zones = ["1", "2", "3"]
subnet_id = azurerm_subnet.web.id
admin_ssh_public_key = file("~/.ssh/vmss_web.pub")
# Ubuntu 24.04 LTS marketplace image (default), accelerated networking on.
enable_accelerated_networking = true
# Register into the internal load balancer + drive health from its probe.
load_balancer_backend_pool_ids = [azurerm_lb_backend_address_pool.web.id]
health_probe_id = azurerm_lb_probe.web_http.id
# Zero-downtime rolling upgrades + reimage unhealthy nodes after 15 min.
upgrade_mode = "Rolling"
enable_automatic_instance_repair = true
instance_repair_grace_period = "PT15M"
# Scale 3 -> 12 on CPU.
enable_autoscale = true
autoscale_min = 3
autoscale_max = 12
autoscale_out_cpu_threshold = 70
autoscale_in_cpu_threshold = 30
tags = {
team = "platform"
cost_center = "cc-1042"
}
}
# Downstream: grant the scale set's managed identity read access to Key Vault
# using the principal_id output.
resource "azurerm_role_assignment" "vmss_kv_reader" {
scope = azurerm_key_vault.app.id
role_definition_name = "Key Vault Secrets User"
principal_id = module.vm_scale_set.identity_principal_id
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "azurerm"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...azurerm state bucket/container + key per path...
}
}
2. Module config — live/prod/vm_scale_set/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-vm-scale-set?ref=v1.0.0"
}
inputs = {
name = "..."
resource_group_name = "..."
location = "..."
admin_ssh_public_key = "..."
subnet_id = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/vm_scale_set && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name | string | — | yes | Name of the scale set; prefix for child resources. |
| resource_group_name | string | — | yes | Resource group for the scale set. |
| location | string | — | yes | Azure region. |
| environment | string | "dev" |
no | Environment tag; one of dev/staging/prod. |
| instance_sku | string | "Standard_D2s_v5" |
no | VM size per instance. |
| instances | number | 2 |
no | Initial instance count (autoscale takes over). |
| zones | list(string) | ["1","2","3"] |
no | Availability zones; empty = regional. |
| zone_balance | bool | true |
no | Force even zone distribution (when zones set). |
| platform_fault_domain_count | number | 1 |
no | Fault domain count. |
| upgrade_mode | string | "Rolling" |
no | Manual, Automatic, or Rolling. |
| overprovision | bool | true |
no | Temporarily over-provision for deploy reliability. |
| admin_username | string | "azureuser" |
no | Admin user on each instance. |
| admin_ssh_public_key | string | — | yes | OpenSSH public key (password auth always off). |
| encryption_at_host_enabled | bool | true |
no | Encrypt host caches/temp disks. |
| source_image_id | string | null |
no | Managed/gallery image ID; overrides reference. |
| source_image_reference | object | Ubuntu 24.04 LTS | no | Marketplace image when no image ID. |
| os_disk_caching | string | "ReadWrite" |
no | OS disk caching mode. |
| os_disk_storage_account_type | string | "Premium_LRS" |
no | OS disk tier. |
| os_disk_size_gb | number | null |
no | OS disk size; null = image default. |
| data_disks | list(object) | [] |
no | Empty data disks per instance. |
| subnet_id | string | — | yes | Subnet for the primary NIC. |
| enable_accelerated_networking | bool | false |
no | SR-IOV accelerated networking. |
| load_balancer_backend_pool_ids | list(string) | [] |
no | LB backend pool IDs to join. |
| application_gateway_backend_pool_ids | list(string) | [] |
no | App Gateway backend pool IDs to join. |
| identity_type | string | "SystemAssigned" |
no | Managed identity type. |
| user_assigned_identity_ids | list(string) | [] |
no | UAMI IDs (when UserAssigned). |
| boot_diagnostics_storage_account_uri | string | "" |
no | Boot diag blob URI; empty = managed. |
| health_probe_id | string | null |
no | LB probe ID; required for Rolling + repair. |
| enable_automatic_instance_repair | bool | true |
no | Reimage unhealthy instances. |
| instance_repair_grace_period | string | "PT30M" |
no | Grace before repair (PT10M…PT90M). |
| rolling_max_batch_instance_percent | number | 20 |
no | Max % upgraded per batch. |
| rolling_max_unhealthy_instance_percent | number | 20 |
no | Max % unhealthy allowed during upgrade. |
| rolling_max_unhealthy_upgraded_instance_percent | number | 20 |
no | Max % new instances unhealthy before rollback. |
| rolling_pause_time_between_batches | string | "PT2M" |
no | Wait between upgrade batches. |
| enable_autoscale | bool | true |
no | Create CPU-based autoscale setting. |
| autoscale_default | number | 2 |
no | Capacity when metrics unavailable. |
| autoscale_min | number | 2 |
no | Minimum instance count. |
| autoscale_max | number | 10 |
no | Maximum instance count. |
| autoscale_out_cpu_threshold | number | 75 |
no | Avg CPU % to scale out. |
| autoscale_in_cpu_threshold | number | 25 |
no | Avg CPU % to scale in. |
| tags | map(string) | {} |
no | Extra tags merged onto all resources. |
Outputs
| Name | Description |
|---|---|
| id | Resource ID of the VM Scale Set. |
| name | Name of the VM Scale Set. |
| unique_id | Globally unique, immutable identifier of the scale set. |
| identity_principal_id | Principal ID of the system-assigned identity (for RBAC). |
| instances | Current configured instance count. |
| autoscale_setting_id | Resource ID of the autoscale setting, if created. |
Enterprise scenario
A fintech platform team runs its customer-facing API tier as a VMSS behind an internal Application Gateway across three availability zones in Central India. Using this module, every release ships a new golden image version (built by Packer into a Shared Image Gallery), and upgrade_mode = "Rolling" rolls it out 20% at a time, pausing two minutes between batches and halting automatically if more than 20% of the new instances fail the gateway health probe. Overnight the autoscale band drops the floor from 12 to 4 instances to cut compute spend, while automatic_instance_repair quietly reimages any node that wedges, so the on-call engineer is never paged for a single bad instance.
Best practices
- Always wire a health probe before enabling Rolling upgrades or instance repair. Without
health_probe_id, Azure has no signal of instance health — rolling upgrades become risky andautomatic_instance_repaircannot function. Point the probe at a real readiness endpoint, not just a TCP port that opens before the app is ready. - Enforce SSH-key-only access and managed identity. This module hard-codes
disable_password_authentication = trueand validates the key format; pair it with a system-assigned identity and RBAC (via theidentity_principal_idoutput) instead of embedding secrets or service-principal credentials in cloud-init. - Let autoscale own capacity and keep it out of state drift. The resource uses
ignore_changes = [instances]so Terraform won’t fight the autoscale engine. Tuneautoscale_minfor your latency floor andautoscale_max/cooldowns to cap cost, and set a saneautoscale_defaultso the set survives a metrics outage. - Spread across availability zones with
zone_balance. For production, keepzones = ["1","2","3"]andzone_balance = trueso a single zone failure removes at most a third of capacity; alignplatform_fault_domain_countwith what the target region supports. - Pick
Premium_LRSand accelerated networking deliberately. Premium OS disks and SR-IOV cut latency for throughput-sensitive tiers but cost more and require supported SKUs — gateenable_accelerated_networkingon a SKU that supports it, and drop toStandardSSD_LRSfor non-prod to save money. - Name and tag for fleet-wide operations. Use a consistent
vmss-<workload>-<env>name and rely on the module’s merged tags (environment,managed_by, plus yourcost_center/team) so cost reports, policy, and incident tooling can slice by fleet without per-instance bookkeeping.