Most teams that run IaaS at scale on Azure are still operating VM Scale Sets the way they did in 2019: Uniform orchestration, a Marketplace image plus a 400-line cloud-init that re-runs on every boot, and “upgrades” that mean tearing the whole set down on a Friday. That model fights you on three fronts. Boot is slow and non-deterministic because you build the box every time it starts. You have no immutable, versioned artifact to roll back to. And you have no safe, health-gated mechanism to push a new image without a maintenance window.
The modern shape fixes all three. Flexible orchestration gives you the placement control of an availability set with the scale machinery of a VMSS. Azure Image Builder bakes a golden image once, in a pipeline, so boot is fast and identical. Azure Compute Gallery versions and replicates that image like any other artifact. And rolling upgrades, gated by the Application Health extension, replace instances batch by batch and stop the moment health regresses. This is the principal-level walkthrough of wiring those four together correctly.
1. Uniform vs Flexible orchestration, and fault-domain placement
Uniform orchestration treats instances as identical, fungible cattle managed through a single VMSS model. It is still the right choice for very large, homogeneous fleets (thousands of instances) and for Service Fabric. But it hides the underlying VMs behind a virtualMachineScaleSets/virtualMachines proxy, so per-instance operations and standard VM tooling are awkward.
Flexible orchestration is the default for almost every new workload. Instances are real Microsoft.Compute/virtualMachines resources that happen to be members of a scale set. That means each instance shows up in the portal as a normal VM, takes VM extensions the normal way, can be attached or detached individually, and works with anything that expects a real VM resource. You trade some of Uniform’s raw scale ceiling for operability, and for most fleets that is the correct trade.
The placement decision that matters on day one is fault-domain spreading. Set it at creation; you cannot change it later.
platformFaultDomainCount |
Behaviour | Use when |
|---|---|---|
1 (max spreading) |
Azure spreads instances across as many fault domains as the region allows, best-effort | Default. Best availability for most stateless fleets |
2-3 (fixed spreading) |
Instances pinned across exactly N fault domains; request fails if it cannot satisfy N | Quorum systems that need a known, fixed FD count |
Microsoft’s own guidance is to use max spreading (platformFaultDomainCount = 1) for most scale sets. It gives the broadest distribution and avoids allocation failures when a region is constrained. Combine that with Availability Zones for the strongest posture.
LOC=eastus
RG=rg-vmss-prod
az group create --name $RG --location $LOC
# Flexible scale set, zonal + max fault-domain spreading.
# --orchestration-mode flexible and --platform-fault-domain-count 1
# are the two flags that define the shape.
az vmss create \
--resource-group $RG --name vmss-app \
--orchestration-mode flexible \
--zones 1 2 3 \
--platform-fault-domain-count 1 \
--instance-count 3 \
--vm-sku Standard_D2as_v5 \
--image Ubuntu2204 \
--upgrade-policy-mode Manual \
--admin-username azureuser \
--generate-ssh-keys \
--single-placement-group false
Create the set in
Manualupgrade mode first. You want to confirm the Application Health extension is reporting green before you ever switch toRolling. Flipping to rolling with a misconfigured health signal is how people brick a fleet.
2. The golden-image pipeline with Azure Image Builder
Azure Image Builder (AIB) is a managed wrapper over HashiCorp Packer. You describe a source image, a set of customizers, and one or more distribution targets in a Microsoft.VirtualMachineImages/imageTemplates resource. AIB spins up a transient build VM in a staging resource group, runs your customizers, generalizes the result, and publishes it to your target – here, a Compute Gallery image version.
First, the identity. AIB runs as a user-assigned managed identity that needs rights to write image versions into your gallery. Grant it a role scoped to the gallery resource group.
IDENTITY=id-aib
az identity create --resource-group $RG --name $IDENTITY
AIB_PRINCIPAL=$(az identity show -g $RG -n $IDENTITY --query principalId -o tsv)
AIB_ID=$(az identity show -g $RG -n $IDENTITY --query id -o tsv)
SUB=$(az account show --query id -o tsv)
# AIB needs to write image versions into the gallery RG.
az role assignment create \
--assignee-object-id $AIB_PRINCIPAL \
--assignee-principal-type ServicePrincipal \
--role "Contributor" \
--scope /subscriptions/$SUB/resourceGroups/$RG
Contributor on the resource group is the simple path. In a hardened estate, replace it with a custom role that grants only the image-version and disk actions AIB needs, scoped to the gallery and the staging RG. Never grant subscription-level Contributor to a build identity.
Now the template. The two load-bearing sections are source (a PlatformImage) and distribute (a SharedImage pointing at a gallery image definition). Note version: latest on the source – because latest is resolved at build time, you can rerun the same template monthly and always rebake on top of the newest patched base image.
{
"type": "Microsoft.VirtualMachineImages/imageTemplates",
"apiVersion": "2024-02-01",
"location": "eastus",
"identity": {
"type": "UserAssigned",
"userAssignedIdentities": {
"<AIB_ID resource id>": {}
}
},
"properties": {
"buildTimeoutInMinutes": 60,
"vmProfile": {
"vmSize": "Standard_D2as_v5",
"osDiskSizeGB": 30
},
"source": {
"type": "PlatformImage",
"publisher": "Canonical",
"offer": "0001-com-ubuntu-server-jammy",
"sku": "22_04-lts-gen2",
"version": "latest"
},
"customize": [
{
"type": "Shell",
"name": "harden-and-install",
"inline": [
"set -euo pipefail",
"sudo apt-get update && sudo apt-get -y upgrade",
"sudo apt-get -y install nginx jq",
"sudo systemctl enable nginx",
"echo 'baked $(date -u +%FT%TZ)' | sudo tee /etc/image-build-stamp"
]
},
{
"type": "Shell",
"name": "cis-baseline",
"scriptUri": "https://stbuildscripts.blob.core.windows.net/scripts/cis-baseline.sh"
}
],
"distribute": [
{
"type": "SharedImage",
"galleryImageId": "/subscriptions/<sub>/resourceGroups/rg-vmss-prod/providers/Microsoft.Compute/galleries/galProd/images/ubuntu-app",
"runOutputName": "ubuntu-app-out",
"artifactTags": { "source": "aib", "baseline": "cis" },
"targetRegions": [
{ "name": "eastus", "replicaCount": 3, "storageAccountType": "Standard_ZRS" },
{ "name": "westus3", "replicaCount": 2, "storageAccountType": "Standard_ZRS" }
]
}
]
}
}
Deploy the template, then invoke the build. AIB templates are submitted as ARM resources, and the build is a separate Run action.
# Submit the template resource (validates and creates the build pipeline).
az deployment group create \
--resource-group $RG \
--template-file aib-ubuntu-app.json
# Kick off the actual image build (long-running).
az image builder run \
--resource-group $RG \
--name aib-ubuntu-app
# Watch the build; lastRunStatus.runState goes Running -> Succeeded.
az image builder show \
--resource-group $RG --name aib-ubuntu-app \
--query "lastRunStatus" -o jsonc
customizeis fail-fast: if any single customizer fails, the whole build fails. Putset -euo pipefailat the top of every Shell block so a silent error inside a script actually surfaces as a build failure instead of shipping a broken image.
3. Compute Gallery versioning, replication, and targeting
The Compute Gallery is the registry for your images. The hierarchy is gallery -> image definition -> image version. The definition is the immutable contract (OS type, generation, security type, publisher/offer/sku triple). Versions are the artifacts AIB writes into it.
GAL=galProd
az sig create --resource-group $RG --gallery-name $GAL
# Image definition. Hyper-V Gen2 + TrustedLaunchSupported is the modern
# default; it lets you boot the image on either Standard or TrustedLaunch VMs.
az sig image-definition create \
--resource-group $RG --gallery-name $GAL \
--gallery-image-definition ubuntu-app \
--publisher kloudvin --offer ubuntu --sku app-jammy \
--os-type Linux --os-state Generalized \
--hyper-v-generation V2 \
--features SecurityType=TrustedLaunchSupported
A few facts that bite people:
- Replication is what gates regional rollout. A scale set in
westus3only sees a new version once that version has finished replicating towestus3. This is the lever you use to stage rollouts geographically – replicate to a canary region first. storageAccountType: Standard_ZRSfor the replica makes the image version zone-redundant, so a single-zone storage outage cannot block instance creation. Use it for any production gallery in a region with zones.excludeFromLateston a version removes it fromlatestresolution. Automatic OS upgrade will not roll out an excluded version – this is your kill switch for a bad bake.
# Promote a known-good version and demote a bad one without deleting it.
az sig image-version update \
--resource-group $RG --gallery-name $GAL \
--gallery-image-definition ubuntu-app \
--gallery-image-version 1.4.0 \
--set publishingProfile.excludeFromLatest=true
When the scale set should always track the newest version, reference the definition without a version. That /latest-style reference (omit the version segment) is exactly what automatic OS upgrade keys off.
4. Automatic OS image upgrades and rolling-upgrade health policies
There are two distinct, separately-configured knobs that people constantly conflate:
- Upgrade policy
mode(Manual/Automatic/Rolling) controls what happens to existing instances when you change the scale set model. automaticOSUpgradePolicy.enableAutomaticOSUpgradecontrols what happens when the image publisher (or your gallery) ships a new version. It does not use themode; it always uses the rolling upgrade policy settings.
Heads-up on lifecycle: automatic OS image upgrade for VMSS Flex is in preview (it has been GA for Uniform for years). For Flex, the instance image version must be set to
latest, the Application Health extension version on the instance must match the model, and – importantly – MaxSurge cannot be combined with automatic OS upgrade on Flex. Validate it in non-prod and confirm current regional availability before you depend on it in production.
The rolling upgrade policy is the safety envelope. These are the real fields and their meanings:
Field (az flag) |
Meaning |
|---|---|
maxBatchInstancePercent (--max-batch-instance-percent) |
Max % of instances upgraded in one batch |
maxUnhealthyInstancePercent (--max-unhealthy-instance-percent) |
If more than this % of the whole set is unhealthy, the upgrade halts |
maxUnhealthyUpgradedInstancePercent (--max-unhealthy-upgraded-instance-percent) |
If more than this % of already-upgraded instances go unhealthy, the upgrade is cancelled |
pauseTimeBetweenBatches (--pause-time-between-batches) |
ISO-8601 soak time between batches, e.g. PT2M |
prioritizeUnhealthyInstances (--prioritize-unhealthy-instances) |
Upgrade already-unhealthy instances first |
maxSurge (--max-surge) |
Create new instances before deleting old (Uniform; not with auto-OS on Flex) |
Set the policy and enable automatic OS upgrade:
# 1) Tighten the rolling envelope and switch to Rolling mode.
az vmss update \
--resource-group $RG --name vmss-app \
--set upgradePolicy.mode=Rolling \
--max-batch-instance-percent 20 \
--max-unhealthy-instance-percent 20 \
--max-unhealthy-upgraded-instance-percent 20 \
--prioritize-unhealthy-instances true \
--pause-time-between-batches PT2M
# 2) Enable automatic OS image upgrade (keys off the gallery 'latest').
az vmss update \
--resource-group $RG --name vmss-app \
--enable-auto-os-upgrade true
The orchestrator never upgrades more than 20% of the set at once, waits up to 5 minutes for each upgraded instance to report healthy, and restores the previous OS disk if an instance does not recover in time. If overall unhealthy instances cross your threshold mid-flight – even from unrelated maintenance – it stops at the end of the current batch.
5. Application Health extension and graceful instance replacement
A rolling upgrade is only as safe as its health signal. With Flexible orchestration and a rolling policy, the Application Health extension is required – there is no load-balancer-probe fallback the way there is for Uniform. The platform uses this signal to decide whether a freshly-upgraded instance is healthy before touching the next batch.
Critical constraint: a scale set may have exactly one health source. If you have both an Application Health extension and a Load Balancer health probe configured, you must remove one before orchestration features (automatic OS upgrade, instance repair) will work.
Add the extension to the model. It probes a local endpoint your app owns – make that endpoint mean “I can actually serve traffic,” not just “the process is up.”
az vmss extension set \
--resource-group $RG --vmss-name vmss-app \
--name ApplicationHealthLinux \
--publisher Microsoft.ManagedServices \
--version 2.0 \
--settings '{
"protocol": "http",
"port": 8080,
"requestPath": "/healthz",
"intervalInSeconds": 5,
"numberOfProbes": 1,
"gracePeriod": 600
}'
# Make sure the extension change is rolled to existing instances.
az vmss update-instances \
--resource-group $RG --name vmss-app --instance-ids '*'
Pair this with automatic instance repair, which uses the same health signal to replace an instance that stays unhealthy outside of any upgrade. The grace period must be long enough to cover boot plus app warm-up, or you will fight a repair loop.
az vmss update \
--resource-group $RG --name vmss-app \
--enable-automatic-repairs true \
--automatic-repairs-grace-period PT30M
6. Autoscale rules, predictive scaling, and scale-in protection
Scaling is configured against the scale set as the target resource. Build rules on a real saturation signal, and always pair a scale-out rule with a scale-in rule plus a cooldown so you do not flap.
az monitor autoscale create \
--resource-group $RG \
--resource vmss-app \
--resource-type Microsoft.Compute/virtualMachineScaleSets \
--name autoscale-app \
--min-count 3 --max-count 30 --count 3
# Scale out on sustained CPU, scale in conservatively.
az monitor autoscale rule create \
--resource-group $RG --autoscale-name autoscale-app \
--condition "Percentage CPU > 70 avg 5m" \
--scale out 2 --cooldown 5
az monitor autoscale rule create \
--resource-group $RG --autoscale-name autoscale-app \
--condition "Percentage CPU < 30 avg 10m" \
--scale in 1 --cooldown 10
Two refinements that separate a production config from a demo:
- Predictive autoscale. For workloads with a daily or weekly shape, enable predictive scaling so Azure provisions ahead of a forecasted spike instead of chasing it. Run it in
ForecastOnlyfirst to validate the model against reality, then switch toEnabled. - Scale-in protection. A long-running job on an instance should not be killed by a scale-in event. Apply instance-level protection so autoscale picks a different victim:
az vmss update \
--resource-group $RG --name vmss-app \
--instance-id 3 \
--protect-from-scale-in true \
--protect-from-scale-set-actions false
7. Spot instances, eviction handling, and mixed capacity
Flexible orchestration unlocks Spot Priority Mix (GA for Flex), which runs a guaranteed floor of regular VMs alongside Spot VMs in one scale set. You set a base count of regular instances that is never evicted, plus a percentage of regular instances among everything above that base. The rest are Spot, evicted (and optionally deallocated) when Azure reclaims capacity.
# Floor of 3 regular VMs; above that, 50% regular / 50% Spot.
# Eviction policy 'Deallocate' keeps the disk so the instance can return.
az vmss create \
--resource-group $RG --name vmss-batch \
--orchestration-mode flexible \
--platform-fault-domain-count 1 \
--instance-count 10 \
--vm-sku Standard_D4as_v5 \
--image Ubuntu2204 \
--priority Spot \
--eviction-policy Deallocate \
--regular-priority-count 3 \
--regular-priority-percentage 50 \
--single-placement-group false
Handle eviction gracefully from inside the instance. Spot eviction is delivered through Azure Scheduled Events on the Instance Metadata Service; poll it and drain on a Preempt signal.
# Poll IMDS for a Preempt event and drain before the 30s window closes.
curl -s -H "Metadata:true" \
"http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01" \
| jq '.Events[] | select(.EventType=="Preempt")'
Spot is for interruptible work: batch, CI runners, stateless stream processors, dev. The base regular count is your insurance that core capacity survives a region-wide Spot reclamation. For a tier-1 synchronous API, keep Spot out of the path or set the base high enough to carry full load alone.
Enterprise scenario
A payments platform team ran a tier-1 authorization service on a Uniform VMSS with a Marketplace Ubuntu image and a 350-line cloud-init. Two problems converged. First, their security team mandated a CIS-hardened, monthly-patched base image with a sub-90-second boot SLO; cloud-init at boot took 4-6 minutes and was non-deterministic under load. Second, a routine kernel CVE patch the previous quarter had been pushed by re-running cloud-init across the fleet, a bad script slipped through, and ~30% of instances came back unable to reach the HSM. There was no health gate and no rollback – the bad config rolled to the entire set before anyone noticed, and recovery was a frantic manual reimage.
They rebuilt on Flexible orchestration with an AIB + Compute Gallery pipeline. The CIS baseline and all packages moved into the image (boot dropped to ~50 seconds). The Application Health extension probed a /ready endpoint that returned 200 only after a successful test connection to the HSM – so “healthy” meant “can actually authorize.” Crucially, they kept maxUnhealthyUpgradedInstancePercent tight and replicated each new gallery version to a single canary region first.
The next CVE patch told the story. A new image version went to the canary region; rolling upgrade replaced the first 20% batch; the new instances failed the HSM connectivity probe; the orchestrator restored their previous OS disks and halted at MaxUnhealthyUpgradedInstancePercentExceededInRollingUpgrade after the first batch. Blast radius: a handful of instances in one region, auto-rolled-back, zero customer impact. The fix (a firewall rule the new baseline had tightened) shipped as 1.7.1; 1.7.0 was demoted with excludeFromLatest rather than deleted.
The health endpoint design was the whole game – a probe that only checks the process would have happily greenlit the broken batch:
az vmss extension set \
--resource-group rg-pay-prod --vmss-name vmss-authz \
--name ApplicationHealthLinux \
--publisher Microsoft.ManagedServices --version 2.0 \
--settings '{
"protocol": "https",
"port": 8443,
"requestPath": "/ready",
"intervalInSeconds": 5,
"gracePeriod": 900
}'
Verify
Confirm the pipeline and the rollout machinery end to end.
# Image was published and replicated to every target region.
az sig image-version show \
--resource-group $RG --gallery-name galProd \
--gallery-image-definition ubuntu-app \
--gallery-image-version 1.7.1 \
--query "{state:provisioningState, regions:publishingProfile.targetRegions[].name}" -o jsonc
# Upgrade policy and auto-OS-upgrade are set as intended.
az vmss show -g $RG -n vmss-app \
--query "{mode:upgradePolicy.mode, autoOS:upgradePolicy.automaticOSUpgradePolicy.enableAutomaticOSUpgrade, rolling:upgradePolicy.rollingUpgradePolicy}" -o jsonc
# Every instance reports a health state (not just a power state).
az vmss get-instance-view -g $RG -n vmss-app \
--query "statuses" -o table
# Rolling upgrade progress / last result.
az vmss rolling-upgrade get-latest -g $RG -n vmss-app -o jsonc
# History of the last automatic OS upgrades.
az vmss get-os-upgrade-history -g $RG -n vmss-app -o table
Healthy looks like: gallery version Succeeded and present in all target regions; mode: Rolling with enableAutomaticOSUpgrade: true; every instance carrying a HealthState/Healthy status; and runningStatus.code of Completed with failedInstanceCount: 0.