Azure Lesson 31 of 137

Azure Update Manager: Maintenance Configurations, Scheduled Patching, and Hybrid Coverage with Arc

Patching is where good intentions go to die. Every estate I have inherited had a patch “strategy” that was really three strategies – a Windows team on WSUS, a Linux team running unattended-upgrades on a cron, and a cloud team hoping the images were recent enough. Nobody could answer the only question that matters at audit time: which machines are missing which CVEs right now, and when will they be patched? Azure Update Manager (AUM) is Microsoft’s answer, and unlike its predecessor it needs no Log Analytics workspace, no Automation account, and no agent of its own – it is a native VM platform capability that also reaches off-Azure through Azure Arc. This is how to wire it up end to end: assessment, on-demand remediation, recurring maintenance configurations, tag-driven dynamic scopes, pre/post automation, hybrid coverage, hotpatching, and the Azure Policy that keeps it all honest.

Update Manager has two planes, and confusing them is the single most common reason a patch program stalls. The data plane assesses and installs updates on a single machine on demand – it is a button you click. The scheduling plane – maintenance configurations – is what turns one-off actions into a governed, recurring program that runs at 02:00 on the third Sunday whether or not anyone is awake. Most teams treat AUM as a button rather than a schedule to declare, and so they never escape the cycle of manual, panic-driven, audit-deadline patching. The schedule is the product. Everything in this article builds toward a maintenance configuration that targets the right machines, at the right time, with the right reboot behaviour, across Azure and everything you run outside it.

By the end you will be able to put a heterogeneous fleet – Windows and Linux, Azure and on-prem and other-cloud – under one targeting model, prove a patch landed before you trust a schedule, decouple install from reboot so a restart window never blocks a security fix, and produce the single queryable compliance view an auditor actually wants. You will also know the half-dozen silent failure modes (the wrong patchMode, a dynamic scope that resolves to zero machines, a window 10 minutes too short) that make a run look successful while patching nothing, and the exact az/Resource Graph command to confirm each.

What problem this solves

In production, “we patch monthly” is a sentence with no evidence behind it. The pain is concrete: an auditor asks for the current CVE exposure of 900 servers and you cannot produce it without three spreadsheets and a week. A critical zero-day drops and you have no fleet-wide mechanism to assess who is exposed and remediate in a bounded window. A latency-sensitive database reboots at 14:00 because someone’s cron fired, and you take an outage during clinical hours. Each of these is a governance failure dressed up as a tooling failure, and each is exactly what AUM exists to remove.

What breaks without it: patching becomes per-team, per-OS, and per-cloud, so there is no single answer to “are we compliant?” New machines are born unpatched because onboarding into the patch program is manual and gets forgotten. Reboots are uncoordinated because install and restart are welded together. And the legacy answer – Automation Update Management on a Log Analytics workspace plus the MMA/OMS agent – reached end of support on 31 August 2024, so anything still depending on it is running on a retired stack with no security backstop.

Who hits this: anyone operating more than a handful of VMs, and acutely anyone with a hybrid or multicloud estate, a regulated workload with an audit obligation, or a tier that legally or contractually cannot reboot during business hours. The fix is almost never “patch harder by hand” – it is “declare one schedule, target it by tag, let the platform install at run-time, and report from one query.”

To frame the whole field before the deep dive, here is every capability this article covers, the production pain it removes, and the AUM construct that delivers it:

Capability Production pain without it AUM construct that delivers it First place to look
Fleet assessment No fleet-wide CVE exposure answer On-demand + periodic assessment patchassessmentresources in Resource Graph
Out-of-band remediation Zero-day with no bounded fix mechanism One-time install-patches run Update Manager -> History
Recurring program “We patch monthly” with no evidence Maintenance configuration (InGuestPatch) maintenanceresources
Targeting at scale New machines forgotten at onboarding Dynamic scopes (ARG tag queries) resources ARG query on tags
Orchestration hooks Uncoordinated drain/snapshot/validate Pre/post events via Event Grid Function/Logic App invocation logs
Hybrid + multicloud Separate patch stack off-cloud Arc-enabled servers connectedmachine resources
Reboot control Outage during business hours rebootSetting + post-event reboot installPatches.rebootSetting
Born-compliant Drift the moment a VM is created Policy DeployIfNotExists enrolment Policy compliance blade

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand Azure resource basics: subscriptions, resource groups, tags, and how to run az in Cloud Shell and read JSON output. You should know what an Azure VM is and that VMs carry an osProfile with OS-specific configuration. Familiarity with Azure Policy effects (Audit, DeployIfNotExists, Modify) and a passing knowledge of Kusto (KQL) for Resource Graph queries will let you use the reporting sections directly. No prior exposure to the legacy Automation Update Management is required – if anything it is baggage.

This sits in the Governance & Operations track. It assumes the platform foundation from Azure Policy: Governance at Scale (the enforcement engine AUM leans on) and the targeting fundamentals from Azure Resource Hierarchy Explained (management groups are where you assign the policies). It pairs tightly with Azure Arc-Enabled Servers: Machine Configuration & Extended Security Updates, because Arc is what makes the hybrid story work, and with Azure Monitor & Application Insights for Observability for surfacing compliance in workbooks. If you orchestrate pre/post events with serverless, Azure Functions: Serverless Patterns is the layer those handlers live in.

A quick map of who owns what during a patch program, so you route work to the right team:

Layer What lives here Who usually owns it Failure classes it can cause
Policy / governance Enrolment, assessment enforcement, drift reporting Platform / governance team Machines never enrolled; remediation silently no-ops
Maintenance configuration Window, cadence, reboot, classifications Platform / ops team Window too short; wrong reboot setting
Targeting (dynamic scope) Tag vocabulary, ring design Platform + app owners Scope resolves to zero; wrong ring patched
Machine settings patchMode, bypass, assessment mode VM / app team Run skipped; platform auto-patches instead
Update source WSUS, distro repos, egress Network / Linux / Windows teams Assessment empty; install fails
Orchestration hooks Drain, snapshot, validate handlers App / SRE team Un-drained run; pre-event timeout

Core concepts

Six mental models make every later decision obvious.

Two planes, one product. The data planeaz vm assess-patches and az vm install-patches – acts on one machine, now. The scheduling plane – a maintenance configuration of scope InGuestPatch – declares when to patch, what classifications, and how to reboot, then machines are associated to it. You use the data plane to learn and to prove; you use the scheduling plane to operate. A patch program that only ever uses the data plane is a person clicking buttons forever.

The orchestration mode is the master switch. Every machine has a patch orchestration mode (patchMode). For a maintenance configuration to install anything, the machine must be AutomaticByPlatform (Azure-orchestrated) and carry bypassPlatformSafetyChecksOnUserSchedule = true so the platform does not also apply its own automatic patches on Microsoft’s cadence and collide with your window. Get this wrong and your run is silently skipped – the single most common “it ran but nothing happened.”

Targeting is a query, not a list. A dynamic scope attaches an Azure Resource Graph filter (over subscriptions, resource groups, locations, OS types, and – above all – tags) to a maintenance configuration. Membership is evaluated at run time, so a VM created an hour before the window, carrying the right tag, is patched with zero manual onboarding. Static assignments rot; dynamic scopes scale.

Arc makes off-Azure machines first-class. An Arc-enabled server – on-prem, in AWS, in GCP – is, to AUM, just another machine. It gets the same assessment, the same one-time runs, the same maintenance configurations and dynamic scopes. There is no per-machine charge for AUM on native Azure VMs; Arc-enabled servers carry a small per-server monthly charge. One targeting model spans every cloud.

Install and reboot are separable. rebootSetting has three values – IfRequired, Always, Never. Setting it to Never lets AUM install packages inside a window but restart nothing, so you can drive the reboot later through a controlled post-event – turning “no reboots during business hours” from a blocker into a scheduling detail. Hotpatching takes this further: on supported Windows Server SKUs, security updates apply without a reboot at all for two of every three months.

A window is a hard stop with a tax. A maintenance window has a duration (minimum 1 hour 30 minutes), and the platform reserves the last 10 minutes to finalize, so effective install time is duration - 10m. AUM stops starting new package installs once the window is exhausted; in-flight installs finish, but anything not yet started is deferred to the next window. Size the window for the slowest machine in the batch.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Assessment Read-only scan of missing updates Per machine; results in Resource Graph Source of CVE exposure; never installs
One-time deployment Ad-hoc install run az vm install-patches Out-of-band remediation, proving a patch lands
Maintenance configuration The recurring schedule resource Microsoft.Maintenance/maintenanceConfigurations The product – when/what/how to patch
maintenanceScope What the config governs Config property Must be InGuestPatch for guest OS patching
patchMode Orchestration mode of the machine osProfile patch settings Must be AutomaticByPlatform or run is skipped
bypassPlatformSafetyChecks… Suppress platform auto-patch osProfile patch settings Must be true so your schedule owns patching
Dynamic scope ARG filter binding machines to a config Configuration assignment Membership by tag; scales onboarding to zero
Configuration assignment The binding of a machine/scope to a config Microsoft.Maintenance/configurationAssignments Static (one machine) or dynamic (a query)
Pre/post event Hook fired before/after the window Event Grid on the config Drain, snapshot, validate, controlled reboot
Arc-enabled server Off-Azure machine projected into Azure Microsoft.HybridCompute/machines Same patch model off-cloud; billed per server
Hotpatching Reboot-less OS security updates OS profile (WS Azure Ed / WS 2025) 4 reboots/yr instead of 12
rebootSetting Reboot behaviour of a run installPatches Decouple install from restart
Classification Update category to include windowsParameters/linuxParameters Critical/Security vs everything
Ring A wave of the fleet with its own window Tag value (PatchGroup) Canary -> broad -> sensitive sequencing

Migrating off Automation Update Management

The legacy Automation Update Management (under an Automation account, backed by a Log Analytics workspace) is retired – it reached end of support on 31 August 2024, and the MMA/OMS agent it depended on retired the same month. If any of your patch program still runs on it, migration is overdue, not optional. The two services differ in ways that matter operationally, and the table below is the translation map:

Concern Automation Update Management (legacy) Azure Update Manager Migration action
Dependencies Log Analytics workspace + Automation account None – native to VM/Arc platform Decommission workspace dependency for patching
Agent Log Analytics agent (MMA/OMS) No separate agent; VM/Arc extension framework Remove MMA/OMS after migration
Scheduling Automation schedules + Update Deployments Maintenance configurations (InGuestPatch) Recreate via the portal migration tool
Targeting Saved searches / computer groups Dynamic scopes (ARG on tags/sub/RG) Re-express groups as tag filters
Off-Azure Hybrid Runbook Worker Arc-enabled servers Onboard servers to Arc
Reporting Log Analytics queries Resource Graph (patchassessmentresources) Rebuild queries/workbooks on ARG
Pre/post tasks Pre/post scripts in the deployment Event Grid pre/post events Re-wire to Functions/Logic Apps
Cost model Log Analytics ingestion + Automation Free on Azure VMs; per-server on Arc Re-baseline the bill

Use Microsoft’s portal migration experience and the supplied runbooks that recreate legacy schedules as maintenance configurations – do not hand-translate. The dynamic-scope mapping (turning saved searches into tag filters) is precisely the part teams get wrong by hand, and the tool reduces the error surface. The migration sequence that avoids a coverage gap:

# Migration step Why this order Verify
1 Inventory legacy schedules + groups Know the target state before you move Export Update Deployments list
2 Register AUM resource providers Prerequisite for any AUM action az provider show = Registered
3 Set patchMode/bypass on machines Without it the new schedule no-ops ARG over osProfile patch settings
4 Run the portal migration tool Recreates schedules as maint. configs Configs exist in Microsoft.Maintenance
5 Validate dynamic scopes resolve Catch tag-mapping errors pre-cutover ARG count matches legacy group size
6 Run a canary window on one ring Prove the new path before fleet cutover maintenanceresources shows installs
7 Disable legacy Update Deployments Avoid double-patching during overlap Legacy schedules disabled
8 Remove MMA/OMS agent Retired dependency, attack surface Agent absent; assessment still returns

Register the resource providers AUM and its scheduling/policy surface depend on:

# Providers AUM and its scheduling/policy surface depend on
az provider register --namespace Microsoft.Maintenance
az provider register --namespace Microsoft.Compute
az provider register --namespace Microsoft.PolicyInsights
az provider register --namespace Microsoft.HybridCompute   # Arc-enabled servers

# Confirm they are Registered before doing anything else
az provider show --namespace Microsoft.Maintenance --query registrationState -o tsv

Patch orchestration mode: the master switch

Update Manager itself requires no enablement resource, but it does require that each machine’s update settings allow the platform to orchestrate. The property that controls this is patchMode on the OS profile, and it is the master switch behind every scheduled patch. For Windows the path is osProfile.windowsConfiguration.patchSettings.*; for Linux it is osProfile.linuxConfiguration.patchSettings.*. Here is every value and what it means operationally:

patchMode value OS Who patches Works with maintenance config? When to use
AutomaticByPlatform Win + Linux The platform, on your schedule (with bypass) Yes – required for AUM scheduling Any machine you want AUM to schedule
AutomaticByOS Windows Windows Update automatically No (platform owns timing) Standalone auto-patch, no governance
Manual Windows You, by hand / your own tooling No Fully manual control
ImageDefault Linux The image’s default (e.g. unattended-upgrades) No Legacy/cron patching, not AUM

The crucial pairing is AutomaticByPlatform plus bypassPlatformSafetyChecksOnUserSchedule = true. The assessmentMode property is separate and controls scanning: set it to AutomaticByPlatform for continuous periodic assessment, or ImageDefault for on-demand only. The full setting matrix you must get right per machine:

Setting Values Default What it controls Set it to Gotcha if wrong
patchMode AutomaticByPlatform, AutomaticByOS, Manual, ImageDefault varies by image Who installs updates and when AutomaticByPlatform Any other value -> schedule installs nothing
bypassPlatformSafetyChecksOnUserSchedule true / false false Suppress platform auto-patch so your schedule owns it true false -> platform patches on its own cadence, collides
assessmentMode AutomaticByPlatform, ImageDefault ImageDefault Continuous vs on-demand scanning AutomaticByPlatform ImageDefault -> stale exposure data, no periodic scan
provisionVMAgent (Win) true / false true VM agent present (prerequisite) true false -> no extension framework, no AUM
enableAutomaticUpdates (Win) true / false true Windows Update service enabled true (with platform mode) false -> WU disabled, install can fail

Set the orchestration mode explicitly on an existing Linux VM, then hand control to your schedule:

# Put an existing Linux VM into customer-managed scheduled patching
az vm update \
  --resource-group rg-fleet-prod \
  --name vm-app-01 \
  --set osProfile.linuxConfiguration.patchSettings.patchMode=AutomaticByPlatform \
  --set osProfile.linuxConfiguration.patchSettings.assessmentMode=AutomaticByPlatform

# Hand control to YOUR maintenance schedule (suppress platform auto-patching)
az vm update \
  --resource-group rg-fleet-prod \
  --name vm-app-01 \
  --set osProfile.linuxConfiguration.patchSettings.bypassPlatformSafetyChecksOnUserSchedule=true

For Windows, swap to osProfile.windowsConfiguration.patchSettings.patchMode=AutomaticByPlatform. In Bicep, bake it into the VM definition so machines are born correct rather than reconciled later:

// VM born with platform orchestration + bypass so a maintenance config can drive it
properties: {
  osProfile: {
    linuxConfiguration: {
      patchSettings: {
        patchMode: 'AutomaticByPlatform'
        assessmentMode: 'AutomaticByPlatform'
        // bypass lives under automaticByPlatformSettings on some API versions:
        automaticByPlatformSettings: {
          bypassPlatformSafetyChecksOnUserSchedule: true
        }
      }
    }
  }
}

Prove the whole fleet is set correctly with one Resource Graph query – never trust a schedule until this returns clean:

// Machines NOT correctly configured for AUM scheduling (the silent-failure hunt)
resources
| where type in~ ("microsoft.compute/virtualmachines", "microsoft.hybridcompute/machines")
| extend ps = properties.osProfile.linuxConfiguration.patchSettings
| extend pw = properties.osProfile.windowsConfiguration.patchSettings
| extend mode = tostring(coalesce(ps.patchMode, pw.patchMode))
| extend bypass = tobool(coalesce(
    ps.automaticByPlatformSettings.bypassPlatformSafetyChecksOnUserSchedule,
    pw.automaticByPlatformSettings.bypassPlatformSafetyChecksOnUserSchedule))
| where mode != "AutomaticByPlatform" or bypass != true
| project name, type, mode, bypass, resourceGroup

On-demand assessment and one-time deployments

Before scheduling anything, learn what the fleet actually needs. Assessment is read-only: it queries each machine’s update source (Windows Update / WSUS for Windows; the distro package manager for Linux) and reports missing updates by classification and KB/package. It installs nothing. Run it on one machine:

# One-off assessment of a single machine (results land in Update Manager)
az vm assess-patches \
  --resource-group rg-fleet-prod \
  --name vm-app-01

Drive it at fleet scale and read the results out of Azure Resource Graph, the only sane way to query patch state across hundreds of machines. The patchassessmentresources table holds both the per-machine summary and the individual patches children:

// Machines with pending CRITICAL or SECURITY updates, Azure VMs and Arc together
patchassessmentresources
| where type =~ "microsoft.compute/virtualmachines/patchassessmentresults/softwarepatches"
   or type =~ "microsoft.hybridcompute/machines/patchassessmentresults/softwarepatches"
| where properties.classifications has_any ("Critical", "Security")
| where properties.patchState =~ "Available"
| extend machine = tostring(split(id, "/")[8])
| summarize pendingUpdates = count() by machine, tostring(properties.classifications)
| order by pendingUpdates desc

When you need to remediate now – an out-of-band CVE – run a one-time deployment (an install run). Filter by classification, give it an explicit maximum duration in minutes, and choose a reboot policy:

# Install only Critical + Security updates, 120-minute window, reboot only if required
az vm install-patches \
  --resource-group rg-fleet-prod \
  --name vm-app-01 \
  --maximum-duration PT120M \
  --reboot-setting IfRequired \
  --classifications-to-include-linux Critical Security

For Windows, swap to --classifications-to-include-win and you can pin or block specific KBs. Every classification value, per OS, and when to include it:

Classification OS What it covers Include in scheduled runs?
Critical Win + Linux Critical-severity fixes Always
Security Win + Linux Security updates Always
UpdateRollUp Windows Cumulative roll-ups Usually
FeaturePack Windows New feature packages Rarely – test first
ServicePack Windows Service packs Rarely – test first
Definition Windows AV/defender definitions Often (fast-moving)
Tools Windows Utilities Optional
Updates Windows Non-security updates Optional
Other Linux Distro “other” bucket Optional

The az vm install-patches flags you will actually use, with their values and effect:

Flag Values Default Effect Gotcha
--maximum-duration ISO 8601 e.g. PT120M required Hard stop on starting new installs Size for the slowest machine; last bit is reserved
--reboot-setting IfRequired, Always, Never IfRequired Reboot behaviour Never decouples install from restart
--classifications-to-include-win Windows classifications none What to install (Windows) Empty = nothing installs
--classifications-to-include-linux Linux classifications none What to install (Linux) Empty = nothing installs
--kb-numbers-to-include KB list none Pin specific KBs (Windows) Overrides classification filter union semantics
--kb-numbers-to-exclude KB list none Block known-bad KBs (Windows) Excludes win even if classification would include
--packages-to-include package names none Pin specific packages (Linux) Distro-specific naming
--packages-to-exclude package names none Block packages (Linux) Use to hold back a problematic package

The --reboot-setting values decoded – this is the lever the whole reboot-control story hinges on:

--reboot-setting Behaviour Use when
IfRequired Reboot only if an installed update needs it Default; balances currency and disruption
Always Reboot after the run regardless Force a clean state; maintenance windows that expect it
Never Install, never restart Restart-sensitive tiers; reboot driven by post-event later

After any install run, re-assess and confirm the pending count dropped – proving the data plane works before you trust a schedule. The result statuses you will see and what each means:

Run status Meaning Likely cause Next step
Succeeded All in-scope updates installed Re-assess to confirm zero pending
CompletedWithWarnings Some updates failed / pending reboot A KB failed, or window cut it short Inspect per-update detail; re-run
Failed Run could not complete Update source unreachable, agent issue Check egress + agent; see playbook
InProgress Still installing Long window, large batch Wait; do not start a second run
NotStarted / skipped Run never began on the machine patchMode wrong, machine off Fix orchestration mode; power on

Maintenance configurations, schedules, and reboot settings

This is the heart of AUM. A maintenance configuration of scope InGuestPatch is a first-class Azure resource that declares when to patch, what classifications to include, and how to handle reboots. Machines are then associated to it – statically or, far better, via dynamic scopes (next section). Every field that matters, its format, default, and the trap in each:

Field Format / values Default What it controls Trap
maintenanceScope InGuestPatch (for OS patching) What the config governs Wrong scope = not a guest-patch schedule
extensionProperties.InGuestPatchMode User / Platform Treats config as a user (AUM) schedule Omit -> config ignored by AUM
maintenanceWindow.startDateTime YYYY-MM-DD HH:mm required First window start Local vs UTC confusion
maintenanceWindow.duration HH:mm, min 01:30 Window length Last 10 min reserved; effective = duration - 10m
maintenanceWindow.timeZone IANA/Windows TZ name UTC Window time zone DST shifts the wall-clock window
maintenanceWindow.recurEvery 1Day, 1Week, Month Third Sunday Cadence Monthly expression syntax is exact
installPatches.rebootSetting IfRequired, Always, Never IfRequired Reboot behaviour Never needs a post-event to ever reboot
installPatches.windowsParameters classifications + KB include/exclude Windows patch selection Empty classifications = nothing installs
installPatches.linuxParameters classifications + package include/exclude Linux patch selection Distro package naming

The recurEvery cadence expressions, with concrete examples:

Cadence intent recurEvery expression Notes
Every day 1Day Aggressive; rings/canaries
Every week 1Week Common for non-prod
Every N days 7Days, 14Days Numeric multiplier
Monthly, nth weekday Month Third Sunday “Patch Tuesday + a week” pattern
Monthly, last weekday Month Last Saturday End-of-month window
Monthly, specific day Month day23 Calendar-day cadence

Here is a production-grade monthly Windows configuration in Bicep – patching on the third Sunday at 02:00 UTC with a 3-hour window, blocking a known-bad KB:

resource patchMonthly 'Microsoft.Maintenance/maintenanceConfigurations@2023-10-01-preview' = {
  name: 'mc-win-prod-monthly'
  location: 'eastus2'
  properties: {
    maintenanceScope: 'InGuestPatch'
    extensionProperties: {
      // Required so the config is treated as a guest-patch (AUM) schedule
      InGuestPatchMode: 'User'
    }
    maintenanceWindow: {
      startDateTime: '2026-06-21 02:00'
      duration: '03:00'
      timeZone: 'UTC'
      recurEvery: 'Month Third Sunday'
    }
    installPatches: {
      rebootSetting: 'IfRequired'
      windowsParameters: {
        classificationsToInclude: [
          'Critical'
          'Security'
          'UpdateRollUp'
        ]
        kbNumbersToExclude: [
          'KB5099999'   // block a known-bad KB until validated
        ]
      }
    }
  }
}

Deploy it and you have a schedule with no machines yet – intentionally. Never staple machines into a config by hand at scale; declare the intent (what it patches, when) and let dynamic scopes decide membership. To associate a single machine explicitly when you must, use a configuration assignment:

az maintenance assignment create \
  --resource-group rg-fleet-prod \
  --resource-name vm-app-01 \
  --resource-type virtualMachines \
  --provider-name Microsoft.Compute \
  --configuration-assignment-name assign-vm-app-01 \
  --maintenance-configuration-id "/subscriptions/<sub>/resourceGroups/rg-fleet-prod/providers/Microsoft.Maintenance/maintenanceConfigurations/mc-win-prod-monthly"

Static assignment versus dynamic scope – when to reach for each:

Dimension Static assignment Dynamic scope
Membership One named machine ARG query (tags/sub/RG/OS)
New-machine onboarding Manual, per machine Automatic on next window
Scale Tens of machines Hundreds to thousands
Drift risk High – forgotten machines Low – query-driven
Reproducibility Per-resource assignment One scope in IaC
When to use A one-off exception The fleet, every ring

The most common silent failure: the VM’s patchMode is not AutomaticByPlatform, or bypassPlatformSafetyChecksOnUserSchedule is false. The maintenance run is skipped with a status that looks benign. Reconcile machine settings to the schedule before you trust the schedule – run the ARG hunt from the previous section.

Dynamic scopes and tag-based targeting at scale

A dynamic scope attaches an Azure Resource Graph filter to a maintenance configuration. Membership is evaluated at run time, so a newly created VM that carries the right tag is patched on the next window with zero manual onboarding. This is the difference between a patch program that scales and one that rots. Filters are expressed over several dimensions; the one you will lean on is tags:

Filter dimension Example value Operator semantics Notes
Subscriptions /subscriptions/<id> In-list Scope across multiple subs
Resource groups rg-fleet-prod In-list Narrow to an RG
Resource types microsoft.compute/virtualmachines In-list VMs vs Arc machines
Locations eastus2, centralus In-list Region-bounded windows
OS types Windows, Linux In-list Per-OS configs
Tags PatchGroup=ring1 All (AND) or Any (OR) The primary targeting axis

Define a small, governed tag vocabulary up front and bind one configuration per ring. The recommended vocabulary:

Tag Values Purpose Audit note
PatchGroup ring0, ring1, ring2, exempt Wave / ring membership exempt routes to no config – audited separately
Environment prod, nonprod, dev Environment dimension Combine with ring for prod-only windows
OSFamily windows, linux Optional OS hint Redundant with OS-type filter but readable
Owner team alias Accountability Who to call when a ring fails

Attach a dynamic scope binding all machines tagged PatchGroup=ring1 in chosen subscriptions and regions:

# Attach a dynamic scope: all machines tagged PatchGroup=ring1
az maintenance assignment create-or-update-subscription \
  --maintenance-configuration-id "/subscriptions/<sub>/resourceGroups/rg-fleet-prod/providers/Microsoft.Maintenance/maintenanceConfigurations/mc-win-prod-monthly" \
  --configuration-assignment-name "scope-ring1" \
  --filter-tags '{"PatchGroup":["ring1"]}' \
  --filter-tags-operator "All" \
  --filter-os-types "Windows" \
  --filter-locations "eastus2" "centralus"

The CLI surface for dynamic scopes has churned across versions; many teams declare scopes in Bicep/ARM alongside the configuration so they are reproducible. The decision that matters is not syntax, it is ring design. The reference ring model:

Ring Tag Membership Window timing Reboot posture Risk tolerance
ring0 (canary) PatchGroup=ring0 Build agents, a few non-critical app servers Earliest (e.g. 1st Sat) IfRequired Highest – catch bad patches here
ring1 (broad) PatchGroup=ring1 The bulk of the fleet A few days after ring0 IfRequired Medium
ring2 (sensitive) PatchGroup=ring2 Latency/availability-critical tier Last; off-hours Never + post-event reboot Lowest
exempt PatchGroup=exempt Deliberate, time-boxed exceptions None n/a Audited separately

Validate that a scope resolves to the machines you expect before the window fires, using the same Resource Graph query AUM evaluates – a scope that resolves to zero is the number-one cause of “the schedule ran but nothing happened”:

resources
| where type in~ ("microsoft.compute/virtualmachines", "microsoft.hybridcompute/machines")
| where tags["PatchGroup"] =~ "ring1"
| project name, type, location, resourceGroup, subscriptionId

A scope-design decision table – match the symptom to the cause:

If you see… It’s probably… Do this
Scope resolves to 0 machines Tag typo or wrong sub/region in the filter Run the ARG query manually; fix the filter
A machine patched by the wrong ring Two configs both match its tags Make tags mutually exclusive; one ring per machine
New VM not patched on first window Tag applied after the window evaluated, or missing Tag at create time (IaC/Policy), confirm before window
exempt machine still patched A broad scope (e.g. by RG) overrides the tag Exclude exempt explicitly or avoid RG-wide scopes
Arc machine not in scope Resource type filter excludes hybridcompute Include microsoft.hybridcompute/machines

Pre and post maintenance events with automation hooks

Patching is rarely just patching. You drain a load balancer first, you quiesce a database, you snapshot a disk, you re-run smoke tests after. AUM exposes this through pre and post maintenance events delivered via Event Grid on the maintenance configuration. A pre-event fires before the window starts; a post-event after it completes. Subscribe an Azure Function, Logic App, Automation runbook, or webhook and you have orchestration hooks without bolting on a separate scheduler. The two event types and their contract:

Event type Fires Bounded by Run proceeds when Use it for
Microsoft.Maintenance.PreMaintenanceEvent Before the window ~20-minute pre-window Handler completes or times out Drain node, snapshot, quiesce DB
Microsoft.Maintenance.PostMaintenanceEvent After the window (post-completion) Validate health, queue/release reboots

The handler options, and when each fits:

Endpoint type Latency State / orchestration Best for
Azure Function Low Stateless (or Durable for long flows) Fast drain/snapshot triggers, validation
Logic App Medium Visual, connectors, stateful Multi-step approvals, ticketing integration
Automation runbook Medium PowerShell/Python, hybrid worker Existing runbook estates, on-prem actions
Webhook Low Whatever you build Custom orchestrators, ChatOps

Wire a pre-event to a Function that cordons machines and a post-event that validates health:

# Pre-maintenance event -> Function App endpoint
az eventgrid event-subscription create \
  --name pre-maint-drain \
  --source-resource-id "/subscriptions/<sub>/resourceGroups/rg-fleet-prod/providers/Microsoft.Maintenance/maintenanceConfigurations/mc-win-prod-monthly" \
  --endpoint-type azurefunction \
  --endpoint "/subscriptions/<sub>/resourceGroups/rg-ops/providers/Microsoft.Web/sites/fn-patch-orchestrator/functions/PreMaintenanceDrain" \
  --included-event-types Microsoft.Maintenance.PreMaintenanceEvent

# Post-maintenance event -> validation Function
az eventgrid event-subscription create \
  --name post-maint-validate \
  --source-resource-id "/subscriptions/<sub>/resourceGroups/rg-fleet-prod/providers/Microsoft.Maintenance/maintenanceConfigurations/mc-win-prod-monthly" \
  --endpoint-type azurefunction \
  --endpoint "/subscriptions/<sub>/resourceGroups/rg-ops/providers/Microsoft.Web/sites/fn-patch-orchestrator/functions/PostMaintenanceValidate" \
  --included-event-types Microsoft.Maintenance.PostMaintenanceEvent

The contract to internalize: the pre-event handler runs inside a bounded pre-window (on the order of ~20 minutes) and the maintenance run proceeds when it completes or times out. Keep handlers idempotent and fast – this is the wrong place for a 30-minute backup. Use it to call the operation (kick off a snapshot, drain a node) and let the long-running work happen asynchronously, with the post-event reconciling state. Good versus bad handler patterns:

Handler concern Do Don’t Why
Duration Trigger async work, return fast Run a 30-min backup inline Pre-window is ~20 min; you will time out
Idempotency Make re-runs safe (no double-drain) Assume exactly-once delivery Event Grid can redeliver
Failure handling Fail closed for safety-critical drains Swallow errors silently A silent failure patches an un-drained node
Long work Snapshot kicked off, reconciled in post Block the pre-event on completion Decouple trigger from completion
Reboot control Queue reboots in post, release in window Reboot mid-business-hours Post-event is where controlled reboot lives

Hybrid and multicloud patching via Azure Arc-enabled servers

Update Manager’s real leverage is that an Arc-enabled server is, to AUM, just another machine. Onboard a server in your datacenter, in AWS, or in GCP, and it gets the same assessment, the same one-time deployments, the same maintenance configurations and dynamic scopes. There is no per-machine charge for Update Manager on native Azure VMs; for Arc-enabled servers, Update Manager is billed (a small per-server monthly charge) – budget for it, but it is far cheaper than running a parallel patch stack off-cloud. Native VM versus Arc server, feature by feature:

Aspect Native Azure VM Arc-enabled server
AUM charge Free Small per-server / month
Agent VM agent (built-in) Connected Machine agent (azcmagent)
patchMode set via az vm update / VM osProfile az connectedmachine update / Policy
Update source Windows Update / distro repo WSUS / internal repo / distro repo
Egress requirement Platform-managed Outbound 443 to Arc + update endpoints
Dynamic scope membership By tag/sub/RG/OS Identical – tag at connect time
Hotpatching WS Azure Ed / WS 2025 WS 2025 (Arc) under subscription

Connect a Linux server to Arc, tagging it at connect time so an existing scope picks it up:

# On the target server (one-shot install + connect)
sudo azcmagent connect \
  --resource-group "rg-arc-servers" \
  --tenant-id "<tenant-id>" \
  --location "eastus2" \
  --subscription-id "<sub>" \
  --cloud "AzureCloud" \
  --tags "PatchGroup=ring1,Environment=prod"

Because you tagged the machine on connect, your existing ring1 dynamic scope picks it up automatically – no separate onboarding into AUM. That is the whole point: one targeting model spanning Azure, on-prem, and other clouds. The connectivity requirements that bite hybrid fleets, and how to satisfy each:

Requirement Why How to satisfy Confirm
Outbound 443 to Arc endpoints Agent heartbeat, config pull Allow *.his.arc.azure.com, *.guestconfiguration.azure.com etc. azcmagent check
Proxy support (if behind one) No direct egress azcmagent config set proxy.url http://proxy:8080 azcmagent show proxy line
Reachable update source (Windows) Assessment + install need it Point at internal WSUS or Windows Update Assessment returns rows
Reachable distro repos (Linux) Package manager needs them Mirror/repo reachable; air-gapped needs a local mirror apt/yum update succeeds
patchMode on the machine Same master switch applies az connectedmachine update patch settings ARG over osProfile

Set the orchestration mode on an Arc machine the same way conceptually, via the connected-machine surface:

# Arc machines honour the same patchMode concept; set it via the connectedmachine surface
az connectedmachine update \
  --resource-group rg-arc-servers \
  --name arc-records-01 \
  --set properties.osProfile.windowsConfiguration.patchSettings.patchMode=AutomaticByPlatform

Hotpatching and Windows Server orchestration patterns

For supported Windows Server SKUs, hotpatching installs OS security updates without a reboot by patching in-memory code, dramatically shrinking your reboot-driven maintenance windows. It is available on Windows Server Azure Edition (Datacenter: Azure Edition) and, more recently, on Windows Server 2025 – including, notably, Arc-enabled Windows Server 2025 machines under a subscription. The cadence is the pattern to internalize:

Month type Months What ships Reboot?
Baseline Jan, Apr, Jul, Oct Cumulative update Yes – required
Hotpatch The two months after each baseline Security fixes patched in-memory No

So a year is four reboots, not twelve, with no loss of security coverage. Where hotpatch is and is not available:

Platform Hotpatch support Notes
Windows Server Datacenter: Azure Edition Yes The original hotpatch SKU
Windows Server 2025 (Azure VM) Yes Broader availability
Windows Server 2025 (Arc, under subscription) Yes Hotpatch reaches hybrid
Windows Server 2022/2019 Standard/Datacenter No Standard cumulative + reboot
Linux N/A Distro live-patch is separate, not AUM hotpatch

The orchestration implication: design your maintenance configuration with rebootSetting: IfRequired, and the platform reboots only on baseline months and skips it on hotpatch months automatically. You do not script the calendar; AUM and the hotpatch service handle it. Enable hotpatch on the OS profile:

// Windows Server Azure Edition VM with hotpatch enabled
properties: {
  osProfile: {
    windowsConfiguration: {
      provisionVMAgent: true
      enableAutomaticUpdates: true
      patchSettings: {
        patchMode: 'AutomaticByPlatform'
        enableHotpatching: true
      }
    }
  }
}

Even where hotpatch is not available, the orchestration pattern holds: separate install from reboot using Never on disruption-sensitive tiers, then drive the reboot through a controlled post-event so it lands inside an approved restart window rather than mid-patch. The reboot-decoupling patterns side by side:

Pattern How Reboots/yr Window disruption Best for
Hotpatch (IfRequired + hotpatch on) Platform skips reboot on hotpatch months ~4 Minimal WS Azure Ed / WS 2025
Install now, reboot later (Never + post-event) AUM installs; post-event reboots off-hours As needed Deferred to approved window Restart-sensitive tiers, no hotpatch
Standard (IfRequired) Reboot whenever an update needs it ~12 Per window General fleet
Always reboot (Always) Force clean state every run Per window Highest Machines that must restart to apply config

Reporting, compliance dashboards, and Policy-driven enforcement

A patch program you cannot report on is a liability. AUM surfaces compliance in the portal, but the durable answer is Azure Policy – it both enforces the prerequisites (so new machines are born compliant) and reports drift across the estate. There are built-in policy definitions for exactly this; assign them at a management group so the whole tenant inherits them:

Built-in policy Effect What it does Why you need it
Configure periodic checking for missing system updates on Azure VMs Modify / DINE Sets assessmentMode = AutomaticByPlatform Continuous exposure data, no stale scans
Schedule recurring updates using Azure Update Manager DeployIfNotExists Associates in-scope machines to a maintenance config New VMs auto-enrol; no forgotten onboarding
Configure periodic checking on Arc machines Modify / DINE Assessment mode on Arc servers Hybrid parity for exposure data
Machines should be configured to periodically check for missing updates Audit Reports machines not in periodic assessment Drift visibility before you enforce

Enforce periodic assessment tenant-wide via the built-in policy. The critical part is the managed identity – without it, the policy reports but never acts:

# Enforce periodic assessment tenant-wide via the built-in policy
az policy assignment create \
  --name "enforce-periodic-assessment" \
  --display-name "AUM: periodic assessment on all VMs" \
  --scope "/providers/Microsoft.Management/managementGroups/mg-platform" \
  --policy "59efceea-0c96-497e-a4a1-4eb2290dac15" \
  --mi-system-assigned --location eastus2 \
  --role "Contributor"

DeployIfNotExists and Modify policies need a managed identity with the right role at the assigned scope. Skip the --mi-system-assigned / --role and remediation tasks silently fail to deploy – the assignment shows compliant-looking definitions but never acts. Always provision the identity.

The role each policy effect requires at the assigned scope:

Effect Needs MI? Typical role If omitted
Audit No Reports only; no change
Modify Yes Contributor (or scoped role) Tags/settings never applied
DeployIfNotExists Yes Contributor + resource-specific Remediation never deploys; looks compliant
Deny No Blocks non-compliant creates

Report compliance from Resource Graph so it feeds a workbook or your existing dashboards rather than living only in the AUM blade:

// Fleet patch-compliance rollup: compliant vs non-compliant by environment
patchassessmentresources
| where type =~ "microsoft.compute/virtualmachines/patchassessmentresults"
   or type =~ "microsoft.hybridcompute/machines/patchassessmentresults"
| extend pending = toint(properties.availablePatchCountByClassification.security)
                 + toint(properties.availablePatchCountByClassification.critical)
| extend state = iff(pending == 0, "Compliant", "NonCompliant")
| join kind=leftouter (
    resources
    | project id = tolower(id), env = tostring(tags["Environment"])
  ) on $left.id == $right.id
| summarize machines = count() by state, env
| order by env asc, state asc

The Resource Graph tables you will query for patch reporting, and what each holds:

Table Holds Key columns Use for
patchassessmentresources Assessment summary + per-patch children classifications, patchState, availablePatchCountByClassification Exposure, pending counts
patchinstallationresources Install run results + per-patch installationState, patchName What actually installed
maintenanceresources Maintenance config + run history maintenanceScope, run status Did the schedule run?
resources Machines + tags + osProfile tags, patchSettings Scope validation, patchMode hunt

Architecture at a glance

The diagram traces patch orchestration as it actually flows, left to right, across the four planes that make AUM work – and marks the five places a run silently does nothing. Read it as a control loop. On the left, the control plane is authored once as code: Azure Policy enrols machines and enforces assessment, a maintenance configuration (InGuestPatch, window ≥ 1h30m) declares the schedule, and a dynamic scope – an Azure Resource Graph query over the PatchGroup tag – decides membership at run time. That intent flows into the orchestration plane, where the AUM engine assesses and installs (only if the machine is AutomaticByPlatform with bypass = true) and an optional pre/post Event Grid handler drains, snapshots, and validates inside a bounded pre-window.

From orchestration the same schedule fans out to two execution targets: the Azure fleet (no per-VM charge, including hotpatch-capable Windows Server SKUs that take four reboots a year instead of twelve) and the hybrid/multicloud estate of Arc-enabled servers in your datacenter, AWS, or GCP, each of which must reach its update source – WSUS or a distro repo – over outbound 443. Both targets emit assessment data into the report-and-enforce plane, where Resource Graph (patchassessmentresources) and a compliance workbook give one queryable Azure-plus-Arc view, and detected drift loops back to Policy for remediation. The five numbered badges mark the silent failures: a window too short or wrong scope, a dynamic scope that resolves to zero machines, the wrong patchMode/bypass, a pre-event that times out, and an unreachable update source. Each badge in the legend reads as symptom · how to confirm · fix – the same diagnostic loop the playbook below expands.

Azure Update Manager patch-orchestration architecture across four planes -- a control plane where Azure Policy enrols machines and a maintenance configuration plus dynamic scope (Resource Graph tag query) declare the schedule; an orchestration plane where the AUM engine assesses and installs and a pre/post Event Grid handler drains and validates; execution fanning out to the Azure fleet (including hotpatch-capable Windows Server) and to Arc-enabled servers in on-prem, AWS and GCP that must reach a WSUS or distro update source on 443; and a report-and-enforce plane where Resource Graph and a compliance workbook give one Azure-plus-Arc view that loops drift back to Policy -- with five numbered failure badges for a too-short window or wrong scope, a zero-resolving dynamic scope, a wrong patchMode/bypass, a pre-event timeout, and an unreachable update source

Real-world scenario

Meridian Health Systems, a healthcare ISV, ran ~900 servers split across Azure (Windows + Linux app tiers) and two on-prem datacenters still hosting a regulated records system that legally could not move to the cloud yet. Their old world was Automation Update Management for the Azure VMs and a hand-maintained WSUS-plus-cron arrangement on-prem. When the legacy service hit end of support, two constraints collided: an external auditor required a single, queryable compliance view across the entire estate, and the on-prem records servers had a hard rule – no unscheduled reboots during clinical hours (06:00-22:00 local), ever.

They solved it with one targeting model. The on-prem servers were onboarded to Arc and tagged PatchGroup=ring2,Environment=prod at connect time, which dropped them straight into an existing dynamic scope – no bespoke onboarding. The Azure app tiers were tagged ring0 (a thin canary of build agents and two non-critical app servers) and ring1 (the broad fleet). Every machine’s patchMode was driven to AutomaticByPlatform with bypass = true by an Azure Policy Modify assignment at the platform management group, so newly created VMs were born correct. The reboot constraint on ring2 was handled by splitting install from restart: the ring2 maintenance configuration used rebootSetting: Never, so AUM installed packages inside a late-evening window but never restarted anything itself. A post-maintenance Event Grid handler then queued required reboots and released them only after 22:00 via a controlled runbook, machine by machine, with health checks between. Finally, all compliance reporting – Azure and Arc alike – came from a single Resource Graph query feeding one workbook, which is exactly the artifact the auditor wanted.

The first scheduled window did not go cleanly, and the failure is instructive. The ring1 run completed with a green status but the pending-update count barely moved. The cause was the classic one: a subset of ring1 VMs had been created from an older image whose patchMode was ImageDefault, and the Policy Modify remediation task had never run because the assignment was created without a managed identity – it reported compliant definitions but never acted. The ARG patchMode hunt surfaced 140 machines in the wrong mode within seconds. They added the system-assigned identity and Contributor role to the assignment, kicked a remediation task, re-validated the hunt to zero, and the next window patched all 900. The reboot-suppressing slice of the ring2 config:

installPatches: {
  rebootSetting: 'Never'   // install in-window; reboots handled by post-event after clinical hours
  windowsParameters: {
    classificationsToInclude: [ 'Critical', 'Security' ]
  }
}

The lessons the team took away: Arc plus dynamic scopes collapsed three patch programs into one; decoupling install from reboot turned a hard compliance constraint into a scheduling detail rather than a blocker; and a DeployIfNotExists/Modify policy without a managed identity is worse than no policy, because it looks like governance while doing nothing.

Advantages and disadvantages

The native-platform, schedule-declared model both removes a huge amount of operational toil and introduces a handful of sharp edges. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
No Log Analytics workspace, Automation account, or dedicated agent – it is native to the VM/Arc platform The orchestration mode (patchMode + bypass) is an easy-to-miss prerequisite; get it wrong and runs silently no-op
One targeting model (dynamic scopes on tags) spans Azure, on-prem, AWS, GCP via Arc Off-Azure machines need reachable update sources and 443 egress – air-gapped/WSUS estates need real network work
Free on native Azure VMs; only Arc servers carry a small per-server charge Arc per-server billing must be budgeted; large hybrid fleets add up
DeployIfNotExists policy makes machines born-compliant and auto-enrolled A policy without a managed identity looks compliant but never acts – a dangerous false signal
Install and reboot are separable (Never + post-event); hotpatch removes most reboots entirely Reboot decoupling adds an orchestration handler you must build, test, and keep idempotent
Single Resource Graph query gives one Azure+Arc compliance view for audit Reporting lives in ARG/KQL, not a turnkey dashboard – you build the workbook
Dynamic scopes evaluate at run time, so new machines need zero manual onboarding A scope that resolves to zero (tag typo) fails silently – “ran but nothing happened”
Pre/post events integrate drain/snapshot/validate without a separate scheduler The ~20-minute pre-window forces async design; long inline work times out

The model is right for any fleet beyond a handful of VMs, and especially for hybrid/multicloud and regulated estates that need one auditable answer. It bites hardest on teams that treat AUM as a button (never declaring a schedule), estates with restrictive egress (Arc machines that cannot reach their update source), and anyone who assigns a remediation policy without the identity that lets it act. Every disadvantage is manageable – but only if you know it exists, which is the point of the playbook below.

Hands-on lab

Stand up a single Linux VM, put it under platform orchestration, prove an assessment and a one-time install work, then attach it to a maintenance configuration – all free-tier-friendly on native Azure VMs (AUM has no per-VM charge; you pay only for the VM). Run in Cloud Shell (Bash) and tear it down at the end.

Step 1 – Variables and resource group.

RG=rg-aum-lab
LOC=eastus2
VM=vm-aum-lab
az group create -n $RG -l $LOC -o table

Step 2 – Register the providers (idempotent; skips if already registered).

az provider register --namespace Microsoft.Maintenance
az provider show --namespace Microsoft.Maintenance --query registrationState -o tsv
# Expected: Registered

Step 3 – Create a small Ubuntu VM born with platform orchestration.

az vm create -g $RG -n $VM --image Ubuntu2204 --size Standard_B1s \
  --admin-username azureuser --generate-ssh-keys \
  --patch-mode AutomaticByPlatform -o table

Expected: a VM row with a public IP. The --patch-mode AutomaticByPlatform flag sets the orchestration mode at create time.

Step 4 – Set the bypass flag so YOUR schedule will own patching.

az vm update -g $RG -n $VM \
  --set osProfile.linuxConfiguration.patchSettings.bypassPlatformSafetyChecksOnUserSchedule=true \
  --set osProfile.linuxConfiguration.patchSettings.assessmentMode=AutomaticByPlatform

Step 5 – Assess the machine and read the result from Resource Graph.

az vm assess-patches -g $RG -n $VM -o table
# Then query the result (may take a minute to surface):
az graph query -q "patchassessmentresources
| where id contains '$VM'
| project name, type, properties" -o jsonc

Expected: an assessment summary with availablePatchCountByClassification. Non-zero counts mean updates are pending.

Step 6 – Run a one-time install of Critical + Security, no reboot.

az vm install-patches -g $RG -n $VM \
  --maximum-duration PT60M \
  --reboot-setting Never \
  --classifications-to-include-linux Critical Security -o table

Expected: a run that returns Succeeded or CompletedWithWarnings. Re-run Step 5’s assessment to confirm the pending count dropped – this proves the data plane before you trust a schedule.

Step 7 – Create a daily maintenance configuration and attach the VM.

MCID=$(az maintenance configuration create -g $RG --resource-name mc-lab-daily -l $LOC \
  --maintenance-scope InGuestPatch \
  --extension-properties InGuestPatchMode=User \
  --duration 01:30 --recur-every 1Day --start-date-time "2026-06-09 03:00" --time-zone "UTC" \
  --reboot-setting IfRequired \
  --query id -o tsv)

az maintenance assignment create -g $RG --resource-name $VM --resource-type virtualMachines \
  --provider-name Microsoft.Compute \
  --configuration-assignment-name assign-lab \
  --maintenance-configuration-id "$MCID" -o table

Expected: the assignment is created. The VM is now associated to a daily schedule; the next window will assess and install per the config.

Step 8 – Verify the association and the machine’s orchestration mode.

az graph query -q "resources
| where name == '$VM'
| extend mode = properties.osProfile.linuxConfiguration.patchSettings.patchMode
| extend bypass = properties.osProfile.linuxConfiguration.patchSettings.automaticByPlatformSettings.bypassPlatformSafetyChecksOnUserSchedule
| project name, mode, bypass" -o jsonc
# Expected: mode = AutomaticByPlatform, bypass = true

Step 9 – Teardown.

az group delete -n $RG --yes --no-wait

This deletes the VM, the maintenance configuration, and the assignment in one shot.

Common mistakes & troubleshooting

This is the differentiator. Patch programs fail quietly – a green status that patched nothing is worse than a red one. The playbook below is the structured map: symptom → root cause → confirm (exact command/portal path) → fix. Scan it first; the prose after expands the worst offenders.

# Symptom Root cause Confirm (exact command / portal path) Fix
1 Run shows green, pending count barely drops patchMode not AutomaticByPlatform or bypass=false ARG patchMode hunt over osProfile patch settings Set patchMode=AutomaticByPlatform + bypass=true
2 “Schedule ran but nothing happened” Dynamic scope resolves to zero machines Run the scope’s ARG query manually; count rows Fix the tag/filter; re-validate count before window
3 Only some machines in a ring patched Mixed patchMode across the ring (old image) ARG hunt filtered to that ring’s tag Remediate via Policy Modify; re-run with identity
4 Most patches skipped, run hit time limit Window too short for the batch maintenanceresources run status + duration Raise duration (≥ 01:30); size for slowest machine
5 Assessment returns no rows Machine can’t reach its update source patchassessmentresources empty for that machine; azcmagent check (Arc) Open 443 egress/proxy; point Windows at WSUS; fix repos
6 Arc machine not patched Not in scope (resource-type filter) or agent disconnected ARG scope query; az connectedmachine show status Include microsoft.hybridcompute; reconnect agent
7 Machine reboots during business hours rebootSetting = IfRequired/Always on sensitive tier Inspect config installPatches.rebootSetting Set Never; drive reboot via post-event off-hours
8 Policy shows compliant but VMs not enrolled DeployIfNotExists assignment has no managed identity Policy assignment -> Identity tab is empty Add system-assigned MI + role; run remediation task
9 Pre-event handler errors / run un-drained Handler exceeds ~20-min pre-window or not idempotent Function/Logic App invocation logs around window time Make handler fast + idempotent; do long work async
10 A known-bad KB keeps installing KB not excluded in config Config windowsParameters.kbNumbersToExclude Add the KB to kbNumbersToExclude
11 Two configs patch the same machine Overlapping scopes (both tag and RG-wide match) List assignments on the machine Make tags mutually exclusive; one ring per machine
12 New VM missed its first window Tag applied after scope evaluated, or missing ARG query for the tag at the time Tag at create (IaC/Policy); confirm before window
13 Linux install fails on specific package Held/broken package or repo conflict default package-manager logs on the host Exclude the package; fix the repo; re-run
14 Hotpatch month still rebooted Hotpatch not enabled or unsupported SKU OS profile enableHotpatching; SKU check Enable hotpatch on a supported SKU; baseline months still reboot

The wrong orchestration mode (the #1 silent failure)

A run completes with a benign-looking status and the pending-update count barely changes. The machine’s patchMode is not AutomaticByPlatform, or bypassPlatformSafetyChecksOnUserSchedule is false, so the platform either never installs on your schedule or auto-patches on its own cadence. Confirm with the ARG patchMode hunt from earlier in this article – it returns every misconfigured machine in seconds. Fix: drive the property to AutomaticByPlatform + bypass=true (by az vm update, az connectedmachine update, or a Policy Modify assignment for born-correct machines), then re-run the hunt to confirm zero.

A dynamic scope that resolves to zero

The window fires, the run logs success, and nothing is patched – because the scope’s tag filter matched no machines. A single typo (ring1 vs Ring1, or the wrong subscription in the filter) silently empties the scope. Confirm by running the scope’s exact ARG query manually and counting rows; zero is the smoking gun. Fix: correct the tag or filter, re-validate the count against expectation, and make this validation a pre-window gate – never trust a scope you have not counted.

DeployIfNotExists policy with no managed identity

The Policy compliance blade shows your enrolment policy as compliant, yet new VMs are never associated to a maintenance configuration. The DeployIfNotExists/Modify assignment was created without a managed identity, so it evaluates definitions but never deploys the remediation. Confirm: open the assignment, check the Identity tab – it is empty – and look for a remediation history of zero. Fix: add a system-assigned identity and the required role (Contributor or a scoped equivalent) at the assignment scope, then trigger a remediation task and confirm the deployment runs.

An unreachable update source

Assessment returns no rows for a machine, or installs fail outright – because the machine cannot reach Windows Update / WSUS (Windows) or its distro repos (Linux), or, for Arc, the Arc endpoints on 443. Confirm: the machine is absent from patchassessmentresources; on Arc, azcmagent check flags the failing endpoint. Fix: open the egress/proxy path, point Windows machines at an internal WSUS, confirm Linux repos are reachable (or stand up a local mirror for air-gapped estates), then re-assess.

A window too short for the batch

Most patches are skipped and the run hits its time limit, because the window’s effective install time (duration - 10m) was too small for the slowest machine in the batch. Confirm: maintenanceresources shows the run hitting the time limit; compare duration against the batch’s worst-case install time. Fix: raise duration (minimum 01:30), split a large ring into smaller batches with their own windows, or move slow machines to a dedicated config with a longer window.

Best practices

Security notes

Patching is a security control, but the patch pipeline itself is also an attack surface and a least-privilege problem. Treat it accordingly:

Concern Risk Control
Policy remediation identity Over-broad Contributor at MG scope Scope the role to the minimum (a custom role granting only maintenance-assignment + VM patch settings) where feasible
Pre/post handler identity A Function that drains/reboots needs power Use a managed identity with the least role; never store credentials in the handler
Update source integrity WSUS/repo poisoning installs malicious “updates” Use trusted, signed sources; restrict who can publish to internal WSUS/mirrors
Arc agent egress Broad outbound from on-prem to the internet Allow-list the specific Arc + update endpoints on 443; use a proxy, not open egress
Known-bad / supply-chain KBs A bad patch breaks or backdoors machines Canary ring first; kbNumbersToExclude/packagesToExclude to hold back; validate before broad rollout
Exemptions exempt machines accumulate unpatched Time-box exemptions, audit them separately, require a documented owner and expiry
Reboot suppression Never leaves machines pending-reboot, partially patched Track pending-reboot state; drive the controlled reboot promptly via post-event
Compliance data exposure CVE exposure is sensitive Restrict Resource Graph/workbook access; treat the exposure view as confidential
Secrets in handlers Snapshot/drain handlers touching storage/DB Reference Key Vault via managed identity; no secrets in app settings

The throughline: the only identities that can act on the fleet (the remediation MI, the handler MI) should hold the minimum role at the minimum scope, the only sources machines pull from should be trusted and signed, and the only machines left unpatched should be deliberately, visibly, and temporarily exempt.

Cost & sizing

AUM’s pricing model is deliberately simple, and the headline is that it is free on native Azure VMs. The cost surface and rough figures (always verify current rates for your region/currency):

Cost driver Native Azure VM Arc-enabled server Notes
Update Manager itself Free Small per-server / month The only AUM line item is Arc machines
Underlying compute You pay for the VM You pay for the on-prem/other-cloud host AUM does not change compute cost
Pre/post handler (Function) Consumption per execution Same Tiny at monthly cadence
Event Grid Per operation Same Negligible for patch events
Resource Graph queries Free Free No charge for ARG
Log Analytics (optional) Per GB if you route logs there Same AUM does not require it; only if you choose to

Rough INR/USD framing for a hybrid estate:

Scenario Machines AUM cost driver Rough monthly cost
All-Azure fleet 200 Azure VMs AUM free; pay only compute ₹0 for AUM (compute separate)
Small hybrid 50 Azure + 50 Arc 50 Arc × per-server charge ~50 × small fee (USD low single digits each)
Large hybrid 500 Azure + 400 Arc 400 Arc × per-server charge 400 × per-server fee – budget explicitly
Orchestration overhead any Functions + Event Grid at monthly cadence Effectively rounding error

Sizing is about windows and batches, not money: size each maintenance window for the slowest machine in its batch (duration - 10m effective), split very large rings into multiple windowed batches to avoid time-limit skips, and prefer hotpatch-capable SKUs where reboot disruption (not licence cost) is the constraint. The one real spend decision is Arc per-server billing on large hybrid fleets – it is far cheaper than running a parallel patch stack, but on 400+ servers it is a line item you must put in the budget, not discover.

Interview & exam questions

These map to AZ-104 (Azure Administrator) and AZ-800/AZ-801 (Windows Server Hybrid); the governance angle touches AZ-305.

  1. What are the two planes of Azure Update Manager? The data plane (assess-patches/install-patches) acts on one machine on demand; the scheduling plane (maintenance configurations of scope InGuestPatch) declares a recurring, governed program. You prove with the data plane and operate with the scheduling plane.

  2. Why must a machine be AutomaticByPlatform with bypassPlatformSafetyChecksOnUserSchedule = true for scheduled patching? AutomaticByPlatform lets the platform orchestrate installs; bypass=true stops the platform from also applying its own automatic patches on Microsoft’s cadence, which would collide with your window. Without both, the run is silently skipped.

  3. What replaced Automation Update Management, and when did it retire? Azure Update Manager replaced it; Automation Update Management reached end of support on 31 August 2024, as did the MMA/OMS agent it depended on.

  4. What is a dynamic scope and why prefer it over static assignment? A dynamic scope binds machines to a maintenance configuration via a Resource Graph tag/sub/RG/OS filter evaluated at run time, so new machines carrying the right tag are patched with zero manual onboarding – static assignments rot as the fleet changes.

  5. What is the minimum maintenance window duration, and what is the 10-minute caveat? Minimum 01:30; the platform reserves the last 10 minutes to finalize, so effective install time is duration - 10m, and AUM stops starting new installs once the window is exhausted.

  6. How do you patch an on-prem or AWS/GCP server with AUM? Onboard it as an Arc-enabled server (azcmagent connect), tag it at connect time, and it inherits the same maintenance configurations and dynamic scopes – billed a small per-server monthly charge, unlike free native Azure VMs.

  7. How do you patch a machine that cannot reboot during business hours? Set rebootSetting: Never so AUM installs but never restarts, then drive the reboot through a controlled post-maintenance Event Grid handler that releases reboots in an approved window.

  8. What is hotpatching and what is its reboot cadence? On Windows Server Azure Edition / WS 2025, hotpatch installs OS security updates without a reboot. Baseline months (Jan/Apr/Jul/Oct) ship a cumulative update and require a reboot; the two months after each ship hotpatches with no reboot – four reboots a year, not twelve.

  9. Which two built-in policies operationalize AUM, and what do they do? “Configure periodic checking for missing system updates” sets assessmentMode=AutomaticByPlatform; “Schedule recurring updates using Azure Update Manager” uses DeployIfNotExists to auto-enrol in-scope machines into a maintenance configuration.

  10. Why does a DeployIfNotExists policy sometimes show compliant but never act? It needs a managed identity with the right role at the assigned scope to deploy the remediation. Without the identity, it evaluates definitions and looks compliant but never deploys anything.

  11. Where do you query fleet patch compliance for both Azure and Arc? Azure Resource Graph – patchassessmentresources for exposure, patchinstallationresources for what installed, maintenanceresources for run history – joined to resources for tags; no Log Analytics workspace required.

  12. What is the bounded contract for a pre-maintenance event handler? It runs inside a ~20-minute pre-window; the maintenance run proceeds when the handler completes or times out, so handlers must be fast and idempotent and trigger long work asynchronously.

Quick check

  1. Your maintenance run shows a green status but the pending-update count barely moved. What is the single most likely cause and the one command to confirm it?
  2. You attach a dynamic scope filtering PatchGroup=ring1 and the window fires with nothing patched. What do you check first?
  3. A regulated tier cannot reboot between 06:00 and 22:00. How do you patch it without violating that rule?
  4. Why is Azure Update Manager free on native Azure VMs but billed on Arc-enabled servers?
  5. What two osProfile settings must be true for a VM, and what is the effect of each?

Answers

  1. The machine’s patchMode is not AutomaticByPlatform (or bypass=false), so the run is silently skipped. Confirm with the Resource Graph patchMode hunt over osProfile patch settings, which returns every misconfigured machine.
  2. Run the scope’s exact ARG query manually and count the rows – a scope resolving to zero (a tag typo or wrong subscription in the filter) is the number-one cause of “ran but nothing happened.” Fix the filter and re-validate the count.
  3. Set rebootSetting: Never so AUM installs packages inside a window but restarts nothing, then drive the reboot through a controlled post-maintenance Event Grid handler that releases reboots only after 22:00.
  4. AUM is a native platform capability for Azure VMs (no extra charge); Arc-enabled servers are off-Azure machines projected into Azure, and AUM coverage for them carries a small per-server monthly charge – still far cheaper than a parallel off-cloud patch stack.
  5. patchMode = AutomaticByPlatform (lets the platform orchestrate installs on your schedule) and bypassPlatformSafetyChecksOnUserSchedule = true (stops the platform from auto-patching on its own cadence and colliding with your window). Both are required or the run is skipped.

Glossary

Next steps

AzureUpdate ManagerPatchingAzure ArcGovernance
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments