Azure Site Recovery for IaaS: Zone-to-Zone and Region Failover with Recovery Plans

Every DR program I have audited had the same gap: replication was healthy, the dashboards were green, and nobody could tell me the last time anyone actually failed over. Azure Site Recovery (ASR) – Microsoft’s business-continuity service that continuously replicates a running VM’s disks to a second fault domain (another availability zone in the same region, or a paired region entirely) and orchestrates an ordered failover – is not hard to enable. The trap is treating “protected” as “recoverable.” A VM with a 5-minute RPO (Recovery Point Objective) is worthless if your application boots in the wrong order, comes up with a stale IP, can’t find DNS, or needs a runbook that lives only in someone’s head.

This is how to wire ASR for IaaS correctly across both an availability-zone failure and a full region loss: build recovery plans that boot a multi-tier app in dependency order, automate the messy parts (DNS cutover, IP reassignment, load-balancer wiring) with Azure Automation runbooks, run isolated test failovers that never touch production, and – the part everyone skips – prove your numbers with a drill you repeat on a schedule. We treat the whole thing as an operational discipline, not a checkbox: replication health is necessary, but a clean, timed, repeatable test failover is the only thing that is sufficient.

By the end you will stop confusing a green replication dashboard with a recoverable application. You will know exactly where the cache storage account lives and why, why multiVmSyncStatus is non-negotiable for a SQL Always On pair, why a fixed Start-Sleep in a boot group is a lie, how a runbook tells a drill apart from a real disaster, and how to measure achieved RPO as a first-class alert rather than a number you discover during the incident. Because this is a reference you return to while planning a drill – or during one – the policy knobs, the failover modes, the runbook context fields, the cost drivers and the failure playbook are all laid out as scannable tables. Read the prose once; keep the tables open when it counts.

What problem this solves

The pain is concrete and almost always discovered at the worst possible time – the first real drill, or the first real outage. Replication being “Normal” tells you bytes are arriving in the target. It tells you nothing about whether the application comes back. The gap between “disks are replicated” and “the service is serving traffic again, inside the RTO the business signed up for” is where DR programs die, and it is invisible on every dashboard until you exercise it.

What breaks without this: an availability zone loses power and the team discovers their “zone-redundant” production was actually pinned to that one zone with no zone-to-zone replication. Or a region degrades, the team triggers failover, and the app tier crash-loops because it booted before SQL finished Always On recovery and exhausted its connection retries. Or the SQL pair fails over to recovery points three minutes apart because nobody enabled multi-VM sync, so the availability group comes up with the secondary ahead of the primary and must be manually reseeded – turning a 15-minute RTO target into a 90-minute incident. Or the failover works, but a runbook nobody tested flips production DNS during a test drill and takes down the live site. Or the team gets to DR and is stranded there for weeks because nobody rehearsed reprotect and failback, accruing cross-region egress and running unscaled DR capacity under real load.

Who hits this: anyone running stateful IaaS that the business cannot lose – regulated payments, ERP, line-of-business apps on VMs, SQL Server on Azure VMs, domain controllers. PaaS-native teams lean on built-in zone redundancy and geo-replication; IaaS teams own the orchestration themselves, and ASR is the tool that turns a pile of replicated disks into a rehearsed, ordered, automated recovery. To frame the whole field before the deep dive, here is every failure class this article addresses, the question it forces, and where you look first:

Failure class	What’s actually wrong	First question to ask	Where to look first	Most common single cause
RPO drift	Achieved RPO climbs above target	Is churn outrunning replication?	Replication health blade; RPO metric	Cache SA throttled / shared with app I/O
Cross-tier time skew	Tiers fail over to points minutes apart	Is multi-VM sync on?	Policy `multiVmSyncStatus`	Multi-VM sync left disabled
Boot-order crash loop	App tier dies before its dependency is up	Did the data/identity tier finish first?	Recovery-plan boot groups	VM unassigned → default parallel group
Runbook misfire	A drill mutates production	Does the runbook branch on failover type?	`RecoveryPlanContext.FailoverType`	No `Test` guard in the runbook
Stranded in DR	Can fail over, can’t get back	Was reprotect/failback rehearsed?	Replication direction after failover	Reprotect step never practised
Drill blast radius	Test failover reaches real prod	Is the test VNet truly isolated?	`vnet-asr-test` peering/routes	Peering or VPN left on the test VNet

Learning objectives

By the end of this article you can:

Explain the Azure-to-Azure (A2A) replication path end to end – Mobility service, source-region cache storage account, asynchronous replication to target managed disks, and crash- vs app-consistent recovery points – and size the cache account on churn.
Decide between zone-to-zone and region-to-region replication as a risk model, and run both where the workload warrants it.
Author a replication policy with the right RPO retention, app-consistent frequency, and multi-VM sync for multi-tier apps, and attach it via a protection-container mapping.
Build a recovery plan with tiered boot groups (1-7) that boot identity/DNS, then data, then app, then web in dependency order, with every VM explicitly assigned.
Inject pre/post Azure Automation runbooks for DNS cutover, IP/LB wiring and app startup that key off RecoveryPlanContext.FailoverType and never mutate production during a Test failover.
Run an isolated test failover into a network with no path back to production, record achieved RTO, and clean up – plus distinguish planned, unplanned and failback modes and execute reprotect.
Alert on achieved RPO as a metric and prove RPO/RTO with a recurring drill, instead of trusting a “replication healthy” dashboard.

Prerequisites & where this fits

You should already be comfortable with Azure IaaS fundamentals: how a VM, its managed disks (covered in Azure Managed Disks: Performance, Snapshots & Encryption), a VNet and a load balancer fit together, and what a resource group and subscription scope. You should understand availability zones versus regions – the physical-failure boundaries are the entire premise of this article, laid out in Azure Regions & Availability Zones Explained and the deeper Azure Global Infrastructure: Regions, Zones, Fault & Update Domains. Familiarity with az in Cloud Shell, reading JSON output, and basic PowerShell helps, since runbooks are PowerShell.

This sits at the top of the resilience and BCDR track. It assumes the platform-level redundancy story (VM-level HA via zones and scale sets, in Azure VM Availability & Resilience) and complements the data-tier failover stories that ASR does not replace – Azure SQL Managed Instance: Failover Groups & Link for managed SQL, and the application-layer global routing in Azure Front Door & Traffic Manager: Global Failover. It pairs with Azure Backup Vault: Immutability, MUA & Cross-Region Restore (backup is point-in-time recovery; ASR is fast failover – different tools), and with Validate Resilience with Azure Chaos Studio for proving the failover under injected fault.

A quick map of who owns what during a DR event, so you call the right person fast:

Layer	What lives here	Who usually owns it	Failure class it can cause
Source VMs / Mobility	Write capture, agent health	App + platform team	RPO drift if the agent stalls
Cache storage account	Write-burst buffer (source region)	Platform / storage	RPO spikes from throttling
Recovery Services vault	Policy, recovery points, jobs	DR / platform team	Wrong policy → bad RPO/retention
Recovery plan	Boot order, runbook hooks	DR + app team	Crash loop from wrong boot order
Automation runbooks	DNS, IP, LB, app startup	Platform + app team	Drill mutates prod; failed cutover
Target VNet / DNS / LB	Where the app lands	Network team	Stale IP, unresolved DNS, no LB
Validation / metrics	Test failover, achieved RPO	DR / SRE	False confidence; undiscovered RTO

Core concepts

Five mental models make every later decision obvious.

Replication is asynchronous and buffered at the source. For A2A there is no appliance and no process server to babysit – the Mobility service extension is pushed onto each protected VM automatically. It continuously captures disk writes and ships them to a cache storage account in the source region first; ASR then asynchronously replicates that data to managed disks in the target zone or region. The cache account decouples your application’s I/O from cross-region replication latency, and it is the single most misconfigured component. RPO is decided here: if the cache is throttled or undersized, no policy setting will save your RPO.

A recovery point is a moment you can boot from, in one of two qualities. Crash-consistent points are like pulling the power cord – the disk is captured as-is; filesystems journal-replay on boot and most apps recover, but in-flight uncommitted writes are lost. ASR takes these every 5 minutes for A2A. App-consistent points trigger VSS (Windows) or a pre/post script freeze (Linux) so the application flushes buffers before the snapshot – what you want for databases, but heavier, so taken less often (hourly is typical). Failing over to an app-consistent point loses more time but recovers cleaner.

The recovery plan is the unit of failover, not the VM. A recovery plan groups VMs, orders their boot into boot groups (1-7) executed sequentially, and lets you inject automation as pre/post actions on any group. Without one, “failover” means clicking each VM individually in the wrong order at 3 a.m. With one, it is a single, ordered, repeatable, automatable operation. Multi-VM consistency – enabled on the policy – makes a group of VMs share one crash-consistent recovery point so tiers don’t drift apart in time.

Failover has direction and three modes, and the return trip is separate. Planned failover (zero data loss, source must be healthy), unplanned failover (source gone; expect data loss equal to achieved RPO), and failback (returning home). Failback is not a button – you must reprotect first, which reverses replication from DR back to the original region, re-seeding only the delta. Forget this and you can get to DR but you are stranded there.

A test failover proves recoverability with zero blast radius – if the network is isolated. Test failover spins up your VMs from a chosen recovery point into a network you specify, while production replication keeps running uninterrupted. The non-negotiable rule: fail over into a network with no peering, no VPN, and no route back to production, or your drill can corrupt production data or duplicate identity objects.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters to DR
Mobility service	Agent that captures VM disk writes	On each protected VM	No agent → no replication
Cache storage account	Write-burst buffer before replication	Source region	Throttle here → RPO spike
Recovery Services vault	Container for policy, RPs, jobs	Target region	The control plane for failover
Replication policy	RPO retention + app-consistent freq	Attached via container mapping	Sets recovery quality
Recovery point (RP)	A bootable moment in time	In the vault’s history	What you fail over to
Crash-consistent RP	Power-cord snapshot, every 5 min	Auto for A2A	RPO floor ≈ 5 min + lag
App-consistent RP	VSS/freeze snapshot, hourly	Policy-controlled	Clean DB recovery
Multi-VM sync	Shared RP across a VM group	Policy flag	Stops cross-tier time skew
Recovery plan	Ordered failover of a VM set	In the vault	Boots tiers in order
Boot group (1-7)	Sequential tier in a plan	In the recovery plan	Dependency ordering
Runbook	Automation at a group boundary	Azure Automation (target region)	DNS/IP/LB/app startup
`RecoveryPlanContext`	Object passed to the runbook	Runbook parameter	Tells Test from real
Achieved RPO	Actual recoverable lag, per VM	Replication health + metric	The number that matters
Reprotect	Reverse replication for failback	After failover	Enables the return trip
Test failover	Drill into an isolated network	Triggered on a plan	Proves recoverability

Replication architecture, cache storage, and what’s actually supported

For Azure-to-Azure replication, there is no appliance and no agent server to babysit. The Mobility service extension is pushed onto each protected VM automatically. It continuously captures writes and ships them to a cache storage account in the source region first, then ASR asynchronously replicates that data to managed disks in the target region (or target zone). The cache account is the single most important component people misconfigure.

The flow is:

VM disk write -> Mobility service -> cache storage account (source region)
                                          |
                                          v
                          ASR replication -> target managed disks (target region/zone)
                                          v
                                  recovery points (crash- and app-consistent)

Rules that bite teams in production:

The cache account lives in the source region, not the target. It absorbs write bursts and decouples app I/O from cross-region replication latency. Size it on churn (write rate), not capacity.
Use a separate, dedicated cache account for ASR. Sharing it with application workloads causes throttling that shows up as RPO spikes. Use a standard general-purpose v2 account; for high-churn VMs that breach Standard limits, ASR supports a high-churn flow but you must watch for Replica Storage Throttle events.
Premium SSD / Ultra-heavy source disks are supported, but Ultra Disk is not supported as an ASR-replicated disk type at the time of writing – confirm disk SKUs against the current support matrix before you promise coverage.
Zone-to-zone requires the region to support availability zones and is configured within a single region. Region-to-region is the classic cross-region DR. The replication mechanics are identical; only the target differs.

Decide RPO at the cache layer. If the cache account is throttled or undersized, no replication policy setting will save your RPO. Provision the cache account first, separately, and monitor it as a first-class resource.

The components you provision, where each lives, and what owns its sizing:

Component	Region it lives in	Sized / chosen on	Why it must be there	Gotcha if you get it wrong
Recovery Services vault	Target region	n/a (one per DR target)	Control plane survives source loss	Vault in source region → lost with the disaster
Cache storage account	Source region	Aggregate write churn	Absorbs write bursts at the source	In target region → adds latency, breaks the model
Replica managed disks	Target region/zone	Source disk size + SKU	The bootable copy on failover	Wrong SKU → slow boot or over-spend
Mobility extension	On each source VM	Auto-installed	Captures writes	Blocked by a restrictive NSG/policy → no replication
Protection container + mapping	Vault (both fabrics)	Per source-target pair	Wires source fabric to target	Missing mapping → enable-replication fails
Target VNet / subnet	Target region	App network layout	Where the failed-over VM lands	Mismatched layout → app can’t resolve peers

The source-disk types ASR can and cannot replicate – confirm against the live support matrix before you promise coverage:

Source disk type	A2A replicated?	Replica SKU options	Notes / limit
Standard HDD (`Standard_LRS`)	Yes	Standard HDD/SSD or Premium	Cheapest replica; fine for cold tiers
Standard SSD (`StandardSSD_LRS`)	Yes	Standard SSD or Premium	Common default
Premium SSD (`Premium_LRS`)	Yes	Premium recommended	Match for IOPS-sensitive workloads
Premium SSD v2	Check matrix	Region/limit dependent	Verify support in your region first
Ultra Disk	No (at time of writing)	n/a	Not supported as ASR-replicated disk
Ephemeral OS disk	No	n/a	Stateless by design; re-image instead
Shared disks / cluster disks	Restricted	n/a	Use guest-level clustering / AG instead

Create the Recovery Services vault and the dedicated cache account up front:

# Vault for region-to-region DR (the vault lives in the TARGET region)
az backup vault create \
  --resource-group rg-dr-westus2 \
  --name rsv-dr-prod \
  --location westus2

# Dedicated cache storage account in the SOURCE region
az storage account create \
  --resource-group rg-dr-eastus2 \
  --name stasrcacheeastus2 \
  --location eastus2 \
  --sku Standard_LRS \
  --kind StorageV2 \
  --min-tls-version TLS1_2 \
  --allow-blob-public-access false

Size the cache account on churn, not capacity. A rough rule of thumb for picking the cache tier and watching for trouble:

Aggregate write churn (per source)	Cache account choice	What to watch	Action if breached
Low (< ~5 MB/s)	One shared Standard GPv2 (dedicated to ASR)	RPO stays near floor	None
Moderate (~5-20 MB/s)	Standard GPv2, dedicated	`Replica Storage Throttle` events	Split high-churn VMs to own cache
High (> ~20 MB/s steady)	Multiple cache accounts; high-churn flow	Throttle events + achieved RPO climb	Enable high-churn support; add cache SAs
Spiky (batch jobs, log floods)	Dedicated cache, headroom	RPO spikes during the spike window	Schedule heavy jobs; oversize the cache

Enabling zone-to-zone vs region-to-region replication

The decision is a risk model, not a preference. Zone-to-zone protects against a single availability zone failing (power, cooling, network in one datacenter) and keeps you inside the region – lowest latency, no data-residency change, but no protection against a regional outage. Region-to-region is your true DR posture for a region-wide event, at the cost of cross-region replication latency and a second region’s spend. Mature platforms run both: AZ-redundant production for the common case, plus cross-region ASR for the regional event.

Dimension	Zone-to-zone (Z2Z)	Region-to-region (R2R)
Protects against	Single-zone outage (power/cooling/network)	Region-wide outage / disaster
Target location	Another AZ, same region	A different (usually paired) region
Replication latency	Very low (intra-region)	Higher (inter-region distance)
Data residency	Unchanged	Changes region – compliance review needed
Failover scope	Same region, different zone	New region entirely
Typical RPO floor	~5 min + small lag	~5 min + inter-region lag
Cost of replica	Replica disks + protected-instance fee	Same, plus inter-region egress on replicate
Requirement	Region must support zones	A valid target region (often the pair)
What it does not cover	A full region loss	(covers region loss; not a substitute for backup)
Best for	AZ-resilience for the common failure	True DR for the rare regional event

The cleanest way to enable replication at scale is the portal’s “Enable replication” wizard for the first pass, then codify it. With the CLI, the modern path uses az site-recovery (install the extension first):

az extension add --name site-recovery

# A2A replication for a single VM, region-to-region (eastus2 -> westus2).
# Run after the vault, fabrics, containers, and protection container
# mapping exist (the portal wizard creates these on first use).
az site-recovery protected-item create \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --fabric-name asr-a2a-default-eastus2 \
  --protection-container-name asr-a2a-default-eastus2-container \
  --replication-protected-item-name vm-app01 \
  --policy-id "/subscriptions/<sub>/resourceGroups/rg-dr-westus2/providers/Microsoft.RecoveryServices/vaults/rsv-dr-prod/replicationPolicies/policy-prod-5min" \
  --provider-specific-details '{
    "a2a": {
      "fabricObjectId": "/subscriptions/<sub>/resourceGroups/rg-app-eastus2/providers/Microsoft.Compute/virtualMachines/vm-app01",
      "recoveryContainerId": "<target-container-id>",
      "recoveryResourceGroupId": "/subscriptions/<sub>/resourceGroups/rg-app-westus2",
      "vmManagedDisks": [{
        "diskId": "<source-disk-id>",
        "recoveryResourceGroupId": "/subscriptions/<sub>/resourceGroups/rg-app-westus2",
        "recoveryReplicaDiskAccountType": "Premium_LRS",
        "recoveryTargetDiskAccountType": "Premium_LRS"
      }]
    }
  }'

For zone-to-zone, the difference is the target: you set recoveryAvailabilityZone to the target zone and keep recoveryResourceGroupId in the same region as the source. Everything else – policy, cache, recovery plans – is identical. The provider-specific fields that change between the two modes:

Provider field	Zone-to-zone value	Region-to-region value	Purpose
`recoveryResourceGroupId`	RG in the same region	RG in the target region	Where failed-over resources land
`recoveryAvailabilityZone`	Target zone (e.g. `3`)	omit / not zone-pinned	Pins the replica to a zone
`recoveryContainerId`	Target container (same region)	Target container (other region)	Wires the protection container
`recoveryReplicaDiskAccountType`	Replica disk SKU	Replica disk SKU	Cost lever on the standby copy
`recoveryTargetDiskAccountType`	Post-failover disk SKU	Post-failover disk SKU	Performance once running
Cache SA	Source region	Source region	Identical in both modes

Be honest about cost. Cross-region ASR bills you for replicated storage in the target plus the protected-instance fee per VM. The compute in the target region is not running until you fail over, which is the whole point – but the storage and protected-instance charges are continuous. Right-size the replica disk SKU (recoveryReplicaDiskAccountType) to control this.

Replication policies: RPO, retention, and app-consistent snapshots

A replication policy controls three knobs and you attach it to a protection container mapping. Get these right and most of your DR posture is set:

Setting	What it controls	Sensible default
`recovery-point-retention-in-hours`	How far back you can recover (the recovery-point history window)	24 (up to 72 for A2A)
`app-consistent-frequency-in-minutes`	How often an application-consistent snapshot is taken	60 (0 to disable)
Crash-consistent frequency	Taken every 5 minutes automatically for A2A	Fixed at 5 min

az site-recovery policy create \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --name policy-prod-5min \
  --provider-input '{
    "instanceType": "A2A",
    "recoveryPointHistory": 1440,
    "appConsistentFrequencyInMinutes": 60,
    "crashConsistentFrequencyInMinutes": 5,
    "multiVmSyncStatus": "Enable"
  }'

The full policy surface, with valid ranges and the trade-off of each knob:

Policy field (REST)	What it controls	Default / typical	Valid range	Trade-off / gotcha
`recoveryPointHistory` (sec)	Retention window for recovery points	86400 (24h)	up to 259200 (72h) for A2A	More history = more replica storage cost
`crashConsistentFrequencyInMinutes`	Crash-consistent cadence	5	Fixed at 5 for A2A	Your RPO floor ≈ this + replication lag
`appConsistentFrequencyInMinutes`	App-consistent cadence	60	0 (disable) to 720	Heavier; too frequent → VSS/freeze overhead
`multiVmSyncStatus`	Shared RP across a VM group	Disable (must opt in)	Enable / Disable	Enable for any multi-VM app
`instanceType`	Provider type	A2A	A2A (for Azure VMs)	Wrong type → policy won’t attach

The distinction that matters for recovery quality:

Crash-consistent recovery points are like pulling the power cord – the disk is captured as-is. Filesystems journal-replay on boot and most apps recover, but in-flight, uncommitted writes are lost. ASR takes these every 5 minutes for A2A, so your effective RPO floor is roughly 5 minutes plus replication lag.
App-consistent recovery points trigger VSS (Windows) or a pre/post script freeze (Linux) so the application flushes buffers to disk before the snapshot. These are what you want for databases and stateful apps, but they are heavier, so you take them less often (hourly is typical). Failing over to an app-consistent point loses more time but recovers cleaner.

The two recovery-point qualities side by side, so you pick the right one per workload:

Property	Crash-consistent	App-consistent
Mechanism	Disk-as-is snapshot	VSS (Windows) / pre-post freeze (Linux)
Default cadence (A2A)	Every 5 min	Hourly (policy-controlled)
Captures in-flight writes	No (lost)	Yes (flushed first)
Overhead on the VM	Minimal	Higher (quiesce + freeze)
Best for	Stateless / journaled filesystems	Databases, stateful apps
Failover data loss	Lower (more recent point)	Higher (less frequent)
Failover recovery quality	App may replay/repair	Cleaner, app-consistent
When you choose it	Need the freshest point	Need transactional integrity

multiVmSyncStatus: Enable is non-negotiable for multi-tier apps that span VMs. It creates shared, crash-consistent recovery points across a group of VMs so your web, app, and DB tiers fail over to the same point in time – otherwise your app tier might be 4 minutes ahead of your DB tier after failover, which is a data-integrity incident waiting to happen.

RPO is a target, not a guarantee. ASR continuously computes an achieved RPO per VM (visible in the replication health blade and queryable via metrics). If your churn outruns replication bandwidth, achieved RPO climbs above target and the VM goes to a warning state. Alert on achieved RPO, not just on “is replication healthy.”

The replication health states you will see and what each one means for a failover decision:

Replication health	What it means	Safe to fail over?	First action
`Normal`	Replication on track, RPO under target	Yes	None
`Warning`	Achieved RPO above target, or transient issue	Yes, with data-loss awareness	Investigate churn / cache throttle
`Critical`	Replication broken or far behind	Risky – expect larger data loss	Fix replication; failover loses more
`NotApplicable`	Item just enabled / mid initial sync	No (still seeding)	Wait for initial replication

ASR surfaces problems as health events and failed jobs rather than HTTP codes; here is the reference you scan when a protected item or a failover job goes red, what it means, and the fix:

Event / job error	Where it shows	Likely cause	How to confirm	First fix
`Replica Storage Throttle`	Replication health blade	Cache SA throttled (shared / undersized)	Cache account metrics; achieved RPO climbing	Dedicated GPv2 cache; split high-churn VMs
RPO exceeds the configured threshold	Health + `RPO` metric	Churn > replication bandwidth	RPO metric trend vs target	Reduce churn / raise bandwidth / fix cache
Mobility service not reachable / heartbeat lost	Protected-item health	Agent stalled, VM down, egress blocked	Agent status on the VM; NSG/UDR egress	Restart Mobility; allow ASR egress
Initial replication stuck at `NotApplicable`	Protected-item state	Unsupported disk (Ultra) or blocked egress	Disk SKU; outbound rules	Swap disk SKU; open egress
`DisableProtection`/re-enable churn	Jobs blade	Policy/disk change forced re-protect	Activity log; job details	Avoid mid-flight policy churn
Test failover job `Failed`	Jobs blade	Target network missing / boot-group error	Job step that failed; target VNet id	Fix target network; reassign boot group
Commit/cleanup job `Failed`	Jobs blade	Stale test resources / locked items	Job details; leftover test VMs	Re-run cleanup; remove locks
Reprotect job `Failed`	Jobs blade	Source RG/cache not ready for reverse	Reprotect job step	Pre-create reverse cache/RG; retry
Runbook action `Failed` in a plan	Plan job timeline	Cmdlet/RBAC/missing module in runbook	Automation job stream/output	Fix module/RBAC; make idempotent
App-consistent RP not generated	RP history	VSS error (Win) / freeze script (Linux) failed	VSS event log; Linux script logs	Repair VSS / app-consistent script

The real ASR limits and numbers worth knowing before you size a program (confirm against the current docs, since several scale with region and time):

Limit / number	Value (typical)	Why it matters
Crash-consistent cadence (A2A)	5 minutes (fixed)	Sets your RPO floor (≈ 5 min + lag)
Recovery-point retention (A2A)	up to 72 hours	How far back you can recover
App-consistent frequency	1 min - 12 hours (0 disables)	Trade clean recovery vs VM overhead
Boot groups per recovery plan	7	Tiers you can order
Protected items per recovery plan	up to 100	Plan scope ceiling
Recovery plans per vault	up to 50 (region-dependent)	DR-program scale
Protected items per vault	hundreds-thousands (region-dependent)	Estate scale ceiling
DNS TTL for cutover records	30-60 s (your choice)	Cutover speed during failover

Building recovery plans with tiered boot groups

A recovery plan is the unit of failover. It groups VMs, orders their boot, and lets you inject automation. Without one, “failover” means clicking each VM individually in the wrong order at 3 a.m. With one, it is a single, repeatable, ordered operation.

The structure is boot groups (1-7) executed sequentially: every VM in Group 1 boots and reaches the configured wait condition before Group 2 starts. Model your dependency graph onto groups – bottom of the dependency stack first:

Group 1: Domain controllers / DNS / identity. Nothing else can authenticate until these are up.
Group 2: Data tier – databases, caches, message brokers.
Group 3: Application / API tier.
Group 4: Web / frontend / load-balancer-facing VMs.

A canonical four-tier mapping, with the readiness signal that gates each group and the automation it typically carries:

Boot group	Tier	Example VMs	Gate before next group	Typical runbook hook
1	Identity / DNS	`vm-dc01`, `vm-dc02`	DCs answer LDAP; DNS resolves	Post: confirm DNS health
2	Data	`vm-sql01`, `vm-sql02` (AG), Redis	AG primary serving on listener	Post: AG readiness gate (poll listener)
3	App / API	`vm-app01`, `vm-app02`	App health endpoint 200	Pre: set app config / connection strings
4	Web / frontend	`vm-web01`, `vm-web02`	LB backend healthy	Post: DNS cutover + LB wiring
(default)	Unassigned	any VM not placed	None – boots in parallel	None – this is the failure mode to avoid

A recovery plan can be created in the portal, but for repeatability define it as JSON and create it via REST/ARM. Here is the shape of a tiered plan with three groups and runbook hooks (covered next):

{
  "properties": {
    "primaryFabricId": "<source-fabric-id>",
    "recoveryFabricId": "<target-fabric-id>",
    "failoverDeploymentModel": "ResourceManager",
    "groups": [
      {
        "groupType": "Boot",
        "replicationProtectedItems": [
          { "id": "<vm-dc01-protected-item-id>", "virtualMachineId": "<vm-dc01-id>" }
        ]
      },
      {
        "groupType": "Boot",
        "replicationProtectedItems": [
          { "id": "<vm-sql01-protected-item-id>", "virtualMachineId": "<vm-sql01-id>" }
        ],
        "startGroupActions": [],
        "endGroupActions": []
      },
      {
        "groupType": "Boot",
        "replicationProtectedItems": [
          { "id": "<vm-app01-protected-item-id>", "virtualMachineId": "<vm-app01-id>" },
          { "id": "<vm-web01-protected-item-id>", "virtualMachineId": "<vm-web01-id>" }
        ]
      }
    ]
  }
}

VMs not added to any boot group land in a default group and boot in parallel with no ordering – which is exactly the chaos a recovery plan exists to prevent. Be explicit: every protected VM should have an assigned group. The group/action vocabulary you compose a plan from:

Plan element	What it is	Where it runs	Use it for
`groupType: Boot`	A boot group (1-7)	Sequentially	Ordering tiers
`replicationProtectedItems`	The VMs in a group	Within the group	Assigning every VM a tier
`startGroupActions`	Pre-actions for a group	Before the group boots	Set config before VMs start
`endGroupActions`	Post-actions for a group	After the group boots	DNS/LB wiring, readiness gates
Manual action	A pause for a human step	At a group boundary	A required manual check
Script/runbook action	An Automation runbook	Pre or post	DNS, IP, LB, app startup

Injecting pre/post automation runbooks for DNS, IPs, and app startup

Booting VMs is the easy 80%. The failure-prone 20% is everything around the boot: repointing DNS to the DR region, assigning the right private/public IPs, attaching the failed-over VMs to the correct load balancer backend pool, and kicking off application-level startup. ASR lets you attach Azure Automation runbooks as pre- or post-actions at the start or end of any boot group.

Two things to get right:

The runbook must be in an Azure Automation account in the target region with the AzureRM/Az modules and use a system-assigned managed identity (or a Run As account on older setups) with RBAC scoped to the DR resource group.
ASR passes a RecoveryPlanContext object to the runbook as a parameter. Your runbook keys off FailoverType (Test, Planned, Unplanned) and FailoverDirection so the same runbook behaves correctly in a drill vs. a real event – e.g. it must not flip production DNS during a test failover.

The RecoveryPlanContext fields your runbook branches on:

Context field	Type / values	What you do with it
`FailoverType`	`Test` / `Planned` / `Unplanned`	Skip prod mutations on `Test`
`FailoverDirection`	`PrimaryToRecovery` / `RecoveryToPrimary`	Forward cutover vs failback wiring
`GroupId`	The boot group index	Scope actions to the right tier
`VmMap`	Map of source → failed-over VM	Look up DR VM names/IDs
`RecoveryPlanName`	The plan’s name	Logging / idempotency keys
`FailoverJobId`	The orchestration job id	Correlate logs to the run

The runbook prerequisites and the failure each one prevents:

Prerequisite	Why	Failure if missing
Automation account in target region	Survives source-region loss	Runbook unreachable during a real DR event
`Az` modules imported	Runbook uses `Az` cmdlets	Cmdlet-not-found at the worst time
Managed identity (system-assigned)	No stored creds to leak/expire	Run As cert expiry breaks the runbook
RBAC scoped to DR RG (least privilege)	Runbook can wire DNS/LB only	Either it can’t act, or it’s over-permissioned
Branch on `FailoverType`	Tell drill from disaster	Drill mutates production
Idempotent logic	Re-runs and retries are safe	Double-apply / error on retry

Here is a production-shaped post-group runbook that updates DNS and wires the load balancer, but only on a real failover:

param(
    [Parameter(Mandatory = $true)]
    [object]$RecoveryPlanContext
)

# ASR may pass the context as a JSON string depending on engine version.
if ($RecoveryPlanContext -is [string]) {
    $RecoveryPlanContext = $RecoveryPlanContext | ConvertFrom-Json
}

Connect-AzAccount -Identity | Out-Null

$failoverType = $RecoveryPlanContext.FailoverType   # Test | Planned | Unplanned
Write-Output "Recovery plan failover type: $failoverType"

# NEVER mutate production DNS during a test failover.
if ($failoverType -eq "Test") {
    Write-Output "Test failover detected -- skipping production DNS/LB changes."
    return
}

# Map of source VM name -> failed-over VM name in the DR resource group.
$drRg = "rg-app-westus2"

# Re-point the app A record to the DR load balancer's frontend IP.
$drLbIp = (Get-AzPublicIpAddress -ResourceGroupName $drRg -Name "pip-app-lb-dr").IpAddress
Set-AzDnsRecordSet -ResourceGroupName "rg-dns" -ZoneName "app.contoso.com" `
    -Name "@" -RecordType A -Ttl 60 `
    -DnsRecords (New-AzDnsRecordConfig -IPv4Address $drLbIp) | Out-Null

Write-Output "DNS app.contoso.com repointed to DR LB $drLbIp (TTL 60s)."

Two patterns I insist on:

Set a short DNS TTL (30-60s) on the records you fail over, permanently. You cannot shrink TTL during an incident – caches already hold the old value. Low TTL on DR-relevant records is a standing cost you pay for fast cutover.
Make runbooks idempotent. A drill that re-runs, or a partial failover you retry, must not double-apply or error out. Check current state before mutating.

The post-failover wiring tasks, where to automate them, and what bites if you skip them:

Cutover task	Automate at	Mechanism	What breaks if skipped
DNS repoint to DR	Post (web group)	`Set-AzDnsRecordSet` (TTL 30-60s)	Clients still hit dead source IPs
Private IP assignment	Replication settings / pre	Target NIC config on the protected item	App config points at wrong/stale IP
Public IP attach	Post	`Get/Set-AzPublicIpAddress` on DR NIC	No ingress to the DR frontend
LB backend pool membership	Post (web group)	Add DR NICs to the DR LB pool	Traffic lands on no healthy backend
App / service startup	Post (app group)	Invoke run command / app health gate	App “up” but service not started
Data-tier readiness gate	Post (data group)	Poll AG listener before app group	App boots before DB → crash loop

Test failover into an isolated network – without touching production

This is the feature that makes ASR trustworthy: test failover spins up your VMs from a chosen recovery point into a network you specify, while production replication keeps running uninterrupted. Done right, it proves recoverability with zero blast radius.

The non-negotiable rule: fail over into an isolated VNet that has no peering, no VPN, and no route back to production. If your test VMs can reach the real domain controllers or the real database, your drill can corrupt production data or duplicate identity objects. Build a dedicated vnet-asr-test in the DR region with the same address space layout (so app configs resolve) but isolated.

# Isolated DR-test VNet: same subnet layout, NO peering, NO gateway.
az network vnet create \
  --resource-group rg-dr-westus2 \
  --name vnet-asr-test \
  --location westus2 \
  --address-prefixes 10.99.0.0/16 \
  --subnet-name snet-app --subnet-prefixes 10.99.1.0/24

# Trigger a test failover for a recovery plan into the isolated network.
az site-recovery recovery-plan test-failover \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --recovery-plan-name rp-prod-app \
  --failover-direction PrimaryToRecovery \
  --network-id "/subscriptions/<sub>/resourceGroups/rg-dr-westus2/providers/Microsoft.Network/virtualNetworks/vnet-asr-test" \
  --network-type VmNetworkAsInput

The isolation checklist for the test VNet – every item is a way a “test” can leak into production:

Isolation control	Required state	How to verify	Leak if wrong
VNet peering	None to prod VNets	`az network vnet peering list` is empty	Test VMs reach real DCs/DB
VPN / ExpressRoute gateway	None	No gateway subnet wired	Path back to on-prem prod
User-defined routes	No route to prod ranges	`az network route-table` review	Traffic forwarded to prod
DNS	Isolated or test resolver	Check NIC DNS settings	Test VM registers in prod DNS
NSG outbound	Deny to prod CIDRs	NSG rules review	Drill calls live services
Address layout	Same layout, isolated space	`10.99.0.0/16` mirrors prod subnets	App config can’t resolve peers

After validation, you must run cleanup to delete the test VMs and resources – otherwise you keep paying for them and the plan state stays dirty. Record findings (what booted, what broke, time to app-ready) as part of cleanup, because that is your drill evidence.

az site-recovery recovery-plan test-failover-cleanup \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --recovery-plan-name rp-prod-app \
  --comments "Q2 DR drill -- app-ready in 11m, DNS runbook OK, fixed SQL boot-group wait."

A test failover that you never clean up is worse than no drill: it bills continuously and leaves the recovery plan unable to run again. Treat cleanup as part of the drill, not an afterthought, and automate the reminder.

Planned, unplanned, and failback with reprotection

Three failover modes, each for a different situation:

Planned failover (region-to-region): zero data loss. ASR shuts down the source, flushes the final pending data, then brings up the target. Use this for anticipated events – a scheduled DC maintenance, a planned region migration. Source must be reachable. (Note: planned failover semantics differ for zone-to-zone vs region-to-region; for A2A region pairs it is available where the source is healthy.)
Unplanned failover: the source is gone or unreachable. ASR fails over to the latest (or a chosen earlier) recovery point. Expect some data loss equal to your achieved RPO at the moment of the outage. This is the real-disaster path.
Failback: returning to the original region after it recovers. This is not a single button – you must reprotect first.

The four failover modes compared, so you pick the right one under pressure:

Mode	When to use	Source state	Data loss	Replication kept	Recovery point used
Test failover	Drill / validation	Healthy (untouched)	None (prod unaffected)	Yes – prod keeps replicating	Any (you choose)
Planned failover	Anticipated event / migration	Must be healthy	Zero (final flush)	Stops after cutover	Latest after flush
Unplanned failover	Real disaster	Gone / unreachable	≈ achieved RPO	Was already lagging	Latest or chosen earlier
Failback (after reprotect)	Return home	DR is current primary	Zero (planned back)	Reversed during reprotect	Latest delta

The recovery-point selection options at failover, and when each is right:

Recovery-point option	What it gives you	When to choose it
Latest (lowest RPO)	Most recent processed point	Default – minimal data loss
Latest processed	Last fully processed RP	Slightly older, fully applied
Latest app-consistent	Most recent VSS/freeze point	Databases needing clean state
Custom (pick from history)	A specific earlier point	Recover before a corruption/ransomware event

The decision table for the moment the pager goes off – match what you’re seeing to the failover move:

If you’re seeing…	It’s probably…	Do this
Source region degrading, still reachable	An anticipated/early event	Planned failover (zero loss); cut over before it’s gone
Source region hard-down / unreachable	A real disaster	Unplanned failover to latest point; accept ≈RPO loss
A single zone lost, region healthy	Zone outage	Failover the Z2Z plan to the target zone
Suspected ransomware/corruption in data	A poisoned latest point	Unplanned failover to a custom earlier clean point
Want to validate before committing	A drill or pre-check	Test failover into the isolated VNet; cleanup after
DR running, source recovered	Time to go home	Reprotect, then planned failback, reprotect again
Failover done, want to lock it in	Satisfied with DR	Commit (discards other RPs)

The lifecycle people get wrong is the return trip:

Fail over to DR (target region is now primary, serving traffic).
Reprotect – ASR now replicates from the DR region back to the original region. This re-seeds only the delta, not the full disks, so it is fast.
Commit the failover once you are satisfied (discards other recovery points).
When the original region is healthy and reprotection is in sync, run a planned failover back (zero data loss) and reprotect again to restore the normal DR direction.

# After an unplanned failover, reprotect to start replicating DR -> source.
az site-recovery protected-item reprotect \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --fabric-name asr-a2a-default-westus2 \
  --protection-container-name asr-a2a-default-westus2-container \
  --replication-protected-item-name vm-app01

The full failover-to-home lifecycle as a state table – where you are, what’s replicating, and the next legal move:

Stage	Primary (serving)	Replication direction	Next action	Watch-out
Steady state	Source region	Source → DR	(normal ops)	Alert on achieved RPO
After failover	DR region	None (paused)	Reprotect	Don’t linger – unprotected
Reprotecting	DR region	DR → Source (delta)	Wait for sync	Re-seed is delta, but watch RPO
Committed	DR region	DR → Source	Plan failback	Other RPs discarded on commit
Failback (planned)	Source region	(flips)	Reprotect again	Schedule a window; zero loss
Back to steady	Source region	Source → DR	Resume drills	Direction restored

Failover without a tested reprotect/failback plan means you can get to DR but you are stranded there. The expensive incidents I have seen were not the failover – they were teams running production from a DR region for weeks because nobody had rehearsed the return trip, accruing cross-region egress and running unscaled DR capacity under real load.

Architecture at a glance

Read the diagram left to right – it is the data path and the control path on one canvas, with each number marking a step that decides whether the failover actually recovers. On the far left, protected VMs in the source region (a DC, a SQL Always On pair, app servers) run the Mobility extension, which streams disk writes into a dedicated cache storage account that lives in the source region (badge 1 – the single most misconfigured component; share it with app I/O and your achieved RPO spikes). From the cache, the ASR engine asynchronously replicates into the Recovery Services vault – which itself lives in the target region so it survives the disaster – producing crash-consistent points every five minutes and hourly app-consistent points. Badge 2 sits on replication: leave multi-VM sync off and your tiers fail over to points minutes apart, reseeding the SQL AG.

The middle-right is the orchestration. The recovery plan boots tiers in dependency order through boot groups (DNS → data → app → web; badge 3 – an unassigned VM lands in the default parallel group and crash-loops the app), and attaches Automation runbooks at group boundaries for DNS cutover, IP and load-balancer wiring (badge 4 – the runbook must branch on FailoverType or a drill mutates live production). The target (a zone in the same region, or a paired region) is where the replica disks come up and DNS cuts over to the DR load balancer with a 60-second TTL. Finally, the validate zone is the discipline that makes all of it real: an isolated test VNet with no path back to production (badge 5), and an achieved-RPO metric alert that fires when recoverable lag crosses your target – not a green dashboard you trust on faith. The dotted reprotect flow from the engine back to the recovery plan is the return trip most teams never rehearse.

Real-world scenario

A platform team running a regulated payments workload (three-tier: IIS web, .NET app, SQL Server on a 2-VM Always On AG, plus a pair of domain controllers) protected everything to a paired region with ASR and a 5-minute RPO. Replication was healthy for months. During their first real drill, the recovery plan booted all VMs but the app tier crash-looped: it came up before SQL had finished AG recovery, exhausted its connection retries, and the service marked itself failed. Worse, because they had not enabled multi-VM sync, the two SQL VMs failed over to recovery points 3 minutes apart, so the AG came up with the secondary ahead of the primary and had to be manually reseeded – blowing their RTO from a target of 15 minutes to over an hour.

The constraint was real: SQL AG recovery time is variable and you cannot put a fixed sleep in a boot group and call it deterministic. They fixed it with two changes. First, they enabled multiVmSyncStatus: Enable on the policy so the SQL pair (and the whole app) shares one crash-consistent recovery point – no more cross-VM time skew. Second, they replaced the fragile fixed wait with a post-group readiness gate: a runbook attached to the end of the SQL boot group that polls the AG listener until the primary is actually serving, before the app group is allowed to start.

# Post-action on the SQL boot group: block until the AG primary answers.
$listener = "sql-ag-listener.payments.internal"
$deadline = (Get-Date).AddMinutes(12)
do {
    try {
        $c = New-Object System.Data.SqlClient.SqlConnection("Server=$listener;Database=master;Integrated Security=True;Connect Timeout=5")
        $c.Open()
        $primaryReady = $true
        $c.Close()
    } catch {
        $primaryReady = $false
        Start-Sleep -Seconds 10
    }
} until ($primaryReady -or (Get-Date) -gt $deadline)

if (-not $primaryReady) { throw "SQL AG primary not ready before deadline -- halting recovery plan." }
Write-Output "SQL AG primary reachable -- releasing app tier boot group."

With shared recovery points plus a readiness gate instead of a guessed sleep, their next drill came up clean and reproducibly inside 14 minutes – and crucially, the plan now fails loudly if the data tier isn’t ready, instead of cascading into a half-booted app. The drill-over-drill progression, because the numbers are the lesson:

Drill	Multi-VM sync	Data-tier gate	App tier outcome	Achieved RTO	Verdict
Q1 (first)	Off	Fixed `Start-Sleep 300`	Crash loop; AG reseed	> 60 min	Failed target
Q2 (after sync fix)	Enable	Fixed `Start-Sleep 300`	Boots, but sleep sometimes short	~22 min	Still over target
Q3 (after gate)	Enable	Readiness poll (listener)	Clean, ordered	~14 min	Met 15-min target
Q4 (steady)	Enable	Readiness poll	Clean, reproducible	~13 min	Stable; fails loudly if not ready

The lesson the team wrote on the wall: “A fixed sleep is a guess; a readiness gate is a contract. Replication health is necessary; a clean, timed test failover is sufficient.”

Advantages and disadvantages

ASR turns a pile of replicated disks into a rehearsed, ordered, automatable recovery – but it is orchestration you own, and the defaults do not protect you. Weigh it honestly:

Advantages (why ASR earns its place)	Disadvantages (why it bites)
Agentless A2A – no appliance/process server to run	You still own boot order, runbooks, DNS, IPs – the hard 80%
Crash-consistent points every 5 min → low RPO floor	App-consistent points are hourly and heavier; data loss is real on unplanned
Recovery plans encode boot order + automation, repeatably	Defaults are unsafe: multi-VM sync off, VMs default to parallel boot
Test failover proves recoverability with zero prod impact	Only true if the test VNet is genuinely isolated – easy to leak
Compute in DR isn’t billed until failover	Replica storage + protected-instance fee are continuous
Reprotect re-seeds only the delta on failback	Failback is multi-step and frequently un-rehearsed → stranded in DR
Achieved RPO is a first-class, alertable metric	“Replication healthy” hides RTO problems entirely
One plan can mix VMs, scripts and manual gates	A bad runbook (no `Test` guard) can take down production during a drill

ASR is the right tool when you have stateful IaaS the business cannot lose and you need fast, ordered recovery – payments, ERP, SQL-on-VM, domain controllers. It is the wrong tool when the right answer is platform-native: zone-redundant PaaS, SQL failover groups, or Cosmos multi-region writes handle their own failover and ASR would just add machinery. And it is not a backup: ASR gives you a short recovery-point window for fast failover, not 35-day point-in-time restore against accidental deletion or ransomware – pair it with Azure Backup for that. The disadvantages are all manageable – but only if you know they exist and rehearse around them, which is the entire point of a drill.

ASR positioned against the adjacent resilience tools, so you don’t reach for the wrong one:

Tool	Protects against	Recovery model	Use ASR instead when…
Azure Site Recovery	Zone / region outage (IaaS)	Fast ordered failover, ~5-min RPO	(this is the IaaS DR tool)
Azure Backup	Deletion, corruption, ransomware	Point-in-time restore (days)	You need fast failover, not restore
Zone-redundant PaaS	Single-zone failure	Built-in, transparent	Workload is IaaS, not PaaS
SQL failover groups	SQL region failure	DB-level auto/failover	You’re running SQL on a VM
Front Door / Traffic Manager	Endpoint/region routing	Global traffic steering	You need the compute moved, not just routed
VMSS across zones	Instance/zone loss (stateless)	Re-create instances	App is stateful and can’t be re-imaged

Hands-on lab

Protect one VM with A2A, prove replication health, run an isolated test failover, and clean up – all reversible. Run in Cloud Shell (Bash). This incurs replica-storage and protected-instance charges while enabled, so tear down at the end. (The portal “Enable replication” wizard creates the fabrics, containers and mappings on first use; this lab assumes you enable the first VM via the portal, then drive validation from the CLI.)

Step 1 – Variables and resource groups.

SRC_RG=rg-app-eastus2
DR_RG=rg-dr-westus2
SRC_LOC=eastus2
DR_LOC=westus2
VAULT=rsv-dr-lab
VM=vm-lab01
az group create -n $SRC_RG -l $SRC_LOC -o table
az group create -n $DR_RG  -l $DR_LOC  -o table

Step 2 – Create the vault in the TARGET region and a dedicated cache SA in the SOURCE region.

az backup vault create -g $DR_RG -n $VAULT -l $DR_LOC -o table
az storage account create -g $SRC_RG -n stasrlab$RANDOM -l $SRC_LOC \
  --sku Standard_LRS --kind StorageV2 --min-tls-version TLS1_2 \
  --allow-blob-public-access false -o table

Expected: a vault row in westus2; a StorageV2 cache account in eastus2.

Step 3 – Deploy a small source VM to protect (use a cheap SKU).

az vm create -g $SRC_RG -n $VM --image Ubuntu2204 --size Standard_B2s \
  --admin-username azureuser --generate-ssh-keys -o table

Step 4 – Enable replication (portal wizard). In the portal: Recovery Services vault rsv-dr-lab → Site Recovery → Enable replication → Azure virtual machines, source eastus2, target westus2, select vm-lab01, accept the default policy (5-min crash-consistent), and the dedicated cache account. This creates the fabrics/containers/mappings. Wait for initial replication to complete (the protected item moves from NotApplicable/seeding to Protected).

Step 5 – Verify replication health and achieved RPO from the CLI.

az extension add --name site-recovery
az site-recovery protected-item show \
  --resource-group $DR_RG --vault-name $VAULT \
  --fabric-name asr-a2a-default-eastus2 \
  --protection-container-name asr-a2a-default-eastus2-container \
  --replication-protected-item-name $VM \
  --query "{state:properties.protectionStateDescription, health:properties.replicationHealth, rpoSeconds:properties.providerSpecificDetails.lastRpoInSeconds}" \
  -o table

Expected: protectionStateDescription = Protected, replicationHealth = Normal, lastRpoInSeconds well under 300.

Step 6 – Build an isolated test VNet and a single-VM recovery plan.

az network vnet create -g $DR_RG -n vnet-asr-test -l $DR_LOC \
  --address-prefixes 10.99.0.0/16 \
  --subnet-name snet-app --subnet-prefixes 10.99.1.0/24 -o table
# Create rp-lab in the portal (Site Recovery → Recovery Plans → +Recovery plan),
# add vm-lab01 to Group 1. (Single-VM plans are simplest via the portal.)

Step 7 – Run a test failover into the isolated VNet, then CLEAN UP.

az site-recovery recovery-plan test-failover \
  --resource-group $DR_RG --vault-name $VAULT \
  --recovery-plan-name rp-lab \
  --failover-direction PrimaryToRecovery \
  --network-id "$(az network vnet show -g $DR_RG -n vnet-asr-test --query id -o tsv)" \
  --network-type VmNetworkAsInput

# Validate the VM booted in westus2 in the isolated VNet, then:
az site-recovery recovery-plan test-failover-cleanup \
  --resource-group $DR_RG --vault-name $VAULT \
  --recovery-plan-name rp-lab \
  --comments "Lab drill -- VM booted in isolated VNet, cleaned up."

Expected: a test VM appears in rg-dr-westus2 attached to vnet-asr-test with no path to production; cleanup deletes it.

Validation checklist. You enabled A2A, confirmed Protected/Normal with a sub-300s achieved RPO, ran a test failover into a network with no route back to prod, and cleaned up – the full proof loop. What each step proves:

Step	What you did	What it proves	Real-world analogue
2	Vault in target, cache in source	The region placement that makes ASR survivable	The first design decision in any DR build
5	Checked health + achieved RPO	Replication is real and within target	The daily DR health check
6	Built an isolated test VNet	Drills must have zero blast radius	The non-negotiable drill setup
7	Test failover + cleanup	You can prove recovery without touching prod	The quarterly DR drill

Cleanup (stop all charges).

# Disable replication for the protected item, then delete both resource groups.
az site-recovery protected-item delete \
  --resource-group $DR_RG --vault-name $VAULT \
  --fabric-name asr-a2a-default-eastus2 \
  --protection-container-name asr-a2a-default-eastus2-container \
  --replication-protected-item-name $VM --yes 2>/dev/null || true
az group delete -n $SRC_RG --yes --no-wait
az group delete -n $DR_RG  --yes --no-wait

Cost note. While enabled, you pay the protected-instance fee (~$25/VM/month order of magnitude) plus replica-disk storage; an hour or two of this lab is a few rupees of storage. Disabling replication and deleting the resource groups stops everything.

Common mistakes & troubleshooting

This is the playbook – the part you bookmark. First as a scannable table, then the entries that bite hardest with full confirm-detail underneath.

#	Symptom	Root cause	Confirm (exact cmd / portal path)	Fix
1	Achieved RPO climbs above target, VM goes `Warning`	Cache SA throttled (shared with app I/O) or undersized	Replication health blade; `Replica Storage Throttle` events; RPO metric	Dedicated GPv2 Standard cache in source; split high-churn VMs
2	Tiers fail over to recovery points minutes apart	Multi-VM sync not enabled on the policy	Policy `multiVmSyncStatus` != `Enable`	Enable multi-VM sync; re-associate the group
3	App tier crash-loops on failover, exhausts retries	App boots before data/identity (no ordering / no gate)	Recovery-plan boot groups; unassigned VMs in default group	Assign boot groups in dependency order; add a readiness gate runbook
4	A test drill changed production DNS/LB	Runbook doesn’t branch on `FailoverType`	Runbook ignores `RecoveryPlanContext.FailoverType`	Branch on `Test` and `return`; make runbook idempotent
5	Failed over fine, but can’t get back to source	Reprotect/failback never set up or rehearsed	Replication direction is `None` after failover	Reprotect (DR→source), then planned failback
6	Test VMs can reach real DCs / database	Test VNet has peering/VPN/route to prod	`az network vnet peering list` on `vnet-asr-test`	Remove peering/gateway/UDR; isolated VNet only
7	Test failover left VMs running, plan won’t re-run	Cleanup never ran after the drill	Test VMs still present in DR RG; plan state dirty	Run `test-failover-cleanup`; automate the reminder
8	Enable-replication fails with mapping error	Protection container mapping / fabrics missing	`az site-recovery ...` errors on container	Create via portal wizard first, then codify
9	Failed-over VM comes up with stale/wrong IP	Target NIC/IP not configured in replication settings	Protected item’s target network settings blank	Set target subnet/IP; automate public IP/LB post-failover
10	Initial replication never completes	Mobility blocked (NSG/policy) or unsupported disk	Protected item stuck seeding; Ultra disk present	Allow Mobility egress; swap unsupported disk type
11	App-consistent points missing on Linux	No/failed pre-post freeze script; VSS-equiv absent	App-consistent RP count zero in history	Install/repair the app-consistent script; check perms
12	SQL AG comes up split-brain / reseeds	Multi-VM sync off and no AG readiness gate	AG secondary ahead of primary post-failover	Multi-VM sync + post-group listener poll before app group
13	DR drill “passes” but RTO is unknown	Achieved RTO never recorded	No timing captured in cleanup comments	Record boot-order + app-ready time every drill
14	“Replication healthy” but failover still loses data	Confused replication health with achieved RPO	RPO metric > target while health shows `Normal`	Alert on the RPO metric, not just health state

The expanded form for the entries that cost the most:

1. Achieved RPO climbs above target and the VM flips to Warning. Root cause: The cache storage account is throttled – usually because it is shared with application workloads – or it is undersized for the VM’s write churn. Confirm: Replication health blade shows achieved RPO above target; look for Replica Storage Throttle events; query the RPO metric over time. Fix: Use a dedicated GPv2 Standard cache account in the source region, sized on churn not capacity; split very high-churn VMs onto their own cache; enable the high-churn flow if Standard limits are breached.

2. Tiers of one application fail over to recovery points minutes apart. Root cause: Multi-VM sync is not enabled on the policy, so each VM picks its own recovery point. Confirm: multiVmSyncStatus on the policy is not Enable; post-failover, the data tier’s point lags the app tier’s. Fix: Enable multi-VM sync on the policy so the VM group shares one crash-consistent recovery point; re-associate the protected items with the synced policy.

3. The app tier crash-loops on failover and marks itself failed. Root cause: The app boots before its dependency (SQL/DNS) is ready – either VMs are unassigned (default parallel boot) or the data-tier gate is a fixed sleep that finished too early. Confirm: Recovery-plan boot groups show unassigned VMs in the default group, or the app group has no readiness gate. Fix: Assign every VM a boot group in dependency order, and gate the app group on a post-group readiness poll of the data tier (e.g. the AG listener), not a guessed sleep.

4. A test drill mutated production DNS / load balancer. Root cause: The runbook does not branch on RecoveryPlanContext.FailoverType, so it ran its production cutover during a Test failover. Confirm: The runbook code ignores FailoverType; the test run shows live DNS/LB changes. Fix: Branch on FailoverType -eq "Test" and return early; make every cutover runbook idempotent so re-runs and retries are safe.

5. You failed over to DR successfully but cannot get back to the source. Root cause: Reprotect/failback was never set up or rehearsed, so after failover replication direction is None and the DR VMs are unprotected. Confirm: The protected item’s replication direction is None after failover; no reprotect job exists. Fix: Run reprotect to replicate DR→source (delta re-seed), let it sync, then run a planned failover back and reprotect again to restore the normal direction.

6. Test VMs can reach the real domain controllers or production database. Root cause: The test VNet has peering, a VPN/ExpressRoute gateway, or a UDR that provides a path back to production. Confirm: az network vnet peering list on vnet-asr-test is non-empty, or a gateway/route exists. Fix: Use a genuinely isolated VNet – same address layout, but no peering, no gateway, no route to prod – and verify before every drill.

Best practices

Vault in the target region, cache account in the source region. This placement is what makes ASR survive the disaster it protects against – get it wrong and you lose your control plane with the region.
Dedicate the cache storage account to ASR and monitor it. Sharing it with app I/O is the number-one cause of RPO spikes. Size on churn, watch Replica Storage Throttle.
Enable multiVmSyncStatus for every multi-VM application. It is off by default and is the difference between a clean failover and a SQL AG reseed.
Assign every protected VM to a boot group, bottom of the dependency stack first. Unassigned VMs boot in parallel – exactly the chaos a recovery plan exists to prevent.
Gate tiers with readiness runbooks, never fixed sleeps. Poll the actual dependency (AG listener, health endpoint) so the plan fails loudly if it isn’t ready.
Branch every runbook on FailoverType and make it idempotent. A drill must never mutate production; a retry must never double-apply.
Keep DR-relevant DNS records at a 30-60s TTL permanently. You cannot shrink TTL mid-incident; pay the standing cost for fast cutover.
Run an isolated test failover on a schedule (quarterly minimum) and clean up. Replication health is necessary; a clean, timed drill is the only sufficient proof.
Alert on achieved RPO as a metric, not on “replication healthy.” Health state hides RTO and RPO-drift problems entirely.
Rehearse reprotect and failback end to end. Getting to DR is half the job; teams stranded in DR for weeks are the expensive incidents.
Right-size the replica disk SKU. It is a continuous cost; match it to recovery performance needs, not reflexively to Premium for cold tiers.
Capture drill evidence every time – boot order, app-ready time, runbook outcomes, achieved RTO – for audit and to prove improvement drill over drill.

The standing alerts worth wiring before the next drill – leading indicators, not “site down”:

Alert on	Signal	Threshold (starting point)	Why it’s leading
Achieved RPO breach	`RPO` metric (per VM)	> 300 s sustained 15 min	Catches replication falling behind before failover
Replication health	Health state	`Critical` for 5 min	Replication broken → bad failover
Cache throttling	`Replica Storage Throttle` events	any sustained	Root cause of RPO drift, before RPO climbs
Test-failover staleness	Days since last clean drill	> 95 days	Untested DR is unproven DR
Unprotected after failover	Replication direction = `None`	any	You’re in DR with no protection
Runbook failure	Automation job status	any `Failed` in a plan	Cutover step silently broke

Security notes

Managed identity over stored credentials for runbooks. Use the Automation account’s system-assigned managed identity with RBAC scoped to the DR resource group – no Run As certificate to expire or leak. Grant least privilege (e.g. Network Contributor on the DNS/LB scope), not Owner.
Encrypt replica disks. Replica managed disks support server-side encryption with platform- or customer-managed keys; for CMK, pre-stage a disk encryption set in the target region and reference it on the replica. See Azure Encryption at Rest with Customer-Managed Keys.
Keep the test network truly isolated. An isolated vnet-asr-test is a security control, not just a correctness one – it prevents a drill from touching real identity (duplicate AD objects) or real data (corruption). No peering, no VPN, no route back.
Lock down the vault and Automation account with RBAC. Site Recovery Contributor / Operator on the vault, not broad subscription roles; the runbook identity is a high-value target because it can rewire DNS and networking.
Mind data residency on region-to-region. Failing over to a paired region moves regulated data across a boundary – get compliance sign-off on the target region before you promise cross-region DR.
Protect the DNS zone you cut over. The runbook that repoints app.contoso.com has write access to your public zone; scope its identity tightly and log every change.
Pair ASR with immutable backup for the ransomware case. ASR’s short recovery window can replicate corruption too; a custom recovery point lets you boot before the event, but immutable Azure Backup is your guarantee against an admin-level compromise.

The security controls and the failure each one also prevents:

Control	Mechanism	Secures against	Also prevents
Managed identity for runbooks	System-assigned MI + scoped RBAC	Leaked/expired Run As creds	Runbook breaking on cert expiry
Least-privilege runbook scope	Network Contributor on DNS/LB only	Over-broad blast radius	Accidental changes outside DR scope
Replica disk encryption (CMK)	Disk encryption set (target region)	Data-at-rest exposure	Compliance gaps on the replica
Isolated test VNet	No peering/VPN/route	Drill touching real identity/data	Production corruption during a drill
Vault/Automation RBAC	SR Contributor/Operator roles	Unauthorized failover/config	Fat-finger failover by non-DR staff
Custom recovery point	Pick a point before an event	Replicating ransomware/corruption	Failing over into a compromised state

Cost & sizing

The bill is continuous even though the DR compute is off – that surprises teams. The drivers:

Protected-instance fee per VM is the headline ASR charge (order of ~$25/VM/month). It is per protected VM regardless of size, so protecting 200 VMs is ~$5,000/month before storage.
Replica disk storage in the target runs continuously – you pay for standby disks sized like the source. Right-size the replica SKU (recoveryReplicaDiskAccountType): Standard HDD/SSD for cold tiers, Premium only where post-failover IOPS demand it.
Cross-region egress on replication applies to region-to-region (not zone-to-zone, which stays intra-region). High-churn VMs replicate more bytes; this is a real line item for chatty workloads.
Cache storage account is a small but real cost; dedicate it anyway – sharing it to “save money” costs you RPO.
DR compute is billed only during a failover or an un-cleaned test failover – which is exactly why forgetting cleanup is expensive, and why being stranded in DR (running un-scaled compute under real load) blows budgets.

A rough monthly picture for a mid-size estate, and what each lever buys:

Cost driver	What you pay for	Rough figure	Lever to control it	Watch-out
Protected-instance fee	Per protected VM	~$25/VM/mo	Protect only what needs fast failover	200 VMs ≈ $5,000/mo baseline
Replica disk storage	Standby disks in target	Source-disk-sized, continuous	Right-size replica SKU	Premium everywhere over-spends
Cross-region egress	Replication bytes (R2R)	Per-GB, churn-driven	Zone-to-zone where it suffices	High-churn VMs add up
Cache storage account	Source-region buffer	Small	Dedicated GPv2 Standard	Sharing it → RPO spikes
Test-failover compute	DR VMs during a drill	Hourly, while running	Always run cleanup	Forgetting cleanup bills forever
Failover compute	DR VMs serving traffic	Hourly, during DR	Scale appropriately; fail back	Stranded-in-DR is the budget killer

The decision rule as a table – match the workload to the cheapest posture that meets its risk:

If the workload…	Risk it faces	DR posture	Why
Must survive a zone outage, low latency	Single-zone failure	Zone-to-zone A2A	Intra-region, no egress, no residency change
Must survive a region loss	Regional disaster	Region-to-region A2A	True DR; accept egress + second region
Is critical and budget allows	Both	Both (Z2Z + R2R)	AZ-resilience for common case + DR for rare
Is stateless and re-imageable	Instance/zone loss	VMSS across zones (not ASR)	Cheaper to re-create than replicate
Is PaaS / managed SQL	Service-level failure	Native zone redundancy / failover groups	ASR adds no value over built-in
Just needs deletion/ransomware cover	Data loss, not outage	Azure Backup (immutable)	Point-in-time restore, not failover

Interview & exam questions

1. Where does the cache storage account live, and why does that placement matter? In the source region. It buffers disk-write bursts from the Mobility service and decouples app I/O from cross-region replication latency, and ASR replicates from it to the target. Placing it in the target would add latency and break the model; sharing it with app workloads throttles it and spikes RPO. Dedicate it and size on churn.

2. Difference between crash-consistent and app-consistent recovery points? Crash-consistent is a disk-as-is snapshot (every 5 min for A2A) – filesystems journal-replay but in-flight writes are lost. App-consistent uses VSS (Windows) or a pre/post freeze (Linux) to flush application buffers first (hourly, heavier). Databases want app-consistent for clean recovery; failing over to one loses more time but recovers cleaner.

3. Why is multiVmSyncStatus: Enable non-negotiable for a multi-tier app? It makes a group of VMs share one crash-consistent recovery point, so the web, app and DB tiers fail over to the same moment in time. Without it, tiers can land minutes apart – e.g. a SQL AG secondary ahead of its primary, forcing a manual reseed and a data-integrity incident.

4. What is a recovery plan and what does a boot group do? A recovery plan is the unit of failover – it groups VMs, orders their boot into boot groups (1-7) executed sequentially (every VM in a group reaches its wait condition before the next group starts), and attaches pre/post automation. Model your dependency graph onto groups: identity/DNS first, then data, then app, then web. Unassigned VMs boot in parallel in a default group.

5. How does one runbook behave correctly in both a drill and a real disaster? ASR passes a RecoveryPlanContext with FailoverType (Test/Planned/Unplanned) and FailoverDirection. The runbook branches on these – crucially it must skip production DNS/LB changes when FailoverType is Test – and should be idempotent so retries and re-runs are safe.

6. Why prefer a readiness gate over a fixed Start-Sleep between boot groups? Dependency recovery time is variable (SQL AG recovery especially), so a fixed sleep is a guess – too short and the app boots into a dead dependency and crash-loops; too long and you waste RTO. A post-group readiness gate polls the actual dependency (e.g. the AG listener) and releases the next group only when it is truly serving, and fails loudly if it isn’t.

7. What’s the non-negotiable rule for a test failover network, and why? Fail over into a network with no peering, no VPN, and no route back to production. If the test VMs can reach the real DCs or database, the drill can corrupt production data or duplicate AD identity objects. The isolated vnet-asr-test uses the same address layout (so app config resolves) but is sealed off.

8. What is reprotect and why does failback depend on it? After a failover, the DR region is primary and replication is paused – you are unprotected. Reprotect reverses replication (DR→source), re-seeding only the delta, so the original region becomes a valid target again. Only then can you run a planned failover back (zero data loss). Skipping reprotect leaves you stranded in DR.

9. Planned vs unplanned failover – data-loss expectation for each? Planned failover requires a healthy source, flushes the final pending data, and is zero data loss – for anticipated events. Unplanned failover is the real-disaster path when the source is gone; it fails over to the latest available point and you expect data loss ≈ your achieved RPO at the moment of the outage.

10. “Replication is healthy” – why isn’t that enough to trust your DR? Health state says bytes are arriving; it says nothing about achieved RPO (you could be far behind and still “Normal”/“Warning”) or RTO (boot order, runbooks, app readiness). The only sufficient proof is a clean, timed, isolated test failover with achieved RTO recorded – plus alerting on the achieved-RPO metric, not just health.

11. Zone-to-zone vs region-to-region – when each, and what’s identical? Zone-to-zone protects a single-zone failure, stays in-region (low latency, no residency change) but does not cover a regional outage. Region-to-region covers a region loss at the cost of egress and a second region. The replication mechanics, policy, cache, and recovery plans are identical; only the target (zone vs region) differs.

12. Why must you run test-failover cleanup, and what happens if you don’t? Cleanup deletes the test VMs and resets the plan state. If you skip it you pay continuously for the running test VMs and the recovery plan stays dirty and cannot run again – so a forgotten cleanup both costs money and breaks your next drill.

These map to AZ-104 (Administrator) – implement and manage Azure Site Recovery, backup and disaster recovery – and AZ-305 (Solutions Architect Expert) – design business continuity, RPO/RTO, and DR for IaaS. The networking-cutover angle touches AZ-700. A compact cert mapping for revision:

Question theme	Primary cert	Exam objective area
ASR architecture, cache, RPs	AZ-104	Implement & manage backup and DR
Recovery plans, boot groups, runbooks	AZ-104 / AZ-305	Orchestrate failover; design BCDR
Zone-to-zone vs region-to-region	AZ-305	Design for HA and DR
RPO/RTO targets and proof	AZ-305	Business continuity requirements
DNS/IP/LB cutover automation	AZ-700	Design & implement network connectivity
Encryption / RBAC / isolation	AZ-500 / AZ-104	Secure DR; manage identities & access

Quick check

Your achieved RPO is climbing and a VM flips to Warning, but replication is still running. What single component do you suspect first, and where does it live?
A multi-tier app’s SQL AG comes up split-brain after failover and has to be reseeded. What policy setting was almost certainly missing?
A Test failover drill changed production DNS. What is the single line of runbook logic that would have prevented it?
You failed over to DR successfully but now can’t return to the source region. What step did you skip, and what does it do?
True or false: a green “replication healthy” dashboard is sufficient evidence that your DR meets its RTO.

Answers

The cache storage account – it lives in the source region. Suspect it is throttled (often because it is shared with application I/O) or undersized for the write churn; look for Replica Storage Throttle events and move to a dedicated GPv2 Standard cache sized on churn.
multiVmSyncStatus: Enable on the replication policy. Without it the SQL VMs fail over to recovery points minutes apart, so the AG secondary can come up ahead of the primary, forcing a manual reseed. Multi-VM sync makes the group share one crash-consistent point.
Branch on the failover type and return early: if ($RecoveryPlanContext.FailoverType -eq "Test") { return } before any production DNS/LB mutation. (And make the runbook idempotent.)
You skipped reprotect. After a failover the DR region is primary and replication is paused; reprotect reverses replication (DR→source), re-seeding only the delta, so the original region becomes a valid target again and you can run a planned failover back.
False. Replication health only confirms bytes are arriving. It hides achieved RPO (you can be far behind and still “Normal”/“Warning”) and says nothing about RTO – boot order, runbooks, app readiness. Only a clean, timed, isolated test failover proves RTO.

Glossary

Azure Site Recovery (ASR) – Microsoft’s BCDR service that replicates running VM disks to another zone or region and orchestrates ordered failover.
A2A (Azure-to-Azure) – the replication scenario for Azure VMs (as opposed to VMware/physical), agentless apart from the Mobility extension.
Mobility service extension – the agent auto-installed on each protected VM that captures disk writes and ships them to the cache account.
Cache storage account – a dedicated storage account in the source region that buffers write bursts before asynchronous replication to the target; sized on churn.
Recovery Services vault – the control-plane resource (in the target region) holding the replication policy, recovery points and failover jobs.
Replication policy – defines recovery-point retention, app-consistent frequency, and whether multi-VM sync is on; attached via a protection-container mapping.
Recovery point (RP) – a bootable moment in time; crash-consistent (every 5 min, disk-as-is) or app-consistent (hourly, VSS/freeze-flushed).
Multi-VM consistency (multiVmSyncStatus) – makes a group of VMs share one crash-consistent recovery point so tiers don’t drift apart in time.
Recovery plan – the unit of failover: groups VMs, orders boot into boot groups (1-7), and injects pre/post automation.
Boot group – a sequential tier (1-7) in a recovery plan; every VM in a group reaches its wait condition before the next group starts.
Azure Automation runbook – a PowerShell script run as a pre/post action at a boot-group boundary, for DNS cutover, IP/LB wiring, and app startup.
RecoveryPlanContext – the object ASR passes to a runbook carrying FailoverType, FailoverDirection, the VM map and the group id.
Achieved RPO – the actual recoverable lag per VM (queryable as a metric); the number to alert on, not just “replication healthy.”
Test failover – a drill that boots VMs from a chosen point into an isolated network while production replication continues uninterrupted.
Planned failover – zero-data-loss failover for an anticipated event; source must be healthy and is flushed first.
Unplanned failover – the real-disaster path when the source is gone; expect data loss equal to achieved RPO.
Reprotect – reverses replication (DR→source) after a failover, re-seeding only the delta, to enable failback.
Failback – the return trip to the original region; requires reprotect first, then a planned failover back.

Next steps

You can now protect IaaS with ASR, build ordered recovery plans, automate cutover, and – the part that matters – prove RPO/RTO with isolated drills. Build outward:

Next: Azure VM Availability & Resilience: Zones, Scale Sets & Fault Domains – the VM-level HA that handles the common failure before you ever need a region failover.
Related: Azure Backup Vault: Immutability, MUA & Cross-Region Restore – the point-in-time, ransomware-resilient half of BCDR that ASR does not replace.
Related: Azure Front Door & Traffic Manager: Global Failover – steer user traffic to the DR region the moment ASR brings it up.
Related: Azure SQL Managed Instance: Failover Groups & Link – managed-SQL failover for when you move off SQL-on-a-VM.
Related: Validate Resilience with Azure Chaos Studio – inject the zone/region fault and prove your recovery plan under real failure.
Related: Azure Regions & Availability Zones Explained – the physical-failure boundaries that decide zone-to-zone vs region-to-region.