Azure Site Recovery for IaaS: Zone-to-Zone and Region Failover with Recovery Plans

Every DR program I have audited had the same gap: replication was healthy, the dashboards were green, and nobody could tell me the last time anyone actually failed over. Azure Site Recovery (ASR) is not hard to enable – the trap is treating “protected” as “recoverable.” A VM with a 5-minute RPO is worthless if your application boots in the wrong order, comes up with a stale IP, can’t find DNS, or needs a runbook that lives only in someone’s head. This is how to wire ASR for IaaS correctly across both an availability-zone failure and a full region loss, build recovery plans that boot a multi-tier app in dependency order, automate the messy parts with runbooks, and – the part everyone skips – prove your numbers with a test failover that never touches production.

1. Replication architecture, cache storage, and what’s actually supported

For Azure-to-Azure replication, there is no appliance and no agent server to babysit. The Mobility service extension is pushed onto each protected VM automatically. It continuously captures writes and ships them to a cache storage account in the source region first, then ASR asynchronously replicates that data to managed disks in the target region (or target zone). The cache account is the single most important component people misconfigure.

The flow is:

VM disk write -> Mobility service -> cache storage account (source region)
                                          |
                                          v
                          ASR replication -> target managed disks (target region/zone)
                                          v
                                  recovery points (crash- and app-consistent)

Rules that bite teams in production:

The cache account lives in the source region, not the target. It absorbs write bursts and decouples app I/O from cross-region replication latency. Size it on churn (write rate), not capacity.
Use a separate, dedicated cache account for ASR. Sharing it with application workloads causes throttling that shows up as RPO spikes. Use a standard general-purpose v2 account; for high-churn VMs that breach Standard limits, ASR supports a high-churn flow but you must watch for Replica Storage Throttle events.
Premium SSD / Ultra-heavy source disks are supported, but Ultra Disk is not supported as an ASR-replicated disk type at the time of writing – confirm disk SKUs against the current support matrix before you promise coverage.
Zone-to-zone requires the region to support availability zones and is configured within a single region. Region-to-region is the classic cross-region DR. The replication mechanics are identical; only the target differs.

Decide RPO at the cache layer. If the cache account is throttled or undersized, no replication policy setting will save your RPO. Provision the cache account first, separately, and monitor it as a first-class resource.

Create the Recovery Services vault and the dedicated cache account up front:

# Vault for region-to-region DR (the vault lives in the TARGET region)
az backup vault create \
  --resource-group rg-dr-westus2 \
  --name rsv-dr-prod \
  --location westus2

# Dedicated cache storage account in the SOURCE region
az storage account create \
  --resource-group rg-dr-eastus2 \
  --name stasrcacheeastus2 \
  --location eastus2 \
  --sku Standard_LRS \
  --kind StorageV2 \
  --min-tls-version TLS1_2 \
  --allow-blob-public-access false

2. Enabling zone-to-zone vs region-to-region replication

The decision is a risk model, not a preference. Zone-to-zone protects against a single availability zone failing (power, cooling, network in one datacenter) and keeps you inside the region – lowest latency, no data-residency change, but no protection against a regional outage. Region-to-region is your true DR posture for a region-wide event, at the cost of cross-region replication latency and a second region’s spend. Mature platforms run both: AZ-redundant production for the common case, plus cross-region ASR for the regional event.

The cleanest way to enable replication at scale is the portal’s “Enable replication” wizard for the first pass, then codify it. With the CLI, the modern path uses az site-recovery (install the extension first):

az extension add --name site-recovery

# A2A replication for a single VM, region-to-region (eastus2 -> westus2).
# Run after the vault, fabrics, containers, and protection container
# mapping exist (the portal wizard creates these on first use).
az site-recovery protected-item create \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --fabric-name asr-a2a-default-eastus2 \
  --protection-container-name asr-a2a-default-eastus2-container \
  --replication-protected-item-name vm-app01 \
  --policy-id "/subscriptions/<sub>/resourceGroups/rg-dr-westus2/providers/Microsoft.RecoveryServices/vaults/rsv-dr-prod/replicationPolicies/policy-prod-5min" \
  --provider-specific-details '{
    "a2a": {
      "fabricObjectId": "/subscriptions/<sub>/resourceGroups/rg-app-eastus2/providers/Microsoft.Compute/virtualMachines/vm-app01",
      "recoveryContainerId": "<target-container-id>",
      "recoveryResourceGroupId": "/subscriptions/<sub>/resourceGroups/rg-app-westus2",
      "vmManagedDisks": [{
        "diskId": "<source-disk-id>",
        "recoveryResourceGroupId": "/subscriptions/<sub>/resourceGroups/rg-app-westus2",
        "recoveryReplicaDiskAccountType": "Premium_LRS",
        "recoveryTargetDiskAccountType": "Premium_LRS"
      }]
    }
  }'

For zone-to-zone, the difference is the target: you set recoveryAvailabilityZone to the target zone and keep recoveryResourceGroupId in the same region as the source. Everything else – policy, cache, recovery plans – is identical.

Be honest about cost. Cross-region ASR bills you for replicated storage in the target plus the protected-instance fee per VM. The compute in the target region is not running until you fail over, which is the whole point – but the storage and protected-instance charges are continuous. Right-size the replica disk SKU (recoveryReplicaDiskAccountType) to control this.

3. Replication policies: RPO, retention, and app-consistent snapshots

A replication policy controls three knobs and you attach it to a protection container mapping. Get these right and most of your DR posture is set:

Setting	What it controls	Sensible default
`recovery-point-retention-in-hours`	How far back you can recover (the recovery-point history window)	24 (up to 72 for A2A)
`app-consistent-frequency-in-minutes`	How often an application-consistent snapshot is taken	60 (0 to disable)
Crash-consistent frequency	Taken every 5 minutes automatically for A2A	Fixed at 5 min

az site-recovery policy create \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --name policy-prod-5min \
  --provider-input '{
    "instanceType": "A2A",
    "recoveryPointHistory": 1440,
    "appConsistentFrequencyInMinutes": 60,
    "crashConsistentFrequencyInMinutes": 5,
    "multiVmSyncStatus": "Enable"
  }'

The distinction that matters for recovery quality:

Crash-consistent recovery points are like pulling the power cord – the disk is captured as-is. Filesystems journal-replay on boot and most apps recover, but in-flight, uncommitted writes are lost. ASR takes these every 5 minutes for A2A, so your effective RPO floor is roughly 5 minutes plus replication lag.
App-consistent recovery points trigger VSS (Windows) or a pre/post script freeze (Linux) so the application flushes buffers to disk before the snapshot. These are what you want for databases and stateful apps, but they are heavier, so you take them less often (hourly is typical). Failing over to an app-consistent point loses more time but recovers cleaner.

multiVmSyncStatus: Enable is non-negotiable for multi-tier apps that span VMs. It creates shared, crash-consistent recovery points across a group of VMs so your web, app, and DB tiers fail over to the same point in time – otherwise your app tier might be 4 minutes ahead of your DB tier after failover, which is a data-integrity incident waiting to happen.

RPO is a target, not a guarantee. ASR continuously computes an achieved RPO per VM (visible in the replication health blade and queryable via metrics). If your churn outruns replication bandwidth, achieved RPO climbs above target and the VM goes to a warning state. Alert on achieved RPO, not just on “is replication healthy.”

4. Building recovery plans with tiered boot groups

A recovery plan is the unit of failover. It groups VMs, orders their boot, and lets you inject automation. Without one, “failover” means clicking each VM individually in the wrong order at 3 a.m. With one, it is a single, repeatable, ordered operation.

The structure is boot groups (1-7) executed sequentially: every VM in Group 1 boots and reaches the configured wait condition before Group 2 starts. Model your dependency graph onto groups – bottom of the dependency stack first:

Group 1: Domain controllers / DNS / identity. Nothing else can authenticate until these are up.
Group 2: Data tier – databases, caches, message brokers.
Group 3: Application / API tier.
Group 4: Web / frontend / load-balancer-facing VMs.

A recovery plan can be created in the portal, but for repeatability define it as JSON and create it via REST/ARM. Here is the shape of a tiered plan with three groups and runbook hooks (covered next):

{
  "properties": {
    "primaryFabricId": "<source-fabric-id>",
    "recoveryFabricId": "<target-fabric-id>",
    "failoverDeploymentModel": "ResourceManager",
    "groups": [
      {
        "groupType": "Boot",
        "replicationProtectedItems": [
          { "id": "<vm-dc01-protected-item-id>", "virtualMachineId": "<vm-dc01-id>" }
        ]
      },
      {
        "groupType": "Boot",
        "replicationProtectedItems": [
          { "id": "<vm-sql01-protected-item-id>", "virtualMachineId": "<vm-sql01-id>" }
        ],
        "startGroupActions": [],
        "endGroupActions": []
      },
      {
        "groupType": "Boot",
        "replicationProtectedItems": [
          { "id": "<vm-app01-protected-item-id>", "virtualMachineId": "<vm-app01-id>" },
          { "id": "<vm-web01-protected-item-id>", "virtualMachineId": "<vm-web01-id>" }
        ]
      }
    ]
  }
}

VMs not added to any boot group land in a default group and boot in parallel with no ordering – which is exactly the chaos a recovery plan exists to prevent. Be explicit: every protected VM should have an assigned group.

5. Injecting pre/post automation runbooks for DNS, IPs, and app startup

Booting VMs is the easy 80%. The failure-prone 20% is everything around the boot: repointing DNS to the DR region, assigning the right private/public IPs, attaching the failed-over VMs to the correct load balancer backend pool, and kicking off application-level startup. ASR lets you attach Azure Automation runbooks as pre- or post-actions at the start or end of any boot group.

Two things to get right:

The runbook must be in an Azure Automation account in the target region with the AzureRM/Az modules and use a system-assigned managed identity (or a Run As account on older setups) with RBAC scoped to the DR resource group.
ASR passes a RecoveryPlanContext object to the runbook as a parameter. Your runbook keys off FailoverType (Test, Planned, Unplanned) and FailoverDirection so the same runbook behaves correctly in a drill vs. a real event – e.g. it must not flip production DNS during a test failover.

Here is a production-shaped post-group runbook that updates DNS and wires the load balancer, but only on a real failover:

param(
    [Parameter(Mandatory = $true)]
    [object]$RecoveryPlanContext
)

# ASR may pass the context as a JSON string depending on engine version.
if ($RecoveryPlanContext -is [string]) {
    $RecoveryPlanContext = $RecoveryPlanContext | ConvertFrom-Json
}

Connect-AzAccount -Identity | Out-Null

$failoverType = $RecoveryPlanContext.FailoverType   # Test | Planned | Unplanned
Write-Output "Recovery plan failover type: $failoverType"

# NEVER mutate production DNS during a test failover.
if ($failoverType -eq "Test") {
    Write-Output "Test failover detected -- skipping production DNS/LB changes."
    return
}

# Map of source VM name -> failed-over VM name in the DR resource group.
$drRg = "rg-app-westus2"

# Re-point the app A record to the DR load balancer's frontend IP.
$drLbIp = (Get-AzPublicIpAddress -ResourceGroupName $drRg -Name "pip-app-lb-dr").IpAddress
Set-AzDnsRecordSet -ResourceGroupName "rg-dns" -ZoneName "app.contoso.com" `
    -Name "@" -RecordType A -Ttl 60 `
    -DnsRecords (New-AzDnsRecordConfig -IPv4Address $drLbIp) | Out-Null

Write-Output "DNS app.contoso.com repointed to DR LB $drLbIp (TTL 60s)."

Two patterns I insist on:

Set a short DNS TTL (30-60s) on the records you fail over, permanently. You cannot shrink TTL during an incident – caches already hold the old value. Low TTL on DR-relevant records is a standing cost you pay for fast cutover.
Make runbooks idempotent. A drill that re-runs, or a partial failover you retry, must not double-apply or error out. Check current state before mutating.

6. Test failover into an isolated network – without touching production

This is the feature that makes ASR trustworthy: test failover spins up your VMs from a chosen recovery point into a network you specify, while production replication keeps running uninterrupted. Done right, it proves recoverability with zero blast radius.

The non-negotiable rule: fail over into an isolated VNet that has no peering, no VPN, and no route back to production. If your test VMs can reach the real domain controllers or the real database, your drill can corrupt production data or duplicate identity objects. Build a dedicated vnet-asr-test in the DR region with the same address space layout (so app configs resolve) but isolated.

# Isolated DR-test VNet: same subnet layout, NO peering, NO gateway.
az network vnet create \
  --resource-group rg-dr-westus2 \
  --name vnet-asr-test \
  --location westus2 \
  --address-prefixes 10.99.0.0/16 \
  --subnet-name snet-app --subnet-prefixes 10.99.1.0/24

# Trigger a test failover for a recovery plan into the isolated network.
az site-recovery recovery-plan test-failover \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --recovery-plan-name rp-prod-app \
  --failover-direction PrimaryToRecovery \
  --network-id "/subscriptions/<sub>/resourceGroups/rg-dr-westus2/providers/Microsoft.Network/virtualNetworks/vnet-asr-test" \
  --network-type VmNetworkAsInput

After validation, you must run cleanup to delete the test VMs and resources – otherwise you keep paying for them and the plan state stays dirty. Record findings (what booted, what broke, time to app-ready) as part of cleanup, because that is your drill evidence.

az site-recovery recovery-plan test-failover-cleanup \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --recovery-plan-name rp-prod-app \
  --comments "Q2 DR drill -- app-ready in 11m, DNS runbook OK, fixed SQL boot-group wait."

A test failover that you never clean up is worse than no drill: it bills continuously and leaves the recovery plan unable to run again. Treat cleanup as part of the drill, not an afterthought, and automate the reminder.

7. Planned, unplanned, and failback with reprotection

Three failover modes, each for a different situation:

Planned failover (region-to-region): zero data loss. ASR shuts down the source, flushes the final pending data, then brings up the target. Use this for anticipated events – a scheduled DC maintenance, a planned region migration. Source must be reachable. (Note: planned failover semantics differ for zone-to-zone vs region-to-region; for A2A region pairs it is available where the source is healthy.)
Unplanned failover: the source is gone or unreachable. ASR fails over to the latest (or a chosen earlier) recovery point. Expect some data loss equal to your achieved RPO at the moment of the outage. This is the real-disaster path.
Failback: returning to the original region after it recovers. This is not a single button – you must reprotect first.

The lifecycle people get wrong is the return trip:

Fail over to DR (target region is now primary, serving traffic).
Reprotect – ASR now replicates from the DR region back to the original region. This re-seeds only the delta, not the full disks, so it is fast.
Commit the failover once you are satisfied (discards other recovery points).
When the original region is healthy and reprotection is in sync, run a planned failover back (zero data loss) and reprotect again to restore the normal DR direction.

# After an unplanned failover, reprotect to start replicating DR -> source.
az site-recovery protected-item reprotect \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --fabric-name asr-a2a-default-westus2 \
  --protection-container-name asr-a2a-default-westus2-container \
  --replication-protected-item-name vm-app01

Failover without a tested reprotect/failback plan means you can get to DR but you are stranded there. The expensive incidents I have seen were not the failover – they were teams running production from a DR region for weeks because nobody had rehearsed the return trip, accruing cross-region egress and running unscaled DR capacity under real load.

Verify

Confirm replication health, recovery-point availability, and achieved RPO before you ever claim a VM is protected.

# Per-VM replication health, protection state, and achieved RPO.
az site-recovery protected-item show \
  --resource-group rg-dr-westus2 \
  --vault-name rsv-dr-prod \
  --fabric-name asr-a2a-default-eastus2 \
  --protection-container-name asr-a2a-default-eastus2-container \
  --replication-protected-item-name vm-app01 \
  --query "{state:properties.protectionStateDescription, health:properties.replicationHealth, rpoSeconds:properties.providerSpecificDetails.lastRpoInSeconds}" \
  -o table

You want protectionStateDescription = Protected, replicationHealth = Normal, and lastRpoInSeconds comfortably under your target. Then query the achieved-RPO metric over time to catch drift, and confirm app-consistent points exist:

// Achieved RPO trend per protected VM (over the SLA window) -- alert on breach.
AzureMetrics
| where ResourceProvider == "MICROSOFT.RECOVERYSERVICES"
| where MetricName == "RPO"
| summarize MaxRpoSeconds = max(Maximum) by Resource, bin(TimeGenerated, 15m)
| where MaxRpoSeconds > 300   // 5-minute target
| order by TimeGenerated desc

Finally, the only verification that actually counts: run an isolated test failover (Step 6), confirm every boot group reaches app-ready in dependency order, confirm the runbooks ran (and the DNS runbook correctly skipped on Test), record the achieved RTO, and clean up. Replication health is necessary; a clean test failover is sufficient.

Enterprise scenario

A platform team running a regulated payments workload (three-tier: IIS web, .NET app, SQL Server on a 2-VM Always On AG, plus a pair of domain controllers) protected everything to a paired region with ASR and a 5-minute RPO. Replication was healthy for months. During their first real drill, the recovery plan booted all VMs but the app tier crash-looped: it came up before SQL had finished AG recovery, exhausted its connection retries, and the service marked itself failed. Worse, because they had not enabled multi-VM sync, the two SQL VMs failed over to recovery points 3 minutes apart, so the AG came up with the secondary ahead of the primary and had to be manually reseeded – blowing their RTO from a target of 15 minutes to over an hour.

The constraint was real: SQL AG recovery time is variable and you cannot put a fixed sleep in a boot group and call it deterministic. They fixed it with two changes. First, they enabled multiVmSyncStatus: Enable on the policy so the SQL pair (and the whole app) shares one crash-consistent recovery point – no more cross-VM time skew. Second, they replaced the fragile fixed wait with a post-group readiness gate: a runbook attached to the end of the SQL boot group that polls the AG listener until the primary is actually serving, before the app group is allowed to start.

# Post-action on the SQL boot group: block until the AG primary answers.
$listener = "sql-ag-listener.payments.internal"
$deadline = (Get-Date).AddMinutes(12)
do {
    try {
        $c = New-Object System.Data.SqlClient.SqlConnection("Server=$listener;Database=master;Integrated Security=True;Connect Timeout=5")
        $c.Open()
        $primaryReady = $true
        $c.Close()
    } catch {
        $primaryReady = $false
        Start-Sleep -Seconds 10
    }
} until ($primaryReady -or (Get-Date) -gt $deadline)

if (-not $primaryReady) { throw "SQL AG primary not ready before deadline -- halting recovery plan." }
Write-Output "SQL AG primary reachable -- releasing app tier boot group."

With shared recovery points plus a readiness gate instead of a guessed sleep, their next drill came up clean and reproducibly inside 14 minutes – and crucially, the plan now fails loudly if the data tier isn’t ready, instead of cascading into a half-booted app.

Azure Site Recovery for IaaS: Zone-to-Zone and Region Failover with Recovery Plans

1. Replication architecture, cache storage, and what’s actually supported

2. Enabling zone-to-zone vs region-to-region replication

3. Replication policies: RPO, retention, and app-consistent snapshots

4. Building recovery plans with tiered boot groups

5. Injecting pre/post automation runbooks for DNS, IPs, and app startup

6. Test failover into an isolated network – without touching production

7. Planned, unplanned, and failback with reprotection

Verify

Enterprise scenario

DR readiness checklist

Written by Vinod

Comments

Keep Reading

Application Gateway for Containers: Gateway API on AKS with Traffic Splitting, mTLS, and Header Routing

Azure Event Hubs at Scale: Partitioning, Capture, Kafka Endpoint, and Stream Analytics Processing

Azure Service Bus at Scale: Sessions, Deduplication, and Dead-Letter Handling