Azure Operations

Azure Backup & Site Recovery Deep Dive: Vaults, Policies, Restore & DR Failover

Two questions decide whether your career survives the worst day in the data centre. How much data can you afford to lose? (your RPO) and how fast must you be back? (your RTO). Azure answers them with two complementary services that learners — and exam writers — constantly confuse. Azure Backup keeps point-in-time copies so you can recover data after deletion, corruption, or ransomware. Azure Site Recovery (ASR) continuously replicates whole machines so you can fail over and keep running when a region, zone, or on-prem site goes dark. Backup is your time machine; Site Recovery is your spare engine. You need both, and an interviewer will probe exactly where one ends and the other begins.

This is the exhaustive lesson. We go setting by setting through the two vault types and the workloads each protects, backup policies down to GFS retention and instant-restore snapshots, the full restore matrix for VMs (whole-machine and file-level), Azure Files and SQL-in-VM, then the hardening controls every production tenant needs — cross-region restore, soft delete, immutable vaults, and multi-user authorization. Then we switch to Site Recovery — replication policies, recovery plans, the A2A and on-prem-to-Azure scenarios, and the critical distinction between test failover, failover, and failback — before pulling the estate together in Backup Center. By the end you can design a backup-and-DR posture from memory and answer the follow-ups AZ-104 and AZ-305 will throw at you.

Learning objectives

Prerequisites & where this fits

You need an Azure subscription, a resource group or two, at least one VM you can protect, and the az CLI (Cloud Shell is fine) from the earlier Fundamentals and Compute lessons. This is the Operations deep-dive of the Azure Zero-to-Hero course — it builds directly on the compute and storage deep-dives (Azure Virtual Machines Deep Dive and Azure Managed Disks Deep Dive), because the things you back up here are exactly the VMs, disks, and file shares you created there, and uses the same monitoring pipeline as Azure Monitor Deep Dive for alerts. If “RPO” and “RTO” are brand new, this lesson defines them in full; if you want the ransomware-hardening angle in more depth afterwards, the companion lesson Azure Backup Hardening goes deeper still.

Core concepts

Backup and disaster recovery solve different problems, and the single most important mental model is why two services exist:

Anchor terms you will see throughout:

Recovery Services vault vs Backup vault: every workload

Azure has two vault resource types, and they are not interchangeable. Choosing the wrong one means re-onboarding workloads later, so this is a day-one decision and a guaranteed exam question.

Capability Recovery Services vault Backup vault
Resource type Microsoft.RecoveryServices/vaults Microsoft.DataProtection/backupVaults
Azure VMs Yes (snapshot + vault) No
SQL Server in Azure VM Yes No
SAP HANA in Azure VM Yes No
Azure Files Yes (snapshot-based) Yes (vaulted, off-site copy)
On-prem files/folders (MARS agent) Yes No
On-prem VMs / app workloads (MABS / DPM) Yes No
Azure Blobs (operational + vaulted) No Yes
Azure Managed Disks No Yes
Azure Database for PostgreSQL Flexible Server No Yes
AKS (cluster state + PVs) No Yes
Azure Site Recovery (DR replication) Yes No
Immutability Yes Yes
Soft delete Yes (basic + enhanced) Yes
MUA via Resource Guard Yes Yes
Cross-region restore Yes Yes (selected workloads)

The rule of thumb:

Many platform teams run both — that is expected and correct; they are governed the same way for redundancy, soft delete, immutability, and MUA. The seam is purely about which workloads each supports.

Creating a Recovery Services vault: every setting

When you create a Recovery Services vault the portal walks Basics → Networking → Tags → Review + create. After creation, two vault-wide Properties (redundancy and security) are the load-bearing settings.

Setting What it is Choices / default When / trade-off / gotcha
Subscription / Resource group Billing + lifecycle container Any you have rights to A vault protects resources in the same region; you typically have one vault per region per environment.
Vault name The resource name 2–50 chars, unique in the RG Renaming is not supported. Use rsv-<env>-<region> (e.g. rsv-prod-eastus).
Region Where the vault (and its backup data) lives Any region The vault must be in the same region as the resources it backs up. A VM in East US cannot be backed up to a vault in West US.
Storage redundancy Durability of backup data LRS · ZRS · GRS (default for new vaults) Changeable only while the vault has zero protected items — a true day-zero decision. GRS is required for cross-region restore. See the redundancy table below.
Cross Region Restore Lets you restore in the geo-paired region on demand Off by default; requires GRS Enable it before onboarding if you want region-failure recovery without a Microsoft-declared outage.
Networking (public/private access) How the vault is reached Public endpoint (default) or Private Endpoint Private endpoints lock the vault’s data plane to your VNet — recommended for production; configure before large-scale onboarding.
Tags Cost/governance metadata Free env, owner, costcenter — FinOps and policy depend on these.

Storage redundancy controls how durable your backup data is — and you cannot change it once items are protected:

Redundancy Copies / placement Survives Cost When
LRS (Locally redundant) 3 copies in one datacentre Disk/rack failure Lowest Dev/test, or where data can be recreated.
ZRS (Zone redundant) 3 copies across availability zones in the region A full zone outage Medium Production needing in-region zone resilience.
GRS (Geo-redundant) LRS locally + async copy to the paired region Loss of the primary region Higher Production needing region-failure recovery; required for cross-region restore.

Gotcha: GRS is the default for new vaults, but if you switch a vault to LRS to save money, you permanently forfeit cross-region restore and can’t switch back once items are protected. Decide based on your DR requirement, not the monthly bill.

# Recovery Services vault + redundancy (set redundancy BEFORE any backups)
az group create -n rg-backup-lab -l eastus

az backup vault create \
  --resource-group rg-backup-lab \
  --name rsv-lab-eastus \
  --location eastus

# Geo-redundant + cross-region restore enabled (prerequisite for CRR)
az backup vault backup-properties set \
  --resource-group rg-backup-lab \
  --name rsv-lab-eastus \
  --backup-storage-redundancy GeoRedundant \
  --cross-region-restore-flag true
resource rsv 'Microsoft.RecoveryServices/vaults@2024-04-01' = {
  name: 'rsv-lab-eastus'
  location: 'eastus'
  sku: { name: 'RS0', tier: 'Standard' }
  properties: {}
}

resource rsvConfig 'Microsoft.RecoveryServices/vaults/backupstorageconfig@2024-04-01' = {
  parent: rsv
  name: 'vaultstorageconfig'
  properties: {
    storageModelType: 'GeoRedundant'
    crossRegionRestoreFlag: true
  }
}

Creating a Backup vault

The Backup vault create flow is similar (Basics → Redundancy → Tags → Review + create), but redundancy values are named LocallyRedundant / ZoneRedundant / GeoRedundant, and you choose the soft-delete mode at creation. Use it only for the workloads in the table above.

az dataprotection backup-vault create \
  --resource-group rg-backup-lab \
  --vault-name bvault-lab-eastus \
  --location eastus \
  --storage-settings datastore-type="VaultStore" type="GeoRedundant" \
  --soft-delete-state On --retention-duration-in-days 14

Backup policies: frequency, GFS retention, instant restore

A backup policy answers two questions for a workload: how often do we back up, and how long do we keep each kind of recovery point. The same policy can protect many items.

Schedule (frequency)

Workload Schedule options Default Notes / gotcha
Azure VM (Standard policy) Daily or Weekly Daily One backup per day; the simplest, most common policy.
Azure VM (Enhanced policy) Hourly (every 4/6/8/12 h) or Daily Required for Trusted Launch / Gen2 / Ultra-disk VMs and for multiple backups per day; can’t downgrade Enhanced→Standard on a protected item.
Azure Files Daily (multiple times/day on snapshot) + vaulted Daily Snapshot tier is local to the share; vaulted tier copies off-site.
SQL in Azure VM Full (daily/weekly) + Differential + Transaction log (as often as every 15 min) Log backups drive a 15-minute RPO with log-chain point-in-time restore.

GFS retention (Grandfather-Father-Son)

Retention is where GFS lives. You keep frequent points for a short window and progressively sparser points for longer — granular recently, cheap for years:

Tier Keeps Typical retention Purpose
Daily Each day’s backup 7–30 days Day-to-day “oops I deleted it” recovery.
Weekly One chosen day’s backup per week 4–12 weeks Roll-back across weeks without daily bloat.
Monthly One chosen backup per month 12–60 months Month-end / reporting snapshots.
Yearly One chosen backup per year up to 99 years Long-term compliance/legal hold.

Gotcha: retention is per recovery point, and reducing retention in a policy shrinks existing points’ lifetimes at the next cleanup — which is exactly the destructive operation immutability and MUA (below) are designed to block. Lengthening retention is always safe.

Instant-restore snapshots

For Azure VMs, Azure Backup first takes a local managed-disk snapshot (the instant restore / snapshot tier), then copies the data into the vault. The snapshot tier gives near-instant restores of recent points (no vault read) and is configurable from 1 to 5 days (default 2).

Lever Effect Trade-off
Snapshot retention 1–5 days (default 2) More days = more recent points restore instantly Snapshots are stored as managed disks → more days = more snapshot storage cost.
# Show the default VM policy, then create a custom daily policy from JSON
az backup policy show -g rg-backup-lab -v rsv-lab-eastus -n DefaultPolicy -o json > vmpolicy.json
# (edit vmpolicy.json: schedule time, daily/weekly/monthly/yearly retention,
#  instantRpRetentionRangeInDays 1-5)
az backup policy set -g rg-backup-lab -v rsv-lab-eastus --policy @vmpolicy.json --name DailyGfsPolicy

Backing up & restoring Azure VMs

VM backup is agent-light: enabling protection installs the VM Backup extension (VMSnapshot) which coordinates an application-consistent snapshot (via VSS on Windows / pre-post scripts on Linux), then ships the snapshot to the vault.

Consistency levels (know these for the exam):

Consistency What it captures When you get it
Application-consistent In-memory/in-flight data flushed via VSS (Windows) or pre/post scripts (Linux) — best The goal for running VMs; needs the extension + VSS/scripts.
File-system-consistent On-disk files consistent, but app buffers not flushed Fallback when VSS/scripts unavailable.
Crash-consistent Equivalent to pulling the power cord Last resort (e.g. VM stopped/deallocated).

Enable and trigger a backup

# Enable VM backup against a policy
az backup protection enable-for-vm \
  --resource-group rg-backup-lab \
  --vault-name rsv-lab-eastus \
  --vm myVm \
  --policy-name DefaultPolicy

# On-demand backup with an explicit retention for this point
az backup protection backup-now \
  --resource-group rg-backup-lab \
  --vault-name rsv-lab-eastus \
  --container-name myVm --item-name myVm \
  --retain-until 15-07-2026 \
  --backup-management-type AzureIaasVM

Restore options for an Azure VM

Restore type What it does When to use Gotcha
Create new VM Builds a brand-new VM from the recovery point Fast full recovery without touching the original Needs a staging storage account; new NIC/IP.
Replace existing (replace disks) Swaps the original VM’s disks back to the recovery point In-place rollback keeping name/IP/NIC Original VM must exist; replaced disks are kept as a backup.
Restore disks Restores managed disks only (and emits an ARM template) Custom rebuilds, attach to a different VM You assemble the VM yourself.
File recovery (item-level) Mounts the recovery point as drives to copy individual files You need a few files, not the whole VM Uses an iSCSI mount via a downloaded script; unmount when done.
Cross-region restore Any of the above in the paired region Region outage or DR drill Requires GRS + CRR enabled.
# Restore disks from the latest recovery point to a staging account
RP=$(az backup recoverypoint list -g rg-backup-lab -v rsv-lab-eastus \
  --container-name myVm --item-name myVm \
  --backup-management-type AzureIaasVM --query "[0].name" -o tsv)

az backup restore restore-disks \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --container-name myVm --item-name myVm \
  --rp-name $RP --storage-account mystagingsa \
  --target-resource-group rg-restore-lab

File-level (item-level) recovery — the mechanism

File recovery is exam-favourite because the mechanism is unusual: you download a small executable script from the recovery point, run it on any machine with network line-of-sight, and it mounts the recovery point’s volumes locally over iSCSI. You then copy the files you need with normal file tools and unmount to release the mount.

# Generate the file-restore script + credentials for a recovery point
az backup restore files mount-rp \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --container-name myVm --item-name myVm --rp-name $RP
# -> download/run the returned script; it mounts the volumes locally.
# When finished:
az backup restore files unmount-rp \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --container-name myVm --item-name myVm --rp-name $RP

Gotcha: the file-restore mount has a 12-hour lifetime and you should always unmount when done — a forgotten mount holds the recovery point open. On Linux you may need the open-iscsi package; on Windows the script needs to run as administrator.

Backing up & restoring Azure Files

Azure Files has two protection tiers:

Tier Where copies live Protects against RPO/retention Restore granularity
Snapshot (operational) In the same storage account as the share Accidental change/delete of files Multiple/day; up to 200 snapshots Whole share or individual files/folders
Vaulted (off-site) In the vault (Backup vault) Loss of the storage account itself Daily; GFS retention Whole share (point-in-time)

Snapshot backup is fast and cheap and covers the common “someone deleted the folder” case; vaulted backup is the insurance for storage-account-level loss. Enable Azure Files backup from a Recovery Services vault (snapshot) and/or a Backup vault (vaulted).

# Snapshot-based Azure Files backup via Recovery Services vault
az backup protection enable-for-azurefileshare \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --storage-account mystorageacct --azure-file-share myshare \
  --policy-name DefaultPolicy

# Item-level restore of a single folder back to the original share
az backup restore restore-azurefileshare \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --container-name "StorageContainer;Storage;rg-backup-lab;mystorageacct" \
  --item-name myshare --rp-name <recovery-point> \
  --restore-mode OriginalLocation --resolve-conflict Overwrite

Backing up & restoring SQL Server in an Azure VM

For SQL running inside an Azure VM, Azure Backup is a workload-aware backup (not the same as VM-level backup). The Azure Backup workload extension runs inside the guest, discovers databases, and performs native SQL backups (Full, Differential, Transaction-log). This gives:

SQL backup type Frequency Purpose
Full Daily or Weekly Baseline restore point.
Differential Daily (not same day as full) Smaller, faster than full; reduces restore time.
Transaction log As often as every 15 min Drives the 15-minute RPO and point-in-time restore.
# Register the SQL-in-VM instance with the vault, then enable protection
az backup container register \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --workload-type MSSQL --backup-management-type AzureWorkload \
  --resource-id $(az vm show -g rg-backup-lab -n sqlvm --query id -o tsv)

Gotcha: SQL-in-VM backup needs the VM to reach the Azure Backup service (network) and a SQL account/permissions for the extension. It backs up databases, so restore targets are databases (overwrite, alternate location, or as files) — not the whole VM. To protect the whole machine and the databases, run both VM backup and SQL backup.

Cross-region restore, soft delete, immutability & MUA

These four controls turn a backup vault from “a copy of your data” into “a copy an attacker or a rogue admin cannot destroy.” This is the ransomware-resilience core of AZ-104/AZ-305.

Cross-region restore (CRR)

CRR lets you restore in the geo-paired region on demand, without waiting for Microsoft to declare a regional outage — invaluable for DR drills and region-down recovery.

Soft delete

Soft delete keeps deleted backup data recoverable for a retention window even after someone stops protection and deletes backups — defeating the classic “delete the backups, then encrypt” ransomware play.

Mode Behaviour Retention
Basic soft delete Deleted backup items are kept and recoverable 14 days, free
Enhanced soft delete Adds always-on option (can be made non-disableable) + configurable window 14–180 days (paid beyond the free window)
# Inspect / set soft delete on a Recovery Services vault
az backup vault backup-properties show -g rg-backup-lab -n rsv-lab-eastus \
  --query "{soft:softDeleteFeatureState}"
az backup vault backup-properties set -g rg-backup-lab -n rsv-lab-eastus \
  --soft-delete-feature-state Enable

Immutable vault

Immutability blocks operations that would reduce protection of existing recovery points — deleting data before retention expires, shortening retention, or disabling soft delete. It does not block creating new backups or extending retention.

State Can an admin revert it? Use
Immutability not locked Yes — vault admin can disable Soak/test period to validate nothing breaks.
Immutability locked No — irreversible Production compliance/ransomware posture; even Microsoft cannot unlock it.

Gotcha: lock immutability only after a soak period. Once locked it is permanent for the life of the vault; if a policy genuinely needs shorter retention later, you cannot shorten it.

Multi-user authorization (MUA)

MUA puts destructive operations behind a second person’s approval using a Resource Guard — typically owned in a different subscription/tenant so the backup admin alone cannot both request and approve. Protected operations (disable soft delete, reduce retention, delete a protected item, disable MUA itself) require a Just-In-Time approval against the Resource Guard before they proceed.

The defence-in-depth order to enable these is: soft delete → cross-region restore → immutability (unlocked, then locked after soak) → MUA.

Azure Site Recovery: replication, plans & failover

Switch gears: Azure Site Recovery (ASR) continuously replicates whole machines so you can fail over and keep running. ASR config lives in a Recovery Services vault.

Scenarios

Scenario Source → Target Replication mechanism
Azure-to-Azure (A2A) Azure VM in region/zone A → region/zone B Agentless (Site Recovery extension auto-installed); the headline cloud-DR pattern.
Zone-to-zone Azure VM in zone 1 → zone 2 (same region) A2A variant for in-region zone resilience.
VMware / Physical → Azure On-prem VMware VMs or physical servers → Azure Via the Azure Site Recovery replication appliance (modern) running on-prem.
Hyper-V → Azure On-prem Hyper-V VMs → Azure Via the Azure Site Recovery provider on the Hyper-V host/VMM.

Replication policy (the RPO/retention knobs)

A replication policy controls how recovery points are generated and kept:

Setting What it is Typical / default Trade-off
RPO threshold When ASR raises an alert if replication lags e.g. 15 min (alert only) Lower = noisier alerts; replication itself is continuous.
Recovery-point retention How long crash/app-consistent points are kept up to 24 hours (A2A) Longer = more recovery points to fail back to.
App-consistent snapshot frequency How often app-consistent (VSS) points are taken e.g. every 1–4 hours, or off More frequent = lower data loss for app-consistent recovery, slight guest overhead.
Multi-VM consistency Groups VMs so they share a crash/app-consistent recovery point Off by default Essential for multi-tier apps (DB + app) that must fail over to the same moment; adds a replication group.

ASR keeps only a short window (hours) of recovery points — it is not a backup. Use Azure Backup for long-term/point-in-time data recovery and ASR for keeping the machine running.

Recovery plans

A recovery plan orchestrates the failover of many VMs into an ordered, repeatable runbook:

A recovery plan is what turns “fail over 30 VMs” from a frantic afternoon into one button with predictable ordering.

Test failover vs failover vs failback

This three-way distinction is the single most-asked ASR exam question:

Operation What it does Impact on production When
Test failover Spins up the replicated VMs in an isolated network to validate DR None — production keeps running and replicating Regular non-disruptive DR drills; do this often.
Failover Brings the VMs up in the target region/zone for real Production source is now down/secondary; you’re running in the target An actual disaster (or a planned, committed migration).
Failback After the source is healthy, reverse-replicate and return to the original Brief switch back to the source region Once the primary region recovers; for on-prem you reprotect then failback.

The full lifecycle is: enable replication → (let it reach a healthy RPO) → test failover (drill) → [disaster] → failover → commit → reprotect (reverse replication) → failback → re-enable original-direction replication.

# A2A is most reliably scripted via PowerShell/templates; the az CLI surface
# is limited. Validate replication health for a protected item:
az site-recovery replication-protected-item show \
  --resource-group rg-asr --vault-name rsv-asr \
  --fabric-name <primary-fabric> --protection-container-name <pc> \
  --replicated-protected-item-name myVm-asr \
  --query "{state:properties.protectionState, rpo:properties.providerSpecificDetails}"

Gotcha: A2A is most robustly configured via the portal, ARM/Bicep, or Az.RecoveryServices PowerShell; the az CLI coverage for ASR is partial. For exam answers, know the concepts and order; for production, drive ASR from templates/PowerShell and rehearse with test failover on a schedule.

Backup Center, reports & alerts

As the estate grows you stop managing vault-by-vault and move to Backup Center — a single pane across all vaults, subscriptions, and workload types:

# Recent backup jobs across a vault (the Backup Center "Jobs" view, in CLI)
az backup job list -g rg-backup-lab -v rsv-lab-eastus \
  --query "[].{op:operation, status:status, start:startTime}" -o table

# Route vault diagnostics to Log Analytics so Backup Reports populate
az monitor diagnostic-settings create \
  --name to-law \
  --resource $(az backup vault show -g rg-backup-lab -n rsv-lab-eastus --query id -o tsv) \
  --workspace <log-analytics-workspace-id> \
  --logs '[{"categoryGroup":"allLogs","enabled":true}]'

Azure Backup and Site Recovery architecture: a Recovery Services vault holding VM, Azure Files and SQL-in-VM backups with GFS retention, instant-restore snapshots, soft delete, immutability and multi-user authorization, alongside Azure Site Recovery replicating VMs to a paired region with a replication policy, recovery plan, and the test-failover / failover / failback lifecycle, all surfaced through Backup Center reports and alerts

The diagram above ties the whole picture together: one Recovery Services vault feeding both the recover-data path (Backup, hardened by soft delete, immutability, and MUA) and the keep-running path (Site Recovery replication with its failover lifecycle), unified under Backup Center.

Hands-on lab

Enable backup on a small VM, take an on-demand backup, perform a file-level restore, then clean everything up. All az CLI; the only billable pieces are tiny and removed at the end.

1. Resource group, vault, and a small VM

LOC=eastus
RG=rg-backup-lab
az group create -n $RG -l $LOC

# Recovery Services vault (LRS for the lab to minimise cost)
az backup vault create -g $RG -n rsv-lab-eastus -l $LOC
az backup vault backup-properties set -g $RG -n rsv-lab-eastus \
  --backup-storage-redundancy LocallyRedundant

# Tiny Linux VM to protect
az vm create -g $RG -n bkpVm -l $LOC \
  --image Ubuntu2204 --size Standard_B1s \
  --admin-username azureuser --generate-ssh-keys

2. Enable backup and trigger an on-demand backup

az backup protection enable-for-vm \
  -g $RG -v rsv-lab-eastus --vm bkpVm --policy-name DefaultPolicy

# On-demand backup, retained ~30 days from today
az backup protection backup-now \
  -g $RG -v rsv-lab-eastus \
  --container-name bkpVm --item-name bkpVm \
  --backup-management-type AzureIaasVM \
  --retain-until $(date -v+30d +%d-%m-%Y 2>/dev/null || date -d "+30 days" +%d-%m-%Y)

3. Verify the job and list recovery points

# Wait for the backup job to complete (Status -> Completed)
az backup job list -g $RG -v rsv-lab-eastus -o table

# List recovery points and capture the newest
RP=$(az backup recoverypoint list -g $RG -v rsv-lab-eastus \
  --container-name bkpVm --item-name bkpVm \
  --backup-management-type AzureIaasVM --query "[0].name" -o tsv)
echo "Recovery point: $RP"

Expected: a Backup job with Status = Completed, and $RP populated with a recovery-point name.

4. File-level restore (mount the recovery point)

# Generate the file-restore script + iSCSI credentials
az backup restore files mount-rp \
  -g $RG -v rsv-lab-eastus \
  --container-name bkpVm --item-name bkpVm --rp-name $RP

Run the returned script on a machine with network access (it mounts the recovery point’s volumes locally over iSCSI). Browse the mounted volume, copy any file you need, then release the mount:

az backup restore files unmount-rp \
  -g $RG -v rsv-lab-eastus \
  --container-name bkpVm --item-name bkpVm --rp-name $RP

Read the result: you recovered individual files without restoring the whole VM — the everyday “I deleted one file” scenario, and a classic exam question.

Cleanup

# Stop protection AND delete backup data, then remove the RG
az backup protection disable -g $RG -v rsv-lab-eastus \
  --container-name bkpVm --item-name bkpVm \
  --backup-management-type AzureIaasVM --delete-backup-data true --yes

az group delete -n $RG --yes --no-wait

If az group delete fails because the vault “contains backup items,” it’s because soft delete is holding deleted data. Undo/disable soft delete or wait out the soft-delete window, then delete — this is the soft-delete safety net doing its job.

Cost note

The vault and policies are free to define; you pay for protected-instance fees (per backed-up instance, by source size band) plus backup storage consumed (LRS in this lab). A single tiny B1s VM with one on-demand backup for an hour is a few rupees of storage plus a small instance fee — round to ₹5–₹30 if cleaned up promptly. The genuinely expensive accidents are GRS storage on large VMs with long retention and forgotten ASR replication (which bills continuously per replicated VM plus target-region storage). Always disable protection with --delete-backup-data and az group delete your labs.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Can’t change vault redundancy to GRS The vault already has protected items Redundancy is fixed once items exist — create a new vault with the right redundancy, or remove all items first.
“Cross-region restore not available” Vault is LRS/ZRS, or CRR flag off Set redundancy to GRS and enable the Cross Region Restore flag before onboarding.
VM backup is only crash-consistent VSS (Windows) / pre-post scripts (Linux) failed, or VM was stopped Ensure the VM Backup extension is healthy and VSS/scripts run; back up while running.
az group delete fails on a vault with no visible items Soft delete is retaining deleted backups Disable soft delete or wait the retention window, then delete.
Can’t reduce retention / delete a recovery point Immutability locked or MUA is enforcing This is by design — request approval via the Resource Guard (MUA) or accept that locked immutability is permanent.
Trusted Launch / Gen2 VM won’t take the Standard policy Standard policy doesn’t support it Use the Enhanced backup policy.
ASR shows a high RPO / replication lag Network throughput between regions, or churn spikes Check egress bandwidth and disk churn; raise the RPO threshold alert appropriately; consider larger cache.
Multi-tier app fails over to inconsistent state No multi-VM consistency group Put the interdependent VMs in a replication group so they share a recovery point.

Best practices

Security notes

Cost & sizing

The levers that move a backup/DR bill, roughly in order of impact:

Lever Cost behaviour
Protected-instance fee Charged per backed-up instance, banded by source data size (e.g. ≤50 GB, ≤500 GB, then per 500 GB).
Backup storage consumed Per GB stored — multiplied by redundancy: GRS/GZRS > ZRS > LRS. Long GFS retention compounds this.
Instant-restore snapshots Stored as managed disks; 1–5 days of snapshots = more disk storage.
Cross-region restore GRS storage cost + restore egress when you actually do a CRR.
ASR replication (per VM) A per-replicated-VM monthly fee plus target-region storage and any egress — bills continuously while enabled.
Enhanced soft delete beyond 14 days Paid for retained-deleted data past the free window.
Log Analytics for Backup Reports Ingestion + retention of vault diagnostic logs.

Sizing rules of thumb: redundancy and retention are the big multipliers — GRS on large VMs with multi-year retention is where surprise bills come from. Right-size GFS to the actual compliance need, keep the snapshot tier at the default 2 days unless you need faster recent restores, and never leave ASR replication running on machines you no longer need to protect — it’s the most common forgotten recurring charge in this space.

Interview & exam questions

  1. What’s the difference between Azure Backup and Azure Site Recovery? Backup keeps point-in-time copies so you can recover data after deletion/corruption/ransomware (RPO hours, many historical points). Site Recovery continuously replicates whole machines so you can fail over and keep running (RPO seconds-minutes, short retention). Backup is a time machine; ASR is a spare engine. You use both.

  2. When do you use a Recovery Services vault vs a Backup vault? Recovery Services vault for Azure VMs, SQL/SAP-HANA-in-VM, Azure Files (snapshot), on-prem (MARS/MABS/DPM), and all of ASR. Backup vault for Blobs, Disks, PostgreSQL Flexible Server, AKS, and vaulted Azure Files. VMs and ASR ⇒ Recovery Services vault.

  3. Explain GFS retention. Grandfather-Father-Son: keep daily points for a short window, roll up to weekly, monthly, and yearly (up to 99 years). Granular recently, sparse and cheap for long-term compliance.

  4. What is instant restore / the snapshot tier? A local managed-disk snapshot taken before data is copied to the vault, giving near-instant restores of recent points without reading vault storage. Configurable 1–5 days (default 2); more days = more snapshot storage cost.

  5. What are the consistency levels for VM backup, and which is best? Application-consistent (VSS/pre-post scripts flush app buffers — best), file-system-consistent (on-disk files consistent), crash-consistent (like pulling the power). Aim for application-consistent on running VMs.

  6. How does file-level (item-level) restore work for an Azure VM? You download a script from the recovery point and run it; it mounts the recovery point’s volumes locally over iSCSI. You copy the files you need, then unmount. The mount has a ~12-hour lifetime.

  7. How do you achieve a 15-minute RPO for SQL running in an Azure VM? Use SQL-in-VM (workload-aware) backup with transaction-log backups every 15 minutes, on top of Full + Differential — enabling point-in-time restore within the log chain.

  8. What’s required for cross-region restore, and why would you use it? GRS redundancy and the Cross Region Restore flag, set before onboarding. It lets you restore in the paired region on demand — for DR drills or a region outage — without waiting for Microsoft to declare an outage.

  9. How do the four ransomware controls fit together, and in what order? Soft delete keeps deleted backups recoverable; cross-region restore gives an out-of-region copy; immutable vault blocks retention-shortening/early-deletion; MUA (Resource Guard) gates destructive ops behind a second approver. Enable soft delete → CRR → immutability (soak then lock) → MUA.

  10. What is the difference between an unlocked and a locked immutable vault? Unlocked: immutability is active but a vault admin can disable it (use as a soak period). Locked: irreversible — not even Microsoft can unlock it; retention can only be extended, never shortened.

  11. Test failover vs failover vs failback — what’s the difference? Test failover spins the replica up in an isolated network with no impact on production (your DR drill). Failover brings the machines up in the target region for real during a disaster. Failback reverse-replicates and returns to the original site once it’s healthy.

  12. What does a recovery plan add over failing over VMs individually, and what is multi-VM consistency? A recovery plan orders failover into groups (e.g. DB before app before web), adds manual actions and automation runbooks. Multi-VM consistency groups interdependent VMs so they share the same recovery point — essential for multi-tier apps that must come back to the same moment.

  13. Why is ASR not a substitute for backup? ASR replicates everything faithfully — including corruption and ransomware — and keeps only a short window of recovery points (hours). For clean, long-term, point-in-time recovery you need Azure Backup.

Quick check

  1. Which vault type backs up Azure VMs and hosts Azure Site Recovery?
  2. What two things must be true for cross-region restore to work?
  3. What is the default (and max) instant-restore snapshot retention?
  4. Which backup type gives SQL-in-VM a 15-minute RPO?
  5. True or false: a test failover briefly takes your production VM offline.

Answers

  1. The Recovery Services vault (Microsoft.RecoveryServices/vaults).
  2. GRS storage redundancy and the Cross Region Restore flag enabled — both set before onboarding.
  3. Default 2 days, maximum 5 days.
  4. Transaction-log backups every 15 minutes (workload-aware SQL-in-VM backup).
  5. False — a test failover runs in an isolated network with no impact on production; it’s the non-disruptive DR drill.

Exercise

Design and build (in CLI) a hardened single-VM backup: create a GRS Recovery Services vault with the cross-region-restore flag enabled and enhanced soft delete on, protect one B-series VM with a custom policy (daily schedule, 14-day daily / 6-week weekly / 12-month monthly retention, instant-restore = 3 days), and trigger an on-demand backup. Then prove two things: (a) attempt to shorten retention or delete a recovery point and observe what soft delete / immutability would gate, and (b) confirm in az backup recoverypoint list that your point exists. Write one short paragraph explaining why you’d enable MUA with a Resource Guard in a separate subscription before considering this production-ready, and what destructive operation it would block. Clean up with az backup protection disable --delete-backup-data true then az group delete.

Certification mapping

Exam Skills this lesson covers
AZ-104 (Administrator) Monitor and back up Azure resources: create and configure Recovery Services / Backup vaults, backup policies (GFS, instant restore), back up and restore VMs (incl. file-level), Azure Files, and SQL-in-VM; configure soft delete, cross-region restore; configure Azure Site Recovery for Azure VMs and perform failover/failback; use Backup Center, jobs, and alerts. The az lab mirrors the exam’s task-based items.
AZ-305 (Solutions Architect) Design business continuity solutions: design a backup and recovery strategy from RPO/RTO requirements, choose vault redundancy and CRR, design ransomware-resilient posture (soft delete, immutability, MUA), and design site recovery / DR with replication policies, recovery plans, multi-VM consistency, and region/zone failover.

Glossary

Next steps

AzureAzure BackupSite RecoveryDisaster RecoveryRecovery Services VaultBusiness Continuity
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading