Azure Backup & Site Recovery Deep Dive: Vaults, Policies, Restore & DR Failover

Two questions decide whether your career survives the worst day in the data centre. How much data can you afford to lose? (your RPO) and how fast must you be back? (your RTO). Azure answers them with two complementary services that learners — and exam writers — constantly confuse. Azure Backup keeps point-in-time copies so you can recover data after deletion, corruption, or ransomware. Azure Site Recovery (ASR) continuously replicates whole machines so you can fail over and keep running when a region, zone, or on-prem site goes dark. Backup is your time machine; Site Recovery is your spare engine. You need both, and an interviewer will probe exactly where one ends and the other begins.

This is the exhaustive lesson. We go setting by setting through the two vault types and the workloads each protects, backup policies down to GFS retention and instant-restore snapshots, the full restore matrix for VMs (whole-machine and file-level), Azure Files and SQL-in-VM, then the hardening controls every production tenant needs — cross-region restore, soft delete, immutable vaults, and multi-user authorization. Then we switch to Site Recovery — replication policies, recovery plans, the A2A and on-prem-to-Azure scenarios, and the critical distinction between test failover, failover, and failback — before pulling the estate together in Backup Center. By the end you can design a backup-and-DR posture from memory and answer the follow-ups AZ-104 and AZ-305 will throw at you.

Learning objectives

Choose correctly between a Recovery Services vault and a Backup vault, and name which workloads each one protects.
Author backup policies with the right frequency, GFS retention (daily/weekly/monthly/yearly), and instant-restore snapshot tuning.
Back up and restore Azure VMs (whole-VM, replace-disks, and file-level recovery), Azure Files (snapshot and vaulted), and SQL Server running in an Azure VM.
Harden backups against ransomware and rogue admins with cross-region restore, soft delete, immutable vaults, and multi-user authorization (MUA).
Configure Azure Site Recovery — replication policy, recovery plans, the Azure-to-Azure (A2A) and on-prem-to-Azure scenarios — and run a test failover, a real failover, and a failback, reasoning about RPO and RTO throughout.
Operate the estate centrally with Backup Center, backup reports, jobs, and alerts.

Prerequisites & where this fits

You need an Azure subscription, a resource group or two, at least one VM you can protect, and the az CLI (Cloud Shell is fine) from the earlier Fundamentals and Compute lessons. This is the Operations deep-dive of the Azure Zero-to-Hero course — it builds directly on the compute and storage deep-dives (Azure Virtual Machines Deep Dive and Azure Managed Disks Deep Dive), because the things you back up here are exactly the VMs, disks, and file shares you created there, and uses the same monitoring pipeline as Azure Monitor Deep Dive for alerts. If “RPO” and “RTO” are brand new, this lesson defines them in full; if you want the ransomware-hardening angle in more depth afterwards, the companion lesson Azure Backup Hardening goes deeper still.

Core concepts

Backup and disaster recovery solve different problems, and the single most important mental model is why two services exist:

Azure Backup = recover data to a point in time. It periodically copies data into a vault and keeps many historical recovery points (yesterday, last week, last quarter, last year). You use it when something was deleted, encrypted, or corrupted and you need a clean older copy. RPO is hours (one or a few backups a day, or every few hours for in-guest SQL logs); RTO is however long a restore takes.
Azure Site Recovery = keep the machine running through a disaster. It continuously replicates a VM’s disks to another region (or zone, or to Azure from on-prem) so that within minutes you can boot an identical machine elsewhere. RPO is seconds to a couple of minutes; RTO is minutes. ASR is not a substitute for backup — it replicates corruption and ransomware just as faithfully as good data, and it keeps only a short window of recovery points.

Anchor terms you will see throughout:

RPO (Recovery Point Objective) — the maximum data loss you can tolerate, measured in time. “RPO = 15 minutes” means you can lose at most the last 15 minutes of changes. Backup frequency and replication continuity set your RPO.
RTO (Recovery Time Objective) — the maximum time you can be down. Restore speed and failover automation set your RTO.
GFS (Grandfather-Father-Son) — the retention scheme where short-interval backups roll up into longer-lived ones (daily → weekly → monthly → yearly), so you keep granularity recently and sparse copies for years cheaply.
Recovery point — a single restorable snapshot/backup taken at a moment in time.
Instant restore (snapshot tier) — a local managed-disk snapshot taken before data is copied to the vault, giving near-instant restores for recent points without reading from vault storage.
Vault — the Azure resource that holds backup data and/or replication configuration, governs redundancy, and is the unit for soft delete, immutability, and RBAC.

Recovery Services vault vs Backup vault: every workload

Azure has two vault resource types, and they are not interchangeable. Choosing the wrong one means re-onboarding workloads later, so this is a day-one decision and a guaranteed exam question.

Capability	Recovery Services vault	Backup vault
Resource type	`Microsoft.RecoveryServices/vaults`	`Microsoft.DataProtection/backupVaults`
Azure VMs	Yes (snapshot + vault)	No
SQL Server in Azure VM	Yes	No
SAP HANA in Azure VM	Yes	No
Azure Files	Yes (snapshot-based)	Yes (vaulted, off-site copy)
On-prem files/folders (MARS agent)	Yes	No
On-prem VMs / app workloads (MABS / DPM)	Yes	No
Azure Blobs (operational + vaulted)	No	Yes
Azure Managed Disks	No	Yes
Azure Database for PostgreSQL Flexible Server	No	Yes
AKS (cluster state + PVs)	No	Yes
Azure Site Recovery (DR replication)	Yes	No
Immutability	Yes	Yes
Soft delete	Yes (basic + enhanced)	Yes
MUA via Resource Guard	Yes	Yes
Cross-region restore	Yes	Yes (selected workloads)

The rule of thumb:

Recovery Services vault — the classic and broadest vault. Use it for Azure VMs, SQL/SAP-HANA-in-VM, Azure Files (snapshot), on-prem via MARS/MABS/DPM, and all of Azure Site Recovery. If you remember one thing: VMs and ASR live in a Recovery Services vault.
Backup vault — the newer vault for managed-data-store workloads: Blobs, Disks, PostgreSQL Flexible Server, AKS, and vaulted Azure Files. It introduced the modern data-protection control plane (Microsoft.DataProtection).

Many platform teams run both — that is expected and correct; they are governed the same way for redundancy, soft delete, immutability, and MUA. The seam is purely about which workloads each supports.

Creating a Recovery Services vault: every setting

When you create a Recovery Services vault the portal walks Basics → Networking → Tags → Review + create. After creation, two vault-wide Properties (redundancy and security) are the load-bearing settings.

Setting	What it is	Choices / default	When / trade-off / gotcha
Subscription / Resource group	Billing + lifecycle container	Any you have rights to	A vault protects resources in the same region; you typically have one vault per region per environment.
Vault name	The resource name	2–50 chars, unique in the RG	Renaming is not supported. Use `rsv-<env>-<region>` (e.g. `rsv-prod-eastus`).
Region	Where the vault (and its backup data) lives	Any region	The vault must be in the same region as the resources it backs up. A VM in East US cannot be backed up to a vault in West US.
Storage redundancy	Durability of backup data	LRS · ZRS · GRS (default for new vaults)	Changeable only while the vault has zero protected items — a true day-zero decision. GRS is required for cross-region restore. See the redundancy table below.
Cross Region Restore	Lets you restore in the geo-paired region on demand	Off by default; requires GRS	Enable it before onboarding if you want region-failure recovery without a Microsoft-declared outage.
Networking (public/private access)	How the vault is reached	Public endpoint (default) or Private Endpoint	Private endpoints lock the vault’s data plane to your VNet — recommended for production; configure before large-scale onboarding.
Tags	Cost/governance metadata	Free	`env`, `owner`, `costcenter` — FinOps and policy depend on these.

Storage redundancy controls how durable your backup data is — and you cannot change it once items are protected:

Redundancy	Copies / placement	Survives	Cost	When
LRS (Locally redundant)	3 copies in one datacentre	Disk/rack failure	Lowest	Dev/test, or where data can be recreated.
ZRS (Zone redundant)	3 copies across availability zones in the region	A full zone outage	Medium	Production needing in-region zone resilience.
GRS (Geo-redundant)	LRS locally + async copy to the paired region	Loss of the primary region	Higher	Production needing region-failure recovery; required for cross-region restore.

Gotcha: GRS is the default for new vaults, but if you switch a vault to LRS to save money, you permanently forfeit cross-region restore and can’t switch back once items are protected. Decide based on your DR requirement, not the monthly bill.

# Recovery Services vault + redundancy (set redundancy BEFORE any backups)
az group create -n rg-backup-lab -l eastus

az backup vault create \
  --resource-group rg-backup-lab \
  --name rsv-lab-eastus \
  --location eastus

# Geo-redundant + cross-region restore enabled (prerequisite for CRR)
az backup vault backup-properties set \
  --resource-group rg-backup-lab \
  --name rsv-lab-eastus \
  --backup-storage-redundancy GeoRedundant \
  --cross-region-restore-flag true

resource rsv 'Microsoft.RecoveryServices/vaults@2024-04-01' = {
  name: 'rsv-lab-eastus'
  location: 'eastus'
  sku: { name: 'RS0', tier: 'Standard' }
  properties: {}
}

resource rsvConfig 'Microsoft.RecoveryServices/vaults/backupstorageconfig@2024-04-01' = {
  parent: rsv
  name: 'vaultstorageconfig'
  properties: {
    storageModelType: 'GeoRedundant'
    crossRegionRestoreFlag: true
  }
}

Creating a Backup vault

The Backup vault create flow is similar (Basics → Redundancy → Tags → Review + create), but redundancy values are named LocallyRedundant / ZoneRedundant / GeoRedundant, and you choose the soft-delete mode at creation. Use it only for the workloads in the table above.

az dataprotection backup-vault create \
  --resource-group rg-backup-lab \
  --vault-name bvault-lab-eastus \
  --location eastus \
  --storage-settings datastore-type="VaultStore" type="GeoRedundant" \
  --soft-delete-state On --retention-duration-in-days 14

Backup policies: frequency, GFS retention, instant restore

A backup policy answers two questions for a workload: how often do we back up, and how long do we keep each kind of recovery point. The same policy can protect many items.

Schedule (frequency)

Workload	Schedule options	Default	Notes / gotcha
Azure VM (Standard policy)	Daily or Weekly	Daily	One backup per day; the simplest, most common policy.
Azure VM (Enhanced policy)	Hourly (every 4/6/8/12 h) or Daily	—	Required for Trusted Launch / Gen2 / Ultra-disk VMs and for multiple backups per day; can’t downgrade Enhanced→Standard on a protected item.
Azure Files	Daily (multiple times/day on snapshot) + vaulted	Daily	Snapshot tier is local to the share; vaulted tier copies off-site.
SQL in Azure VM	Full (daily/weekly) + Differential + Transaction log (as often as every 15 min)	—	Log backups drive a 15-minute RPO with log-chain point-in-time restore.

GFS retention (Grandfather-Father-Son)

Retention is where GFS lives. You keep frequent points for a short window and progressively sparser points for longer — granular recently, cheap for years:

Tier	Keeps	Typical retention	Purpose
Daily	Each day’s backup	7–30 days	Day-to-day “oops I deleted it” recovery.
Weekly	One chosen day’s backup per week	4–12 weeks	Roll-back across weeks without daily bloat.
Monthly	One chosen backup per month	12–60 months	Month-end / reporting snapshots.
Yearly	One chosen backup per year	up to 99 years	Long-term compliance/legal hold.

Gotcha: retention is per recovery point, and reducing retention in a policy shrinks existing points’ lifetimes at the next cleanup — which is exactly the destructive operation immutability and MUA (below) are designed to block. Lengthening retention is always safe.

Instant-restore snapshots

For Azure VMs, Azure Backup first takes a local managed-disk snapshot (the instant restore / snapshot tier), then copies the data into the vault. The snapshot tier gives near-instant restores of recent points (no vault read) and is configurable from 1 to 5 days (default 2).

Lever	Effect	Trade-off
Snapshot retention 1–5 days (default 2)	More days = more recent points restore instantly	Snapshots are stored as managed disks → more days = more snapshot storage cost.

# Show the default VM policy, then create a custom daily policy from JSON
az backup policy show -g rg-backup-lab -v rsv-lab-eastus -n DefaultPolicy -o json > vmpolicy.json
# (edit vmpolicy.json: schedule time, daily/weekly/monthly/yearly retention,
#  instantRpRetentionRangeInDays 1-5)
az backup policy set -g rg-backup-lab -v rsv-lab-eastus --policy @vmpolicy.json --name DailyGfsPolicy

Backing up & restoring Azure VMs

VM backup is agent-light: enabling protection installs the VM Backup extension (VMSnapshot) which coordinates an application-consistent snapshot (via VSS on Windows / pre-post scripts on Linux), then ships the snapshot to the vault.

Consistency levels (know these for the exam):

Consistency	What it captures	When you get it
Application-consistent	In-memory/in-flight data flushed via VSS (Windows) or pre/post scripts (Linux) — best	The goal for running VMs; needs the extension + VSS/scripts.
File-system-consistent	On-disk files consistent, but app buffers not flushed	Fallback when VSS/scripts unavailable.
Crash-consistent	Equivalent to pulling the power cord	Last resort (e.g. VM stopped/deallocated).

Enable and trigger a backup

# Enable VM backup against a policy
az backup protection enable-for-vm \
  --resource-group rg-backup-lab \
  --vault-name rsv-lab-eastus \
  --vm myVm \
  --policy-name DefaultPolicy

# On-demand backup with an explicit retention for this point
az backup protection backup-now \
  --resource-group rg-backup-lab \
  --vault-name rsv-lab-eastus \
  --container-name myVm --item-name myVm \
  --retain-until 15-07-2026 \
  --backup-management-type AzureIaasVM

Restore options for an Azure VM

Restore type	What it does	When to use	Gotcha
Create new VM	Builds a brand-new VM from the recovery point	Fast full recovery without touching the original	Needs a staging storage account; new NIC/IP.
Replace existing (replace disks)	Swaps the original VM’s disks back to the recovery point	In-place rollback keeping name/IP/NIC	Original VM must exist; replaced disks are kept as a backup.
Restore disks	Restores managed disks only (and emits an ARM template)	Custom rebuilds, attach to a different VM	You assemble the VM yourself.
File recovery (item-level)	Mounts the recovery point as drives to copy individual files	You need a few files, not the whole VM	Uses an iSCSI mount via a downloaded script; unmount when done.
Cross-region restore	Any of the above in the paired region	Region outage or DR drill	Requires GRS + CRR enabled.

# Restore disks from the latest recovery point to a staging account
RP=$(az backup recoverypoint list -g rg-backup-lab -v rsv-lab-eastus \
  --container-name myVm --item-name myVm \
  --backup-management-type AzureIaasVM --query "[0].name" -o tsv)

az backup restore restore-disks \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --container-name myVm --item-name myVm \
  --rp-name $RP --storage-account mystagingsa \
  --target-resource-group rg-restore-lab

File-level (item-level) recovery — the mechanism

File recovery is exam-favourite because the mechanism is unusual: you download a small executable script from the recovery point, run it on any machine with network line-of-sight, and it mounts the recovery point’s volumes locally over iSCSI. You then copy the files you need with normal file tools and unmount to release the mount.

# Generate the file-restore script + credentials for a recovery point
az backup restore files mount-rp \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --container-name myVm --item-name myVm --rp-name $RP
# -> download/run the returned script; it mounts the volumes locally.
# When finished:
az backup restore files unmount-rp \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --container-name myVm --item-name myVm --rp-name $RP

Gotcha: the file-restore mount has a 12-hour lifetime and you should always unmount when done — a forgotten mount holds the recovery point open. On Linux you may need the open-iscsi package; on Windows the script needs to run as administrator.

Backing up & restoring Azure Files

Azure Files has two protection tiers:

Tier	Where copies live	Protects against	RPO/retention	Restore granularity
Snapshot (operational)	In the same storage account as the share	Accidental change/delete of files	Multiple/day; up to 200 snapshots	Whole share or individual files/folders
Vaulted (off-site)	In the vault (Backup vault)	Loss of the storage account itself	Daily; GFS retention	Whole share (point-in-time)

Snapshot backup is fast and cheap and covers the common “someone deleted the folder” case; vaulted backup is the insurance for storage-account-level loss. Enable Azure Files backup from a Recovery Services vault (snapshot) and/or a Backup vault (vaulted).

# Snapshot-based Azure Files backup via Recovery Services vault
az backup protection enable-for-azurefileshare \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --storage-account mystorageacct --azure-file-share myshare \
  --policy-name DefaultPolicy

# Item-level restore of a single folder back to the original share
az backup restore restore-azurefileshare \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --container-name "StorageContainer;Storage;rg-backup-lab;mystorageacct" \
  --item-name myshare --rp-name <recovery-point> \
  --restore-mode OriginalLocation --resolve-conflict Overwrite

Backing up & restoring SQL Server in an Azure VM

For SQL running inside an Azure VM, Azure Backup is a workload-aware backup (not the same as VM-level backup). The Azure Backup workload extension runs inside the guest, discovers databases, and performs native SQL backups (Full, Differential, Transaction-log). This gives:

15-minute RPO through frequent transaction-log backups.
Point-in-time restore to any moment within the log chain.
Auto-protection of new databases on an instance.

SQL backup type	Frequency	Purpose
Full	Daily or Weekly	Baseline restore point.
Differential	Daily (not same day as full)	Smaller, faster than full; reduces restore time.
Transaction log	As often as every 15 min	Drives the 15-minute RPO and point-in-time restore.

# Register the SQL-in-VM instance with the vault, then enable protection
az backup container register \
  --resource-group rg-backup-lab --vault-name rsv-lab-eastus \
  --workload-type MSSQL --backup-management-type AzureWorkload \
  --resource-id $(az vm show -g rg-backup-lab -n sqlvm --query id -o tsv)

Gotcha: SQL-in-VM backup needs the VM to reach the Azure Backup service (network) and a SQL account/permissions for the extension. It backs up databases, so restore targets are databases (overwrite, alternate location, or as files) — not the whole VM. To protect the whole machine and the databases, run both VM backup and SQL backup.

Cross-region restore, soft delete, immutability & MUA

These four controls turn a backup vault from “a copy of your data” into “a copy an attacker or a rogue admin cannot destroy.” This is the ransomware-resilience core of AZ-104/AZ-305.

Cross-region restore (CRR)

CRR lets you restore in the geo-paired region on demand, without waiting for Microsoft to declare a regional outage — invaluable for DR drills and region-down recovery.

Requires GRS storage redundancy and the Cross Region Restore flag enabled (set both before onboarding).
Works for Azure VMs, SQL/SAP-HANA-in-VM, and selected Backup-vault workloads.
You restore from the secondary region recovery points to resources in that region.

Soft delete

Soft delete keeps deleted backup data recoverable for a retention window even after someone stops protection and deletes backups — defeating the classic “delete the backups, then encrypt” ransomware play.

Mode	Behaviour	Retention
Basic soft delete	Deleted backup items are kept and recoverable	14 days, free
Enhanced soft delete	Adds always-on option (can be made non-disableable) + configurable window	14–180 days (paid beyond the free window)

# Inspect / set soft delete on a Recovery Services vault
az backup vault backup-properties show -g rg-backup-lab -n rsv-lab-eastus \
  --query "{soft:softDeleteFeatureState}"
az backup vault backup-properties set -g rg-backup-lab -n rsv-lab-eastus \
  --soft-delete-feature-state Enable

Immutable vault

Immutability blocks operations that would reduce protection of existing recovery points — deleting data before retention expires, shortening retention, or disabling soft delete. It does not block creating new backups or extending retention.

State	Can an admin revert it?	Use
Immutability not locked	Yes — vault admin can disable	Soak/test period to validate nothing breaks.
Immutability locked	No — irreversible	Production compliance/ransomware posture; even Microsoft cannot unlock it.

Gotcha: lock immutability only after a soak period. Once locked it is permanent for the life of the vault; if a policy genuinely needs shorter retention later, you cannot shorten it.

Multi-user authorization (MUA)

MUA puts destructive operations behind a second person’s approval using a Resource Guard — typically owned in a different subscription/tenant so the backup admin alone cannot both request and approve. Protected operations (disable soft delete, reduce retention, delete a protected item, disable MUA itself) require a Just-In-Time approval against the Resource Guard before they proceed.

The defence-in-depth order to enable these is: soft delete → cross-region restore → immutability (unlocked, then locked after soak) → MUA.

Azure Site Recovery: replication, plans & failover

Switch gears: Azure Site Recovery (ASR) continuously replicates whole machines so you can fail over and keep running. ASR config lives in a Recovery Services vault.

Scenarios

Scenario	Source → Target	Replication mechanism
Azure-to-Azure (A2A)	Azure VM in region/zone A → region/zone B	Agentless (Site Recovery extension auto-installed); the headline cloud-DR pattern.
Zone-to-zone	Azure VM in zone 1 → zone 2 (same region)	A2A variant for in-region zone resilience.
VMware / Physical → Azure	On-prem VMware VMs or physical servers → Azure	Via the Azure Site Recovery replication appliance (modern) running on-prem.
Hyper-V → Azure	On-prem Hyper-V VMs → Azure	Via the Azure Site Recovery provider on the Hyper-V host/VMM.

Replication policy (the RPO/retention knobs)

A replication policy controls how recovery points are generated and kept:

Setting	What it is	Typical / default	Trade-off
RPO threshold	When ASR raises an alert if replication lags	e.g. 15 min (alert only)	Lower = noisier alerts; replication itself is continuous.
Recovery-point retention	How long crash/app-consistent points are kept	up to 24 hours (A2A)	Longer = more recovery points to fail back to.
App-consistent snapshot frequency	How often app-consistent (VSS) points are taken	e.g. every 1–4 hours, or off	More frequent = lower data loss for app-consistent recovery, slight guest overhead.
Multi-VM consistency	Groups VMs so they share a crash/app-consistent recovery point	Off by default	Essential for multi-tier apps (DB + app) that must fail over to the same moment; adds a replication group.

ASR keeps only a short window (hours) of recovery points — it is not a backup. Use Azure Backup for long-term/point-in-time data recovery and ASR for keeping the machine running.

Recovery plans

A recovery plan orchestrates the failover of many VMs into an ordered, repeatable runbook:

Groups and ordering — boot tier-1 (DB) before tier-2 (app) before tier-3 (web).
Manual actions — pause for an operator step (e.g. validate DNS).
Automation runbooks (Azure Automation) — script post-failover tasks (reassign public IPs, update DNS, open NSGs).

A recovery plan is what turns “fail over 30 VMs” from a frantic afternoon into one button with predictable ordering.

Test failover vs failover vs failback

This three-way distinction is the single most-asked ASR exam question:

Operation	What it does	Impact on production	When
Test failover	Spins up the replicated VMs in an isolated network to validate DR	None — production keeps running and replicating	Regular non-disruptive DR drills; do this often.
Failover	Brings the VMs up in the target region/zone for real	Production source is now down/secondary; you’re running in the target	An actual disaster (or a planned, committed migration).
Failback	After the source is healthy, reverse-replicate and return to the original	Brief switch back to the source region	Once the primary region recovers; for on-prem you reprotect then failback.

The full lifecycle is: enable replication → (let it reach a healthy RPO) → test failover (drill) → [disaster] → failover → commit → reprotect (reverse replication) → failback → re-enable original-direction replication.

# A2A is most reliably scripted via PowerShell/templates; the az CLI surface
# is limited. Validate replication health for a protected item:
az site-recovery replication-protected-item show \
  --resource-group rg-asr --vault-name rsv-asr \
  --fabric-name <primary-fabric> --protection-container-name <pc> \
  --replicated-protected-item-name myVm-asr \
  --query "{state:properties.protectionState, rpo:properties.providerSpecificDetails}"

Gotcha: A2A is most robustly configured via the portal, ARM/Bicep, or Az.RecoveryServices PowerShell; the az CLI coverage for ASR is partial. For exam answers, know the concepts and order; for production, drive ASR from templates/PowerShell and rehearse with test failover on a schedule.

Backup Center, reports & alerts

As the estate grows you stop managing vault-by-vault and move to Backup Center — a single pane across all vaults, subscriptions, and workload types:

Overview & Backup instances — every protected item and its last-backup health in one list.
Jobs — every backup/restore job, success/failure, durations (also az backup job list).
Policies — author and govern policies centrally; spot items on weak policies.
Backup reports — a Log-Analytics-backed workbook for storage consumed, retention, job trends, and optimization (e.g. items with excessive retention). Requires routing vault diagnostic settings to a Log Analytics workspace.
Alerts — built-in Azure Monitor alerts for backup failures and security events (e.g. disable-soft-delete), surfaced via action groups to email/SMS/webhook/ITSM. Wire these to the same action groups you built in the Azure Monitor Deep Dive.

# Recent backup jobs across a vault (the Backup Center "Jobs" view, in CLI)
az backup job list -g rg-backup-lab -v rsv-lab-eastus \
  --query "[].{op:operation, status:status, start:startTime}" -o table

# Route vault diagnostics to Log Analytics so Backup Reports populate
az monitor diagnostic-settings create \
  --name to-law \
  --resource $(az backup vault show -g rg-backup-lab -n rsv-lab-eastus --query id -o tsv) \
  --workspace <log-analytics-workspace-id> \
  --logs '[{"categoryGroup":"allLogs","enabled":true}]'

The diagram above ties the whole picture together: one Recovery Services vault feeding both the recover-data path (Backup, hardened by soft delete, immutability, and MUA) and the keep-running path (Site Recovery replication with its failover lifecycle), unified under Backup Center.

Hands-on lab

Enable backup on a small VM, take an on-demand backup, perform a file-level restore, then clean everything up. All az CLI; the only billable pieces are tiny and removed at the end.

1. Resource group, vault, and a small VM

LOC=eastus
RG=rg-backup-lab
az group create -n $RG -l $LOC

# Recovery Services vault (LRS for the lab to minimise cost)
az backup vault create -g $RG -n rsv-lab-eastus -l $LOC
az backup vault backup-properties set -g $RG -n rsv-lab-eastus \
  --backup-storage-redundancy LocallyRedundant

# Tiny Linux VM to protect
az vm create -g $RG -n bkpVm -l $LOC \
  --image Ubuntu2204 --size Standard_B1s \
  --admin-username azureuser --generate-ssh-keys

2. Enable backup and trigger an on-demand backup

az backup protection enable-for-vm \
  -g $RG -v rsv-lab-eastus --vm bkpVm --policy-name DefaultPolicy

# On-demand backup, retained ~30 days from today
az backup protection backup-now \
  -g $RG -v rsv-lab-eastus \
  --container-name bkpVm --item-name bkpVm \
  --backup-management-type AzureIaasVM \
  --retain-until $(date -v+30d +%d-%m-%Y 2>/dev/null || date -d "+30 days" +%d-%m-%Y)

3. Verify the job and list recovery points

# Wait for the backup job to complete (Status -> Completed)
az backup job list -g $RG -v rsv-lab-eastus -o table

# List recovery points and capture the newest
RP=$(az backup recoverypoint list -g $RG -v rsv-lab-eastus \
  --container-name bkpVm --item-name bkpVm \
  --backup-management-type AzureIaasVM --query "[0].name" -o tsv)
echo "Recovery point: $RP"

Expected: a Backup job with Status = Completed, and $RP populated with a recovery-point name.

4. File-level restore (mount the recovery point)

# Generate the file-restore script + iSCSI credentials
az backup restore files mount-rp \
  -g $RG -v rsv-lab-eastus \
  --container-name bkpVm --item-name bkpVm --rp-name $RP

Run the returned script on a machine with network access (it mounts the recovery point’s volumes locally over iSCSI). Browse the mounted volume, copy any file you need, then release the mount:

az backup restore files unmount-rp \
  -g $RG -v rsv-lab-eastus \
  --container-name bkpVm --item-name bkpVm --rp-name $RP

Read the result: you recovered individual files without restoring the whole VM — the everyday “I deleted one file” scenario, and a classic exam question.

Cleanup

# Stop protection AND delete backup data, then remove the RG
az backup protection disable -g $RG -v rsv-lab-eastus \
  --container-name bkpVm --item-name bkpVm \
  --backup-management-type AzureIaasVM --delete-backup-data true --yes

az group delete -n $RG --yes --no-wait

If az group delete fails because the vault “contains backup items,” it’s because soft delete is holding deleted data. Undo/disable soft delete or wait out the soft-delete window, then delete — this is the soft-delete safety net doing its job.

Cost note

The vault and policies are free to define; you pay for protected-instance fees (per backed-up instance, by source size band) plus backup storage consumed (LRS in this lab). A single tiny B1s VM with one on-demand backup for an hour is a few rupees of storage plus a small instance fee — round to ₹5–₹30 if cleaned up promptly. The genuinely expensive accidents are GRS storage on large VMs with long retention and forgotten ASR replication (which bills continuously per replicated VM plus target-region storage). Always disable protection with --delete-backup-data and az group delete your labs.

Common mistakes & troubleshooting

Symptom	Likely cause	Fix
Can’t change vault redundancy to GRS	The vault already has protected items	Redundancy is fixed once items exist — create a new vault with the right redundancy, or remove all items first.
“Cross-region restore not available”	Vault is LRS/ZRS, or CRR flag off	Set redundancy to GRS and enable the Cross Region Restore flag before onboarding.
VM backup is only crash-consistent	VSS (Windows) / pre-post scripts (Linux) failed, or VM was stopped	Ensure the VM Backup extension is healthy and VSS/scripts run; back up while running.
`az group delete` fails on a vault with no visible items	Soft delete is retaining deleted backups	Disable soft delete or wait the retention window, then delete.
Can’t reduce retention / delete a recovery point	Immutability locked or MUA is enforcing	This is by design — request approval via the Resource Guard (MUA) or accept that locked immutability is permanent.
Trusted Launch / Gen2 VM won’t take the Standard policy	Standard policy doesn’t support it	Use the Enhanced backup policy.
ASR shows a high RPO / replication lag	Network throughput between regions, or churn spikes	Check egress bandwidth and disk churn; raise the RPO threshold alert appropriately; consider larger cache.
Multi-tier app fails over to inconsistent state	No multi-VM consistency group	Put the interdependent VMs in a replication group so they share a recovery point.

Best practices

Pick the vault redundancy and CRR on day zero. GRS + cross-region restore for anything you’d need after a region outage — you cannot change redundancy once items are protected.
Use GFS deliberately. Short daily retention for granularity, plus weekly/monthly/yearly only as compliance requires — over-long retention quietly dominates the bill.
Match RPO to the workload. Daily VM backup for stateless tiers; Enhanced/hourly or SQL log backups (15-min RPO) for transactional data; ASR for “must keep running.”
Back up the machine and the database. VM backup plus SQL-in-VM backup gives both whole-machine recovery and point-in-time database restore.
Harden in order: soft delete → CRR → immutability (soak, then lock) → MUA. Treat the backup control plane as the highest-value ransomware target.
Rehearse with test failover on a schedule. A DR plan you’ve never run is a hypothesis; isolated test failovers prove RTO without touching production.
Centralize in Backup Center with diagnostics to Log Analytics so reports populate and you catch unprotected/weakly-protected items.
Don’t treat ASR as backup. It replicates corruption too and keeps only hours of points — keep Azure Backup for long-term/point-in-time recovery.

Security notes

Enable enhanced soft delete (always-on) so deleted backups survive an admin compromise — the core anti-ransomware control.
Lock immutability on production vaults after a soak period; it stops retention-shortening and early deletion even by Microsoft.
Turn on multi-user authorization (MUA) with a Resource Guard owned in a separate subscription/tenant so no single admin can both request and approve destructive operations.
Scope RBAC tightly. Backup Contributor / Backup Operator are powerful — separate who can configure backups from who can delete them, and audit changes.
Use private endpoints on the vault to keep the backup data plane off the public internet.
Encrypt backups with customer-managed keys (CMK) where compliance requires control of the key; platform-managed keys protect data at rest by default.
Alert on security events (disable-soft-delete, MUA changes) via action groups — early warning of a control-plane attack.

Cost & sizing

The levers that move a backup/DR bill, roughly in order of impact:

Lever	Cost behaviour
Protected-instance fee	Charged per backed-up instance, banded by source data size (e.g. ≤50 GB, ≤500 GB, then per 500 GB).
Backup storage consumed	Per GB stored — multiplied by redundancy: GRS/GZRS > ZRS > LRS. Long GFS retention compounds this.
Instant-restore snapshots	Stored as managed disks; 1–5 days of snapshots = more disk storage.
Cross-region restore	GRS storage cost + restore egress when you actually do a CRR.
ASR replication (per VM)	A per-replicated-VM monthly fee plus target-region storage and any egress — bills continuously while enabled.
Enhanced soft delete beyond 14 days	Paid for retained-deleted data past the free window.
Log Analytics for Backup Reports	Ingestion + retention of vault diagnostic logs.

Sizing rules of thumb: redundancy and retention are the big multipliers — GRS on large VMs with multi-year retention is where surprise bills come from. Right-size GFS to the actual compliance need, keep the snapshot tier at the default 2 days unless you need faster recent restores, and never leave ASR replication running on machines you no longer need to protect — it’s the most common forgotten recurring charge in this space.

Interview & exam questions

What’s the difference between Azure Backup and Azure Site Recovery? Backup keeps point-in-time copies so you can recover data after deletion/corruption/ransomware (RPO hours, many historical points). Site Recovery continuously replicates whole machines so you can fail over and keep running (RPO seconds-minutes, short retention). Backup is a time machine; ASR is a spare engine. You use both.
When do you use a Recovery Services vault vs a Backup vault? Recovery Services vault for Azure VMs, SQL/SAP-HANA-in-VM, Azure Files (snapshot), on-prem (MARS/MABS/DPM), and all of ASR. Backup vault for Blobs, Disks, PostgreSQL Flexible Server, AKS, and vaulted Azure Files. VMs and ASR ⇒ Recovery Services vault.
Explain GFS retention. Grandfather-Father-Son: keep daily points for a short window, roll up to weekly, monthly, and yearly (up to 99 years). Granular recently, sparse and cheap for long-term compliance.
What is instant restore / the snapshot tier? A local managed-disk snapshot taken before data is copied to the vault, giving near-instant restores of recent points without reading vault storage. Configurable 1–5 days (default 2); more days = more snapshot storage cost.
What are the consistency levels for VM backup, and which is best? Application-consistent (VSS/pre-post scripts flush app buffers — best), file-system-consistent (on-disk files consistent), crash-consistent (like pulling the power). Aim for application-consistent on running VMs.
How does file-level (item-level) restore work for an Azure VM? You download a script from the recovery point and run it; it mounts the recovery point’s volumes locally over iSCSI. You copy the files you need, then unmount. The mount has a ~12-hour lifetime.
How do you achieve a 15-minute RPO for SQL running in an Azure VM? Use SQL-in-VM (workload-aware) backup with transaction-log backups every 15 minutes, on top of Full + Differential — enabling point-in-time restore within the log chain.
What’s required for cross-region restore, and why would you use it? GRS redundancy and the Cross Region Restore flag, set before onboarding. It lets you restore in the paired region on demand — for DR drills or a region outage — without waiting for Microsoft to declare an outage.
How do the four ransomware controls fit together, and in what order? Soft delete keeps deleted backups recoverable; cross-region restore gives an out-of-region copy; immutable vault blocks retention-shortening/early-deletion; MUA (Resource Guard) gates destructive ops behind a second approver. Enable soft delete → CRR → immutability (soak then lock) → MUA.
What is the difference between an unlocked and a locked immutable vault? Unlocked: immutability is active but a vault admin can disable it (use as a soak period). Locked: irreversible — not even Microsoft can unlock it; retention can only be extended, never shortened.
Test failover vs failover vs failback — what’s the difference? Test failover spins the replica up in an isolated network with no impact on production (your DR drill). Failover brings the machines up in the target region for real during a disaster. Failback reverse-replicates and returns to the original site once it’s healthy.
What does a recovery plan add over failing over VMs individually, and what is multi-VM consistency? A recovery plan orders failover into groups (e.g. DB before app before web), adds manual actions and automation runbooks. Multi-VM consistency groups interdependent VMs so they share the same recovery point — essential for multi-tier apps that must come back to the same moment.
Why is ASR not a substitute for backup? ASR replicates everything faithfully — including corruption and ransomware — and keeps only a short window of recovery points (hours). For clean, long-term, point-in-time recovery you need Azure Backup.

Quick check

Which vault type backs up Azure VMs and hosts Azure Site Recovery?
What two things must be true for cross-region restore to work?
What is the default (and max) instant-restore snapshot retention?
Which backup type gives SQL-in-VM a 15-minute RPO?
True or false: a test failover briefly takes your production VM offline.

Answers

The Recovery Services vault (Microsoft.RecoveryServices/vaults).
GRS storage redundancy and the Cross Region Restore flag enabled — both set before onboarding.
Default 2 days, maximum 5 days.
Transaction-log backups every 15 minutes (workload-aware SQL-in-VM backup).
False — a test failover runs in an isolated network with no impact on production; it’s the non-disruptive DR drill.

Exercise

Design and build (in CLI) a hardened single-VM backup: create a GRS Recovery Services vault with the cross-region-restore flag enabled and enhanced soft delete on, protect one B-series VM with a custom policy (daily schedule, 14-day daily / 6-week weekly / 12-month monthly retention, instant-restore = 3 days), and trigger an on-demand backup. Then prove two things: (a) attempt to shorten retention or delete a recovery point and observe what soft delete / immutability would gate, and (b) confirm in az backup recoverypoint list that your point exists. Write one short paragraph explaining why you’d enable MUA with a Resource Guard in a separate subscription before considering this production-ready, and what destructive operation it would block. Clean up with az backup protection disable --delete-backup-data true then az group delete.

Certification mapping

Exam Skills this lesson covers

AZ-104 (Administrator) Monitor and back up Azure resources: create and configure Recovery Services / Backup vaults, backup policies (GFS, instant restore), back up and restore VMs (incl. file-level), Azure Files, and SQL-in-VM; configure soft delete, cross-region restore; configure Azure Site Recovery for Azure VMs and perform failover/failback; use Backup Center, jobs, and alerts. The az lab mirrors the exam’s task-based items.

AZ-305 (Solutions Architect) Design business continuity solutions: design a backup and recovery strategy from RPO/RTO requirements, choose vault redundancy and CRR, design ransomware-resilient posture (soft delete, immutability, MUA), and design site recovery / DR with replication policies, recovery plans, multi-VM consistency, and region/zone failover.

Exam	Skills this lesson covers
AZ-104 (Administrator)	Monitor and back up Azure resources: create and configure Recovery Services / Backup vaults, backup policies (GFS, instant restore), back up and restore VMs (incl. file-level), Azure Files, and SQL-in-VM; configure soft delete, cross-region restore; configure Azure Site Recovery for Azure VMs and perform failover/failback; use Backup Center, jobs, and alerts. The `az` lab mirrors the exam’s task-based items.
AZ-305 (Solutions Architect)	Design business continuity solutions: design a backup and recovery strategy from RPO/RTO requirements, choose vault redundancy and CRR, design ransomware-resilient posture (soft delete, immutability, MUA), and design site recovery / DR with replication policies, recovery plans, multi-VM consistency, and region/zone failover.

Glossary

Azure Backup — Service that takes and retains point-in-time copies of data for recovery after loss/corruption.
Azure Site Recovery (ASR) — Service that continuously replicates whole machines to enable failover/DR.
Recovery Services vault — Vault (Microsoft.RecoveryServices) for VMs, SQL/SAP-HANA-in-VM, Azure Files, on-prem, and ASR.
Backup vault — Vault (Microsoft.DataProtection) for Blobs, Disks, PostgreSQL Flexible Server, AKS, vaulted Files.
RPO (Recovery Point Objective) — Maximum tolerable data loss, measured in time.
RTO (Recovery Time Objective) — Maximum tolerable downtime.
GFS (Grandfather-Father-Son) — Retention rolling daily→weekly→monthly→yearly points.
Recovery point — A single restorable snapshot/backup at a moment in time.
Instant restore (snapshot tier) — Local managed-disk snapshot for near-instant recent restores (1–5 days, default 2).
Application-consistent backup — Backup with app buffers flushed via VSS/pre-post scripts (best consistency).
Cross-region restore (CRR) — On-demand restore in the geo-paired region; needs GRS + the CRR flag.
Soft delete — Retention of deleted backup data (basic 14 days; enhanced 14–180, can be always-on).
Immutable vault — Vault that blocks retention-reducing/early-deletion operations; can be locked (irreversible).
Multi-user authorization (MUA) — Destructive operations gated by a second approver via a Resource Guard.
Replication policy — ASR settings for recovery-point retention, app-consistent frequency, RPO threshold, multi-VM consistency.
Recovery plan — Ordered ASR runbook grouping VMs with manual actions and automation.
Test failover — Non-disruptive ASR drill in an isolated network.
Failover / Failback — Real switch to the target region, and the reverse-replication return to the source.
Backup Center — Single-pane management across all vaults, workloads, jobs, policies, reports, and alerts.

Next steps

Microsoft Entra ID & Governance Admin Deep Dive — the natural sequel: lock down who can configure and delete backups with RBAC, policy, locks, and tags across the management-group hierarchy.
Azure Backup Hardening: Immutable Vaults, MUA, Soft Delete & Cross-Region Restore — go deeper on the four ransomware controls and the exact order to wire them.
Azure Site Recovery: Zone-to-Zone & Region Failover with Runbooks — the advanced ASR playbook: zone-to-zone DR, recovery-plan automation, and failover runbooks.
Azure Monitor Deep Dive — wire backup and DR alerts into action groups and dashboards for end-to-end operational visibility.