BCDR Foundations on Azure: Making Sense of RTO, RPO, and the Resilience Spectrum

Every team can answer “is the app up?” Very few can answer the two questions that actually decide whether an outage is a footnote or a front-page incident: how long can we be down, and how much data can we afford to lose. Those two numbers — RTO (Recovery Time Objective) and RPO (Recovery Point Objective) — are the entire vocabulary of business continuity and disaster recovery (BCDR). Everything else (zones, replicas, backups, failover regions, runbooks) is just machinery for hitting the RTO and RPO you committed to. Get the numbers right and the architecture almost designs itself; get them wrong and you either overpay for resilience nobody needs or discover at 3 a.m. that “we have backups” and “we can recover in an hour” are very different statements.

This article builds the mental model from the ground up. We define RTO and RPO precisely (they are targets, not measurements, and they are not the same thing), separate high availability (HA) from disaster recovery (DR) — two ideas people blur constantly — and lay out the resilience spectrum: a ladder on Azure from a single VM in one datacentre up to active-active across regions, each rung buying a smaller RTO/RPO for more money and complexity. You will see where availability zones, paired regions, Azure Backup, zone-redundant storage, and Azure Site Recovery sit on that ladder, and which RTO/RPO band each realistically delivers.

By the end you can take a workload, ask “what does an hour of downtime cost, and how stale can the data be after recovery?”, and translate the answer into a concrete Azure design and a believable monthly bill. This is the foundation the deeper region, backup, and multi-region articles build on — where most teams’ resilience story should start, and where a surprising number of expensive mistakes are quietly prevented.

What problem this solves

Outages are not hypothetical. A region has a bad day, a deployment corrupts a database, a ransomware actor encrypts a file share, an availability zone loses power, or someone fat-fingers a DELETE without a WHERE. The question is never “will something break?” — it is “when it breaks, what’s our plan, and is the plan good enough for this workload?” Without RTO and RPO as a shared language, that conversation has no anchor. Engineering says “we have geo-redundant storage”; the business hears “we’re safe”; nobody has agreed how long recovery takes or how much data vanishes. The gap surfaces only during the incident, which is the worst possible time to discover it.

The second problem is mis-spending. Resilience is a spectrum with a steep cost curve. A marketing microsite and a payments ledger do not need the same protection, but teams routinely apply one blanket policy to both — either gold-plating the microsite or under-protecting the ledger (a single-region app with nightly backups, quietly accepting an RPO of up to 24 hours nobody signed off on). RTO and RPO let you tier workloads: spend the budget where an outage genuinely hurts, and accept cheaper, slower recovery where it doesn’t.

Who hits this: essentially everyone running production on Azure, but it bites hardest on teams that treat “backup” as a synonym for “DR.” A backup is a recovery point; it says nothing about recovery time. If restoring your 2 TB database takes six hours, your RTO is six hours no matter how recent the backup is. BCDR planning forces both numbers into the open, where they can be designed for instead of discovered.

To frame the whole field before the deep dive, here are the failure events this foundation prepares you for, what each one threatens, and the layer of the spectrum that addresses it:

Failure event	What it threatens	Primary defence	RTO/RPO it drives
Single VM / disk failure	One instance	Availability set / multiple instances	Seconds–minutes / ~0
Datacentre (zone) outage	One physical building	Availability zones	Seconds–minutes / ~0
Whole-region outage	All zones in a region	Paired region + replication/DR	Minutes–hours / seconds–minutes
Data corruption / bad deploy	Data integrity, not infra	Backups + point-in-time restore	Hours / your backup interval
Accidental / malicious deletion	Data existence	Soft-delete + immutable backup	Hours / backup interval
Ransomware	Data + backups together	Immutable, isolated backup copy	Hours–days / last clean point

Learning objectives

By the end of this article you can:

Define RTO and RPO precisely, explain why they are independent, and place any workload on a 2-by-2 of “downtime cost vs data-loss cost.”
Articulate the difference between high availability and disaster recovery, and why a zone-redundant app can still need a DR plan.
Walk the resilience spectrum from single-instance to active-active and name what each rung costs and what RTO/RPO it buys.
Map Azure’s building blocks — availability zones, paired regions, Azure Backup, storage redundancy (LRS/ZRS/GRS/GZRS), and Azure Site Recovery — onto the right rung.
Choose the standard DR strategy (Backup & Restore, Pilot Light, Warm Standby, Active-Active) that fits a given RTO/RPO and budget.
Read a backup/replication design and tell whether it actually meets a stated RTO/RPO, or only sounds like it does.
Estimate the rough monthly cost difference between resilience tiers, in INR and USD, and right-size instead of gold-plating.

Prerequisites & where this fits

You should be comfortable with Azure basics: a subscription holds resource groups, which hold resources like VMs, storage accounts, and databases; an Azure region is a geographic area, and within most regions there are physically separate availability zones. You should know how to run az commands in Cloud Shell and read their output. No prior DR experience is assumed — this article is the starting point.

This sits at the very front of the Resilience track. It is the conceptual layer beneath the hands-on guides: once you understand RTO/RPO and the spectrum here, the region mechanics in Azure Regions and Availability Zones Explained, the protection tooling in Azure Backup and Site Recovery for Protection, and the cross-region design in Azure Multi-Region Active-Active Design all slot into place. It pairs naturally with High Availability vs Disaster Recovery: RTO and RPO, which drills the HA/DR distinction further. If your concern is application-level resilience rather than infrastructure, Resiliency Patterns: Retry, Circuit Breaker, Bulkhead is the complementary read.

A quick map of who usually owns each decision, so the conversation reaches the right people early:

Decision	Who owns it	Why it matters
RTO / RPO targets per workload	Business + product owner	These are business commitments, not engineering defaults
Region and zone topology	Cloud architect	Determines the achievable RTO/RPO floor
Backup policy and retention	Platform / ops team	Sets the RPO for data corruption events
DR runbook and failover drills	SRE / ops	A plan never tested is a plan that fails
Cost ceiling per tier	Finance + architect	Resilience is bought, not free

Core concepts

Four mental models make every later decision obvious.

RTO is a clock; RPO is a rewind. Picture the moment disaster strikes. RTO (Recovery Time Objective) measures forward from that moment: how long until the service is usable again. RPO (Recovery Point Objective) measures backward: how far back in time your most recent safe data is. An RTO of 1 hour and an RPO of 5 minutes means “we will be back within an hour, having lost at most the last five minutes of data.” They answer different questions — time to restore service versus amount of data lost — and a design can be strong on one and weak on the other. Nightly backups give a cheap, slow RTO with a brutal RPO (up to 24 hours of lost data); synchronous replication gives a near-zero RPO but says nothing on its own about how fast you can fail over.

HA and DR are different jobs. High availability keeps a service running through small, expected failures — a dead disk, a rebooting VM, a single zone losing power — usually automatically, within one region, with little or no data loss. Disaster recovery is the plan for a large, rare event that takes out your primary location entirely — a whole region down, a data-destroying event — and typically involves failing over to a different region, often with a human decision and a measurable recovery window. HA reduces how often you have an outage; DR limits how bad the worst outage gets. A zone-redundant app (great HA) with no second region (no DR) is fully exposed to a regional disaster — a common and dangerous blind spot.

Resilience is a spectrum, and the cost curve is steep. There is no single “resilient” setting; there is a ladder. At the bottom, one VM in one datacentre — cheapest, weakest. At the top, active-active across two regions serving live traffic from both — most expensive, strongest. Each rung up shrinks your RTO and RPO and grows your bill, often non-linearly (active-active can roughly double infrastructure cost because you run a full second copy). The architect’s job is to climb only as high as the workload’s RTO/RPO justifies, no higher.

You cannot exceed your weakest link. Your real RTO is the slowest component to recover; your real RPO is the least recent safe copy across the whole system. A stateless web tier that fails over in 30 seconds is irrelevant if its database takes four hours to restore — the workload’s RTO is four hours. BCDR is a chain, and you measure it end-to-end, not by its strongest part.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Term	One-line definition	Why it matters to BCDR
RTO	Max acceptable time to restore service after an outage	Sets your downtime budget; drives the DR strategy
RPO	Max acceptable amount of recent data lost	Sets your backup/replication frequency
High availability (HA)	Surviving small failures, usually automatically, in-region	Reduces outage frequency; near-zero data loss
Disaster recovery (DR)	Recovering from a large failure, often cross-region	Limits worst-case blast radius
Availability zone	A physically isolated datacentre within a region	In-region HA against a building-level failure
Region pair	Two regions Azure links for residency and recovery	Natural DR target; sequenced platform updates
Backup	A point-in-time copy you restore from	Defends against corruption/deletion; sets data RPO
Replication	Continuous copying of data to another location	Lowers RPO toward near-zero for DR
Failover	Switching production to the standby location	The act of executing DR
Failback	Returning production to the original location	The other half of a complete DR plan
SLA	The provider’s uptime commitment for a service	A floor on availability, not your RTO/RPO

RTO and RPO, made concrete

The two numbers are simple to state and easy to get muddled in practice. This section nails them down with real magnitudes, because the band you land in is what selects an architecture.

Reading the two numbers off a timeline

Lay outages on a line. Data is being written continuously. At time T disaster hits. The last protected copy of your data is at some earlier time T − RPO — everything written between then and T is lost. Service comes back at T + RTO — that span is your outage. RPO is governed by how often you protect data (backup interval, replication lag). RTO is governed by how fast you can bring service back (restore speed, failover automation, DNS/traffic cut-over).

A subtle trap: RPO is not the backup schedule, it’s the worst case within it. If you back up every 24 hours and the outage strikes 23 hours after the last backup, you lose 23 hours of data — so a daily backup buys an RPO of up to 24 hours, not “24 hours on average.” Quote RPO as the maximum, because that is what you must survive.

The bands that matter

You rarely need a precise RTO like “47 minutes.” You need to know which band a workload sits in, because each band maps to a class of architecture. Here are the practical bands and what each implies:

Band	RTO meaning	RPO meaning	Typical mechanism	Rough relative cost
Near-zero (seconds)	No perceptible downtime	No data loss	Active-active, synchronous replication	Highest (×1.7–2)
Minutes	Brief, automated failover	Seconds of data	Warm standby, async replication	High
Tens of minutes	Fast manual failover	Minutes of data	Pilot light, frequent replication	Moderate
Hours	Restore-and-redeploy	Hours of data	Backup & restore, geo-redundant backup	Low
A day+	Rebuild from backups	Up to a day of data	Nightly backup only	Lowest

The decision is rarely “what’s the best we can do?” — it’s “what band does this workload’s cost of being wrong put it in?” That is the next table.

Mapping a workload to a band

For any workload, ask two independent questions: what does an hour of downtime cost? and what does losing recent data cost? The answers place it on a grid, and the grid suggests a tier:

Downtime cost \ Data-loss cost	Low data-loss cost	High data-loss cost
Low downtime cost	Single region + backups (RTO hours / RPO hours) — e.g. internal wiki, marketing site	Single region + frequent backup/PITR (RTO hours / RPO minutes) — e.g. analytics store you can rebuild slowly but mustn’t lose entries
High downtime cost	Zone-redundant + warm DR (RTO minutes / RPO hours) — e.g. a read-heavy catalog you can repopulate	Zone-redundant + active-active or warm standby (RTO seconds–minutes / RPO seconds) — e.g. payments, ordering, the system of record

Most organisations have workloads in all four quadrants and waste money applying a single quadrant’s design to everything. Tiering by this grid is the single highest-leverage BCDR decision you make.

High availability vs disaster recovery

These two words get used interchangeably and they are not the same job. Conflating them produces the classic failure: a beautifully zone-redundant application that has no answer at all when the region goes down.

Two different failures, two different plans

Dimension	High availability (HA)	Disaster recovery (DR)
Failure it handles	Small, frequent (disk, VM, single zone)	Large, rare (whole region, data destruction)
Scope	Usually within one region	Usually a second, separate region
Trigger	Automatic (platform/health probe)	Often a human decision (declare a disaster)
Typical RTO	Seconds to a few minutes	Minutes to hours
Typical RPO	Near-zero	Seconds to hours (depends on replication)
Primary cost	Redundancy within a region	A second-region footprint + drills
Example mechanism	Multiple instances across zones	Site Recovery / geo-replication + failover

The line to remember: HA keeps you up through the failures that happen most weeks; DR saves you on the worst day of the decade. They are additive, not alternatives. A serious system has both — zone redundancy and a cross-region recovery plan.

Where the SLA fits (and where it doesn’t)

An Azure service-level agreement (SLA) — for example, the higher availability percentage you get by spreading VM instances across availability zones — is a statement about how often the platform aims to keep a service reachable. It is a useful floor, but it is not your RTO or RPO. The SLA does not promise how fast your specific application recovers from a bad deployment, nor how much data you lose to a corruption event. Vendors sell SLA numbers; architects own RTO/RPO. Treat the SLA as one input to your HA design and nothing more — it never substitutes for a tested recovery plan.

The resilience spectrum on Azure

Now assemble the ladder. Each rung is a real Azure topology; the higher you climb, the smaller your RTO/RPO and the larger your bill. Pick the lowest rung that satisfies the workload’s band.

The four-tier DR strategy ladder

The cloud industry settled on four named DR strategies, ordered by recovery speed and cost. They are the canonical way to talk about where a workload sits:

Strategy	How it works	Typical RTO	Typical RPO	Standby cost	Best for
Backup & Restore	Back up data; on disaster, restore and redeploy in another region	Hours	Hours (= backup interval)	Storage only	Cost-sensitive, downtime-tolerant workloads
Pilot Light	Core data replicated and a minimal “spark” running in DR; scale up on failover	Tens of minutes	Minutes (async replication)	Small (data + minimal infra)	Important apps that can take a short outage
Warm Standby	A scaled-down but running copy in DR; scale up and cut traffic over	Minutes	Seconds–minutes	Moderate (always-running smaller fleet)	Business-critical apps needing fast recovery
Active-Active	Both regions serve live traffic; failure removes one with little impact	Seconds (near-zero)	Near-zero	Highest (full second copy)	Mission-critical, no perceptible downtime

The cost climbs roughly in step with the recovery speed: Backup & Restore parks cheap data and nothing else; Active-Active runs a second production-grade environment full-time. The art is choosing the cheapest strategy whose RTO/RPO still clears the workload’s band.

Azure building blocks, placed on the ladder

The strategies above are abstract; here is the concrete Azure machinery and the rung each one serves. None of these are interchangeable — each defends a different failure scope.

Building block	What it protects against	Scope	RTO band it enables	RPO band it enables
Availability set	VM host/rack/update failures	Within one datacentre	Seconds–minutes	~0 (HA only)
Availability zones	A whole datacentre (zone) failing	Within one region	Seconds–minutes	~0 (HA only)
Zone-redundant storage (ZRS)	A zone failing, for data at rest	Within one region	Transparent	~0
Azure Backup	Corruption, deletion, ransomware	Per resource; vault can be geo-redundant	Hours	= backup frequency
Geo-redundant storage (GRS/GZRS)	A whole region failing, for blobs/files	Cross-region (paired)	Hours (after failover)	Minutes (async)
Azure Site Recovery (ASR)	A region failing, for VMs/workloads	Cross-region or zone-to-zone	Minutes–hours	Minutes (continuous)
Geo-replication (databases)	A region failing, for managed data	Cross-region	Seconds–minutes (failover)	Seconds (async) / 0 (sync)
Front Door / Traffic Manager	Routing traffic away from a dead region	Global	Part of the failover cut-over	n/a (routing layer)

Two clarifications that prevent expensive confusion. First, HA blocks (sets, zones, ZRS) do not give you DR — they survive a zone, not a region. A region-down event sails straight through zone redundancy. Second, a backup in a geo-redundant vault is still a backup, not a hot standby — its RTO is “restore time,” which for large datasets is hours, regardless of how recent the copy is.

Storage redundancy: the RPO knob for data at rest

Storage redundancy deserves its own look because it is the most common place teams think they have DR and don’t. Azure Storage offers four redundancy levels, each a different point on the cost/durability/availability curve:

Redundancy	Copies & placement	Survives a zone outage?	Survives a region outage?	Read access in DR	Relative cost
LRS (Locally redundant)	3 copies, one datacentre	No	No	—	Lowest
ZRS (Zone-redundant)	3 copies across zones in one region	Yes	No	—	Low–moderate
GRS (Geo-redundant)	LRS in primary + async copy in paired region	No (in primary)	Yes (after failover)	No until failover	Moderate
GZRS (Geo-zone-redundant)	ZRS in primary + async copy in paired region	Yes	Yes (after failover)	No until failover	Highest
RA-GRS / RA-GZRS (Read-access)	GRS/GZRS + read endpoint in secondary	Same as above	Yes	Yes (read-only) anytime	+ small premium

The trap: GRS/GZRS replication is asynchronous, so the secondary lags the primary by a short window — that lag is your storage RPO for a region failure, and it is not zero. And until a failover is initiated (or you use the read-access variants), you cannot read the secondary copy. “We’re on GRS” is a DR posture with an RPO of minutes and an RTO gated by failover, not a magic no-loss guarantee.

Set the redundancy explicitly when you create a storage account — the default is often LRS, which has no regional protection:

# Create a geo-zone-redundant storage account (zone HA + regional DR, async)
az storage account create \
  --name stkvbcdrprod \
  --resource-group rg-bcdr-prod \
  --location centralindia \
  --sku Standard_GZRS \
  --kind StorageV2 \
  --access-tier Hot

resource storage 'Microsoft.Storage/storageAccounts@2023-05-01' = {
  name: 'stkvbcdrprod'
  location: 'centralindia'
  sku: { name: 'Standard_GZRS' }   // ZRS in-region + async copy to the paired region
  kind: 'StorageV2'
  properties: { accessTier: 'Hot' }
}

Region pairs: the default DR direction

Azure organises most regions into region pairs — two regions in the same geography that the platform links for data residency and recovery. Pairs matter for BCDR for three reasons: GRS/GZRS replication targets the paired region automatically, platform updates are sequenced so both halves of a pair are rarely updated at once, and in a broad regional event Azure prioritises recovery one region per pair first. Picking your DR region as the pair of your primary keeps data in-geography (important for compliance) and aligns with how the platform already replicates.

You can confirm which region is paired with yours, and which regions even support zones, from the CLI:

# Show the paired region for your primary (the natural DR target)
az account list-locations \
  --query "[?name=='centralindia'].{region:name, pairedWith:metadata.pairedRegion[0].name}" \
  -o table

# List regions that support availability zones (zonal HA)
az account list-locations \
  --query "[?metadata.physicalLocation!=null].name" -o tsv

Architecture at a glance

The reference shape for a typical business-critical web workload reads left to right as a recovery story. A global routing layer — Azure Front Door or Traffic Manager — sits at the front and decides which region serves a user; it is the lever you pull (or that pulls automatically on health-probe failure) to steer traffic away from a dead region. Behind it, the primary region runs the live application across availability zones: the web/API tier is spread over zones for HA, and the data tier (database plus storage) is zone-redundant so a single datacentre failure is invisible. That zone redundancy is your HA story — it handles the common failures without anyone waking up.

The DR story is the second region. Data flows continuously from primary to the secondary (paired) region by asynchronous replication — geo-redundant storage for blobs/files, geo-replication for the database — giving an RPO of seconds-to-minutes. The secondary runs as warm standby (a smaller, running copy) or pilot light (data warm, compute minimal), so on a regional disaster you scale it up and Front Door cuts traffic over, landing an RTO in the minutes-to-tens-of-minutes band. Layered across both regions, Azure Backup writes point-in-time copies into a geo-redundant vault — the independent defence against corruption, deletion, and ransomware that replication alone cannot give you (replication faithfully copies bad data too). The numbered badges below mark the four points where an RTO/RPO target is actually won or lost.

Read the badges as the four levers of your RTO/RPO: (1) the routing cut-over decides failover time; (2) zone redundancy decides whether common failures even register; (3) replication lag decides your regional RPO; (4) an independent, immutable backup decides whether you survive corruption and ransomware at all.

Real-world scenario

KloudMart, a mid-size Indian e-commerce company, ran its order platform as a single-region deployment in Central India: web app and SQL database in one region, nightly backups to a geo-redundant vault, no second region. Leadership believed they were “covered” because backups existed and storage was geo-redundant. The numbers nobody had written down were an RTO of several hours (restore a 1.2 TB database, redeploy the app, repoint DNS) and an RPO of up to 24 hours (the nightly backup).

The reckoning came in two waves. First, a botched schema migration corrupted the orders table at 2 p.m. The team restored from the previous night’s backup — and lost every order placed that day, roughly ₹40 lakh in transactions, because the RPO was 24 hours and replication had faithfully copied the corruption to the GRS secondary. Months later a regional incident took the primary offline for ninety minutes; with no warm standby, the store was down the whole time, because the “DR plan” was a manual restore slower than the outage itself.

The fix was a deliberate tiering exercise. The team classified the order platform as high downtime cost, high data-loss cost and set targets of RTO 15 minutes, RPO 5 minutes. They moved the database to active geo-replication into the paired region (RPO now seconds), stood up a warm standby of the web tier there (smaller fleet, kept running), and put Azure Front Door in front with health-probe-driven failover so a dead primary reroutes automatically. Crucially, they kept Azure Backup with point-in-time restore and a short backup interval as the independent defence against corruption — the lesson from the migration — and enabled vault immutability so ransomware could not destroy the recovery points. They left the low-stakes microsite on its cheap single-region-plus-nightly-backup design, refusing to gold-plate it.

The result: the next regional blip caused a sub-minute automatic failover most customers never noticed, and a later accidental bulk-delete was recovered to within five minutes via point-in-time restore. The extra spend — a scaled-down second-region fleet plus more frequent backups — was a fraction of the single bad afternoon it replaced. The decisive change was not a tool; it was writing the two numbers down and designing to them.

Advantages and disadvantages

Treating BCDR as an explicit RTO/RPO-driven discipline has clear trade-offs:

Advantages	Disadvantages
Spend lands where outages actually hurt (tiering)	Requires up-front classification work per workload
Recovery becomes a designed-for number, not a surprise	Higher rungs add real cost and operational complexity
Business and engineering share one vocabulary	Targets must be tested (drills), or they’re fiction
Right-sizing avoids gold-plating low-stakes apps	Cross-region designs add latency and data-consistency concerns
Independent backups defend against corruption/ransomware	More moving parts to monitor, patch, and keep in sync
SLA, HA, and DR stop being conflated	Failback (returning to primary) is often under-planned

When each matters: the up-front classification cost is trivial next to a single mis-tiered incident, so it almost always pays off. The complexity and cost of higher rungs matter most for truly critical workloads — which is exactly why you tier, climbing the ladder only for the systems whose downtime cost justifies it, and leaving everything else on a cheap, honest, lower rung.

Hands-on lab

This lab makes RTO/RPO tangible without a multi-region bill: you will set storage redundancy (the RPO knob for data at rest), inspect the region pair (your DR direction), and create a Recovery Services vault with geo-redundancy (the backup foundation). It is free-tier-friendly except for minimal storage costs; teardown is included.

1. Set variables and a resource group.

RG=rg-bcdr-lab
LOC=centralindia
az group create --name $RG --location $LOC

2. Inspect your region pair — the natural DR target.

az account list-locations \
  --query "[?name=='$LOC'].{region:name, pairedWith:metadata.pairedRegion[0].name}" -o table
# Expected: a row showing centralindia paired with southindia (your DR direction).

3. Create a geo-zone-redundant storage account (HA and DR for data at rest).

az storage account create \
  --name stbcdrlab$RANDOM \
  --resource-group $RG --location $LOC \
  --sku Standard_GZRS --kind StorageV2 --access-tier Hot
# 'provisioningState': 'Succeeded' and 'sku': { 'name': 'Standard_GZRS' } confirm it.

4. Confirm the redundancy you actually got (a frequent surprise — defaults are often LRS).

az storage account list -g $RG \
  --query "[].{name:name, sku:sku.name, location:primaryLocation, secondary:secondaryLocation}" -o table
# 'secondaryLocation' populated = your data is replicating to the paired region.

5. Create a Recovery Services vault with geo-redundant storage (the backup foundation).

az backup vault create --name rsv-bcdr-lab --resource-group $RG --location $LOC
# Set the vault's backup storage redundancy to geo-redundant (cross-region durability)
az backup vault backup-properties set \
  --name rsv-bcdr-lab --resource-group $RG \
  --backup-storage-redundancy GeoRedundant

6. The equivalent Bicep, for repeatable infrastructure.

param location string = 'centralindia'

resource vault 'Microsoft.RecoveryServices/vaults@2023-06-01' = {
  name: 'rsv-bcdr-lab'
  location: location
  sku: { name: 'RS0', tier: 'Standard' }
  properties: {}
}

resource vaultConfig 'Microsoft.RecoveryServices/vaults/backupstorageconfig@2023-06-01' = {
  parent: vault
  name: 'vaultstorageconfig'
  properties: {
    storageModelType: 'GeoRedundant'   // recovery points survive a regional event
  }
}

7. Teardown — delete everything to stop charges.

az group delete --name $RG --yes --no-wait
# Note: a vault with active backup items must have them removed before it deletes.

What you proved: redundancy is an explicit choice (step 3–4), the platform already defines your DR direction via the region pair (step 2), and the vault is the geo-redundant home for the point-in-time copies that defend your RPO against corruption (step 5–6).

Common mistakes & troubleshooting

BCDR fails in predictable ways. Each row is a real failure mode — symptom, the root cause, how to confirm it, and the fix:

#	Symptom	Root cause	How to confirm	Fix
1	“We have GRS, so we’re safe” — yet a region outage still took data	GRS replication is async; the lag is your RPO, and you can’t read the secondary until failover	`az storage account show --query "{sku:sku.name, secondary:secondaryLocation}"` — note it’s async	Set an RPO budget; use RA-GZRS for read access; pair with backups
2	Restored from backup but lost a full day of orders	RPO equals the backup interval; nightly backup = up to 24 h loss	Check backup policy schedule in the vault	Shorten the interval / use point-in-time restore on the database
3	Zone-redundant app went fully down in a regional outage	Zone redundancy is HA, not DR — it doesn’t survive a region	Confirm there is no second-region footprint	Add a DR region (warm standby / pilot light) + global routing
4	Corruption was faithfully copied to the DR replica	Replication mirrors all writes, including bad ones — it’s not a backup	The bad data exists in both regions	Keep independent point-in-time backups alongside replication
5	DR plan exists on paper but failover took hours	Never drilled; runbook steps were wrong or manual	No record of a successful failover test	Schedule regular failover drills; automate cut-over
6	Ransomware encrypted the backups too	Backups were mutable / in the same trust boundary	Vault has no immutability / soft-delete	Enable vault immutability + soft-delete; isolate the copy
7	Failover worked, but failback was chaos	Failback was never designed	No documented return-to-primary procedure	Plan and test failback as a first-class step
8	The SLA said 99.99% but recovery still took hours	SLA is platform uptime, not your RTO/RPO	The SLA doc says nothing about your app’s recovery	Own RTO/RPO separately; design and test to them
9	DR region exists but DNS still pointed at the dead region	No global routing / health-probe failover	TTL-bound DNS with manual change	Front Door / Traffic Manager with automatic health failover
10	“RPO is 5 minutes” but a dependency lagged hours	Measured one component, not the weakest link	Map RTO/RPO end-to-end across every tier	Set targets on the slowest/stalest component

The meta-lesson across all ten: a recovery capability you have never tested is a hypothesis, not a plan. Drill it.

Best practices

Write RTO and RPO down per workload, signed off by the business. Undocumented targets default to “whatever the architecture happens to give,” which is how 24-hour RPOs get accepted by accident.
Tier your workloads. Use the downtime-cost × data-loss-cost grid; do not apply one resilience policy to a microsite and a payments ledger.
Separate HA from DR explicitly. Confirm you have both an in-region HA story (zones) and a cross-region DR story for anything critical.
Keep backups independent of replication. Replication copies corruption; backups give you a clean point to roll back to. You need both.
Choose storage redundancy deliberately. The default is often LRS (no regional protection). Pick ZRS/GZRS to match the failure scope you must survive.
Make the DR region the paired region where compliance and latency allow — it aligns with how Azure replicates and sequences updates.
Drill failover on a schedule. An untested runbook fails when it matters; rehearse it and time the real RTO.
Plan failback as carefully as failover. Returning to primary is a project of its own, not an afterthought.
Protect backups against ransomware with vault immutability, soft-delete, and isolation from production credentials.
Measure RTO/RPO end-to-end. Your real numbers are set by the slowest/stalest component, not the fastest.
Use global routing for failover. Front Door or Traffic Manager with health probes turns a region failure into an automatic reroute, not a frantic DNS edit.
Right-size, don’t gold-plate. Climb the spectrum only as high as the workload’s band justifies; over-provisioned resilience is wasted budget.

Security notes

Resilience and security overlap more than teams expect. Backups and replicas are full copies of your data — and therefore full-value targets. Apply the same least-privilege, encryption, and isolation discipline to them as to production. Use Azure RBAC to restrict who can delete or restore from a Recovery Services vault, and enable multi-user authorization (MUA) so a single compromised admin cannot wipe recovery points. Enable vault immutability and soft-delete so backups cannot be silently deleted or encrypted by ransomware — the difference between “we have backups” and “we have backups an attacker cannot reach.”

Keep DR credentials and the DR environment in a separate trust boundary from production where practical: a compromised identity plane should not automatically hand the attacker your recovery copies. Targeting the paired region keeps data in-geography for residency compliance — confirm the secondary satisfies the same regulatory requirements as the primary. All redundancy levels (LRS–GZRS) and Recovery Services vaults encrypt at rest by default; for stricter needs, layer customer-managed keys (CMK) in Key Vault — but the key then needs its own availability plan, or it becomes a new single point of failure for recovery.

Cost & sizing

BCDR cost is driven by how much you keep running in the second location and how much/often you store recovery data. The spectrum’s cost curve is steep precisely because the top rungs run a full second environment full-time. Rough figures (Central India, indicative; always price your own SKUs):

Tier	What you pay for	Rough relative monthly cost	Indicative INR/mo (small workload)	Indicative USD/mo
Backup & Restore	Backup storage + vault	Baseline (×1.0)	₹1,500–6,000	$18–72
Pilot Light	Replicated data + minimal compute	×1.1–1.3	₹8,000–20,000	$95–240
Warm Standby	Smaller always-running second fleet	×1.4–1.6	₹25,000–60,000	$300–720
Active-Active	Full second production environment	×1.7–2.0	₹60,000–1,20,000+	$720–1,440+

The dominant cost levers, and how to right-size each:

Cost driver	What inflates it	How to right-size
Standby compute	Running a full second fleet 24×7	Use pilot light / warm standby; scale up only on failover
Backup storage	Long retention + high frequency on large data	Tier retention; back up frequently only what needs a tight RPO
Geo-redundant storage	GZRS premium across all data	Use GZRS for critical data, LRS/ZRS for the rest
Cross-region egress	Replication and failover data transfer	Expected with DR; minimise chatty cross-region calls
Recovery Services vault	Per-instance protection + storage redundancy	Geo-redundant only where regional durability is required

Free-tier and cost notes: Azure Backup has no separate compute charge — you pay for protected-instance count and the backup storage consumed. ZRS/GZRS carry a premium over LRS for the extra copies. The cheapest honest posture for a non-critical workload is single-region with scheduled backups to a geo-redundant vault — a low monthly cost that still gives you a real (if hours-scale) recovery. Reserve the expensive rungs for the workloads whose downtime cost genuinely clears the bar.

Interview & exam questions

1. What is the difference between RTO and RPO? RTO (Recovery Time Objective) is the maximum acceptable time to restore service after an outage — measured forward from the disaster. RPO (Recovery Point Objective) is the maximum acceptable amount of recent data lost — measured backward to the last safe copy. They are independent: a design can have a good RTO and a poor RPO, or vice versa.

2. How do high availability and disaster recovery differ? HA survives small, frequent failures (disk, VM, a single availability zone), usually automatically and within one region, with near-zero data loss. DR recovers from large, rare events (a whole region down, data destruction), typically by failing over to a second region, often with a human decision and a measurable recovery window. You generally need both.

3. Does zone redundancy give you disaster recovery? No. Availability zones protect against a datacentre (zone) failing within a region — that is HA. A whole-region outage defeats zone redundancy entirely. For DR you need a presence in a second region.

4. Why is replication not a substitute for backup? Replication copies every write to the secondary, including corrupt or maliciously deleted data — so the bad state appears in both regions. A backup is an independent point-in-time copy you can roll back to. You need replication for low RPO on infrastructure failure and backups for recovery from corruption/ransomware.

5. What is the practical difference between LRS, ZRS, GRS, and GZRS? LRS keeps three copies in one datacentre (no zone or region protection). ZRS spreads them across zones in one region (survives a zone, not a region). GRS adds an async copy in the paired region (survives a region, after failover). GZRS combines in-region zone redundancy with that async cross-region copy. GRS/GZRS replication is asynchronous, so the lag is your storage RPO.

6. What are the four standard DR strategies, from cheapest to most resilient? Backup & Restore (restore and redeploy; RTO hours), Pilot Light (core data warm, minimal compute; RTO tens of minutes), Warm Standby (a scaled-down running copy; RTO minutes), and Active-Active (both regions serve live traffic; RTO near-zero). Cost rises with recovery speed.

7. What is a region pair and why does it matter for BCDR? A region pair is two regions Azure links within a geography. It matters because GRS/GZRS replication targets the paired region automatically, platform updates are sequenced so both are rarely updated together, and recovery is prioritised one region per pair in a broad event. It is the natural, compliance-friendly DR direction.

8. Is an Azure SLA the same as your RTO/RPO? No. An SLA is the platform’s uptime commitment for a service — a floor on availability. It says nothing about how fast your application recovers from a bad deployment or how much data you lose to corruption. RTO/RPO are yours to define, design for, and test.

9. How do you decide which DR strategy a workload needs? Estimate two independent costs — an hour of downtime, and losing recent data — to place the workload on a grid that maps to an RTO/RPO band, which selects a strategy. Tier workloads so spend matches the real cost of an outage.

10. Why must your real RTO/RPO be measured end-to-end? Because a chain is only as strong as its weakest link: your true RTO is the slowest component to recover and your true RPO is the least-recent safe copy across the whole system. A 30-second web failover is meaningless if the database takes four hours to restore.

11. What protects backups from ransomware? Vault immutability and soft-delete (so recovery points cannot be silently deleted or encrypted), multi-user authorization so one compromised admin cannot wipe them, and isolation of the backup copy from production credentials. “Having backups” an attacker can reach is not protection.

12. Which certs cover this material? The HA/DR, RTO/RPO, region-pair, storage-redundancy, and Backup/Site Recovery concepts here map to AZ-104 (Administrator), AZ-305 (Solutions Architect — design for resilience), and the resilience pillar of the Well-Architected/AZ-700 networking content for the routing layer.

Quick check

An outage strikes at 14:00; your last backup was at 02:00. What is your RPO for this event?
Your web tier fails over in 30 seconds but the database restore takes 3 hours. What is the workload’s real RTO?
You are on GZRS. A whole region fails. Is your data automatically readable in the secondary region with no further action?
Which DR strategy keeps a scaled-down but running copy in the second region?
True or false: a 99.99% SLA guarantees you will recover from a corrupted database within minutes.

Answers

Up to 12 hours — RPO is the gap to the last safe copy, so 14:00 minus 02:00. (Daily backups make this up to the full interval in the worst case.)
3 hours — the workload’s RTO is set by its slowest-to-recover component, the database, not the fast web tier.
No — GZRS replication is asynchronous and the secondary is not readable until a failover is initiated, unless you are using the read-access variant (RA-GZRS).
Warm Standby — a running, scaled-down copy you scale up and cut traffic to (Pilot Light keeps data warm but compute minimal; Active-Active serves live from both).
False — an SLA is a platform-uptime commitment; it says nothing about recovering your application from corruption, which is governed by your RTO/RPO and your backups.

Glossary

RTO (Recovery Time Objective): The maximum acceptable time to restore service after an outage; measured forward from the disaster.
RPO (Recovery Point Objective): The maximum acceptable amount of recent data lost; measured backward to the last safe copy.
High availability (HA): Surviving small, frequent failures (disk, VM, zone) automatically, usually within one region, with near-zero data loss.
Disaster recovery (DR): Recovering from large, rare failures (region down, data destruction), typically via a second region and a deliberate failover.
Availability zone: A physically isolated datacentre within an Azure region, with independent power, cooling, and networking.
Region pair: Two Azure regions in a geography that the platform links for residency, replication, and sequenced recovery.
Failover: Switching production to the standby location during a disaster.
Failback: Returning production to the original primary location after recovery.
LRS / ZRS / GRS / GZRS: Storage redundancy levels — locally, zone-, geo-, and geo-zone-redundant respectively — trading cost against the failure scope they survive.
Azure Backup / Recovery Services vault: Azure’s managed point-in-time backup service and the vault that stores recovery points.
Azure Site Recovery (ASR): Azure’s service for replicating and failing over whole workloads (VMs) to another region or zone.
Pilot Light / Warm Standby / Active-Active: DR strategies ordered by recovery speed and cost, from minimal-standby to full live second region.
SLA (service-level agreement): The provider’s uptime commitment for a service — a floor on availability, not a substitute for your RTO/RPO.
Asynchronous replication: Copying data to a secondary with a small lag; that lag is the RPO for a primary-region failure.

Next steps

Go deeper on the region/zone mechanics that set your HA floor in Azure Regions and Availability Zones Explained.
Learn the protection tooling end-to-end in Azure Backup and Site Recovery for Protection.
Drill the HA-versus-DR distinction further in High Availability vs Disaster Recovery: RTO and RPO.
Climb to the top rung with Azure Multi-Region Active-Active Design.
Add the global routing layer that makes failover automatic in Azure Front Door and Traffic Manager for Global Failover.
Make the application itself resilient with Resiliency Patterns: Retry, Circuit Breaker, Bulkhead.