Azure Enterprise Architecture: Disaster Recovery for IaaS

Most “disaster recovery” stories in the enterprise are not about cosmic events. They are about a botched storage firmware update that takes a zone offline, a ransomware blast radius that encrypts a SQL Server VM and its local backups, or a region-wide control-plane incident that makes a perfectly healthy VM unreachable for four hours. The workloads that get hit hardest are almost always the IaaS ones — the lift-and-shifted line-of-business apps that never got refactored, run on Windows or Linux VMs, talk to a SQL Server or Oracle instance, and quietly underpin revenue.

This article is a reusable reference architecture for DR of Azure IaaS workloads built on Azure Site Recovery (ASR), Azure Traffic Manager, paired regions, and an explicit, tested RTO/RPO contract. It is deliberately IaaS-first: no rewrite, no PaaS migration, no application code changes. It scales from a 15-VM small enterprise to a 600-VM regulated estate using the same building blocks.

The business scenario

Picture a mid-market financial-services firm — call them a “small-to-large enterprise” depending on the year — that runs a customer-facing loan-origination platform. The platform is three tiers of VMs: a pair of IIS web servers behind an internal load balancer, three .NET application servers, and a SQL Server 2022 Always On availability group on two VMs with a witness. There are another dozen supporting VMs: an AD domain controller replica, a file server, a reporting box, and a couple of integration/ETL servers. Everything lives in a single Azure region today, spread across availability zones.

Zones protect against a datacenter failure. They do not protect against:

A regional outage — a rare but real control-plane or networking incident that affects all zones in the region at once.
Ransomware or destructive operator error — where the corruption replicates into your in-region backups before anyone notices.
A compliance mandate — the regulator (and the firm’s own auditors) require a documented, tested ability to recover the platform in a geographically separate region within a defined window, with evidence from periodic drills.

The business has signed off on two numbers that drive every decision below:

RTO (Recovery Time Objective): 1 hour. From “region declared lost” to “customers transacting again in the recovery region.”
RPO (Recovery Point Objective): 15 minutes for the app/web tier and near-zero (seconds) for the SQL transactional data.

The constraint that makes this interesting: leadership will not fund a hot, fully-running duplicate of the platform in a second region 24x7. They want DR economics — pay for replication and a small standby footprint, pay for full compute only when a disaster is actually declared. That single constraint is what makes Azure Site Recovery the right tool rather than just “run two of everything.”

Architecture overview

The architecture is a two-region active/passive (pilot-light) topology across an Azure region pair — for example Central India (primary) and South India (secondary), or East US / West US in the US. Region pairs matter here because Azure sequences platform updates and prioritizes recovery across paired regions, and several services replicate within a pair by default.

Here is the end-to-end picture, described as a flow.

Steady state (no disaster). Users resolve the platform’s public name — loans.contoso-fs.com — through Azure Traffic Manager configured with a Priority routing method. Traffic Manager returns the CNAME/endpoint for the primary region; the secondary endpoint is configured but ranked lower and is not receiving traffic. In the primary region, requests enter through an Application Gateway (WAF v2), hit the IIS web tier, which calls the .NET app tier, which talks to the SQL Always On listener. Meanwhile, Azure Site Recovery continuously replicates every protected VM’s disks to the secondary region: the Mobility service agent installed inside each VM captures writes, ships them to a cache storage account in the primary region, and ASR pushes them across to managed-disk replicas in the secondary region. SQL data is handled differently and better — see below.

The replication data path. For each protected VM, app-consistent and crash-consistent recovery points are generated on a schedule (crash-consistent every 5 minutes; app-consistent typically every 1–4 hours). These are retained for a configurable window (e.g. 24–72 hours), giving you not just “the latest” but a ladder of recovery points to roll back past a ransomware event. The secondary region holds only the replica disks and a recovery plan — there are no running VMs for the app/web tier until failover. This is the “pilot light.”

SQL is special. The web and app tiers are stateless enough to recover from ASR replicas. The database is not — a 15-minute crash-consistent recovery point is unacceptable for a loan ledger. So SQL is replicated natively using a SQL Server Always On availability group with an asynchronous replica VM running in the secondary region, not via ASR. This gives near-zero RPO and a clean, transactionally-consistent failover that the database engine itself guarantees. ASR protects the app/web/supporting tiers; Always On protects the data tier.

Failover (disaster declared). An operator (or an automated signal) triggers the ASR recovery plan. The plan boots the replica VMs in the secondary region in the correct order (domain controller and SQL first, then app, then web), runs post-failover scripts (re-IP, re-join, warm caches), and forces the SQL Always On secondary to become primary. Once the recovery-region stack is healthy, the operator flips Traffic Manager — either by disabling the primary endpoint or by promoting the secondary’s priority — and DNS resolvers begin handing users the recovery region. Because Traffic Manager is DNS-based, the cutover is gated by DNS TTL (set low, e.g. 30–60 seconds, well before any incident).

Failback. When the primary region is healthy again, ASR reverse-replicates the now-current secondary back to the primary, you re-establish the Always On replica in the primary, and a planned failover returns you to steady state during a maintenance window.

The mental model: Traffic Manager decides where users go, ASR decides what comes back to life, Always On guarantees the data is intact, and the region pair is the blast-radius boundary.

Component breakdown

Component	Role in the architecture	Key configuration choices
Azure Site Recovery (Recovery Services vault)	Orchestrates replication, recovery points, and failover for the app/web/supporting VM tiers.	Azure-to-Azure replication; multi-VM consistency groups for VMs that must fail over together; 24–72h recovery-point retention; app-consistent snapshots enabled where the in-VM app supports VSS/pre-post scripts.
Recovery Plan	The ordered, scripted failover runbook ASR executes.	Boot groups (DC → SQL → app → web); pre/post Azure Automation runbooks for re-IP, DNS update, and Traffic Manager flip; manual-approval gate before the public cutover.
Azure Traffic Manager	Global DNS-based traffic director between regions.	Priority routing; primary endpoint priority 1, secondary priority 2; endpoint health probes against an app health endpoint; low DNS TTL (30–60s) to bound cutover time.
SQL Server Always On AG (on VMs)	Near-zero-RPO data replication for the database tier.	Async-commit replica in the secondary region; AG listener fronted per-region; automatic seeding; backups to immutable storage; not protected by ASR (engine-level replication instead).
Application Gateway (WAF v2)	Regional ingress, TLS termination, and L7 routing in each region.	Deployed independently in both regions (primary live, secondary pre-provisioned or deployed-on-failover); zone-redundant; WAF in Prevention mode.
Paired regions	The fault-isolation boundary the whole design assumes.	Use an Azure region pair so platform maintenance is sequenced and recovery is prioritized; place secondary far enough for true geographic isolation.
Virtual networks (hub-spoke per region)	Identical-but-separate network fabric in each region.	Non-overlapping or mirrored address spaces; ASR maps source subnets → target subnets; global VNet peering / shared hub for replication and management traffic.
Cache storage account	Staging buffer for ASR replication writes in the source region.	One per replicated region; sized for change rate; do not delete during the protection lifecycle.
Azure Backup	Independent, immutable backup as the ransomware/long-retention safety net.	Soft-delete + immutable vault; cross-region restore enabled; complements ASR (ASR is for DR, Backup is for data recovery/retention — they are not the same control).
Azure Monitor + Log Analytics	Observability for replication health, RPO drift, and drill evidence.	Alerts on replication health, RPO breach, and recovery-point age; workbook tracking last successful test failover per workload.

A few breakdown points deserve emphasis.

ASR vs Backup is not redundant — it is two different controls. ASR keeps a replica that can become a running system in minutes (low RTO, short retention). Azure Backup keeps immutable point-in-time copies for weeks/months (data recovery, ransomware rollback, compliance). A mature DR posture uses both. Relying on ASR alone leaves you exposed if corruption replicates into all your recovery points before detection; relying on Backup alone leaves your RTO at “however long a full restore takes.”

Why Always On instead of ASR for SQL. ASR’s recovery points for a busy database are crash-consistent at best on a 5-minute cadence and app-consistent far less often — both can mean lost committed transactions and a longer, riskier recovery for a transactional system. Always On asynchronous replication ships log records continuously, gives you a tested engine-native failover, and lands RPO in the seconds. The trade-off is that you pay for a running SQL replica VM in the secondary region even in steady state — but that is one small standing cost, not a full duplicate platform.

Why Traffic Manager and not just Azure Front Door. For a pure IaaS, region-failover scenario where the protocols include non-HTTP traffic and you want DNS-level steering with explicit priority, Traffic Manager is the clean fit. Front Door (or Application Gateway behind it) is the better choice when you want a global L7 reverse proxy, caching, and instant anycast failover for HTTP(S). Many enterprises run both: Front Door for the web front-end, Traffic Manager for non-HTTP and infrastructure endpoints. This reference keeps Traffic Manager as the primary director to stay IaaS-protocol-agnostic, and notes Front Door as the HTTP-optimized alternative under “When to use it.”

Implementation guidance

Provision the foundation as code. Stand up both regions’ landing zones with Terraform or Bicep so the secondary is a deterministic mirror, never hand-built drift.

Networking: two hub-spoke VNets, one per region. Decide early between mirrored address spaces (identical CIDRs, simpler app config, requires careful routing and no direct peering of the duplicates) versus non-overlapping address spaces (peerable, easier to run “warm” tests, requires ASR subnet mapping and possible re-IP). For a failover-only secondary, mirrored is common; for ones you want to test live, non-overlapping wins.
ASR setup (Terraform): create the azurerm_recovery_services_vault, then azurerm_site_recovery_replication_policy (e.g. 24h recovery-point retention, app-consistent frequency 4h), azurerm_site_recovery_network_mapping (source VNet → target VNet) and per-subnet mapping, and azurerm_site_recovery_replicated_vm for each protected VM with target_resource_group_id, target_network_id, and managed_disk blocks pinning replica disk SKUs. Codify the boot order intent here even though the recovery plan itself is often finalized in the portal/REST.
Recovery plan + automation: define boot groups and attach Azure Automation runbooks as pre/post actions — a post-failover runbook that updates internal DNS, fixes any re-IP, validates the SQL listener, and (after an approval gate) calls the Traffic Manager API to disable the primary endpoint. Keep the public cutover behind manual approval so a transient blip never auto-fails-over your customers.
SQL Always On: build the AG with two synchronous-commit replicas in the primary region (HA) plus one asynchronous-commit replica in the secondary region (DR). Use a distributed network name / per-region listener pattern, automatic seeding, and ship log/full backups to an immutable, soft-delete-enabled vault with cross-region restore.

Identity wiring. DR is worthless if nobody can authenticate after failover. Ensure at least one read-write domain controller (or a replica that can be promoted) exists in the secondary region and is included early in the boot order, with Microsoft Entra Connect/AD replication healthy across regions. Use managed identities for the Automation runbooks and an Entra ID PIM-gated break-glass account for declaring a disaster, scoped with least privilege via Azure RBAC and Azure Policy.

Make the runbook real, not theoretical. ASR’s Test Failover spins the secondary up into an isolated test VNet without touching production or breaking ongoing replication. Schedule it quarterly, capture screenshots/logs as audit evidence, and measure actual RTO against the 1-hour target. A DR plan that has never been executed is a hypothesis.

Enterprise considerations

Security & Zero Trust. Treat the recovery region as production from day one: same NSGs, same Azure Firewall/WAF policy, same Private Endpoints, same Conditional Access. Enable Customer-Managed Keys on replica disks if the primary uses them (and replicate the key to a Key Vault in the secondary region, or use a multi-region Managed HSM). The most common DR security gap is a recovery region with relaxed controls — attackers love a warm, under-monitored failover target. Gate the disaster-declaration capability behind Entra PIM with approval, and log every failover action to an immutable audit sink.

Cost optimization. This is the architecture’s headline benefit. In steady state you pay for: ASR per-protected-instance licensing, replication storage (cache + replica managed disks), egress for replication, the one standing SQL DR replica VM, and a small always-on footprint (DC, Traffic Manager, pre-provisioned App Gateway). You do not pay for the app/web tier compute in the secondary region until you fail over. Tactics: use Reserved Instances / savings plans for the standing SQL replica and DC; keep replica disks at Standard SSD where the source is Premium if the failover-time perf hit is acceptable (or match SKUs for tier-1); right-size recovery-point retention (every extra day is storage you pay for). Expect DR to cost on the order of single-digit-to-~15% of running a full duplicate region.

Scalability. The pattern scales by adding VMs to ASR consistency groups and boot groups — there is no architectural cliff from 15 to 600 VMs, only operational discipline (group related VMs so they fail over coherently, and respect ASR’s per-vault and churn limits by sharding very large estates across multiple vaults/policies). The data tier scales independently via SQL HA/DR topology.

Reliability & DR (RTO/RPO). This is the whole point, so make the contract explicit and monitored:

Tier	Mechanism	RPO target	RTO contribution
Web / App (stateless-ish)	Azure Site Recovery replica + recovery plan	~5–15 min (recovery-point cadence)	Boot + script time within the plan
Database	SQL Always On async replica	Seconds (async log shipping)	AG failover (fast)
Public entry	Traffic Manager priority flip	n/a (routing)	Bounded by DNS TTL (30–60s)
End to end	Recovery plan + TM cutover	≤15 min data loss	≤1 hour to transact

Validate it with quarterly Test Failovers and record the measured numbers — auditors want evidence, not aspirations. Watch RPO drift: if a VM’s change rate exceeds replication throughput, its recovery-point age grows and silently breaks your RPO. Alert on it.

Observability. Pipe ASR replication health, recovery-point age, and Always On synchronization state into Log Analytics. Build a workbook that answers three questions at a glance: Is every protected VM healthy and within RPO? When did each workload last pass a test failover? What is the current measured RTO? Alert on replication health degradation and RPO breach before a real incident exposes the gap.

Governance. Enforce with Azure Policy: “all tier-1 VMs in production must be protected by a Recovery Services vault,” “replica disks must use CMK,” “Backup immutability is enabled.” Tag every workload with its RTO/RPO tier so the right protection level is applied automatically. Maintain a versioned DR runbook in source control alongside the IaC, and review it after every drill.

Reference enterprise example

FinServe Lending Pvt. Ltd. runs the loan-origination platform described above out of Central India, with South India as the paired recovery region. Their estate: 22 production VMs (2 web, 3 app, 2 SQL + witness, 1 DC replica, plus reporting/ETL/file/integration), all zone-distributed in primary.

Their decisions:

ASR protects 18 VMs (everything except the two SQL nodes and the witness). They group the 2 web + 3 app servers into a multi-VM consistency group and a single recovery plan with boot groups: DC (group 1) → app servers (group 2) → web servers (group 3). Recovery-point retention: 48 hours, app-consistent every 4 hours.
SQL uses Always On with two synchronous replicas in Central India (HA across zones) and one asynchronous replica in South India (DR), running on a Reserved-Instance Standard_E8s_v5. Backups go to an immutable Recovery Services vault with cross-region restore.
Traffic Manager in Priority mode: loans.finserve.in → primary endpoint (priority 1, the primary App Gateway), secondary endpoint (priority 2). DNS TTL set to 45 seconds. Health probes hit /healthz on each App Gateway.
A post-failover Azure Automation runbook updates private DNS, validates the South India SQL listener, then pauses at a manual-approval gate before calling the Traffic Manager API to disable the Central India endpoint.

The drill. In their Q1 2026 test failover into an isolated VNet, the recovery plan brought the full app/web stack up and the SQL async replica was promoted; measured RTO was 38 minutes, comfortably under the 1-hour target, and SQL RPO was 6 seconds at the moment of the simulated failure. The web/app tier’s worst-case recovery point was 11 minutes old. Two issues surfaced and were fixed: the App Gateway WAF custom rules hadn’t been replicated to South India (now enforced via Azure Policy), and the post-failover DNS runbook had a hard-coded primary-region IP (now parameterized).

The economics. Running a full hot duplicate of the platform in South India was quoted internally at roughly ₹14–16 lakh/month. The pilot-light DR design — ASR licensing + replication storage + replica disks + the single SQL DR replica + the small standing footprint — lands at about ₹1.7–2.1 lakh/month, roughly 12–13% of the hot-duplicate cost, while still meeting a 1-hour RTO. That cost delta is exactly the conversation that got DR funded.

The outcome. Six weeks later a real zonal storage incident in Central India degraded the SQL HA pair. Because the data tier had a healthy async replica in South India and the app tiers had current ASR recovery points, the team executed a partial, controlled failover of the data tier and rode out the incident within RTO, with the documented runbook and drill evidence turning a potential audit finding into a demonstration of control maturity.

When to use it

Use this architecture when you have meaningful IaaS workloads you cannot or will not refactor, you have a regulatory or business mandate for tested cross-region recovery, your RTO is measured in tens of minutes to a couple of hours (not seconds), and your finance team will fund DR economics but not a hot duplicate region. It is the canonical answer for lift-and-shifted three-tier line-of-business apps with a SQL/Oracle backend.

Trade-offs to accept. Failover is an operational event, not automatic and instant — recovery-plan boot time plus DNS TTL means tens of minutes, not zero. You carry a small standing cost (SQL replica, DC, replication storage) and the operational burden of actually running drills. ASR’s recovery-point cadence sets a floor on app-tier RPO that is fine for stateless tiers but unacceptable for transactional data — which is precisely why the data tier is split out to Always On.

Anti-patterns.

Treating ASR as backup. ASR’s short retention won’t save you from ransomware discovered a week later — pair it with immutable Azure Backup.
Protecting SQL with ASR alone. Crash-consistent recovery points for a busy transactional database invite data loss and messy recovery; use engine-native replication.
Skipping test failovers. An untested DR plan is the single most common reason real recoveries fail. Quarterly drills are non-negotiable.
A relaxed recovery region. Weaker security/monitoring in the secondary turns your DR target into the easiest attack surface you own.
High DNS TTL. If you only lower TTL during the incident, caches already hold the old value and your cutover stalls — keep it low permanently.

Alternatives and where they fit.

Azure Front Door instead of Traffic Manager when the workload is HTTP(S)-only and you want global anycast L7 failover, caching, and faster cutover than DNS TTL allows.
Availability Zones only (single region, no DR region) when the mandate is HA, not geographic DR, and a region-wide outage is an accepted risk — cheaper, simpler, but no protection against regional loss.
Re-platform to PaaS (App Service / AKS + Azure SQL with active geo-replication and auto-failover groups) when you’re willing to invest in modernization — it generally yields better RTO/RPO and less DR machinery, but it is a migration project, not a DR project, and is out of scope when the requirement is “protect what we have, now.”
Active/active multi-region when even a 30-minute RTO is too long and you can afford to run (and load-balance) two live regions — far costlier and requires stateless app design and multi-write data strategy.

For the very common reality of “we have IaaS we can’t rewrite, an auditor who wants proof, and a CFO who won’t pay for a twin region,” the ASR + Traffic Manager + paired-region + Always On pattern in this article is the pragmatic, provable, and economical answer.

Azure Enterprise Architecture: Disaster Recovery for IaaS

The business scenario

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Reference enterprise example

When to use it

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)