Architecture Azure

Disaster Recovery Orchestration with Azure Site Recovery and ServiceNow

A regional hospital network running eleven facilities loses its primary Azure region for ninety minutes during a Tuesday-morning storm. The pharmacy system, the bed-management platform, and the clinical-results portal all go dark mid-shift. The on-call engineer who knows the failover runbook is asleep after a night shift; the engineer who is awake has done a real failover exactly zero times. By the time someone finds the right runbook page, confirms with a duty manager that it is genuinely safe to flip, and starts clicking through the Azure portal, forty of those ninety minutes are gone — and nobody can tell the chief medical officer, who is now standing in the room, whether the systems are actually back or just powered on. The board’s after-action question is the one that funds this entire project: “Why did recovering a system we pay to protect depend on which human happened to be awake?”

This article is the reference architecture for taking that human-paced, tribal-knowledge failover and turning it into a governed, one-decision, fully audited orchestration — where a duty manager approves a major incident in ServiceNow, Azure Site Recovery executes the regional failover as a tested recovery plan, automated runbooks rewire DNS and reseed dependencies, and Dynatrace confirms the services are genuinely healthy before anyone declares the incident resolved. The goal is not faster clicking. It is removing the clicking, the guesswork, and the “is it really up?” entirely, while leaving a clean record an auditor and a regulator can both read.

The pressures here are healthcare’s, and they are unforgiving. Regulation (HIPAA, and the contractual SLAs the network signs with each hospital) means every failover is a change that must be approved, logged, and reversible, with an audit trail showing who authorized it and when. Recovery objectives are clinical, not cosmetic: an agreed RTO of 30 minutes and RPO of 5 minutes for the tier-1 clinical systems, because a pharmacy that cannot dispense for an hour is a patient-safety event, not an IT ticket. Blast radius matters: a botched failover that brings up half a system or corrupts data is worse than the outage. And trust matters most — the CMO needs a single, unambiguous “the clinical portal is serving real traffic and passing health checks,” not a wall of green VM icons that mean nothing clinically.

Why not the obvious approaches

Three shortcuts get proposed on every DR project, and each fails in a way worth naming because someone will argue for all three.

A runbook in a wiki, executed by hand. This is where most organizations start and where the hospital network is today. It fails on the two axes that matter under pressure: it depends on the right human being awake and calm, and it has no enforced approval gate, so either someone fails over without authorization or waits too long seeking it. Worse, a hand-run failover is never identical twice — step order drifts, a manual DNS edit gets fat-fingered — so the thing you practiced is not the thing you execute.

A fully automatic failover with no human in the loop. Tempting, and exactly wrong for clinical systems. Automatic regional failover on a health-signal trigger means a transient network blip or a monitoring false-positive can flip a healthy region for no reason, and an unsupervised failover during a partial outage can split-brain your data. For tier-1 healthcare you want the execution automated and the decision human — a duty manager who confirms this is a real declared disaster, not a flapping probe.

Site Recovery alone, driven from the Azure portal. Azure Site Recovery (ASR) is the correct execution engine, but the portal is the wrong cockpit. Driving DR from the portal means the approval, the communication, the incident record, and the post-failover validation all live outside the system of record, reconstructed afterward from memory. The whole point is that the orchestration, the approval, and the proof of health are one connected workflow — which is why ServiceNow, not the portal, is the front door.

The architecture threads the needle: ServiceNow owns the decision and the record, Azure Site Recovery owns the execution, automation runbooks own the wiring, and Dynatrace owns the proof. Each tool does the one thing it is best at, and they hand off to each other through APIs, not through humans copying values between consoles.

Architecture overview

Disaster Recovery Orchestration with Azure Site Recovery and ServiceNow — architecture

The platform runs two regions and three planes that are worth keeping distinct in your head: a replication plane that continuously copies state from primary to secondary (steady-state, always running), a control plane that decides and orchestrates a failover (event-driven, rarely active), and a validation plane that proves the recovered service is genuinely serving (active immediately after failover and during every drill).

The defining property of the whole topology is the one the board cares about: no failover happens without an approved ServiceNow record, and no failover is declared done until Dynatrace confirms real service health. The decision is gated by a human; everything between the decision and the proof is automated and identical every time.

Replication plane (steady state, continuous):

The primary region — say Central India — runs the clinical workloads: a mix of Azure VMs (legacy pharmacy and bed-management appliances that cannot be re-platformed) and the clinical-results portal. Azure Site Recovery continuously replicates each protected VM to the secondary region — South India — keeping a crash-consistent recovery point every few minutes and an app-consistent point on a schedule, which is what makes the 5-minute RPO real. Data tiers replicate through their own native mechanisms rather than ASR: Azure SQL uses an auto-failover group, and any geo-redundant storage replicates underneath. ASR groups the VMs into a recovery plan with an explicit boot order — domain controllers and identity first, then databases, then app servers, then the web tier — with pre- and post-actions hooked to automation runbooks at each step.

Control plane (the decision and the orchestration):

  1. An outage is detected. Dynatrace Davis flags that the clinical portal’s synthetic checks from outside the region are failing while the region itself is unreachable, and raises a problem. A CrowdStrike Falcon detection or an Azure Service Health event can be the trigger just as easily — the point is that a signal, not a person stumbling onto it, opens the incident.
  2. The signal auto-creates a ServiceNow major incident (via the Dynatrace–ServiceNow integration), pre-populated with the affected services, the CMDB CIs, and the impacted facilities. ServiceNow’s CMDB is what maps “the clinical portal is down” to “these are the protected VMs, this is the recovery plan, this is the business service and its SLA.”
  3. The major-incident workflow invokes the emergency change process. This is the human gate: an authorized duty manager (and, for a full regional failover, a second approver from the CAB on-call roster) reviews the auto-assembled change — scope, recovery plan, expected RTO, rollback plan — and approves. ServiceNow enforces who is allowed to approve and records the timestamp, the approver, and the justification. This single approval is the audit artifact the regulator will ask for.
  4. On approval, a ServiceNow workflow / Flow Designer action calls Azure to start the failover. It does not hold Azure credentials itself — it triggers an Azure Automation runbook (or an Azure Function) through a ServiceNow MID Server or a scoped webhook, and that runbook authenticates to Azure with a managed identity and pulls any residual secrets (the ServiceNow callback token, third-party DNS API keys) from HashiCorp Vault. No standing Azure credential ever lives in the ITSM tool.
  5. The runbook invokes Start-AzRecoveryServicesAsrPlanFailover on the ASR recovery plan, targeting the latest app-consistent recovery point (or latest processed, per the runbook’s policy for the RPO target). ASR brings the VMs up in the secondary region in the defined boot order.

Wiring runbooks (the part that turns “VMs are up” into “service is up”):

ASR’s recovery plan fires automation runbooks as post-failover actions at the right points in the boot sequence — this is the difference between machines that are powered on and a service that actually works:

Validation plane (the proof, before anyone says “resolved”):

  1. With the service ostensibly up, the orchestration runs post-failover validation checks — not VM power state, but real service health. Automated smoke tests hit the clinical portal’s health endpoint, perform a synthetic login through the SSO path, place a canary order in a non-production patient record, and confirm the database is serving reads and writes. Dynatrace synthetic monitors and the application’s real-user monitoring confirm the portal is returning 200s with acceptable latency from the secondary region, and Davis confirms the problem is clearing.
  2. Only when validation passes does the runbook call back into ServiceNow to advance the incident — attaching the validation evidence (the smoke-test results, the Dynatrace health snapshot) to the record. The duty manager sees “clinical portal: 200 OK, SSO passing, DB read/write confirmed, p95 latency nominal,” not a guess. If validation fails, the orchestration does not silently declare success — it flags the incident for human attention and surfaces the failing check, because a half-up clinical system is the failure mode that hurts patients.

Component breakdown

Component Service / tool Role in the platform Key configuration choices
Replication engine Azure Site Recovery Continuous VM replication; recovery plan with ordered failover Crash-consistent every 5 min; app-consistent hourly; multi-VM consistency group for the DB+app tier
Data-tier DR Azure SQL auto-failover group Database replication and listener failover, independent of ASR Auto-failover group; grace period tuned so a blip doesn’t flip the DB
Decision & record ServiceNow (ITSM + CMDB) Major incident, emergency change, approval gate, audit trail CMDB maps service → CIs → recovery plan; CAB on-call approval; Flow Designer trigger
Orchestration glue Azure Automation runbooks / Functions Execute ASR failover, wire DNS, validate, call back Managed identity to Azure; Vault for residual secrets; idempotent and re-runnable
Edge / traffic Akamai + Azure Traffic Manager / Front Door Global anycast, TLS, WAF; origin steering primary→secondary Health-checked dual origins; automatic failover; DNS TTL low enough for fast cutover
Identity / SSO Microsoft Entra ID + Okta Clinician SSO must work from the recovered region Okta federated to Entra; conditional access; DC/identity boots first in the recovery plan
Secrets HashiCorp Vault DNS API keys, ServiceNow callback token, signing keys Entra auth method; dynamic short-lived leases; never stored in ServiceNow or a runbook variable
Detection & health Dynatrace Outage detection, post-failover service-health proof Synthetic checks from outside the region; Davis problem detection; RUM on the portal
Endpoint/runtime security CrowdStrike Falcon Runtime protection; secondary region is not a security gap Sensor on both regions’ VMs; detections piped to the SOC and can open a ServiceNow incident
Cloud posture Wiz / Wiz Code DR region config parity; no drift, no public exposure in secondary Agentless scan of both regions; alert if the secondary drifts from the IaC baseline
CI / IaC GitHub Actions + Terraform / Ansible Both regions provisioned from one codebase; drill automation in CI OIDC to Azure (no stored creds); Ansible configures the failed-over app tier; scheduled drill workflow

A few of these choices deserve the why, because they are the ones teams get wrong.

Why ServiceNow holds the decision, not the execution. It is tempting to give ServiceNow an Azure service-principal secret and let it call ARM directly. Do not — that puts a standing, highly-privileged Azure credential inside the ITSM platform’s integration store, exactly the kind of secret that leaks. Instead ServiceNow’s only job is to decide and trigger: it fires a webhook or MID-Server action to an Azure Automation runbook, and that runbook — running inside Azure with a managed identity — holds the privilege to actually start the failover. ServiceNow owns the approval and the record; Azure owns the credential and the action. The two never share a secret.

Why validation is service-level, not VM-level. ASR will happily report “failover succeeded” the moment the VMs boot, and that is the single most dangerous false signal in DR. A VM can be running while the app pool is dead, the database listener is pointing at the wrong region, or SSO is broken because identity came up after the app. The validation plane exists precisely to convert “powered on” into “clinicians can log in and dispense” — synthetic login, canary transaction, DB read/write, Dynatrace 200s from outside. The incident is not “resolved” on VM state; it is resolved on proven service health.

Why the data tier fails over on its own path. Replicating a busy SQL database through ASR’s VM replication is the wrong tool — you want transactionally consistent, database-native replication. So the VMs replicate via ASR while Azure SQL uses an auto-failover group with its own listener, and the app-tier runbook simply re-points connection strings at the listener, which already follows the database to the secondary. Two replication mechanisms, each correct for its data, coordinated by the recovery plan’s ordering.

Implementation guidance

Provision both regions from one Terraform codebase, and treat parity as the deliverable. The most common DR failure is not the failover mechanism — it is that the secondary region quietly drifted from the primary (a missing NSG rule, an un-replicated app setting, a smaller VM SKU) and nobody noticed until the real failover exposed it. Define both regions in the same Terraform, parameterized by region, so the secondary is a near-mirror by construction. Use Ansible to configure the failed-over application tier on boot (connection strings, feature flags, cache warm-up) so the recovered app is configured identically every time, not hand-tuned. Run Wiz continuously against both regions to alert on any drift from the IaC baseline — the independent check that parity is real.

A minimal shape for the ASR replication and recovery plan communicates the intent:

resource "azurerm_site_recovery_replication_policy" "clinical" {
  name                                                 = "rp-clinical-5min"
  recovery_vault_name                                  = azurerm_recovery_services_vault.dr.name
  resource_group_name                                  = azurerm_resource_group.dr.name
  recovery_point_retention_in_minutes                  = 24 * 60   # 24h of points
  application_consistent_snapshot_frequency_in_minutes = 60        # app-consistent hourly
}
# Boot order (DCs → DB → app → web) and per-step runbook hooks are
# defined on the recovery plan so failover is identical every drill.

The PowerShell the orchestration runbook runs is deliberately small and idempotent — start the plan, capture the job, hand the job id back to ServiceNow:

Connect-AzAccount -Identity                              # managed identity, no stored secret
$fabric = Get-AzRecoveryServicesAsrFabric
$plan   = Get-AzRecoveryServicesAsrRecoveryPlan -Name "rp-clinical-tier1"
$job    = Start-AzRecoveryServicesAsrPlanFailover `
            -RecoveryPlan $plan `
            -Direction PrimaryToRecovery `
            -RecoveryTag Latest          # latest processed; LatestApplicationConsistent for the RPO floor
# $job.Id is written back to the ServiceNow incident as the execution handle.

Identity: federate the humans, kill the standing keys. Clinician SSO is Okta federated to Microsoft Entra ID — Okta is the workforce IdP, brokered to Entra so Azure resources see a first-class token — and because identity must work from the recovered region, the recovery plan boots domain controllers and confirms the Entra/Okta endpoints before the app tier comes up. The orchestration runbook authenticates to Azure with a managed identity, and the handful of secrets that are not managed identities (the third-party DNS provider’s API key, the ServiceNow callback token) live in HashiCorp Vault, leased short-lived, never written into a ServiceNow integration record or an Automation account variable.

Wire the runbooks to the recovery plan’s steps, not after it. The single biggest authoring mistake is running all the wiring at the end. Hook DNS/Traffic Manager updates as a post-action on the web-tier step, identity confirmation as a post-action on the DC step, and connection-string re-pointing as a post-action on the app-tier step — so each layer is wired the moment it is up, in order, and a failure in one step is visible against that step rather than buried in a final catch-all.

Enterprise considerations

Security & Zero Trust. DR is a security event, not just an availability one, and the secondary region is a favorite blind spot. Apply the same posture there as in primary: CrowdStrike Falcon sensors on both regions’ VMs so a failover does not move workloads onto unmonitored compute, with Falcon detections able to open a ServiceNow incident the same way Dynatrace does. Wiz / Wiz Code scans both regions continuously and alerts if the secondary drifts from baseline or exposes anything publicly — a real risk when a rarely-exercised region accumulates quiet misconfiguration. Access to trigger a failover is least-privileged and gated: only the ServiceNow emergency-change approvers can authorize, the runbook’s managed identity holds exactly the ASR and DNS roles it needs and nothing more, and Vault keeps the residual secrets short-lived. Azure Policy denies a DR resource created with public network access, and Wiz independently verifies the policy holds.

Cost optimization. DR’s cost question is “what are we paying to keep a region we hope never to use?” Engineer the answer.

Lever Mechanism Typical effect
ASR over warm standby Replicate state, don’t run duplicate VMs in secondary Pay storage + license, not double compute, until failover
Right-sized recovery SKUs Define target VM sizes; spin full size only on failover Avoids paying primary-grade compute 24×7 in DR
Tiered RPO 5-min RPO for tier-1 only; longer for tier-2/3 Replication cost scales with criticality, not blanket
Reserved capacity in secondary Reserve the compute you’d burst to, if RTO is tight Trades a little spend for guaranteed capacity at failover
Drills in CI off-hours Automated test failover on a schedule, torn down after Proves DR without standing duplicate environments

ASR’s model — replicate continuously, but only run the secondary VMs during an actual or test failover — is what keeps DR affordable: you pay for replication storage and licensing, not for a full duplicate of production sitting idle. Meter the DR spend and surface it in Dynatrace so the CFO sees the cost of the protection against the SLA it buys.

Scalability. The orchestration scales with the number of protected services, not request volume. Group workloads into recovery plans by business service and tier so a single facility’s failure can fail over just its affected services, and a regional disaster fails over tier-1 first (the 30-minute RTO clock) then tier-2/3. ASR’s multi-VM consistency groups keep app+DB tiers crash-consistent together. As the estate grows, the ServiceNow CMDB is the thing that keeps the mapping (service → CIs → recovery plan) coherent — without it, you cannot answer “what do I fail over?” at 3 a.m.

Failure modes, and what each one looks like. Name them before they page you.

Reliability & DR (RTO/RPO). Set the numbers per tier and prove them, repeatedly. For this network: RTO 30 minutes, RPO 5 minutes for tier-1 clinical systems (pharmacy, bed management, results portal), with tier-2 at RTO 4 hours and tier-3 next-business-day. The numbers are meaningless without scheduled, automated test failovers — ASR’s test failover spins the recovery plan into an isolated network without touching production, and running it monthly from a GitHub Actions workflow (provision the isolated bubble, fail over, run the same validation suite, tear down) is what turns a paper RTO into a proven one. The hospital network’s hard-won rule: a recovery plan that has not passed a test failover in the last quarter is assumed broken.

Observability. Instrument the whole orchestration in Dynatrace, not just the apps: the time from detection → incident → approval → failover-start → service-validated, broken into segments, so the team can see where the 30-minute budget is actually spent (it is almost always the approval wait, which is a process fix, not a tech one). Emit the metrics the business cares about — measured RTO and RPO per drill and per real event, validation pass-rate, replication health, and time-to-approval — and keep them on a dashboard the duty managers and the CMO both read. Every failover and every drill writes its evidence back to the ServiceNow record, so the audit trail is a byproduct of the work, not a thing reconstructed later.

Governance. Pin the recovery plans and the orchestration runbooks in version control, reviewed and revertible, so DR behavior does not drift. Run a post-incident review as a ServiceNow problem record after every real failover, feeding fixes back into the recovery plan. Apply Azure Policy to require diagnostic settings and deny public exposure on DR resources, with Wiz as the independent verifier. And treat the drill as a first-class deliverable with the same rigor as the real thing — a DR plan nobody exercises is a document, not a capability.

Explicit tradeoffs

Accept these or do not build it. Orchestrated DR adds real moving parts — a ServiceNow workflow, automation runbooks, a validation suite — and each is a thing to maintain and test. The human approval gate that protects you from split-brain and unauthorized failover also adds the one delay you cannot automate away: the minutes between the incident opening and a duty manager approving. That is a deliberate trade — supervised execution over blind speed — and for clinical systems it is the right one, but it means your RTO budget must include a human, so the process of reaching and equipping that approver (the auto-assembled change with scope, plan, and rollback already filled in) matters as much as the technology. ASR’s replicate-don’t-duplicate model keeps cost down but means the secondary VMs are cold until failover, so first-failover boot time is part of your RTO and must be measured, not assumed. And the validation plane that prevents the “powered on but dead” disaster is itself code that must be kept honest — a stale smoke test that always passes is worse than none.

The alternatives, and when they win. If your RTO is measured in seconds and you can afford it, an active-active multi-region design (both regions serving live, no failover event at all) beats orchestrated failover — but it costs roughly double and demands every workload be region-agnostic, which the network’s legacy pharmacy appliances are not. If your systems are stateless and re-deployable in minutes, redeploy-from-IaC to a second region (Terraform + GitHub Actions, no replication) is simpler and cheaper than ASR — but it does not meet a 5-minute RPO for a stateful clinical database. And if you are a small team with a single non-critical app, a manual runbook may genuinely be enough — the full ServiceNow-gated, Dynatrace-validated orchestration here earns its complexity precisely when failover must be authorized, audited, identical every time, and proven healthy before a clinician trusts it.

The shape of the win

For the hospital network, the payoff is not “faster failover.” It is that when the next storm takes the primary region, a duty manager — any duty manager on the roster, not the one engineer who knew the runbook — opens the auto-created ServiceNow major incident, reads the pre-filled change with its scope and rollback, and approves with one decision. Azure Site Recovery executes the tested recovery plan in the exact order it was drilled; runbooks rewire DNS, Akamai steers traffic, and SSO works because identity booted first; and minutes later the duty manager sees, attached to the incident, “clinical portal: 200 OK, synthetic login passing, pharmacy DB read/write confirmed, p95 latency nominal” — so when the CMO asks “are we back?”, the answer is proven, not hoped. That sentence — a clinician can log in and dispense, and we can show the regulator exactly who approved it and when — is the one that funds the platform. Everything upstream (the continuous replication, the CMDB mapping, the approval gate, the wiring runbooks, the Dynatrace validation, the Wiz parity check, the monthly automated drills) exists so that recovering a system the network pays to protect never again depends on which human happened to be awake.

AzureSite RecoveryServiceNowDisaster RecoveryEnterpriseAutomation
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading