Multi-Cloud Disaster Recovery: AWS Primary with Azure Pilot-Light Standby

A national health-insurance payer runs its claims-adjudication and member-portal platform entirely on AWS in us-east-1, and the board has just made the risk personal. A regional power and network event the prior winter took the platform dark for nine hours; the regulator’s follow-up letter cited the state-mandated 24/7 availability requirement for member services, and the actuarial team put a number on a single lost adjudication day that made the CFO stop arguing about cloud bills. The mandate that came down is precise and uncomfortable: survive the loss of an entire cloud provider’s region — not just an Availability Zone — with a recovery time the regulator will accept, and do it without doubling the annual infrastructure spend. Multi-region within AWS was the easy proposal; the board rejected it because the failure they actually feared last winter was a control-plane and account-level event that a second AWS region would not have escaped. The decision: keep AWS as primary, stand up a pilot-light disaster-recovery standby on Microsoft Azure, and prove the failover with a runbook the audit committee can watch execute. This article is the reference architecture for building exactly that.

The pressures stack the way they always do in regulated operations. Regulation means a documented, tested, regularly-rehearsed DR capability with evidence — not a slide deck claiming one exists. Provider independence means the standby cannot share AWS’s control plane, IAM, or regional fate, which is the whole reason it lives on Azure. Cost means the standby must be cheap while idle — you are paying for insurance, not a second production estate — which is what “pilot light” buys you. And recovery objectives mean two hard numbers everyone signs in blood: RTO, how long until members can file claims again, and RPO, how much in-flight data you can afford to lose. For this payer the targets are RTO 60 minutes, RPO 5 minutes for the core adjudication path. Pilot light is the DR pattern that hits those numbers without idle-running a full Azure copy of production.

Why pilot light, not the alternatives

The DR strategy menu has four rungs, and naming why the others lose here matters because someone on the steering committee will champion each one.

Backup-and-restore — keep snapshots in the other cloud, rebuild on demand — is the cheapest and has the worst RTO. Restoring databases, re-provisioning everything, and rehydrating from object storage is a many-hours exercise, and many-hours fails a 60-minute mandate outright. Warm standby — a scaled-down but fully running copy on Azure taking live traffic share — hits the best RTO but runs a second always-on estate, which is the cost the board explicitly refused. Multi-site active/active across both clouds is the gold standard for availability and the nightmare for cost, data consistency, and operational complexity; nobody adjudicating insurance claims wants two-cloud write conflicts.

Pilot light sits deliberately in the middle. The data tier on Azure is always on and continuously fed — a database replica receiving log shipping, an object store receiving live sync — because data is the one thing you cannot conjure at failover time. Everything else — the compute, the application services, the scaled-out capacity — exists only as Terraform code and pre-built images, switched off or scaled to zero, ready to ignite. The “pilot light” is that always-burning data flame; failover means turning on the gas. You pay continuously only for replicated storage and a small replica database, and you pay for full compute only during an actual disaster or a rehearsal. That economic shape — cheap while idle, fast to ignite — is precisely what a 60-minute RTO under a no-blank-cheque budget demands.

Architecture overview

Multi-Cloud Disaster Recovery: AWS Primary with Azure Pilot-Light Standby — architecture

The architecture runs as two clouds in distinct states that share one global front door and two continuous replication streams. Holding the states clearly in your head is the first step to operating this well: AWS is hot and serving 100% of traffic; Azure is a warm data tier wrapped in cold, code-defined compute. The only things crossing the cloud boundary during normal operation are the two replication streams and the telemetry the observability stack collects from both sides.

The defining property of the whole topology is the one the board cares about most: the standby shares no control plane, no IAM, and no regional fate with the primary. Azure has its own identity, its own Terraform state, its own secrets store, its own network. An AWS account compromise, an IAM misconfiguration, or an us-east-1 control-plane event cannot reach into Azure and take the lifeboat down with the ship. That independence is the entire point, and it is also what makes the DR posture defensible to a regulator who has seen single-vendor outages cascade.

Primary path (AWS, steady state), following the request flow:

A member opens the portal. Akamai sits at the global edge providing TLS termination, anycast routing, WAF, and bot mitigation, and — critically for this design — it is the failover control point: its origin configuration and health checks decide which cloud receives traffic, so cutover is a DNS/origin decision made at the edge, not a frantic manual repoint.
Identity federates through Okta as the workforce and member IdP for the staff-facing adjudication console, with Microsoft Entra ID federated in so the Azure standby resources trust the same identity authority the AWS side uses. One identity plane spanning both clouds means an adjudicator’s access works identically after failover — no second login system to fail independently.
Traffic reaches the AWS application tier: EKS running the adjudication services and member-portal API behind an ALB, autoscaled for the daily claims peak. Application secrets — database credentials, third-party clearinghouse API tokens, signing keys — come from HashiCorp Vault, which runs as the cross-cloud secrets authority precisely so the same secret material is reachable from either cloud at failover without copying long-lived keys into two places.
State lives in Amazon RDS (PostgreSQL) for adjudication and member data, with Amazon S3 holding documents — scanned claims, EOBs, member correspondence — as the durable object store.

Replication streams (always on, the pilot light):

Database log shipping: RDS streams its write-ahead log to a continuously-recovering Azure Database for PostgreSQL replica. The replica applies the WAL as it arrives and stays seconds behind primary, which is what delivers the 5-minute RPO. This small replica is the single always-running compute cost on the Azure side.
Object-store sync: S3 replicates documents to Azure Blob Storage continuously. New claim documents written on AWS land in Azure within seconds, so the standby’s document store is never more than moments stale.

Standby path (Azure, cold compute), pre-built and waiting: the entire Azure application tier — AKS node pools, the application deployments, the ingress, the front-door wiring — exists as Terraform code with desired_count/min_count at zero and pre-baked container images already in Azure Container Registry. Nothing serves traffic. The PostgreSQL replica and the Blob store burn as the pilot light; everything above them is dark until the runbook ignites it.

Component breakdown

Component	AWS (primary)	Azure (pilot-light standby)	Role in DR
Global edge / failover control	Akamai	Akamai (same config)	Health-checked origin failover between clouds
Identity / SSO	Okta → Entra ID	Okta → Entra ID	One identity plane; access survives cutover
Compute	EKS (hot, autoscaled)	AKS (scaled to zero)	Cold compute ignited from Terraform + ACR images
Application secrets	HashiCorp Vault	HashiCorp Vault (reachable)	Same secret material both clouds, no copied keys
Relational state	RDS PostgreSQL (primary)	Azure DB for PostgreSQL (recovering replica)	Log shipping → seconds-behind replica = low RPO
Object store	Amazon S3	Azure Blob Storage	Continuous sync → near-live document copy
Image registry	Amazon ECR	Azure Container Registry	Pre-baked images so ignition needs no build
Infrastructure definition	Terraform	Terraform (same modules)	Standby is code; full estate `terraform apply`-able
Config / ignition automation	—	Ansible	Post-apply wiring: scale up, promote DB, run smoke tests
Posture / IaC scanning	Wiz + Wiz Code	Wiz + Wiz Code	Drift, exposure, and pre-merge Terraform misconfig checks
Runtime security	CrowdStrike Falcon	CrowdStrike Falcon	Sensors on both clouds’ nodes; one SOC view
Observability	Datadog / Dynatrace	Datadog / Dynatrace	Replication-lag SLOs; failover dashboards span both
ITSM / runbook control	ServiceNow	ServiceNow	Declares the disaster, drives runbook, records evidence
CI/CD	Jenkins / GitHub Actions / Argo CD	Argo CD (standby cluster)	Same artifacts to both; GitOps re-converges Azure

A few of these choices deserve the why, because they are the ones teams get wrong.

Why log shipping for the database, not a managed cross-cloud HA feature. There is no native managed primary/replica that spans RDS and Azure PostgreSQL — the clouds do not federate their database control planes. So the bridge is PostgreSQL’s own physical replication: RDS ships WAL, the Azure side runs in continuous-recovery mode applying it. This is portable, well-understood, and the replica is genuinely usable read-only while it recovers, which matters for testing. The price is that promotion is one-way — once you promote Azure to primary, you have committed to a failback re-seed — and you must monitor replication lag as a first-class SLO, because silent lag growth quietly destroys your RPO. That lag is the single most important number on the DR dashboard.

Why secrets live in one Vault, not duplicated per cloud. Copying long-lived credentials into both AWS Secrets Manager and Azure Key Vault doubles your secret-sprawl and your rotation problem, and at failover you discover the Azure copy went stale months ago. Instead HashiCorp Vault is the single secrets authority both clouds authenticate to — AWS workloads via the AWS auth method, Azure workloads via the Entra/Azure auth method — leasing short-lived dynamic credentials. The same database role, the same clearinghouse token, reachable identically from either side, with nothing long-lived sitting in two places to drift apart.

Why the standby is only Terraform, and what Ansible adds. The discipline that makes pilot light work is that the Azure compute estate has no hand-built drift — it is the same Terraform modules as AWS-equivalent infrastructure, parameterized for Azure, so a terraform apply brings the full application tier into existence reproducibly. Ansible then handles the imperative ignition sequence that Terraform is poor at: scaling node pools up, promoting the PostgreSQL replica, repointing the app’s database DSN, warming caches, and running smoke tests in order. Terraform declares the standby; Ansible lights it.

The failover runbook

A DR design that cannot be executed under stress is theatre. The runbook is the deliverable the audit committee actually cares about, and it must be rehearsed until it is boring. The sequence, owned and timestamped in ServiceNow so every step produces evidence:

Declare. A major-incident record is opened in ServiceNow; the DR commander role is assigned. This starts the clock the regulator will measure and creates the audit trail. No failover happens without a declaration — it prevents a jittery on-call from cutting over during a recoverable blip.
Confirm primary loss and replica health. Verify via Datadog/Dynatrace that AWS is genuinely unreachable (not an Akamai-side or DNS artifact) and that the Azure PostgreSQL replica’s lag is within RPO. Cutting over to a badly-lagged replica trades an outage for data loss.
Ignite compute. terraform apply the Azure application tier (or scale the pre-applied estate’s node pools and deployments from zero), pulling the pre-baked images from Azure Container Registry. Argo CD re-converges the AKS cluster to the declared production state from Git.
Promote the database. Ansible promotes the Azure PostgreSQL replica to a standalone read-write primary and stops log-recovery. This is the irreversible step — past here, failback means a re-seed.
Rewire and warm. Ansible repoints application config at the promoted Azure database and the Blob document store, pulls fresh leases from Vault, and warms caches.
Smoke-test behind a closed door. Run the synthetic adjudication transaction and member-portal checks against Azure before exposing it — submit a test claim, confirm it adjudicates, confirm a document loads.
Cut traffic over. Flip Akamai’s origin to the Azure ingress. Health checks confirm; members are now served from Azure. Okta → Entra federation means adjudicators’ logins work unchanged.
Verify and record. Confirm SLOs green in the observability stack, close the cutover task with timestamps. The elapsed time is the audited RTO.

Failback is the runbook’s mirror, and it is the part teams forget to design. Because promotion broke the replication chain, failback means re-establishing AWS as a fresh replica of the now-primary Azure database, syncing S3 from Blob, validating, and cutting back during a planned window — never an emergency. Document it with the same rigor; an un-rehearsed failback strands you on the standby indefinitely.

Enterprise considerations

Security & Zero Trust across two clouds. The hard part of multi-cloud DR security is that you now have two attack surfaces and must not let the bridge between them become the weakest link. Identity is unified through Okta → Entra ID so there is one place to enforce conditional access and one set of group claims that work in both clouds. Wiz runs continuous CSPM across both AWS and Azure, surfacing misconfigurations, public-exposure drift, and cross-cloud attack paths in one inventory — and Wiz Code scans the Terraform in pull requests so a standby module that would deploy a publicly-exposed Blob container or an over-permissive role is caught before merge, not discovered during a failover. CrowdStrike Falcon sensors run on both the EKS and AKS node pools, feeding one SOC so a threat detected on the dormant Azure side is not invisible just because that cloud is idle. Vault keeps secrets short-lived and centrally revocable, so a compromised lease is contained regardless of which cloud used it. A critical, easily-missed control: the replication streams themselves are sensitive data in motion between clouds and must be encrypted in transit with access scoped to exactly the replication principals — Wiz flags it if that scope ever widens.

Cost optimization — the whole reason for pilot light. The economics are the design’s justification, so engineer them deliberately.

Lever	Mechanism	Effect on standby cost
Scale compute to zero	AKS node pools / deployments at zero until failover	Eliminates ~all standby compute spend while idle
Small recovering replica	Right-size Azure PostgreSQL to apply WAL, not serve load	Pay for one modest replica, not a prod-sized DB
Storage tiering	Blob cool/cold tier for older replicated documents	Cuts object-store cost on the standby copy
Pre-baked images, not idle build infra	Images in ACR; no running CI on Azure	No standing compute for the standby pipeline
Rehearse, don’t idle-run	Full Azure compute billed only during tests/DR	Insurance cost, not second-estate cost

The standby’s steady-state bill is essentially replicated storage plus a small replica plus egress — a fraction of a warm standby and a tiny fraction of active/active. Datadog/Dynatrace track that cross-cloud egress, because replication and observability traffic between providers is the cost line that surprises people. The board approved DR specifically because pilot light kept it off the active/active price tag.

Scalability and the ignition ceiling. Steady-state scaling is ordinary AWS autoscaling on EKS for the daily claims peak. The DR-specific scaling concern is the ignition burst: when you fail over, AKS must scale from zero to full production capacity fast enough to honor RTO, which means the Azure subscription needs pre-arranged compute quota in the standby region — discovering a quota cap at minute 30 of a real disaster is the canonical pilot-light failure. Pre-baked ACR images keep ignition to a pull-and-start, not a build. The replication tier scales with data volume: log-shipping throughput must keep pace with the primary’s write rate or lag grows and RPO silently erodes, so replication-lag headroom is a capacity metric, not just a health metric.

Observability — replication lag is the heartbeat. Instrument both clouds into Datadog/Dynatrace with the DR-specific signals that ordinary app monitoring omits: database replication lag (the RPO proxy and the single most important DR metric), S3-to-Blob sync lag and backlog depth, synthetic failover-readiness checks that confirm the replica is promotable and images are current, and cross-cloud egress for cost. Anomaly detection on replication lag means a quietly-stalling stream pages someone before a disaster exposes a blown RPO. The same dashboards become the failover console during an actual event — the DR commander watches RTO accumulate live. Every rehearsal feeds its timings back into ServiceNow as the auditable record the regulator reviews.

Governance — DR you can prove. The regulator does not accept “we have DR”; they accept evidence of a tested DR capability. So the architecture treats rehearsal as a first-class, scheduled obligation: a periodic game-day where the runbook executes against Azure (ideally a real cutover of a non-member-facing slice, at minimum a full promote-and-smoke-test in isolation), every step timestamped in ServiceNow, producing the artifact the audit committee signs. Terraform state and the Wiz Code pre-merge gate mean the standby cannot silently drift from production’s shape between tests — the failure mode where the runbook worked last quarter but the infrastructure changed underneath it. Pin image versions in ACR and database engine versions on both sides so a rehearsal and a real failover land on identical, reproducible infrastructure.

Failure modes, named before they page you.

Replication lag blowout — the WAL stream falls behind under a write spike and nobody notices, so a failover would lose more than 5 minutes of claims. Mitigation: lag as a hard SLO in Datadog/Dynatrace with alerting well below the RPO threshold.
Standby drift — someone hot-fixes AWS infrastructure and the Azure Terraform never catches up; the runbook ignites a subtly-wrong estate. Mitigation: single shared modules, Wiz Code in CI, scheduled rehearsals that exercise the real standby.
Quota wall at ignition — AKS cannot scale to production size because the Azure region’s quota was never raised. Mitigation: pre-arranged quota and a synthetic readiness check that test-scales periodically.
Promotion-before-confirmation — an itchy cutover during a recoverable AWS blip promotes the replica and forces an unnecessary, expensive failback. Mitigation: the ServiceNow declaration gate and the explicit “confirm primary loss” runbook step.
Failback amnesia — failover is rehearsed, failback is not, and the payer is stranded on Azure for weeks. Mitigation: failback documented and rehearsed as a peer of failover.
Secret staleness — duplicated per-cloud secrets drift and the standby cannot authenticate at the worst moment. Mitigation: one Vault, short-lived leases, no copied long-lived keys.

Explicit tradeoffs

Accept these or do not build it. Pilot light buys cheap insurance and pays for it in recovery complexity: failover is a multi-step orchestration with an irreversible database promotion in the middle, not a transparent flip, and the seconds-of-lag RPO means a real disaster will drop the last few in-flight transactions — acceptable for this payer at 5 minutes, but it is data loss you must consciously sign for. Running two clouds means two skill sets, two security postures, and two billing models, which is real operational weight even with Terraform, Wiz, Falcon, and Datadog unifying the tooling. The cross-cloud replication streams add egress cost and a bandwidth dependency between providers that itself can fail. And the standby is only as good as its last rehearsal — an untested pilot light is a more expensive backup-and-restore wearing a costume, which is why the governance discipline is not optional overhead but the thing that makes the whole investment real.

The alternatives, and when they win. If your RTO can tolerate hours and your budget is the dominant constraint, backup-and-restore across clouds is simpler and cheaper — accept the long recovery. If your RTO is sub-minute and cost is no object, warm standby (a scaled-down always-running Azure copy taking a traffic share) erases the ignition delay entirely and is worth it for the most critical tier. If you genuinely need to survive a regional event with zero recovery time and can stomach the cost and the cross-cloud data-consistency problem, active/active is the ceiling. And if provider independence is not actually your threat model — if an AZ-level or single-region AWS event is what you fear — then multi-region within AWS is dramatically simpler than spanning clouds, and you should not pay the multi-cloud tax for a guarantee you do not need. This payer crossed to Azure for one specific reason: the board’s feared failure was provider- and account-level, and only a genuinely independent cloud answers it.

The shape of the win

For the payer, the payoff is not “we run on two clouds.” It is that when us-east-1 goes dark again, a DR commander opens a ServiceNow record, the team executes a rehearsed runbook, Terraform and Ansible ignite the Azure standby off a replica that is seconds behind and a document store that is moments behind, Akamai flips the edge, adjudicators log in with the same Okta credentials, and members are filing claims again inside the hour — with a timestamped, audited record handed to the regulator that proves the mandated capability is real and not aspirational. That last clause is what funds the architecture. Everything upstream — the log-shipped replica burning as the pilot light, the S3-to-Blob sync, the one Vault spanning both clouds, the Wiz and Falcon coverage that does not blink because Azure is idle, the Datadog lag SLO, the Terraform that keeps the standby honest — exists so that a regulator, a CISO, and a CFO each say yes: yes it works, yes it is independent of AWS, and yes it costs insurance money rather than a second production estate. Start narrower if you must, but for a regulated service under a provider-independence mandate, a cross-cloud pilot light is where DR has to land.

Multi-Cloud Disaster Recovery: AWS Primary with Azure Pilot-Light Standby

Why pilot light, not the alternatives

Architecture overview

Component breakdown

The failover runbook

Enterprise considerations

Explicit tradeoffs

The shape of the win

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)