AWS Enterprise Architecture: Disaster Recovery Strategies

Every disaster recovery conversation that goes wrong starts the same way: someone asks “are we covered if a region goes down?” and someone else answers “yes, we have backups.” Those two sentences are about completely different things, and the gap between them is where outages turn into resignations. “Covered” is a question about how much data you can afford to lose and how long you can afford to be down — RPO and RTO — and the honest answer is a number with a price tag, not a yes. Disaster recovery on AWS is not one architecture; it is a spectrum of four, each buying you a tighter RTO/RPO for more money and more operational discipline. The art is choosing the cheapest one that meets your actual recovery objectives, building it so it genuinely works, and — the part everyone skips — proving it works on a schedule.

This article lays out the four canonical AWS DR strategies the way the AWS Disaster Recovery of Workloads on AWS guidance frames them — Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active — but as a single decision framework rather than four disconnected diagrams. We will anchor everything to RTO (recovery time objective — how long until you are serving again) and RPO (recovery point objective — how much recent data you lose), because those two numbers are the only honest way to compare them. Along the way: the AWS services that implement each tier (AWS Backup, S3 Cross-Region Replication, Aurora Global Database, DynamoDB Global Tables, AWS Elastic Disaster Recovery, Route 53 Application Recovery Controller), how to express them in Terraform, how to orchestrate an actual failover, and how to keep the bill from quietly turning a passive standby into an expensive insurance policy nobody has ever cashed.

The business scenario

The thing that makes DR universally relevant is that the driver is never “we want DR” — it is a specific loss the business cannot absorb, and that loss looks different at every size of company while the underlying maths is identical.

A small SaaS startup (one product, one region, a dozen engineers) has a customer portal on a single RDS instance and an EC2/ECS app tier. They have never lost a region — but they have lost an availability zone, suffered a botched migration that corrupted a table, and once had an engineer drop the wrong database. Their real exposure is not “us-east-1 disappears”; it is operational and data-corruption events, and their honest RTO need is “a few hours” with an RPO of “last night’s backup, ideally tighter.” For them, gold-plated active-active would be malpractice; Backup & Restore done properly is the right and responsible answer.
A mid-market company running an order-management system or an insurer’s quote-and-bind app has graduated to “down for a day is a serious incident.” They can tolerate a short outage during a genuine disaster but not data loss on financial records. They want an RTO in tens of minutes and an RPO of seconds to single-digit minutes — and they want to spend as little as possible to get there. Pilot Light or Warm Standby is their band.
A regulated enterprise — a bank, a healthcare platform, a payments processor — operates under explicit regulatory expectations (DORA in the EU, FFIEC/OCC guidance for US banking, operational-resilience rules that name impact tolerances in time) and contractual SLAs with credits attached. For their tier-1 systems, an RTO measured in minutes and an RPO measured in near-zero is not aspirational; it is audited. Warm Standby or Multi-Site Active-Active is mandated for the crown-jewel workloads — but even they run their tier-3 internal systems on Backup & Restore, because spending active-active money on a low-criticality app is how DR budgets get wasted.

What unites all three is the same uncomfortable trio of facts. First, the failures that actually require DR are correlated and large — an AZ-spanning power or network event, a regional control-plane degradation, a bad region-wide config change, ransomware that encrypts your primary and anything it can reach, or a data-corruption bug that faithfully replicates to your replica. Multi-AZ (which you should already have) handles the small ones; DR is about the big ones, including the ones that come from inside the house. Second, the cost of recovery is dominated by what you keep running while nothing is wrong — a hot standby costs the same on a quiet Tuesday as during a disaster, so over-provisioning DR is a tax you pay every single day for an event that may never come. Third, an untested recovery plan is not a plan — it is a document, and documents do not fail over.

The problem this architecture solves, stated precisely: map each workload to a recovery objective (RTO/RPO) the business has actually signed off on, implement the cheapest DR strategy that meets it using native AWS services, orchestrate the failover so it is deterministic rather than heroic, and verify it on a recurring schedule so the RTO/RPO you claim is the RTO/RPO you can deliver. The rest of this article is the map from “RTO/RPO target” to “which of the four, built how.”

Architecture overview

AWS disaster recovery reference architecture showing a primary Region (us-east-1) replicating to a recovery Region (us-west-2) across four strategy lanes — Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active — under a shared Route 53 / ARC / Global Accelerator failover control plane, with an immutable cross-account backup vault as the ransomware backstop.

The four strategies are best understood as one axis — increasing investment in pre-provisioned, pre-warmed, pre-replicated infrastructure in a recovery Region — sliding RTO and RPO from “hours/hours” down toward “near-zero/near-zero.” Picture a primary Region (say us-east-1) serving production, and a recovery Region (say us-west-2). The difference between the four strategies is entirely about what is already running, and how current the data is, in us-west-2 before disaster strikes.

Strategy 1 — Backup & Restore (RPO: hours; RTO: hours). Nothing runs in the recovery Region in steady state. AWS Backup takes scheduled snapshots of EBS volumes, RDS/Aurora clusters, DynamoDB tables, EFS, and more, and copies them cross-Region (and ideally cross-account, into a locked vault) on a schedule. Application artifacts live as container images in ECR with cross-Region replication and IaC in version control. When disaster strikes, you build the recovery environment from IaC and restore data from the latest copied backups. The data path in steady state is just “snapshot → copy to Region B → sit in a vault.” The recovery path is “terraform apply in Region B, restore snapshots, repoint DNS.” It is the cheapest by far and the slowest by far.

Strategy 2 — Pilot Light (RPO: minutes; RTO: tens of minutes). The data layer is live and continuously replicating to the recovery Region, but compute is switched off or minimal. An Aurora Global Database secondary cluster sits in us-west-2 receiving storage-level replication (sub-second lag); DynamoDB Global Tables replicate continuously; S3 Cross-Region Replication mirrors objects. The “pilot light” is exactly this always-warm data plus the core scaffolding — the VPC, subnets, security groups, and a scaled-to-zero (or minimal) compute definition — kept current but not serving. On failover you promote the Aurora secondary, scale the compute up from zero to production size, and repoint traffic. You pay for replicated storage and data transfer, but almost nothing for idle compute.

Strategy 3 — Warm Standby (RPO: seconds; RTO: minutes). A fully functional but under-scaled copy of the workload runs in the recovery Region all the time. The data layer replicates as in Pilot Light, and a smaller version of the compute fleet — fewer/smaller ECS tasks or EC2 instances, a smaller Aurora reader — is actually running and could serve traffic right now, just not at full capacity. Failover is promote the database, scale the already-running fleet up, and shift traffic — faster than Pilot Light because nothing has to cold-start from zero. You pay for a continuously-running (if minimal) second environment.

Strategy 4 — Multi-Site Active-Active (RPO: near-zero; RTO: near-zero). Both Regions serve live production traffic simultaneously. There is no “failover” of compute because both fleets are already hot at full-ish capacity; DynamoDB Global Tables are multi-active (both Regions write locally); Aurora Global Database has one writer but a promotable secondary. A regional loss is a capacity event, not a recovery event. This is the most expensive and the most complex, and it gets its own deep treatment in the companion Active-Active Multi-Region article — here it is the top rung of the ladder.

The common control plane across all four: traffic steering and failover orchestration are the same machinery regardless of strategy. Amazon Route 53 provides DNS with health-check-based failover records; Route 53 Application Recovery Controller (ARC) provides routing controls (deterministic on/off switches you flip to redirect traffic, instead of hoping health checks fire correctly) and readiness checks (continuous verification that the standby is actually capable of taking load). For non-cacheable latency-critical APIs, AWS Global Accelerator fails over at the network layer in seconds rather than waiting on DNS TTLs. And for server-based workloads that are hard to re-platform, AWS Elastic Disaster Recovery (DRS) continuously block-level-replicates whole servers into a low-cost staging area in the recovery Region and launches them on demand — effectively a managed Pilot Light / Warm Standby for lift-and-shift estates.

The single most important thing this overview should make obvious: you do not pick one strategy for the company — you pick one per workload, based on its RTO/RPO tier. A mature enterprise runs all four at once: active-active for the payment rail, warm standby for the order system, pilot light for the reporting platform, and backup & restore for the internal wiki — each priced to its actual cost of downtime.

Component breakdown

Component	AWS service	Role in DR	Key configuration choices
Cross-Region backup	AWS Backup	The foundation of Backup & Restore; defence against corruption/ransomware for all tiers	Backup plans with cross-Region copy + cross-account copy to an isolated account; Vault Lock (compliance mode) for immutability; lifecycle to cold storage; backup of RDS, EBS, DynamoDB, EFS, S3, Aurora
Relational replication	Aurora Global Database	Live cross-Region data plane for Pilot Light / Warm Standby / Active-Active	One global cluster; primary writer + ≥1 secondary Region; storage-layer replication (typically <1s lag); managed planned failover for drills; target RPO ~1s
Relational (RDS, non-Aurora)	RDS cross-Region read replica	Same idea for MySQL/PostgreSQL/MariaDB on RDS	Cross-Region read replica that can be promoted to standalone; async replication (RPO = replication lag, seconds–minutes)
NoSQL replication	DynamoDB Global Tables	Multi-active key/document data; zero-failover for the tiers that use it	Global Tables v2 (2019.11.21); PITR enabled; design for last-writer-wins; watch `ReplicationLatency`
Object replication	S3 Cross-Region Replication (CRR)	Mirror uploads, exports, static origins, and backups to Region B	Versioning on; CRR rules (optionally RTC for a 15-min replication SLA); replication metrics; bidirectional where both Regions write
Server-based DR	AWS Elastic Disaster Recovery (DRS)	Continuous block replication of whole EC2/on-prem servers; managed pilot-light/warm-standby for lift-and-shift	Low-cost staging area with cheap instances + EBS; point-in-time recovery snapshots; launch templates for recovery; supports drill launches into an isolated subnet
Container artifacts	Amazon ECR (cross-Region replication)	Make the same image digest available in Region B for fast rebuild/scale-up	Registry replication rules to Region B; deploy by digest, not tag
DNS failover	Route 53	Steer traffic to the healthy Region	Failover or latency routing; health checks on a deep `/health` endpoint; low TTL (30–60s); `Evaluate Target Health` on alias records
Failover orchestration	Route 53 Application Recovery Controller (ARC)	Deterministic, audited Region switch + standby readiness assurance	Routing controls (manual/automated on-off) gated by safety rules; readiness checks per resource type; multi-Region cluster of 5 endpoints
Network-layer failover	AWS Global Accelerator	Sub-DNS failover for non-cacheable latency-critical APIs	Two anycast IPs; endpoint groups per Region; traffic dials; health checks at the network layer
Keys & secrets	KMS (multi-Region keys) + Secrets Manager (replica secrets)	Ensure encrypted data and credentials are usable in Region B	Multi-Region KMS keys so replicated data decrypts locally; replica secrets so Region B reads DB creds without a cross-Region call
IaC + pipeline	Terraform + CodePipeline/GitHub Actions	Recreate compute/networking in Region B on demand (Backup & Restore / Pilot Light)	Region-parameterised modules; recovery environment is `terraform apply`, not click-ops; pipeline can deploy Region B independently

A few component choices deserve their why, not just their what:

Why AWS Backup with cross-Region and cross-account copy and Vault Lock, not just snapshots? Because the disaster that most often actually happens is not a region vanishing — it is deletion or encryption of your data, whether by a bad actor, ransomware, or a mistake. A snapshot in the same account that the same compromised credentials can delete is not a backup; it is a hostage. Copying backups into a separate, locked account with Vault Lock in compliance mode (which even the root user cannot shorten or delete before the retention period) is what turns “we have snapshots” into “we can actually recover from ransomware.” This single control protects every tier, including active-active stacks whose live replication would have dutifully copied the corruption to the other Region.

Why Aurora Global Database over RDS cross-Region read replicas for the higher tiers? Both give you a promotable copy in Region B, but Aurora Global Database replicates at the storage layer with typically sub-second lag and far lower RPO, supports managed planned failover (a clean, ~1-minute switch you can rehearse), and decouples replication from the database engine’s own load. RDS cross-Region read replicas use engine-level async replication — perfectly fine for Backup & Restore / Pilot Light on smaller MySQL/PostgreSQL workloads, but with looser, more variable lag. Choose Aurora Global DB when your RPO budget is sub-second and you intend to rehearse failover regularly.

Why Elastic Disaster Recovery (DRS) instead of re-architecting? Because a huge share of enterprise estates is not cloud-native — it is lift-and-shift EC2 (or still on-prem) running software that nobody is going to re-platform onto Fargate and Aurora just to get DR. DRS continuously replicates the entire server (OS, app, data) as block-level changes into a cheap staging area in Region B, and on failover (or a non-disruptive drill) launches production-sized instances from the latest point-in-time. It is the pragmatic way to put a Pilot-Light-grade RTO/RPO on workloads you cannot or will not refactor — and it is far better than a quarterly AMI copy.

Why Route 53 ARC and not just health-check failover? Because DNS health checks fail in the messy middle — a partial brownout where the primary is sick enough to lose data but healthy enough to pass a shallow health check, or healthy enough that flapping health checks bounce traffic back into a degraded Region. ARC routing controls are deterministic switches you (or an automated runbook) flip with intent, protected by safety rules (e.g. “never turn both Regions off,” “always keep at least one on”). ARC readiness checks continuously answer the question that actually matters before you fail over: is the standby genuinely ready to take this load right now? For tier-1 systems, that determinism is worth the added moving part.

Implementation guidance

Infrastructure as Code is the load-bearing wall of Backup & Restore and Pilot Light, because in those strategies the recovery environment’s compute and networking do not exist (or barely exist) until you create them. If “recovery” means an engineer hand-clicking a VPC together under pressure at 3am, your real RTO is “however long that takes, plus the mistakes.” Express the workload as a region-parameterised Terraform module so that standing up Region B is terraform apply against a second provider alias — not archaeology.

The clean structure:

A global stack for genuinely global resources: the Route 53 hosted zone + failover records + health checks, the ARC cluster and routing controls, the multi-Region KMS key, and ECR replication rules.
A reusable region module instantiated for the primary and (for Pilot Light/Warm Standby) the recovery Region: VPC, subnets, security groups, ALB, ECS service, and the regional database member. The difference between Pilot Light and Warm Standby in this module is often one variable — the desired task count / min capacity (0 for pilot light, a small N for warm standby).
A data stack wiring the cross-Region replication: the Aurora global cluster, S3 CRR rules, and DynamoDB replicas.
A backup stack for the AWS Backup plan, cross-Region/cross-account copy actions, and Vault Lock.

Terraform shape for the AWS Backup plan with cross-Region copy and an immutable vault (illustrative):

resource "aws_backup_vault" "dr" {
  provider = aws.usw2                       # vault in the recovery Region
  name     = "dr-locked-vault"
  kms_key_arn = aws_kms_key.backup_usw2.arn
}

# Compliance-mode lock: cannot be deleted/shortened before retention elapses
resource "aws_backup_vault_lock_configuration" "dr" {
  provider            = aws.usw2
  backup_vault_name   = aws_backup_vault.dr.name
  min_retention_days  = 35
  changeable_for_days = 3                    # cooling-off before lock is permanent
}

resource "aws_backup_plan" "core" {
  name = "core-cross-region"

  rule {
    rule_name         = "daily-35d"
    target_vault_name = aws_backup_vault.primary.name
    schedule          = "cron(0 5 * * ? *)"  # 05:00 UTC daily
    start_window      = 60
    completion_window = 180
    lifecycle { delete_after = 35 }

    copy_action {                            # the DR-critical part
      destination_vault_arn = aws_backup_vault.dr.arn   # different Region + account
      lifecycle { delete_after = 35 }
    }
  }
}

Terraform shape for the Aurora Global Database data plane (the Pilot Light / Warm Standby relational tier):

resource "aws_rds_global_cluster" "this" {
  global_cluster_identifier = "ord-global"
  engine                    = "aurora-postgresql"
  engine_version            = "16.4"
  storage_encrypted         = true
}

resource "aws_rds_cluster" "primary" {          # writer — us-east-1
  provider                    = aws.use1
  cluster_identifier          = "ord-use1"
  engine                      = aws_rds_global_cluster.this.engine
  engine_version              = aws_rds_global_cluster.this.engine_version
  global_cluster_identifier   = aws_rds_global_cluster.this.id
  master_username             = var.db_user
  manage_master_user_password = true            # secret in Secrets Manager, not state
  kms_key_id                  = aws_kms_key.use1.arn
  db_subnet_group_name        = module.region_use1.db_subnet_group
}

resource "aws_rds_cluster" "secondary" {        # promotable reader — us-west-2
  provider                  = aws.usw2
  cluster_identifier        = "ord-usw2"
  engine                    = aws_rds_global_cluster.this.engine
  engine_version            = aws_rds_global_cluster.this.engine_version
  global_cluster_identifier = aws_rds_global_cluster.this.id
  kms_key_id                = aws_kms_key.usw2.arn
  db_subnet_group_name      = module.region_usw2.db_subnet_group
  depends_on                = [aws_rds_cluster.primary]
}

And the Region B compute as a single variable away from Pilot Light vs Warm Standby:

module "region_usw2" {
  source    = "./modules/region"
  providers = { aws = aws.usw2 }

  # Pilot Light: desired_count = 0  (scaffolding only, scaled to zero)
  # Warm Standby: desired_count = 2 (minimal live fleet, ready to scale)
  ecs_desired_count = var.dr_warm ? 2 : 0
  ecs_max_count     = 40            # full production ceiling on failover
}

Networking and identity. Keep request handling in-Region; only data replication should cross Regions, and it travels the AWS backbone natively for Aurora, DynamoDB, and S3 — you do not need a hot-path VPC peering for the user flow. Put Gateway VPC endpoints for S3 and DynamoDB and Interface endpoints for the rest in each Region so data-plane traffic stays off NAT and the internet. Use multi-Region KMS keys so an encrypted snapshot or replicated object decrypts in Region B under the local key replica — a single-Region key is a silent way to make your “recovered” data unreadable. Put DB credentials in Secrets Manager replica secrets so Region B never makes a cross-Region Secrets Manager call on the recovery path. For human and machine identity, one AWS Organization with IAM Identity Center for SSO and per-Region IAM roles scoped to that Region’s resource ARNs (IRSA on EKS) — a compromised task in Region A should have no standing path to Region B beyond what replication already grants.

The failover runbook itself must be code, not prose. Whatever the strategy, the switch should be an executable sequence — a Systems Manager Automation document or a Step Functions state machine — that: (1) confirms the standby’s readiness (ARC readiness check), (2) promotes the data tier (failover-global-cluster for a planned switch, or promote-on-loss for unplanned), (3) scales Region B compute to production size, (4) flips the ARC routing control (and/or Global Accelerator traffic dial) to send traffic to Region B, and (5) verifies synthetic transactions succeed before declaring victory. Backup & Restore adds a step zero: terraform apply the Region B environment and restore the latest copied backups. Encoding this is what collapses RTO from “however long the on-call figures it out” to a predictable number.

Enterprise considerations

Security and Zero Trust. DR widens your attack surface — there is now a second copy of everything — so the recovery Region must be held to the same standard, not a relaxed one. Encrypt every backup and replica with multi-Region KMS; keep the immutable backup copy in a separate, least-privilege account so the credentials that run production cannot delete your last line of defence. Treat ransomware as a first-class DR scenario: your live cross-Region replication will faithfully copy encrypted data to the other Region, so the only recovery is the immutable, point-in-time backup — design retention and Vault Lock accordingly, and rehearse a restore from the locked vault specifically. Enable GuardDuty, Security Hub, and an organization-wide multi-Region CloudTrail so detection is symmetric across both Regions. Scope IAM per Region; a breach in the primary should not hand the attacker the recovery Region for free.

Cost optimization. This is where DR strategy selection literally is the cost decision, because the four strategies are a price ladder and the steady-state spend is dominated by what you keep running while nothing is wrong:

Strategy	Steady-state cost driver	Rough relative cost	What you’re paying for
Backup & Restore	Snapshot storage + cross-Region copy transfer	$ (lowest)	Just durable, replicated backups; zero idle compute
Pilot Light	Replicated DB storage + transfer; minimal scaffolding	$$	Live data plane; compute scaled to ~zero
Warm Standby	Above + a small always-running compute fleet	$$$	A real (if minimal) second environment, always on
Multi-Site Active-Active	A near-full second environment + cross-Region transfer	$$$$	Two live Regions; capacity for either to take 100%

The discipline is matching the strategy to the cost of downtime per workload, not to anxiety. A tier-3 internal tool on active-active is pure waste; a payment rail on backup-and-restore is negligence. Concrete levers: don’t replicate data that doesn’t need it (DynamoDB Global Tables charge replicated write capacity, Aurora Global DB charges cross-Region transfer — keep purely-regional data single-Region); run the Region B Aurora secondary smaller and scale it up as a step in the failover runbook if your RTO budget allows the extra minute; use AWS Backup lifecycle to cold storage for long-retention copies; and for Warm Standby, size Region B to the minimum that can survive the first few minutes while Auto Scaling ramps, not to full production.

Scalability. The scaling question in DR is specifically “can Region B actually absorb production load when it has to?” — and the failure mode is a Pilot Light or Warm Standby that looks ready but cannot scale fast enough, hitting service quotas (Region-specific limits on EC2 vCPUs, Elastic IPs, Lambda concurrency) or cold-start cliffs at the worst moment. Pre-raise quotas in Region B to production levels now, not during the incident. For Pilot Light especially, validate that scale-from-zero actually reaches capacity within your RTO — a fleet that takes 20 minutes to warm up turns a “10-minute RTO” into fiction.

Reliability and DR (RTO/RPO). This is the headline the whole article serves — the explicit mapping:

Strategy	RPO (data loss)	RTO (time to recover)	How it’s achieved
Backup & Restore	Hours (since last backup/copy)	Hours (build + restore)	AWS Backup cross-Region copies; IaC rebuild; snapshot restore
Pilot Light	Minutes (replication lag)	Tens of minutes (promote DB + scale compute from ~0)	Live data replication; scaffolding ready; cold compute warms on failover
Warm Standby	Seconds (replication lag)	Minutes (promote DB + scale up an already-running fleet)	Live data + minimal live compute; scale, don’t cold-start
Multi-Site Active-Active	Near-zero	Near-zero (capacity event, not failover)	Both Regions hot; multi-active data; promote-only relational writer

Two practices make these numbers real rather than aspirational. First: rehearse on a schedule. Run Aurora managed planned failover and an ARC routing-control flip in a monthly/quarterly GameDay; launch DRS drills into an isolated subnet without disrupting production. A failover path you have never executed is an RTO you cannot honestly claim. Second: watch the leading indicators. Rising Aurora AuroraGlobalDBRPOLag or DynamoDB ReplicationLatency means your RPO promise is silently degrading before any outage — alarm on them. And distinguish planned (clean, replication caught up, RPO ~0) from unplanned (region truly lost, RPO = whatever was in-flight) failover in your runbooks and your SLA claims; they are different numbers.

Observability. Emit metrics, logs, and traces per Region (CloudWatch, X-Ray/OpenTelemetry) and aggregate into a single pane (cross-account/cross-Region CloudWatch dashboards or Datadog/Grafana). The DR-specific signals to watch: replication lag (Aurora RPO lag, DynamoDB replication latency, S3 CRR metrics), backup job success/failure and copy-job completion (a silently-failing cross-Region copy is a DR outage you discover at the worst time), ARC readiness-check status, Route 53 health-check status, and recovery-Region service-quota headroom. Build a DR readiness dashboard that answers “if we had to fail over in the next five minutes, would it work?” — and put backup-copy failures and readiness-check regressions on the on-call pager, because they are the failures that bite you precisely when you reach for the parachute.

Governance. Enforce with Service Control Policies (deny resource creation outside sanctioned Regions to stop shadow expansion; deny deletion of backup vaults), AWS Config conformance packs verifying that critical resources are actually covered by a backup plan and that replication is enabled, and a data-residency tagging scheme so a future engineer cannot accidentally replicate EU-resident personal data into a non-permitted recovery Region — a genuine compliance trap in any cross-Region design. Maintain a per-workload DR register: each workload’s tier, its agreed RTO/RPO, its chosen strategy, the date of its last successful failover test, and the owner who signed off the objectives. That register is your audit evidence and your honesty check in one document.

Reference enterprise example

Meridian Logistics, a fictional mid-market freight and supply-chain platform, runs three customer-facing systems in us-east-1 for ~450 enterprise shippers: a shipment-tracking API and portal, an order/billing system (the financial system of record), and an analytics & reporting platform. After a six-hour us-east-1 AZ event cost them a day of degraded service and a near-miss on a major customer’s renewal, their board mandated a DR program — but their CFO refused to “build everything active-active” after seeing the quote. The CTO’s mandate became the right one: tier each system and spend per tier.

What they decided (one strategy per workload):

System	Criticality	Agreed RTO / RPO	Strategy chosen	Why
Order / Billing	Tier-1 (financial SoR)	RTO 15 min / RPO ~5 s	Warm Standby	Money can’t be lost or down for long; needs a fast, rehearsable failover but not full active-active
Shipment Tracking API/Portal	Tier-2 (customer-facing)	RTO 30 min / RPO 5 min	Pilot Light	Customers tolerate a short gap; data must be current; idle compute is wasteful
Analytics / Reporting	Tier-3 (internal + batch)	RTO 8 h / RPO 24 h	Backup & Restore	Re-runs nightly; a day-old report is fine; cheapest is correct here

How each was built:

Order/Billing (Warm Standby): Aurora PostgreSQL Global Database with the writer in us-east-1 and a promotable secondary in us-west-2 (observed AuroraGlobalDBRPOLag 40–300 ms). A minimal ECS Fargate fleet (2 tasks) runs continuously in us-west-2, ready to scale to 30. DynamoDB Global Tables hold idempotency keys for billing events. Failover is a Systems Manager Automation doc that promotes the Aurora secondary, scales the fleet, and flips an ARC routing control.
Shipment Tracking (Pilot Light): Aurora Global DB secondary live in us-west-2; ECS desired count = 0 in steady state (full VPC/ALB/security-group scaffolding present via the shared Terraform region module). On failover the runbook scales compute 0 → 16 and promotes the DB.
Analytics (Backup & Restore): AWS Backup daily snapshots of the Redshift-adjacent RDS metadata and EFS, copied cross-Region and cross-account into a Vault-Locked vault (35-day compliance-mode retention — also their ransomware backstop for all three systems). Recovery is terraform apply in us-west-2 plus snapshot restore.

Numbers:

Total DR program added ~$31k/month over the single-Region baseline of ~$86k. The breakdown of the increase tells the tiering story: Warm Standby for billing (~$19k: secondary Aurora + the always-on minimal fleet + replicated transfer), Pilot Light for tracking (~$9k: replicated data + scaffolding, near-zero idle compute), and Backup & Restore for analytics (~$3k: cross-Region/-account copy + locked-vault storage).
Had they built all three active-active, the estimate was ~$74k/month added — they saved ~$43k/month (~$516k/year) by tiering instead of gold-plating, while still meeting every RTO/RPO the business signed off.

The test that mattered: in their first quarterly GameDay they ran a managed planned Aurora failover for the billing system from us-east-1 to us-west-2 in a low-traffic window. Writes were serving from us-west-2 in ~75 seconds; the warm fleet scaled to full in ~3 minutes; the ARC routing control flipped traffic deterministically; synthetic billing transactions passed — total measured RTO under 6 minutes, comfortably inside the 15-minute commitment. The tracking-system Pilot Light test came in at ~22 minutes (the scale-from-zero compile being the long pole), inside its 30-minute target. And restoring an analytics snapshot from the locked vault into us-west-2 took ~3 hours, well inside the 8-hour budget.

One scar they earned: their first Pilot Light test for tracking failed — Region B hit the default EC2 vCPU service quota while scaling from zero and stalled at half capacity, blowing the RTO. The fix was to pre-raise service quotas in us-west-2 to production levels as a standing item in the DR register, and to add an ARC readiness check that flags quota headroom. It is the canonical Pilot Light lesson: a standby that exists is not a standby that can scale — and you find that out in a drill or in a disaster, so make it the drill.

When to use it

The decision is never “should we do DR” — it is “which strategy, for which workload, at what cost.” The clean rule: start from the agreed RTO/RPO and walk up the ladder only as far as those numbers force you.

Choose Backup & Restore when:

RTO of hours and RPO of hours are genuinely acceptable (internal tools, batch/reporting systems, anything reconstructible).
Cost sensitivity is high and the workload’s downtime cost is low.
You need it anyway as the ransomware/corruption backstop for every tier — even active-active stacks need immutable, cross-account backups, because live replication copies corruption too.

Choose Pilot Light when:

RTO of tens of minutes and RPO of minutes fit, and you want to avoid paying for idle compute.
The data must be current but the application can tolerate a short scale-up gap.
You can commit to validating scale-from-zero regularly (this is the strategy most often defeated by quotas and cold starts).

Choose Warm Standby when:

RTO of minutes and RPO of seconds are required, but full active-active is overkill or its complexity/cost isn’t justified.
You want a rehearsable, fast failover without running two full production environments.
The workload is important enough that a cold scale-up is too slow, but not so latency-global that both Regions must serve simultaneously.

Choose Multi-Site Active-Active when:

RTO/RPO of near-zero is contractual or regulatory, and the downtime cost exceeds the ~2x infrastructure premium.
You genuinely have users on multiple continents needing low write latency, not just a failover need (otherwise warm standby is cheaper and simpler).
See the dedicated Active-Active Multi-Region article — and be honest that most workloads do not need this.

Anti-patterns to avoid:

One strategy for the whole company. Either you over-spend on low-criticality systems or you under-protect the crown jewels. Tier per workload.
“We have backups” as a DR plan. Backups are an input to recovery; without tested IaC rebuild, restore runbooks, and a measured RTO, you have storage, not recovery.
Never testing failover. An untested plan is a document. The first real failover should not be the first failover — rehearse with GameDays, Aurora planned failover, and DRS drills.
Single-Region KMS keys / single-Region secrets on the recovery path. Quietly renders replicated data undecryptable or blocks Region B at the worst moment. Use multi-Region keys and replica secrets.
Replicating corruption and calling it resilience. Live cross-Region replication is not a backup — it dutifully copies a DROP TABLE or a ransomware encryption to the other Region. Immutable point-in-time backups are the only defence against that class of disaster.
Pilot Light that can’t scale. Pre-raise quotas, validate scale-from-zero, or your “10-minute RTO” is fiction.
DNS-only failover for latency-critical APIs. Resolver TTL caching makes DNS failover minutes-slow; use ARC routing controls plus Global Accelerator for those paths.

Alternatives and adjacent choices:

Single-Region multi-AZ with rigorous backups + PITR is the right default for most workloads — reach for cross-Region DR only when a concrete RTO/RPO or regulatory driver forces it. Multi-AZ already handles data-centre failure.
AWS Elastic Disaster Recovery (DRS) is the pragmatic path for lift-and-shift / non-cloud-native estates that you cannot re-architect — a managed pilot-light/warm-standby for whole servers.
CloudEndure-style continuous replication and AWS Backup can coexist: use Backup for the immutable compliance/ransomware copy and DRS or native replication for the fast-recovery path. They solve different halves of “disaster.”

The honest framing for any DR review: the RTO/RPO numbers are a business decision and a price tag, not a technical preference. Get those numbers signed off per workload, build the cheapest strategy that meets them, encode the failover so it is deterministic, and prove it on a schedule. A DR plan you have tested at the lowest tier that meets your objectives beats a gold-plated one you have never run — because the parachute you have actually deployed is worth more than the one you only bought.

AWS Enterprise Architecture: Disaster Recovery Strategies

The business scenario

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Reference enterprise example

When to use it

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)