Architecture AWS

AWS Enterprise Architecture: Disaster Recovery Strategies

Every disaster recovery conversation that goes wrong starts the same way: someone asks “are we covered if a region goes down?” and someone else answers “yes, we have backups.” Those two sentences are about completely different things, and the gap between them is where outages turn into resignations. “Covered” is a question about how much data you can afford to lose and how long you can afford to be down — RPO and RTO — and the honest answer is a number with a price tag, not a yes. Disaster recovery on AWS is not one architecture; it is a spectrum of four, each buying you a tighter RTO/RPO for more money and more operational discipline. The art is choosing the cheapest one that meets your actual recovery objectives, building it so it genuinely works, and — the part everyone skips — proving it works on a schedule.

This article lays out the four canonical AWS DR strategies the way the AWS Disaster Recovery of Workloads on AWS guidance frames them — Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active — but as a single decision framework rather than four disconnected diagrams. We will anchor everything to RTO (recovery time objective — how long until you are serving again) and RPO (recovery point objective — how much recent data you lose), because those two numbers are the only honest way to compare them. Along the way: the AWS services that implement each tier (AWS Backup, S3 Cross-Region Replication, Aurora Global Database, DynamoDB Global Tables, AWS Elastic Disaster Recovery, Route 53 Application Recovery Controller), how to express them in Terraform, how to orchestrate an actual failover, and how to keep the bill from quietly turning a passive standby into an expensive insurance policy nobody has ever cashed.

The business scenario

The thing that makes DR universally relevant is that the driver is never “we want DR” — it is a specific loss the business cannot absorb, and that loss looks different at every size of company while the underlying maths is identical.

What unites all three is the same uncomfortable trio of facts. First, the failures that actually require DR are correlated and large — an AZ-spanning power or network event, a regional control-plane degradation, a bad region-wide config change, ransomware that encrypts your primary and anything it can reach, or a data-corruption bug that faithfully replicates to your replica. Multi-AZ (which you should already have) handles the small ones; DR is about the big ones, including the ones that come from inside the house. Second, the cost of recovery is dominated by what you keep running while nothing is wrong — a hot standby costs the same on a quiet Tuesday as during a disaster, so over-provisioning DR is a tax you pay every single day for an event that may never come. Third, an untested recovery plan is not a plan — it is a document, and documents do not fail over.

The problem this architecture solves, stated precisely: map each workload to a recovery objective (RTO/RPO) the business has actually signed off on, implement the cheapest DR strategy that meets it using native AWS services, orchestrate the failover so it is deterministic rather than heroic, and verify it on a recurring schedule so the RTO/RPO you claim is the RTO/RPO you can deliver. The rest of this article is the map from “RTO/RPO target” to “which of the four, built how.”

Architecture overview

AWS disaster recovery reference architecture showing a primary Region (us-east-1) replicating to a recovery Region (us-west-2) across four strategy lanes — Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active — under a shared Route 53 / ARC / Global Accelerator failover control plane, with an immutable cross-account backup vault as the ransomware backstop.

The four strategies are best understood as one axis — increasing investment in pre-provisioned, pre-warmed, pre-replicated infrastructure in a recovery Region — sliding RTO and RPO from “hours/hours” down toward “near-zero/near-zero.” Picture a primary Region (say us-east-1) serving production, and a recovery Region (say us-west-2). The difference between the four strategies is entirely about what is already running, and how current the data is, in us-west-2 before disaster strikes.

Strategy 1 — Backup & Restore (RPO: hours; RTO: hours). Nothing runs in the recovery Region in steady state. AWS Backup takes scheduled snapshots of EBS volumes, RDS/Aurora clusters, DynamoDB tables, EFS, and more, and copies them cross-Region (and ideally cross-account, into a locked vault) on a schedule. Application artifacts live as container images in ECR with cross-Region replication and IaC in version control. When disaster strikes, you build the recovery environment from IaC and restore data from the latest copied backups. The data path in steady state is just “snapshot → copy to Region B → sit in a vault.” The recovery path is “terraform apply in Region B, restore snapshots, repoint DNS.” It is the cheapest by far and the slowest by far.

Strategy 2 — Pilot Light (RPO: minutes; RTO: tens of minutes). The data layer is live and continuously replicating to the recovery Region, but compute is switched off or minimal. An Aurora Global Database secondary cluster sits in us-west-2 receiving storage-level replication (sub-second lag); DynamoDB Global Tables replicate continuously; S3 Cross-Region Replication mirrors objects. The “pilot light” is exactly this always-warm data plus the core scaffolding — the VPC, subnets, security groups, and a scaled-to-zero (or minimal) compute definition — kept current but not serving. On failover you promote the Aurora secondary, scale the compute up from zero to production size, and repoint traffic. You pay for replicated storage and data transfer, but almost nothing for idle compute.

Strategy 3 — Warm Standby (RPO: seconds; RTO: minutes). A fully functional but under-scaled copy of the workload runs in the recovery Region all the time. The data layer replicates as in Pilot Light, and a smaller version of the compute fleet — fewer/smaller ECS tasks or EC2 instances, a smaller Aurora reader — is actually running and could serve traffic right now, just not at full capacity. Failover is promote the database, scale the already-running fleet up, and shift traffic — faster than Pilot Light because nothing has to cold-start from zero. You pay for a continuously-running (if minimal) second environment.

Strategy 4 — Multi-Site Active-Active (RPO: near-zero; RTO: near-zero). Both Regions serve live production traffic simultaneously. There is no “failover” of compute because both fleets are already hot at full-ish capacity; DynamoDB Global Tables are multi-active (both Regions write locally); Aurora Global Database has one writer but a promotable secondary. A regional loss is a capacity event, not a recovery event. This is the most expensive and the most complex, and it gets its own deep treatment in the companion Active-Active Multi-Region article — here it is the top rung of the ladder.

The common control plane across all four: traffic steering and failover orchestration are the same machinery regardless of strategy. Amazon Route 53 provides DNS with health-check-based failover records; Route 53 Application Recovery Controller (ARC) provides routing controls (deterministic on/off switches you flip to redirect traffic, instead of hoping health checks fire correctly) and readiness checks (continuous verification that the standby is actually capable of taking load). For non-cacheable latency-critical APIs, AWS Global Accelerator fails over at the network layer in seconds rather than waiting on DNS TTLs. And for server-based workloads that are hard to re-platform, AWS Elastic Disaster Recovery (DRS) continuously block-level-replicates whole servers into a low-cost staging area in the recovery Region and launches them on demand — effectively a managed Pilot Light / Warm Standby for lift-and-shift estates.

The single most important thing this overview should make obvious: you do not pick one strategy for the company — you pick one per workload, based on its RTO/RPO tier. A mature enterprise runs all four at once: active-active for the payment rail, warm standby for the order system, pilot light for the reporting platform, and backup & restore for the internal wiki — each priced to its actual cost of downtime.

Component breakdown

Component AWS service Role in DR Key configuration choices
Cross-Region backup AWS Backup The foundation of Backup & Restore; defence against corruption/ransomware for all tiers Backup plans with cross-Region copy + cross-account copy to an isolated account; Vault Lock (compliance mode) for immutability; lifecycle to cold storage; backup of RDS, EBS, DynamoDB, EFS, S3, Aurora
Relational replication Aurora Global Database Live cross-Region data plane for Pilot Light / Warm Standby / Active-Active One global cluster; primary writer + ≥1 secondary Region; storage-layer replication (typically <1s lag); managed planned failover for drills; target RPO ~1s
Relational (RDS, non-Aurora) RDS cross-Region read replica Same idea for MySQL/PostgreSQL/MariaDB on RDS Cross-Region read replica that can be promoted to standalone; async replication (RPO = replication lag, seconds–minutes)
NoSQL replication DynamoDB Global Tables Multi-active key/document data; zero-failover for the tiers that use it Global Tables v2 (2019.11.21); PITR enabled; design for last-writer-wins; watch ReplicationLatency
Object replication S3 Cross-Region Replication (CRR) Mirror uploads, exports, static origins, and backups to Region B Versioning on; CRR rules (optionally RTC for a 15-min replication SLA); replication metrics; bidirectional where both Regions write
Server-based DR AWS Elastic Disaster Recovery (DRS) Continuous block replication of whole EC2/on-prem servers; managed pilot-light/warm-standby for lift-and-shift Low-cost staging area with cheap instances + EBS; point-in-time recovery snapshots; launch templates for recovery; supports drill launches into an isolated subnet
Container artifacts Amazon ECR (cross-Region replication) Make the same image digest available in Region B for fast rebuild/scale-up Registry replication rules to Region B; deploy by digest, not tag
DNS failover Route 53 Steer traffic to the healthy Region Failover or latency routing; health checks on a deep /health endpoint; low TTL (30–60s); Evaluate Target Health on alias records
Failover orchestration Route 53 Application Recovery Controller (ARC) Deterministic, audited Region switch + standby readiness assurance Routing controls (manual/automated on-off) gated by safety rules; readiness checks per resource type; multi-Region cluster of 5 endpoints
Network-layer failover AWS Global Accelerator Sub-DNS failover for non-cacheable latency-critical APIs Two anycast IPs; endpoint groups per Region; traffic dials; health checks at the network layer
Keys & secrets KMS (multi-Region keys) + Secrets Manager (replica secrets) Ensure encrypted data and credentials are usable in Region B Multi-Region KMS keys so replicated data decrypts locally; replica secrets so Region B reads DB creds without a cross-Region call
IaC + pipeline Terraform + CodePipeline/GitHub Actions Recreate compute/networking in Region B on demand (Backup & Restore / Pilot Light) Region-parameterised modules; recovery environment is terraform apply, not click-ops; pipeline can deploy Region B independently

A few component choices deserve their why, not just their what:

Why AWS Backup with cross-Region and cross-account copy and Vault Lock, not just snapshots? Because the disaster that most often actually happens is not a region vanishing — it is deletion or encryption of your data, whether by a bad actor, ransomware, or a mistake. A snapshot in the same account that the same compromised credentials can delete is not a backup; it is a hostage. Copying backups into a separate, locked account with Vault Lock in compliance mode (which even the root user cannot shorten or delete before the retention period) is what turns “we have snapshots” into “we can actually recover from ransomware.” This single control protects every tier, including active-active stacks whose live replication would have dutifully copied the corruption to the other Region.

Why Aurora Global Database over RDS cross-Region read replicas for the higher tiers? Both give you a promotable copy in Region B, but Aurora Global Database replicates at the storage layer with typically sub-second lag and far lower RPO, supports managed planned failover (a clean, ~1-minute switch you can rehearse), and decouples replication from the database engine’s own load. RDS cross-Region read replicas use engine-level async replication — perfectly fine for Backup & Restore / Pilot Light on smaller MySQL/PostgreSQL workloads, but with looser, more variable lag. Choose Aurora Global DB when your RPO budget is sub-second and you intend to rehearse failover regularly.

Why Elastic Disaster Recovery (DRS) instead of re-architecting? Because a huge share of enterprise estates is not cloud-native — it is lift-and-shift EC2 (or still on-prem) running software that nobody is going to re-platform onto Fargate and Aurora just to get DR. DRS continuously replicates the entire server (OS, app, data) as block-level changes into a cheap staging area in Region B, and on failover (or a non-disruptive drill) launches production-sized instances from the latest point-in-time. It is the pragmatic way to put a Pilot-Light-grade RTO/RPO on workloads you cannot or will not refactor — and it is far better than a quarterly AMI copy.

Why Route 53 ARC and not just health-check failover? Because DNS health checks fail in the messy middle — a partial brownout where the primary is sick enough to lose data but healthy enough to pass a shallow health check, or healthy enough that flapping health checks bounce traffic back into a degraded Region. ARC routing controls are deterministic switches you (or an automated runbook) flip with intent, protected by safety rules (e.g. “never turn both Regions off,” “always keep at least one on”). ARC readiness checks continuously answer the question that actually matters before you fail over: is the standby genuinely ready to take this load right now? For tier-1 systems, that determinism is worth the added moving part.

Implementation guidance

Infrastructure as Code is the load-bearing wall of Backup & Restore and Pilot Light, because in those strategies the recovery environment’s compute and networking do not exist (or barely exist) until you create them. If “recovery” means an engineer hand-clicking a VPC together under pressure at 3am, your real RTO is “however long that takes, plus the mistakes.” Express the workload as a region-parameterised Terraform module so that standing up Region B is terraform apply against a second provider alias — not archaeology.

The clean structure:

Terraform shape for the AWS Backup plan with cross-Region copy and an immutable vault (illustrative):

resource "aws_backup_vault" "dr" {
  provider = aws.usw2                       # vault in the recovery Region
  name     = "dr-locked-vault"
  kms_key_arn = aws_kms_key.backup_usw2.arn
}

# Compliance-mode lock: cannot be deleted/shortened before retention elapses
resource "aws_backup_vault_lock_configuration" "dr" {
  provider            = aws.usw2
  backup_vault_name   = aws_backup_vault.dr.name
  min_retention_days  = 35
  changeable_for_days = 3                    # cooling-off before lock is permanent
}

resource "aws_backup_plan" "core" {
  name = "core-cross-region"

  rule {
    rule_name         = "daily-35d"
    target_vault_name = aws_backup_vault.primary.name
    schedule          = "cron(0 5 * * ? *)"  # 05:00 UTC daily
    start_window      = 60
    completion_window = 180
    lifecycle { delete_after = 35 }

    copy_action {                            # the DR-critical part
      destination_vault_arn = aws_backup_vault.dr.arn   # different Region + account
      lifecycle { delete_after = 35 }
    }
  }
}

Terraform shape for the Aurora Global Database data plane (the Pilot Light / Warm Standby relational tier):

resource "aws_rds_global_cluster" "this" {
  global_cluster_identifier = "ord-global"
  engine                    = "aurora-postgresql"
  engine_version            = "16.4"
  storage_encrypted         = true
}

resource "aws_rds_cluster" "primary" {          # writer — us-east-1
  provider                    = aws.use1
  cluster_identifier          = "ord-use1"
  engine                      = aws_rds_global_cluster.this.engine
  engine_version              = aws_rds_global_cluster.this.engine_version
  global_cluster_identifier   = aws_rds_global_cluster.this.id
  master_username             = var.db_user
  manage_master_user_password = true            # secret in Secrets Manager, not state
  kms_key_id                  = aws_kms_key.use1.arn
  db_subnet_group_name        = module.region_use1.db_subnet_group
}

resource "aws_rds_cluster" "secondary" {        # promotable reader — us-west-2
  provider                  = aws.usw2
  cluster_identifier        = "ord-usw2"
  engine                    = aws_rds_global_cluster.this.engine
  engine_version            = aws_rds_global_cluster.this.engine_version
  global_cluster_identifier = aws_rds_global_cluster.this.id
  kms_key_id                = aws_kms_key.usw2.arn
  db_subnet_group_name      = module.region_usw2.db_subnet_group
  depends_on                = [aws_rds_cluster.primary]
}

And the Region B compute as a single variable away from Pilot Light vs Warm Standby:

module "region_usw2" {
  source    = "./modules/region"
  providers = { aws = aws.usw2 }

  # Pilot Light: desired_count = 0  (scaffolding only, scaled to zero)
  # Warm Standby: desired_count = 2 (minimal live fleet, ready to scale)
  ecs_desired_count = var.dr_warm ? 2 : 0
  ecs_max_count     = 40            # full production ceiling on failover
}

Networking and identity. Keep request handling in-Region; only data replication should cross Regions, and it travels the AWS backbone natively for Aurora, DynamoDB, and S3 — you do not need a hot-path VPC peering for the user flow. Put Gateway VPC endpoints for S3 and DynamoDB and Interface endpoints for the rest in each Region so data-plane traffic stays off NAT and the internet. Use multi-Region KMS keys so an encrypted snapshot or replicated object decrypts in Region B under the local key replica — a single-Region key is a silent way to make your “recovered” data unreadable. Put DB credentials in Secrets Manager replica secrets so Region B never makes a cross-Region Secrets Manager call on the recovery path. For human and machine identity, one AWS Organization with IAM Identity Center for SSO and per-Region IAM roles scoped to that Region’s resource ARNs (IRSA on EKS) — a compromised task in Region A should have no standing path to Region B beyond what replication already grants.

The failover runbook itself must be code, not prose. Whatever the strategy, the switch should be an executable sequence — a Systems Manager Automation document or a Step Functions state machine — that: (1) confirms the standby’s readiness (ARC readiness check), (2) promotes the data tier (failover-global-cluster for a planned switch, or promote-on-loss for unplanned), (3) scales Region B compute to production size, (4) flips the ARC routing control (and/or Global Accelerator traffic dial) to send traffic to Region B, and (5) verifies synthetic transactions succeed before declaring victory. Backup & Restore adds a step zero: terraform apply the Region B environment and restore the latest copied backups. Encoding this is what collapses RTO from “however long the on-call figures it out” to a predictable number.

Enterprise considerations

Security and Zero Trust. DR widens your attack surface — there is now a second copy of everything — so the recovery Region must be held to the same standard, not a relaxed one. Encrypt every backup and replica with multi-Region KMS; keep the immutable backup copy in a separate, least-privilege account so the credentials that run production cannot delete your last line of defence. Treat ransomware as a first-class DR scenario: your live cross-Region replication will faithfully copy encrypted data to the other Region, so the only recovery is the immutable, point-in-time backup — design retention and Vault Lock accordingly, and rehearse a restore from the locked vault specifically. Enable GuardDuty, Security Hub, and an organization-wide multi-Region CloudTrail so detection is symmetric across both Regions. Scope IAM per Region; a breach in the primary should not hand the attacker the recovery Region for free.

Cost optimization. This is where DR strategy selection literally is the cost decision, because the four strategies are a price ladder and the steady-state spend is dominated by what you keep running while nothing is wrong:

Strategy Steady-state cost driver Rough relative cost What you’re paying for
Backup & Restore Snapshot storage + cross-Region copy transfer $ (lowest) Just durable, replicated backups; zero idle compute
Pilot Light Replicated DB storage + transfer; minimal scaffolding $$ Live data plane; compute scaled to ~zero
Warm Standby Above + a small always-running compute fleet $$$ A real (if minimal) second environment, always on
Multi-Site Active-Active A near-full second environment + cross-Region transfer $$$$ Two live Regions; capacity for either to take 100%

The discipline is matching the strategy to the cost of downtime per workload, not to anxiety. A tier-3 internal tool on active-active is pure waste; a payment rail on backup-and-restore is negligence. Concrete levers: don’t replicate data that doesn’t need it (DynamoDB Global Tables charge replicated write capacity, Aurora Global DB charges cross-Region transfer — keep purely-regional data single-Region); run the Region B Aurora secondary smaller and scale it up as a step in the failover runbook if your RTO budget allows the extra minute; use AWS Backup lifecycle to cold storage for long-retention copies; and for Warm Standby, size Region B to the minimum that can survive the first few minutes while Auto Scaling ramps, not to full production.

Scalability. The scaling question in DR is specifically “can Region B actually absorb production load when it has to?” — and the failure mode is a Pilot Light or Warm Standby that looks ready but cannot scale fast enough, hitting service quotas (Region-specific limits on EC2 vCPUs, Elastic IPs, Lambda concurrency) or cold-start cliffs at the worst moment. Pre-raise quotas in Region B to production levels now, not during the incident. For Pilot Light especially, validate that scale-from-zero actually reaches capacity within your RTO — a fleet that takes 20 minutes to warm up turns a “10-minute RTO” into fiction.

Reliability and DR (RTO/RPO). This is the headline the whole article serves — the explicit mapping:

Strategy RPO (data loss) RTO (time to recover) How it’s achieved
Backup & Restore Hours (since last backup/copy) Hours (build + restore) AWS Backup cross-Region copies; IaC rebuild; snapshot restore
Pilot Light Minutes (replication lag) Tens of minutes (promote DB + scale compute from ~0) Live data replication; scaffolding ready; cold compute warms on failover
Warm Standby Seconds (replication lag) Minutes (promote DB + scale up an already-running fleet) Live data + minimal live compute; scale, don’t cold-start
Multi-Site Active-Active Near-zero Near-zero (capacity event, not failover) Both Regions hot; multi-active data; promote-only relational writer

Two practices make these numbers real rather than aspirational. First: rehearse on a schedule. Run Aurora managed planned failover and an ARC routing-control flip in a monthly/quarterly GameDay; launch DRS drills into an isolated subnet without disrupting production. A failover path you have never executed is an RTO you cannot honestly claim. Second: watch the leading indicators. Rising Aurora AuroraGlobalDBRPOLag or DynamoDB ReplicationLatency means your RPO promise is silently degrading before any outage — alarm on them. And distinguish planned (clean, replication caught up, RPO ~0) from unplanned (region truly lost, RPO = whatever was in-flight) failover in your runbooks and your SLA claims; they are different numbers.

Observability. Emit metrics, logs, and traces per Region (CloudWatch, X-Ray/OpenTelemetry) and aggregate into a single pane (cross-account/cross-Region CloudWatch dashboards or Datadog/Grafana). The DR-specific signals to watch: replication lag (Aurora RPO lag, DynamoDB replication latency, S3 CRR metrics), backup job success/failure and copy-job completion (a silently-failing cross-Region copy is a DR outage you discover at the worst time), ARC readiness-check status, Route 53 health-check status, and recovery-Region service-quota headroom. Build a DR readiness dashboard that answers “if we had to fail over in the next five minutes, would it work?” — and put backup-copy failures and readiness-check regressions on the on-call pager, because they are the failures that bite you precisely when you reach for the parachute.

Governance. Enforce with Service Control Policies (deny resource creation outside sanctioned Regions to stop shadow expansion; deny deletion of backup vaults), AWS Config conformance packs verifying that critical resources are actually covered by a backup plan and that replication is enabled, and a data-residency tagging scheme so a future engineer cannot accidentally replicate EU-resident personal data into a non-permitted recovery Region — a genuine compliance trap in any cross-Region design. Maintain a per-workload DR register: each workload’s tier, its agreed RTO/RPO, its chosen strategy, the date of its last successful failover test, and the owner who signed off the objectives. That register is your audit evidence and your honesty check in one document.

Reference enterprise example

Meridian Logistics, a fictional mid-market freight and supply-chain platform, runs three customer-facing systems in us-east-1 for ~450 enterprise shippers: a shipment-tracking API and portal, an order/billing system (the financial system of record), and an analytics & reporting platform. After a six-hour us-east-1 AZ event cost them a day of degraded service and a near-miss on a major customer’s renewal, their board mandated a DR program — but their CFO refused to “build everything active-active” after seeing the quote. The CTO’s mandate became the right one: tier each system and spend per tier.

What they decided (one strategy per workload):

System Criticality Agreed RTO / RPO Strategy chosen Why
Order / Billing Tier-1 (financial SoR) RTO 15 min / RPO ~5 s Warm Standby Money can’t be lost or down for long; needs a fast, rehearsable failover but not full active-active
Shipment Tracking API/Portal Tier-2 (customer-facing) RTO 30 min / RPO 5 min Pilot Light Customers tolerate a short gap; data must be current; idle compute is wasteful
Analytics / Reporting Tier-3 (internal + batch) RTO 8 h / RPO 24 h Backup & Restore Re-runs nightly; a day-old report is fine; cheapest is correct here

How each was built:

Numbers:

The test that mattered: in their first quarterly GameDay they ran a managed planned Aurora failover for the billing system from us-east-1 to us-west-2 in a low-traffic window. Writes were serving from us-west-2 in ~75 seconds; the warm fleet scaled to full in ~3 minutes; the ARC routing control flipped traffic deterministically; synthetic billing transactions passed — total measured RTO under 6 minutes, comfortably inside the 15-minute commitment. The tracking-system Pilot Light test came in at ~22 minutes (the scale-from-zero compile being the long pole), inside its 30-minute target. And restoring an analytics snapshot from the locked vault into us-west-2 took ~3 hours, well inside the 8-hour budget.

One scar they earned: their first Pilot Light test for tracking failed — Region B hit the default EC2 vCPU service quota while scaling from zero and stalled at half capacity, blowing the RTO. The fix was to pre-raise service quotas in us-west-2 to production levels as a standing item in the DR register, and to add an ARC readiness check that flags quota headroom. It is the canonical Pilot Light lesson: a standby that exists is not a standby that can scale — and you find that out in a drill or in a disaster, so make it the drill.

When to use it

The decision is never “should we do DR” — it is “which strategy, for which workload, at what cost.” The clean rule: start from the agreed RTO/RPO and walk up the ladder only as far as those numbers force you.

Choose Backup & Restore when:

Choose Pilot Light when:

Choose Warm Standby when:

Choose Multi-Site Active-Active when:

Anti-patterns to avoid:

Alternatives and adjacent choices:

The honest framing for any DR review: the RTO/RPO numbers are a business decision and a price tag, not a technical preference. Get those numbers signed off per workload, build the cheapest strategy that meets them, encode the failover so it is deterministic, and prove it on a schedule. A DR plan you have tested at the lowest tier that meets your objectives beats a gold-plated one you have never run — because the parachute you have actually deployed is worth more than the one you only bought.

AWSArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading