The most expensive mistake I see in AWS architecture reviews is not under-engineering. It is over-engineering — a four-person team running a multi-region, active-active estate with DynamoDB global tables and an Aurora Global Database to serve an internal tool that a few hundred staff touch during office hours, hemorrhaging cost and operational toil it has no capacity to sustain. The second most common mistake is its mirror image: a single EC2 instance in one Availability Zone quietly running the checkout flow for a business that loses serious money per hour when it falls over. Both teams skipped the only question that matters in architecture: what do the requirements actually demand, and what is the cheapest design that meets them with margin to spare?
Architecture on AWS is not a catalogue of impressive services. It is the disciplined act of climbing exactly as high as the requirements force you, and not one rung higher. So this lesson teaches architecture as a ladder — six designs for the same application, starting from a static site that costs almost nothing and ending at a multi-region active-active system engineered to transact through the loss of an entire AWS Region. Each rung adds a specific, named capability (durability and global edge delivery, a contractual SLA, event-driven elasticity, independent team scaling, regional disaster recovery, regional failure tolerance) and each addition has a price in money, complexity, and operational burden. The skill this builds is reading a set of requirements — RTO, RPO, scale, budget, team shape, compliance — and landing on the right rung.
We will use the lens of the AWS Well-Architected Framework’s Reliability pillar throughout, because every step up this ladder is a deliberate trade between the six pillars: you spend Cost-Optimisation and Operational-Excellence currency to buy Reliability and Performance Efficiency. The two geographic rungs at the top land precisely on the territory of the active-active multi-region reference architecture and the four canonical AWS disaster-recovery strategies — so by the time you arrive there, every concept (pilot light, warm standby, failover routing, global tables, composite SLA) will already be familiar, because you watched the requirements force it into existence.
Learning objectives
By the end of this lesson you will be able to:
- Map requirements to architecture — translate RTO, RPO, scale, budget, team topology and compliance into a specific rung on the ladder, with justification.
- Describe six canonical AWS application designs in increasing order of resilience and complexity, naming the services and the role each plays.
- Reason about the key design decision at each rung as a Well-Architected tradeoff — what the addition buys and what it costs across the six pillars.
- Calculate a composite SLA for a chained system and understand how Availability Zones and multiple Regions raise the achievable number.
- Distinguish the four DR strategies (backup & restore, pilot light, warm standby, multi-site active-active) and tie each to an RTO/RPO band.
- Recognise when not to climb — to identify the cheapest design that meets the requirement and resist gold-plating, and justify a rung choice to a review board.
Prerequisites & where this fits
This is an Advanced lesson in the Architecture & Design Mastery module. You will get the most from it if you have already met the Well-Architected Reliability pillar (service quotas, failure management, the DR strategies), have a working mental model of high availability versus disaster recovery and RTO/RPO, and understand VPC networking fundamentals (subnets, Availability Zones, route tables). A passing familiarity with the core AWS compute and data services (S3, EC2, ECS/EKS, Lambda, RDS/Aurora, DynamoDB) helps, but each is reintroduced in context.
Where it fits in the ladder of learning: the Well-Architected pillars taught you the value system, the troubleshooting lessons taught you to keep a running system alive, and this lesson shows you how to design one — requirements in, the right rung out, across a realistic progression. It is the bridge from operating AWS to architecting on it.
A note on the running example: every rung serves the same application — “ShopKart”, a product catalogue with a shopping basket and checkout. Keeping the app constant is the whole point. It isolates the one variable that actually changes between these designs: how much failure the business can tolerate, and what it will pay to tolerate less.
A word on the costs below. The figures are deliberately rough, order-of-magnitude monthly estimates for a small-to-moderate workload (a few hundred GB of data, low-millions of requests per month), at indicative US-region on-demand rates, to teach the shape of the cost curve — they are not quotes. Real numbers depend on Region, instance choice, Savings Plans/Reserved Instances, data-transfer (egress) and traffic. Always model your own in the AWS Pricing Calculator. The lesson is in the ratios between rungs, not the absolute dollars — and the biggest hidden line item as you climb is almost always cross-AZ and cross-Region data transfer, which most first estimates forget.
How to read the requirement axes
Every rung is described against the same set of axes. Internalise these — they are the vocabulary of an architecture decision, and they are exactly what an SAA-C03 or SAP-C02 scenario hands you before asking for a design.
| Axis | What it asks | Why it drives design |
|---|---|---|
| RTO (Recovery Time Objective) | After a failure, how long until service is restored? | Drives redundancy: hours allows backup & restore; minutes forces warm standby; near-zero forces active-active. |
| RPO (Recovery Point Objective) | How much data can you afford to lose? | Drives replication: hours allows nightly snapshots; near-zero forces continuous/synchronous replication or multi-write. |
| Scale | Peak concurrent load and its variability (steady vs spiky)? | Drives elasticity and the compute model (serverless vs EC2/ECS vs orchestrated containers). |
| Availability target (SLA) | What uptime % must you promise, and is it composite? | Drives AZ/Region redundancy; each “nine” is roughly an order of magnitude harder and dearer. |
| Budget | Capital and run-rate ceiling? | The hard constraint. Caps how high you can climb regardless of desire. |
| Team topology | One team or many? What is their operational maturity? | A microservices rung needs many autonomous teams; a small team should stay on managed/serverless. |
| Compliance / data residency | Regulatory constraints on where data lives and how DR is proven? | Can force a Region choice or a provable multi-Region DR design irrespective of pure availability maths. |
Keep this table in your head as you read each rung. The design is always a response to a specific movement in these axes — never an aesthetic preference and never a résumé-building exercise.
A quick word on AWS’s two-level geography, because the whole ladder turns on it. A Region is a physical location (e.g. us-east-1) containing multiple, isolated Availability Zones — each one or more discrete data centres with independent power, cooling and networking, interconnected by low-latency links. Spreading across AZs buys you data-centre-fault tolerance within a Region (Rungs 2–4). Spreading across Regions buys you tolerance to the loss of an entire Region (Rungs 5–6). They are different failures at different price points, and confusing the two is the single most common architecture error this lesson exists to prevent.
Rung 1 — Static site (S3 + CloudFront + Route 53)
Scenario & requirements. ShopKart begins as an MVP — a catalogue and a marketing front, the dynamic basket bolted on later. A two-person startup needs it online to validate the idea. Traffic is a trickle and unpredictable: could be 10 visitors, could be 10,000 if a post goes viral. RTO: hours is fine (a hobbyist-grade promise). RPO: effectively the content is in version control, so “data loss” barely applies. Availability: best-effort, though as we will see this rung quietly delivers far better. Budget: as close to zero as possible. Team: two generalist developers, no operations function.
The design. A single-page application (React/Vue) or a static-site-generator build is compiled to static files and stored in an Amazon S3 bucket. Amazon CloudFront sits in front as the CDN, caching assets at hundreds of edge locations worldwide, terminating TLS with a free AWS Certificate Manager (ACM) certificate, and serving everything over HTTPS. Amazon Route 53 hosts the DNS zone and provides an alias record pointing the apex domain at the CloudFront distribution. The S3 bucket stays private, fronted by an Origin Access Control (OAC) so objects are reachable only through CloudFront, never directly. Deploys are a aws s3 sync plus a CloudFront cache invalidation, trivially wired into CI/CD.
This is the Static Content Hosting pattern in its purest form — serve static assets straight from object storage through a CDN, with no web server to run, patch, or scale.
Services.
| Component | Service | Role |
|---|---|---|
| Origin store | Amazon S3 (private bucket) | Durable object storage for the built site (11 nines of durability) |
| Edge / CDN | Amazon CloudFront | Global caching, TLS termination, low-latency delivery |
| DNS | Amazon Route 53 | Hosted zone + alias record to the distribution |
| Certificate | AWS Certificate Manager | Free, auto-renewing TLS certificate |
| Origin protection | Origin Access Control (OAC) | Keep the bucket private; force traffic through CloudFront |
Key design decisions & Well-Architected tradeoffs. The defining decision is no servers at all, and the principle behind it is use managed services and let the platform absorb scale. You pay essentially nothing when idle and CloudFront soaks up traffic spikes automatically — superb Cost Optimisation and Performance Efficiency for a spiky, low-baseline, read-heavy workload. The quiet bonus is Reliability: S3 offers eleven nines of durability and a regional availability SLA, and CloudFront spreads delivery across many edge locations, so this “best-effort” rung is in practice extraordinarily robust — far more available than a single server ever is. The tradeoff is that it is static: there is no server-side logic. The moment you need a real “place order” endpoint, a database write, or anything dynamic, you must add it — which is exactly Rungs 2 and 3. From the Reliability pillar’s “design for business requirements” principle, this is the right call: the requirement here is cheap and good-enough, and this rung over-delivers on reliability almost for free.
Rough cost. With low traffic this rung lives largely inside the AWS Free Tier and the standing cost is dominated by the Route 53 hosted zone (about $0.50/month) plus pennies of S3 storage and CloudFront requests. Realistically $1–$10/month until you have meaningful traffic; the main variable as you grow is CloudFront data-transfer egress. This is the cheapest functional rung on the ladder by one to two orders of magnitude.
When this is enough. Marketing sites, MVPs, documentation, JAMstack content, single-page apps that call a separate API, and any read-heavy front end. If your app is genuinely static (or static plus a third-party API), you may never need to climb off this rung — it scales globally and costs almost nothing. Stop here unless you need server-side business logic, your own database, authenticated write operations, or a backend you control.
Rung 2 — Single-region 3-tier (ALB + EC2/ECS + RDS Multi-AZ)
Scenario & requirements. ShopKart found product-market fit. It is now a real business with paying customers, a relational data model (orders, inventory, customers with foreign-key integrity), server-side logic, and the need for consistent, predictable performance. RTO: minutes, automatic for a component failure. RPO: near-zero for committed transactions (a database failover must not lose acknowledged orders). Availability: a solid ~99.9%+ that survives the loss of one Availability Zone. Budget: modest but real — the business can fund a few hundred dollars a month. Team: a small product team, light on dedicated operations.
The design. The canonical three-tier web application, deployed across at least two Availability Zones inside one Region’s VPC. An Application Load Balancer (ALB) spans public subnets in two AZs and distributes traffic to the app tier. The app tier runs in private subnets — either an EC2 Auto Scaling group spanning the AZs, or, more commonly today, Amazon ECS on AWS Fargate tasks (serverless containers, no instances to patch) behind the same ALB. The data tier is Amazon RDS (or Aurora) in a Multi-AZ configuration: a primary in one AZ with a synchronously-replicated standby in another, and automatic failover. Amazon S3 + CloudFront still serve static assets; secrets live in AWS Secrets Manager or SSM Parameter Store; an internet-facing path is protected by AWS WAF on the ALB or a CloudFront distribution in front.
This is the N-tier architecture style — layered, well-understood, the natural home for a lift-and-shift or a straightforward new build with a relational core. Crucially, multi-AZ from the outset is what makes it production-grade rather than a single point of failure.
Services.
| Tier | Service | Role |
|---|---|---|
| Edge / WAF | CloudFront + AWS WAF (optional) | Caching, TLS, OWASP protection, global ingress |
| Load balancing | Application Load Balancer (multi-AZ) | Layer-7 routing, health checks, AZ spread |
| App tier | ECS Fargate or EC2 Auto Scaling (2+ AZs) | Run the application; scale horizontally on demand |
| Data tier | Amazon RDS / Aurora Multi-AZ | Relational store with synchronous standby + auto-failover |
| Secrets/config | Secrets Manager / SSM Parameter Store | Externalised credentials and settings |
| Static | S3 + CloudFront | Images, assets |
Key design decisions & Well-Architected tradeoffs. Two decisions define this rung. First, multi-AZ everything: the ALB spans AZs, the app tier runs in two or more AZs, and RDS runs Multi-AZ. This is the make all things redundant principle applied at the zone granularity, and it is the single highest-ROI reliability step in all of AWS — it removes the most common real-world failure (a data-centre/AZ fault) for a modest premium rather than a re-architecture. Second, warm provisioned capacity trades scale-to-zero for predictability: you now pay a standing monthly bill whether or not anyone visits, in exchange for no cold starts and consistent latency on checkout. The chief tradeoff is Cost and a little Operational Excellence (you now own scaling policies, AMIs or task definitions, and a database to tune) bought in exchange for Reliability and Performance Efficiency. Note the RDS subtlety: Multi-AZ is for availability, not for scaling reads — a standby is not a readable replica; if you need read scale you add read replicas, a different feature.
Rough cost. ALB + two small Fargate tasks (or two t-class EC2 instances) + RDS Multi-AZ (a small-to-medium instance, which roughly doubles the single-AZ database cost for the standby) + CloudFront: ballpark $150–$500/month depending on instance sizes and traffic. An order of magnitude above Rung 1 — the price of warm capacity, a managed relational engine, and zone-fault tolerance.
When this is enough. The overwhelming majority of business web applications. SaaS products in early-to-mid growth, internal line-of-business apps, e-commerce that can tolerate a rare, very short blip during an AZ failover, anything where ~99.9–99.95% and an RTO of minutes is contractually fine. Stop here unless a whole-Region outage would be unacceptable, you have an event-driven workload that scale-to-zero would serve far more cheaply, or your organisation (many teams) has outgrown a single deployable.
Rung 3 — Serverless event-driven (API Gateway + Lambda + DynamoDB + EventBridge)
Scenario & requirements. ShopKart’s load is now genuinely spiky and event-shaped: flat for hours, then a flash-sale or a marketing push drives a 50× burst for twenty minutes. The team is small and wants to stop managing servers, patching, and scaling policies entirely — and to stop paying for idle capacity between bursts. RTO: minutes (the managed services self-heal). RPO: near-zero (DynamoDB is multi-AZ and durable by default). Availability: high, ~99.95%+, inherited from regional managed services that are themselves multi-AZ. Budget: pay strictly for what you use — ideally near-zero when idle. Team: a small team that wants to ship features, not operate infrastructure.
The design. A fully serverless, event-driven architecture, all of it regional and multi-AZ by default. Amazon API Gateway (HTTP API) is the front door, terminating TLS and routing requests to AWS Lambda functions that hold the business logic and scale automatically from zero to thousands of concurrent executions. State lives in Amazon DynamoDB — a fully-managed, single-digit-millisecond key-value/document store with on-demand capacity. The event spine is Amazon EventBridge: when an order is placed, Lambda emits an event onto an event bus, and downstream consumers (send confirmation email, decrement inventory, update analytics) are decoupled rules and targets. Asynchronous, buffered work flows through Amazon SQS queues with Competing Consumers Lambdas, and durable multi-step workflows use AWS Step Functions. Static assets stay on S3 + CloudFront.
This is the Event-driven architecture style composed with serverless compute, and it leans on the Publisher-Subscriber, Queue-Based Load Levelling, Competing Consumers and Async Request-Reply patterns to decouple producers from consumers and absorb spikes.
Services.
| Component | Service | Role |
|---|---|---|
| API front door | Amazon API Gateway (HTTP API) | Managed ingress, TLS, throttling, routing to Lambda |
| Compute | AWS Lambda | Event-driven functions; scale to zero, scale to thousands |
| Data | Amazon DynamoDB (on-demand) | Multi-AZ NoSQL store, single-digit-ms, pay-per-request |
| Event bus | Amazon EventBridge | Route domain events to decoupled consumers |
| Buffering | Amazon SQS (+ DLQ) | Smooth bursts; Competing Consumers; poison-message handling |
| Orchestration | AWS Step Functions | Durable, visual multi-step workflows |
Key design decisions & Well-Architected tradeoffs. The defining decision is serverless and event-driven, and the principle is use managed services taken to its conclusion plus design to scale out. You get extraordinary Cost Optimisation for a spiky workload (you pay per request and per millisecond, nothing when idle) and effortless Performance Efficiency at the burst (the platform scales horizontally for you). High availability is inherited — API Gateway, Lambda and DynamoDB are all regional, multi-AZ managed services, so you get strong reliability without designing it. The tradeoffs are real and exam-relevant. Lambda cold starts add tail latency after idle (mitigated with provisioned concurrency or SnapStart, at a cost). The model is eventually consistent and asynchronous by nature, so you accept the fallacies of distributed computing and design for idempotency, retries and out-of-order delivery. You trade away the comfort of relational joins and transactions across tables — DynamoDB rewards single-table design and access-pattern-first modelling, which is a genuine skill shift. And you accept service quotas (concurrency limits, throttling) as first-class design constraints. The Operational-Excellence story is excellent for a small team, but observability is harder (many short-lived functions and async hops), so distributed tracing with AWS X-Ray becomes essential, not optional.
Rough cost. With bursty, low-average traffic this rung can be dramatically cheaper than Rung 2 — frequently $20–$200/month for a low-millions-of-requests workload, and often inside the Free Tier at MVP volumes, because there is no idle compute or standby database to pay for. The cost scales cleanly with usage, which is precisely its appeal; the watch-outs are high-throughput sustained traffic (where provisioned capacity can become cheaper than per-request) and chatty designs that explode the request count.
When this is enough. Event-driven and spiky workloads: APIs with uneven traffic, webhooks, ingestion and processing pipelines, scheduled jobs, glue between systems, and product backends a small team wants to run with minimal operational surface. It is also a superb complement to the other rungs — even a Rung 2 or 4 estate uses EventBridge/SQS/Lambda for the asynchronous edges. Stop here unless you need long-running or specialised compute that Lambda’s limits don’t suit, you have a strongly relational/transactional core that fights DynamoDB, or — the big one — your organisation has grown to many teams who each need to own and deploy a service independently, which is the driver for Rung 4.
Rung 4 — Containerised microservices (ECS/EKS + service discovery)
Scenario & requirements. ShopKart is now a large product with many engineering teams. The catalogue team, the basket team, the payments team and the fulfilment team each want to own, deploy, and scale their slice independently, on their own cadence, without a coordinated big-bang release. The domain has grown complex enough that a single deployable is a bottleneck — every change requires whole-app regression and a shared release train. RTO/RPO: similar to Rung 2/3 (still single-Region, multi-AZ; the move here is organisational, not a reliability upgrade). Availability: ~99.95%+ within the Region. Budget: the business will fund the additional platform overhead because team velocity is the constraint. Team: many autonomous teams with real platform/DevOps maturity and on-call.
The design. A microservices architecture: the application is decomposed by business capability into independently-deployable services, each containerised. They run on an orchestrator — Amazon ECS on Fargate (simpler, fully AWS-managed) or Amazon EKS (managed Kubernetes, for teams who want the Kubernetes ecosystem and portability). Each service has its own data store (“database per service” — DynamoDB here, Aurora there, the best data store for the job). Services find each other through service discovery: ECS Service Connect / AWS Cloud Map on ECS, or Kubernetes DNS plus a service mesh on EKS. North-south traffic enters through an ALB (or API Gateway); east-west service-to-service traffic can run through a mesh (Istio/App Mesh-style) or Amazon VPC Lattice for IAM-authenticated, cross-account service networking. Asynchronous integration between services uses EventBridge / SQS / SNS (the same event spine as Rung 3). Everything is still spread across multiple AZs.
This is the Microservices architecture style, and it draws heavily on the Sidecar, Ambassador, Anti-Corruption Layer, Backends for Frontends, Gateway Routing/Aggregation/Offloading and Strangler Fig patterns — the last being how you usually get here, by carving services off a monolith incrementally rather than rewriting.
Services.
| Concern | ECS Fargate option | EKS option |
|---|---|---|
| Orchestration | Amazon ECS on Fargate | Amazon EKS (managed Kubernetes) |
| Service discovery | ECS Service Connect / AWS Cloud Map | Kubernetes DNS + service mesh |
| Service-to-service | VPC Lattice / internal ALBs | Service mesh (mTLS, traffic shaping) |
| Ingress | ALB / API Gateway | ALB Ingress Controller / API Gateway |
| Per-service data | DynamoDB / Aurora / RDS (best fit) | DynamoDB / Aurora / RDS (best fit) |
| Async integration | EventBridge / SQS / SNS | EventBridge / SQS / SNS |
Key design decisions & Well-Architected tradeoffs. The single most important thing to understand about this rung — and the most common exam and review trap — is that microservices are an answer to organisational and scaling pressure, not an availability upgrade. A zone-redundant Rung 2 or serverless Rung 3 is just as available within a Region, and far simpler. You climb to Rung 4 when many teams need to deploy independently and the domain complexity justifies the split — not to chase nines. The principle in play is minimize coordination (autonomous teams shipping without a shared release train) and partition around limits. What you pay for it is substantial Operational-Excellence and Cost currency: a distributed system brings the full fallacies of distributed computing — network partitions, partial failures, eventual consistency, distributed transactions (handled with the Saga pattern over Compensating Transactions, never 2PC) — plus the platform burden of a mesh, service discovery, distributed tracing, and per-service pipelines. EKS vs ECS is itself a tradeoff: ECS/Fargate minimises operational surface and is the right default on AWS; EKS buys ecosystem and portability at the price of Kubernetes operational complexity. Choosing ECS unless you have a concrete reason for Kubernetes is the keep it simple call.
Rough cost. Compute cost is broadly comparable to running the same workload as a monolith on Fargate/EC2 — you are paying for the same aggregate vCPU/memory — but the platform overhead (a service mesh, more load balancers, more observability, an EKS control-plane fee where applicable, and a larger DevOps investment) adds a real premium. Ballpark $500–$3,000+/month for a modest multi-service estate, dominated less by raw compute than by the surrounding platform and the engineering time to run it.
When this is enough. Large applications with many teams, complex domains, and independent scaling/deployment needs — the textbook fit. It is the right rung when the organisation, not the availability target, is the binding constraint. Stop here (do not climb to multi-region) unless a whole-Region outage is genuinely unacceptable or a regulator demands a provable, geographically separate DR capability. And critically: do not climb to here for availability — if your pain is “we need to survive an AZ failure”, Rungs 2 and 3 already solve that for a fraction of the cost and toil.
Rung 5 — Multi-region active-passive (disaster recovery)
Scenario & requirements. ShopKart now underpins revenue the business feels acutely, and a whole-Region event — rare, but real — must not take the service down for long or lose committed orders. A regulator (or the business’s own risk appetite) requires a demonstrable, geographically separate disaster-recovery capability. RTO: tens of minutes, via a controlled failover. RPO: small — seconds to a few minutes — the last sliver of un-replicated data may be lost in a sudden regional loss, and that is acceptable. Availability: a higher composite, surviving the loss of the primary Region. Budget: the business will pay an insurance premium for geographic redundancy, but not the full cost of a second live estate. Team: a mature platform team that will own and test a runbook.
The design. A second Region holds a standby copy of the stack, kept at one of two warmth levels chosen by the RTO. Pilot light: the data layer is continuously replicated to the second Region and minimal core infrastructure exists, but compute is scaled to (near) zero and is scaled up only on failover — cheapest, with the longest RTO. Warm standby: a scaled-down-but-running copy of the full stack is always live in the second Region, ready to scale up fast — more expensive, faster RTO. Data replication is the heart of it: Amazon RDS / Aurora cross-Region read replicas (or Aurora’s cross-Region replication) for relational data, DynamoDB global tables or point-in-time replication for NoSQL, and S3 Cross-Region Replication (CRR) for objects. Amazon Route 53 does the failover: a health check on the primary plus a failover routing policy flips DNS to the standby Region when the primary is unhealthy. CloudFront stays global in front. Infrastructure is defined as code (Terraform/CloudFormation) so the standby is a faithful, redeployable twin — and the failover runbook is automated and regularly rehearsed.
This rung instantiates two of the four canonical DR strategies — pilot light and warm standby — covered in depth in the AWS disaster-recovery strategies lesson. (The simplest strategy, backup & restore, is essentially “Rung 2 plus cross-Region snapshots” with an RTO of hours; the most aggressive, multi-site active-active, is Rung 6.)
Services & the warmth dial.
| Concern | Pilot light | Warm standby |
|---|---|---|
| Compute in standby | Scaled to ~zero; scaled up on failover | Always running, scaled down |
| Relational data | Cross-Region read replica (async) | Cross-Region read replica (async) |
| NoSQL data | DynamoDB global tables / replication | DynamoDB global tables / replication |
| Object data | S3 Cross-Region Replication | S3 Cross-Region Replication |
| Traffic switch | Route 53 health check + failover policy | Route 53 health check + failover policy |
| RTO band | Tens of minutes to ~1 hour | Minutes to tens of minutes |
| Relative cost | Lower (no idle compute) | Higher (always-on, scaled-down stack) |
Key design decisions & Well-Architected tradeoffs. Two decisions dominate. First, how warm is the standby? — a pure RTO-versus-cost dial. Pay only for the warmth the RTO actually requires: a pilot light that must scale up may or may not hit a tight RTO (so test it), while a fully warm standby wastes money if a slower failover is acceptable. Second, and the one architects most often get wrong: active-passive replication is asynchronous, so the RPO is never zero. At the instant the primary Region fails, the last few seconds-to-minutes of un-replicated transactions are lost. If the business truly needs zero data loss, active-passive cannot deliver it — that requirement alone is what justifies Rung 6’s multi-write data. The Well-Architected trade is Cost and significant Operational-Excellence investment (a second estate, replication, and a tested failover runbook) spent to buy Reliability against a regional disaster. The most dangerous failure mode is an untested runbook: a DR capability you have never exercised is a hope, not a control — rehearse the failover (and failback) on a schedule.
Rough cost. Pilot light adds the cost of cross-Region data replication, standby data stores, and the cross-Region data-transfer bill — but little idle compute — so it might add 40–80% over the single-Region baseline. Warm standby adds an always-on (scaled-down) second stack on top, pushing the total toward 1.5–2× the single-Region cost. The frequently-forgotten line item is cross-Region data-transfer egress, which on a chatty replication workload can rival the compute savings.
When this is enough. Revenue-critical and regulated workloads that must survive a regional outage and can tolerate an RTO of minutes-to-tens-of-minutes and a small, non-zero RPO. This is the right rung for most “serious DR” requirements — it delivers geographic resilience without the cost and consistency complexity of running two live regions. Stop here unless the business has quantified a catastrophic cost for any downtime, demands an RTO/RPO approaching zero, or needs to serve users in multiple geographies with low latency from the nearest Region — all of which push you to Rung 6.
Rung 6 — Multi-region active-active (mission-critical)
Scenario & requirements. ShopKart is now a system whose downtime cost is catastrophic and quantified — every minute offline is a number the board can recite — and it serves a global user base that expects low latency from the nearest Region. There can be no failover step: traffic must already be flowing to multiple Regions, so the loss of one is absorbed, not recovered from. RTO: effectively zero — the surviving Regions are already serving. RPO: effectively zero, requiring multi-Region writes, not async replication. Availability: the highest tier, surviving the complete loss of a Region with no human in the loop. Budget: justified only by the cost-of-downtime maths — this is the most expensive rung by a wide margin. Team: a mature engineering organisation with deep operational practice (chaos testing, game days, continuous validation).
The design. Two or more Regions all serving live production traffic simultaneously. Amazon Route 53 uses latency-based or geolocation routing to send each user to the nearest healthy Region, with health checks to drain a Region the moment it degrades. Amazon CloudFront fronts everything globally and can route to a regional origin per request. The defining challenge is data: writes happen in every Region, so you need stores built for multi-Region write. DynamoDB global tables provide active-active, multi-Region, last-writer-wins replication out of the box. Amazon Aurora Global Database provides a primary Region with fast cross-Region replicas and managed promotion (sub-minute), and — with Aurora Global Database write forwarding — limited multi-Region write semantics for relational data. The application is built as an independently-deployable cell / scale unit replicated per Region so blast radius is contained and whole regional stacks can be deployed blue-green. Conflict handling, idempotency and eventual consistency are designed in from the first line of code.
This rung lands exactly on the active-active multi-region reference architecture — Route 53 latency routing, CloudFront, DynamoDB global tables and Aurora Global Database — and represents the multi-site active-active end of the DR strategy spectrum. It draws on the Geode pattern (geographically distributed nodes serving any request) and Deployment Stamps (the per-Region cell).
Services.
| Concern | Service | Role |
|---|---|---|
| Global routing | Route 53 latency/geolocation + health checks | Send users to the nearest healthy Region; drain unhealthy ones |
| Edge | Amazon CloudFront | Global caching + per-request regional origin selection |
| NoSQL data | DynamoDB global tables | Active-active multi-Region writes (last-writer-wins) |
| Relational data | Aurora Global Database (+ write forwarding) | Cross-Region replicas, fast promotion, limited multi-write |
| Compute (per Region) | ECS/EKS/Lambda cell per Region | Independently-deployable scale unit / cell |
| Validation | Fault injection / game days | Continuously prove the design survives Region loss |
Key design decisions & Well-Architected tradeoffs. The decision that makes this rung is multi-Region writes, and it is genuinely hard: you must confront data conflicts and consistency head-on (last-writer-wins in DynamoDB, or careful partitioning/write-forwarding with Aurora), accept eventual consistency across Regions, and design every write path to be idempotent and conflict-tolerant. This is the deepest dive into the fallacies of distributed computing on the whole ladder. The Well-Architected trade is stark: you spend the maximum of Cost (multiple full live estates plus cross-Region replication traffic) and Operational-Excellence currency (multi-Region deployments, continuous validation, chaos engineering, sophisticated observability) to buy the maximum Reliability (zero-RTO, zero-RPO, Region-loss tolerance). A subtle but vital point: the composite SLA of two Regions each at, say, 99.95% combines as 1 − (1 − A)² ≈ 99.999975% for that redundant tier in isolation — but the real number is capped by the least-available serial dependency in front of them (a single global router misconfiguration, a single shared global table that throttles, a single account-level quota). Adding Regions buys nothing if a serial choke point remains. And honesty matters: active-active is not automatically better than active-passive — it is far more complex and costly to operate, and complexity avoidance is itself a mission-critical principle. You climb here only when the cost-of-downtime maths forces it.
Rough cost. You are now running N full live estates plus continuous cross-Region replication, so cost scales roughly with the number of Regions — commonly 2–3×+ the single-Region baseline, and frequently more once cross-Region data-transfer and the operational investment (tooling, chaos programmes, larger SRE function) are counted. This is the most expensive rung on the ladder by a wide margin, and the only justification for it is a downtime cost that exceeds the spend.
When this is enough. This is the apex. It is right for systems where any downtime is catastrophic and quantified, where a global audience demands low-latency local serving, and where the organisation has the maturity to operate a continuously-validated multi-Region estate. For the overwhelming majority of workloads, this rung is over-engineering — the discipline is recognising that and climbing back down. There is no rung above this; the work from here is operational excellence (chaos testing, game days, tightening the health model and deployment automation), not a more elaborate topology.
How to choose a rung from requirements
You never pick a rung by taste. You read the axes and let them point. Here is the decision distilled into a single table — read it top to bottom and stop at the first row whose requirement you genuinely have.
| If the requirement is… | …the rung is | Why |
|---|---|---|
| Static/JAMstack front end, cheap, spiky, best-effort (over-delivers anyway) | 1 — Static (S3 + CloudFront + Route 53) | No servers; global edge; near-zero cost; 11 nines of durability |
| Server-side logic, relational data, ~99.9%+, survive an AZ failure | 2 — Single-region 3-tier (ALB + ECS/EC2 + RDS Multi-AZ) | Warm capacity + ACID + multi-AZ; the simplest real production design |
| Spiky/event-shaped load, want zero idle cost and no servers to run | 3 — Serverless event-driven (API GW + Lambda + DynamoDB + EventBridge) | Pay-per-use, scale-to-zero, HA inherited; cold starts/eventual consistency accepted |
| Many autonomous teams + complex domain + independent deploy/scale | 4 — Containerised microservices (ECS/EKS) | Organisational/scaling driver — not an availability driver |
| Must survive a whole-Region outage; DR provable; RTO tens of min, small RPO | 5 — Multi-region active-passive (pilot light / warm standby) | Geographic insurance; composite SLA up; async ⇒ non-zero RPO |
| Catastrophic, quantified cost of any downtime; RTO/RPO ≈ 0; global low latency | 6 — Multi-region active-active | The apex; justified only by cost-of-downtime maths |
Four rules govern the whole climb:
- Requirements drive the rung — not fashion, not résumé-building. The single best question in any review is “what requirement forces us off the rung below?” If you cannot answer it crisply, you have over-engineered.
- Availability and organisation are different axes. Rungs 1→2→3 (and 5→6) climb reliability/geography; Rung 4 climbs organisational structure. Do not reach for EKS to get availability — Rung 2’s Multi-AZ or Rung 3’s managed services do that more cheaply.
- Multi-AZ (inside Rungs 2–4) is the highest-ROI reliability step on the ladder. It removes the most common real failure for a modest premium. Most teams under-buy AZ redundancy and over-buy multi-Region.
- Every step up spends Cost and Operational-Excellence currency to buy Reliability and Performance. That is the Well-Architected trade in one sentence. Make it deliberately, write down what you bought and what you paid, and you will rarely be wrong.
The honest summary: most production workloads belong on Rung 2 or 3 — multi-AZ, single-Region, on managed or serverless compute. Rung 4 is for organisations whose team structure forces it. Rung 5 is for the regulated and the genuinely revenue-critical. Rung 6 is for the few systems with a catastrophic, quantified downtime cost, and is over-engineering everywhere else. Climbing the ladder is easy; the discipline — and the seniority — is in knowing when to stop.
The diagram above stacks the six rungs as a single climb — static site, single-region 3-tier, serverless event-driven, containerised microservices, multi-region active-passive DR, and multi-region active-active — showing for each the headline AWS services and the capability it adds, so you can see at a glance how cost, complexity and resilience all rise together with every step up.
Real-world application
In a real AWS design engagement this ladder is the backbone of the first conversation, before a single resource is drawn. You sit with the business owner and pin the axes: “What does an hour of downtime actually cost? Can you lose a minute of data? How many teams will touch this? What does compliance demand? Is your traffic flat or spiky?” Their answers land you on a rung, and from there the service list almost writes itself.
It also reframes migration and modernisation. A lift-and-shift typically lands a workload on Rung 2 (rehost onto EC2/ECS Multi-AZ + RDS), and the modernisation roadmap is literally “which rung, and when?” — often 2→3 for spiky workloads (replatform to serverless to kill idle cost), or 2→4 only when team structure demands it, and 2/3→5 only if a regional-DR requirement appears. It anchors cost conversations with FinOps: each rung has a recognisable cost shape, and “we are paying for Rung 6 but only need Rung 2” — a multi-Region active-active estate fronting an internal app — is one of the most common and expensive findings in an AWS cost review. And in the SAA-C03 and SAP-C02 design exams, the questions are this ladder in disguise: a scenario hands you RTO/RPO/scale/budget and asks for the most cost-effective design that meets it — you are being tested on whether you can land on the right rung without over- or under-shooting.
Common mistakes & anti-patterns
- Over-engineering — building Rung 5 or 6 for a Rung 2/3 requirement. The classic. A small team runs a multi-Region active-active estate (global tables, Aurora Global Database) for an internal app, drowning in cost and cross-Region data-transfer bills and operational toil it cannot sustain. If you cannot name the requirement that forced you up, climb back down.
- Under-engineering — a single AZ/Region for a revenue-critical system. A single EC2 instance, or RDS with no Multi-AZ, under a checkout flow. One AZ event and the business stops. At minimum, go multi-AZ (Rung 2); the cost is modest and the failure it removes is the most common one.
- Reaching for microservices/EKS to get availability. ECS or EKS does not make you more available than a multi-AZ Rung 2 or serverless Rung 3 — it makes you more operationally complex. Microservices answer organisational and scaling pressure, not a reliability gap.
- Forgetting the serial dependency in the composite SLA. Adding a second Region but leaving a single global database, a single shared identity path, or an account-level quota in front. The composite SLA is capped by the least-available serial component, so the redundancy buys far less than the back-of-napkin maths promised.
- Treating active-passive’s RPO as zero. Cross-Region replication (RDS read replicas, DynamoDB replication, S3 CRR) is asynchronous — it always has a non-zero RPO. If the business truly needs zero data loss, active-passive cannot deliver it; that requirement is what justifies Rung 6’s multi-write data.
- Never testing the DR runbook. A pilot-light or warm-standby region you have never failed over to is a hope, not a control. Rehearse failover and failback on a schedule, or your real RTO is “unknown”.
- Confusing “highly available” with “disaster recoverable”. Multi-AZ (inside Rungs 2–4) is HA — it survives a data-centre/AZ fault. It does not survive a regional outage. They are different rungs solving different failures; do not let one stand in for the other.
- Skipping multi-AZ and jumping to multi-Region. Teams sometimes leap from a single-instance app straight to a multi-Region project because it sounds impressive, skipping the cheapest, highest-ROI reliability step. Get multi-AZ right first; reconsider multi-Region only if a regional requirement actually exists.
- Ignoring data-transfer cost as you climb. Cross-AZ and especially cross-Region data transfer is a real, often-forgotten line item that can dominate the bill on chatty or replication-heavy designs. Model it explicitly at Rungs 5 and 6.
Interview & exam questions
- Walk me through how you would choose between a single-region multi-AZ design and a multi-region active-passive design. (Look for: RTO/RPO and regional-outage tolerance as the deciding axis; multi-AZ survives an AZ fault but not a Region; multi-Region is justified by a regional-outage or compliance-DR requirement; cost roughly 1.5–2× and a non-zero RPO from async replication.)
- A startup with two engineers wants to build “microservices on EKS” for their MVP. What’s your advice? (Look for: that’s an over-engineering anti-pattern; microservices answer organisational complexity they don’t have; recommend Rung 1, 2 or serverless Rung 3; “keep it simple”; revisit ECS/EKS when many teams and domain complexity force it — and prefer ECS/Fargate over EKS unless they have a concrete Kubernetes need.)
- Calculate the approximate composite availability of two Regions, each 99.95%, behind a global router, and explain what caps it. (Look for: redundant paths combine as 1 − (1 − A)² ≈ 99.999975% for that tier in isolation; then multiply by the Route 53 / serial-dependency SLA; the least-available serial component caps the composite.)
- What’s the minimum change to make a single-EC2-instance web app survive a data-centre failure, and why is it the highest-ROI step? (Look for: spread across two+ AZs behind an ALB with an Auto Scaling group, and make RDS Multi-AZ; it removes the most common real failure — an AZ fault — for a modest premium rather than a re-architecture; that’s the core of Rung 2.)
- A workload needs RPO ≈ 0. Which rung does that force on AWS, and why can’t a cheaper one deliver it? (Look for: active-passive uses async cross-Region replication → non-zero RPO; near-zero RPO requires multi-Region writes — DynamoDB global tables and/or Aurora Global Database with write forwarding, i.e. Rung 6 active-active — or, for zonal zero-loss, RDS Multi-AZ’s synchronous standby within a Region.)
- Explain the four AWS DR strategies and the RTO/RPO band of each. (Look for: backup & restore — hours, cheapest; pilot light — tens of minutes, data replicated/compute off; warm standby — minutes, scaled-down stack live; multi-site active-active — near-zero, all Regions serving. Map them to Rungs: backup&restore ≈ Rung 2 + snapshots, pilot light & warm standby = Rung 5, active-active = Rung 6.)
- When would you choose RDS read replicas versus RDS Multi-AZ? (Look for: Multi-AZ is for availability — a synchronous standby with auto-failover, not readable for scaling; read replicas are for read scaling/offloading and are asynchronous; cross-Region read replicas also serve as a DR building block. They solve different problems and are often used together.)
- ECS Fargate vs EKS for a microservices estate — how do you decide? (Look for: ECS/Fargate minimises operational surface and is the sensible AWS default; EKS buys the Kubernetes ecosystem, portability and advanced scheduling at the cost of Kubernetes operational complexity; choose ECS unless there’s a concrete reason for Kubernetes — “keep it simple”.)
- Where does a static-site design break down, and what’s the next rung? (Look for: the moment you need server-side logic, your own database, or authenticated write operations; climb to Rung 2 (3-tier multi-AZ) for a relational/transactional core, or Rung 3 (serverless event-driven) for a spiky, event-shaped workload.)
- Why might you choose active-passive over active-active even when you can afford active-active? (Look for: complexity and consistency cost — multi-Region write conflict resolution is hard; if an RTO of minutes-to-tens-of-minutes is acceptable, active-passive is far simpler and cheaper to operate; complexity avoidance is itself a reliability principle.)
- How does this ladder map to a Well-Architected tradeoff? (Look for: every step spends Cost and Operational-Excellence currency to buy Reliability/Performance; the right rung is the cheapest one that meets the requirement with margin; over- and under-shooting are both pillar violations.)
- Is the ladder strictly linear? Defend your answer. (Look for: no — Rung 4 (microservices) is an organisational axis orthogonal to the geographic axis of 5–6; a monolith can be multi-Region active-active and a microservices estate can be single-Region; Rungs 1→2→3 and 5→6 climb reliability/geography, 4 climbs structure. Also: serverless (3) can be combined with multi-Region (5/6).)
Quick check
- Which step is the highest-ROI reliability upgrade on AWS, and what failure does it remove?
- What is the composite-availability formula for two independent redundant Regions each at availability
A, and what caps the real number? - Name the axis that drives a move to containerised microservices — and the axis it does not improve.
- Why does multi-region active-passive always have a non-zero RPO on AWS?
- Which AWS data services make multi-region active-active writes possible, and what hard problem do they force you to design for?
Answers.
- Going multi-AZ (ALB across AZs + Auto Scaling/Fargate across AZs + RDS Multi-AZ) — the heart of Rung 2. It removes a single Availability-Zone/data-centre failure for a modest premium rather than a re-architecture.
- 1 − (1 − A)² for that redundant tier in isolation (the system is down only when both Regions are down) — then multiply by the SLA of any serial dependency in front (Route 53, a shared global table, an account-level quota). The least-available serial component caps the composite.
- It is driven by organisational scale and domain complexity (many autonomous teams + independent deploy/scale needs). It does not improve availability — Rung 2’s Multi-AZ or Rung 3’s managed services deliver HA more cheaply.
- Because cross-Region replication (RDS read replicas, DynamoDB replication, S3 CRR) is asynchronous — at the instant the primary Region fails, the last few seconds-to-minutes of un-replicated data are lost. Near-zero RPO requires multi-Region writes.
- DynamoDB global tables (active-active, last-writer-wins) and Aurora Global Database (cross-Region replicas + write forwarding). They force you to design for data conflicts, eventual consistency and idempotency — the fallacies of distributed computing at Region scale.
Exercise
The brief. You are the architect for “MediShip”, a pharmacy fulfilment platform on AWS. Requirements as stated by the business: it handles prescription orders across one country; a regional AWS outage must not lose orders and must not take the system down for more than ~15 minutes; losing more than a minute or two of order data in a disaster is unacceptable for audit reasons; traffic is steady with predictable evening peaks; there is one moderately-sized engineering team; a regulator requires a demonstrable, geographically separate DR capability; the budget is real but not unlimited. Choose a rung, name the key AWS services, and state the one decision you would push back on.
Write your answer before reading on.
Model answer. Read the axes. “Regional outage must not take it down for >15 min” + “regulator requires a geographically separate, demonstrable DR” → this crosses the regional boundary, so a single-Region multi-AZ design (Rung 2/3) is not sufficient. An RTO of ~15 minutes is achievable with a controlled failover, so you do not need true active-active (Rung 6) — its cost and multi-write consistency complexity aren’t justified by a 15-minute RTO. One moderately-sized team → microservices on EKS (Rung 4) is an over-engineering trap; stay on managed/serverless compute. The right rung is 5 — multi-region active-passive, specifically a warm standby (a pilot light may struggle to hit 15 minutes — but test it before deciding), with each Region’s stack already multi-AZ (so you also get Rung 2’s HA inside it). Services: Route 53 health check + failover routing policy; ALB + ECS Fargate (multi-AZ) in the primary, a scaled-down warm copy in the second Region; Amazon RDS/Aurora cross-Region read replica (promote on failover) or Aurora Global Database for fast managed promotion; DynamoDB global tables if any NoSQL is involved; S3 Cross-Region Replication for objects; CloudFront in front; everything in Terraform/CloudFormation so the standby is a redeployable twin; and a tested, automated failover (and failback) runbook.
The decision to push back on: the stated RPO of “a minute or two” sits in tension with async cross-Region replication, which under a sudden regional loss can lose more than that. Surface it explicitly: either (a) use Aurora Global Database, whose typical cross-Region replication lag and managed promotion can achieve a low-RPO that you then validate against the audit requirement, or (b) if the audit truly demands near-zero loss, recognise that this requirement alone pushes you toward multi-Region writes (a Rung-6 cost) and force the business to choose between the RPO and the budget. Naming that tension — rather than silently designing past it — is the senior move. Also flag: tune the standby’s warmth to the 15-minute RTO and prove it with a real failover test, and budget the cross-Region data-transfer cost explicitly, because it is the line item that most often surprises on a Rung-5 design.
Certification mapping
This lesson is squarely SAA-C03 (Solutions Architect – Associate) and SAP-C02 (Solutions Architect – Professional) territory — both exams are, in essence, a series of “given these requirements, choose the most cost-effective design that meets them” questions, which is precisely this ladder.
| Cert | Relevance |
|---|---|
| SAA-C03 | Primary. Design resilient, high-performing, secure and cost-optimised architectures: multi-AZ vs multi-Region; ALB + Auto Scaling/ECS + RDS Multi-AZ; serverless (API Gateway/Lambda/DynamoDB/EventBridge); S3/CloudFront/Route 53; the four DR strategies and RTO/RPO. Every rung maps to exam objectives. |
| SAP-C02 | Primary (advanced). Multi-account/multi-Region strategy, active-passive vs active-active, Aurora Global Database, DynamoDB global tables, composite-SLA reasoning, cost-of-downtime-driven design, failover orchestration, and choosing the cheapest design that meets aggressive RTO/RPO. |
| SOA-C02 | The operational view — running multi-AZ workloads, configuring Route 53 health checks and failover, monitoring, and executing/testing DR runbooks. |
| DVA-C02 | The Rung 1–3 developer view — Lambda, API Gateway, DynamoDB single-table design, EventBridge/SQS/SNS, Step Functions, and resilience patterns (retries, idempotency, DLQs) in code. |
For SAA-C03 and SAP-C02 specifically, drill the availability and DR fundamentals: per-service SLAs, the down-minutes-per-month each “nine” implies, composite SLA for serial chains, the four DR strategies mapped to RTO/RPO bands, and how AZs and Regions raise the achievable number — and remember the exam’s recurring tell: it usually wants the most cost-effective design that meets the requirement, which is this ladder’s whole thesis.
Glossary
- Rung — In this lesson, one design on the progression from simplest to multi-region active-active; the unit of the architecture decision.
- RTO (Recovery Time Objective) — The maximum acceptable time to restore service after a failure. Drives how much redundancy/failover automation you need.
- RPO (Recovery Point Objective) — The maximum acceptable amount of data loss, measured in time. Drives the replication strategy (snapshot vs async vs multi-write).
- Region — A physical AWS location (e.g.
us-east-1) containing multiple isolated Availability Zones. Spanning Regions survives a whole-Region outage. - Availability Zone (AZ) — One or more discrete data centres within a Region, with independent power, cooling and networking. Spanning AZs survives a data-centre fault.
- Composite SLA — The end-to-end availability of a system computed from its components: serial dependencies multiply (lowering it); redundant paths combine as 1 − (1 − A)ⁿ (raising it).
- Multi-AZ — A resource whose instances/replicas are spread across Availability Zones (e.g. RDS Multi-AZ, an ALB + Auto Scaling group), so a single-AZ failure does not take it down.
- Active-passive (DR) — One Region serves traffic; a standby Region (pilot light/warm) takes over on failover. Async replication → non-zero RPO; failover → non-zero RTO.
- Active-active — All Regions serve live traffic simultaneously; with multi-Region-write data there is no failover step. The basis of the apex rung.
- Pilot light — A minimal standby Region (data replicated, compute near-zero) scaled up on failover; cheapest multi-Region option, with the longest RTO.
- Warm standby — A scaled-down-but-running copy of the full stack in the standby Region, ready to scale up fast; more expensive than pilot light, faster RTO.
- DynamoDB global tables — DynamoDB’s active-active, multi-Region replication (last-writer-wins), enabling multi-Region writes.
- Aurora Global Database — Aurora’s cross-Region replication with fast managed promotion and optional write forwarding, the relational basis for low-RPO DR and active-active.
- Cold start — The latency penalty when a scaled-to-zero serverless function (Lambda) must be provisioned to serve the first request after idle.
- Distributed-systems tax — The added complexity (eventual consistency, distributed tracing, sagas, partial failure, idempotency) you accept when you decompose across the network or across Regions.
Next steps
You now have the spine of AWS architectural judgement: requirements in, the right rung out. The natural next lesson turns this design discipline into a hiring portfolio — Real-World AWS Portfolio Projects: From a Static Site to a Multi-Account Landing Zone — which builds exactly these rungs as shippable GitHub projects with quantified résumé bullets, so you can demonstrate the judgement this lesson teaches.
To deepen the surrounding material:
- Revisit AWS Well-Architected: Reliability — every step on this ladder is a deliberate move within its system of tradeoffs (Cost/Ops spent to buy Reliability/Performance), and it covers the service quotas, failure management and DR foundations the upper rungs depend on.
- Study the apex up close: AWS Enterprise Architecture: Active-Active Multi-Region — Route 53 latency routing, CloudFront, DynamoDB global tables and Aurora Global Database, with conflict handling and the cost trade-offs of Rung 6 in full.
- Master the geographic middle: AWS Enterprise Architecture: Disaster Recovery Strategies — the four canonical strategies (backup & restore, pilot light, warm standby, multi-site active-active) driven by RTO/RPO, with failover orchestration and a worked example — the detailed treatment of Rung 5.
- Ground the foundations: high availability vs disaster recovery and RTO/RPO for the vocabulary, and VPC networking fundamentals for the multi-AZ subnet/route-table mechanics every rung from 2 upward relies on.