The most common mistake I see in architecture reviews is not under-engineering. It is over-engineering — a team building a globally-distributed, active/active, deployment-stamped fortress to serve a line-of-business app that 200 internal users touch during office hours. The second most common mistake is its mirror image: a single virtual machine in one datacentre quietly running the checkout flow for a business that loses lakhs per hour when it falls over. Both teams skipped the only question that matters in architecture: what do the requirements actually demand, and what is the cheapest design that meets them with margin to spare?
Architecture is not a catalogue of impressive components. It is the disciplined act of climbing exactly as high as the requirements force you, and not one rung higher. So this lesson teaches architecture as a ladder — six designs for the same web application, starting from a static site that costs almost nothing and ending at a mission-critical system engineered to transact through a regional failure. Each rung adds a specific, named capability (durability, an SLA you can stand behind, zone tolerance, independent team scaling, regional disaster recovery, regional failure tolerance) and each addition has a price in money, complexity, and operational burden. The skill this builds is reading a set of requirements — RTO, RPO, scale, budget, team shape, compliance — and landing on the right rung.
We will use the lens of the Azure Well-Architected Framework throughout, because every step up this ladder is a deliberate trade between its five pillars: you spend Cost and Operational-Excellence currency to buy Reliability and Performance. We will reach for named cloud design patterns as each rung needs them. And the top of the ladder lands exactly on the mission-critical (AlwaysOn) architecture — so by the time you arrive there, every concept (deployment stamp, health model, composite SLA) will already be familiar, because you watched the requirements force it into existence.
Learning objectives
By the end of this lesson you will be able to:
- Map requirements to architecture — translate RTO, RPO, scale, budget, team topology and compliance into a specific rung on the ladder, with justification.
- Describe six canonical Azure web-application designs in increasing order of resilience and complexity, naming the services and the role each plays.
- Reason about the key design decision at each rung as a Well-Architected tradeoff — what the addition buys and what it costs.
- Calculate a composite SLA for a chained system and understand how availability zones and paired regions raise the achievable number.
- Recognise when not to climb — to identify the cheapest design that meets the requirement and resist gold-plating.
- Justify a rung choice to a review board using cost, RTO/RPO, and operational burden as the deciding axes.
Prerequisites & where this fits
This is an Advanced lesson in the Architecture & Design Mastery module. You will get the most from it if you have already met the Well-Architected Framework (the five pillars and their tradeoffs), the architecture styles and ten design principles (N-tier, web-queue-worker, microservices), and have a working mental model of high availability versus disaster recovery and RTO/RPO. A passing familiarity with the core Azure compute and data services (App Service, Functions, AKS, Azure SQL, Cosmos DB) helps, but each is reintroduced in context.
Where it fits in the ladder of learning: the styles lesson taught you the macro shape, the patterns lesson taught you the tactical moves, and this lesson shows you both driven by requirements across a realistic progression. It is the bridge into the mission-critical capstone.
A note on the running example: every rung serves the same application — “ContosoCart”, a product catalogue with a shopping basket and checkout. Keeping the app constant is the whole point. It isolates the one variable that actually changes between these designs: how much failure the business can tolerate, and what it will pay to tolerate less.
A word on the costs below. The INR figures are deliberately rough, order-of-magnitude monthly estimates for a small-to-moderate workload (a few hundred GB of data, low-millions of requests per month), at indicative pay-as-you-go rates, to teach the shape of the cost curve — they are not quotes. Real numbers depend on region, tier, reservations, egress and traffic. Always model your own in the Azure Pricing Calculator and TCO Calculator. The lesson is in the ratios between rungs, not the absolute rupees.
How to read the requirement axes
Every rung is described against the same set of axes. Internalise these — they are the vocabulary of an architecture decision.
| Axis | What it asks | Why it drives design |
|---|---|---|
| RTO (Recovery Time Objective) | After a failure, how long until service is restored? | Drives redundancy: minutes-to-hours allows restore/failover; near-zero forces active/active. |
| RPO (Recovery Point Objective) | How much data can you afford to lose? | Drives replication: hours allows nightly backup; zero forces synchronous or multi-write replication. |
| Scale | Peak concurrent load and its variability (steady vs spiky)? | Drives elasticity and the compute model (serverless vs PaaS vs orchestrated containers). |
| Availability target (SLA) | What uptime % must you promise, and is it composite? | Drives zone/region redundancy; each “nine” roughly an order of magnitude harder and dearer. |
| Budget | Capital and run-rate ceiling? | The hard constraint. Caps how high you can climb regardless of desire. |
| Team topology | One team or many? What is their operational maturity? | A microservices rung needs many autonomous teams; a small team should stay on PaaS. |
| Compliance / data residency | Regulatory constraints on where data lives and how DR is proven? | Can force a region choice or a paired-region DR design irrespective of pure availability maths. |
Keep this table in your head as you read each rung. The design is always a response to a specific movement in these axes — never an aesthetic preference.
Rung 1 — Static site + serverless API
Scenario & requirements. ContosoCart begins as an MVP. A two-person startup needs the catalogue and a basic basket online to validate the idea. Traffic is a trickle and unpredictable — could be 10 users, could be 10,000 if a post goes viral. RTO: hours is fine (a hobbyist-grade promise). RPO: a few hours (some lost cart data is survivable at this stage). Availability: best-effort. Budget: as close to zero as possible. Team: two generalist developers, no operations function.
The design. A single-page application (React/Vue) is built to static files and served from Azure Storage static website hosting (or Azure Static Web Apps, which bundles the front end with a managed Functions backend). The dynamic bits — “add to cart”, “get price”, “place order” — are individual Azure Functions on the Consumption plan, invoked over HTTP. State lives in Azure Cosmos DB in serverless capacity mode. Azure CDN / Front Door (the Static Web Apps free tier includes global distribution) caches the static assets at the edge.
This is the Static Content Hosting pattern (serve static assets straight from object storage, not from a web server) composed with a serverless API. Architecturally it is the leanest expression of the web workload.
Services.
| Component | Service | Role |
|---|---|---|
| Front end | Azure Storage static website / Static Web Apps | Serve SPA assets globally, no server to manage |
| API | Azure Functions (Consumption) | Event-driven HTTP endpoints; scale to zero when idle |
| Data | Azure Cosmos DB (serverless) | Low-latency document store, pay per request unit |
| Edge | Azure CDN / Front Door | Cache static assets, TLS, custom domain |
Key design decisions & Well-Architected tradeoffs. The defining decision is serverless everything, and the principle behind it is use managed services and scale to zero. You pay essentially nothing when idle and the platform absorbs spikes automatically — superb Cost Optimization and Performance Efficiency for a spiky, low-baseline workload. The tradeoff is paid in Reliability and Performance at the tails: Consumption-plan Functions suffer cold starts (a request after idle waits for an instance to spin up), and there is no warm capacity guarantee. You also accept a relatively stateless, simple model — long-running or stateful workflows fight this design. From the Reliability pillar’s “design for business requirements” principle, that is the right call: the business requirement here is cheap and good-enough, not fast and certain.
Rough cost. With low traffic, this rung can run on free tiers (Static Web Apps free, Functions’ generous monthly free grant, Cosmos DB free-tier allowance). Realistically, ₹0–₹1,500/month until you have meaningful traffic. This is the cheapest functional rung on the ladder by an order of magnitude.
When this is enough. Marketing sites, MVPs, internal tools, JAMstack content sites, and any read-heavy app whose backend is a handful of simple operations. If your traffic is genuinely spiky with a low baseline and you can tolerate occasional cold-start latency, you may never need to climb off this rung. Many successful products live here far longer than their founders expect. Stop here unless you need predictable low latency, complex stateful server logic, or an SLA you can put in a contract.
Rung 2 — Single-region 3-tier web app
Scenario & requirements. ContosoCart found product-market fit. It is now a real business with paying customers, a relational data model (orders, inventory, customers with foreign-key integrity), and the need for consistent, predictable performance — no cold starts on checkout. RTO: ~1 hour. RPO: minutes (point-in-time restore of the database is acceptable; losing a day of orders is not). Availability: roughly 99.9% is fine for now. Budget: modest but real — the business can fund a few tens of thousands of rupees a month. Team: a small product team, still no dedicated ops.
The design. The canonical N-tier / three-tier web application: a presentation/app tier on Azure App Service (a managed PaaS web host — deploy code, the platform runs it), a relational data tier on Azure SQL Database (single database, a sensible General Purpose tier), and Azure Front Door (or Application Gateway) in front for TLS termination, a WAF, caching, and a global entry point. Secrets live in Azure Key Vault; the app reads configuration via the External Configuration Store pattern using App Configuration + Key Vault references. Static assets still go to Blob/CDN.
This is the N-tier architecture style — layered, well-understood, the natural home for a lift-and-shift or a straightforward new build with a relational core.
Services.
| Component | Service | Role |
|---|---|---|
| Edge / WAF | Azure Front Door | Global ingress, TLS, WAF, caching, routing |
| App tier | Azure App Service (PaaS) | Host the web app/API; autoscale on a plan |
| Data tier | Azure SQL Database (single, General Purpose) | Relational store with point-in-time restore |
| Secrets/config | Key Vault + App Configuration | Externalised secrets and settings |
| Static | Blob Storage + CDN | Images, assets |
Key design decisions & Well-Architected tradeoffs. The move from Rung 1 is a deliberate trade of Cost for Performance Efficiency and predictability: a provisioned App Service plan has warm capacity (no cold starts) and Azure SQL gives you ACID transactions and referential integrity that a document store makes you work for. The cost is now a standing monthly bill whether or not anyone visits — you have left scale-to-zero behind. The critical Reliability caveat: a single App Service instance and a single SQL database in one region is not highly available. App Service’s SLA only applies with two or more instances; a single instance, or a single-AZ database, means a hardware fault or a datacentre-rack issue takes you down. This rung is “redundant within the service if you configure it” — run at least two App Service instances and you get the platform SLA, but you are still single-region and not explicitly zone-redundant, so it is not yet resilient to an availability-zone failure. The design principle in play is keep it simple: do not buy zone redundancy until the availability requirement demands it.
Rough cost. App Service (Standard/Premium plan, 2 small instances) + Azure SQL (General Purpose, modest vCores) + Front Door standard + Key Vault: ballpark ₹15,000–₹40,000/month depending on tiers and traffic. An order of magnitude above Rung 1 — the price of warm capacity and a managed relational engine.
When this is enough. The overwhelming majority of business web applications. SaaS products in early growth, internal line-of-business apps, e-commerce that can tolerate rare short outages, anything where ~99.9% and an RTO of an hour is contractually fine. Stop here unless an availability-zone failure would be unacceptable, you need a stronger SLA, or you have outgrown a single regional footprint.
Rung 3 — Zone-redundant high availability
Scenario & requirements. ContosoCart now underpins revenue that the business genuinely feels when the site is down. A datacentre-level incident — a rack failure, a power event in one building — must not take the service offline. RTO: minutes, automatic. RPO: near-zero for committed transactions. Availability: a solid 99.95%+ that survives the loss of one availability zone. Budget: the business will pay a premium for this resilience — it is now justified. Team: a small but maturing platform team with on-call.
The design. Same three tiers as Rung 2, but every tier is now zone-redundant within a single region. Azure regions that support availability zones offer three physically separate datacentres with independent power, cooling and networking. So: App Service on a plan with zone redundancy enabled (instances spread across zones); Application Gateway v2 / Front Door which is zone-redundant by design; Azure SQL Database in a tier that supports zone-redundant configuration (Business Critical or zone-redundant General Purpose, which keeps synchronous replicas in other zones); zone-redundant Storage (ZRS) for assets; and any cache (Azure Cache for Redis) in a zone-redundant tier. The shape is unchanged — the redundancy posture is what changed.
This rung is where the make all things redundant design principle becomes the organising idea, applied at the zone granularity.
Services & the redundancy upgrade.
| Tier | Rung 2 (single-region) | Rung 3 (zone-redundant) |
|---|---|---|
| Ingress | Front Door (already resilient) | Front Door + zone-redundant App Gateway v2 |
| App | 2 instances, one zone | Instances spread across 3 zones |
| Database | Single GP database | Zone-redundant GP/Business Critical (synchronous replicas across zones) |
| Storage | LRS/standard | ZRS (zone-redundant) |
| Cache | (optional) | Zone-redundant Redis |
Key design decisions & Well-Architected tradeoffs. This is the single most cost-effective reliability upgrade on the entire ladder, and the most under-used. Enabling zone redundancy is usually a configuration flag plus a tier bump, not a re-architecture — yet it removes an entire class of failure (a single datacentre going dark). The tradeoff is cost (zone-redundant SKUs carry a premium; synchronous cross-zone SQL replication needs the higher tier) and a small latency consideration (synchronous commit across zones adds sub-millisecond-to-low-millisecond write latency — usually negligible because zones in a region are close). The Reliability pillar’s “design for resilience” principle is satisfied here for intra-region faults. What this rung does not protect against is a whole-region outage — if the entire Azure region has a problem, zone redundancy cannot save you. That is the boundary that pushes you to Rung 5.
Rough cost. Zone-redundant App Service + Business Critical / zone-redundant Azure SQL + ZRS + Redis: ballpark ₹60,000–₹1,50,000/month. Notably not 3× Rung 2 — you are not running three full independent stacks, you are running one stack whose components are zone-spread, so the premium is meaningful but not linear in the number of zones.
When this is enough. This is the right default for most production business applications that matter to revenue. It delivers a genuinely high SLA, automatic recovery from the most common real-world failure (a single datacentre fault), and it does so without the operational weight of multi-region. A very large fraction of “serious” Azure workloads should live exactly here. Stop here unless regulation demands a geographically separate DR site, or a regional outage is intolerable, or your domain complexity is forcing organisational decomposition (Rung 4).
Rung 4 — Microservices on AKS
Scenario & requirements. Note the change of axis. ContosoCart’s problem is no longer purely availability — it is organisational scale and domain complexity. The company has grown to many engineering teams; the monolith has become a bottleneck where every team’s release is coupled to every other’s; different sub-domains (catalogue, basket, pricing, fulfilment, recommendations) have wildly different scaling and release cadences. RTO/RPO: similar to Rung 3 (still want zone-redundant HA). Scale: components must scale independently — pricing under Black-Friday load shouldn’t force the whole app to scale. Team: many autonomous teams, each owning a service end-to-end. Budget: substantial; the org can fund a platform team.
The design. Decompose the monolith into microservices running on Azure Kubernetes Service (AKS) — a managed container orchestrator. Each service is independently deployable, owns its own data store (the best data store for the job — Cosmos DB here, Azure SQL there), and scales on its own. Azure API Management (APIM) sits at the front as the Gateway Routing / Aggregation / Offloading layer, presenting one external contract and handling auth, throttling and versioning. Asynchronous workflows go through Azure Service Bus (queues and topics) so services are temporally decoupled — the Queue-Based Load Levelling and Competing Consumers patterns. Cross-cutting resilience (retry, circuit breaker, mTLS) is handled by a service mesh or Dapr. The cluster runs zone-redundant node pools, so you keep Rung 3’s HA posture inside the new style.
This is the microservices architecture style — and crucially, it is chosen here for an organisational reason, not because microservices are “more advanced”. The style buys team autonomy and independent scaling at the price of the full distributed-systems tax.
Services.
| Concern | Service | Role |
|---|---|---|
| Orchestration | AKS (zone-redundant node pools) | Run/scale containerised services |
| API gateway | Azure API Management | Routing, aggregation, offloading, throttling, auth |
| Async messaging | Azure Service Bus (queues/topics) | Decouple services; load levelling; pub/sub |
| Per-service data | Cosmos DB / Azure SQL / Redis | Polyglot persistence — right store per service |
| Cross-cutting | Service mesh / Dapr | Retry, circuit breaker, mTLS, observability sidecars |
| Registry/secrets | Azure Container Registry, Key Vault (CSI) | Images and secrets |
Key design decisions & Well-Architected tradeoffs. This is the rung where it is easiest to over-engineer, so be honest about the trade. Microservices on AKS deliver enormous Operational-Excellence upside for the right org — independent deploys, blast-radius isolation per service, technology heterogeneity — and excellent Performance Efficiency through granular scaling. But the cost is steep and lands squarely on Operational Excellence and Cost: you now own a Kubernetes platform (upgrades, node management, networking, security), you inherit eventual consistency and distributed-transaction complexity (hello Saga and Compensating Transaction), debugging requires distributed tracing, and you must fund a platform team. The design principle that should haunt this decision is keep it simple and its corollary from the styles lesson: do not adopt microservices for a small team or a simple domain. A two-pizza company running this rung is the canonical over-engineering failure. The right trigger is many teams + genuine domain complexity + independent-scaling needs — all three, not one.
Rough cost. A production AKS platform (multiple zone-spread nodes, system + user pools) + APIM (Standard/Premium) + Service Bus + multiple data stores + the human cost of a platform team: easily ₹1,50,000–₹5,00,000+/month in infrastructure, before salaries. APIM Premium alone is a significant line item. This is a step-change in total cost of ownership, much of it operational rather than on the invoice.
When this is enough — and a caution. Choose this rung when the organisation, not just the workload, demands it: many autonomous teams blocked by a shared monolith, sub-domains with divergent scaling and release needs, and the maturity to run a platform. Do not climb here for availability — Rung 3 already gives you HA more cheaply. And note this rung is somewhat orthogonal to Rungs 5–6: you can take a microservices estate multi-region, and a monolith can be mission-critical. The ladder is not strictly linear here; this rung is about structure, the next two are about geography.
Rung 5 — Multi-region active-passive (disaster recovery)
Scenario & requirements. Now the requirement crosses the regional boundary. A whole-region Azure outage — rare, but real — must not take ContosoCart permanently down, and a compliance auditor requires a demonstrable, geographically separate DR capability. RTO: tens of minutes (a controlled failover is acceptable; near-zero is not yet required). RPO: minutes (a small amount of in-flight data loss at the moment of regional failure is tolerable). Availability: a composite target meaningfully above what one region can offer. Budget: the business funds a standby footprint — paying for insurance it hopes never to use. Team: a mature platform/SRE team that runs DR drills.
The design. Deploy the (zone-redundant) stack into a primary region and a secondary (paired) region. Azure Front Door (or Traffic Manager) sits globally in front and routes all traffic to primary in priority/failover mode. The data tier replicates asynchronously to the secondary: Azure SQL active geo-replication / failover groups (a readable secondary in the paired region), GRS/RA-GRS storage, geo-replicated Cosmos DB. Compute in the secondary is either warm (running but not taking traffic) or pilot-light/cold (minimal, scaled up on failover) depending on RTO and budget. For IaaS components, Azure Site Recovery (ASR) replicates VMs. A tested failover runbook (ideally automated) promotes the secondary.
This is the active-passive multi-region DR design — the geographic expression of “design for recovery”. It is treated in depth in multi-region active-active disaster recovery and grounded in high availability vs disaster recovery and RTO/RPO.
Services.
| Concern | Primary region | Secondary (paired) region |
|---|---|---|
| Global routing | Front Door / Traffic Manager (priority routing) | (same global resource) |
| App tier | Active, zone-redundant | Warm or pilot-light standby |
| Database | Azure SQL primary | Geo-replicated readable secondary (failover group) |
| Storage | ZRS | GRS/RA-GRS replica in paired region |
| IaaS | — | Azure Site Recovery replicas |
Key design decisions & Well-Architected tradeoffs — and the composite-SLA point. Two decisions dominate. First, how warm is the standby? A hot standby gives a shorter RTO but you pay for largely-idle capacity; a pilot-light standby is cheap but the RTO includes scale-up time. This is a direct Cost-vs-Reliability dial. Second, asynchronous replication means a non-zero RPO — at the instant the primary region dies, the last few seconds of transactions that hadn’t yet replicated are lost. If RPO must be zero, active-passive with async replication cannot deliver it and you are pushed toward Rung 6’s multi-write data.
The reliability payoff is best understood through composite SLA maths. Components in series multiply (and so reduce the total); independent redundant paths combine to raise availability. Two regions each at availability A, behind a global router, give a combined availability of roughly 1 − (1 − A)² — the system is down only when both regions are simultaneously down. Worked example: a region at 99.9% (down 0.1% of the time) deployed in two regions yields about 1 − (0.001)² = 99.9999% for the redundant tier in isolation — but you must then multiply by the SLA of the global router and any shared/serial dependency in front. The lesson: redundant regions raise availability dramatically, but a serial dependency (a single global front door, a single shared identity provider) caps the composite, so you account for those too. This is exactly the maths the mission-critical lesson formalises.
Rough cost. Roughly 1.5×–2× the single-region (Rung 3) cost, depending on how warm the standby is and how much data egress the geo-replication generates. A pilot-light secondary keeps it near the lower end; a hot standby approaches 2×. You are paying a real premium for insurance — that is the nature of DR.
When this is enough. Regulated workloads that must prove a DR site; revenue-critical systems that can tolerate a brief, controlled failover (tens of minutes) and a few seconds of data loss; most “serious enterprise” applications that need regional protection but cannot justify true always-on. For the large majority of organisations, active-passive multi-region is the top of the ladder they ever actually need. Climb to Rung 6 only when the financial or human cost of even a brief outage — or any data loss — genuinely justifies the very large jump in cost and complexity that active/active demands.
Rung 6 — Multi-region active-active (mission-critical / AlwaysOn)
Scenario & requirements. This is the apex. ContosoCart is now, say, a payments or trading platform where an hour of downtime costs crores and a regional failover that takes ten minutes is itself a catastrophe. The system must keep transacting through a regional failure with no human in the loop for the first line of defence. RTO: effectively zero (seconds, automatic — users in a failing region are routed away transparently). RPO: at or near zero. Availability: the maximum the architecture can deliver, justified by the cost of downtime. Budget: explicitly funded for “as close to always-on as possible”. Team: a mature SRE organisation running continuous validation and chaos engineering.
The design. Multiple regions run active/active — all taking live traffic simultaneously — built from deployment stamps (a.k.a. scale units): byte-for-byte identical, self-contained regional deployments, each a complete copy of the stack. A thin global layer (Front Door anycast ingress with health-probe-driven routing, global DNS, shared identity) sits above the stamps and routes users to a healthy stamp. The data tier is active-active multi-write — Cosmos DB with multi-region writes (and a defined conflict-resolution policy) so any region can accept writes, eliminating the failover-promotion step entirely. Critically, the system is governed by a health model: telemetry is rolled up into healthy / degraded / unhealthy states (not raw uptime), and that signal drives automated traffic shifting and the blue/green deployment of whole stamps. Continuous validation and chaos (fault injection) constantly proves the system can take the failures it claims to survive.
This is the mission-critical (AlwaysOn) architecture, and it is the convergence point of everything in this module — it uses Deployment Stamps, Geode, Health Endpoint Monitoring, the resilience pattern compositions, and the composite-SLA reasoning all at once. It is taught in full in Mission-Critical (AlwaysOn) Architecture on Azure: The Apex Design, and its data layer is the active-active end-state of multi-region disaster recovery.
Services.
| Concern | Realisation |
|---|---|
| Global ingress | Azure Front Door (anycast, health-probe routing) + global DNS |
| Regional unit | Deployment stamp — full stack (zone-redundant AKS/App Service + App Gateway + messaging + cache + private data endpoints) |
| Data | Cosmos DB multi-region writes with conflict resolution (active-active) |
| Health | Health model: telemetry → healthy/degraded/unhealthy → automated routing |
| Delivery | Blue/green of whole stamps; continuous validation + chaos (Azure Chaos Studio) |
Key design decisions & Well-Architected tradeoffs. The defining principle is active/active with blast-radius reduction and fault isolation — each stamp is independent, so a poisoned stamp can be drained and replaced without touching the others. The deepest conceptual shift is observing application health, not uptime: a stamp that is “up” but returning errors is unhealthy and must be routed away, which only a real health model can detect. The tradeoffs are severe and must be stated plainly to any board: cost is the highest on the ladder (you run N full stacks, all live); complexity is the highest (multi-write conflict resolution, global consistency, stamp lifecycle); and the operational and engineering maturity required is substantial (chaos engineering, continuous validation, automated everything). The mission-critical guidance is explicit that this is the most expensive way to run software on Azure, and that the central skill is knowing when the price is justified and avoiding complexity that doesn’t buy reliability — complexity avoidance is itself one of its principles. You do not arrive here by ambition; you arrive here because the cost of downtime forced you, and you proved it with a number.
Rough cost. Multiples of Rung 5 — you are running several complete active stacks, plus globally-distributed multi-write data, plus the tooling and the SRE organisation. For most organisations the figure is “if you have to ask whether you can afford it, you are not on this rung”. The justification is never the infrastructure cost in isolation; it is the infrastructure cost measured against the cost of downtime the business has quantified (e.g. ₹4 crore/hour). That ratio is the entire argument.
When this is enough. When downtime has a catastrophic, quantified cost — payments, trading, emergency/health systems, anything where minutes of outage mean lost lives or lakhs-to-crores per minute and an explicit business mandate for always-on. This is the ceiling of the ladder. There is nothing above it; the work beyond this rung is operating it well, not adding more architecture. For the vast majority of systems, reaching this rung would be a textbook over-engineering error.
The diagram lays the six rungs side by side so the shape of the climb is visible at a glance: each step adds a specific capability (durability → predictability → zone tolerance → team autonomy → regional DR → regional failure tolerance) while cost and complexity rise non-linearly — the lesson is to climb exactly as high as the requirements force you and no higher.
How to choose a rung from requirements
You never pick a rung by taste. You read the axes and let them point. Here is the decision distilled into a single table — read it top to bottom and stop at the first row whose requirement you genuinely have.
| If the requirement is… | …the rung is | Why |
|---|---|---|
| Cheap, spiky, simple backend, best-effort uptime | 1 — Static + serverless | Scale-to-zero, near-zero cost; cold starts acceptable |
| Predictable performance, relational data, ~99.9%, RTO ~1h | 2 — Single-region 3-tier | Warm PaaS + ACID; simplest “real” production design |
| Must survive a datacentre/zone failure, ~99.95%+, auto recovery | 3 — Zone-redundant HA | Cheapest big reliability win; the right default for serious workloads |
| Many autonomous teams + complex domain + independent scaling | 4 — Microservices on AKS | Organisational/scaling driver — not an availability driver |
| Must survive a whole-region outage; DR provable; RTO tens of min, small RPO | 5 — Active-passive multi-region | Geographic redundancy as insurance; composite SLA up, async RPO |
| Catastrophic, quantified cost of any downtime; RTO/RPO ≈ 0; mandate for always-on | 6 — Active-active mission-critical | The apex; cost justified only by cost-of-downtime maths |
Four rules govern the whole climb:
- Requirements drive the rung — not fashion, not résumé-building. The single best question in any review is “what requirement forces us off the rung below?” If you cannot answer it crisply, you have over-engineered.
- Reliability and geography are different axes from organisation. Rungs 1→2→3 and 5→6 climb reliability/geography; Rung 4 climbs organisational structure. Do not reach for AKS to get availability — Rung 3 does that more cheaply.
- Zone redundancy (Rung 3) is the highest-ROI step on the ladder. It removes the most common real failure for a config-flag-and-tier-bump premium. Most teams under-buy it and over-buy multi-region.
- Every step up spends Cost and Operational-Excellence currency to buy Reliability and Performance. That is the Well-Architected trade in one sentence. Make it deliberately, write down what you bought and what you paid, and you will rarely be wrong.
The honest summary: most production workloads belong on Rung 3. A large share are happy on Rung 2. Rung 5 is for the regulated and the genuinely revenue-critical. Rungs 4 and 6 are for specific, provable situations and are over-engineering everywhere else. Climbing the ladder is easy; the discipline — and the seniority — is in knowing when to stop.
Real-world application
In a real Azure design engagement this ladder is the backbone of the first conversation, before a single resource is drawn. You sit with the business owner and pin the axes: “What does an hour of downtime actually cost you? Can you lose ten minutes of data? How many teams will touch this? What does compliance demand?” Their answers land you on a rung, and from there the service list almost writes itself.
It also reframes migration and modernisation: a lift-and-shift typically lands a workload on Rung 2, and the modernisation roadmap is literally “which rung, and when?” — usually 2→3 first (cheap, huge reliability gain), and 3→5 only if a regional-DR requirement appears. It anchors cost conversations with FinOps, because each rung has a recognisable cost shape and “we are paying for Rung 6 but only need Rung 3” is one of the most common and expensive findings in a cloud cost review. And in an AZ-305 design exam, the questions are this ladder in disguise: a scenario hands you RTO/RPO/scale/budget and asks for the design — you are being tested on whether you can land on the right rung without over- or under-shooting.
Common mistakes & anti-patterns
- Over-engineering — building Rung 5 or 6 for a Rung 2/3 requirement. The classic. A small team runs a multi-region active-active estate for an internal app, drowning in cost and operational toil it cannot sustain. If you cannot name the requirement that forced you up, climb back down.
- Under-engineering — a single instance/region for a revenue-critical system. A single App Service instance (no SLA) or single-AZ database under a checkout flow. One rack fault and the business stops. At minimum, run two instances; if revenue is at stake, get to Rung 3.
- Reaching for microservices to get availability. AKS does not make you more available than zone-redundant App Service — it makes you more operationally complex. Microservices are an answer to organisational and scaling pressure, not a reliability upgrade.
- Forgetting the serial dependency in the composite SLA. Adding a second region but leaving a single shared identity provider, a single global gateway, or a single shared database in front. The composite SLA is capped by the least available serial component, so the redundancy buys far less than the maths-on-the-back-of-a-napkin promised.
- Ignoring the warm/cold standby dial in DR. Paying for a fully hot secondary when a pilot-light meets the RTO (wasted money), or running a cold secondary when the RTO is tight (a failover that misses its target). Tune the standby to the actual RTO.
- Treating active-passive’s RPO as zero. Asynchronous geo-replication always has a non-zero RPO. If the business truly needs zero data loss, active-passive cannot deliver it — that requirement is what justifies Rung 6’s multi-write data.
- Confusing “highly available” with “disaster recoverable”. Zone redundancy (Rung 3) is HA — it survives a datacentre fault. It does not survive a regional outage. They are different rungs solving different failures; do not let one stand in for the other.
- Skipping Rung 3. Teams often jump from a single-region app straight to a multi-region project because it sounds impressive, skipping the cheapest, highest-ROI reliability step. Enable zone redundancy first; reconsider multi-region only if a regional requirement actually exists.
Interview & exam questions
- Walk me through how you would choose between a single-region zone-redundant design and a multi-region active-passive design. (Look for: RTO/RPO and regional-outage tolerance as the deciding axis; zone redundancy survives a datacentre fault but not a region; multi-region is justified by a regional-outage or compliance-DR requirement; cost roughly 1.5–2× and a non-zero RPO from async replication.)
- A startup with two engineers wants to build “microservices on Kubernetes” for their MVP. What’s your advice? (Look for: that’s an over-engineering anti-pattern; microservices answer organisational complexity they don’t have; recommend Rung 1 or 2; “keep it simple”; revisit AKS when many teams and domain complexity force it.)
- Calculate the approximate composite availability of two regions, each 99.9%, behind a global router, and explain what caps it. (Look for: redundant paths combine as 1 − (1 − A)² ≈ 99.9999% for that tier in isolation; then multiply by the router/serial-dependency SLA; the least-available serial component caps the composite.)
- Why is single-instance Azure App Service not covered by an availability SLA, and what’s the minimum change to get one? (Look for: the SLA requires two or more instances; a single instance has no redundancy; run ≥2 instances, ideally zone-spread, to earn the SLA — that’s the Rung 2→3 boundary.)
- A workload needs RPO = 0. Which rung does that force, and why can’t a cheaper one deliver it? (Look for: active-passive uses async replication → non-zero RPO; zero RPO requires synchronous or active-active multi-write data, i.e. Rung 6 / mission-critical with Cosmos DB multi-region writes — or at least synchronous zone replication for zonal zero-loss.)
- What is a deployment stamp / scale unit, and at which rung does it appear? (Look for: a self-contained, identical, independently-deployable copy of the full stack; appears at Rung 6 mission-critical; enables blast-radius isolation and blue/green of whole stamps; built on the Deployment Stamps pattern.)
- The business says “we want 99.99% uptime.” What questions do you ask before designing? (Look for: cost of downtime, RTO/RPO, is it composite, regional vs zonal tolerance, budget, compliance; each nine is ~10× harder/dearer; map the answer to a rung rather than over-building.)
- Explain why a health model beats raw uptime monitoring for a mission-critical system. (Look for: a node can be “up” but returning errors/degraded; routing on raw uptime keeps sending traffic to a sick stamp; the health model rolls telemetry into healthy/degraded/unhealthy to drive automated routing and failover.)
- Why might you choose active-passive over active-active even when you can afford active-active? (Look for: complexity and consistency cost — multi-write conflict resolution is hard; if RTO of tens of minutes is acceptable, active-passive is far simpler and cheaper to operate; complexity avoidance is itself a mission-critical principle.)
- Where does a static-site-plus-serverless design break down, and what’s the next rung? (Look for: cold-start latency, stateful/complex server logic, need for a contractual SLA or relational integrity; climb to Rung 2 single-region 3-tier on App Service + Azure SQL.)
- How does this ladder map to a Well-Architected tradeoff? (Look for: every step spends Cost and Operational-Excellence currency to buy Reliability/Performance; the right rung is the cheapest one that meets the requirement with margin; over- and under-shooting are both pillar violations.)
- Is the ladder strictly linear? Defend your answer. (Look for: no — Rung 4 is an organisational axis orthogonal to the geographic axis of 5–6; a monolith can be mission-critical and a microservices estate can be single-region; Rungs 1→2→3 and 5→6 climb reliability/geography, 4 climbs structure.)
Quick check
- Which rung is the highest-ROI reliability upgrade, and what failure does it remove?
- What is the composite-availability formula for two independent redundant regions each at availability
A? - Name the axis that drives a move to microservices on AKS — and the axis it does not improve.
- Why does active-passive multi-region always have a non-zero RPO?
- What signal drives automated traffic shifting in a mission-critical (Rung 6) design — and why not raw uptime?
Answers.
- Rung 3 — zone-redundant HA. It removes a single availability-zone / datacentre failure, usually for a config-flag-plus-tier-bump premium rather than a re-architecture.
- 1 − (1 − A)² for that redundant tier in isolation (the system is down only when both regions are down) — then multiply by the SLA of any serial dependency in front (the global router, shared identity).
- It is driven by organisational scale and domain complexity (many autonomous teams + independent-scaling needs). It does not improve availability — Rung 3 delivers HA more cheaply.
- Because geo-replication is asynchronous — at the instant the primary region fails, the last few seconds of un-replicated transactions are lost. Zero RPO requires synchronous or active-active multi-write data.
- A health model that rolls telemetry into healthy / degraded / unhealthy states. Raw uptime can show a node as “up” while it returns errors; routing on uptime keeps sending traffic to a sick stamp.
Exercise
The brief. You are the architect for “MediShip”, a pharmacy fulfilment platform. Requirements as stated by the business: it handles prescription orders across India; a regional Azure outage must not lose orders and must not take the system down for more than ~15 minutes; losing more than a minute or two of order data in a disaster is unacceptable for audit reasons; traffic is steady with predictable evening peaks; there is one moderately-sized engineering team; a regulator requires a demonstrable, geographically separate DR capability; the budget is real but not unlimited. Choose a rung, name the key services, and state the one decision you would push back on.
Write your answer before reading on.
Model answer. Read the axes. “Regional outage must not take it down for >15 min” + “regulator requires a geographically separate, demonstrable DR” → this crosses the regional boundary, so Rung 3 (zone-only) is not sufficient. RTO of ~15 minutes is achievable with a controlled failover, so you do not need true active/active (Rung 6) — its cost and complexity aren’t justified by a 15-minute RTO. One moderately-sized team → microservices on AKS (Rung 4) is an over-engineering trap; stay on PaaS. The right rung is 5 — active-passive multi-region, with each region’s stack zone-redundant (so you also get Rung 3’s HA inside it). Services: Front Door (priority/failover routing) globally; App Service (zone-redundant) in primary, warm or pilot-light standby in the paired region; Azure SQL with a failover group / active geo-replication (readable secondary in the paired region); RA-GRS storage; a tested, ideally automated failover runbook.
The decision to push back on: the stated RPO of “a minute or two” sits in tension with async geo-replication, which under a sudden regional loss can lose more than that. You would surface this explicitly: either (a) accept that Azure SQL failover groups typically achieve a low-seconds-to-low-minutes RPO and validate it meets the audit requirement, or (b) if the audit truly demands near-zero loss, recognise that this requirement alone pushes you toward synchronous/active-active data (a Rung-6 cost) and force the business to choose between the RPO and the budget. Naming that tension — rather than silently designing past it — is the senior move. Also worth flagging: tune the standby’s warmth to the 15-minute RTO (a pilot-light standby that scales up may or may not hit 15 minutes — test it), and do not over-buy a fully hot standby if pilot-light meets the target.
Certification mapping
This lesson is squarely AZ-305 (Designing Microsoft Azure Infrastructure Solutions) territory — the exam is, in essence, a series of “given these requirements, choose the design” questions, which is precisely this ladder.
| Cert | Relevance |
|---|---|
| AZ-305 | Primary. Design for high availability and disaster recovery (RTO/RPO, zone vs region, active-passive vs active-active); choose compute (Functions/App Service/AKS) and data stores from requirements; cost-aware design; composite SLA reasoning. Every rung maps to exam objectives. |
| AZ-104 | Operating the building blocks — App Service plans and scale, Azure SQL tiers and geo-replication, storage redundancy (LRS/ZRS/GRS), availability zones, Front Door/App Gateway configuration. |
| AZ-204 | The Rung 1–2 developer view — Functions and Consumption plans, App Service deployment, Cosmos DB, Service Bus, durable/async patterns and resilience in code. |
For AZ-305 specifically, drill the availability maths: per-service SLAs, the down-minutes-per-month each “nine” implies, composite SLA for serial chains, and how zones/paired-regions raise the achievable number. That maths is covered in Azure cloud economics: pricing, TCO, SLAs and support and is a reliable source of exam points.
Glossary
- Rung — In this lesson, one design on the progression from simplest to mission-critical; the unit of the architecture decision.
- RTO (Recovery Time Objective) — The maximum acceptable time to restore service after a failure. Drives how much redundancy/failover automation you need.
- RPO (Recovery Point Objective) — The maximum acceptable amount of data loss, measured in time. Drives the replication strategy (backup vs sync vs multi-write).
- Availability zone — One of three physically separate datacentres within an Azure region, with independent power, cooling and networking. Zone redundancy survives the loss of one.
- Paired region — A second Azure region linked to the primary for DR; the basis of multi-region active-passive and active-active designs.
- Composite SLA — The end-to-end availability of a system computed from its components: serial dependencies multiply (lowering it); redundant paths combine (raising it).
- Zone-redundant — A resource whose instances/replicas are spread across availability zones, so a single-zone failure does not take it down.
- Active-passive (DR) — One region serves traffic; a standby region (warm/pilot-light/cold) takes over on failover. Async replication → non-zero RPO; failover → non-zero RTO.
- Active-active — All regions serve live traffic simultaneously; with multi-write data there is no failover step. The basis of mission-critical.
- Deployment stamp / scale unit — A self-contained, identical, independently-deployable copy of the full application stack; the unit of scale and blast-radius isolation in mission-critical designs.
- Health model — A scheme that rolls raw telemetry into healthy/degraded/unhealthy states to drive automated routing and failover, instead of relying on raw uptime.
- Pilot light — A minimal always-on footprint in the standby region (e.g. replicated data, scaled-down compute) that is scaled up on failover; cheaper than a hot standby, with a longer RTO.
- Cold start — The latency penalty when a scaled-to-zero serverless instance must be provisioned to serve the first request after idle.
- Distributed-systems tax — The added complexity (eventual consistency, distributed tracing, sagas, partial failure) you accept when you decompose into services across the network.
Next steps
You now have the spine of architectural judgement: requirements in, the right rung out. The natural next lesson is the top of this ladder seen up close — Mission-Critical (AlwaysOn) Architecture on Azure: The Apex Design — where deployment stamps, the health model, active/active multi-write data, composite-SLA maths and continuous validation are taught in full. Everything there will feel inevitable, because you watched the requirements build it rung by rung.
To deepen the surrounding material:
- Revisit The Azure Well-Architected Framework, In Depth — every step on this ladder is a deliberate move within its system of tradeoffs (Cost/Ops spent to buy Reliability/Performance).
- Read The 43 Azure Cloud Design Patterns — the tactical moves (Static Content Hosting, Queue-Based Load Levelling, Gateway Aggregation, Deployment Stamps, Geode, Saga) that appear inside the rungs above.
- Revisit Choosing an Architecture: Styles & the Ten Design Principles — the styles (N-tier, web-queue-worker, microservices) each rung instantiates, and the principles (keep it simple, make all things redundant, scale out) that justify climbing or staying put.
- See it land in a whole system: Multi-region active-active disaster recovery and high availability vs disaster recovery and RTO/RPO for the geographic rungs in detail.
- Connect it to the organisation: every workload lands inside an application landing zone — see Cloud Adoption Framework & Azure Landing Zones, In Depth and Azure Landing Zones with CAF for where these designs are deployed.
- Ground the numbers: Azure cloud economics: pricing, TCO, SLAs and support for the SLA and composite-availability maths that decide which rung you can stand behind.