Mission-Critical (AlwaysOn) Architecture on Azure: The Apex Design

This is the apex of the Architecture & Design Mastery module — the lesson where the Well-Architected pillars, the cloud design patterns, the architecture styles, and the landing zone all converge into a single, coherent, ruthless target: a system that stays up. Not “highly available”. Not “we have a DR runbook”. Mission-critical — engineered so that the business keeps transacting through a zone failure, a regional failure, a bad deployment, a poisoned cache, and a dependency that just went dark, with no human in the loop for the first line of defence.

Microsoft codified this discipline as the mission-critical workload guidance (historically the AlwaysOn project) inside the Azure Well-Architected Framework. It is not a product, a SKU, or a reference template you deploy. It is a design methodology — five principles, eight design areas, and two reference implementations you can read line-by-line on GitHub. It is also, bluntly, the most expensive way to run software on Azure, and the central skill this lesson teaches is knowing when that price is justified and how to spend it where it actually buys reliability.

Everything in the earlier lessons was building to this. The Well-Architected Reliability pillar told you to design for resilience and recovery; mission-critical tells you exactly how far. The design patterns gave you Bulkhead, Circuit Breaker, Deployment Stamps, Health Endpoint Monitoring, Geode; mission-critical composes them into one architecture. The landing zone gave you the platform; mission-critical is the most demanding workload that lands on it. If you can design a mission-critical system and defend every tradeoff, you are no longer a service-operator — you are an architect.

Learning objectives

By the end of this lesson you will be able to:

Define mission-critical in business terms — tie criticality to the financial and human cost of downtime, and explain why Reliability is the primary pillar but never a blank cheque.
Apply the five mission-critical design principles (active/active, blast-radius reduction & fault isolation, observe application health, drive automation, self-healing with complexity avoidance) and explain the tension between them.
Work through the eight design areas end-to-end and make a defensible decision in each, from the application scale-unit down to operational procedures.
Design with the scale-unit / deployment-stamp model and compute a composite SLA for a multi-region, multi-dependency system — including why adding components lowers the number.
Build a health model that classifies the system as healthy / degraded / unhealthy from telemetry, instead of trusting raw infrastructure uptime.
Make zero-downtime deployment and continuous validation/chaos a first-class part of the architecture, including blue/green of an entire stamp.
Choose between the Mission-Critical Online and Connected reference implementations and justify the choice from your connectivity and compliance constraints.

Prerequisites & where this fits

This is lesson A5, the capstone of the Architecture & Design Mastery module. It assumes the four lessons before it:

The Azure Well-Architected Framework, In Depth — mission-critical is the Reliability pillar pushed to its limit, paid for by deliberate tradeoffs against Cost and Performance. You must already think in pillar tensions.
Cloud Adoption Framework & Azure Landing Zones, In Depth — a mission-critical workload is an application landing zone with extreme requirements; it consumes the platform’s identity, connectivity, and governance.
Choosing an Architecture: Styles & the Ten Design Principles — the ten design principles (design for self-healing, make everything redundant, scale out, partition around limits, use managed services…) are the seeds of the five mission-critical principles.
The 43 Azure Cloud Design Patterns — mission-critical is a composition of patterns. You should already know Deployment Stamps, Geode, Bulkhead, Circuit Breaker, Health Endpoint Monitoring, Queue-Based Load Levelling, and Throttling.

You should also be comfortable with the reliability fundamentals: RTO/RPO and HA vs DR, active-active multi-region on Azure, and the resiliency patterns.

Where it fits in the wider course: this lesson is the design-judgement bridge to the capstone build. After this, the Azure Zero-to-Hero capstone has you build a production landing zone, and Azure Chaos Studio gives you the tooling to prove the resilience this lesson designs.

What “mission-critical” actually means

The word is abused. Every team thinks their workload is mission-critical because they care about it. The Microsoft definition is sharper and it is economic, not emotional:

A mission-critical workload is one whose unavailability or under-performance results in severe business or human consequence — large direct financial loss, regulatory or legal exposure, reputational damage, or risk to life and safety.

Criticality is therefore a spectrum, and you place a workload on it by answering one question: what does one hour of downtime cost? A payments authorisation switch, an airline check-in and departure-control system, an e-commerce platform on its peak trading day, a hospital’s clinical records, a stock exchange’s matching engine, a connected-vehicle telemetry backbone — for these, an hour of downtime is measured in millions, in lost flights, or in patient harm. That number is what justifies the architecture.

It is worth contrasting the tiers explicitly, because the architecture is completely different at each:

Tier	Typical target	Downtime/yr	Topology	Cost posture
Standard / internal	99.9%	~8.8 h	Single region, zone-redundant	Minimise cost
Business-critical	99.95–99.99%	~52 min–4.4 h	Active-passive multi-region, automated failover	Balance cost & reliability
Mission-critical	99.99%+ (often “as close to 100% as the budget buys”)	~52 min or less	Active/active multi-region, multi-write data, self-healing	Reliability is primary; cost is justified, not minimised

Two consequences follow that you must internalise:

Reliability is the primary pillar — but it is cost-justified, never unconstrained. Mission-critical does not mean “ignore cost”. It means cost is subordinate to reliability rather than co-equal, and every reliability investment is defended by the downtime cost it averts. You do not chase a theoretical extra nine that the business will not fund; you design to the criticality the workload actually carries. Spending five-nines money on a four-nines workload is an architecture failure, not a virtue.

Reliability lives in tension with every other pillar. Active/active doubles (or worse) your compute spend (Cost). Cross-region synchronous replication taxes write latency (Performance Efficiency). Every redundant component and failover path is more to operate (Operational Excellence). Every additional network path and identity boundary is more attack surface (Security). Mission-critical design is the deliberate management of these tensions in the direction of staying up. Naming the tradeoff out loud is the senior move — “we accept +40 ms write latency and 2x compute to hold RPO near zero across regions” is an architecture statement; “we made it reliable” is not.

A final framing that keeps teams honest: the SLO is a business decision, the SLA is a contract, and the SLI is what you actually measure. Mission-critical work starts by negotiating a realistic Service-Level Objective with the business (the target you design and operate to, with error budget), separate from any provider Service-Level Agreement (the financially-backed promise from Azure) and from the Service-Level Indicators (the concrete signals — success rate, latency percentiles, freshness — your health model watches). Confusing these three is the most common reliability mistake at this level.

The five design principles

Microsoft’s mission-critical guidance is built on five design principles. They are not independent best-practices to tick off — they form a system, and the art is in their interplay. Learn them by name; AZ-305 reliability scenarios are built on this exact mental model.

1. Active/active by design (maximise reliability)

Run multiple regional stamps that all take live production traffic at the same time, fronted by a global ingress that routes around the unhealthy. The opposite — active-passive with a “warm” standby — has two fatal flaws for mission-critical: the standby’s failover path is exercised only during disasters (so it is never trusted), and failover introduces a recovery-time gap you are trying to eliminate. If both regions are always serving, a regional loss is just the global router shifting weight — there is no “failover event” to go wrong, because the surviving region was already doing the work.

Active/active is the principle that costs the most and the one juniors resist hardest. It is also the one that most directly buys reliability, because the only failover path you can trust is the one you use in production every minute.

2. Blast-radius reduction & fault isolation

Partition the system so that no single failure can take down everything. This is the scale-unit / deployment-stamp idea (below) plus the Bulkhead and cell-based patterns: independent, self-contained units with no shared fate. A poisoned message, a hot tenant, a bad config push, or a saturated connection pool should be contained inside one stamp/scale-unit and degrade a slice of traffic, not the whole service. The design question for every shared component is: “if this fails, who notices?” — and the answer must never be “everyone”.

3. Observe application health

You cannot heal or fail over from what you cannot see. Mission-critical demands a health model that fuses telemetry into a layered judgement of whether the system — and each scale-unit — is healthy, degraded, or unhealthy, from the customer’s perspective. This is the principle that distinguishes mission-critical from ordinary monitoring: you are not watching CPU and disk, you are watching whether the business outcome is succeeding, and you have pre-defined what “degraded” means before the incident, not during it.

4. Drive automation

Humans are too slow and too inconsistent for mission-critical response. Every routine action — deployment, scaling, failover, certificate rotation, environment provisioning — must be automated and repeatable through infrastructure-as-code and pipelines. Manual steps are reliability liabilities: they fail under pressure, they are not tested, and they do not run at 3 a.m. The goal is that the first response to a regional health failure is the platform shifting traffic, with the on-call engineer informed rather than required.

5. Design for self-healing — and avoid unnecessary complexity

Two ideas Microsoft deliberately pairs, because they pull against each other and must be balanced. Self-healing: the system recovers from common failures automatically — retries with back-off and circuit breakers absorb transient faults, health probes evict bad instances, autoscale replaces capacity, orchestrators restart crashed pods, and unhealthy stamps are drained. Complexity avoidance: every mechanism you add to improve reliability is itself a thing that can fail, so you add only what materially moves the reliability needle and you keep the architecture as simple as the requirements allow. The most reliable component is the one that isn’t there. Over-engineering reliability is itself an unreliability — a baroque failover scheme nobody fully understands is less reliable than a simple one that is exercised constantly.

The five principles in tension, stated plainly:

Pull toward…	…in tension with	The mission-critical resolution
Active/active (max reliability)	Cost, write latency	Accept the spend; choose data consistency per workload
Fault isolation (many units)	Operational simplicity	Automate operations so many units cost little to run
Deep observability	Telemetry cost & noise	Model health into 3 states, alert on the model not raw metrics
Full automation	Up-front engineering effort	Pay it once; manual steps are the real liability
Self-healing mechanisms	Complexity avoidance	Add only what moves the needle; the simplest reliable design wins

The eight design areas

Mission-critical design is organised into eight design areas. Think of them as the eight surfaces you must make a defensible decision on. Skipping one is how mission-critical systems fail — the chain is exactly as strong as the weakest area.

1. Application design — scale-unit & deployment stamp

The architecture is decomposed into scale-units bundled into a regional deployment stamp (next section is dedicated to this). At the application layer this means: stateless compute that can scale horizontally and be replaced freely; the Health Endpoint Monitoring pattern on every service so the platform can judge it; asynchronous, loosely-coupled communication (Queue-Based Load Levelling, Publisher-Subscriber) so a slow downstream becomes back-pressure rather than cascading failure; and resiliency patterns — Retry, Circuit Breaker, Bulkhead — baked into every external call. Crucially, the app must degrade gracefully: shed non-essential features, serve cached/stale data, and protect the critical-path user journey when a dependency is impaired, rather than failing the whole request.

2. Application platform

The compute substrate must support availability zones and multi-region deployment, scale rapidly, and be operable as immutable, automatable infrastructure. The two canonical mission-critical choices on Azure:

Platform	When it fits	Reliability characteristics
Azure Kubernetes Service (AKS)	Containerised microservices, maximum control, the Online reference uses it	Zone-spread node pools, cluster autoscaler, self-healing pods, full orchestration control
Azure Container Apps	Container workloads wanting less cluster overhead	Managed Kubernetes underneath, built-in scale-to-demand, simpler operations
Azure App Service	Web apps/APIs without containers	Zone-redundant plans, deployment slots; less granular than AKS

Whatever the choice, the platform-level rules are constant: zone-redundant by default, deployed identically per region from IaC, immutable (you replace, you do not patch in place), and capable of scaling out faster than load arrives. The platform is also where you apply Compute Resource Consolidation judgement — packing for efficiency without coupling fates.

3. Data platform

Data is the hardest mission-critical problem and where most designs are won or lost, because state cannot simply be duplicated like stateless compute — it must be replicated with a consistency choice, and that choice is governed by the CAP theorem, not by wishing. The decisions:

Globally distributed, multi-region datastore. Azure Cosmos DB with multi-region writes is the archetypal mission-critical data tier — multiple write regions, tunable consistency, and automatic regional failover. Azure SQL with failover groups (and active geo-replication) is the relational answer, though writes funnel to one primary.
Consistency vs latency vs availability — choose explicitly. Strong/synchronous gives RPO 0 at the price of cross-region write latency; bounded-staleness/eventual gives single-digit-second RPO and lower latency but admits some loss on a hard failure. There is no free choice here; mission-critical means deciding deliberately per dataset and writing the decision down.
Handle the split-brain. Active/active multi-write means concurrent writes to the same logical item in two regions are possible. You must define conflict resolution (last-writer-wins, custom merge, or partition so a given key is only ever written in one region) before go-live, not after the first incident.
Right store for the job. Cache (Azure Cache for Redis, active-active geo-replication), event store/log (Event Hubs), blob (Azure Storage with GZRS/RA-GZRS) — each replicated to match the workload’s RPO.

4. Networking & connectivity

A global, redundant ingress is the keystone. Azure Front Door is the standard mission-critical front door: anycast global entry, layer-7 routing, health-probe-driven failover between regional backends, integrated WAF and TLS, and caching/offload at the edge. (Azure Traffic Manager — DNS-based — is the lighter alternative where layer-7 features are not needed; its weakness is DNS TTL on failover speed.) Below the edge: regional load balancing (Application Gateway / internal load balancers), private networking with Private Endpoints so data services are not internet-exposed, DDoS protection, and a network topology that itself has no single regional choke point. The rule: the ingress must be able to take a whole region out of rotation in seconds, automatically, on a health signal.

5. Health modelling & observability

The design area that operationalises principle #3 — covered in depth in its own section below. In short: unified collection (Azure Monitor, Log Analytics, Application Insights with distributed tracing), correlated across every layer, rolled up into a health model that yields healthy/degraded/unhealthy per scale-unit and globally, driving both alerting and automated failover.

6. Deployment & testing — zero-downtime

Mission-critical means you can ship continuously without ever taking the service down, and you prove resilience continuously rather than assuming it. This area mandates: fully automated CI/CD with zero-downtime deployment (blue/green, ideally of an entire stamp — see below); immutable, ephemeral environments stood up from IaC and torn down, so every environment is identical and reproducible; and continuous validation — load tests, chaos/fault-injection, and end-to-end checks run as part of the pipeline, not as an occasional event. Covered in depth below; this is where Azure Chaos Studio lives.

7. Security

Reliability and security are inseparable at this tier: a successful attack is an availability event. Mission-critical security applies Zero Trust — verify explicitly, least privilege, assume breach — across the whole stamp: managed identities (no secrets in code), Key Vault for keys/certs with automated rotation, private networking and Private Endpoints, WAF and DDoS at the edge, and policy-as-code guardrails inherited from the landing zone. The reliability-specific lens: security controls must not themselves become single points of failure or unbounded latency — a single regional Key Vault that every request blocks on, or a TLS-inspection appliance with no redundancy, is a reliability anti-pattern dressed as security.

8. Operational procedures

Even a self-healing system needs disciplined operations for the failures automation cannot handle. This area covers: automated, version-controlled runbooks for the residual manual scenarios; secret/certificate rotation as automated procedure; an incident response model with clear severities and on-call; a blameless post-incident review loop that feeds findings back into the health model and chaos suite; and operating the deployment/teardown of ephemeral environments. The principle here is that operations are engineered and rehearsed, not improvised — the runbook you have never run is not a runbook.

A one-line decision per area, to carry into a design review:

#	Design area	The decision you must defend
1	Application design	How is the app decomposed into scale-units, and how does it degrade gracefully?
2	Application platform	AKS / Container Apps / App Service — zone-redundant, immutable, multi-region?
3	Data platform	Which store, which consistency level, and how is split-brain resolved?
4	Networking & connectivity	Global ingress with automatic health-based regional failover?
5	Health modelling & observability	How does telemetry roll up into healthy/degraded/unhealthy?
6	Deployment & testing	Zero-downtime deploys + continuous validation/chaos in the pipeline?
7	Security	Zero Trust without security becoming a single point of failure?
8	Operational procedures	Automated runbooks, incident response, post-incident learning loop?

Signature concept: the scale-unit and the deployment stamp

This is the structural heart of mission-critical, and the single most important idea to take away.

A deployment stamp (the Deployment Stamps pattern) is a complete, self-contained, independently deployable copy of the application and its scale-units in one region — its own compute, its own data, its own messaging, wired together and able to serve traffic entirely on its own. A scale-unit is the bundle of components within a stamp that scales together as one indivisible block — you grow capacity by adding more units, not by making one unit bigger, which sidesteps the per-resource limits any single instance eventually hits.

The architecture is then layered:

A thin global layer above all stamps: Front Door (ingress, health-based routing), global DNS, and the globally-replicated data services and shared identity. This is the only shared fate, so it is kept deliberately minimal and itself globally redundant.
Multiple regional stamps, each a full vertical slice of the application, each independently scalable and deployable, each taking live traffic.

Why this model is the backbone of mission-critical:

Fault isolation (principle #2): a stamp is a blast-radius boundary. A failure inside one stamp — a bad node pool, a saturated queue, a regional dependency outage — is contained to that stamp’s slice of traffic. The global router simply stops sending it work.
Scale around limits: every Azure resource has ceilings (throughput, connections, IPs, throttle limits). When a scale-unit approaches a limit, you deploy another unit rather than fighting the ceiling — horizontal growth with predictable headroom.
Active/active (principle #1): because each stamp is identical and self-sufficient, running them in multiple regions simultaneously is the active/active design. No special standby logic exists.
Zero-downtime deployment: because a stamp is independently deployable and disposable, you can blue/green an entire stamp — stand up a new green stamp on the new version, shift traffic to it via the global router, and retire the blue stamp. The deployment is a routing change, the safest kind, and rollback is just routing back.

The diagram above is the canonical mission-critical reference shape: a minimal global layer over independent, identical, active/active regional stamps, with a health model driving automated routing decisions. Commit this picture to memory — it is the answer to most AZ-305 reliability scenarios.

Signature concept: active/active multi-region and composite SLA maths

Mission-critical reliability is quantified, and the maths is the part most engineers get wrong. You must be able to compute a composite SLA — the realistic availability of a system built from many components — and it produces two counter-intuitive results.

Result 1: dependencies in series multiply, so more components means a lower number. When a request must pass through several components and the failure of any one fails the request, their availabilities multiply:

Composite SLA (series) = SLA_1 × SLA_2 × SLA_3 × …

Worked example — a single-region request path of Front Door (99.99%) → App Service (99.95%) → Azure SQL (99.99%) → Key Vault (99.99%):

0.9999 × 0.9995 × 0.9999 × 0.9999 ≈ 0.99920  →  ~99.92%  →  ~7 hours downtime/yr

Every component you add in the critical path drags the composite down, even when each part is excellent. This is the mathematical reason behind principle #5’s “avoid unnecessary complexity” — each dependency is a multiplicative tax on availability. (It is also the reason to remove components from the critical path — e.g. make a dependency asynchronous or cache it — so its outage degrades rather than fails the request.)

Result 2: redundancy in parallel adds nines, which is how you climb back up. When you run two independent instances and the request succeeds if either survives, you combine their failure probabilities:

Composite SLA (parallel) = 1 − ( (1 − SLA_A) × (1 − SLA_B) )

Two regional stamps each at 99.95%, behind a global router:

1 − ( (1 − 0.9995) × (1 − 0.9995) ) = 1 − (0.0005 × 0.0005) = 1 − 0.00000025  ≈  0.99999975  →  ~99.9999%

Two four-and-a-half-nines regions combine to roughly six nines — if they are genuinely independent and the router can shift between them. That “if” is the whole game: the gain only materialises when there is no shared fate (no single regional dependency both stamps lean on) and the routing layer actually fails over on a health signal. This single calculation is why active/active across independent stamps is the mission-critical topology: parallelism is the only lever that buys nines back after the series multiplication takes them away.

The honest caveats — state these in interviews to show seniority:

The global layer caps you. A composite can never exceed the SLA of the least-redundant shared component — most often the global ingress/DNS and the global data tier. If Front Door is 99.99%, no amount of regional redundancy beneath it gets the system past Front Door’s own ceiling on a Front-Door failure. Mission-critical designs therefore keep the global layer as thin and as redundant as the platform allows.
Independence is an assumption, not a fact. Correlated failures — a shared dependency, a global config push, a platform-wide control-plane incident, a bad deployment to both regions — break the multiplication. The parallel maths assumes the two failures are uncorrelated; your job is to make them uncorrelated (independent stamps, staggered deployments, no shared regional services).
Composite SLA is a design estimate, not a measured SLO. It tells you whether the shape can plausibly hit the target. Your real reliability is what the health model measures in production against the negotiated SLO — and it will be lower than the paper composite until the design is exercised.

Signature concept: the health model

Ordinary monitoring asks “is the VM up?”. A mission-critical health model asks “is the business outcome succeeding, from the customer’s point of view?” — and turns the answer into one of three states the automation can act on. This is principle #3 and design area #5 made concrete, and it is the concept that most separates mission-critical from “we have dashboards”.

The model is layered and quantified:

Collect signals from every layer — infrastructure metrics, platform metrics, application telemetry (Application Insights), distributed traces, dependency health, and synthetic probes — into a unified store (Azure Monitor / Log Analytics).
Define the Service-Level Indicators that actually represent customer success — request success rate, latency percentiles (p95/p99), error rate, queue depth, data freshness/replication lag — not raw CPU. Raw infrastructure uptime is a vanity metric: a node can be “up” while every request it serves fails.
Set thresholds that classify each component, each scale-unit, and the whole system into:
- Healthy — operating within SLOs; all indicators green.
- Degraded — impaired but still serving the critical path (elevated latency/errors, a non-critical dependency down, reduced capacity); typically you shed non-essential features and may pre-emptively shift some traffic.
- Unhealthy — failing the critical path; the scale-unit/region is taken out of rotation and traffic is shifted to a healthy stamp.
Drive action from the model, not from raw metrics. Alerts fire on state transitions (so on-call gets “Region X is degraded”, not a wall of CPU graphs), and the unhealthy state is what triggers automated failover at the global router. Failover decisions are made on the modelled health of the customer experience, not on whether a server replied to a ping.

The two failure modes a good health model prevents:

False healthy — infrastructure reports up while customers are failing (the classic “all green dashboards during an outage”). The fix is SLIs that measure outcomes, plus synthetic transactions that exercise the real critical path.
False unhealthy / flapping — a transient blip evicts a healthy region. The fix is hysteresis: sustained breaches over a window, not single data points, and graduated degraded-before-unhealthy states.

The health model is the brain of the self-healing system. Self-healing (principle #5) and automated failover (#4) are only as good as the model that tells them when to act. Building it well — before the incident, in code, with degraded explicitly defined — is the senior skill.

Signature concept: continuous validation, chaos, and blue/green of whole stamps

Mission-critical reliability that is assumed is reliability that does not exist. Two practices make it real and continuous.

Continuous validation & chaos engineering. Resilience is verified constantly, in the pipeline, not in an annual DR drill. The discipline: state a steady-state hypothesis (“the system stays healthy when one zone is lost”), inject a controlled fault with a bounded blast radius, observe whether the hypothesis holds via the health model, and feed every surprise back into the design. On Azure this is Azure Chaos Studio — agent-based and service-direct faults (kill AKS pods, fail a zone, add network latency, fault a dependency) run as gated pipeline experiments correlated with Azure Monitor. The point is cultural as much as technical: you find the weakness on a Tuesday afternoon with the team watching, not at 3 a.m. during a real regional outage. This is the operational embodiment of chaos engineering as a programme.

Zero-downtime deployment via blue/green of whole stamps. Because a deployment stamp is independently deployable and disposable (the structural property from earlier), the deployment unit becomes the entire stamp:

Stand up a fresh green stamp on the new version, from IaC, alongside the running blue stamp.
Run continuous-validation checks (smoke, load, chaos) against green before it sees customers.
Progressively shift production traffic blue → green at the global router (canary weights → full).
Watch the health model on green throughout; if it degrades, shift traffic back — rollback is a routing change, instant and safe.
Retire the blue stamp once green is proven.

This is the safest possible deployment: the new version is fully built and validated before any customer touches it, the cut-over is a reversible routing decision rather than an in-place mutation, and a bad release degrades nothing because the old stamp is still there until the new one earns the traffic. It is also why immutable, ephemeral environments (design area #6) are mandatory — blue/green of stamps only works if a stamp is something you create and destroy reliably from code.

The reference implementations: Mission-Critical Online vs Connected

Microsoft does not leave this as theory — it publishes two production-grade, open-source reference implementations (the Azure/Mission-Critical-Online and Azure/Mission-Critical-Connected repositories) that implement the full methodology end-to-end in Bicep/Terraform and application code. They differ on exactly one axis — connectivity and isolation — and choosing between them is a real design decision.

Dimension	Mission-Critical Online	Mission-Critical Connected
Network posture	Public-facing; services reachable over the internet (with WAF, private endpoints to data)	Private; integrated into a corporate network via the platform landing zone
Landing-zone integration	Standalone — does not require a platform landing zone; self-contained subscription	Integrated with Azure landing zones — uses hub-spoke/Virtual WAN connectivity, central identity, policy
Typical workload	Internet-facing apps, public SaaS, consumer e-commerce	Enterprise apps needing on-prem connectivity, private ingress, stricter network isolation/compliance
Connectivity to on-prem	Not required	Via the landing zone’s hub (ExpressRoute/VPN)
Governance	Self-governed within its subscription	Inherits enterprise policy/guardrails from the platform
Complexity	Lower — fewer moving parts, easier to stand up	Higher — coupled to platform networking and governance

Both share the same core: active/active regional stamps with scale-units, Front Door global ingress with health-based routing, Cosmos DB multi-write data, AKS application platform, a full health model, zero-downtime stamp deployment, and continuous validation. The only difference is whether the workload lives in the open (Online) or inside the enterprise network and governance fabric (Connected).

How to choose, in one rule: start from your connectivity and compliance constraints. If the workload is internet-facing, has no hard requirement for private/on-prem connectivity, and can be self-governed, choose Online — it is simpler and that simplicity is reliability (principle #5). If the workload must reach on-premises systems, sit behind private ingress, or inherit enterprise governance from a platform landing zone, choose Connected — it is the mission-critical application landing zone that plugs into the platform. Read the actual repos before a design review; they are the best worked example of every concept in this lesson.

Real-world application

How this shows up when you are actually designing on Azure:

The payments authorisation switch. Active/active stamps across two paired regions; Cosmos DB multi-write with last-writer-wins on idempotent keys; Front Door routing on a synthetic “can I authorise a test transaction” probe, not on CPU; a health model where “p99 authorisation latency > X” is degraded (shed enrichment features) and “authorisation success rate < Y” is unhealthy (drain the region). Downtime cost in the millions/hour funds the 2x compute without argument.
Airline check-in / departure control on irregular-ops days. The scale-unit model lets capacity grow per-airport-cluster; graceful degradation keeps the critical path (issue a boarding pass) alive even when ancillary services (seat-map upsell) are down. Blue/green of stamps means a fix ships during operations without a maintenance window — there is no maintenance window when planes are departing.
Peak-trading e-commerce (the Online reference, almost verbatim). Internet-facing, self-governed, Front Door + WAF + AKS stamps + Cosmos DB; continuous load + chaos in the pipeline all year so the peak day is just another Tuesday at scale. See the Black Friday surge architecture for the cross-cloud shape of the same problem.
Regulated enterprise platform (the Connected reference). Private ingress through the landing-zone hub, central identity and policy, on-prem connectivity over ExpressRoute — same mission-critical core, wrapped in enterprise governance. This is where mission-critical meets the secure landing zone.

In all four, the design judgement — not the service list — is the deliverable: which pillar tensions you accepted, where you spent for redundancy, what you deliberately left simple, and how you proved it works before the customer found out it didn’t.

Common mistakes & anti-patterns

Anti-pattern	Why it fails mission-critical	The fix
Active-passive dressed as active/active	The standby’s failover path is never exercised, so it is never trusted; the recovery gap is exactly what you were eliminating	Both regions take live traffic; the only failover path you trust is the one used in production
Failing over on infrastructure metrics	A region can be “up” (servers pinging) while every customer request fails — “false healthy”	Drive failover from a health model of customer outcomes (success rate, p99), with synthetic probes
Ignoring composite SLA maths	Adding components in series silently lowers availability; teams claim “four nines” for a chain that computes to two	Compute the series composite; remove dependencies from the critical path; add parallel redundancy to buy nines back
Assuming regions are independent	Shared regional dependency, global config push, or simultaneous deploy correlates the failures and breaks the parallel maths	Make stamps genuinely independent; stagger deployments; eliminate shared regional services
State as an afterthought	Stateless compute is easy to duplicate; teams forget data needs a consistency choice and a split-brain plan	Decide consistency per dataset; define conflict resolution before go-live
Over-engineering reliability	A baroque failover scheme nobody understands is less reliable than a simple, exercised one (violates principle #5)	Add only mechanisms that move the needle; favour the simplest design that meets the SLO
Manual failover / deployment	Humans are too slow and inconsistent under pressure; the untested runbook fails at 3 a.m.	Automate failover off the health model; blue/green whole stamps; rehearse residual runbooks
Reliability without continuous validation	Resilience that is assumed is resilience that does not exist; the first real test is the real outage	Chaos + load tests gated in the pipeline (Azure Chaos Studio); find weaknesses on a Tuesday
Security as a single point of failure	A single regional Key Vault every request blocks on, or non-redundant inline inspection, is an availability risk wearing a security badge	Make security controls redundant and bounded-latency; cache/replicate appropriately
Mission-critical everything	Treating a low-criticality workload as mission-critical wastes money the business will not fund	Tier by downtime cost; spend the architecture only where the consequence justifies it

Interview & exam questions

These concepts dominate AZ-305 reliability scenarios and senior architecture interviews. Be ready to defend tradeoffs, not just name services.

Q1. What distinguishes a “mission-critical” workload from a merely “highly available” one, and how do you decide? Criticality is defined by the consequence of downtime — severe financial loss, legal/regulatory exposure, or risk to human safety. You decide by quantifying the cost of one hour of downtime; that number justifies the architecture. Mission-critical makes Reliability the primary pillar (cost subordinate but still justified), targets active/active multi-region with self-healing, and is reserved for workloads whose failure carries severe consequence — not every workload a team cares about.

Q2. Name the five mission-critical design principles. Active/active (maximise reliability); blast-radius reduction & fault isolation; observe application health; drive automation; design for self-healing while avoiding unnecessary complexity. Bonus seniority: explain that they form a system in tension — e.g. self-healing mechanisms add complexity, which principle #5 explicitly pulls back against.

Q3. List the eight mission-critical design areas. Application design (scale-unit/deployment stamp); Application platform; Data platform; Networking & connectivity; Health modelling & observability; Deployment & testing (zero-downtime); Security; Operational procedures. The chain is as strong as the weakest area.

Q4. What is a deployment stamp, and how does it differ from a scale-unit? A deployment stamp is a complete, self-contained, independently deployable copy of the application (compute + data + messaging) in one region, able to serve traffic alone. A scale-unit is the bundle of components within a stamp that scales together as one block — you add capacity by adding units, not by enlarging one. Stamps provide regional fault isolation and the unit of blue/green; scale-units provide growth around per-resource limits.

Q5. Compute the composite SLA of two regional stamps, each 99.95%, behind a global router — and state the assumptions. Parallel: 1 − ((1−0.9995)²) = 1 − 0.00000025 ≈ 99.9999%. Assumptions: the two regions are genuinely independent (no shared fate / correlated failure) and the router fails over on a health signal. Caveat: the composite cannot exceed the SLA of the least-redundant shared component (the global ingress/DNS/data tier), and correlated failures break the multiplication.

Q6. Why does adding components to a request path lower availability, and what do you do about it? Components in series multiply their availabilities, so each added dependency is a multiplicative tax — five excellent services in a row compute to less than any one of them. The fixes: remove dependencies from the critical path (make them asynchronous or cached so they degrade rather than fail the request), and add parallel redundancy (multiple stamps/regions) to buy nines back. This is the maths behind “avoid unnecessary complexity”.

Q7. What is a health model and why not just monitor uptime? A health model fuses telemetry into a layered, quantified judgement of healthy/degraded/unhealthy per scale-unit and globally, based on customer-outcome SLIs (success rate, p99 latency, freshness) rather than raw infrastructure metrics. Uptime is a vanity metric — a server can be “up” while every request fails (“false healthy”). The model drives alerting on state transitions and triggers automated failover off the unhealthy state.

Q8. How do you deploy a new version of a mission-critical system with zero downtime? Blue/green of an entire stamp: stand up a green stamp on the new version from IaC, validate it (smoke/load/chaos) before any traffic, shift production blue→green progressively at the global router while watching the health model, roll back by routing traffic back if green degrades, then retire blue. The cut-over is a reversible routing change, not an in-place mutation — the safest deployment shape. Requires immutable, ephemeral environments.

Q9. What role does chaos engineering play, and which Azure service implements it? Continuous validation proves resilience instead of assuming it: state a steady-state hypothesis, inject a bounded-blast-radius fault, verify via the health model, feed surprises back into the design — run as gated pipeline experiments, not annual drills. Azure Chaos Studio provides agent-based and service-direct faults (kill pods, fail a zone, add latency) correlated with Azure Monitor. The aim is to find weaknesses under controlled conditions rather than during a real outage.

Q10. Front Door vs Traffic Manager for mission-critical global ingress — which and why? Azure Front Door is the mission-critical default: anycast layer-7 ingress with health-probe-driven backend failover (seconds), integrated WAF/TLS, and edge caching/offload. Traffic Manager is DNS-based — lighter, but failover is bounded by DNS TTL and it offers no layer-7 features, so it is reserved for cases where DNS-level routing suffices. Mission-critical needs the fast, health-based, layer-7 failover Front Door provides.

Q11. Online vs Connected reference implementation — how do you choose? Choose on connectivity and governance. Online is standalone, public-facing, self-governed — for internet-facing apps with no on-prem/private requirement; simpler, and simplicity is reliability. Connected integrates with the platform landing zone — private ingress, central identity/policy, on-prem connectivity via the hub — for enterprise/regulated workloads. Same mission-critical core; the difference is whether the workload lives in the open or inside the enterprise network/governance fabric.

Q12. Your data tier must serve writes in two regions at once. What do you decide, and what do you guard against? Decide the consistency level per dataset (strong/synchronous = RPO 0 but cross-region write latency; bounded-staleness/eventual = lower latency, small loss window) using a multi-write store such as Cosmos DB multi-region writes (or Azure SQL failover groups for relational, accepting a single write primary). Guard against split-brain: define conflict resolution (last-writer-wins, custom merge, or partition keys so each is written in only one region) before go-live. This is the CAP tradeoff made explicit.

Quick check

1. True or false: mission-critical means cost is ignored in favour of reliability. False. Reliability is the primary pillar, but it is cost-justified — every reliability investment is defended by the downtime cost it averts, and you do not buy nines the business will not fund.

2. You add a Key Vault call (99.99%) into a request path that was 99.92%. What is the new composite, and what does that illustrate? 0.9992 × 0.9999 ≈ 0.99910 → ~99.91%. It illustrates that components in series multiply — every added critical-path dependency lowers availability, the maths behind “avoid unnecessary complexity”.

3. A region’s servers are all responding to pings, but 80% of customer transactions are failing. What health state is it, and what should happen? Unhealthy (the critical path is failing, regardless of infrastructure “up”). The global router should drain the region and shift traffic to a healthy stamp — this is exactly the “false healthy” trap a good health model prevents.

4. Why is blue/green of a whole stamp safer than an in-place rolling update? The new version is fully built and validated before any customer touches it, the cut-over is a reversible routing change rather than an in-place mutation, and the old stamp stays serving until the new one earns traffic — so rollback is instant and a bad release degrades nothing.

5. Two regional stamps each at 99.9% combine to roughly what availability, and what one assumption must hold? 1 − ((1−0.999)²) = 1 − 0.000001 = 99.9999% (~six nines). The assumption: the regions are genuinely independent (no shared fate / correlated failure) and the router fails over on a health signal — otherwise the parallel maths does not hold.

Exercise: the design thought-experiment

Scenario. You are the architect for “AuthSwitch”, the real-time card-authorisation service for a global payments processor. Requirements: authorise a transaction in < 300 ms p99; one hour of downtime costs ~₹4 crore in lost interchange and SLA penalties; transactions originate worldwide; regulators in two jurisdictions require data residency and an audited DR capability; the business has explicitly funded “as close to always-on as the architecture can deliver”. You will be challenged in an architecture review board.

Produce: (a) the topology, (b) one defended decision per design area, © a composite-SLA estimate, (d) the health model’s degraded vs unhealthy definitions, (e) the deployment & validation approach, and (f) Online or Connected.

Model answer.

(a) Topology. Active/active deployment stamps across two paired regions per residency jurisdiction (so four stamps total if both jurisdictions need in-region processing; two if one jurisdiction). Thin global layer: Azure Front Door anycast ingress with health-probe-driven routing + WAF; global DNS; a globally-distributed data tier. Each stamp is a self-contained scale-unit: zone-redundant AKS behind Application Gateway, Service Bus/Event Hubs for async enrichment, Azure Cache for Redis (geo-replicated) for hot BIN/token lookups, Cosmos DB multi-region writes for the authorisation ledger, all over Private Endpoints.

(b) One defended decision per design area:

Area	Decision	Defence
Application design	Stateless auth service; enrichment async; Circuit Breaker on every downstream	Keeps the critical path (approve/decline) short and resilient; enrichment failure → degrade, not fail
Application platform	Zone-redundant AKS, immutable, per-region from IaC	Maximum control + self-healing pods; identical stamps; replace-don’t-patch
Data platform	Cosmos DB multi-write, session/bounded-staleness consistency; idempotent transaction keys; last-writer-wins	Sub-300 ms p99 forbids cross-region synchronous writes; idempotency + LWW resolves split-brain; ledger is append-mostly
Networking	Front Door, health-based failover, Private Endpoints, DDoS	Seconds-level regional drain on a transaction probe; data tier never internet-exposed
Health modelling	SLIs: auth success rate, p99 latency, downstream/enrichment health, replication lag	Measures the business outcome, not CPU; drives routing
Deployment & testing	Blue/green of whole stamps + chaos/load gated in pipeline (Chaos Studio)	Ship fixes during live operation; prove zone/region loss continuously
Security	Zero Trust, managed identity, Key Vault (auto-rotation, regional + cached), WAF	No secrets in code; security controls redundant so they are not SPOFs
Operational procedures	Automated runbooks for residual failover, rehearsed; blameless PIR loop feeding the chaos suite	The untested runbook fails at 3 a.m.; learning loop hardens the health model

© Composite SLA. Per-stamp series (Front Door front + App Gateway + AKS + Cosmos ~99.99% each, with one ~99.95% link) ≈ ~99.93%. Two independent stamps in parallel: 1 − ((1−0.9993)²) ≈ 99.99995% — ~five-to-six nines for the regional tier, capped by Front Door’s own 99.99% as the least-redundant shared component. Honest statement to the ARB: “the shape can hit four-plus nines; the global ingress is the ceiling and the assumption is region independence, which we enforce with staggered deploys and no shared regional services.”

(d) Health model. Degraded = auth success ≥ 99.5% but p99 latency > 250 ms, OR a non-critical enrichment dependency is down → shed enrichment, serve approve/decline on cached risk data, optionally pre-shift a fraction of traffic. Unhealthy = auth success < 98% sustained over a 60-second window, OR Cosmos write failures in-region → drain the region at Front Door, shift to the paired region. Hysteresis (sustained windows) prevents flapping; synthetic “authorise a canary transaction” probes prevent false-healthy.

(e) Deployment & validation. Immutable, ephemeral environments from IaC; blue/green of an entire stamp with canary traffic weights at Front Door, rollback = route back; continuous load + Chaos Studio experiments (kill a zone, fault Cosmos in one region, add downstream latency) gated in the pipeline against the steady-state hypothesis “auth success stays ≥ 99.9% through single-zone loss”.

(f) Online or Connected? Connected. The residency, audited-DR, and enterprise-governance requirements mean the workload must inherit policy and (likely) private connectivity from a platform landing zone — it is a mission-critical application landing zone, not a standalone public app. If AuthSwitch were a purely internet-facing consumer product with no on-prem/governance ties, Online would be the simpler, and therefore more reliable, choice.

ARB-killer line to close on: “We accept ~2x compute and bounded-staleness writes to hold RPO near-zero and p99 under 300 ms across regions; the global ingress is our availability ceiling and region independence is our load-bearing assumption, which the chaos suite tests every pipeline run.” That sentence — explicit tradeoffs, named ceiling, tested assumption — is what separates an architect from a service-operator.

Certification mapping

AZ-305 (Designing Microsoft Azure Infrastructure Solutions) — primary. This lesson sits at the centre of the Design infrastructure solutions and Design business continuity solutions domains:

AZ-305 area	What this lesson covers
Design for high availability	Active/active multi-region, zone-redundancy, deployment stamps, global ingress with health-based failover
Design a solution for backup & DR	RPO/RTO via data-tier consistency choices, multi-region data, automated failover, Online/Connected DR posture
Design for reliability / resilience	The five principles, health modelling, self-healing, composite SLA reasoning
Design compute & application architecture solutions	Scale-unit/stamp model, AKS/Container Apps/App Service choice, graceful degradation
Design data storage solutions	Cosmos DB multi-write, SQL failover groups, consistency-vs-latency tradeoffs, split-brain handling
Design network solutions	Front Door vs Traffic Manager, private endpoints, redundant global topology

AZ-104 (Azure Administrator). Supporting: availability zones/sets, load balancing, Azure Monitor/Log Analytics, backup and site recovery, and the operational mechanics (autoscale, deployment slots) that mission-critical automates.

AZ-204 (Developing Solutions for Azure). Supporting: implementing the resiliency patterns in app code (Retry, Circuit Breaker via the SDKs/Polly), Application Insights instrumentation feeding the health model, Cosmos DB consistency levels and conflict resolution, message-based decoupling with Service Bus/Event Hubs.

Across all three, the differentiator is the design judgement: examiners increasingly test “given these requirements and this downtime cost, choose and justify the topology” — exactly the reasoning this lesson drills.

Glossary

Mission-critical workload — a workload whose unavailability causes severe business or human consequence; Reliability is its primary, cost-justified pillar.
AlwaysOn — Microsoft’s historical name for the mission-critical guidance and reference implementations.
Deployment stamp — a complete, self-contained, independently deployable copy of the application (compute + data + messaging) in one region, able to serve traffic alone.
Scale-unit — the bundle of components within a stamp that scales together as one indivisible block; capacity grows by adding units, not enlarging one.
Active/active — multiple regional stamps all taking live production traffic simultaneously, so a regional loss is a routing shift rather than a failover event.
Blast radius — the extent of impact when a component fails; mission-critical minimises it via isolation (stamps, bulkheads, cells).
Composite SLA — the realistic availability of a multi-component system: components in series multiply (lower), redundancy in parallel combines failure probabilities (higher).
Health model — a layered, quantified classification of the system and each scale-unit as healthy / degraded / unhealthy, based on customer-outcome SLIs, driving alerting and automated failover.
SLI / SLO / SLA — Service-Level Indicator (the measured signal), Objective (the target you design/operate to), Agreement (the financially-backed contractual promise).
RPO / RTO — Recovery Point Objective (tolerable data loss, set by replication) and Recovery Time Objective (tolerable downtime, set by routing/failover).
Split-brain — concurrent conflicting writes to the same logical item in multiple regions under active/active multi-write; resolved by conflict-resolution policy or key partitioning.
Graceful degradation — shedding non-essential features / serving stale data to protect the critical path when a dependency is impaired.
Continuous validation — running load and chaos/fault-injection experiments as part of the pipeline to prove resilience continuously rather than assume it.
Blue/green (of a stamp) — deploying a new version as a fresh stamp, validating it before traffic, then shifting traffic via the global router; rollback is a routing change.
Front Door — Azure’s anycast layer-7 global ingress with health-probe-driven backend failover, WAF/TLS, and edge caching — the mission-critical global front door.
Online vs Connected — the two mission-critical reference implementations: Online (standalone, public-facing, self-governed) vs Connected (integrated with a platform landing zone, private, enterprise-governed).

Next steps

You have reached the apex of the design-judgement layer. The natural next lesson is the course-level capstone where you build what you have learned to design:

Capstone: Design & Build a Production-Ready Azure Landing Zone — take identity, networking, governance, monitoring, and reliability and stand up one governed landing zone with Bicep, Terraform, and the az CLI.

To go deeper on the two pillars this lesson leans on hardest:

Active-Active Multi-Region on Azure: Building for RTO Near Zero — the deployment-stamp and multi-write-data mechanics at implementation depth, including split-brain handling.
Resilience Validation with Azure Chaos Studio — the tooling to prove the resilience this lesson designs: agent-based and service-direct faults, steady-state hypotheses, and CI/CD gating.

And to reinforce the foundations this capstone composes:

The Azure Well-Architected Framework, In Depth — the Reliability pillar and the tradeoff system mission-critical pushes to its limit.
The 43 Azure Cloud Design Patterns — Deployment Stamps, Geode, Bulkhead, Circuit Breaker, Health Endpoint Monitoring as standalone patterns.
Cell-Based Architecture: Containing Blast Radius — the blast-radius isolation idea generalised across clouds.
The Well-Architected Reliability Pillar, Deep Dive and High Availability vs Disaster Recovery: RTO/RPO for the reliability fundamentals underneath it all.