Architecture Design Patterns

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)

Every architecture course teaches you the parts — the services, the pillars, the patterns. Almost none teach you the job, which is sitting across a table from a customer who hands you a one-paragraph business problem and a budget, and walking out with a defensible design. That gap is where careers stall: engineers who can recite the Well-Architected Framework word-for-word but freeze when asked “so what would you actually build for us?” This lesson is the bridge. It is four complete proposal walkthroughs — the kind you would present to a customer’s review board — taken from a one-line brief all the way to a costed, justified, failure-analysed architecture.

The four case studies rise deliberately in difficulty, because the skill of architecture is not knowing the most complex pattern. It is climbing exactly as high as the requirements force you and not one rung higher — a discipline taught at length in The Azure Architecting Ladder: From a Simple Web App to Mission-Critical, which this lesson puts into practice on four genuinely different businesses. We start with a two-founder SaaS that needs to ship cheaply and survive a single datacentre’s bad day. We move to a regulated healthcare data platform where the binding constraint is not uptime but HIPAA, privacy and audit. We then scale out to a global retail e-commerce platform that must absorb a Diwali traffic surge across continents without buckling. And we finish at the apex: a bank’s core ledger that must transact through the loss of an entire Azure region with zero downtime and zero data loss, under data-sovereignty law — the design that lands squarely on Mission-Critical (AlwaysOn) Architecture on Azure.

Throughout, watch how the design is derived. Each case follows the same disciplined sequence — business brief → requirements (RTO/RPO/scale/compliance/budget) → constraints → the design and its Azure services → the architecture decisions as Well-Architected tradeoffs → a rough INR cost → what could go wrong — and reaches for named, reusable cloud design patterns as each problem demands them. By the end you will not just know the services; you will know how a senior architect reasons from a sentence of business intent to a system you can stand behind.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites & where this fits

This is an Advanced lesson in the Architecture & Design Mastery module, and it is the applied one — it spends the vocabulary the earlier lessons built. You will get the most from it if you have already met:

You should also be comfortable with the reliability fundamentals — high availability vs disaster recovery and RTO/RPO — because RTO and RPO are the spine of every brief.

Where it fits: the WAF, styles, patterns and ladder lessons taught you the language; this lesson shows you the job. It is exam-critical for AZ-305, whose case-study questions are exactly this shape — given a scenario with constraints, choose and justify a design.

How to read each case

Every case below is described against the same set of axes. Internalise them — they are the vocabulary of an architecture decision, and the order is the order you should ask them in:

Axis The question it asks Why it drives the design
RTO (Recovery Time Objective) After a failure, how long until service is restored? Minutes-to-hours allows restore/failover; near-zero forces active/active.
RPO (Recovery Point Objective) How much data can you afford to lose? Hours allows nightly backup; zero forces synchronous or multi-write replication.
Scale & shape Peak load, and is it steady or spiky? Drives the compute model (serverless vs PaaS vs orchestrated) and elasticity.
Availability target (SLA) What uptime must you promise, composite across the chain? Each “nine” is roughly an order of magnitude harder and dearer.
Compliance & residency Regulatory limits on where data lives and how it is protected? Can force a region, an encryption model or a DR design regardless of pure availability maths.
Budget Capital and run-rate ceiling? The hard constraint. Caps how high you can climb regardless of desire.
Team topology One team or many? What operational maturity? A microservices design needs autonomous teams; a small team should stay on PaaS.

A word on the costs throughout. The INR figures are deliberately rough, order-of-magnitude monthly run-rate estimates at indicative pay-as-you-go rates, to teach the shape of the cost curve — they are not quotes. Real numbers depend on region, tier, reservations, egress and traffic. Always model your own in the Azure Pricing and TCO calculators. The lesson is in the ratios between the four cases, not the absolute rupees — note how the bill rises roughly 50× from the first case to the last, and why.


Case study 1 — A startup SaaS web app

Business brief

“PulseDesk” is a two-founder startup building a help-desk SaaS for small businesses. They have a working prototype on a laptop, ten design-partner customers lined up, and eight months of runway. They need it live, multi-tenant, and able to take credit-card signups next month. They have no operations team and no patience for managing servers. The message to the architect is blunt: “Get us to market cheaply, don’t let a single outage embarrass us in front of our first customers, and don’t build anything we’ll have to throw away when we grow.”

Requirements

Axis PulseDesk’s requirement
RTO A few hours is tolerable at launch. The business does not die if it is down for an afternoon, but a multi-day outage would.
RPO Near-zero for customer data (a ticket lost is a customer lost), but a short window (minutes) is survivable.
Scale & shape Tiny now (hundreds of users), unpredictable later. Could 100× in a year if it takes off, or flatline. Spiky, not steady.
Availability “Don’t embarrass us” — call it 99.9% as a stretch goal, not a contractual SLA yet.
Compliance None binding at launch beyond basic data hygiene; they will want SOC 2 eventually, so build toward it.
Budget Brutal. Under ₹40,000/month all-in is the ceiling; ideally far less while traffic is low.
Team Two full-stack developers. No dedicated ops. Every hour spent patching a VM is an hour not building product.

Constraints

The binding constraints here are money and people, not reliability. With two developers and no ops function, the architecture must be managed — no VMs to patch, no Kubernetes to babysit. The “don’t throw it away” instruction rules out a design that cannot grow: a single SQLite file on one box would be cheap but a dead end. And the elastic, spiky load means anything with a fixed always-on cost floor is wrong; the bill should track usage, falling near zero on quiet nights.

The design and its Azure services

This is the canonical web-queue-worker style, built entirely from managed PaaS so two people can run it. Multi-tenancy is handled at the data layer with a shared database and a tenant ID on every row — the cheapest tenancy model, appropriate at this stage.

Layer Azure service Role
Front end / API Azure App Service (Linux, Basic→Standard plan) Hosts the web app and REST API. Built-in autoscale, TLS, deployment slots, no servers to manage.
Background work Azure Functions (Consumption plan) Sends emails, generates ticket digests, processes webhooks — triggered off a queue. Scales to zero; you pay per execution.
Queue Azure Storage Queue (or Service Bus later) Decouples slow work (email, exports) from the request path — the Queue-Based Load Levelling pattern.
Database Azure SQL Database (General Purpose, serverless) Relational store with auto-pause when idle (cost), point-in-time restore (RPO), and a clean path to grow.
Files / assets Azure Blob Storage Ticket attachments and static assets. The Valet Key pattern issues short-lived SAS URLs so clients upload directly, bypassing the app tier.
Cache Azure Cache for Redis (Basic, optional) Sessions and hot lookups — add only when read load justifies the floor cost.
Identity Microsoft Entra External ID Customer sign-up/sign-in (CIAM) with social and email — far cheaper and safer than rolling your own auth.
Secrets Azure Key Vault Connection strings and API keys, read via the App Service managed identity — no secrets in code or config.
Edge Azure Front Door (Standard) Global TLS termination, CDN caching of static assets, and a WAF — a cheap security and performance win.
Ops Application Insights + Log Analytics Telemetry, traces, and alerts so two people get paged before customers complain.

Single region, but with availability zones turned on for App Service and Azure SQL where the tier supports it — this buys datacentre-fault tolerance for a small premium, which directly answers “don’t embarrass us”.

Architecture decisions and Well-Architected tradeoffs

Rough cost

Item Indicative monthly (INR)
App Service (Standard S1, zone-redundant) ₹9,000
Azure SQL (serverless, light use) ₹4,000
Functions (Consumption) ₹500
Storage + Queue + Blob ₹600
Front Door (Standard) ₹3,000
Entra External ID (first tier free, light MAU) ₹0–500
App Insights / Log Analytics ₹1,500
Approx. total ₹18,000–22,000/month

Comfortably inside the ₹40,000 ceiling, and most of it scales up only as revenue does. On a truly quiet month it drifts lower as serverless components idle.

What could go wrong

This is the correct architecture for where PulseDesk is: cheap, managed, zone-tolerant, and with a clear, non-throwaway growth path. Now we change the binding constraint entirely.


Case study 2 — A regulated healthcare data platform

Business brief

“MediVault” is a healthcare analytics company building a platform that ingests electronic health records (EHR) and medical imaging from a network of hospitals, stores them securely, and lets clinicians and researchers query de-identified data. The customers are US hospital systems; the data is Protected Health Information (PHI). The brief from their CISO is unambiguous: “This is HIPAA-regulated PHI. Nothing touches the public internet. Everything is encrypted, every access is logged, and we must be able to prove to an auditor exactly who saw what. Reliability matters, but compliance and privacy are non-negotiable — we will pay for them.”

Requirements

Axis MediVault’s requirement
RTO Hours is acceptable for the analytics platform (it is not a life-support system), but the ingestion pipeline must not lose data during an outage.
RPO Effectively zero for ingested clinical records — losing a patient’s lab result is unacceptable and a reportable event.
Scale & shape Steady, batch-heavy ingestion (nightly EHR feeds) plus large imaging files; analytical query load from a known set of clinicians and researchers.
Availability 99.9% for the platform; the ingestion path is the part that must be durable above all.
Compliance HIPAA / HITRUST binding. Data residency in the US. Full audit trail. Encryption at rest and in transit, with customer-managed keys preferred. PHI must be de-identified before research access.
Budget Generous relative to PulseDesk — compliance is funded — but not unlimited. A few lakh rupees/month is in scope; the CISO will not trade away controls to save money.
Team A small platform team plus a compliance officer. Operational maturity is moderate; they need automation and clear audit evidence, not heroics.

Constraints

The binding constraint here is regulatory, not load. HIPAA reshapes the architecture in ways performance never would: no public endpoints (everything behind private networking), encryption with customer-managed keys, immutable audit logs, and a hard separation between identifiable PHI and the de-identified data researchers may touch. The data-residency rule pins every component to US regions. And because PHI breaches are legally reportable and ruinous, the design must be private-by-default — the opposite of the public, edge-cached startup above.

The design and its Azure services

This is a medallion data platform (Bronze → Silver → Gold) wrapped in a private network with defence-in-depth. The architecture pattern that matters most is isolation: every service is reachable only over private endpoints, and identifiable data is walled off from de-identified data.

Concern Azure service Role
Secure ingestion Azure Data Factory (with self-hosted/managed VNet integration runtime) + SFTP on Blob Pulls nightly EHR feeds and imaging over private connectivity; no public ingress.
Landing & lake Azure Data Lake Storage Gen2 (Bronze/Silver/Gold) Immutable raw landing zone, cleansed/conformed Silver, de-identified/aggregated Gold. Hierarchical namespace, lifecycle tiering for cold imaging.
Transform & de-identify Azure Databricks (in a customer VNet, no public IP) Cleansing, conforming, and the de-identification step that produces the research-safe Gold layer.
Serve / query Microsoft Fabric / Synapse + Power BI Clinician dashboards and researcher SQL — reading only the Gold (de-identified) layer for research personas.
Networking Azure Virtual Network, Private Endpoints, Private DNS, Azure Firewall, NSGs Everything private. No service has a public endpoint. Egress is forced through the firewall and logged.
Identity & access Microsoft Entra ID + Conditional Access + PIM Least-privilege RBAC; just-in-time elevation for admins; MFA enforced; access to PHI strictly role-scoped.
Encryption & keys Azure Key Vault / Managed HSM with customer-managed keys (CMK) Encryption at rest under keys MediVault controls and can revoke — a HIPAA-grade requirement.
Audit & monitoring Microsoft Sentinel, Defender for Cloud, Azure Monitor, immutable storage for logs Every access logged to tamper-evident storage; SIEM detection; Defender regulatory-compliance dashboard tracks HIPAA/HITRUST controls.
Governance Microsoft Purview + Azure Policy (HIPAA/HITRUST initiative) Data classification, lineage from raw to research, and policy enforcement that denies non-compliant resources (e.g. any public endpoint).

The whole estate sits inside a landing zone with governance baked in — see Azure Landing Zones with CAF — so that the HIPAA Azure Policy initiative and private-networking guardrails apply by construction, not by hope.

Architecture decisions and Well-Architected tradeoffs

Rough cost

Item Indicative monthly (INR)
Data Lake Gen2 (large, tiered imaging) ₹40,000
Databricks (VNet-injected, scheduled jobs) ₹70,000
Data Factory + integration runtimes ₹15,000
Synapse / Fabric + Power BI ₹40,000
Private networking (Firewall, PE, DNS) ₹45,000
Key Vault / Managed HSM (CMK) ₹15,000
Sentinel + Defender + immutable log storage ₹35,000
Purview governance ₹12,000
Approx. total ₹2.7–3.2 lakh/month

Roughly 15× PulseDesk — and almost every additional rupee buys compliance and privacy, not features or raw scale. That is the honest cost of regulated data, and the CISO signed up for it.

What could go wrong

MediVault shows a vital lesson: the most expensive pillar is not always Reliability. Here it is Security, and a good architect spends where the requirements — not the textbook — point. Next, the binding constraint shifts again, to scale.


Case study 3 — A global retail e-commerce platform

Business brief

“BharatBazaar” is a fast-growing retailer selling across India, South-East Asia, and the Middle East. They run flash sales and a Diwali peak where traffic spikes 50× in minutes. Their current monolith falls over under load, oversells stock, and is slow for customers far from their single datacentre. The brief from the VP of Engineering: “We need a platform that’s fast for customers on three continents, never oversells inventory, and survives the Diwali surge without us pre-provisioning a fortune of idle capacity the other 360 days. Teams must be able to ship independently — checkout, catalogue and search can’t be blocked by each other.”

Requirements

Axis BharatBazaar’s requirement
RTO Low — minutes. An outage during a flash sale is lost revenue measured in crores per hour.
RPO Near-zero for orders and payments; eventual consistency is acceptable for catalogue and recommendations.
Scale & shape Extreme spikes (50× in minutes during sales) on a moderate baseline. Global read traffic; write traffic concentrated around checkout.
Availability 99.95%+ across regions; graceful degradation (browse must survive even if recommendations don’t).
Compliance PCI-DSS for card data (mostly delegated to a payment provider); data-residency awareness across countries.
Budget Significant but ROI-driven. They will spend to capture peak revenue, but idle capacity 360 days a year is unacceptable — elasticity is a hard requirement.
Team Multiple autonomous product teams (catalogue, search, cart, checkout, fulfilment) with good DevOps maturity.

Constraints

The binding constraint is elastic global scale with correctness under contention. Two things break naive designs here: the surge (a design that needs pre-provisioned peak capacity is too expensive) and overselling (concurrent buyers racing for the last unit of stock). The multi-team requirement rules out a monolith — teams must deploy independently, which points to microservices and an event-driven spine. And global customers demand low latency, which forces multi-region presence and edge delivery.

The design and its Azure services

This is a microservices + event-driven architecture, multi-region active-active for the stateless tiers, with the Competing Consumers and Queue-Based Load Levelling patterns absorbing the surge, and CQRS separating the read-heavy catalogue from the write-critical order path.

Concern Azure service Role
Global edge Azure Front Door (Premium) + CDN + WAF Routes users to the nearest healthy region, caches static catalogue/imagery at the edge, absorbs bot and DDoS load.
Compute Azure Kubernetes Service (AKS), multiple regions, with KEDA Per-team microservices; KEDA scales pods on queue depth, so checkout workers spin up with the surge and back down after.
Async spine Azure Service Bus (premium) / Event Hubs Orders, inventory events, and notifications flow asynchronously — the Publisher-Subscriber and Competing Consumers patterns decouple teams and level load.
Order data Azure Cosmos DB (multi-region, session/strong consistency where needed) Globally distributed, elastically scalable store for the order and inventory domain; multi-region writes for availability.
Catalogue read model Cosmos DB / Azure SQL + Azure Cache for Redis A CQRS read model and Cache-Aside keep the hot browse path microsecond-fast and cheap to scale out.
Search Azure AI Search Faceted product search and relevance, scaled independently of catalogue writes.
Inventory correctness Cosmos DB optimistic concurrency / Service Bus sessions Serialises decrements per SKU so the last unit is never oversold — correctness under contention.
Payments External PCI provider + tokenisation Card data never lands in BharatBazaar’s estate (Quarantine/Gatekeeper thinking) — PCI scope minimised.
Order workflow Azure Durable Functions / Logic Apps The Saga pattern orchestrates reserve-stock → charge → fulfil with Compensating Transactions on failure.
Resilience App-level Retry + Circuit Breaker + Bulkhead Browse survives even when recommendations or reviews are degraded — graceful degradation by design.
Ops Azure Monitor, App Insights, Managed Grafana Per-service SLOs, surge dashboards, and autoscale observability across regions.

Front Door routes to the nearest healthy region; AKS clusters in each region scale on demand; Cosmos DB spans regions so reads are local and a region loss does not stop writes.

Architecture decisions and Well-Architected tradeoffs

Rough cost

Item Indicative monthly (INR)
AKS (multi-region, baseline + surge headroom) ₹2.0 lakh
Cosmos DB (multi-region, provisioned + autoscale RU) ₹2.5 lakh
Front Door Premium + CDN + WAF ₹1.2 lakh
Service Bus / Event Hubs (premium) ₹60,000
Azure AI Search ₹50,000
Redis (Premium, clustered) ₹70,000
Monitoring / Grafana ₹40,000
Approx. baseline total ₹8–9 lakh/month (baseline)

Crucially, the bill is elastic: it spikes during sales when revenue justifies it and falls back toward baseline afterward — the opposite of a pre-provisioned monolith that pays for peak capacity year-round. The architecture’s whole economic argument is that cost tracks revenue.

What could go wrong

BharatBazaar is the classic AZ-305 “design for scale and resilience” case. But notice its reliability ceiling: it survives a region loss with degradation, but it is not zero-downtime, zero-data-loss through that loss. For the final case, the business cannot tolerate even that.


Case study 4 — A zero-downtime bank core

Business brief

“SovereignBank” is building a new core banking ledger — the system that records every account balance and transaction. The brief from the Chief Risk Officer is the most demanding an architect ever hears: “This system cannot go down and cannot lose a transaction — ever. A regional disaster, a bad deployment, a poisoned cache: customers must keep transacting through all of them with no human in the loop for the first line of defence. We are regulated for data sovereignty — every byte of customer data stays within national borders. Reliability is the requirement; we will pay what correctness and continuity cost, but not a rupee on theatre that doesn’t buy them.”

Requirements

Axis SovereignBank’s requirement
RTO Effectively zero. The business transacts through a regional failure; there is no acceptable “down for failover” window for the ledger.
RPO Zero for committed transactions. A committed debit/credit must never be lost — this is the absolute, non-negotiable requirement.
Scale & shape High, steady transactional throughput with predictable daily/monthly peaks (payday, month-end), not flash-sale spikes.
Availability The highest the business can fund and prove — designed for continuity through the loss of a whole region.
Compliance Data sovereignty binding — all customer data within national borders; full audit, regulatory reporting, and provable controls. Banking regulation (e.g. RBI-style) and strong cryptographic key control.
Budget Large and explicitly justified by the cost of downtime (crores per hour, regulatory penalties, reputational ruin). But spent with discipline — no reliability theatre.
Team A mature platform and SRE organisation, comfortable with chaos engineering, health modelling, and zero-downtime deployment.

Constraints

This is the apex case, and the binding constraints stack: zero RTO and zero RPO under regional failure, plus data sovereignty that pins everything inside national borders. Zero-RTO-through-region-loss forces active/active multi-region — an active-passive design has a failover window, which is disqualifying. Zero-RPO forces careful data design — you cannot simply async-replicate the ledger and accept lag. Sovereignty means both active regions must be in-country (Azure has multiple Indian regions, e.g. Central and South India), and key material stays under national control. And “no human in the loop for the first line of defence” forces a health model and self-healing automation, not a runbook a tired engineer follows at 3 a.m.

The design and its Azure services

This lands squarely on Mission-Critical (AlwaysOn) Architecture on Azure — the apex design where the Well-Architected pillars and the design patterns converge. The signature concepts are the deployment stamp / scale unit, active/active multi-region, the health model, and zero-downtime deployment of whole stamps.

Concern Azure service / concept Role
Topology Active/active across two in-country regions, fronted by Azure Front Door Both regions serve live traffic; loss of one is absorbed with no failover window — zero RTO.
Deployment unit Deployment Stamp / scale unit A self-contained, independently-deployable unit (compute + data + config). Capacity is added by cloning stamps; blue/green of an entire stamp gives zero-downtime releases.
Compute AKS (or scale-unit-aligned App Service/Container Apps) per stamp Stateless application tier within each stamp, zone-redundant within region and replicated across regions.
Ledger data Cosmos DB multi-region (multi-write) and/or Azure SQL with synchronous in-region replicas + cross-region replication The hardest decision: a globally-distributed store with active-active writes and conflict resolution, or a strongly-consistent SQL design with synchronous zone replicas in-region and tight cross-region replication. The ledger’s correctness model decides which.
Exactly-once integrity Transactional Outbox, Idempotency keys, Saga Every transaction is idempotent and replayable; the Transactional Outbox pattern guarantees a committed ledger entry and its event are atomic — no lost or duplicated transactions.
Health model Custom Health Endpoint Monitoring → healthy/degraded/unhealthy The system classifies its own health from telemetry (latency, error rate, dependency health) — not raw uptime — and Front Door routes away from a degraded stamp automatically.
Self-healing & isolation Bulkhead, Circuit Breaker, Retry, Throttling + automation Fault isolation per stamp (blast-radius reduction); automated remediation is the first responder.
Networking & sovereignty Private endpoints, in-country regions only, Azure Firewall All data in-country; private-by-default; egress controlled and logged.
Keys & encryption Managed HSM with customer-managed keys, in-country Cryptographic control under national jurisdiction.
Continuous validation Azure Chaos Studio + load and failover testing in the pipeline The resilience is proven continuously by injecting faults (kill a stamp, fail a region) — see chaos engineering.
Observability & audit Azure Monitor, App Insights, Sentinel, immutable audit, regulatory reporting Deep telemetry feeds the health model; tamper-evident audit satisfies the regulator.

The composite-SLA maths is explicit here: chaining components multiplies their availabilities, so adding regions and removing single points of failure is how you claw back the nines that a long dependency chain erodes — the discipline taught in Mission-Critical (AlwaysOn) Architecture and multi-region active-active disaster recovery.

Architecture decisions and Well-Architected tradeoffs

This case is the inverse of PulseDesk’s economics: there, you spent the minimum and accepted real reliability gaps; here, Reliability is paramount and you spend deliberately — but still with discipline, refusing spend that does not measurably buy continuity or correctness.

Rough cost

Item Indicative monthly (INR)
Active/active compute (AKS, multiple stamps × 2 regions) ₹8–10 lakh
Multi-region ledger data (Cosmos multi-write / SQL replicas) ₹6–8 lakh
Front Door Premium + global routing/WAF ₹1.5 lakh
Private networking × 2 regions (Firewall, PE, DNS) ₹2 lakh
Managed HSM + CMK (in-country) ₹1.5 lakh
Sentinel + immutable audit + regulatory reporting ₹2 lakh
Chaos Studio + load/failover test infrastructure ₹50,000
Approx. total ₹22–28 lakh/month

Roughly 50× PulseDesk — but for a system where an hour of downtime costs crores and a lost transaction is a regulatory incident, the run-rate is dwarfed by the risk it retires. The architect’s job is to ensure every rupee buys continuity or correctness, not reassurance.

What could go wrong

This is the apex: the design every other case has been climbing toward. And the meta-lesson across all four is the one that separates an architect from a service-operator — the right design is the one the requirements force, no higher and no lower.

Azure architecture case studies

The diagram lays the four case studies side by side on a rising-complexity axis, so you can see at a glance how the binding constraint shifts — cost, then compliance, then scale, then reliability — and how the architecture grows in response from a single-region serverless web app to a sovereign, active/active mission-critical core.

Real-world application

In a real engagement, these walkthroughs are the job — they are what you whiteboard in a discovery workshop and then formalise in a proposal or an Azure Architecture Review. The repeatable method is the deliverable: ask the seven axis questions, find the binding constraint, choose the cheapest design that meets it with margin, name every decision as a Well-Architected tradeoff, cost it, and pre-mortem it. A few patterns from these cases recur in almost every engagement:

These are precisely the scenarios AZ-305 tests, and precisely the conversations that fill a senior architect’s week.

Common mistakes & anti-patterns

Interview & exam questions

  1. A startup with two developers and a brutal budget needs a multi-tenant SaaS live next month. What architecture do you propose, and why not Kubernetes? (Looking for: managed PaaS — App Service, Functions, Azure SQL serverless; consumption/serverless for cost; shared-DB multi-tenancy; KEDA/AKS is over-engineering for two people — “use managed services”.)
  2. A HIPAA platform’s binding constraint is compliance, not uptime. How does that reshape the architecture versus a public web app? (Private endpoints, no public ingress, CMK in Managed HSM, immutable audit logs, PHI/de-identified separation, Azure Policy in deny mode, reliability sized honestly to the real RTO.)
  3. How do you absorb a 50× flash-sale surge without paying for peak capacity all year? (Queue-Based Load Levelling + Competing Consumers as a shock absorber; KEDA scaling workers on queue depth; the queue buffers the spike; cost tracks revenue.)
  4. How do you guarantee a retailer never oversells the last unit of stock under extreme concurrency? (Per-SKU serialisation via Service Bus sessions or optimistic concurrency with retry; reserve strong/session consistency for inventory and money only; eventual consistency elsewhere.)
  5. Active/active versus active/passive for a system that must transact through a region loss — which, and why? (Active/active: active/passive has a failover window that violates zero RTO. Cost is the tradeoff; both regions already serve, so there is no failover.)
  6. What does it mean to drive failover from a “health model” rather than raw uptime, and why is it superior? (Classify application health — healthy/degraded/unhealthy — from telemetry; route away from a degraded stamp before customers feel it; raw “is the VM up?” misses brown-outs.)
  7. How do you guarantee zero RPO for a bank ledger — no lost and no duplicate transactions? (Transactional Outbox for atomic commit-and-publish, idempotency keys, Saga with compensating transactions; synchronous in-region replication; careful conflict resolution on multi-write.)
  8. A client insists on active/active multi-region for a line-of-business app that tolerates four hours of RTO. How do you respond? (Push back: it is over-engineering. Use the WAF tradeoff language — they would spend Cost and Operational Excellence for Reliability they don’t need. Propose zone-redundant single region with geo-backup DR.)
  9. How does a data-sovereignty requirement change a multi-region design? (Both regions must be in-country; CMK under national jurisdiction; fewer regions to spread across, so maximise zone/region separation within the country and document residual correlated-failure risk.)
  10. Name three things that go wrong in an active/active design and how the architecture mitigates each. (Split-brain → conflict-resolution model + chaos testing; bad deployment to both regions → stamp blue/green + canary; wrong health model → treat it as a tested, continuously-validated artefact.)
  11. Why delegate card handling to an external PCI provider instead of building it? (Tokenisation keeps card data out of your estate — Quarantine/Gatekeeper thinking — collapsing PCI scope; you spend a small hop and fee to retire most of the compliance burden.)
  12. Across these four cases the monthly bill rises ~50×. What single principle explains the spread? (Climb exactly as high as the requirements force you. Cost is the price of the binding constraint — money/people, then compliance, then scale, then zero-downtime reliability — never aesthetics.)

Quick check

  1. In the startup case, why is Azure Functions on the Consumption plan the right choice for background work?
  2. What is the binding constraint in the healthcare case, and name two architecture decisions it forces.
  3. Which pattern lets the e-commerce platform absorb a 50× surge without pre-provisioning peak capacity?
  4. Why must the bank core be active/active rather than active/passive?
  5. State the one principle that explains why the four designs differ so much in cost and complexity.

Answers

  1. Background work (email, exports, webhooks) is intermittent, so Consumption scales to zero and bills per execution — maximum Cost Optimization for a startup whose traffic is mostly zero — and it is fully managed, fitting a two-person team. The cold-start Performance cost is acceptable for async work off a queue.
  2. The binding constraint is regulatory compliance (HIPAA), not uptime. It forces, among others: private-by-default networking (no public endpoints), customer-managed keys, immutable audit logs, hard PHI/de-identified separation, and Azure Policy in deny mode — and it justifies sizing reliability honestly to the real (hours) RTO rather than gold-plating it.
  3. Queue-Based Load Levelling (with Competing Consumers): the queue buffers the spike and KEDA scales workers on queue depth, so the platform provisions for the surge in minutes and scales back afterward — cost tracks revenue.
  4. Because the requirement is zero RTO through a region loss. Active/passive has a failover window during which the ledger is unavailable, which is disqualifying; active/active means both regions are already serving, so losing one is absorbed with no failover.
  5. Climb exactly as high as the requirements force you, and not one rung higher — the right design is the one driven by the binding constraint (money/people, compliance, scale, or zero-downtime reliability), so cost and complexity rise only as the requirements genuinely demand.

Exercise

A design thought-experiment. A mid-sized airline approaches you to architect its new flight check-in and boarding platform. The brief: passengers check in via web and mobile, often in a rush at the gate; load is highly peaked around departure waves at major hubs in two countries; a check-in must not be lost (a passenger with a boarding pass must be boardable even if a server failed mid-transaction); the platform must keep working at one hub even if another hub’s region has problems; aviation regulators require data on passengers to stay in-region and demand an audit trail. Budget is real but ROI-driven — downtime during a departure wave strands passengers and incurs penalties.

Produce a one-page proposal in the lesson’s format: (a) extract the requirement axes (RTO, RPO, scale shape, availability, compliance, budget posture, team), (b) name the binding constraint, © sketch the design and key Azure services, (d) state the three most important decisions as Well-Architected tradeoffs, and (e) list three things that could go wrong and their mitigations. Then decide: is this closer to the e-commerce case or the bank case — and why?

Model answer (outline). (a) Requirements: RTO low (minutes — a stranded departure wave is costly) but arguably not absolute-zero across the whole platform; RPO zero for a completed check-in (the boarding-pass guarantee); scale is spiky around departure waves (closer to the retail surge than steady banking load); availability high with graceful degradation (browse/seat-map can degrade, check-in cannot); compliance forces in-region data residency + audit; budget ROI-driven; team assumed reasonably mature. (b) Binding constraint: a combination — peaked scale and the never-lose-a-check-in correctness guarantee and regional independence between hubs, under residency law. © Design: multi-region active-active across the two in-country regions fronted by Front Door routing passengers to their hub’s region; AKS or Container Apps scaling on queue depth (KEDA) to absorb departure-wave peaks; the check-in transaction protected by Transactional Outbox + idempotency + Saga so a completed check-in is durable and replayable; Cosmos DB / SQL with in-region replicas for residency; private networking and immutable audit for the regulator; a health model so a degraded hub region sheds traffic gracefully. (d) Tradeoffs: (1) active-active spends Cost/Consistency to buy hub independence and continuity through a region problem; (2) queue-depth autoscaling spends a little Performance latency to buy Cost Optimization against year-round peak provisioning; (3) Transactional Outbox + Saga spends Operational Excellence/Performance to buy the zero-RPO check-in guarantee. (e) What could go wrong: split-brain on a check-in during a partition (mitigate with per-passenger/per-flight write ownership + chaos testing); a departure-wave surge outrunning autoscale (mitigate with the queue buffer + pre-scaling on the known flight schedule); residency limiting regions (mitigate with max zone separation in-country + documented residual risk). Verdict: it sits between the two — it has the spiky surge shape of the e-commerce case but the correctness-critical, regional-independence, residency demands closer to the bank. A strong answer recognises it is not full mission-critical zero-RTO everywhere (the seat-map can degrade), so it spends the bank-grade rigour only on the check-in transaction and keeps the rest at retail-grade — right-sizing within a single system, which is the highest form of the skill this lesson teaches.

Certification mapping

Glossary

Next steps

You now have the architect’s core skill in practice: a one-line brief in, a costed, justified, failure-analysed design out. The natural next lesson is the apex these case studies climbed toward, taught in full — Mission-Critical (AlwaysOn) Architecture on Azure: The Apex Design — where deployment stamps, the health model, active/active multi-write data, composite-SLA maths and continuous validation are unpacked end-to-end. Everything there will feel inevitable, because you watched the bank case force each piece into existence.

To deepen the surrounding material:

ArchitectureCase StudiesWell-ArchitectedAZ-305Multi-RegionMission-Critical
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading