Every architecture course teaches you the parts — the services, the pillars, the patterns. Almost none teach you the job, which is sitting across a table from a customer who hands you a one-paragraph business problem and a budget, and walking out with a defensible design. That gap is where careers stall: engineers who can recite the Well-Architected Framework word-for-word but freeze when asked “so what would you actually build for us?” This lesson is the bridge. It is four complete proposal walkthroughs — the kind you would present to a customer’s review board — taken from a one-line brief all the way to a costed, justified, failure-analysed architecture.
The four case studies rise deliberately in difficulty, because the skill of architecture is not knowing the most complex pattern. It is climbing exactly as high as the requirements force you and not one rung higher — a discipline taught at length in The Azure Architecting Ladder: From a Simple Web App to Mission-Critical, which this lesson puts into practice on four genuinely different businesses. We start with a two-founder SaaS that needs to ship cheaply and survive a single datacentre’s bad day. We move to a regulated healthcare data platform where the binding constraint is not uptime but HIPAA, privacy and audit. We then scale out to a global retail e-commerce platform that must absorb a Diwali traffic surge across continents without buckling. And we finish at the apex: a bank’s core ledger that must transact through the loss of an entire Azure region with zero downtime and zero data loss, under data-sovereignty law — the design that lands squarely on Mission-Critical (AlwaysOn) Architecture on Azure.
Throughout, watch how the design is derived. Each case follows the same disciplined sequence — business brief → requirements (RTO/RPO/scale/compliance/budget) → constraints → the design and its Azure services → the architecture decisions as Well-Architected tradeoffs → a rough INR cost → what could go wrong — and reaches for named, reusable cloud design patterns as each problem demands them. By the end you will not just know the services; you will know how a senior architect reasons from a sentence of business intent to a system you can stand behind.
Learning objectives
By the end of this lesson you will be able to:
- Run a proposal from brief to design — take a one-paragraph business problem and produce a costed, justified Azure architecture using a repeatable requirements-first method.
- Translate business language into the requirement axes — convert “we can’t lose patient data” or “we go down, we lose crores” into concrete RTO, RPO, scale, compliance and budget numbers that drive the design.
- Select Azure services as a response to requirements, not from familiarity — and defend each choice against a cheaper or simpler alternative.
- Articulate every significant decision as a Well-Architected tradeoff — name which pillar you are spending and which you are buying, across all five pillars.
- Right-size deliberately across four very different businesses — recognise when serverless-and-cheap is correct and when active/active-and-expensive is the only honest answer.
- Perform a failure analysis — pre-mortem each design for what realistically goes wrong, and show where the architecture absorbs it.
Prerequisites & where this fits
This is an Advanced lesson in the Architecture & Design Mastery module, and it is the applied one — it spends the vocabulary the earlier lessons built. You will get the most from it if you have already met:
- The Azure Well-Architected Framework, In Depth — every decision below is named as a tradeoff between its five pillars (Reliability, Security, Cost Optimization, Operational Excellence, Performance Efficiency). You must already think in pillar tensions.
- Choosing an Architecture: Styles & the Ten Design Principles — each case instantiates a style (web-queue-worker, event-driven, microservices) and leans on principles like use managed services, make all things redundant, partition around limits.
- The 43 Azure Cloud Design Patterns — the tactical moves (Queue-Based Load Levelling, Cache-Aside, CQRS, Deployment Stamps, Geode, Saga, Valet Key, Gateway Routing) that appear inside the designs.
- The Azure Architecting Ladder — the requirements-first habit of mind these case studies practise on real businesses.
You should also be comfortable with the reliability fundamentals — high availability vs disaster recovery and RTO/RPO — because RTO and RPO are the spine of every brief.
Where it fits: the WAF, styles, patterns and ladder lessons taught you the language; this lesson shows you the job. It is exam-critical for AZ-305, whose case-study questions are exactly this shape — given a scenario with constraints, choose and justify a design.
How to read each case
Every case below is described against the same set of axes. Internalise them — they are the vocabulary of an architecture decision, and the order is the order you should ask them in:
| Axis | The question it asks | Why it drives the design |
|---|---|---|
| RTO (Recovery Time Objective) | After a failure, how long until service is restored? | Minutes-to-hours allows restore/failover; near-zero forces active/active. |
| RPO (Recovery Point Objective) | How much data can you afford to lose? | Hours allows nightly backup; zero forces synchronous or multi-write replication. |
| Scale & shape | Peak load, and is it steady or spiky? | Drives the compute model (serverless vs PaaS vs orchestrated) and elasticity. |
| Availability target (SLA) | What uptime must you promise, composite across the chain? | Each “nine” is roughly an order of magnitude harder and dearer. |
| Compliance & residency | Regulatory limits on where data lives and how it is protected? | Can force a region, an encryption model or a DR design regardless of pure availability maths. |
| Budget | Capital and run-rate ceiling? | The hard constraint. Caps how high you can climb regardless of desire. |
| Team topology | One team or many? What operational maturity? | A microservices design needs autonomous teams; a small team should stay on PaaS. |
A word on the costs throughout. The INR figures are deliberately rough, order-of-magnitude monthly run-rate estimates at indicative pay-as-you-go rates, to teach the shape of the cost curve — they are not quotes. Real numbers depend on region, tier, reservations, egress and traffic. Always model your own in the Azure Pricing and TCO calculators. The lesson is in the ratios between the four cases, not the absolute rupees — note how the bill rises roughly 50× from the first case to the last, and why.
Case study 1 — A startup SaaS web app
Business brief
“PulseDesk” is a two-founder startup building a help-desk SaaS for small businesses. They have a working prototype on a laptop, ten design-partner customers lined up, and eight months of runway. They need it live, multi-tenant, and able to take credit-card signups next month. They have no operations team and no patience for managing servers. The message to the architect is blunt: “Get us to market cheaply, don’t let a single outage embarrass us in front of our first customers, and don’t build anything we’ll have to throw away when we grow.”
Requirements
| Axis | PulseDesk’s requirement |
|---|---|
| RTO | A few hours is tolerable at launch. The business does not die if it is down for an afternoon, but a multi-day outage would. |
| RPO | Near-zero for customer data (a ticket lost is a customer lost), but a short window (minutes) is survivable. |
| Scale & shape | Tiny now (hundreds of users), unpredictable later. Could 100× in a year if it takes off, or flatline. Spiky, not steady. |
| Availability | “Don’t embarrass us” — call it 99.9% as a stretch goal, not a contractual SLA yet. |
| Compliance | None binding at launch beyond basic data hygiene; they will want SOC 2 eventually, so build toward it. |
| Budget | Brutal. Under ₹40,000/month all-in is the ceiling; ideally far less while traffic is low. |
| Team | Two full-stack developers. No dedicated ops. Every hour spent patching a VM is an hour not building product. |
Constraints
The binding constraints here are money and people, not reliability. With two developers and no ops function, the architecture must be managed — no VMs to patch, no Kubernetes to babysit. The “don’t throw it away” instruction rules out a design that cannot grow: a single SQLite file on one box would be cheap but a dead end. And the elastic, spiky load means anything with a fixed always-on cost floor is wrong; the bill should track usage, falling near zero on quiet nights.
The design and its Azure services
This is the canonical web-queue-worker style, built entirely from managed PaaS so two people can run it. Multi-tenancy is handled at the data layer with a shared database and a tenant ID on every row — the cheapest tenancy model, appropriate at this stage.
| Layer | Azure service | Role |
|---|---|---|
| Front end / API | Azure App Service (Linux, Basic→Standard plan) | Hosts the web app and REST API. Built-in autoscale, TLS, deployment slots, no servers to manage. |
| Background work | Azure Functions (Consumption plan) | Sends emails, generates ticket digests, processes webhooks — triggered off a queue. Scales to zero; you pay per execution. |
| Queue | Azure Storage Queue (or Service Bus later) | Decouples slow work (email, exports) from the request path — the Queue-Based Load Levelling pattern. |
| Database | Azure SQL Database (General Purpose, serverless) | Relational store with auto-pause when idle (cost), point-in-time restore (RPO), and a clean path to grow. |
| Files / assets | Azure Blob Storage | Ticket attachments and static assets. The Valet Key pattern issues short-lived SAS URLs so clients upload directly, bypassing the app tier. |
| Cache | Azure Cache for Redis (Basic, optional) | Sessions and hot lookups — add only when read load justifies the floor cost. |
| Identity | Microsoft Entra External ID | Customer sign-up/sign-in (CIAM) with social and email — far cheaper and safer than rolling your own auth. |
| Secrets | Azure Key Vault | Connection strings and API keys, read via the App Service managed identity — no secrets in code or config. |
| Edge | Azure Front Door (Standard) | Global TLS termination, CDN caching of static assets, and a WAF — a cheap security and performance win. |
| Ops | Application Insights + Log Analytics | Telemetry, traces, and alerts so two people get paged before customers complain. |
Single region, but with availability zones turned on for App Service and Azure SQL where the tier supports it — this buys datacentre-fault tolerance for a small premium, which directly answers “don’t embarrass us”.
Architecture decisions and Well-Architected tradeoffs
- Serverless / consumption everywhere it fits. Functions on Consumption and Azure SQL serverless mean the bill tracks usage and falls near zero on quiet nights. This spends Performance Efficiency (cold starts; the database takes a few seconds to wake from auto-pause) to buy Cost Optimization — exactly the right trade for a startup whose traffic is mostly zero.
- Managed PaaS over IaaS. Choosing App Service over VMs and Azure SQL over self-hosted Postgres spends a little Cost (PaaS carries a margin over raw compute) and buys enormous Operational Excellence — the “use managed services” design principle. Two developers cannot run a patched, backed-up, highly-available database themselves; Azure does it for them.
- Shared-database multi-tenancy. A single database with a tenant column is the cheapest tenancy model and ships fastest. The tradeoff is a Security and noisy-neighbour risk — one tenant’s heavy query can slow others, and a query bug could leak across tenants. Acceptable now; flagged as the first thing to revisit at scale (the path is per-tenant databases or elastic pools).
- Availability zones, not multi-region. Zone redundancy answers a datacentre failure cheaply. Full multi-region DR would spend Cost and Operational Excellence for a Reliability level the business does not yet need. This is the discipline of not climbing the ladder before requirements force you to.
- Front Door from day one. Putting a global edge and WAF in front early spends a small fixed Cost to buy Security and Performance, and crucially avoids a painful re-architecture later when they add a second region — Front Door is already the front door.
Rough cost
| Item | Indicative monthly (INR) |
|---|---|
| App Service (Standard S1, zone-redundant) | ₹9,000 |
| Azure SQL (serverless, light use) | ₹4,000 |
| Functions (Consumption) | ₹500 |
| Storage + Queue + Blob | ₹600 |
| Front Door (Standard) | ₹3,000 |
| Entra External ID (first tier free, light MAU) | ₹0–500 |
| App Insights / Log Analytics | ₹1,500 |
| Approx. total | ₹18,000–22,000/month |
Comfortably inside the ₹40,000 ceiling, and most of it scales up only as revenue does. On a truly quiet month it drifts lower as serverless components idle.
What could go wrong
- The shared database becomes the bottleneck and the risk. As tenants grow, a single noisy tenant degrades everyone, and the cross-tenant data-leak blast radius is the whole customer base. Mitigation path: move to elastic pools or per-tenant databases; enforce row-level security; add tenant-scoped rate limits.
- Function cold starts hurt once latency matters. Consumption-plan cold starts are fine for background email but would be felt if synchronous APIs moved onto them. Mitigation: keep latency-sensitive APIs on App Service; promote hot Functions to a Premium plan with pre-warmed instances when needed.
- No real DR story. A regional Azure outage takes PulseDesk fully down — zones don’t help there. Honest answer for launch: acceptable, with a documented “restore from geo-redundant backup into a second region” runbook (hours of RTO) as the safety net until the business can fund active DR.
- Secrets and tenancy are the audit gaps for future SOC 2. Already mitigated by Key Vault + managed identity and centralised logging — the design is built toward compliance even though none binds yet.
This is the correct architecture for where PulseDesk is: cheap, managed, zone-tolerant, and with a clear, non-throwaway growth path. Now we change the binding constraint entirely.
Case study 2 — A regulated healthcare data platform
Business brief
“MediVault” is a healthcare analytics company building a platform that ingests electronic health records (EHR) and medical imaging from a network of hospitals, stores them securely, and lets clinicians and researchers query de-identified data. The customers are US hospital systems; the data is Protected Health Information (PHI). The brief from their CISO is unambiguous: “This is HIPAA-regulated PHI. Nothing touches the public internet. Everything is encrypted, every access is logged, and we must be able to prove to an auditor exactly who saw what. Reliability matters, but compliance and privacy are non-negotiable — we will pay for them.”
Requirements
| Axis | MediVault’s requirement |
|---|---|
| RTO | Hours is acceptable for the analytics platform (it is not a life-support system), but the ingestion pipeline must not lose data during an outage. |
| RPO | Effectively zero for ingested clinical records — losing a patient’s lab result is unacceptable and a reportable event. |
| Scale & shape | Steady, batch-heavy ingestion (nightly EHR feeds) plus large imaging files; analytical query load from a known set of clinicians and researchers. |
| Availability | 99.9% for the platform; the ingestion path is the part that must be durable above all. |
| Compliance | HIPAA / HITRUST binding. Data residency in the US. Full audit trail. Encryption at rest and in transit, with customer-managed keys preferred. PHI must be de-identified before research access. |
| Budget | Generous relative to PulseDesk — compliance is funded — but not unlimited. A few lakh rupees/month is in scope; the CISO will not trade away controls to save money. |
| Team | A small platform team plus a compliance officer. Operational maturity is moderate; they need automation and clear audit evidence, not heroics. |
Constraints
The binding constraint here is regulatory, not load. HIPAA reshapes the architecture in ways performance never would: no public endpoints (everything behind private networking), encryption with customer-managed keys, immutable audit logs, and a hard separation between identifiable PHI and the de-identified data researchers may touch. The data-residency rule pins every component to US regions. And because PHI breaches are legally reportable and ruinous, the design must be private-by-default — the opposite of the public, edge-cached startup above.
The design and its Azure services
This is a medallion data platform (Bronze → Silver → Gold) wrapped in a private network with defence-in-depth. The architecture pattern that matters most is isolation: every service is reachable only over private endpoints, and identifiable data is walled off from de-identified data.
| Concern | Azure service | Role |
|---|---|---|
| Secure ingestion | Azure Data Factory (with self-hosted/managed VNet integration runtime) + SFTP on Blob | Pulls nightly EHR feeds and imaging over private connectivity; no public ingress. |
| Landing & lake | Azure Data Lake Storage Gen2 (Bronze/Silver/Gold) | Immutable raw landing zone, cleansed/conformed Silver, de-identified/aggregated Gold. Hierarchical namespace, lifecycle tiering for cold imaging. |
| Transform & de-identify | Azure Databricks (in a customer VNet, no public IP) | Cleansing, conforming, and the de-identification step that produces the research-safe Gold layer. |
| Serve / query | Microsoft Fabric / Synapse + Power BI | Clinician dashboards and researcher SQL — reading only the Gold (de-identified) layer for research personas. |
| Networking | Azure Virtual Network, Private Endpoints, Private DNS, Azure Firewall, NSGs | Everything private. No service has a public endpoint. Egress is forced through the firewall and logged. |
| Identity & access | Microsoft Entra ID + Conditional Access + PIM | Least-privilege RBAC; just-in-time elevation for admins; MFA enforced; access to PHI strictly role-scoped. |
| Encryption & keys | Azure Key Vault / Managed HSM with customer-managed keys (CMK) | Encryption at rest under keys MediVault controls and can revoke — a HIPAA-grade requirement. |
| Audit & monitoring | Microsoft Sentinel, Defender for Cloud, Azure Monitor, immutable storage for logs | Every access logged to tamper-evident storage; SIEM detection; Defender regulatory-compliance dashboard tracks HIPAA/HITRUST controls. |
| Governance | Microsoft Purview + Azure Policy (HIPAA/HITRUST initiative) | Data classification, lineage from raw to research, and policy enforcement that denies non-compliant resources (e.g. any public endpoint). |
The whole estate sits inside a landing zone with governance baked in — see Azure Landing Zones with CAF — so that the HIPAA Azure Policy initiative and private-networking guardrails apply by construction, not by hope.
Architecture decisions and Well-Architected tradeoffs
- Private-by-default networking. Private Endpoints + forced-tunnel egress + Azure Firewall mean no PHI service is internet-reachable. This heavily spends Cost and Operational Excellence (private DNS, firewall, runtime integration are real complexity and money) to buy Security — and here Security is the whole point. The “design to protect confidentiality” principle is in command.
- Customer-managed keys over platform-managed keys. CMK in Key Vault/Managed HSM spends Operational Excellence (key lifecycle, rotation, the risk of locking yourself out) to buy Security and compliance leverage — MediVault can cryptographically revoke a customer’s data on contract termination, which platform-managed keys cannot offer.
- Hard PHI/de-identified separation. Producing a de-identified Gold layer and granting researchers access only there spends Performance and Cost (an extra transform stage, duplicated data) to buy Security and regulatory safety — it makes the worst-case research breach a breach of non-identifiable data.
- Immutable, tamper-evident audit logs. Writing access logs to immutable (WORM) storage and into Sentinel spends Cost to buy the one thing HIPAA demands above all: the ability to prove who saw what to an auditor. Without provable audit, every other control is unverifiable.
- Policy-as-guardrail, not as guidance. Using Azure Policy in deny mode (no public IPs, CMK required, approved regions only) spends a little Operational Excellence (developers hit guardrails) to buy Security and Governance that cannot be accidentally bypassed — compliance by construction.
- Reliability sized honestly. The platform is single-region with zone redundancy and geo-redundant backup, not active/active — because the RTO genuinely allows hours. The team resisted gold-plating reliability so the compliance budget went where it mattered. The ingestion durability (the real near-zero-RPO requirement) is met with idempotent, replayable pipelines and immutable Bronze, not with multi-region spend.
Rough cost
| Item | Indicative monthly (INR) |
|---|---|
| Data Lake Gen2 (large, tiered imaging) | ₹40,000 |
| Databricks (VNet-injected, scheduled jobs) | ₹70,000 |
| Data Factory + integration runtimes | ₹15,000 |
| Synapse / Fabric + Power BI | ₹40,000 |
| Private networking (Firewall, PE, DNS) | ₹45,000 |
| Key Vault / Managed HSM (CMK) | ₹15,000 |
| Sentinel + Defender + immutable log storage | ₹35,000 |
| Purview governance | ₹12,000 |
| Approx. total | ₹2.7–3.2 lakh/month |
Roughly 15× PulseDesk — and almost every additional rupee buys compliance and privacy, not features or raw scale. That is the honest cost of regulated data, and the CISO signed up for it.
What could go wrong
- De-identification is imperfect. Re-identification from “anonymised” data is a known risk; a sloppy transform leaks PHI into the research layer. Mitigation: expert-determination or safe-harbour de-identification standards, automated checks, and treating the Gold layer as still-sensitive with access controls.
- CMK self-lockout. Lose or mis-rotate the customer-managed key and the data is unreadable — by you too. Mitigation: strict key backup, rotation runbooks, soft-delete and purge-protection on Key Vault, and tested recovery.
- Audit gaps fail the audit, not just security. If any path to PHI is not logged to immutable storage, the auditor finding is as damaging as a breach. Mitigation: diagnostic settings enforced by policy on every resource; periodic audit-completeness reviews.
- Private-networking complexity causes outages. Private DNS or firewall misconfiguration is the most common cause of “everything is down” in private estates. Mitigation: IaC for all networking, change control, and connectivity tests in the deployment pipeline.
MediVault shows a vital lesson: the most expensive pillar is not always Reliability. Here it is Security, and a good architect spends where the requirements — not the textbook — point. Next, the binding constraint shifts again, to scale.
Case study 3 — A global retail e-commerce platform
Business brief
“BharatBazaar” is a fast-growing retailer selling across India, South-East Asia, and the Middle East. They run flash sales and a Diwali peak where traffic spikes 50× in minutes. Their current monolith falls over under load, oversells stock, and is slow for customers far from their single datacentre. The brief from the VP of Engineering: “We need a platform that’s fast for customers on three continents, never oversells inventory, and survives the Diwali surge without us pre-provisioning a fortune of idle capacity the other 360 days. Teams must be able to ship independently — checkout, catalogue and search can’t be blocked by each other.”
Requirements
| Axis | BharatBazaar’s requirement |
|---|---|
| RTO | Low — minutes. An outage during a flash sale is lost revenue measured in crores per hour. |
| RPO | Near-zero for orders and payments; eventual consistency is acceptable for catalogue and recommendations. |
| Scale & shape | Extreme spikes (50× in minutes during sales) on a moderate baseline. Global read traffic; write traffic concentrated around checkout. |
| Availability | 99.95%+ across regions; graceful degradation (browse must survive even if recommendations don’t). |
| Compliance | PCI-DSS for card data (mostly delegated to a payment provider); data-residency awareness across countries. |
| Budget | Significant but ROI-driven. They will spend to capture peak revenue, but idle capacity 360 days a year is unacceptable — elasticity is a hard requirement. |
| Team | Multiple autonomous product teams (catalogue, search, cart, checkout, fulfilment) with good DevOps maturity. |
Constraints
The binding constraint is elastic global scale with correctness under contention. Two things break naive designs here: the surge (a design that needs pre-provisioned peak capacity is too expensive) and overselling (concurrent buyers racing for the last unit of stock). The multi-team requirement rules out a monolith — teams must deploy independently, which points to microservices and an event-driven spine. And global customers demand low latency, which forces multi-region presence and edge delivery.
The design and its Azure services
This is a microservices + event-driven architecture, multi-region active-active for the stateless tiers, with the Competing Consumers and Queue-Based Load Levelling patterns absorbing the surge, and CQRS separating the read-heavy catalogue from the write-critical order path.
| Concern | Azure service | Role |
|---|---|---|
| Global edge | Azure Front Door (Premium) + CDN + WAF | Routes users to the nearest healthy region, caches static catalogue/imagery at the edge, absorbs bot and DDoS load. |
| Compute | Azure Kubernetes Service (AKS), multiple regions, with KEDA | Per-team microservices; KEDA scales pods on queue depth, so checkout workers spin up with the surge and back down after. |
| Async spine | Azure Service Bus (premium) / Event Hubs | Orders, inventory events, and notifications flow asynchronously — the Publisher-Subscriber and Competing Consumers patterns decouple teams and level load. |
| Order data | Azure Cosmos DB (multi-region, session/strong consistency where needed) | Globally distributed, elastically scalable store for the order and inventory domain; multi-region writes for availability. |
| Catalogue read model | Cosmos DB / Azure SQL + Azure Cache for Redis | A CQRS read model and Cache-Aside keep the hot browse path microsecond-fast and cheap to scale out. |
| Search | Azure AI Search | Faceted product search and relevance, scaled independently of catalogue writes. |
| Inventory correctness | Cosmos DB optimistic concurrency / Service Bus sessions | Serialises decrements per SKU so the last unit is never oversold — correctness under contention. |
| Payments | External PCI provider + tokenisation | Card data never lands in BharatBazaar’s estate (Quarantine/Gatekeeper thinking) — PCI scope minimised. |
| Order workflow | Azure Durable Functions / Logic Apps | The Saga pattern orchestrates reserve-stock → charge → fulfil with Compensating Transactions on failure. |
| Resilience | App-level Retry + Circuit Breaker + Bulkhead | Browse survives even when recommendations or reviews are degraded — graceful degradation by design. |
| Ops | Azure Monitor, App Insights, Managed Grafana | Per-service SLOs, surge dashboards, and autoscale observability across regions. |
Front Door routes to the nearest healthy region; AKS clusters in each region scale on demand; Cosmos DB spans regions so reads are local and a region loss does not stop writes.
Architecture decisions and Well-Architected tradeoffs
- Microservices over the monolith. Vertical decomposition spends Operational Excellence and Performance (network hops, distributed-systems complexity, the fallacies of distributed computing now apply) to buy team autonomy and independent scaling — the catalogue team ships without waiting on checkout. Justified only because the team topology is many autonomous, DevOps-mature teams. (For a two-person startup it would be over-engineering — contrast Case 1.)
- KEDA queue-depth autoscaling over fixed capacity. Scaling workers on queue depth means the platform provisions for the surge in minutes and scales to near-baseline afterward. This spends a little Performance (a brief lag while pods start, smoothed by the queue buffer) to buy enormous Cost Optimization — the explicit “no idle peak capacity” requirement. The queue is the shock absorber: Queue-Based Load Levelling turns a 50× spike into a manageable backlog drained by Competing Consumers.
- CQRS + Cache-Aside for the read/write split. Browse traffic dwarfs order traffic, so separating the read model and caching it spends Operational Excellence and Cost (a second model to keep eventually-consistent, cache to manage) to buy Performance and Cost-at-scale — the cheap, fast browse path is what survives the surge.
- Cosmos DB multi-region writes over a single SQL primary. Going multi-region active-active for the order domain spends Cost and Consistency (eventual consistency, conflict handling, higher RU spend) to buy Reliability and global low-latency — a region loss does not stop checkout. Where strict correctness is needed (inventory), strong/session consistency and per-SKU serialisation are used selectively rather than globally, paying the consistency cost only where it earns its keep.
- Saga for the order workflow. Distributed transactions across reserve-stock, payment and fulfilment cannot use a 2-phase commit, so a Saga with Compensating Transactions spends Operational Excellence (you must design and test every compensation) to buy Reliability — a failed payment cleanly releases reserved stock instead of leaving the system inconsistent.
- Delegating PCI to a provider. Tokenising cards through an external PCI-DSS provider spends a small Cost and a little Performance (an extra hop) to buy a massive reduction in Security and compliance scope — card data never enters the estate, so most of PCI’s burden simply does not apply.
Rough cost
| Item | Indicative monthly (INR) |
|---|---|
| AKS (multi-region, baseline + surge headroom) | ₹2.0 lakh |
| Cosmos DB (multi-region, provisioned + autoscale RU) | ₹2.5 lakh |
| Front Door Premium + CDN + WAF | ₹1.2 lakh |
| Service Bus / Event Hubs (premium) | ₹60,000 |
| Azure AI Search | ₹50,000 |
| Redis (Premium, clustered) | ₹70,000 |
| Monitoring / Grafana | ₹40,000 |
| Approx. baseline total | ₹8–9 lakh/month (baseline) |
Crucially, the bill is elastic: it spikes during sales when revenue justifies it and falls back toward baseline afterward — the opposite of a pre-provisioned monolith that pays for peak capacity year-round. The architecture’s whole economic argument is that cost tracks revenue.
What could go wrong
- Overselling under extreme contention. If inventory decrements race, two buyers get the last unit. Mitigation: per-SKU serialisation via Service Bus sessions or optimistic concurrency with retry; treat inventory as the one place strong consistency is non-negotiable.
- Retry storms amplify an outage. When a downstream service slows, naive retries from every microservice can turn a brown-out into a black-out (the Retry Storm anti-pattern). Mitigation: exponential backoff with jitter, Circuit Breakers, and Bulkheads to isolate failures so a slow recommendations service cannot exhaust the threads serving checkout.
- Cache stampede at the start of a flash sale. When the sale opens, a cold or invalidated cache sends every request to the database at once. Mitigation: pre-warming, request coalescing, and graceful TTL jitter.
- Eventual-consistency surprises. A customer sees a product in search that the catalogue write model has just removed. Mitigation: design the UX for eventual consistency, and reserve strong consistency for money and stock only.
- Distributed-systems operational load. Many services across many regions is a lot to run. Mitigation: strong SLOs per service, golden-path tooling, and centralised observability — and the honest acknowledgement that this complexity is only worth it at this scale.
BharatBazaar is the classic AZ-305 “design for scale and resilience” case. But notice its reliability ceiling: it survives a region loss with degradation, but it is not zero-downtime, zero-data-loss through that loss. For the final case, the business cannot tolerate even that.
Case study 4 — A zero-downtime bank core
Business brief
“SovereignBank” is building a new core banking ledger — the system that records every account balance and transaction. The brief from the Chief Risk Officer is the most demanding an architect ever hears: “This system cannot go down and cannot lose a transaction — ever. A regional disaster, a bad deployment, a poisoned cache: customers must keep transacting through all of them with no human in the loop for the first line of defence. We are regulated for data sovereignty — every byte of customer data stays within national borders. Reliability is the requirement; we will pay what correctness and continuity cost, but not a rupee on theatre that doesn’t buy them.”
Requirements
| Axis | SovereignBank’s requirement |
|---|---|
| RTO | Effectively zero. The business transacts through a regional failure; there is no acceptable “down for failover” window for the ledger. |
| RPO | Zero for committed transactions. A committed debit/credit must never be lost — this is the absolute, non-negotiable requirement. |
| Scale & shape | High, steady transactional throughput with predictable daily/monthly peaks (payday, month-end), not flash-sale spikes. |
| Availability | The highest the business can fund and prove — designed for continuity through the loss of a whole region. |
| Compliance | Data sovereignty binding — all customer data within national borders; full audit, regulatory reporting, and provable controls. Banking regulation (e.g. RBI-style) and strong cryptographic key control. |
| Budget | Large and explicitly justified by the cost of downtime (crores per hour, regulatory penalties, reputational ruin). But spent with discipline — no reliability theatre. |
| Team | A mature platform and SRE organisation, comfortable with chaos engineering, health modelling, and zero-downtime deployment. |
Constraints
This is the apex case, and the binding constraints stack: zero RTO and zero RPO under regional failure, plus data sovereignty that pins everything inside national borders. Zero-RTO-through-region-loss forces active/active multi-region — an active-passive design has a failover window, which is disqualifying. Zero-RPO forces careful data design — you cannot simply async-replicate the ledger and accept lag. Sovereignty means both active regions must be in-country (Azure has multiple Indian regions, e.g. Central and South India), and key material stays under national control. And “no human in the loop for the first line of defence” forces a health model and self-healing automation, not a runbook a tired engineer follows at 3 a.m.
The design and its Azure services
This lands squarely on Mission-Critical (AlwaysOn) Architecture on Azure — the apex design where the Well-Architected pillars and the design patterns converge. The signature concepts are the deployment stamp / scale unit, active/active multi-region, the health model, and zero-downtime deployment of whole stamps.
| Concern | Azure service / concept | Role |
|---|---|---|
| Topology | Active/active across two in-country regions, fronted by Azure Front Door | Both regions serve live traffic; loss of one is absorbed with no failover window — zero RTO. |
| Deployment unit | Deployment Stamp / scale unit | A self-contained, independently-deployable unit (compute + data + config). Capacity is added by cloning stamps; blue/green of an entire stamp gives zero-downtime releases. |
| Compute | AKS (or scale-unit-aligned App Service/Container Apps) per stamp | Stateless application tier within each stamp, zone-redundant within region and replicated across regions. |
| Ledger data | Cosmos DB multi-region (multi-write) and/or Azure SQL with synchronous in-region replicas + cross-region replication | The hardest decision: a globally-distributed store with active-active writes and conflict resolution, or a strongly-consistent SQL design with synchronous zone replicas in-region and tight cross-region replication. The ledger’s correctness model decides which. |
| Exactly-once integrity | Transactional Outbox, Idempotency keys, Saga | Every transaction is idempotent and replayable; the Transactional Outbox pattern guarantees a committed ledger entry and its event are atomic — no lost or duplicated transactions. |
| Health model | Custom Health Endpoint Monitoring → healthy/degraded/unhealthy | The system classifies its own health from telemetry (latency, error rate, dependency health) — not raw uptime — and Front Door routes away from a degraded stamp automatically. |
| Self-healing & isolation | Bulkhead, Circuit Breaker, Retry, Throttling + automation | Fault isolation per stamp (blast-radius reduction); automated remediation is the first responder. |
| Networking & sovereignty | Private endpoints, in-country regions only, Azure Firewall | All data in-country; private-by-default; egress controlled and logged. |
| Keys & encryption | Managed HSM with customer-managed keys, in-country | Cryptographic control under national jurisdiction. |
| Continuous validation | Azure Chaos Studio + load and failover testing in the pipeline | The resilience is proven continuously by injecting faults (kill a stamp, fail a region) — see chaos engineering. |
| Observability & audit | Azure Monitor, App Insights, Sentinel, immutable audit, regulatory reporting | Deep telemetry feeds the health model; tamper-evident audit satisfies the regulator. |
The composite-SLA maths is explicit here: chaining components multiplies their availabilities, so adding regions and removing single points of failure is how you claw back the nines that a long dependency chain erodes — the discipline taught in Mission-Critical (AlwaysOn) Architecture and multi-region active-active disaster recovery.
Architecture decisions and Well-Architected tradeoffs
- Active/active over active/passive. This is the decision that defines the case. Active/active spends Cost (roughly double the footprint), Operational Excellence (running two live regions and reconciling data), and Consistency to buy the one thing the brief demands and active/passive cannot give: zero RTO through a region loss. There is no failover window because there is no failover — both regions were already serving.
- Deployment stamps over a shared global deployment. Partitioning the system into independently-deployable scale units spends Operational Excellence and Cost (more units to manage, some duplicated overhead) to buy blast-radius reduction (a fault is contained to one stamp) and zero-downtime deployment (blue/green an entire stamp, drain it via the health model, never the whole system at once).
- Health model over raw uptime. Investing in a custom health model that classifies application health spends real engineering effort to buy Reliability and Operational Excellence — the system can route away from a degraded (not yet dead) stamp before customers feel it, which raw “is the VM up?” monitoring never catches. This is the “observe application health” mission-critical principle.
- Exactly-once data integrity over best-effort. The Transactional Outbox + idempotency + Saga combination spends significant Performance and Operational Excellence (every write path is more complex and a little slower) to buy the absolute zero-RPO, no-duplicate guarantee a ledger demands. For a bank, correctness is not a tradeoff to optimise — it is the floor.
- Sovereignty pinned in-country with CMK in Managed HSM. Restricting to in-country regions and national key control spends Cost and Reliability flexibility (you cannot reach for a far-flung third region; the in-country region pair is your universe) to buy legal compliance — non-negotiable, so the design works within that box rather than fighting it.
- Continuous validation over “we tested DR once.” Running Chaos Studio experiments in the pipeline spends Operational Excellence (building and trusting fault-injection) to buy proven Reliability — an untested failover is a hope, not a control, and a regulator (and the CRO) will not accept hope. You prove the region can be lost by routinely losing it on purpose.
This case is the inverse of PulseDesk’s economics: there, you spent the minimum and accepted real reliability gaps; here, Reliability is paramount and you spend deliberately — but still with discipline, refusing spend that does not measurably buy continuity or correctness.
Rough cost
| Item | Indicative monthly (INR) |
|---|---|
| Active/active compute (AKS, multiple stamps × 2 regions) | ₹8–10 lakh |
| Multi-region ledger data (Cosmos multi-write / SQL replicas) | ₹6–8 lakh |
| Front Door Premium + global routing/WAF | ₹1.5 lakh |
| Private networking × 2 regions (Firewall, PE, DNS) | ₹2 lakh |
| Managed HSM + CMK (in-country) | ₹1.5 lakh |
| Sentinel + immutable audit + regulatory reporting | ₹2 lakh |
| Chaos Studio + load/failover test infrastructure | ₹50,000 |
| Approx. total | ₹22–28 lakh/month |
Roughly 50× PulseDesk — but for a system where an hour of downtime costs crores and a lost transaction is a regulatory incident, the run-rate is dwarfed by the risk it retires. The architect’s job is to ensure every rupee buys continuity or correctness, not reassurance.
What could go wrong
- Split-brain on active/active writes. A network partition leaves both regions accepting conflicting writes to the same account. Mitigation: a clear consistency and conflict-resolution model (per-account write ownership, or a consensus store for the balance), and testing it under partition with chaos experiments. This is the single hardest problem in the design and deserves the most scrutiny.
- A bad deployment poisons both regions. Active/active means a flawed release can roll to everywhere. Mitigation: the deployment-stamp blue/green model and progressive rollout — deploy to one stamp, let the health model judge it, and never promote globally until a canary stamp is proven healthy.
- The health model is wrong. If health is mis-classified, Front Door routes traffic to a sick stamp or away from a healthy one. Mitigation: treat the health model as a first-class, tested, continuously-validated artefact — and validate it with chaos, not just unit tests.
- Sovereignty limits your blast-radius options. Pinned to in-country regions, you have fewer regions to spread across, so a correlated national-scale failure is a residual risk. Mitigation: maximise zone and region separation within the country, design for graceful degradation, and document the residual risk to the regulator honestly.
- Operational complexity outruns the team. This is the most complex thing most organisations will ever run; under-investing in SRE maturity is the real failure mode. Mitigation: the brief already specifies a mature SRE org — and the architecture’s automation, health model and continuous validation exist precisely so humans are not the first line of defence.
This is the apex: the design every other case has been climbing toward. And the meta-lesson across all four is the one that separates an architect from a service-operator — the right design is the one the requirements force, no higher and no lower.
The diagram lays the four case studies side by side on a rising-complexity axis, so you can see at a glance how the binding constraint shifts — cost, then compliance, then scale, then reliability — and how the architecture grows in response from a single-region serverless web app to a sovereign, active/active mission-critical core.
Real-world application
In a real engagement, these walkthroughs are the job — they are what you whiteboard in a discovery workshop and then formalise in a proposal or an Azure Architecture Review. The repeatable method is the deliverable: ask the seven axis questions, find the binding constraint, choose the cheapest design that meets it with margin, name every decision as a Well-Architected tradeoff, cost it, and pre-mortem it. A few patterns from these cases recur in almost every engagement:
- The binding constraint is rarely the one the client names first. PulseDesk said “make it reliable” but the binding constraint was money and people. MediVault thought they needed reliability; the binding constraint was compliance. Surfacing the real constraint is half the value an architect adds.
- Right-sizing both ways. Half of real architecture is talking a client out of complexity they don’t need (a startup does not need active/active) and the other half is making them fund the complexity they do (a bank ledger cannot run single-region). The same WAF tradeoff language justifies both directions.
- The same patterns, recomposed. Queue-Based Load Levelling appears in a startup (decouple email) and in a global retailer (absorb a 50× surge). Deployment Stamps appear only when reliability requirements force them. Mastering the pattern catalogue lets you reach for the right move in any case.
- Cost as a first-class design output. Every proposal ends in a number, and the shape of the cost curve (does it track usage? does it track revenue? is it a fixed floor?) is itself an architecture decision a client will scrutinise.
These are precisely the scenarios AZ-305 tests, and precisely the conversations that fill a senior architect’s week.
Common mistakes & anti-patterns
- Designing from the service catalogue, not the requirements. Reaching for AKS, Cosmos DB multi-region and active/active because they are impressive — when the client is a two-person startup — is the cardinal sin. Requirements first, always.
- Treating Reliability as the only pillar that matters. MediVault’s binding constraint was Security; over-spending on multi-region reliability there would have starved the compliance budget. Spend where the requirements point.
- Skipping the failure analysis. A design with no pre-mortem is a hope. Every proposal must answer “what realistically goes wrong, and where does the architecture absorb it?” — split-brain, retry storms, cache stampedes, CMK lockout, de-identification leaks.
- Active/passive when the brief says zero RTO. A failover window is disqualifying for a system that “cannot go down”. Conversely, active/active for a system that tolerates hours of RTO is expensive theatre. Match the topology to the RTO.
- Ignoring the cost shape. Two designs with the same monthly bill can be wildly different businesses — one that scales to zero on quiet nights versus one that pays for peak capacity year-round. Always reason about elasticity, not just the headline number.
- Forgetting that compliance can override availability maths. Data sovereignty can pin you to fewer regions than pure reliability would choose. The law wins; design within the box.
- Under-investing in operations for complex designs. A microservices or mission-critical estate that a small, low-maturity team cannot run is a liability, however elegant on paper. Team topology is a design input.
Interview & exam questions
- A startup with two developers and a brutal budget needs a multi-tenant SaaS live next month. What architecture do you propose, and why not Kubernetes? (Looking for: managed PaaS — App Service, Functions, Azure SQL serverless; consumption/serverless for cost; shared-DB multi-tenancy; KEDA/AKS is over-engineering for two people — “use managed services”.)
- A HIPAA platform’s binding constraint is compliance, not uptime. How does that reshape the architecture versus a public web app? (Private endpoints, no public ingress, CMK in Managed HSM, immutable audit logs, PHI/de-identified separation, Azure Policy in deny mode, reliability sized honestly to the real RTO.)
- How do you absorb a 50× flash-sale surge without paying for peak capacity all year? (Queue-Based Load Levelling + Competing Consumers as a shock absorber; KEDA scaling workers on queue depth; the queue buffers the spike; cost tracks revenue.)
- How do you guarantee a retailer never oversells the last unit of stock under extreme concurrency? (Per-SKU serialisation via Service Bus sessions or optimistic concurrency with retry; reserve strong/session consistency for inventory and money only; eventual consistency elsewhere.)
- Active/active versus active/passive for a system that must transact through a region loss — which, and why? (Active/active: active/passive has a failover window that violates zero RTO. Cost is the tradeoff; both regions already serve, so there is no failover.)
- What does it mean to drive failover from a “health model” rather than raw uptime, and why is it superior? (Classify application health — healthy/degraded/unhealthy — from telemetry; route away from a degraded stamp before customers feel it; raw “is the VM up?” misses brown-outs.)
- How do you guarantee zero RPO for a bank ledger — no lost and no duplicate transactions? (Transactional Outbox for atomic commit-and-publish, idempotency keys, Saga with compensating transactions; synchronous in-region replication; careful conflict resolution on multi-write.)
- A client insists on active/active multi-region for a line-of-business app that tolerates four hours of RTO. How do you respond? (Push back: it is over-engineering. Use the WAF tradeoff language — they would spend Cost and Operational Excellence for Reliability they don’t need. Propose zone-redundant single region with geo-backup DR.)
- How does a data-sovereignty requirement change a multi-region design? (Both regions must be in-country; CMK under national jurisdiction; fewer regions to spread across, so maximise zone/region separation within the country and document residual correlated-failure risk.)
- Name three things that go wrong in an active/active design and how the architecture mitigates each. (Split-brain → conflict-resolution model + chaos testing; bad deployment to both regions → stamp blue/green + canary; wrong health model → treat it as a tested, continuously-validated artefact.)
- Why delegate card handling to an external PCI provider instead of building it? (Tokenisation keeps card data out of your estate — Quarantine/Gatekeeper thinking — collapsing PCI scope; you spend a small hop and fee to retire most of the compliance burden.)
- Across these four cases the monthly bill rises ~50×. What single principle explains the spread? (Climb exactly as high as the requirements force you. Cost is the price of the binding constraint — money/people, then compliance, then scale, then zero-downtime reliability — never aesthetics.)
Quick check
- In the startup case, why is Azure Functions on the Consumption plan the right choice for background work?
- What is the binding constraint in the healthcare case, and name two architecture decisions it forces.
- Which pattern lets the e-commerce platform absorb a 50× surge without pre-provisioning peak capacity?
- Why must the bank core be active/active rather than active/passive?
- State the one principle that explains why the four designs differ so much in cost and complexity.
Answers
- Background work (email, exports, webhooks) is intermittent, so Consumption scales to zero and bills per execution — maximum Cost Optimization for a startup whose traffic is mostly zero — and it is fully managed, fitting a two-person team. The cold-start Performance cost is acceptable for async work off a queue.
- The binding constraint is regulatory compliance (HIPAA), not uptime. It forces, among others: private-by-default networking (no public endpoints), customer-managed keys, immutable audit logs, hard PHI/de-identified separation, and Azure Policy in deny mode — and it justifies sizing reliability honestly to the real (hours) RTO rather than gold-plating it.
- Queue-Based Load Levelling (with Competing Consumers): the queue buffers the spike and KEDA scales workers on queue depth, so the platform provisions for the surge in minutes and scales back afterward — cost tracks revenue.
- Because the requirement is zero RTO through a region loss. Active/passive has a failover window during which the ledger is unavailable, which is disqualifying; active/active means both regions are already serving, so losing one is absorbed with no failover.
- Climb exactly as high as the requirements force you, and not one rung higher — the right design is the one driven by the binding constraint (money/people, compliance, scale, or zero-downtime reliability), so cost and complexity rise only as the requirements genuinely demand.
Exercise
A design thought-experiment. A mid-sized airline approaches you to architect its new flight check-in and boarding platform. The brief: passengers check in via web and mobile, often in a rush at the gate; load is highly peaked around departure waves at major hubs in two countries; a check-in must not be lost (a passenger with a boarding pass must be boardable even if a server failed mid-transaction); the platform must keep working at one hub even if another hub’s region has problems; aviation regulators require data on passengers to stay in-region and demand an audit trail. Budget is real but ROI-driven — downtime during a departure wave strands passengers and incurs penalties.
Produce a one-page proposal in the lesson’s format: (a) extract the requirement axes (RTO, RPO, scale shape, availability, compliance, budget posture, team), (b) name the binding constraint, © sketch the design and key Azure services, (d) state the three most important decisions as Well-Architected tradeoffs, and (e) list three things that could go wrong and their mitigations. Then decide: is this closer to the e-commerce case or the bank case — and why?
Model answer (outline). (a) Requirements: RTO low (minutes — a stranded departure wave is costly) but arguably not absolute-zero across the whole platform; RPO zero for a completed check-in (the boarding-pass guarantee); scale is spiky around departure waves (closer to the retail surge than steady banking load); availability high with graceful degradation (browse/seat-map can degrade, check-in cannot); compliance forces in-region data residency + audit; budget ROI-driven; team assumed reasonably mature. (b) Binding constraint: a combination — peaked scale and the never-lose-a-check-in correctness guarantee and regional independence between hubs, under residency law. © Design: multi-region active-active across the two in-country regions fronted by Front Door routing passengers to their hub’s region; AKS or Container Apps scaling on queue depth (KEDA) to absorb departure-wave peaks; the check-in transaction protected by Transactional Outbox + idempotency + Saga so a completed check-in is durable and replayable; Cosmos DB / SQL with in-region replicas for residency; private networking and immutable audit for the regulator; a health model so a degraded hub region sheds traffic gracefully. (d) Tradeoffs: (1) active-active spends Cost/Consistency to buy hub independence and continuity through a region problem; (2) queue-depth autoscaling spends a little Performance latency to buy Cost Optimization against year-round peak provisioning; (3) Transactional Outbox + Saga spends Operational Excellence/Performance to buy the zero-RPO check-in guarantee. (e) What could go wrong: split-brain on a check-in during a partition (mitigate with per-passenger/per-flight write ownership + chaos testing); a departure-wave surge outrunning autoscale (mitigate with the queue buffer + pre-scaling on the known flight schedule); residency limiting regions (mitigate with max zone separation in-country + documented residual risk). Verdict: it sits between the two — it has the spiky surge shape of the e-commerce case but the correctness-critical, regional-independence, residency demands closer to the bank. A strong answer recognises it is not full mission-critical zero-RTO everywhere (the seat-map can degrade), so it spends the bank-grade rigour only on the check-in transaction and keeps the rest at retail-grade — right-sizing within a single system, which is the highest form of the skill this lesson teaches.
Certification mapping
- AZ-305 — Designing Microsoft Azure Infrastructure Solutions (primary). This lesson is the applied core of the exam. It maps to Design identity, governance, and monitoring solutions (Entra, Conditional Access, PIM, Purview, audit), Design data storage solutions (Azure SQL vs Cosmos DB, medallion lake, CQRS, consistency choices), Design business continuity solutions (RTO/RPO, backup, multi-region active-passive vs active-active, composite SLA), and Design infrastructure solutions (compute selection App Service/Functions/AKS, messaging, networking, private endpoints). The exam’s case-study items are exactly this brief→design→justify shape.
- AZ-104 — Azure Administrator (supporting). The building blocks appear here: App Service plans and scaling, storage tiers and SAS, VNets/NSGs/private endpoints, Key Vault, Azure Monitor and alerts.
- AZ-204 — Developing Solutions for Azure (supporting). The application-level patterns — Queue-Based Load Levelling, Cache-Aside, Competing Consumers, Saga/Durable Functions, Transactional Outbox, idempotency, Valet Key SAS — are AZ-204 territory and recur across these designs.
- SC-100 / AZ-500 (adjacent). The healthcare and bank cases lean on Zero-Trust networking, CMK/Managed HSM, Sentinel and Defender for Cloud — the security-architecture exams’ core material.
Glossary
- Binding constraint — the single requirement that most limits the design space (money, compliance, scale, or reliability). The architect’s first job is to identify it; it dictates the whole shape of the solution.
- RTO (Recovery Time Objective) — the maximum acceptable time to restore service after a failure. Near-zero RTO forces active/active.
- RPO (Recovery Point Objective) — the maximum acceptable amount of data loss, measured in time. Zero RPO forces synchronous replication and exactly-once data design.
- Composite SLA — the combined availability of a system whose components are chained; availabilities multiply, so long dependency chains erode the achievable number, and redundancy claws it back.
- Deployment stamp / scale unit — a self-contained, independently-deployable unit of an application (compute + data + config). Capacity grows by cloning stamps; releases are blue/green at stamp granularity for zero downtime.
- Active/active vs active/passive — both regions serving live traffic (no failover window, zero RTO, higher cost) versus a standby region promoted on failure (a failover window, cheaper).
- Health model — a classification of application health (healthy/degraded/unhealthy) derived from telemetry, used to route traffic away from a sick component before users feel it — superior to raw infrastructure uptime.
- Medallion architecture — a data-lake layering of Bronze (raw), Silver (cleansed/conformed) and Gold (curated/de-identified) used to refine and govern data progressively.
- CMK (customer-managed key) — encryption keys the customer controls (in Key Vault or Managed HSM) and can revoke, rather than platform-managed keys — a requirement in regulated estates.
- CQRS (Command Query Responsibility Segregation) — separating the write model from one or more read models, so read-heavy paths (browse) scale independently of write-critical paths (orders).
- Transactional Outbox — a pattern that writes a business change and its outgoing event in one local transaction, then relays the event reliably — guaranteeing no lost or duplicated events (vital for a ledger).
- Saga — a sequence of local transactions across services with compensating transactions to undo on failure, used where a distributed two-phase commit is impossible (e.g. reserve-stock → charge → fulfil).
- De-identification — removing or transforming personal identifiers so data can be used for research without exposing PHI; imperfect de-identification risks re-identification.
- Data sovereignty / residency — legal requirements that data (and sometimes the keys protecting it) remain within national borders, which can constrain region and key-management choices.
Next steps
You now have the architect’s core skill in practice: a one-line brief in, a costed, justified, failure-analysed design out. The natural next lesson is the apex these case studies climbed toward, taught in full — Mission-Critical (AlwaysOn) Architecture on Azure: The Apex Design — where deployment stamps, the health model, active/active multi-write data, composite-SLA maths and continuous validation are unpacked end-to-end. Everything there will feel inevitable, because you watched the bank case force each piece into existence.
To deepen the surrounding material:
- Internalise the requirements-first habit with The Azure Architecting Ladder: From a Simple Web App to Mission-Critical — the same business problem at six levels of resilience, which is the engine behind every case above.
- Master the tactical moves with The 43 Azure Cloud Design Patterns — Queue-Based Load Levelling, Cache-Aside, CQRS, Saga, Deployment Stamps, Valet Key, Gatekeeper and the rest that appeared inside these designs.
- Anchor every decision in The Azure Well-Architected Framework, In Depth — the five-pillar tradeoff language used to justify every choice in this lesson.
- Choose the macro shape with Choosing an Architecture: Styles & the Ten Design Principles — web-queue-worker, microservices and event-driven, the styles these cases instantiate.
- Ground the reliability and continuity maths in high availability vs disaster recovery and RTO/RPO and multi-region active-active disaster recovery.
- Connect designs to the organisation that runs them via Azure Landing Zones with CAF — every workload here lands inside an application landing zone — and prove the resilience you design with Azure Chaos Studio fault-injection.