A design pattern is a named, reusable answer to a problem that recurs across systems – and the value is not the code, it is the name. When you can say “we’ll put a Queue-Based Load Leveling buffer in front of the database and add Competing Consumers behind it,” you have compressed a paragraph of design reasoning, its failure modes, and its tradeoffs into eight words that a room of architects will all decode the same way. That shared vocabulary is what this lesson builds. The Azure Architecture Center publishes a catalogue of cloud design patterns – commonly cited as forty-three – and most engineers have met a handful (Retry, Cache-Aside, Sidecar) by accident without ever seeing the whole map. This is the whole map.
I am not going to teach these as trivia. Every pattern here is an answer to a force that distributed systems exert on you whether you acknowledge it or not, and the cleanest way to see those forces is through the fallacies of distributed computing – the false assumptions that, when you build on them, produce exactly the outages these patterns prevent. So we start with the why, then catalogue all forty-three grouped by intent (Reliability/resilience, Messaging, Data management, Design & implementation, Security), each as problem → solution → when to use → Well-Architected pillar(s) → a concrete Azure example. Then we do the part the official docs skip: how patterns compose into the real designs you ship, and the anti-patterns – the named failure shapes – that tell you a composition has gone wrong. This is the reference lesson for the module; keep it open while you design.
Learning objectives
By the end of this lesson you will be able to:
- Explain the eight fallacies of distributed computing and trace each cloud design pattern back to the fallacy it defends against, so you can justify a pattern from first principles rather than cargo-culting it.
- Recall and apply all 43 Azure cloud design patterns, grouped by intent, naming for each its problem, solution, when-to-use trigger, the Well-Architected pillar(s) it serves, and a concrete Azure service that implements it.
- Compose patterns correctly – Retry with Circuit Breaker, Queue-Based Load Leveling with Competing Consumers, Gateway Routing with Aggregation and Offloading, Saga on Compensating Transaction – and explain why the combination is stronger than either part alone.
- Recognise the key cloud anti-patterns (Busy Database, Chatty I/O, Noisy Neighbour, Retry Storm, Improper Instantiation, Monolithic Persistence, No Caching, Synchronous I/O) in a design or a metric, and name the pattern that fixes each.
- Map any pattern to its Well-Architected pillar(s) using the summary table, so design reviews and AZ-305 answers connect a concrete technique to the pillar it advances.
- Select a coherent pattern set for a scenario under stated constraints, defending each choice and – just as important – naming the patterns you deliberately left out.
Prerequisites & where this fits
This is lesson A4 in the Architecture & Design Mastery module. It assumes you have done the upstream lessons: the Well-Architected Framework deep dive (you must know the five pillars – Reliability, Security, Cost Optimisation, Operational Excellence, Performance Efficiency – as a tradeoff system, because every pattern here is tagged to them) and Choosing an Architecture: Styles & the Ten Design Principles. The distinction between the two upstream lessons matters here: an architecture style (N-tier, microservices, event-driven) is the shape of the whole system; a design pattern is a local, composable technique you apply inside that shape. Styles are the floor plan; patterns are how you build each room. You pick one style and many patterns.
You should be comfortable with core Azure compute, messaging, and data services (App Service, Functions, AKS, Service Bus, Event Hubs, Event Grid, Cosmos DB, Azure SQL, Storage, Front Door, Application Gateway, API Management). Where a pattern leans on a service you have already studied in depth – caching, CQRS, the Strangler Fig migration, saga orchestration – I link the dedicated article so this catalogue can stay a catalogue rather than balloon into a book.
Where it fits in the arc: A3 taught you to choose the style from requirements. This lesson gives you the toolbox you reach into once the style is chosen. A5, Mission-Critical (AlwaysOn) Architecture, is where pillars, styles, and these patterns converge into the apex design – deployment stamps, active/active, health modelling – so think of A4 as the vocabulary you need before A5 can speak to you in full sentences.
The fallacies of distributed computing: the why behind every pattern
In 1994 Peter Deutsch (with later additions credited to James Gosling and others at Sun Microsystems) wrote down the assumptions that engineers reliably make when they first build distributed systems – assumptions that are all false, and that all eventually cause an outage. The cloud did not repeal these fallacies; it industrialised them. Every one of the 43 patterns is, at root, a structured way of not believing one or more of these eight lies. Internalise the fallacies and the catalogue stops being a list to memorise and becomes a set of inevitable consequences.
| # | The fallacy (the false assumption) | The reality in Azure | Patterns that defend against it |
|---|---|---|---|
| 1 | The network is reliable | Packets drop, connections reset, a dependency restarts mid-call, a region has a partition. Transient faults are the normal case at scale, not the exception. | Retry, Circuit Breaker, Health Endpoint Monitoring, Compensating Transaction, Scheduler Agent Supervisor |
| 2 | Latency is zero | Every hop costs milliseconds; a chatty call pattern multiplies them. Cross-region adds tens of ms; cross-cloud more. | Cache-Aside, Materialized View, Index Table, Gateway Aggregation, Backends for Frontends, CQRS, Geode |
| 3 | Bandwidth is infinite | Large payloads saturate links and message buses; queues have size limits. Moving everything everywhere is not free. | Claim Check, Static Content Hosting, Valet Key, Pipes and Filters |
| 4 | The network is secure | The wire is hostile by default; identity must be proven on every call and the trust boundary made explicit. | Federated Identity, Gatekeeper, Valet Key, Quarantine, Ambassador |
| 5 | Topology doesn’t change | Instances scale out and in, move zones, get replaced on deploy; IPs and leaders are not stable. | Leader Election, Sidecar, Ambassador, External Configuration Store, Deployment Stamps |
| 6 | There is one administrator | Ownership is federated across teams, subscriptions, even clouds; you integrate with systems you don’t control and can’t change. | Anti-Corruption Layer, Strangler Fig, Messaging Bridge, Gateway Routing |
| 7 | Transport cost is zero | Serialisation, egress, and per-message charges are real line items; “just call the API again” has a price. | Cache-Aside, Claim Check, Compute Resource Consolidation, Gateway Aggregation |
| 8 | The network is homogeneous | Services speak different protocols, schemas, and SLAs; nothing is uniform across a real estate of systems. | Anti-Corruption Layer, Messaging Bridge, Pipes and Filters, Ambassador, Sidecar |
A ninth fallacy is often appended – “the system is monolithic” / coordination is free – and it is the spiritual root of the messaging and data-management groups: the moment you have more than one process, every shared decision costs a round trip and risks a conflict. Patterns like Competing Consumers, Choreography, CQRS, Event Sourcing, and Sharding exist to minimise coordination, which is one of the Ten Design Principles you met in A3.
The architect’s habit this builds: when someone proposes a design, silently ask “which fallacy is this assuming?” A synchronous chain of six service calls assumes latency is zero and the network is reliable. A 200 MB blob on a message bus assumes bandwidth is infinite. A hard-coded leader assumes topology doesn’t change. The fallacy names the risk; the catalogue names the fix.
A note on the count and on canon accuracy. The Azure Architecture Center catalogue is a living document: patterns have been added over the years (Rate Limiting and Sequential Convoy are relatively recent; older course material lists fewer). This lesson catalogues the full set as grouped below; treat the names and intents as authoritative for AZ-305 and design reviews, and expect the headline number to drift by one or two as Microsoft curates the catalogue. What does not drift is the reasoning: every pattern still answers a fallacy.
Now the catalogue.
Group 1 — Reliability & resilience patterns
These patterns answer fallacy #1 (the network is reliable) and the coordination fallacy. Their shared job is to keep a system serving its business requirements when individual components misbehave – by retrying transient faults, isolating failures, smoothing load, electing coordinators, and recovering from partial failure. They map predominantly to the Reliability pillar, with strong Performance Efficiency and Operational Excellence crossovers.
Retry
- Problem. A call fails because of a transient fault – a momentary network blip, a throttled dependency, a brief failover. Treating it as a hard error fails an operation that would have succeeded a moment later.
- Solution. Re-attempt the failed operation, distinguishing transient faults (retry) from permanent ones (fail fast). Use exponential backoff with jitter so a recovering dependency is not re-saturated, and cap the attempts and total deadline.
- When to use. Idempotent or safely-retryable operations against services with documented transient failure modes (almost all cloud services). Do not retry non-idempotent writes without a dedup/idempotency key, and do not retry permanent errors (4xx other than 429).
- WAF pillar(s). Reliability (primary); Performance Efficiency.
- Azure example. The Azure SDKs (Cosmos DB, Storage, Service Bus) have built-in retry policies; for HTTP, Polly /
Microsoft.Extensions.Http.Resilienceadds backoff toHttpClient. Service Bus and Event Grid retry delivery automatically with configurable schedules. Deep dive: Resiliency patterns that actually work.
Circuit Breaker
- Problem. A dependency is not blipping but down (or pathologically slow). Continuing to call it – especially retrying – wastes resources, ties up threads, and amplifies the failure into a cascade.
- Solution. Wrap the call in a state machine: Closed (calls flow, failures counted), Open (calls fail fast for a cooldown, no load on the sick dependency), Half-Open (a trial trickle probes recovery, then closes or re-opens). It is the deliberate complement to Retry – Retry handles brief faults, the breaker handles sustained ones.
- When to use. Any remote dependency whose sustained failure could exhaust your resources or cascade. Pair it with a fallback (cached value, degraded response, queued write).
- WAF pillar(s). Reliability (primary); Performance Efficiency.
- Azure example. Polly’s circuit-breaker strategy in .NET; on AKS, a service mesh (Istio, Linkerd) or Dapr resiliency policies provide breakers without app code. Application Gateway and Front Door eject unhealthy backends from rotation – a breaker at the load-balancer tier.
Bulkhead
- Problem. One slow or failing dependency consumes a shared resource pool (threads, connections, memory) and starves every other call – so a problem in one feature sinks the whole service.
- Solution. Partition resources into isolated pools, like the watertight compartments of a ship’s hull. Each dependency or tenant gets its own bounded pool; when one floods, the bulkhead keeps the flooding contained.
- When to use. When a service calls multiple downstreams of differing reliability, or serves multiple tenants/criticalities that must not interfere. The structural cousin at the system scale is cell-based / blast-radius isolation.
- WAF pillar(s). Reliability (primary); Performance Efficiency.
- Azure example. Separate
HttpClientinstances with isolated connection pools per dependency; dedicated AKS node pools or resource quotas per workload class; separate App Service plans for critical vs. background work; Cosmos DB throughput isolation per container.
Compensating Transaction
- Problem. A business operation spans multiple services and cannot be wrapped in a single ACID transaction. A step succeeds, a later step fails, and you are left with partial, inconsistent state.
- Solution. For each step, define a compensating action that semantically undoes it (refund the charge, release the reserved seat, cancel the shipment). On failure, run the compensations in reverse to return the system to a consistent state. Compensation is not a rollback – the original effects happened and may be visible; you are issuing a counteracting business action.
- When to use. Long-running, multi-service workflows without distributed transactions – which is to say, almost all cloud workflows. It is the building block of the Saga pattern (see composition below).
- WAF pillar(s). Reliability (primary); Operational Excellence.
- Azure example. Durable Functions (the orchestrator can call compensating activities in a
try/catch); Logic Apps with explicit compensation branches; saga orchestration on Service Bus. Background: saga orchestration vs. choreography.
Leader Election
- Problem. A set of identical instances must coordinate, and exactly one must perform a singleton task (a scheduler, a partition owner, a cleanup job) – but instances come and go, so you cannot hard-code which one.
- Solution. Have the instances elect a leader via a shared, atomic mechanism (a distributed lock or lease). The leader holds the lease and renews it; if it dies, the lease expires and the others re-elect. Defends directly against fallacy #5 (topology doesn’t change).
- When to use. Singleton background work in a scaled-out tier; ownership assignment; avoiding duplicate processing where idempotency alone is insufficient.
- WAF pillar(s). Reliability (primary); Operational Excellence.
- Azure example. Azure Blob Storage lease as a distributed lock (the canonical lightweight approach); the Durable Functions / WebJobs singleton feature; Kubernetes lease objects on AKS; ZooKeeper/etcd for self-managed clusters.
Health Endpoint Monitoring
- Problem. Infrastructure liveness (the process is up) is not the same as application health (it can actually serve, its dependencies are reachable). Without a real health signal, load balancers send traffic to instances that will fail.
- Solution. Expose dedicated health endpoints that perform functional checks – can I reach the database, the cache, the downstream API? – and have external probes call them on a schedule, routing traffic only to healthy instances.
- When to use. Always, for any service behind a load balancer or orchestrator. It is the foundation of the health model you will build in mission-critical design (healthy/degraded/unhealthy, not raw uptime).
- WAF pillar(s). Reliability (primary); Operational Excellence.
- Azure example. ASP.NET Core Health Checks (
/healthz,/readyz) consumed by Application Gateway / Front Door health probes, AKS liveness/readiness/startup probes, and Application Insights availability tests (URL ping / standard tests) for outside-in monitoring.
Queue-Based Load Leveling
- Problem. A spiky producer (a flash sale, an IoT burst) overwhelms a downstream that can only handle a steady rate, causing failures and timeouts at the peaks.
- Solution. Insert a queue between producer and consumer. The producer enqueues at its own pace; the consumer dequeues at its sustainable rate. The queue absorbs the spike, decoupling the two and turning a peak that breaks the system into a backlog that drains over time.
- When to use. Bursty or unpredictable load against a rate-limited resource (a database, a legacy API, a paid third party). The classic partner of Competing Consumers (see composition).
- WAF pillar(s). Reliability (primary); Performance Efficiency; Cost Optimisation.
- Azure example. Azure Service Bus queues or Storage queues between an ingestion API and worker tier; Event Hubs for high-throughput telemetry buffering. The Web-Queue-Worker style is this pattern made structural. Related: backpressure & flow control.
Rate Limiting
- Problem. You must call a downstream that enforces quotas (a SaaS API, a partner, a throttled Azure service), and exceeding the limit gets you throttled or banned – harming you.
- Solution. Self-impose limits on your outbound rate so you stay within the downstream’s quota – token-bucket or fixed-window counters that shape your egress, often coordinated across instances. (Contrast with Throttling, which protects your own service from others.)
- When to use. Integrating with quota-bound dependencies; smoothing batch jobs that would otherwise burst; cost control on metered APIs.
- WAF pillar(s). Reliability (primary); Cost Optimisation; Performance Efficiency.
- Azure example. The .NET
System.Threading.RateLimitingmiddleware; API Management rate-limit and quota policies on outbound calls; Service Bus / Event Hubs client-side throttling; coordinated limiting via a shared Redis token bucket.
Throttling
- Problem. A surge of incoming demand – legitimate or abusive – threatens to exhaust your capacity and degrade service for everyone, including your best customers.
- Solution. Cap the resources any single consumer (or tier, or the whole system) may use, rejecting or queuing excess with
429 Too Many RequestsandRetry-After. Throttling is graceful degradation under overload: shed the marginal load to protect the core. It complements autoscaling, which is slower to react. - When to use. Public or multi-tenant APIs; protecting a fixed-capacity backend; enforcing fair use and tiered SLAs. Pairs with Priority Queue to decide whose load to shed.
- WAF pillar(s). Reliability (primary); Performance Efficiency; Security (DoS mitigation); Cost Optimisation.
- Azure example. API Management rate-limit-by-key / quota policies returning 429; Front Door and Application Gateway WAF rate-limit rules; Cosmos DB and Service Bus themselves throttle you with 429 / server-busy – which is your cue to apply Retry with backoff.
Scheduler Agent Supervisor
- Problem. A multi-step distributed action must complete reliably as a whole, but individual steps can fail, time out, or leave the system in an indeterminate state – and you need detection and recovery, not just optimism.
- Solution. Three roles. The Scheduler orchestrates the steps and records state. Agents perform each step against a remote service. The Supervisor monitors for steps that have failed or stalled and triggers remediation – retry, compensate, or escalate. It is a resilient orchestration skeleton with built-in detection-and-recovery.
- When to use. Complex, fault-prone workflows where you need guaranteed eventual completion or clean compensation – order fulfilment, provisioning, financial settlement.
- WAF pillar(s). Reliability (primary); Operational Excellence.
- Azure example. Durable Functions is essentially this pattern as a service: the orchestrator is the Scheduler, activity functions are Agents, and the durable runtime + timers/monitors act as the Supervisor. Logic Apps with run history and resubmission fills the same role for integration workflows.
Sequential Convoy
- Problem. You need high-throughput parallel message processing (Competing Consumers), but a subset of messages must be processed strictly in order – all events for one order, or one device, must not be reordered or processed concurrently.
- Solution. Group related messages by a session / partition key and route each group to a single consumer that processes that group sequentially, while different groups still run in parallel. You get ordering within a key and scale across keys – the best of both.
- When to use. Event streams that demand per-entity ordering (per-account, per-device, per-aggregate) while still needing horizontal scale. The reconciliation of “ordered” and “parallel.”
- WAF pillar(s). Reliability (primary); Performance Efficiency.
- Azure example. Service Bus message sessions (one consumer per
SessionId); Event Hubs partitions (ordering guaranteed within a partition, parallel across partitions); Kafka on Azure with key-based partitioning.
Group 2 — Messaging patterns
Messaging patterns answer the latency, bandwidth, and coordination fallacies by replacing tight synchronous coupling with asynchronous, decoupled communication. They let producers and consumers scale, fail, and evolve independently. They serve Reliability and Performance Efficiency primarily, with Operational Excellence and Cost Optimisation crossovers.
Asynchronous Request-Reply
- Problem. A client needs the result of an operation that takes too long for a synchronous HTTP request, but the client cannot or should not hold a connection open and poll naively.
- Solution. The API accepts the request, returns
202 Acceptedwith a status/polling URL (or a callback/webhook), and processes asynchronously. The client polls the status endpoint (or is notified) and retrieves the result when ready. Decouples request acceptance from completion without a persistent connection. - When to use. Long-running operations behind a request/response client (report generation, media transcoding, large imports) where you cannot push the whole flow onto a queue the client never sees.
- WAF pillar(s). Performance Efficiency (primary); Reliability.
- Azure example. API Management front door returning 202 +
Location, with Durable Functions doing the work and exposing a status endpoint; the long-running operation convention used across Azure Resource Manager APIs.
Claim Check
- Problem. A message carries a large payload (a video, a big document) that bloats the message bus, blows past message-size limits, and wastes bandwidth (fallacy #3).
- Solution. Store the large payload in external storage and put only a reference (the “claim check”) on the bus. The consumer uses the reference to fetch the payload directly. The message stays small; the heavy data moves out-of-band.
- When to use. Any messaging flow with payloads near or beyond the broker’s size limit, or where most consumers don’t need the full payload.
- WAF pillar(s). Performance Efficiency (primary); Cost Optimisation; Reliability.
- Azure example. Put the blob in Azure Blob Storage, send the blob URI on Service Bus / Event Grid; Event Grid Blob Created events are this pattern natively – the event is the claim check for the object.
Competing Consumers
- Problem. A single consumer cannot keep up with the message volume on a queue, and you need to scale throughput and add resilience to consumer failure.
- Solution. Run multiple consumer instances reading from the same queue. The broker delivers each message to exactly one consumer; the pool self-balances, and throughput scales with instance count. If a consumer dies mid-message, the lock expires and another picks it up.
- When to use. Almost always alongside Queue-Based Load Leveling – the queue smooths the load, competing consumers drain it elastically.
- WAF pillar(s). Reliability (primary); Performance Efficiency; Cost Optimisation.
- Azure example. Multiple Functions instances scaled by the Service Bus / Event Hubs trigger; an AKS deployment of workers with KEDA scaling on queue depth; Service Bus PeekLock giving the at-least-once redelivery that makes this safe.
Choreography
- Problem. A central orchestrator that drives a multi-service workflow becomes a bottleneck, a single point of failure, and a coupling magnet – every change touches the orchestrator.
- Solution. Remove the conductor. Each service reacts to events and emits its own, so the workflow emerges from local reactions rather than central command. Services are decoupled and independently deployable; the trade is reduced central visibility (you must invest in distributed tracing).
- When to use. Event-driven systems with autonomous teams/services and simple-to-moderate flows. Use orchestration (Scheduler Agent Supervisor / Saga orchestration) instead when the flow is complex or needs strong central control and visibility.
- WAF pillar(s). Reliability (primary); Operational Excellence; Performance Efficiency.
- Azure example. Event Grid as the backbone, services subscribing to and publishing domain events; Service Bus topics for choreographed sagas. Contrast in depth: saga orchestration vs. choreography.
Pipes and Filters
- Problem. A complex processing task is a monolithic block – hard to scale unevenly, reuse, or recombine, and you cannot scale only the slow stage.
- Solution. Decompose the task into a chain of independent filters (single-responsibility transforms) connected by pipes (the channels between them). Each filter scales and is reused independently; stages can be reordered or swapped.
- When to use. Data-processing and ingestion pipelines, ETL, media processing, content moderation – anywhere stages have different scaling profiles or need independent reuse.
- WAF pillar(s). Performance Efficiency (primary); Reliability; Operational Excellence.
- Azure example. A chain of Functions linked by Service Bus/Storage queues; Azure Data Factory / Synapse pipelines; Stream Analytics jobs feeding downstream stages; container stages on AKS connected by queues.
Priority Queue
- Problem. All work is not equal – a premium customer’s request or an urgent alert must jump ahead of routine background jobs – but a single FIFO queue treats everything the same.
- Solution. Process higher-priority messages ahead of lower-priority ones, via either separate queues per priority (consumers favour the high-priority queue) or a broker that supports priority ordering.
- When to use. Tiered SLAs, mixed urgent/batch workloads, premium vs. free tenants. Pairs naturally with Throttling (shed the low-priority work first under overload).
- WAF pillar(s). Performance Efficiency (primary); Reliability; Cost Optimisation.
- Azure example. Multiple Service Bus queues (
orders-high,orders-normal) with consumers weighted to the high-priority queue; Service Bus topics with subscription filters routing by apriorityproperty.
Publisher-Subscriber
- Problem. A producer must notify many, unknown, changing consumers of an event without coupling to who they are or how many exist (fallacy #5, #6).
- Solution. The publisher sends events to a topic; the messaging infrastructure fans them out to all current subscribers, each on its own subscription. Publisher and subscribers never know about each other – full temporal and referential decoupling.
- When to use. Event broadcast / fan-out, integration across teams, reactive architectures. The foundational pattern beneath Choreography and event-driven styles.
- WAF pillar(s). Reliability (primary); Performance Efficiency; Operational Excellence.
- Azure example. Event Grid (discrete reactive events, massive fan-out), Service Bus topics + subscriptions (durable enterprise pub/sub with filtering), Event Hubs (high-volume streaming to multiple consumer groups). Choosing between them: message queues vs. pub/sub.
Messaging Bridge
- Problem. Two systems use different messaging infrastructures (an on-prem ESB and cloud Service Bus, Kafka and Event Hubs, two clouds) and must exchange messages without rewriting either end (fallacies #6 and #8).
- Solution. A bridge connects the two messaging systems, translating protocols and formats and relaying messages between them – an adapter at the transport layer that lets heterogeneous buses interoperate.
- When to use. Hybrid and migration scenarios; multicloud event flows; integrating legacy middleware with cloud-native messaging.
- WAF pillar(s). Reliability (primary); Operational Excellence.
- Azure example. Azure Service Bus with the on-prem connector / hybrid relay; Logic Apps or MirrorMaker-style relays bridging Kafka↔Event Hubs; Event Grid’s MQTT broker bridging IoT protocols to Azure messaging.
Group 3 — Data management patterns
Data patterns answer the latency, bandwidth, and coordination fallacies in the data tier – where the hardest tradeoffs (consistency vs. availability vs. performance) live. They serve Performance Efficiency and Reliability heavily, with Security and Cost Optimisation crossovers.
Cache-Aside
- Problem. Repeatedly reading the same data from a slow or expensive store adds latency (fallacy #2) and load and cost (fallacy #7).
- Solution. The application checks the cache first; on a miss it loads from the store, populates the cache, and returns. Writes update the store and invalidate (or update) the cache. The app – not the store – owns the cache, hence “aside.”
- When to use. Read-heavy data that tolerates slight staleness; reference data, session state, computed results. Pick a TTL/invalidation strategy deliberately. Deep dive: caching strategies.
- WAF pillar(s). Performance Efficiency (primary); Cost Optimisation; Reliability.
- Azure example. Azure Cache for Redis in front of Azure SQL or Cosmos DB; in-memory cache for single-instance hot paths; Front Door / CDN caching at the edge for the HTTP layer.
CQRS (Command and Query Responsibility Segregation)
- Problem. A single data model optimised for writes (normalised, transactional) is a poor fit for reads (denormalised, query-shaped), and the two have wildly different scaling and contention profiles.
- Solution. Split the model in two: commands mutate state through the write model; queries read from one or more read models shaped for specific queries. The two can use different stores and scale independently; they are kept in sync (often eventually).
- When to use. Read/write ratios that are very asymmetric; complex domains; collaborative systems with contention. It adds complexity and eventual consistency – do not apply it blanket. Deep dive: CQRS read-model projection pipelines.
- WAF pillar(s). Performance Efficiency (primary); Reliability; Operational Excellence.
- Azure example. Writes to Azure SQL, reads from a denormalised Cosmos DB projection kept current by a change feed / Functions; the read side often combined with Materialized View.
Event Sourcing
- Problem. Storing only the current state loses history, makes auditing and temporal queries impossible, and creates update contention on the latest row.
- Solution. Persist an append-only log of events as the source of truth; current state is derived by replaying events. Nothing is updated in place – you append facts. You get a perfect audit trail, time travel, and rebuildable read models.
- When to use. Audit-critical domains (finance, healthcare), systems needing temporal queries or replay, and as the natural write side of CQRS. It demands schema-evolution discipline and snapshotting for performance. Deep dive: event sourcing aggregate design.
- WAF pillar(s). Reliability (primary); Operational Excellence; Performance Efficiency.
- Azure example. Event store on Cosmos DB or Event Hubs / Kafka; projections built by Functions on the change feed; pairs with CQRS and Materialized View.
Materialized View
- Problem. Producing a needed result requires expensive joins/aggregations across normalised data on every read – too slow and costly to compute on demand (fallacies #2, #7).
- Solution. Pre-compute and store the query result (the materialised view), refreshing it as source data changes. Reads hit the prepared view directly. The view is disposable – it can always be rebuilt from source.
- When to use. Expensive, frequently-read aggregations; dashboards; the read side of CQRS/Event Sourcing; cross-source summaries.
- WAF pillar(s). Performance Efficiency (primary); Cost Optimisation; Reliability.
- Azure example. Cosmos DB materialised views / change-feed-built projections; Azure SQL indexed views; precomputed summary containers refreshed by Functions; Synapse materialised views for analytics.
Index Table
- Problem. A data store partitions or indexes only by primary key, but your queries filter by other fields – forcing slow full scans (fallacy #2).
- Solution. Maintain secondary index tables keyed by the fields you query, each pointing back to the primary records (or carrying the projected fields). You trade extra write work and storage for fast non-key lookups.
- When to use. NoSQL/partitioned stores where the engine lacks rich secondary indexing, or where you need a query path the partition key doesn’t serve.
- WAF pillar(s). Performance Efficiency (primary); Cost Optimisation.
- Azure example. Secondary lookup containers in Cosmos DB (or its native indexing where sufficient); Azure Table Storage index tables; an Azure AI Search index over Blob/SQL as an externalised query path.
Sharding
- Problem. A single data store hits a ceiling – storage, throughput, or connection limits – that no amount of scaling up can clear (the “partition around limits” principle).
- Solution. Partition data horizontally across multiple stores (shards) by a shard key, each holding a subset. The system scales out near-linearly; the art is choosing a key that spreads load evenly and avoids hot shards. Strategies: lookup, range, hash.
- When to use. Datasets or throughput beyond a single store; multi-tenant isolation; geo-partitioning. A poor shard key creates the Noisy Neighbour anti-pattern; choose carefully.
- WAF pillar(s). Performance Efficiency (primary); Reliability; Cost Optimisation.
- Azure example. Cosmos DB partition keys (sharding as a managed feature – pick the key well); Azure SQL elastic database tools / sharded pools; partitioned Event Hubs/Kafka. See also multi-region data replication.
Static Content Hosting
- Problem. Serving static assets (HTML, JS, CSS, images, video) from your application compute wastes server cycles and adds latency for distant users (fallacies #2, #3).
- Solution. Host static content in dedicated storage and serve it via a CDN at the edge, freeing application compute for dynamic work and putting bytes close to users.
- When to use. Any app with meaningful static assets – which is nearly all of them. The cheapest performance and cost win available.
- WAF pillar(s). Performance Efficiency (primary); Cost Optimisation; Reliability.
- Azure example. Azure Storage static website + Azure Front Door / CDN; Static Web Apps for SPA hosting with a built-in global edge; combine with Cache-Aside at the edge.
Valet Key
- Problem. Routing large uploads/downloads through your application to enforce access control wastes compute and bandwidth and makes your app a bottleneck – but you cannot hand out blanket storage access either (fallacies #3, #4).
- Solution. Issue the client a scoped, time-limited token that grants direct access to a specific storage resource for a specific operation. The client talks to storage directly; your app never touches the bytes but still controls who can do what.
- When to use. Large file uploads/downloads, media delivery, any direct-to-storage flow where the app should authorise but not proxy.
- WAF pillar(s). Performance Efficiency (primary); Security; Cost Optimisation.
- Azure example. Azure Storage Shared Access Signatures (SAS) – the textbook valet key; user-delegation SAS backed by Entra ID for least-privilege, auditable, short-lived grants.
Group 4 — Design & implementation patterns
The largest group: structural patterns for how you compose, deploy, configure, and evolve services – answering the topology, single-administrator, and homogeneity fallacies. They serve Operational Excellence and Performance Efficiency most, with Reliability and Cost Optimisation crossovers.
Ambassador
- Problem. You want consistent networking concerns – retries, timeouts, routing, TLS, telemetry – around outbound calls, but baking them into every app in every language is duplicative and error-prone.
- Solution. Place a helper “ambassador” service alongside the app that proxies its outbound network calls and handles those cross-cutting client-side concerns. The app makes a simple local call; the ambassador does the hard networking. (A Sidecar specialised for the client side of communication.)
- When to use. Polyglot estates needing uniform connectivity behaviour; legacy apps you cannot modify; offloading client resilience without code changes.
- WAF pillar(s). Operational Excellence (primary); Reliability; Security.
- Azure example. A service-mesh sidecar (Istio/Linkerd/Open Service Mesh) on AKS handling outbound mTLS, retries, and routing; Dapr sidecar for service invocation with built-in resilience.
Anti-Corruption Layer
- Problem. Integrating a clean, modern system with a legacy or third-party system risks letting the legacy model’s quirks, terms, and bad abstractions leak in and corrupt your design (fallacies #6, #8).
- Solution. Insert a translation layer that maps between the two models, isolating your domain from the foreign one. Your code speaks your language; the ACL does the dirty translation at the boundary.
- When to use. Strangler-Fig migrations, integrating SaaS/legacy systems, any boundary where a foreign model would otherwise pollute yours. The companion to Strangler Fig.
- WAF pillar(s). Operational Excellence (primary); Reliability.
- Azure example. A façade Functions/API Management layer translating a legacy SOAP/mainframe contract into clean REST for new services; an adapter microservice fronting a third-party API. Background: mainframe modernisation with Strangler Fig.
Backends for Frontends (BFF)
- Problem. A single general-purpose API can’t serve a mobile app, a web SPA, and partners well at once – each wants different payloads, chattiness, and auth, and one API becomes a compromise that fits none (fallacy #2 hits mobile hardest).
- Solution. Build a separate backend per frontend, each tailored to its client’s needs (payload shape, aggregation, auth), sitting between the client and shared downstream services.
- When to use. Multiple distinct client types with divergent needs; mobile clients on high-latency links needing tailored, aggregated responses. Often combined with Gateway Aggregation. Deep dive: API gateway & BFF pattern.
- WAF pillar(s). Performance Efficiency (primary); Operational Excellence; Security.
- Azure example. Per-client App Service / Functions / Container Apps BFFs behind API Management; an AKS ingress routing to a mobile-BFF and a web-BFF.
Compute Resource Consolidation
- Problem. Running many small, under-utilised services each on its own compute wastes money and management overhead – lots of idle capacity billed in full (fallacy #7).
- Solution. Consolidate multiple tasks/services onto shared compute to raise utilisation and cut cost and operational surface – carefully, so they don’t interfere (mind the Noisy Neighbour and Bulkhead tradeoff).
- When to use. Many low-traffic services; cost pressure; consolidating microservices that fragmented too far. Balance against isolation needs.
- WAF pillar(s). Cost Optimisation (primary); Operational Excellence; Performance Efficiency.
- Azure example. Multiple apps on a shared App Service Plan; several microservices co-located in an AKS cluster with resource limits; Container Apps environments packing workloads onto shared infrastructure.
Deployment Stamps
- Problem. A single shared deployment cannot scale past a ceiling, isolate tenants, deploy regionally, or contain a blast radius – one fault or one noisy tenant affects everyone.
- Solution. Deploy multiple independent copies (stamps / scale units), each a self-contained slice of the app+data serving a subset of tenants or a region. Scale by adding stamps; a stamp is the unit of deployment, scale, and blast-radius isolation.
- When to use. Multi-tenant SaaS, regional scale-out, blast-radius reduction. The cornerstone concept of mission-critical architecture (A5) and the structural sibling of cell-based design.
- WAF pillar(s). Reliability (primary); Performance Efficiency; Operational Excellence; Cost Optimisation.
- Azure example. Per-region/per-tenant-tier stamps provisioned by Bicep/Terraform (Azure Verified Modules), each with its own AKS/App Service + Cosmos DB/SQL, traffic split by Front Door / Traffic Manager. Related: cell-based architecture.
External Configuration Store
- Problem. Configuration baked into deployment packages can’t change without a redeploy, can’t be shared across instances/services, and scatters secrets through code (fallacy #5).
- Solution. Move configuration out of the deployment into a centralised external store that instances read at runtime, enabling dynamic updates, sharing, and central management (with secrets in a vault).
- When to use. Anything beyond a single static instance; feature flags; shared settings; secret management. Foundational for twelve-factor apps.
- WAF pillar(s). Operational Excellence (primary); Security; Reliability.
- Azure example. Azure App Configuration (settings + feature flags) with Key Vault references for secrets, consumed by App Service/Functions/AKS at startup and on refresh. Related: global config & feature flag platform.
Gateway Aggregation
- Problem. A client must call many backend services to render one screen, and on a high-latency link those round trips stack up painfully (fallacy #2).
- Solution. A gateway receives one client request, fans out to the needed backends, and combines the responses into a single payload. The client makes one call; the gateway absorbs the chattiness over the fast internal network.
- When to use. Composite UI screens, mobile clients, microservice backends. The direct fix for the Chatty I/O anti-pattern; often part of a BFF.
- WAF pillar(s). Performance Efficiency (primary); Operational Excellence.
- Azure example. API Management request aggregation / composite policies; a BFF on Functions/Container Apps doing the fan-out; AKS ingress + aggregator service.
Gateway Offloading
- Problem. Cross-cutting concerns – TLS termination, authentication, response caching, compression, WAF – are implemented redundantly in every service, duplicating effort and config drift.
- Solution. Offload those shared concerns to the gateway so individual services don’t each implement them. Centralise once at the edge; simplify every backend.
- When to use. Any multi-service estate behind a gateway – which is most. Centralising TLS, auth, and WAF is near-universal best practice.
- WAF pillar(s). Operational Excellence (primary); Security; Performance Efficiency.
- Azure example. Front Door / Application Gateway terminating TLS and running WAF; API Management handling auth (JWT validation), caching, and rate limiting on behalf of backends.
Gateway Routing
- Problem. Exposing many backend services directly forces clients to know each service’s address and version, and couples them to your internal topology (fallacies #5, #6).
- Solution. Put a single gateway endpoint in front and route requests to the right backend by path, header, or version. Clients see one stable entry point; you reshuffle backends freely behind it.
- When to use. Microservices, API versioning, blue-green/canary routing, hiding internal structure. The base of the gateway trio (Routing + Aggregation + Offloading).
- WAF pillar(s). Operational Excellence (primary); Reliability; Performance Efficiency.
- Azure example. Application Gateway path-based routing, Front Door routing rules, API Management API routing; AKS Ingress controllers. Background: API gateways explained.
Geode
- Problem. A globally distributed user base suffers latency from a single-region backend, and you want every region to serve every user (fallacy #2 at planet scale).
- Solution. Deploy the backend as geographical nodes (“geodes”) across regions, each able to serve any request, backed by a globally-distributed, multi-write data layer. Requests route to the nearest geode; the system is active everywhere.
- When to use. Global, latency-sensitive, high-availability services – gaming, social, real-time collaboration – that justify the cost and the multi-write data complexity.
- WAF pillar(s). Performance Efficiency (primary); Reliability; Cost Optimisation (tradeoff – it raises cost).
- Azure example. Cosmos DB multi-region writes as the data layer with compute geodes in each region, fronted by Front Door / Traffic Manager geo-routing. The active/active end state of multi-region DR.
Sidecar
- Problem. You need supporting features – monitoring, logging, configuration, networking, security – attached to an app without baking them into the app or coupling their lifecycle and language to it (fallacies #5, #8).
- Solution. Deploy the helper as a separate co-located process/container (the sidecar) sharing the app’s lifecycle and host but isolated from its code. The sidecar adds capabilities the app gets “for free.” (Ambassador is a sidecar specialised for outbound networking.)
- When to use. Polyglot microservices, adding observability/security uniformly, service-mesh data planes. The mechanism behind much of cloud-native cross-cutting tooling.
- WAF pillar(s). Operational Excellence (primary); Reliability; Security; Performance Efficiency.
- Azure example. Dapr sidecar on AKS/Container Apps (state, pub/sub, secrets, invocation); Envoy proxies in a service mesh; a logging/telemetry agent container in the pod.
Strangler Fig
- Problem. A big-bang rewrite of a legacy monolith is high-risk and often fails; you need to modernise incrementally while the old system keeps running (fallacy #6).
- Solution. Place a façade in front of the legacy system and incrementally route slices of functionality to new services, “strangling” the old system feature by feature until it can be retired. The cutover is gradual and reversible.
- When to use. Modernising monoliths and mainframes with low risk and continuous delivery of value. Pairs with Anti-Corruption Layer (clean translation) and Gateway Routing (the façade). Deep dive: Strangler Fig monolith decomposition.
- WAF pillar(s). Operational Excellence (primary); Reliability.
- Azure example. Application Gateway / API Management façade routing some paths to new Container Apps/AKS services and the rest to the legacy backend, shifting routes over time as features are rebuilt.
Group 5 — Security patterns
The smallest but non-negotiable group: patterns that answer fallacy #4 (the network is secure) by proving identity, validating untrusted input at the boundary, and isolating the unverified. They serve the Security pillar primarily, with Reliability crossovers. (Several patterns in other groups – Valet Key, Ambassador, Gateway Offloading, Throttling – also do heavy security work; security is cross-cutting.)
Federated Identity
- Problem. Managing user credentials yourself is a liability and a poor experience; users and partners already have identities elsewhere, and you should not be in the password-storage business (fallacy #4, #6).
- Solution. Delegate authentication to a trusted external identity provider (IdP). Your app trusts tokens the IdP issues rather than handling credentials, enabling SSO, social/enterprise login, and B2B federation.
- When to use. Practically every modern app – enterprise SSO, customer (CIAM), partner federation. Owning passwords should be the rare exception.
- WAF pillar(s). Security (primary); Operational Excellence.
- Azure example. Microsoft Entra ID (OIDC/SAML) for workforce SSO; Entra External ID / Azure AD B2C for customer identity and social/partner federation; App Service Easy Auth wiring it in with no code. Background: identity federation & SSO concepts.
Gatekeeper
- Problem. Exposing application instances that hold keys, secrets, and direct data access to untrusted clients means a compromise of the app exposes everything behind it.
- Solution. Insert a hardened, minimal broker (the gatekeeper) between clients and the app/storage. It validates and sanitises requests and brokers access, holding no sensitive keys itself – so compromising it yields little. It decouples the public attack surface from the privileged core.
- When to use. Internet-facing systems with sensitive backends; defence-in-depth where you want a thin, throwaway front line. Combines with Valet Key (the gatekeeper issues scoped tokens).
- WAF pillar(s). Security (primary); Reliability.
- Azure example. A WAF-fronted API Management / Front Door tier validating and sanitising before traffic reaches backends that hold the real keys; a DMZ broker service that talks to Key Vault and storage on the client’s behalf.
Quarantine
- Problem. Accepting external content or artefacts (uploaded files, third-party container images, partner data) directly into your trusted environment risks ingesting malware or non-compliant material.
- Solution. Land incoming content in an isolated quarantine zone, validate/scan it (malware, schema, policy), and promote only what passes into the trusted environment; reject or destroy the rest.
- When to use. User-uploaded files, software-supply-chain ingestion of external images/packages, B2B data exchange – any inbound artefact you did not produce.
- WAF pillar(s). Security (primary); Reliability; Operational Excellence.
- Azure example. Uploads to a quarantine Blob container scanned by Microsoft Defender for Storage, promoted to the trusted container only on a clean verdict; ACR image quarantine + Defender scanning before images are allowed into AKS. Related: secure SFTP ingestion gateway.
How the patterns map: the catalogue at a glance
The diagram below is the mental map for the whole catalogue – the five groups, the fallacy each group primarily answers, and the headline patterns within each, with the composition arrows that connect them into real designs. Keep it as your one-page index; the rest of this lesson teaches how to combine what it shows.
Pattern composition: how real designs are built
No production system uses one pattern. The skill that separates an architect from a pattern-memoriser is composition – knowing which patterns reinforce each other and which conflict. Here are the four canonical combinations you will reach for constantly. Each is more than the sum of its parts: the composition closes a gap that either pattern alone leaves open.
Retry + Circuit Breaker — the resilience pair
These are almost never used apart, because each covers the other’s blind spot. Retry handles transient faults brilliantly but is dangerous against a sustained outage – retrying a dead service just adds load and amplifies the failure (the Retry Storm anti-pattern). Circuit Breaker handles sustained outages brilliantly but does nothing for the brief blips that retry would have absorbed transparently. Composed correctly, the breaker wraps the retry: while the circuit is Closed, retry absorbs blips; once failures cross a threshold the breaker Opens and short-circuits the retries, so you stop hammering the sick dependency; in Half-Open a trial request decides whether to resume. Add a timeout beneath both (so a hung call can’t pin a thread forever) and a fallback above (cached value, queued write, degraded response) and you have the complete resilience pipeline. The ordering matters: timeout is innermost, then retry, then breaker outermost — so the breaker counts final failures, not each retry attempt. In .NET this is a single Polly pipeline; on AKS it is a Dapr resiliency policy or mesh config. Full treatment: resiliency patterns that actually work.
Queue-Based Load Leveling + Competing Consumers — the elastic worker
The most common throughput composition in cloud apps. Queue-Based Load Leveling puts a buffer between a spiky producer and the backend, converting peaks into a drainable backlog – but a single consumer draining that queue is now the bottleneck and a single point of failure. Competing Consumers solves that: multiple consumers read the same queue, the broker hands each message to exactly one, and the pool self-balances and survives consumer death (the lock expires, another picks up). Together they give you both smoothing (the queue) and elastic, resilient drain (the consumer pool) – the heart of the Web-Queue-Worker style. Scale the consumer pool on queue depth (KEDA on AKS, the Functions Service Bus trigger) so workers track the backlog automatically. If a subset of messages must stay ordered, layer Sequential Convoy on top (Service Bus sessions / Event Hubs partitions) to get ordering within a key while still scaling across keys.
Gateway Routing + Aggregation + Offloading — the gateway trio
A mature API gateway is rarely doing one job; it is doing three patterns at once, and they stack cleanly. Gateway Routing gives clients one stable endpoint and routes by path/header/version to the right backend, hiding internal topology and enabling canary/blue-green. Gateway Aggregation lets one client call fan out to several backends and return a combined payload, killing the Chatty I/O that murders mobile latency. Gateway Offloading centralises the cross-cutting concerns – TLS termination, authentication, WAF, caching, compression – so no backend reimplements them. Composed, a request hits Front Door / Application Gateway / API Management, gets TLS-terminated and WAF-screened and authenticated (offloading), is routed to the right service or BFF (routing), which may aggregate several downstreams (aggregation) before responding. Layer this with Backends for Frontends when different client types need differently-shaped aggregations. Background: API gateways explained and the BFF pattern.
Saga (built on Compensating Transaction) — distributed consistency without 2PC
You cannot wrap a multi-service business transaction in a single ACID commit (no distributed transactions across Cosmos DB, Service Bus, and a partner API). The Saga pattern composes the workflow from a sequence of local transactions, each with a defined Compensating Transaction that semantically undoes it. If step 4 fails, the saga runs the compensations for steps 3, 2, 1 in reverse, returning the system to a consistent (not identical) state – the charge is refunded, not un-charged. Sagas come in two flavours, which themselves are pattern compositions: orchestration (a central coordinator drives steps – built on Scheduler Agent Supervisor) gives strong visibility and control; choreography (services react to events – built on Publisher-Subscriber + Choreography) gives looser coupling and autonomy at the cost of central visibility. Make every step idempotent so retries are safe, and you have reliable distributed consistency without two-phase commit. Deep dives: saga orchestration vs. choreography, idempotency & deduplication, and the transactional outbox for reliably publishing the saga’s events.
The composition meta-lesson: patterns combine along the axes of the fallacies. Resilience patterns stack to fully defend fallacy #1 (timeout → retry → breaker → fallback). Throughput patterns stack to defend the coordination/latency fallacies (queue → consumers → convoy). Boundary patterns stack at the edge (routing → offloading → aggregation → gatekeeper). When you find yourself reaching for one pattern, ask which adjacent pattern covers its blind spot – that is the composition.
Real-world application: how this shows up in actual Azure designs
In a real Azure design review you will not hear “let’s apply pattern 17.” You will see these patterns embedded in product decisions, and your job is to name them so the tradeoffs become discussable. A few composite shapes you will meet repeatedly:
- The resilient web-and-worker product. Front Door (Gateway Routing + Offloading + WAF) → App Service / Container Apps web tier with Health Endpoint Monitoring → Service Bus (Queue-Based Load Leveling) → Functions workers (Competing Consumers, scaled on queue depth) → Azure SQL writes and a Cosmos DB read projection (CQRS + Materialized View) with Azure Cache for Redis (Cache-Aside) in front. Every outbound call carries Retry + Circuit Breaker. That single sentence is a dozen patterns, and an experienced reviewer hears each one.
- The multi-tenant SaaS platform. Deployment Stamps per tenant tier/region (each a self-contained scale unit), Sharding the data by tenant, Throttling and Rate Limiting to enforce per-tenant fairness, Bulkhead/Compute Resource Consolidation to balance isolation against cost, and Federated Identity (Entra External ID) for tenant SSO. This is the on-ramp to the mission-critical design in A5.
- The integration / migration estate. Strangler Fig façade (API Management) routing slices to new services, Anti-Corruption Layer translating the legacy model, Messaging Bridge connecting the on-prem ESB to Service Bus, and Quarantine + Defender screening inbound partner files. Federated identity bridges the org boundary.
- The event-driven backbone. Event Grid / Service Bus topics (Publisher-Subscriber) with Choreography between autonomous services, Claim Check for large payloads, Priority Queue for tiered work, Pipes and Filters for the processing chain, and Sagas (Compensating Transaction) holding cross-service consistency together. Distributed tracing (Application Insights) is mandatory because choreography sacrifices central visibility.
The recurring lesson: patterns are how you operationalise the Well-Architected pillars and the Ten Design Principles. “Design for self-healing” is Retry + Circuit Breaker + Health Endpoint Monitoring + Scheduler Agent Supervisor. “Partition around limits” is Sharding + Deployment Stamps + Bulkhead. “Minimise coordination” is Competing Consumers + Choreography + CQRS. The principles tell you what good looks like; the patterns are how you get there.
Pattern → pillar summary table
This is the table to internalise for AZ-305 and design reviews. The primary pillar is the pattern’s main intent; secondary pillars are strong crossovers. (Cost Optimisation, Operational Excellence, and Performance Efficiency appear widely as crossovers because almost every structural choice touches cost, operations, and performance.)
| Pattern | Group | Primary pillar | Secondary pillar(s) |
|---|---|---|---|
| Retry | Reliability | Reliability | Performance Efficiency |
| Circuit Breaker | Reliability | Reliability | Performance Efficiency |
| Bulkhead | Reliability | Reliability | Performance Efficiency |
| Compensating Transaction | Reliability | Reliability | Operational Excellence |
| Leader Election | Reliability | Reliability | Operational Excellence |
| Health Endpoint Monitoring | Reliability | Reliability | Operational Excellence |
| Queue-Based Load Leveling | Reliability | Reliability | Performance Efficiency, Cost Optimisation |
| Rate Limiting | Reliability | Reliability | Cost Optimisation, Performance Efficiency |
| Throttling | Reliability | Reliability | Performance Efficiency, Security, Cost Optimisation |
| Scheduler Agent Supervisor | Reliability | Reliability | Operational Excellence |
| Sequential Convoy | Reliability | Reliability | Performance Efficiency |
| Asynchronous Request-Reply | Messaging | Performance Efficiency | Reliability |
| Claim Check | Messaging | Performance Efficiency | Cost Optimisation, Reliability |
| Competing Consumers | Messaging | Reliability | Performance Efficiency, Cost Optimisation |
| Choreography | Messaging | Reliability | Operational Excellence, Performance Efficiency |
| Pipes and Filters | Messaging | Performance Efficiency | Reliability, Operational Excellence |
| Priority Queue | Messaging | Performance Efficiency | Reliability, Cost Optimisation |
| Publisher-Subscriber | Messaging | Reliability | Performance Efficiency, Operational Excellence |
| Messaging Bridge | Messaging | Reliability | Operational Excellence |
| Cache-Aside | Data management | Performance Efficiency | Cost Optimisation, Reliability |
| CQRS | Data management | Performance Efficiency | Reliability, Operational Excellence |
| Event Sourcing | Data management | Reliability | Operational Excellence, Performance Efficiency |
| Materialized View | Data management | Performance Efficiency | Cost Optimisation, Reliability |
| Index Table | Data management | Performance Efficiency | Cost Optimisation |
| Sharding | Data management | Performance Efficiency | Reliability, Cost Optimisation |
| Static Content Hosting | Data management | Performance Efficiency | Cost Optimisation, Reliability |
| Valet Key | Data management | Performance Efficiency | Security, Cost Optimisation |
| Ambassador | Design & implementation | Operational Excellence | Reliability, Security |
| Anti-Corruption Layer | Design & implementation | Operational Excellence | Reliability |
| Backends for Frontends | Design & implementation | Performance Efficiency | Operational Excellence, Security |
| Compute Resource Consolidation | Design & implementation | Cost Optimisation | Operational Excellence, Performance Efficiency |
| Deployment Stamps | Design & implementation | Reliability | Performance Efficiency, Operational Excellence, Cost Optimisation |
| External Configuration Store | Design & implementation | Operational Excellence | Security, Reliability |
| Gateway Aggregation | Design & implementation | Performance Efficiency | Operational Excellence |
| Gateway Offloading | Design & implementation | Operational Excellence | Security, Performance Efficiency |
| Gateway Routing | Design & implementation | Operational Excellence | Reliability, Performance Efficiency |
| Geode | Design & implementation | Performance Efficiency | Reliability, Cost Optimisation |
| Sidecar | Design & implementation | Operational Excellence | Reliability, Security, Performance Efficiency |
| Strangler Fig | Design & implementation | Operational Excellence | Reliability |
| Federated Identity | Security | Security | Operational Excellence |
| Gatekeeper | Security | Security | Reliability |
| Quarantine | Security | Security | Reliability, Operational Excellence |
Common mistakes & anti-patterns
An anti-pattern is a recurring bad shape – a solution that looks reasonable and is reliably wrong. Microsoft’s performance anti-patterns are the named failure modes you should be able to spot in a code review or a metrics dashboard, and crucially each one maps to the pattern that fixes it. Spotting the anti-pattern is half the diagnosis; naming the corrective pattern is the prescription.
| Anti-pattern | What it looks like | The tell (in metrics/code) | Pattern that fixes it |
|---|---|---|---|
| Busy Database | Pushing too much application logic (heavy stored procs, business rules, transforms) into the database, the hardest tier to scale out. | DB CPU saturated while app tier is idle; scaling the app does nothing. | Move logic to the (scalable) app tier; Cache-Aside, Materialized View, CQRS to offload reads. |
| Chatty I/O | Many small fine-grained calls where a few coarse ones would do – N+1 queries, per-item API calls. | Request count and round-trip latency dominate; throughput collapses on high-latency links. | Gateway Aggregation, Backends for Frontends, batching, coarser-grained APIs. |
| Noisy Neighbour | One tenant/workload monopolises a shared resource and starves the rest. | One tenant’s spike degrades everyone; uneven shard/partition load. | Bulkhead, Deployment Stamps, Throttling/Rate Limiting, better shard key. |
| Retry Storm | Aggressive retries (no backoff, no breaker) against a struggling dependency, amplifying the failure into a cascade. | Failures increase load; synchronised retry waves; cascading timeouts. | Retry with exponential backoff + jitter + Circuit Breaker (the composition). |
| Improper Instantiation | Creating and discarding expensive, meant-to-be-shared objects (HttpClient, DB connections, SDK clients) per request. | Socket/port exhaustion, connection pool churn, GC pressure under load. | Singleton/pooled instances (IHttpClientFactory, connection pooling); Compute Resource Consolidation mindset. |
| Monolithic Persistence | Forcing all data into one store regardless of access pattern, creating contention and a scaling ceiling. | Hot table contention; one store throttling the whole system; wrong store for the job. | Polyglot persistence (best store for each job), Sharding, CQRS, Index Table. |
| No Caching | Recomputing or re-fetching the same data on every request. | Repeated identical expensive reads; backend load that caching would erase; high cost per request. | Cache-Aside, Materialized View, Static Content Hosting / CDN, edge caching. |
| Synchronous I/O | Blocking threads on I/O (synchronous calls, blocking file/network ops) on the request path. | Thread-pool starvation; throughput plateaus far below CPU capacity; tail latency under load. | async/await end-to-end; Asynchronous Request-Reply; Queue-Based Load Leveling for long work. |
Beyond these named eight, the architect-level mistakes I see most often are pattern misuse rather than absence:
- Applying CQRS or Event Sourcing everywhere. They add real complexity and eventual consistency. Use them where the read/write asymmetry or audit need demands it – not by default. A premature CQRS split is its own anti-pattern.
- Retry without idempotency. Retrying a non-idempotent write duplicates side effects (double charges). Retry presupposes idempotency keys or dedup. The pattern is only safe with the precondition.
- Choreography for complex flows. Beautifully decoupled, but with many steps and error paths you lose the central visibility needed to debug. Past a complexity threshold, orchestration (Scheduler Agent Supervisor / Saga orchestration) is the right call.
- Gateway as a god-object. Offloading too much business logic into the gateway recreates a monolith at the edge. The gateway does cross-cutting concerns; domain logic stays in services.
- Stamps/sharding without an even key. Deployment Stamps and Sharding only deliver if load distributes evenly; a skewed key reintroduces Noisy Neighbour and hot partitions. The key choice is the design.
Interview & exam questions
These concepts dominate AZ-305 (“design” verbs – recommend, select the appropriate solution) and senior architecture interviews. Practise giving the pattern name plus the tradeoff, not just a service.
-
A spiky ingestion workload overwhelms a downstream that processes at a steady rate. Which pattern, and which Azure service? Queue-Based Load Leveling – put a Service Bus queue (or Event Hubs for high volume) between producer and consumer so the queue absorbs the spike and the consumer drains at its sustainable rate. Add Competing Consumers (scaled on queue depth) to drain elastically.
-
Why are Retry and Circuit Breaker used together, and what happens if you use Retry alone against a dependency that is fully down? Retry handles transient faults; Circuit Breaker handles sustained outages. Retry alone against a down dependency causes a Retry Storm – the retries add load and amplify the failure into a cascade. The breaker opens after a failure threshold and short-circuits the retries, shedding load while the dependency recovers. Always pair them, with a timeout beneath and a fallback above.
-
You must build a multi-service order workflow with no distributed transaction available. How do you keep data consistent if a later step fails? Saga pattern built on Compensating Transactions: each step has a semantic undo (refund, release, cancel); on failure you run the compensations in reverse. Choose orchestration (Durable Functions – central visibility) or choreography (Event Grid/Service Bus – looser coupling). Make every step idempotent so retries are safe.
-
A mobile app makes 12 backend calls to render one screen and is slow on cellular. What pattern fixes it and what anti-pattern were they hitting? The anti-pattern is Chatty I/O; the fix is Gateway Aggregation (one client call fans out internally and returns a combined payload), often as a Backends for Frontends tailored to the mobile client. The fast internal network absorbs the chattiness instead of the high-latency mobile link.
-
A multi-tenant SaaS sees one large tenant degrade performance for everyone. Name the anti-pattern and two patterns that mitigate it. Noisy Neighbour. Mitigate with Bulkhead (isolate resource pools), Deployment Stamps (give large tenants their own scale unit), and Throttling / Rate Limiting (cap per-tenant consumption). A better shard key also helps if the cause is a hot partition.
-
You need exactly one instance in a scaled-out worker tier to run a singleton scheduled job. Which pattern and what is a lightweight Azure mechanism? Leader Election. Lightweight mechanism: an Azure Blob Storage lease as a distributed lock – the holder is the leader and renews the lease; if it dies the lease expires and another instance acquires it. Durable Functions / WebJobs singleton provides this as a feature.
-
A message bus rejects messages because some payloads are too large. How do you keep using messaging? Claim Check: store the large payload in Blob Storage and send only the reference (URI) on the bus; the consumer fetches the payload directly. Event Grid Blob Created events are this pattern natively.
-
Distinguish Throttling from Rate Limiting – who is protected in each, and give an Azure example of each. Throttling protects your own service from inbound overload by rejecting/queuing excess (API Management 429 / WAF rate-limit rules). Rate Limiting is self-imposed on your outbound calls to stay within a downstream’s quota (client-side token bucket, APM outbound policies). One defends your capacity; the other respects someone else’s.
-
Users worldwide complain about latency from your single-region app, and you want every region to serve every user. Which pattern, and what is the key data-tier requirement? Geode – deploy interchangeable geographical nodes across regions, route to the nearest, backed by a globally-distributed multi-write data layer (Cosmos DB multi-region writes). The cost and multi-write conflict-handling are the tradeoff; it raises cost (a Cost Optimisation tension).
-
You are migrating a legacy monolith and cannot do a big-bang rewrite. Which two patterns do you combine, and what is the role of each? Strangler Fig (a façade routes slices of functionality to new services incrementally until the monolith is retired) combined with Anti-Corruption Layer (a translation layer that keeps the legacy model’s quirks from leaking into your clean new domain). Gateway Routing implements the façade.
-
Your team wants to add observability and mTLS to a dozen microservices written in three languages without touching app code. Which pattern, and how on AKS? Sidecar (and its outbound-networking specialisation, Ambassador). On AKS, a service mesh (Istio/Linkerd/Open Service Mesh) or Dapr injects a sidecar per pod that handles telemetry, mTLS, retries, and routing – capabilities the app gets for free, uniformly, regardless of language.
-
For a read-heavy reporting screen built from expensive cross-table aggregations, which two data patterns combine, and what is the consistency tradeoff? CQRS (separate read model) + Materialized View (pre-computed, stored aggregation refreshed from source, e.g. a Cosmos DB projection built off the SQL change feed). The tradeoff is eventual consistency – the read model lags the write model briefly – accepted in exchange for fast, cheap reads.
Quick check
-
The fallacy “latency is zero” most directly motivates which group of patterns? The data-management and gateway patterns that reduce round trips – Cache-Aside, Materialized View, Gateway Aggregation, BFF, Geode. (More broadly, anything that brings data closer or batches calls.)
-
True or false: a Compensating Transaction rolls the system back to its exact prior state. False. It issues a semantic undo (a refund, a cancellation) – the original effects happened and may have been visible; you counteract them, you do not erase them.
-
Which pattern lets you process related messages in order while still scaling across unrelated ones, and what Azure feature implements it? Sequential Convoy – Service Bus message sessions (one consumer per SessionId) or Event Hubs partitions (ordering within a partition, parallel across partitions).
-
Name the anti-pattern: an app creates a new HttpClient per request and exhausts sockets under load. What fixes it? Improper Instantiation. Fix with a shared/pooled instance via IHttpClientFactory (and connection pooling for DB clients).
-
Which security pattern issues a client a scoped, time-limited token to access storage directly so the app never proxies the bytes, and what is the Azure mechanism? Valet Key – Azure Storage Shared Access Signatures (preferably user-delegation SAS backed by Entra ID).
Exercise: pick the patterns for a scenario
Scenario. You are designing the backend for a global food-delivery platform. Requirements: (a) customers worldwide expect sub-200 ms responses; (b) order placement must remain consistent across the order, payment, and restaurant services even though there is no distributed transaction; © order traffic spikes 10× at lunch and dinner; (d) the platform is multi-tenant (each restaurant chain is a tenant) and one large chain must not degrade others; (e) customers upload food photos and review attachments; (f) restaurant partners log in with their existing corporate identity; (g) a mobile app must render an order screen that today needs eight backend calls and is slow on cellular.
Task. For each requirement, name the pattern(s) you would apply, the Azure service that implements it, and – for at least two requirements – one pattern you deliberately considered and rejected, with your reason. Then state the two anti-patterns you are most at risk of and how your choices avoid them.
Model answer.
- (a) Global low latency → Geode with Cosmos DB multi-region writes and Front Door geo-routing, plus Static Content Hosting (Storage + CDN) for assets and Cache-Aside (Azure Cache for Redis) for hot reads. Rejected: a single-region design with read replicas – it cannot meet sub-200 ms for distant users on writes, and the latency floor is physics, not capacity.
- (b) Cross-service order consistency → Saga built on Compensating Transactions, implemented as orchestration with Durable Functions (I want central visibility for a money-touching flow), with every step idempotent. Rejected: choreography via Event Grid – elegant and decoupled, but for a multi-step payment flow I value the central visibility and explicit error handling of orchestration over loose coupling.
- © 10× meal-time spikes → Queue-Based Load Leveling (Service Bus) + Competing Consumers (Functions/KEDA scaled on queue depth) so peaks become a drainable backlog and workers scale elastically. Throttling (API Management 429) protects the edge if the spike exceeds even that.
- (d) Tenant isolation / no noisy neighbour → Deployment Stamps per tenant tier (large chains get their own scale unit), Sharding by tenant (good key to avoid hot partitions), Bulkhead + Rate Limiting per tenant. This directly defuses the Noisy Neighbour risk.
- (e) User uploads → Valet Key (user-delegation SAS) so clients upload directly to Blob Storage without proxying bytes through compute, landing in a Quarantine container scanned by Defender for Storage before promotion.
- (f) Partner SSO → Federated Identity with Microsoft Entra ID / External ID (OIDC), so you never store partner credentials.
- (g) Chatty mobile screen → Gateway Aggregation behind a mobile Backends for Frontends, collapsing eight calls into one tailored payload over the fast internal network – the direct fix for the Chatty I/O anti-pattern.
Anti-patterns you are most at risk of: Noisy Neighbour (multi-tenant + huge chains) – mitigated by stamps, sharding, bulkhead, and rate limiting; and Retry Storm (10× spikes against payment/restaurant services) – mitigated by Retry-with-backoff + Circuit Breaker on every outbound call, plus Queue-Based Load Leveling so backends see a smoothed rate rather than the raw spike. A good answer also names what it left out and why (e.g. no Event Sourcing here – the audit/temporal need does not yet justify the complexity).
The grading rubric is not “did you list patterns” – it is “did you justify each from a requirement, name the tradeoff, and show judgement about what not to apply.” That last part is the architect signal.
Certification mapping
This lesson is high-yield for AZ-305: Designing Microsoft Azure Infrastructure Solutions, whose questions are framed as design recommendations under constraints – exactly the pattern-selection skill above.
- Design for resiliency, high availability, and disaster recovery (AZ-305). Retry, Circuit Breaker, Bulkhead, Health Endpoint Monitoring, Queue-Based Load Leveling + Competing Consumers, Deployment Stamps, Geode, and the active/active material that A5 builds on. Expect “the application must remain available when a dependency fails / a region is lost – what do you recommend?” – answer with the pattern and the service.
- Design data storage solutions (AZ-305). CQRS, Event Sourcing, Materialized View, Index Table, Sharding, Cache-Aside, and polyglot persistence vs. the Monolithic Persistence anti-pattern – “best store for the job” and partitioning around limits.
- Design business continuity & messaging integration (AZ-305). Publisher-Subscriber, Competing Consumers, Choreography vs. orchestration, Asynchronous Request-Reply, Claim Check, Priority Queue, Messaging Bridge, and Saga/Compensating Transaction for distributed consistency.
- Design identity, governance, and security (AZ-305). Federated Identity (Entra ID / External ID), Gatekeeper, Valet Key (SAS), Quarantine, Gateway Offloading (WAF/TLS/auth), and External Configuration Store with Key Vault.
Crossover with AZ-204 (Developing Solutions for Azure): the implementation-level patterns – Retry/Circuit Breaker with the SDKs and Polly, Cache-Aside with Azure Cache for Redis, Competing Consumers with Service Bus/Functions, Valet Key with SAS, Async Request-Reply, and the performance anti-patterns (Improper Instantiation, Chatty I/O, Synchronous I/O) – are AZ-204 staples. Crossover with AZ-104: Health probes, App Gateway/Front Door routing and offloading, autoscaling, and storage SAS appear at the operations level. The pattern names are the connective tissue across all three exams.
Glossary
- Design pattern. A named, reusable solution to a recurring design problem, with known forces and tradeoffs – a local, composable technique applied inside an architecture style.
- Architecture style. The overall shape of a system (N-tier, microservices, event-driven). You pick one style and many patterns; the style is the floor plan, patterns build the rooms.
- Fallacies of distributed computing. Eight (often nine) false assumptions – network reliable, latency zero, bandwidth infinite, network secure, topology stable, one admin, transport cost zero, network homogeneous – that cause outages when built upon; each cloud pattern defends against one or more.
- Anti-pattern. A recurring solution that looks reasonable but is reliably wrong (Busy Database, Chatty I/O, Noisy Neighbour, Retry Storm, Improper Instantiation, Monolithic Persistence, No Caching, Synchronous I/O).
- Transient fault. A short-lived, self-correcting failure (a blip, a throttle, a brief failover) – the target of Retry; distinct from a sustained outage (the target of Circuit Breaker).
- Idempotency. The property that performing an operation multiple times has the same effect as once – the precondition that makes Retry and at-least-once messaging safe.
- Compensating transaction. A semantic undo of a completed step (refund, release, cancel) used by Saga to restore consistency without a distributed rollback.
- Saga. A multi-service transaction composed of local transactions each with a compensating action; runs as orchestration (central coordinator) or choreography (event-driven).
- Deployment stamp / scale unit. A self-contained, independently deployable copy of the app+data serving a subset of tenants or a region; the unit of scale and blast-radius isolation.
- Geode. A geographical node that can serve any request, deployed across regions over a multi-write data layer for global low-latency, high-availability service.
- Sidecar / Ambassador. A co-located helper process adding cross-cutting capabilities to an app without changing its code (Ambassador = sidecar specialised for outbound networking).
- Bulkhead. Resource isolation into separate pools so one failing dependency cannot starve the rest – named for a ship’s watertight compartments.
- Valet key. A scoped, time-limited token (Azure SAS) granting a client direct, limited access to a specific storage resource without proxying through the app.
- Throttling vs. rate limiting. Throttling protects your service from inbound overload; rate limiting self-imposes outbound limits to respect a downstream’s quota.
- CQRS. Command and Query Responsibility Segregation – separate models/stores for writes and reads, scaling and shaping each independently, usually with eventual consistency.
- Materialized view. A pre-computed, stored query result refreshed from source data, read directly to avoid expensive on-demand computation; always rebuildable.
Next steps
You now have the full toolbox and – more importantly – the judgement to compose it and to recognise when it has been misused. The capstone of this module puts it all together.
- Next lesson: Mission-Critical (AlwaysOn) Architecture on Azure: The Apex Design – where pillars, styles, and these patterns converge into deployment stamps, active/active multi-region with composite-SLA math, and health modelling. This is where Deployment Stamps, Geode, Health Endpoint Monitoring, and the resilience compositions graduate from technique to architecture.
- Foundations to revisit: The Azure Well-Architected Framework deep dive (the pillars each pattern is tagged to) and Choosing an Architecture: Styles & the Ten Design Principles (the style these patterns live inside).
- Go deeper on individual patterns: Resiliency patterns that actually work (Retry + Circuit Breaker + Bulkhead composed), CQRS read-model projection pipelines, Event sourcing aggregate design, Saga orchestration vs. choreography, Strangler Fig monolith decomposition, Caching strategies, and the API gateway & BFF pattern.
- See patterns in whole systems: Multi-region active/active disaster recovery, Cell-based architecture & blast-radius isolation, and the Well-Architected reliability and security pillar deep dives.