Most teams arrive at event-driven architecture for the wrong reason: someone read that it “decouples services,” drew a box labelled “message bus” between the monolith and a new service, and called it a day. Six months later they have a distributed monolith — the same synchronous dependencies, now hidden behind a queue, with the added joy of debugging across process boundaries. Event-driven done well is not about adding a broker. It is about choosing, for every interaction in the system, whether it is a command (someone is asking for something to happen and cares about the answer) or an event (something happened and other parts of the business want to know), and then picking the Azure primitive that matches that semantic — not picking one broker and forcing every interaction through it.
This article is a concrete Azure reference architecture for that decision. The protagonists are Azure Kubernetes Service (AKS) for stateful, latency-sensitive domain services; Azure Service Bus for reliable command and work-queue messaging with ordering and transactions; Event Grid for high-fanout reactive notifications and the glue between Azure resources; Azure Functions for stateless event handlers and the occasional orchestration; Cosmos DB as the multi-region operational store with change feed; and API Management (APIM) as the governed synchronous edge. The running domain is an order-to-fulfilment platform for a mid-market retailer, because order processing is the canonical workload where commands, events, sagas, and read models all show up in one system and you cannot hand-wave any of them away.
The business scenario
Northwind Apparel is a fictional but representative company: an omnichannel clothing retailer doing roughly ₹400 crore (~USD 48M) in annual GMV across a website, a mobile app, ~40 physical stores, and three marketplace integrations (a domestic marketplace, plus Amazon and a regional one). They run on a five-year-old .NET monolith backed by a single SQL Server. The monolith is not the problem on a normal day. The problem shows up on the days that matter.
The pain that triggered the rebuild:
- Sale events take the site down. During an end-of-season sale, checkout, inventory decrement, payment, loyalty-points accrual, fraud scoring, and the marketplace inventory sync all run synchronously inside one HTTP request. When the payment gateway slows from 200 ms to 4 seconds, the request threads pile up, the SQL connection pool exhausts, and the entire site — including browsing, which needs none of that — falls over. One slow downstream takes everything down.
- Inventory is wrong everywhere at once. Because store POS, the website, and three marketplaces each write stock counts on their own schedule into the same tables with their own locking assumptions, oversells are routine on high-velocity SKUs. Customer service eats the chargebacks.
- Every new channel is a six-month integration. Onboarding a new marketplace means threading its calls through the monolith’s request path. The blast radius of a change is the whole application, so releases are quarterly and terrifying.
- No team owns anything end to end. Pricing, inventory, and fulfilment logic are tangled in shared tables, so no squad can ship independently.
The business goals the architecture must hit are not “use microservices.” They are: survive a 10x traffic spike on sale days without checkout failing; never oversell; onboard a new sales channel in under two weeks; and let four product squads release independently on their own cadence. Those goals are what make this event-driven rather than a request/response refactor — the spike survival and the never-oversell guarantee both demand that work be accepted quickly and processed reliably out of band, with a single source of truth for stock that everyone reacts to instead of races against.
Crucially, this is an architecture that scales down as well as up. A 20-person startup processing 500 orders a day can deploy the exact same shape on consumption-tier Functions, a Basic Service Bus namespace, and a serverless Cosmos account for a few thousand rupees a month, and grow into the AKS-and-Premium-tier version without re-architecting. That is what makes it a reference architecture and not a megacorp special.
Architecture overview
The system separates three planes that get muddled in monoliths: the synchronous request plane (a user is waiting), the asynchronous command plane (work that must happen reliably but not while the user waits), and the event notification plane (facts that many consumers react to). Each plane gets the Azure service that fits its semantics.
The end-to-end flow for placing an order:
-
Ingress and edge. A client (web, mobile, store POS, marketplace webhook) calls a single hostname fronted by API Management, which itself sits behind Azure Front Door (global anycast, WAF, TLS, caching of static/cacheable GETs). APIM validates the JWT (issuer, audience, signature, expiry) against Microsoft Entra ID, enforces per-product rate limits and quotas, strips internal headers, and routes to the right backend. APIM is the only public door to the platform.
-
Command acceptance (the fast path).
POST /ordersroutes to the Order API, a service running on AKS. The Order API does the minimum synchronous work: validate the payload, check idempotency (more on this below), persist anOrderaggregate in PENDING state to Cosmos DB, and — in the same logical transaction as the write — record an outbox entry. It returns 202 Accepted with an order id and a status URL in well under 200 ms. The user is no longer waiting on payment, fraud, or inventory. This is the single most important move in the whole design: the request returns as soon as the intent is durably captured. -
Reliable publication (outbox → bus). A change-feed processor (an Azure Functions app bound to the Cosmos DB change feed) reads newly-written order documents, and for each one publishes an
OrderPlacedcommand/event onto Azure Service Bus. Because the Cosmos write and the document that drives publication are the same write, we get the transactional outbox guarantee: an order is never accepted-but-unpublished, and the change feed gives us at-least-once delivery to the bus with a checkpoint. (The deep mechanics of outbox/inbox live in a companion article; here we just wire it.) -
Command processing (Service Bus, ordered, reliable). Domain services subscribe to Service Bus topics.
OrderPlacedfans out via topic subscriptions to: Payment service (charge the card), Inventory service (reserve stock), and Fraud service (score the order). These are commands with delivery guarantees and ordering needs — exactly what Service Bus is for. Inventory uses session-enabled subscriptions keyed by SKU so that all operations on one SKU are processed in order by one consumer, which is how we guarantee we never oversell under concurrency. Each handler is idempotent via an inbox (dedup table in Cosmos) so redelivery is safe. -
State changes emit facts (Event Grid, high fanout). When a domain service changes state, it publishes a notification — a fact — to Event Grid:
PaymentCaptured,StockReserved,OrderConfirmed,OrderShipped. Event Grid is the right tool here because these are lightweight “it happened” signals that an unknown and growing number of consumers want, with push delivery, filtering, and retries. Subscribers include: a read-model projector (Function) that maintains a denormalisedOrderViewin Cosmos for the status page; the notification service (email/SMS via a Function); the loyalty service on AKS; the analytics pipeline (Event Grid → Event Hubs → the lakehouse); and the marketplace-sync service that pushes authoritative stock levels back out. New consumers attach by adding a subscription with a filter — zero changes to producers. This is the property that turns “six-month channel onboarding” into “two weeks.” -
The saga / long-running coordination. Order fulfilment is a multi-step business transaction across services with no distributed transaction available: reserve stock → capture payment → confirm → allocate to a warehouse → ship. We coordinate it with a Durable Functions orchestration acting as a saga: it listens for the relevant Event Grid facts, advances the order state machine, and on failure runs compensating actions (release the stock reservation if payment fails; refund if allocation fails after capture). The orchestration state itself is durable and survives restarts.
-
Read path (the status page never touches the write services). The customer’s “where’s my order?” page and the ops dashboard query the
OrderViewread model in Cosmos through a thin Query API on AKS, fronted by APIM and cached at Front Door. Reads are fully decoupled from writes — a CQRS split — so a flood of status checks during a sale cannot impact order acceptance.
In words, the diagram is three horizontal lanes. Top lane (blue, synchronous): clients → Front Door → APIM → Order API / Query API on AKS → Cosmos (write) and Cosmos OrderView (read). Middle lane (orange, commands): Cosmos change feed → Functions publisher → Service Bus topics → Payment / Inventory / Fraud handlers (AKS + Functions) → back to Cosmos. Bottom lane (green, events): domain services → Event Grid → {projector, notifications, loyalty, analytics→Event Hubs, marketplace-sync} and the Durable Functions saga orchestrator weaving across the middle and bottom lanes. Everything sits inside a hub-and-spoke VNet; all data services are reached over Private Endpoints; identity flows via Entra Workload Identity with no secrets in code.
The key architectural insight visible in that diagram: Service Bus carries commands (point-to-point/queued work with ordering and transactions), Event Grid carries events (broadcast facts with fanout and filtering). Using one for the other is the most common mistake, and the article’s component table makes the choice explicit.
Component breakdown
Choosing the right messaging primitive
This decision is the spine of the whole architecture, so it gets the first table. Azure gives you four messaging services and they are not interchangeable.
| Service | Semantic | Use it for | Don’t use it for |
|---|---|---|---|
| Service Bus (queues/topics) | Reliable, ordered, transactional brokered messaging; pull-based with sessions, dead-lettering, scheduled delivery, duplicate detection | Commands and work that must be processed reliably, in order, exactly-once-ish: reserve stock, capture payment, fulfil order | High-fanout “notify everyone” patterns; millions of tiny telemetry events |
| Event Grid | Lightweight push notification of discrete events; fanout to many subscribers with server-side filtering and retry | Reactive “X happened” facts with many/unknown consumers: OrderConfirmed, BlobCreated, resource lifecycle |
Ordered command processing; large payloads; anything needing transactions or sessions |
| Event Hubs | High-throughput streaming/ingestion, partitioned log, replay | Telemetry, clickstream, IoT, analytics ingestion — millions of events/sec | Per-message workflow with individual ack/dead-letter |
| Storage Queues | Dirt-simple, cheap queue, 7-day TTL, no topics/sessions | Trivial decoupling on a budget | Anything needing ordering, topics, dedup, or transactions |
The rule I apply in reviews: if a human or downstream cares about the result and the message names an imperative (“do this”), it is a command → Service Bus. If the message names a fact in past tense (“this happened”) and you don’t know or care who consumes it, it is an event → Event Grid. Northwind uses Service Bus for the order pipeline (capture/reserve/fulfil) and Event Grid for the fact fanout (confirmed/shipped/etc.), and Event Hubs only at the analytics tap.
Component-by-component
| Component | Azure service | What it does here | Key configuration choices |
|---|---|---|---|
| Global edge | Front Door (Premium) + WAF | TLS, anycast, caching of cacheable GETs, WAF, DDoS | WAF in Prevention mode; Front Door → APIM locked via Private Link so APIM is not directly internet-reachable |
| API gateway | API Management (Premium, multi-region) | The single governed door: JWT validation, rate-limit/quota by product, routing, request/response policy, developer portal | validate-jwt against Entra; products = public, partner, internal; Premium for VNet injection + multi-region gateway + zone redundancy |
| Domain services | AKS (Standard tier, AZ-spread) | Order, Inventory, Payment, Loyalty, Query APIs — stateful, latency-sensitive, polyglot | System + user node pools; KEDA scaling on Service Bus queue length; Workload Identity; Istio/Cilium for mTLS east-west |
| Command bus | Service Bus (Premium) | Reliable ordered command transport with sessions, dead-letter, dedup | Premium for VNet/Private Endpoint, predictable throughput (MUs), and ≥4 messaging units in prod; sessions on Inventory subscription keyed by SKU; 5 max-delivery then dead-letter |
| Event broker | Event Grid (custom topic / namespace) | Fanout of domain facts to N subscribers with filtering and retry | Subscription filters by eventType and subject; dead-letter to Blob; MQTT/namespace topics if pull or higher scale is needed |
| Serverless handlers | Azure Functions (Premium/EP plan) | Outbox publisher (Cosmos change feed → Service Bus), Event Grid handlers, notifications, projector | Premium (Elastic) plan for VNet integration + no cold start on the hot path; consumption plan acceptable for the small-scale variant |
| Saga orchestrator | Durable Functions | Long-running order state machine with compensation | Orchestration per order id; fan-out/fan-in for parallel reserve+score; durable timers for SLA escalation |
| Operational store | Cosmos DB (NoSQL API) | Write store for aggregates, inbox/outbox, and the OrderView read model |
Partition key /customerId for orders, /sku for inventory; change feed drives the publisher and projector; session consistency default, strong on the inventory container’s region |
| Streaming tap | Event Hubs + Stream Analytics / lakehouse | Analytics ingestion off Event Grid | Capture to ADLS; downstream to Synapse/Fabric or Databricks |
| Secrets/keys | Key Vault + Managed HSM | CMK for Cosmos/Storage, certs, any unavoidable secrets | Private Endpoint; RBAC data-plane; no secrets in app config — Workload Identity instead |
| Observability | Azure Monitor, App Insights, Log Analytics | Distributed tracing, metrics, logs, dead-letter alerts | OpenTelemetry from AKS + Functions; W3C trace context propagated through Service Bus/Event Grid via message properties |
Two configuration choices deserve emphasis because they are where teams quietly get burned:
Service Bus Premium, not Standard, for production. Standard is shared-tenant and metered per-operation; under a sale-day burst its throughput is unpredictable and you cannot put it on a Private Endpoint. Premium gives you dedicated messaging units, VNet isolation, and the large-message and dedup features the pipeline relies on. The small-scale variant can absolutely start on Standard — just know the upgrade is a namespace change, so isolate the connection behind config.
Cosmos partition keys are a one-way door. Orders partition on /customerId (queries are almost always “this customer’s orders,” and a hot customer is rare). Inventory partitions on /sku and uses Service Bus sessions on the same key, so the partition that owns a SKU’s stock and the consumer that mutates it are aligned — single-writer-per-SKU is what makes the never-oversell guarantee hold without distributed locks. Get this wrong and you are migrating containers later, which is painful.
Implementation guidance
Identity and networking wiring (do this first)
Everything else assumes a Zero-Trust foundation, so it is the first thing to stand up, not the last.
- One hub VNet, spoke per environment. AKS, Functions (Premium), APIM (Premium), and all Private Endpoints land in spokes peered to a hub holding Azure Firewall and the Front Door Private Link origin. No data service has a public endpoint — Cosmos, Service Bus, Event Grid, Key Vault, and Storage are all reached over Private Endpoints with Private DNS zones.
- Workload Identity, not connection strings. AKS pods and Function apps authenticate to Service Bus, Event Grid, and Cosmos via Microsoft Entra Workload Identity (federated, OIDC) mapped to RBAC data roles (
Azure Service Bus Data Sender/Receiver,Cosmos DB Built-in Data Contributor,EventGrid Data Sender). There are no broker connection strings in the cluster. This single decision eliminates the most common Azure incident class — a leaked SAS key in a repo or pipeline log. - APIM validates, services trust. APIM does coarse authN/Z (valid token, has scope
orders:write). Services do fine-grained authZ (“can this user act on this order”). Don’t duplicate the JWT validation in every service; do duplicate the business authorization, because only the service knows the domain rule.
Infrastructure as code
Use Bicep for the Azure resource graph (it is first-class for Azure, has no state file to babysit, and what-if previews are reliable), and reserve Terraform for the small minority of cross-cloud or third-party pieces if any exist. Structure the IaC as Azure Verified Modules (AVM) composed per workload, deployed through the Azure Developer CLI (azd) so app + infra ship together.
A representative slice — a Service Bus Premium namespace, a topic, and the Inventory subscription with sessions and dead-lettering:
resource sb 'Microsoft.ServiceBus/namespaces@2024-01-01' = {
name: 'sb-northwind-${env}'
location: location
sku: { name: 'Premium', tier: 'Premium', capacity: 4 } // 4 messaging units
properties: {
minimumTlsVersion: '1.2'
publicNetworkAccess: 'Disabled' // private endpoint only
zoneRedundant: true
}
}
resource topicOrders 'Microsoft.ServiceBus/namespaces/topics@2024-01-01' = {
parent: sb
name: 'order-placed'
properties: {
requiresDuplicateDetection: true
duplicateDetectionHistoryTimeWindow: 'PT10M'
supportOrdering: true
}
}
resource subInventory 'Microsoft.ServiceBus/namespaces/topics/subscriptions@2024-01-01' = {
parent: topicOrders
name: 'inventory'
properties: {
requiresSession: true // one consumer per SKU session => no oversell
maxDeliveryCount: 5 // then dead-letter
deadLetteringOnMessageExpiration: true
lockDuration: 'PT1M'
}
}
The corresponding KEDA ScaledObject lets the Inventory deployment on AKS scale on the backlog, which is the metric that actually matters for an event-driven worker — not CPU:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: inventory-worker }
spec:
scaleTargetRef: { name: inventory-worker }
minReplicaCount: 2
maxReplicaCount: 30
triggers:
- type: azure-servicebus
metadata:
topicName: order-placed
subscriptionName: inventory
messageCount: "20" # target backlog per replica
namespace: sb-northwind-prod
authenticationRef: { name: keda-wi } # Workload Identity, no keys
For the outbox publisher, bind a Function to the Cosmos change feed and send to Service Bus — the lease container gives you the checkpoint that makes delivery at-least-once:
[Function("OutboxPublisher")]
[ServiceBusOutput("order-placed", Connection = "SbConnection",
EntityType = EntityType.Topic)]
public static IEnumerable<ServiceBusMessage> Run(
[CosmosDBTrigger(
databaseName: "northwind",
containerName: "orders",
Connection = "CosmosConnection",
LeaseContainerName = "leases",
CreateLeaseContainerIfNotExists = true)] IReadOnlyList<OrderDoc> changes)
=> changes.Where(o => o.Status == "PENDING")
.Select(o => new ServiceBusMessage(BinaryData.FromObjectAsJson(o))
{
SessionId = o.Sku, // honour SKU ordering downstream
MessageId = o.OrderId, // enables dedup
ApplicationProperties = { ["traceparent"] = o.TraceParent }
});
Note the three properties carried on every message: SessionId (ordering), MessageId (duplicate detection / idempotency), and traceparent (W3C trace propagation so the distributed trace survives the async hop). Propagating trace context through the broker is the difference between a debuggable system and a haunted house.
Deployment and progressive delivery
Deploy services through GitHub Actions → AKS with Argo Rollouts or Flagger for canary, gated on App Insights metrics (error rate, p95 latency, dead-letter count). Because producers and consumers are decoupled by the broker, you deploy them independently — but you must keep event schemas backward-compatible (additive-only). Register schemas in the Event Grid namespace schema registry (or an Avro/JSON Schema registry) and enforce compatibility in CI so a producer can’t ship a breaking change that silently poisons every consumer’s dead-letter queue.
Enterprise considerations
Security and Zero Trust
The architecture is Zero Trust by construction, not by bolt-on:
- No public data plane. Every backing service is Private-Endpoint-only; the only internet ingress is Front Door → APIM. Lateral movement is contained by AKS network policies and service-mesh mTLS.
- No secrets. Workload Identity federates AKS/Functions to Entra; data-plane RBAC is least-privilege per service (the Payment service can receive from its subscription and write its own Cosmos container, nothing else). The companion security note in our memory — old DB credentials leaked into git — is exactly the failure mode this design makes structurally impossible: there are no broker or DB connection strings to leak.
- Encryption at rest with customer-managed keys in Key Vault/Managed HSM; TLS 1.2+ everywhere; Cosmos and Service Bus payloads can carry field-level encryption for PII (card tokens never touch your store — use the PSP’s vault and keep only a token).
- Defender for Cloud + Defender for Containers for posture and runtime; Microsoft Sentinel ingests APIM, Front Door WAF, and AKS audit logs for detection.
Cost optimization
Event-driven on Azure is cost-elastic if you let it be:
- Scale to (near) zero off-peak. KEDA scales Inventory/Payment workers down to a small floor overnight; Functions on consumption bill per-execution. Cosmos uses autoscale RU/s (or serverless for the small variant) so you pay for the sale-day peak only on sale day.
- Right-size the broker. Service Bus Premium messaging units are the biggest fixed line item; size to steady-state and rely on the queue to absorb bursts (the whole point of buffering) rather than over-provisioning for peak. A burst that doubles latency-to-process for 20 minutes is usually a fine trade vs. paying for 4x MUs year-round.
- Front Door caching offloads cacheable GETs (product status pages with short TTLs) from APIM/AKS, cutting both compute and egress.
- Tag and budget by squad. Each domain service is its own resource group with cost tags; FinOps alerts per team make the bill legible. Northwind’s blended steady-state lands around ₹6–7 lakh/month at their scale, with sale-day spikes auto-scaling and auto-receding.
Scalability and reliability (RTO/RPO)
- Scalability: the buffering broker is the shock absorber. A 10x checkout spike turns into a deeper Service Bus backlog, KEDA adds consumers, and the user-facing 202 latency is unaffected because acceptance is decoupled from processing. This is the core mechanism by which the sale-day-outage problem disappears.
- Reliability: at-least-once delivery + idempotent inbox handlers + dead-letter queues mean no order is silently lost; a poison message lands in DLQ with an alert instead of crash-looping a consumer. The Durable Functions saga guarantees that partial failures compensate (no captured-payment-without-stock).
- DR with explicit RTO/RPO. Deploy active-active across two paired regions (e.g., Central India + South India, or a primary + paired region). Cosmos DB multi-region writes give RPO ≈ seconds and near-zero RTO for the data tier. APIM Premium and Front Door are multi-region; AKS and Functions are deployed per region with Front Door health-probing failover. Service Bus uses Geo-Disaster-Recovery (Geo-DR) pairing (metadata replication; for full message-data replication use the Premium geo-replication feature). Realistic targets for the platform: RTO ≤ 15 min, RPO ≤ 1 min for the order pipeline. The single most important DR property is that accepted orders are already durably in Cosmos and (via outbox) guaranteed to be published — a regional failure cannot lose an accepted order, only delay its processing.
Observability
- Distributed tracing is non-negotiable in async systems. Propagate W3C
traceparentthrough Service Bus application properties and Event Grid event data so one trace spans Order API → bus → Inventory → Event Grid → projector. App Insights’ Application Map then shows the async topology, not just the sync hops. - The metrics that matter are not CPU: Service Bus active message count and dead-letter count per subscription, Event Grid delivery failures, Functions execution backlog, and end-to-end order-accepted-to-confirmed latency as a business SLO. Alert on rising DLQ depth above all — it is the earliest signal that a consumer is broken.
- Correlation IDs flow from APIM (
request-id) through every hop, so support can paste an order id and see the full saga timeline.
Governance
- Azure Landing Zones provide the management-group hierarchy, Azure Policy (deny public endpoints, require CMK, enforce tags, allowed regions), and platform-level networking. The workload lands in an application landing zone.
- Schema governance: the event/command schema registry with CI-enforced backward compatibility is the governance control that keeps a decoupled system from rotting — it is to events what an API contract is to REST.
- Subscription/RBAC governance via PIM for break-glass; everything else is least-privilege Workload Identity.
Reference enterprise example
Northwind Apparel ran the rebuild over three quarters. Concrete decisions and numbers:
Scale and traffic. Steady state ~6,000 orders/day; sale-day peak observed at 58,000 orders/day with a 9-minute burst hitting ~140 orders/second at checkout. The old monolith fell over at ~25 orders/second.
What they built.
- Front Door Premium + APIM Premium (2 regions: Central India primary, South India secondary), 3 products (
public,partnerfor marketplaces,internal). - AKS Standard, 2 user node pools, KEDA-scaled Order/Inventory/Payment/Loyalty/Query services. Inventory uses SKU-session subscriptions.
- Service Bus Premium, 4 MU, topics
order-placed,payment,fulfilment; Inventory subscription session-enabled. - Event Grid custom topic for
OrderConfirmed/Shipped/Cancelled, 7 subscribers including the marketplace-sync service. - Cosmos DB, multi-region writes both regions, autoscale 4,000→40,000 RU/s on the orders container;
OrderViewread model projected by a Function off the change feed. - Durable Functions saga for the reserve→capture→confirm→allocate→ship state machine with compensation.
Decisions that paid off.
- 202-Accept on the hot path. Checkout p95 dropped from 3.8 s (under load) to 140 ms, because the user no longer waits on payment/fraud/inventory. The 9-minute burst became a Service Bus backlog that drained in ~6 minutes after the peak while KEDA ran Inventory at 22 replicas. Zero checkout failures on the first post-launch sale.
- SKU sessions killed oversells. Oversell incidents on high-velocity SKUs went from ~40/sale to 0, because all decrements for one SKU serialize through one consumer over an authoritative Cosmos partition.
- Marketplace onboarding of a fourth marketplace took 9 days — add a
partnerAPIM product, point the marketplace-sync Event Grid subscriber at it. No producer changed. - Independent releases. Four squads now ship 30–40 times a week combined behind canaries; the quarterly-release era is over.
A failure they handled gracefully. Two months in, a bad deploy of the Fraud service started throwing on a subset of orders. Because fraud scoring is a Service Bus consumer with maxDeliveryCount: 5, the affected messages dead-lettered with an alert instead of blocking the pipeline; payment and confirmation continued for unaffected orders, and the DLQ was replayed after the fix. In the monolith, that same bug would have failed checkouts outright.
Outcome. Steady-state Azure spend ~₹6.4 lakh/month (vs. an over-provisioned monolith that cost more to keep upright on sale days), sale-day survivability proven at 10x, zero oversells, and a platform four teams can evolve independently. The business signed off on the rebuild on the strength of the first sale event alone.
When to use it
Use this architecture when:
- You have bursty or spiky load where accept-fast / process-reliably genuinely helps (commerce, ticketing, onboarding, claims, IoT command-and-control).
- You need independent team and deployment autonomy across distinct business capabilities.
- You have multiple producers/consumers of the same business facts (the “many channels react to one stock truth” shape) — this is where Event Grid fanout earns its keep.
- Reliability and never-lose-the-work matter more than the simplest possible request/response.
Be honest about the trade-offs:
- Eventual consistency is now your problem. The order is
PENDINGfor a moment; the read model lags the write by milliseconds-to-seconds. The UI and the business must tolerate “accepted, processing.” If your domain truly needs read-your-write strong consistency on every operation, an event-driven split fights you. - Operational complexity goes up. You are now running a broker, dead-letter queues, sagas, and distributed tracing. Without the observability and idempotency discipline described above, you have built something harder to debug than the monolith, not easier. Do not adopt this without committing to tracing and DLQ alerting from day one.
- It is more moving parts than a small problem needs. 500 orders a day with one team and no spikes? A well-structured modular monolith on App Service + SQL is the right answer; you can carve out events later when a real seam appears.
Anti-patterns to avoid (each one is a real incident waiting to happen):
- One broker for everything. Forcing high-fanout events through Service Bus topics (or ordered commands through Event Grid) fights the tool. Match the primitive to the semantic — that is what the messaging table is for.
- The distributed monolith. Services that call each other synchronously over the broker and block on the reply have all the cost of distribution with none of the decoupling. If a “command” is just a blocking RPC with extra steps, it should be a direct call or shouldn’t exist.
- Events that carry no useful state but force a callback. A fact event that only says “something changed, call me back to find out what” reintroduces the coupling you paid to remove. Carry enough state in the event (or version it) for common consumers to act without a synchronous round-trip.
- No idempotency / no dead-letter strategy. At-least-once delivery will redeliver. Handlers that aren’t idempotent will double-charge; pipelines without DLQ will crash-loop on the first poison message. These aren’t edge cases — they are guaranteed to happen in week one.
- Shared database across services. If two services write the same table, you do not have microservices; you have a monolith with a network in the middle. Each service owns its data; they integrate through events.
Alternatives worth weighing: a modular monolith (simplest, best until you have proven independent-scaling or independent-deployment needs); Dapr on AKS/Container Apps if you want the same patterns with a portable building-block abstraction over the brokers (pub/sub, state, bindings) and less Azure-specific glue; Azure Container Apps instead of full AKS if you want event-driven, KEDA-built-in, serverless containers without operating Kubernetes — a strictly better starting point for most teams until AKS-level control is actually required; and pure Functions + Service Bus/Event Grid with no Kubernetes at all for the genuinely small or spiky-but-stateless workload. The shape of the architecture — three planes, commands vs. events, outbox, idempotent handlers, saga — stays the same across all of them. Only the compute host changes.