Architecture Azure

Azure Enterprise Architecture: Event-Driven Microservices

Most teams arrive at event-driven architecture for the wrong reason: someone read that it “decouples services,” drew a box labelled “message bus” between the monolith and a new service, and called it a day. Six months later they have a distributed monolith — the same synchronous dependencies, now hidden behind a queue, with the added joy of debugging across process boundaries. Event-driven done well is not about adding a broker. It is about choosing, for every interaction in the system, whether it is a command (someone is asking for something to happen and cares about the answer) or an event (something happened and other parts of the business want to know), and then picking the Azure primitive that matches that semantic — not picking one broker and forcing every interaction through it.

This article is a concrete Azure reference architecture for that decision. The protagonists are Azure Kubernetes Service (AKS) for stateful, latency-sensitive domain services; Azure Service Bus for reliable command and work-queue messaging with ordering and transactions; Event Grid for high-fanout reactive notifications and the glue between Azure resources; Azure Functions for stateless event handlers and the occasional orchestration; Cosmos DB as the multi-region operational store with change feed; and API Management (APIM) as the governed synchronous edge. The running domain is an order-to-fulfilment platform for a mid-market retailer, because order processing is the canonical workload where commands, events, sagas, and read models all show up in one system and you cannot hand-wave any of them away.

The business scenario

Northwind Apparel is a fictional but representative company: an omnichannel clothing retailer doing roughly ₹400 crore (~USD 48M) in annual GMV across a website, a mobile app, ~40 physical stores, and three marketplace integrations (a domestic marketplace, plus Amazon and a regional one). They run on a five-year-old .NET monolith backed by a single SQL Server. The monolith is not the problem on a normal day. The problem shows up on the days that matter.

The pain that triggered the rebuild:

The business goals the architecture must hit are not “use microservices.” They are: survive a 10x traffic spike on sale days without checkout failing; never oversell; onboard a new sales channel in under two weeks; and let four product squads release independently on their own cadence. Those goals are what make this event-driven rather than a request/response refactor — the spike survival and the never-oversell guarantee both demand that work be accepted quickly and processed reliably out of band, with a single source of truth for stock that everyone reacts to instead of races against.

Crucially, this is an architecture that scales down as well as up. A 20-person startup processing 500 orders a day can deploy the exact same shape on consumption-tier Functions, a Basic Service Bus namespace, and a serverless Cosmos account for a few thousand rupees a month, and grow into the AKS-and-Premium-tier version without re-architecting. That is what makes it a reference architecture and not a megacorp special.

Architecture overview

The system separates three planes that get muddled in monoliths: the synchronous request plane (a user is waiting), the asynchronous command plane (work that must happen reliably but not while the user waits), and the event notification plane (facts that many consumers react to). Each plane gets the Azure service that fits its semantics.

Azure event-driven microservices reference architecture: a synchronous request plane (Front Door, API Management, Order/Query APIs on AKS, Cosmos DB), an asynchronous command plane (Cosmos change feed, Functions outbox publisher, Service Bus topic, Payment/Inventory/Fraud handlers), and an Event Grid notification plane fanning out to projector, notifications, loyalty, marketplace-sync and Event Hubs, with a Durable Functions saga and Workload Identity over Private Endpoints

The end-to-end flow for placing an order:

  1. Ingress and edge. A client (web, mobile, store POS, marketplace webhook) calls a single hostname fronted by API Management, which itself sits behind Azure Front Door (global anycast, WAF, TLS, caching of static/cacheable GETs). APIM validates the JWT (issuer, audience, signature, expiry) against Microsoft Entra ID, enforces per-product rate limits and quotas, strips internal headers, and routes to the right backend. APIM is the only public door to the platform.

  2. Command acceptance (the fast path). POST /orders routes to the Order API, a service running on AKS. The Order API does the minimum synchronous work: validate the payload, check idempotency (more on this below), persist an Order aggregate in PENDING state to Cosmos DB, and — in the same logical transaction as the write — record an outbox entry. It returns 202 Accepted with an order id and a status URL in well under 200 ms. The user is no longer waiting on payment, fraud, or inventory. This is the single most important move in the whole design: the request returns as soon as the intent is durably captured.

  3. Reliable publication (outbox → bus). A change-feed processor (an Azure Functions app bound to the Cosmos DB change feed) reads newly-written order documents, and for each one publishes an OrderPlaced command/event onto Azure Service Bus. Because the Cosmos write and the document that drives publication are the same write, we get the transactional outbox guarantee: an order is never accepted-but-unpublished, and the change feed gives us at-least-once delivery to the bus with a checkpoint. (The deep mechanics of outbox/inbox live in a companion article; here we just wire it.)

  4. Command processing (Service Bus, ordered, reliable). Domain services subscribe to Service Bus topics. OrderPlaced fans out via topic subscriptions to: Payment service (charge the card), Inventory service (reserve stock), and Fraud service (score the order). These are commands with delivery guarantees and ordering needs — exactly what Service Bus is for. Inventory uses session-enabled subscriptions keyed by SKU so that all operations on one SKU are processed in order by one consumer, which is how we guarantee we never oversell under concurrency. Each handler is idempotent via an inbox (dedup table in Cosmos) so redelivery is safe.

  5. State changes emit facts (Event Grid, high fanout). When a domain service changes state, it publishes a notification — a fact — to Event Grid: PaymentCaptured, StockReserved, OrderConfirmed, OrderShipped. Event Grid is the right tool here because these are lightweight “it happened” signals that an unknown and growing number of consumers want, with push delivery, filtering, and retries. Subscribers include: a read-model projector (Function) that maintains a denormalised OrderView in Cosmos for the status page; the notification service (email/SMS via a Function); the loyalty service on AKS; the analytics pipeline (Event Grid → Event Hubs → the lakehouse); and the marketplace-sync service that pushes authoritative stock levels back out. New consumers attach by adding a subscription with a filter — zero changes to producers. This is the property that turns “six-month channel onboarding” into “two weeks.”

  6. The saga / long-running coordination. Order fulfilment is a multi-step business transaction across services with no distributed transaction available: reserve stock → capture payment → confirm → allocate to a warehouse → ship. We coordinate it with a Durable Functions orchestration acting as a saga: it listens for the relevant Event Grid facts, advances the order state machine, and on failure runs compensating actions (release the stock reservation if payment fails; refund if allocation fails after capture). The orchestration state itself is durable and survives restarts.

  7. Read path (the status page never touches the write services). The customer’s “where’s my order?” page and the ops dashboard query the OrderView read model in Cosmos through a thin Query API on AKS, fronted by APIM and cached at Front Door. Reads are fully decoupled from writes — a CQRS split — so a flood of status checks during a sale cannot impact order acceptance.

In words, the diagram is three horizontal lanes. Top lane (blue, synchronous): clients → Front Door → APIM → Order API / Query API on AKS → Cosmos (write) and Cosmos OrderView (read). Middle lane (orange, commands): Cosmos change feed → Functions publisher → Service Bus topics → Payment / Inventory / Fraud handlers (AKS + Functions) → back to Cosmos. Bottom lane (green, events): domain services → Event Grid → {projector, notifications, loyalty, analytics→Event Hubs, marketplace-sync} and the Durable Functions saga orchestrator weaving across the middle and bottom lanes. Everything sits inside a hub-and-spoke VNet; all data services are reached over Private Endpoints; identity flows via Entra Workload Identity with no secrets in code.

The key architectural insight visible in that diagram: Service Bus carries commands (point-to-point/queued work with ordering and transactions), Event Grid carries events (broadcast facts with fanout and filtering). Using one for the other is the most common mistake, and the article’s component table makes the choice explicit.

Component breakdown

Choosing the right messaging primitive

This decision is the spine of the whole architecture, so it gets the first table. Azure gives you four messaging services and they are not interchangeable.

Service Semantic Use it for Don’t use it for
Service Bus (queues/topics) Reliable, ordered, transactional brokered messaging; pull-based with sessions, dead-lettering, scheduled delivery, duplicate detection Commands and work that must be processed reliably, in order, exactly-once-ish: reserve stock, capture payment, fulfil order High-fanout “notify everyone” patterns; millions of tiny telemetry events
Event Grid Lightweight push notification of discrete events; fanout to many subscribers with server-side filtering and retry Reactive “X happened” facts with many/unknown consumers: OrderConfirmed, BlobCreated, resource lifecycle Ordered command processing; large payloads; anything needing transactions or sessions
Event Hubs High-throughput streaming/ingestion, partitioned log, replay Telemetry, clickstream, IoT, analytics ingestion — millions of events/sec Per-message workflow with individual ack/dead-letter
Storage Queues Dirt-simple, cheap queue, 7-day TTL, no topics/sessions Trivial decoupling on a budget Anything needing ordering, topics, dedup, or transactions

The rule I apply in reviews: if a human or downstream cares about the result and the message names an imperative (“do this”), it is a command → Service Bus. If the message names a fact in past tense (“this happened”) and you don’t know or care who consumes it, it is an event → Event Grid. Northwind uses Service Bus for the order pipeline (capture/reserve/fulfil) and Event Grid for the fact fanout (confirmed/shipped/etc.), and Event Hubs only at the analytics tap.

Component-by-component

Component Azure service What it does here Key configuration choices
Global edge Front Door (Premium) + WAF TLS, anycast, caching of cacheable GETs, WAF, DDoS WAF in Prevention mode; Front Door → APIM locked via Private Link so APIM is not directly internet-reachable
API gateway API Management (Premium, multi-region) The single governed door: JWT validation, rate-limit/quota by product, routing, request/response policy, developer portal validate-jwt against Entra; products = public, partner, internal; Premium for VNet injection + multi-region gateway + zone redundancy
Domain services AKS (Standard tier, AZ-spread) Order, Inventory, Payment, Loyalty, Query APIs — stateful, latency-sensitive, polyglot System + user node pools; KEDA scaling on Service Bus queue length; Workload Identity; Istio/Cilium for mTLS east-west
Command bus Service Bus (Premium) Reliable ordered command transport with sessions, dead-letter, dedup Premium for VNet/Private Endpoint, predictable throughput (MUs), and ≥4 messaging units in prod; sessions on Inventory subscription keyed by SKU; 5 max-delivery then dead-letter
Event broker Event Grid (custom topic / namespace) Fanout of domain facts to N subscribers with filtering and retry Subscription filters by eventType and subject; dead-letter to Blob; MQTT/namespace topics if pull or higher scale is needed
Serverless handlers Azure Functions (Premium/EP plan) Outbox publisher (Cosmos change feed → Service Bus), Event Grid handlers, notifications, projector Premium (Elastic) plan for VNet integration + no cold start on the hot path; consumption plan acceptable for the small-scale variant
Saga orchestrator Durable Functions Long-running order state machine with compensation Orchestration per order id; fan-out/fan-in for parallel reserve+score; durable timers for SLA escalation
Operational store Cosmos DB (NoSQL API) Write store for aggregates, inbox/outbox, and the OrderView read model Partition key /customerId for orders, /sku for inventory; change feed drives the publisher and projector; session consistency default, strong on the inventory container’s region
Streaming tap Event Hubs + Stream Analytics / lakehouse Analytics ingestion off Event Grid Capture to ADLS; downstream to Synapse/Fabric or Databricks
Secrets/keys Key Vault + Managed HSM CMK for Cosmos/Storage, certs, any unavoidable secrets Private Endpoint; RBAC data-plane; no secrets in app config — Workload Identity instead
Observability Azure Monitor, App Insights, Log Analytics Distributed tracing, metrics, logs, dead-letter alerts OpenTelemetry from AKS + Functions; W3C trace context propagated through Service Bus/Event Grid via message properties

Two configuration choices deserve emphasis because they are where teams quietly get burned:

Service Bus Premium, not Standard, for production. Standard is shared-tenant and metered per-operation; under a sale-day burst its throughput is unpredictable and you cannot put it on a Private Endpoint. Premium gives you dedicated messaging units, VNet isolation, and the large-message and dedup features the pipeline relies on. The small-scale variant can absolutely start on Standard — just know the upgrade is a namespace change, so isolate the connection behind config.

Cosmos partition keys are a one-way door. Orders partition on /customerId (queries are almost always “this customer’s orders,” and a hot customer is rare). Inventory partitions on /sku and uses Service Bus sessions on the same key, so the partition that owns a SKU’s stock and the consumer that mutates it are aligned — single-writer-per-SKU is what makes the never-oversell guarantee hold without distributed locks. Get this wrong and you are migrating containers later, which is painful.

Implementation guidance

Identity and networking wiring (do this first)

Everything else assumes a Zero-Trust foundation, so it is the first thing to stand up, not the last.

Infrastructure as code

Use Bicep for the Azure resource graph (it is first-class for Azure, has no state file to babysit, and what-if previews are reliable), and reserve Terraform for the small minority of cross-cloud or third-party pieces if any exist. Structure the IaC as Azure Verified Modules (AVM) composed per workload, deployed through the Azure Developer CLI (azd) so app + infra ship together.

A representative slice — a Service Bus Premium namespace, a topic, and the Inventory subscription with sessions and dead-lettering:

resource sb 'Microsoft.ServiceBus/namespaces@2024-01-01' = {
  name: 'sb-northwind-${env}'
  location: location
  sku: { name: 'Premium', tier: 'Premium', capacity: 4 } // 4 messaging units
  properties: {
    minimumTlsVersion: '1.2'
    publicNetworkAccess: 'Disabled'      // private endpoint only
    zoneRedundant: true
  }
}

resource topicOrders 'Microsoft.ServiceBus/namespaces/topics@2024-01-01' = {
  parent: sb
  name: 'order-placed'
  properties: {
    requiresDuplicateDetection: true
    duplicateDetectionHistoryTimeWindow: 'PT10M'
    supportOrdering: true
  }
}

resource subInventory 'Microsoft.ServiceBus/namespaces/topics/subscriptions@2024-01-01' = {
  parent: topicOrders
  name: 'inventory'
  properties: {
    requiresSession: true                // one consumer per SKU session => no oversell
    maxDeliveryCount: 5                   // then dead-letter
    deadLetteringOnMessageExpiration: true
    lockDuration: 'PT1M'
  }
}

The corresponding KEDA ScaledObject lets the Inventory deployment on AKS scale on the backlog, which is the metric that actually matters for an event-driven worker — not CPU:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: inventory-worker }
spec:
  scaleTargetRef: { name: inventory-worker }
  minReplicaCount: 2
  maxReplicaCount: 30
  triggers:
    - type: azure-servicebus
      metadata:
        topicName: order-placed
        subscriptionName: inventory
        messageCount: "20"               # target backlog per replica
        namespace: sb-northwind-prod
      authenticationRef: { name: keda-wi }   # Workload Identity, no keys

For the outbox publisher, bind a Function to the Cosmos change feed and send to Service Bus — the lease container gives you the checkpoint that makes delivery at-least-once:

[Function("OutboxPublisher")]
[ServiceBusOutput("order-placed", Connection = "SbConnection",
                  EntityType = EntityType.Topic)]
public static IEnumerable<ServiceBusMessage> Run(
    [CosmosDBTrigger(
        databaseName: "northwind",
        containerName: "orders",
        Connection = "CosmosConnection",
        LeaseContainerName = "leases",
        CreateLeaseContainerIfNotExists = true)] IReadOnlyList<OrderDoc> changes)
    => changes.Where(o => o.Status == "PENDING")
              .Select(o => new ServiceBusMessage(BinaryData.FromObjectAsJson(o))
              {
                  SessionId = o.Sku,                  // honour SKU ordering downstream
                  MessageId = o.OrderId,              // enables dedup
                  ApplicationProperties = { ["traceparent"] = o.TraceParent }
              });

Note the three properties carried on every message: SessionId (ordering), MessageId (duplicate detection / idempotency), and traceparent (W3C trace propagation so the distributed trace survives the async hop). Propagating trace context through the broker is the difference between a debuggable system and a haunted house.

Deployment and progressive delivery

Deploy services through GitHub Actions → AKS with Argo Rollouts or Flagger for canary, gated on App Insights metrics (error rate, p95 latency, dead-letter count). Because producers and consumers are decoupled by the broker, you deploy them independently — but you must keep event schemas backward-compatible (additive-only). Register schemas in the Event Grid namespace schema registry (or an Avro/JSON Schema registry) and enforce compatibility in CI so a producer can’t ship a breaking change that silently poisons every consumer’s dead-letter queue.

Enterprise considerations

Security and Zero Trust

The architecture is Zero Trust by construction, not by bolt-on:

Cost optimization

Event-driven on Azure is cost-elastic if you let it be:

Scalability and reliability (RTO/RPO)

Observability

Governance

Reference enterprise example

Northwind Apparel ran the rebuild over three quarters. Concrete decisions and numbers:

Scale and traffic. Steady state ~6,000 orders/day; sale-day peak observed at 58,000 orders/day with a 9-minute burst hitting ~140 orders/second at checkout. The old monolith fell over at ~25 orders/second.

What they built.

Decisions that paid off.

A failure they handled gracefully. Two months in, a bad deploy of the Fraud service started throwing on a subset of orders. Because fraud scoring is a Service Bus consumer with maxDeliveryCount: 5, the affected messages dead-lettered with an alert instead of blocking the pipeline; payment and confirmation continued for unaffected orders, and the DLQ was replayed after the fix. In the monolith, that same bug would have failed checkouts outright.

Outcome. Steady-state Azure spend ~₹6.4 lakh/month (vs. an over-provisioned monolith that cost more to keep upright on sale days), sale-day survivability proven at 10x, zero oversells, and a platform four teams can evolve independently. The business signed off on the rebuild on the strength of the first sale event alone.

When to use it

Use this architecture when:

Be honest about the trade-offs:

Anti-patterns to avoid (each one is a real incident waiting to happen):

Alternatives worth weighing: a modular monolith (simplest, best until you have proven independent-scaling or independent-deployment needs); Dapr on AKS/Container Apps if you want the same patterns with a portable building-block abstraction over the brokers (pub/sub, state, bindings) and less Azure-specific glue; Azure Container Apps instead of full AKS if you want event-driven, KEDA-built-in, serverless containers without operating Kubernetes — a strictly better starting point for most teams until AKS-level control is actually required; and pure Functions + Service Bus/Event Grid with no Kubernetes at all for the genuinely small or spiky-but-stateless workload. The shape of the architecture — three planes, commands vs. events, outbox, idempotent handlers, saga — stays the same across all of them. Only the compute host changes.

AzureArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading