Architecture Azure

Azure Enterprise Architecture: Production Microservices on AKS

Most “we run microservices on Kubernetes” stories are really “we run a distributed monolith on a cluster nobody is sure how to upgrade.” The pods are there, the YAML is there, but the cluster has one giant flat namespace, secrets are baked into images, traffic between services is plaintext, deploys happen by kubectl apply from a laptop, and the security team has quietly given up on auditing any of it. This reference architecture is about the other thing: a production-grade Azure Kubernetes Service (AKS) platform where dozens of microservices owned by many teams share a cluster safely — every pod has its own least-privilege Azure identity with no stored secrets, every service-to-service hop is mutually authenticated and encrypted, every image is signed and scanned before it can run, and every change to what’s deployed arrives through Git, not a human with cluster-admin. It scales down to a ten-service startup on one cluster and up to a regulated enterprise running hundreds of services across a fleet — the same component set serves both; what changes is the number of node pools and the strictness of the policy, not the diagram.

This article follows the format of the major architecture centers: the scenario, the end-to-end request and delivery flow, a component-by-component breakdown, concrete implementation and IaC wiring, the enterprise concerns (security, cost, reliability, observability, governance), a named worked example with real numbers, and an honest section on when not to build this.

The business scenario

Picture an engineering organization that has outgrown a single deployable. It might be a Series-B SaaS company with four squads, or a bank’s digital channel with forty. The shape of the pain is the same at both ends:

The problem this architecture solves is precise: let many teams run many services on shared AKS infrastructure with hard security and tenancy boundaries, zero stored secrets, encrypted and authorized service-to-service traffic, a trusted software supply chain, and a fully auditable Git-driven delivery path — while keeping the cluster a disposable, continuously-upgradable cattle node rather than a pet. The non-goals matter too: this is not “Kubernetes for a single app” (that is over-engineering — use Container Apps or App Service), and it is not a multi-cluster service-mesh federation (a different, heavier article). It is the smallest coherent platform that makes a shared AKS cluster safe for real multi-team production.

Architecture overview

The organizing idea is a paved-road platform: the platform team owns one (or a small fleet of) hardened, private AKS clusters and a set of golden guardrails; application teams own namespaces and the Git repos that describe what runs in them. Three planes are kept deliberately separate — the traffic plane (how requests get in and move between services), the identity plane (how workloads prove who they are without secrets), and the delivery plane (how desired state becomes running state). Get those three right and the cluster becomes boring in the best way.

Reference architecture for multi-team microservices on Azure AKS: a public edge (Front Door + WAF, Entra ID) fronting a private spoke VNet that holds the private AKS cluster with an mTLS service mesh and user node pools, secret-less workload-identity federation, private-endpoint state and data services (Key Vault, Azure SQL, Cosmos DB, Service Bus, Storage), a supply-chain and GitOps delivery column (CI to ACR scan and sign to a config repo reconciled by Flux/Argo with signed-image admission), and an observability strip. The numbered request path runs 1 to 7 and the delivery path is shown as dashed pull-model connectors.

The request path, end to end, for external traffic hitting a service:

  1. A client resolves the application hostname to Azure Front Door (anycast edge, TLS termination, WAF with OWASP managed rules, bot and rate-limit policies, edge caching for static assets). For dynamic requests Front Door selects the cluster’s regional origin over a Private Link origin, so the cluster is never directly reachable from the internet.
  2. The request lands on the cluster’s ingress — the AKS-managed Application Gateway for Containers (AGC) or an Istio ingress gateway exposed via an internal Azure Load Balancer. Ingress is implemented through the Kubernetes Gateway API (Gateway + HTTPRoute), not legacy Ingress, so routing is portable and expressive.
  3. Inside the cluster the request enters the service mesh. Two production-supported choices on AKS: Istio (the AKS-managed Istio add-on, increasingly in ambient/sidecar-less mode) for rich L7 traffic management, or Cilium (Azure CNI powered by Cilium) with Cilium service mesh / Hubble for an eBPF-based data plane with identity-aware L3/L4 network policy at near-zero overhead. Either way, the first hop into a workload is mutual TLS — the calling identity is cryptographically verified, not trusted by IP.
  4. The target microservice pod runs in its team namespace, scheduled onto a user node pool appropriate to its workload class (general, memory-optimized, spot, or GPU). It has a Kubernetes ServiceAccount federated to an Azure Managed Identity via Workload Identity Federation — so when it needs to call Azure, it requests a token using the OIDC issuer projected into the pod, with no client secret anywhere.
  5. The service reads its configuration and secrets from Azure Key Vault through the Secrets Store CSI Driver (mounted as a tmpfs volume and/or synced to a Secret), authenticating with that same workload identity over a private endpoint. It calls Azure data services — Azure SQL, Cosmos DB, Service Bus, Storage — using the workload identity’s Microsoft Entra token, so the database sees a named, least-privilege principal, not a shared connection string.
  6. Service-to-service calls (service A → service B) stay inside the mesh: mTLS encrypts the hop, an AuthorizationPolicy (Istio) or CiliumNetworkPolicy decides whether A is allowed to call B at all, and the mesh records the call. East-west traffic that is not explicitly allowed is denied by default.
  7. Telemetry flows out continuously: Azure Monitor managed Prometheus scrapes metrics, Container Insights / Azure Monitor collects logs and the control-plane audit log, managed Grafana dashboards it, and the mesh emits distributed traces and a live service-dependency map.

The delivery path, end to end, for getting a change into the cluster:

  1. A developer merges to the app repo. Azure Pipelines / GitHub Actions builds the container, runs tests, pushes the image to Azure Container Registry (ACR), and the registry scans it (Microsoft Defender for Containers / Trivy) and signs it (Notation / Cosign).
  2. The pipeline does not deploy. Instead it opens a pull request against a GitOps config repo that bumps the image digest in the service’s manifests/Helm values.
  3. A GitOps controller running in the cluster — Argo CD or Flux (available as the AKS GitOps add-on, microsoft.flux) — continuously reconciles cluster state to that repo. When the PR merges, the controller pulls the change and applies it. Git is the single source of truth; the cluster converges to it; drift is detected and (optionally) auto-corrected.
  4. Before anything runs, an admission policy engine — Azure Policy for AKS (Gatekeeper) or Kyverno — validates the manifests: only images from the trusted ACR with a valid signature are admitted, every pod must set non-root and resource limits, host networking is forbidden, and so on. A non-conforming deploy is rejected at the API server, before scheduling.

The mental model: Front Door and the Gateway get traffic in; the mesh moves and authorizes it with mTLS; workload identity removes every secret from the picture; ACR + signing + admission control guarantees only trusted code runs; and GitOps makes the whole thing converge to Git, auditable commit by commit.

Component breakdown

Component Azure service / project What it does Key configuration choices
Cluster Azure Kubernetes Service (AKS) Managed Kubernetes control plane + nodes Private cluster (no public API endpoint) or API-server VNet integration with authorized IP ranges; Azure CNI Overlay (or CNI powered by Cilium) for IP efficiency; auto-upgrade channel = stable with planned maintenance windows; Uptime SLA / Standard tier for the control plane; system node pool tainted CriticalAddonsOnly
Node pools AKS user node pools + VMSS Run workloads, segmented by class Separate pools for general / memory / spot / GPU; cluster autoscaler per pool + KEDA for event-driven scale; ephemeral OS disks; Azure Linux (Mariner) nodes; taints/labels so workloads land on the right pool
Ingress / Gateway App Gateway for Containers (AGC) or Istio ingress gateway North-south entry, L7 routing, TLS Gateway API (Gateway/HTTPRoute) over legacy Ingress; WAF at Front Door (and optionally AGC); internal LB so origin is private behind Front Door Private Link
Service mesh AKS-managed Istio add-on or Cilium (Azure CNI Powered by Cilium) mTLS, L7 traffic mgmt, authz, observability Istio: prefer ambient mode (ztunnel + waypoints) to drop per-pod sidecar cost; PeerAuthentication: STRICT mTLS; AuthorizationPolicy default-deny. Cilium: eBPF dataplane, CiliumNetworkPolicy (identity-aware), Hubble for flow visibility; mutual auth via SPIFFE
Registry Azure Container Registry (Premium) Stores & secures images and Helm/OCI charts Premium for private endpoints, geo-replication, content trust / Notation signing; quarantine pattern: image scanned before promotion; ACR Tasks for base-image patching; pull via workload/kubelet managed identity, not admin user (which is disabled)
Secrets Azure Key Vault + Secrets Store CSI Driver Source of truth for secrets/certs Key Vault with RBAC authorization + private endpoint; CSI SecretProviderClass mounts secrets as files; rotation enabled; prefer passwordless (Entra tokens) over storing connection strings at all
Workload identity Microsoft Entra Workload Identity Federation Pods get Azure tokens with no secret OIDC issuer enabled on AKS; federated credential binds ServiceAccountUser-Assigned Managed Identity; annotate SA + label pod azure.workload.identity/use: "true"; one identity per service, scoped RBAC
Delivery (GitOps) Argo CD or Flux (microsoft.flux AKS add-on) Reconciles cluster to a Git repo App-of-apps / Flux Kustomizations; digest-pinned images (no :latest); progressive delivery via Argo Rollouts / Flagger (canary, blue-green); separate app repo vs config repo
Admission / policy Azure Policy for AKS (Gatekeeper) or Kyverno Enforces guardrails at the API server Image-source + signature verification, non-root, read-only rootfs, required limits/requests, no host network, allowed registries; built-in Azure Policy initiative for AKS baseline
Edge Azure Front Door + WAF Global entry, TLS, WAF, caching OWASP managed ruleset, bot manager, rate limiting; Private Link origin to the internal LB
Observability Azure Monitor managed Prometheus, Container Insights, managed Grafana, Application Insights Metrics, logs, traces, dashboards, audit Managed Prometheus scrape configs; control-plane diagnostic logs (kube-audit) to Log Analytics; mesh traces + service map; Grafana dashboards as code

A few choices deserve the “why,” because they are where teams most often go wrong.

Why a private cluster. A public AKS API endpoint is a standing internet-facing attack surface for the most powerful credential in your platform. A private cluster (or API-server VNet integration) means the control plane is reachable only from your network; CI reaches it through a self-hosted agent in the VNet or via the GitOps pull model (which needs no inbound access to the cluster at all — the controller reaches out to Git). Pair it with local accounts disabled and Entra + Azure RBAC for Kubernetes authorization, so cluster access is governed by Entra groups and Conditional Access, and cluster-admin is a break-glass PIM-elevated role, not a kubeconfig on a laptop.

Why ambient-mode Istio (or Cilium) instead of classic sidecars. The sidecar model puts an Envoy proxy in every pod — real CPU/memory tax per replica, and a coupling between app and proxy lifecycle that complicates upgrades. Istio ambient mode moves mTLS and L4 to a per-node ztunnel and only deploys an L7 waypoint proxy where you actually need L7 policy, cutting mesh overhead substantially. Cilium takes a different route entirely: mTLS-equivalent identity and policy enforced in the eBPF datapath in the kernel, no userspace proxy on the hot path at all. Choose Istio when you need rich L7 traffic shaping (header-based canaries, retries/timeouts, fault injection) mesh-wide; choose Cilium when you want high-throughput identity-aware L3/L4 security with minimal overhead and great flow visibility via Hubble. Both give you the non-negotiable: encrypted, authenticated, default-deny east-west traffic.

Why workload identity federation, emphatically. This is the single highest-leverage security decision in the architecture. The legacy alternatives — pod-identity, or a service principal with a client secret in a Secret — both end with a long-lived credential somewhere on the cluster that can be exfiltrated. Federation means the pod presents its projected Kubernetes service-account token (a short-lived OIDC JWT) to Entra and receives a short-lived Azure access token in exchange. There is no secret to steal, rotate, or leak. One managed identity per service, each with exactly the Azure RBAC it needs (this service reads this Key Vault and writes that Storage container — nothing else), gives you per-workload least privilege and a clean audit trail.

Implementation guidance

Provision in layers, each with its own IaC stack and state, so the platform team owns the cluster and app teams own their namespaces. Terraform is the common choice on Azure; Bicep is equally valid and avoids state management. The layering matters more than the tool.

A representative Terraform skeleton for Layer 1 — note the add-ons that turn a bare cluster into this architecture (OIDC issuer + workload identity, the managed Istio mesh, monitoring, and key-vault secrets provider):

resource "azurerm_kubernetes_cluster" "prod" {
  name                      = "aks-prod-eus2"
  resource_group_name       = azurerm_resource_group.platform.name
  location                  = "eastus2"
  dns_prefix                = "aks-prod"
  kubernetes_version        = "1.31"          # track n-1 of latest stable
  sku_tier                  = "Standard"       # control-plane Uptime SLA
  oidc_issuer_enabled       = true             # required for workload identity
  workload_identity_enabled = true
  azure_policy_enabled      = true             # Gatekeeper guardrails
  local_account_disabled    = true             # Entra-only access

  # Private cluster: no public API server
  private_cluster_enabled   = true

  default_node_pool {
    name                 = "system"
    vm_size              = "Standard_D4ds_v5"
    auto_scaling_enabled = true
    min_count            = 3
    max_count            = 5
    only_critical_addons_enabled = true        # taint: CriticalAddonsOnly
    zones                = [1, 2, 3]
    os_sku               = "AzureLinux"
  }

  network_profile {
    network_plugin      = "azure"
    network_plugin_mode = "overlay"            # CNI Overlay for IP efficiency
    network_policy      = "cilium"             # or "azure"; mesh handles mTLS
    load_balancer_sku   = "standard"
  }

  # AKS-managed Istio service mesh add-on
  service_mesh_profile {
    mode      = "Istio"
    revisions = ["asm-1-23"]
  }

  key_vault_secrets_provider {
    secret_rotation_enabled = true
  }

  oms_agent {
    log_analytics_workspace_id  = azurerm_log_analytics_workspace.platform.id
    msi_auth_for_monitoring_enabled = true
  }

  azure_active_directory_role_based_access_control {
    azure_rbac_enabled = true                  # Azure RBAC for Kubernetes
    tenant_id          = data.azurerm_client_config.current.tenant_id
  }

  identity { type = "UserAssigned"
             identity_ids = [azurerm_user_assigned_identity.cluster.id] }
}

# Spot + GPU + general user pools added as azurerm_kubernetes_cluster_node_pool ...

Wire the workload identity for one service — this is the pattern every microservice repeats. Create a user-assigned identity, grant it only the Azure RBAC it needs, federate it to the service’s Kubernetes ServiceAccount, then annotate the ServiceAccount:

resource "azurerm_user_assigned_identity" "orders" {
  name                = "id-orders-svc"
  resource_group_name = azurerm_resource_group.platform.name
  location            = "eastus2"
}

# Least-privilege Azure RBAC: this service reads ONLY its own KV secrets
resource "azurerm_role_assignment" "orders_kv" {
  scope                = azurerm_key_vault.orders.id
  role_definition_name = "Key Vault Secrets User"
  principal_id         = azurerm_user_assigned_identity.orders.principal_id
}

# Federate the K8s ServiceAccount -> the managed identity (no secret)
resource "azurerm_federated_identity_credential" "orders" {
  name                = "orders-fed"
  resource_group_name = azurerm_resource_group.platform.name
  parent_id           = azurerm_user_assigned_identity.orders.id
  issuer              = azurerm_kubernetes_cluster.prod.oidc_issuer_url
  subject             = "system:serviceaccount:orders:orders-sa"
  audience            = ["api://AzureADTokenExchange"]
}
apiVersion: v1
kind: ServiceAccount
metadata:
  name: orders-sa
  namespace: orders
  annotations:
    azure.workload.identity/client-id: "<id-orders-svc client id>"
---
# In the Deployment pod template:
#   labels: { azure.workload.identity/use: "true" }
#   serviceAccountName: orders-sa
# The Azure SDK in the pod now gets tokens via the projected SA token — no secret.

Networking and identity wiring, the load-bearing rules:

Progressive delivery: wire Argo Rollouts or Flagger so a new digest rolls out as a canary — 5% of traffic, watch the mesh’s success-rate and latency metrics from managed Prometheus, auto-promote if healthy, auto-rollback if not. The mesh provides the traffic-splitting primitive; the rollout controller provides the analysis and the abort.

Enterprise considerations

Security and Zero Trust. This architecture is a Zero Trust implementation for compute, applied at four layers. Identity: every workload has its own short-lived, secret-less Entra identity (workload identity federation) and every Azure data call is a named principal with least-privilege RBAC — there is no shared credential to compromise. Network: default-deny east-west via the mesh/CNI, mTLS on every hop, private endpoints for every dependency, and a private API server — a pod can only reach what policy explicitly allows. Supply chain: images are scanned (Defender for Containers / Trivy) and signed, and admission control refuses to run anything unsigned or from an untrusted registry; this closes the “someone pushed a malicious image” path that most clusters leave wide open. Runtime: Defender for Containers provides runtime threat detection (crypto-miner, reverse-shell, suspicious exec) on the nodes; pods run non-root, read-only-rootfs, with dropped capabilities, enforced by policy. Pull CIS AKS benchmark and Microsoft cloud security baseline assessments into Defender for Cloud and treat the findings as a backlog, not a one-time audit.

Cost optimization (FinOps). A shared cluster is itself the biggest cost lever — bin-packing many services onto common nodes beats a VM-per-service estate. Beyond that: (1) Spot node pools for stateless, interruptible, and batch workloads — often 60–90% cheaper, with KEDA/PDBs to handle eviction gracefully; (2) right-size with the VPA recommender and set requests from real usage, because over-requested pods waste reserved capacity even when idle; (3) cluster autoscaler + scale-to-zero node pools and KEDA so capacity tracks demand, plus the AKS Stop/Start feature for non-prod overnight; (4) Savings Plans / Reserved Instances for the steady-state baseline node count; (5) ambient-mode mesh to remove the per-pod sidecar tax across hundreds of replicas; (6) OpenCost / Microsoft Cost Management + Kubernetes cost views to show each team a per-namespace bill, which is the single most effective behavior change. Showback by namespace turns “the cluster is expensive” into “your service is expensive,” which is actionable.

Scalability. Four independent axes: pods scale via HPA (CPU/memory) and KEDA (queue depth, event rate, custom metrics) including scale-to-zero; nodes scale via the cluster autoscaler per pool and the faster Node Autoprovisioning (Karpenter for AKS) where available; the cluster itself has generous limits (thousands of nodes), and the platform scales by adding clusters to a fleet managed by Azure Kubernetes Fleet Manager when one cluster’s blast radius or limits become the constraint. Design services stateless so any of these can scale them freely; push state to Azure data services.

Reliability and DR (RTO/RPO). Inside a region: node pools span availability zones, Pod Disruption Budgets keep minimum replicas during upgrades and node churn, topology spread constraints avoid single-node concentration, and planned maintenance windows + surge upgrades make Kubernetes-version and node-image upgrades routine and non-disruptive — the cluster stays current and in support by construction. Region loss: the cluster is stateless and rebuildable — that is the whole point of GitOps. Your RTO is “how fast can IaC stand up a cluster in the paired region and the GitOps controller reconcile every service onto it” — realistically 15–45 minutes for a warm-standby cluster (pre-provisioned, GitOps paused) or longer for cold. Your RPO is governed entirely by the data tier, not the cluster: it is the replication lag of Cosmos DB (multi-region writes, seconds), Azure SQL failover groups, or geo-replicated Storage — the cluster holds no durable state to lose. Geo-replicate ACR so the standby region can pull images during a primary outage. The reliability win of this architecture is that DR is git apply against a fresh cluster, which you can — and must — rehearse on a schedule.

Observability. Three signals plus audit. Metrics: Azure Monitor managed Prometheus scrapes app and mesh metrics; managed Grafana dashboards them (golden signals per service, mesh success-rate/latency, node and cost views). Logs: Container Insights collects stdout and the control-plane kube-audit log to Log Analytics — the audit log is your “who changed what on the cluster” record. Traces: the mesh emits distributed traces to Application Insights, and Istio/Hubble draw a live service-dependency map so you can see, not guess, who calls whom. Alert on SLOs (error budget burn), not raw CPU. The mesh and GitOps controller both expose health you should alert on: mesh mTLS coverage and GitOps sync/drift status (a service that has drifted from Git, or a failed reconcile, is an incident).

Governance. Enforce, do not document. Azure Policy for AKS (Gatekeeper) applies the org’s guardrails cluster-wide — allowed registries, required labels/limits, no privileged pods, no host network — and reports compliance into Azure Policy. Kyverno covers mutation and finer policy (auto-inject securityContext, enforce image-digest pinning). Microsoft Entra + Azure RBAC for Kubernetes ties cluster access to Entra groups and PIM so cluster-admin is time-bound and approved. The GitOps repo’s PR history is your change-management record — every production change is a reviewed, attributed, revertable commit, which is exactly what auditors want and exactly what kubectl apply from a laptop never provides.

Reference enterprise example

Northwind Mobility is a (fictional) mid-market mobility-and-logistics SaaS: a driver app, a shipper portal, and a partner API, serving ~120,000 daily active users across the US, run by six squads. They started as a Django monolith on App Service. By 2025 the monolith’s single release train was the bottleneck — squads waited days for each other’s changes, a payments hotfix required a full-app deploy, and a SOC 2 audit flagged shared service-principal secrets and plaintext internal traffic. They decided to decompose into ~28 services on AKS, deliberately as a platform, not a pile of pods.

What they built. One production AKS cluster in East US 2 (Standard tier, private API server, Azure CNI Overlay with Cilium), with a warm-standby cluster in Central US kept current by the same IaC and a paused GitOps controller. Node pools: a 3-node zonal system pool, a general user pool (D-series, autoscaling 6→30), a spot pool for the trip-pricing batch and notification fan-out (saving ~70% on that bursty compute), and a small GPU pool for their ETA-prediction model. They chose the AKS-managed Istio add-on in ambient mode because the squads wanted header-based canaries and per-route retries, and ambient kept the mesh tax low across ~400 pods. Each of the 28 services got its own user-assigned managed identity federated to its ServiceAccount — zero client secrets remained anywhere; the SOC 2 finding closed itself. Secrets that genuinely had to exist (a third-party payment-gateway key) came from Key Vault via the CSI driver over a private endpoint; everything else (SQL, Service Bus, Blob) went passwordless via Entra tokens. Delivery moved to Flux (the microsoft.flux add-on) with an app repo / config repo split and Flagger canaries; Azure Policy for AKS enforced “only signed images from northwind.azurecr.io, non-root, limits required.” Front Door + WAF fronted the cluster’s internal ingress over a Private Link origin.

The numbers and decisions. Roughly $6,800/month all-in for the production cluster: ~$4,100 compute (heavily offset by spot and a 1-year Savings Plan on the baseline nodes), ~$700 ACR Premium + geo-replication, ~$900 Front Door + WAF, ~$1,100 Azure Monitor/Prometheus/Grafana/Log Analytics ingestion. The warm-standby cluster added ~$1,500 (mostly its idle baseline nodes). They debated classic Istio sidecars vs ambient and chose ambient, saving an estimated ~$900/month in sidecar CPU/memory at their replica count. They debated a cluster-per-squad model and rejected it as six times the platform toil for tenancy they could get with namespaces + mesh policy + Azure RBAC.

The outcome. Deploy frequency went from ~3/week (whole monolith) to 40+/day across squads, each squad shipping independently behind canaries. Mean time to recovery for a bad deploy dropped to under 4 minutes (Flagger auto-rollback on success-rate dip). They ran a region-loss game day: failed Front Door to Central US, un-paused the standby cluster’s Flux controller, and had all 28 services reconciled and serving in 31 minutes (RTO), with RPO of seconds because Cosmos DB multi-region writes and the SQL failover group held the state — the cluster held none. The SOC 2 auditor’s “credential management” and “encryption in transit (internal)” findings were both closed by workload identity and mesh mTLS respectively. The platform team’s recurring nightmare — Kubernetes upgrades — became a scheduled, automated, non-event via the stable auto-upgrade channel and surge upgrades within maintenance windows. Net: independent team velocity and a defensible security posture, on shared infrastructure, for under $8.5k/month.

When to use it

Use this architecture when you have multiple teams shipping multiple services that must release independently, you need a defensible security and tenancy boundary on shared infrastructure (encrypted/authorized east-west traffic, per-workload secret-less identity, a trusted supply chain), and you want auditable, Git-driven delivery with a cluster you can upgrade and rebuild without fear. It scales cleanly from one cluster running a dozen services to a Fleet-managed estate running hundreds; the diagram is the same, only the number of clusters and the strictness of policy change. The prerequisite is operational maturity: a platform team that owns the paved road, and app teams willing to live on it.

Trade-offs to accept going in. Kubernetes, a service mesh, GitOps, workload identity, and policy-as-code are a substantial amount of platform to learn and operate. You are buying enormous flexibility and a strong security posture, and paying for it in platform complexity and a real platform team. If you have one or two services and a single squad, this is over-engineering — the operational surface will cost you more than it returns.

Anti-patterns that quietly defeat the design:

Alternatives, in increasing capability and operational cost: (1) Azure Container Apps — managed, serverless Kubernetes-without-the-cluster, with built-in Dapr, KEDA, and ingress; the right choice for a small-to-medium set of microservices that do not need full cluster control, custom operators, or a specific mesh. Most teams should start here and graduate only when they hit its ceiling. (2) App Service / Functions — for a handful of web apps and event handlers, no orchestration needed. (3) A single AKS cluster, namespace-per-team (this article) — the default for real multi-team production at moderate scale. (4) AKS Fleet (multi-cluster) — when one cluster’s blast radius, scale limits, or hard regulatory isolation forces a fleet; same components, federated. Pick the lowest tier that meets your team count and isolation requirements; most organizations reach for AKS when Container Apps would have done, and pay for cluster operations they did not need. The platform you can actually operate beats the platform you merely deployed.

AzureArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading