Containerization Lesson 94 of 113

Application Gateway for Containers: Gateway API on AKS with Traffic Splitting, mTLS, and Header Routing

If you have run AGIC (the Application Gateway Ingress Controller) at any scale, you already know its failure mode: a single pod reconciling the entire Application Gateway config, mutating an ARM resource on every Ingress change, and grinding through multi-minute control-plane updates while one noisy namespace starves everyone else. Application Gateway for Containers (AGC) is Microsoft’s clean break from that design. The data plane is a managed, regional, near-real-time proxy fleet; the control plane is the ALB Controller running inside your cluster; and — the change that reorganises everything — it speaks the Kubernetes Gateway API, not the legacy Ingress API. The result is routing that propagates in seconds, a blast radius scoped per Gateway instead of per gateway resource, and first-class weighted splitting, backend re-encryption, mTLS and header/path/query routing expressed as portable Kubernetes objects.

This guide walks the full production path and treats AGC as an operable system, not a demo. You will install the ALB Controller against workload identity, provision a managed AGC from an ApplicationLoadBalancer CRD, expose it with Gateway and HTTPRoute, then layer on weighted traffic splitting, BackendTLSPolicy re-encryption and mTLS, custom HealthCheckPolicy probes, and header/path/query routing — every command in the Gateway API variant. Because AGC has a precise mapping from Gateway API objects to AGC constructs, a precise set of RBAC roles, and a precise set of status conditions that tell you exactly where a request stalls, this is also a reference you keep open mid-incident: every CRD field, every role, every status.conditions value, every error string and every limit is laid out as a scannable table beside the prose and the YAML.

By the end you will stop treating AGC as a black box. When a route stays Accepted=False, a split lands 50/50 instead of 90/10, a re-encrypt leg throws 502, or the GatewayClass never goes Accepted, you will know which CRD field, which controller log line, or which federated-credential subject is the cause — and the exact kubectl or az command that confirms it. AGC does still support a legacy Ingress path, but if you are adopting AGC in 2026 you adopt Gateway API: that is where all the routing capability lives, and it is the only path Microsoft is investing behind.

What problem this solves

AGIC’s architecture made a category of pain inevitable. Every Ingress change anywhere in the cluster triggered a full Application Gateway configuration push to ARM, so a single team’s frequent deploys produced 4–7 minute propagation windows during which unrelated services saw stale routing. The controller was a single point of reconciliation; a malformed Ingress could wedge the whole pipeline; and because the gateway was a Standard_v2 ARM resource, every change paid ARM’s control-plane latency and throttling. Worse, sophisticated L7 routing (header matches, weighted canaries, per-service backend mTLS) was either impossible or expressed through a sprawl of annotations that no two engineers spelled the same way.

What breaks without AGC’s model: canary releases that should be a one-line weight edit become deploy-time gymnastics; a PCI requirement to re-encrypt to the cardholder workload gets quietly skipped because AGIC’s backend-mTLS story was awkward, leaving an audit finding waiting to happen; and the noisy-neighbour reconcile storms turn one team’s churn into everyone’s latency. Teams paper over it by sharding gateways (one AGIC per namespace, multiplying cost and operational surface) or by freezing deploys during business hours.

Who hits this: any platform team running ingress for more than a handful of namespaces on AKS, anyone who needs progressive delivery (Argo Rollouts / Flagger) wired to a real traffic-splitting data plane, and any regulated estate that must prove encryption all the way to the pod. AGC fixes the architecture: near-real-time programming from an in-cluster controller, per-Gateway blast radius, native weighted splits, and BackendTLSPolicy-driven re-encryption/mTLS — all in Gateway API manifests that travel with the workload.

To frame the whole field before the deep dive, here is what AGC changes versus AGIC, the symptom each change removes, and where in this article you act on it:

Dimension AGIC (the old way) AGC (this article) Symptom it removes Where you act
Data plane Standard_v2 Application Gateway (ARM) Managed regional proxy fleet ARM throttling on every change §AGC architecture
Control plane One pod mutating ARM per Ingress ALB Controller writing a config plane Single-point reconcile wedge §ALB Controller install
Propagation 4–7 min full config push Single-digit seconds Stale routing during peer deploys §First HTTPRoute
API Ingress + annotation sprawl Gateway API CRDs Inconsistent annotation routing §Gateway + HTTPRoute
Blast radius Whole gateway per change Per-Gateway object Noisy-neighbour reconcile storms §Enterprise scenario
Traffic split Awkward / annotation-driven Native backendRefs weights Canary as deploy gymnastics §Weighted splitting
Backend mTLS Limited / awkward BackendTLSPolicy + clientCert PCI re-encrypt gap §Backend TLS & mTLS

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You need an AKS cluster you can administer, with OIDC issuer and workload identity enabled (we turn them on idempotently below), and a dedicated subnet — minimum /24, delegated to Microsoft.ServiceNetworking/trafficControllers — that AGC injects its data plane into. You should be comfortable with kubectl, Helm, and reading Kubernetes object status.conditions, and have az configured with rights to create identities and role assignments in the cluster’s resource groups. Familiarity with the Gateway API object model (GatewayClass / Gateway / HTTPRoute) is assumed at a conceptual level; if it is new, read Kubernetes Gateway API: HTTPRoute, Traffic Splitting & Ingress Migration first — this article is the Azure-managed implementation of exactly those primitives.

This sits in the AKS networking & ingress track. Upstream of it is the managed-Kubernetes decision in Understanding Managed Kubernetes: AKS vs EKS vs GKE Compared and the broader cluster-networking picture in Production AKS: Networking & Observability. It is the containers-native cousin of the classic data-plane covered in Application Gateway v2 with WAF, L7 Routing & TLS in Production and the end-to-end TLS patterns in Application Gateway with WAF, mTLS & End-to-End TLS. The identity mechanism it depends on is detailed in Azure Key Vault & Workload Identity for Secrets, and it pairs naturally with a mesh — compare with AKS Istio Service Mesh Add-on: mTLS, Ingress & Egress when you need pod-to-pod mTLS behind the gateway.

A quick map of who owns what during an AGC incident, so you escalate to the right team fast:

Layer What lives here Who usually owns it Failure classes it can cause
DNS / client CNAME to AGC FQDN, TLS Frontend / SRE No resolution; cert name mismatch
AGC data plane Managed proxy, listeners, routing rules Microsoft (managed) 502/503 if backend unhealthy; rule eval
ALB Controller Reconciles CRDs → config plane Platform / cluster team Nothing programmed; Accepted=False
Workload identity Federated cred, role assignment Platform + identity Controller 401; provisioning stalls
Delegated subnet /24, trafficControllers delegation Network team AGC won’t inject; association fails
Gateway API CRDs Gateway, HTTPRoute, policies App + platform Routing, splits, mTLS misconfig
Backend pods / Services Workloads, TLS, health paths App / dev team Re-encrypt 502; probe eviction

Core concepts

Five mental-model shifts make every later step obvious.

The proxy is not an Application Gateway v2. AGC is a separate product backed by Microsoft.ServiceNetworking/trafficControllers, not Microsoft.Network/applicationGateways. There is no Standard_v2 SKU, no per-Ingress ARM mutation, and — critically — no WAF policy built in. Routing changes propagate in seconds because the controller writes to a managed config plane rather than re-deploying a gateway resource. If you need a WAF in front of AGC today, you place it upstream (Front Door, or a classic Application Gateway v2 fronting the AGC FQDN), not on the AGC itself.

The control plane lives in your cluster. The ALB Controller is a Helm-installed deployment in the azure-alb-system namespace. It watches Gateway API objects (and its own AGC CRDs), and programs the managed data plane. It authenticates to Azure as a user-assigned managed identity federated to its Kubernetes service account — no secrets, no service-principal passwords. The GatewayClass named azure-alb-external is what the chart registers; Accepted=True on it is your green light that the controller is alive and authorised.

Gateway API objects map cleanly onto AGC constructs. This mapping is worth memorising because every diagnosis traces back to it:

Gateway API object AGC construct it becomes Carries Status to watch
GatewayClass (azure-alb-external) The AGC integration itself Controller binding Accepted=True
Gateway An AGC frontend + its listeners Hostnames, ports, TLS Programmed=True, an address
HTTPRoute AGC routing rules path/header/query matches Accepted, ResolvedRefs
backendRefs with weight A weighted traffic split Relative weights (reflected in ResolvedRefs)
BackendTLSPolicy Backend re-encryption / mTLS SAN, CA, client cert Accepted on the policy
HealthCheckPolicy Per-backend health probe path, interval, codes (reflected in backend health)

Two deployment flavours, two ownership models. AGC can be created and lifecycle-managed by the controller from an in-cluster CRD (managed mode), or provisioned by you via ARM/Bicep/Terraform with the controller only referencing it (BYO mode). Managed mode is faster and keeps everything in cluster manifests; BYO mode fits enterprises where a platform-networking team must own the AGC, its subnet delegation and its Private Link surface independently of any cluster. We deploy managed mode end to end, then show the BYO association, because most regulated estates land there.

Mode Who creates the AGC + association Lifecycle owner Use when
Managed by ALB Controller The controller, from an ApplicationLoadBalancer CRD Kubernetes manifests Greenfield, GitOps-driven, lifecycle in-cluster
Bring your own (BYO) You, via ARM/Bicep/Terraform The network team’s IaC Central team owns subnet/RBAC/Private Link governance

Weights are relative, not percentages. A traffic split is multiple backendRefs under one HTTPRoute rule, each with a weight. AGC distributes requests proportionally to the sum — so 90/10 and 9/1 behave identically, and weight: 0 drains a backend to zero without deleting the ref (keeping rollback one edit away). Because propagation is near-real-time, a canary ramp is just a sequence of kubectl applys, which is exactly why Argo Rollouts and Flagger drive AGC through the Gateway API provider.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this is the mental model side by side:

Term One-line definition Where it lives Why it matters
AGC Managed L7 proxy fleet (trafficControllers) Azure (regional) The data plane; replaces AGIC’s gateway
ALB Controller In-cluster reconciler that programs AGC azure-alb-system ns No controller → nothing routes
GatewayClass The azure-alb-external binding Cluster-scoped Accepted=True = controller live
Gateway Frontend + listeners (hostnames, TLS) App namespace Emits the AGC FQDN address
HTTPRoute Routing rules + weighted backends App namespace Where splits and matches live
ApplicationLoadBalancer CRD that creates a managed AGC Infra namespace Managed-mode provisioning
BackendTLSPolicy Re-encrypt / mTLS to pods App namespace End-to-end encryption
HealthCheckPolicy Per-Service probe override App namespace Replaces default GET /
Workload identity Federated SA → managed identity Azure + cluster How the controller authenticates
Delegated subnet /24 for trafficControllers The VNet Where AGC injects its data plane
ReferenceGrant Cross-namespace ref permission Target namespace Lets a route reach a foreign Secret/Service
Weight Relative share of a backend HTTPRoute rule 0 = drain; relative not %

AGC architecture, and how it differs from AGIC

Two architectural facts drive everything operational. First, the data plane is managed and regional: you never patch it, scale it, or pay ARM latency to change it. The controller writes a desired-state config and the fleet converges in seconds. Second, the control plane is in your cluster and identity-bound: the ALB Controller is the only thing with rights to program the AGC, and it earns those rights through a federated managed identity, not a stored secret.

The practical consequence is a different operational posture than AGIC. With AGIC you debugged ARM deployments and Application Gateway config; with AGC you debug Kubernetes objects and a controller. Here is the side-by-side that matters when you are deciding whether to migrate and what to expect:

Property AGIC AGC Operational consequence
Backing resource Microsoft.Network/applicationGateways Microsoft.ServiceNetworking/trafficControllers Different ARM API, different RBAC
Reconcile target ARM gateway config Managed config plane Seconds vs minutes
API surface Ingress + annotations Gateway API CRDs Portable, typed routing
WAF Built-in WAF_v2 policy None on AGC (put upstream) WAF moves to Front Door / AppGW v2
Subnet AppGW subnet /24 delegated to trafficControllers New delegation requirement
Identity AAD pod identity / MSI Workload identity (federated) Secretless, OIDC-based
Private exposure Private frontend IP Private Link to frontend Private endpoint + Private DNS
Multi-tenancy One gateway, shared config Per-Gateway, scoped blast radius One namespace can’t stall another
Splitting Annotation / awkward Native backendRefs weights First-class canary

What you lose moving off AGIC is the integrated WAF and the familiarity of Ingress; what you gain is propagation speed, blast-radius isolation, and the full Gateway API routing vocabulary. The WAF gap is the one to plan for deliberately — most teams front AGC with Front Door (Premium, with managed WAF) or keep a thin Application Gateway v2 + WAF_v2 hop for the public edge, then let AGC own all the L7 routing and backend TLS inside. The capability decision in one grid:

If you need… AGIC could… AGC does… Recommendation
Built-in WAF at the same hop Yes No Front AGC with Front Door / AppGW v2 WAF
Sub-10s routing changes No (4–7 min) Yes AGC, native
Per-namespace blast radius No Yes AGC, one Gateway per team
Weighted canary, edit-to-shift Awkward Yes AGC backendRefs weights
Backend mTLS to pods Limited Yes AGC BackendTLSPolicy
Header/path/query routing Annotation soup Typed matches AGC Gateway API
Central-team-owned data plane Possible Yes (BYO) AGC BYO mode

Prerequisites and the ALB Controller install

You need OIDC issuer and workload identity on the cluster, plus the delegated subnet. Turn the cluster features on idempotently:

RG=rg-agc-prod
AKS=aks-agc-prod
LOCATION=eastus2

# Ensure OIDC + workload identity are on (idempotent on an existing cluster)
az aks update -g "$RG" -n "$AKS" \
  --enable-oidc-issuer \
  --enable-workload-identity

OIDC_ISSUER=$(az aks show -g "$RG" -n "$AKS" \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

The infrastructure prerequisites are unforgiving in specific ways — a /25 subnet or a missing delegation fails provisioning with errors that don’t always name the real cause. Confirm each against this checklist before you install anything:

Prerequisite Exact requirement How to verify Failure if wrong
OIDC issuer Enabled on the cluster az aks show --query oidcIssuerProfile.enabled Federation has no issuer to trust
Workload identity Add-on enabled az aks show --query securityProfile.workloadIdentity SA token not projected; controller 401
Subnet size /24 minimum az network vnet subnet show --query addressPrefix AGC injection fails (too few IPs)
Subnet delegation Microsoft.ServiceNetworking/trafficControllers ... --query delegations Association cannot bind the subnet
Subnet emptiness No conflicting resources Subnet has free address space Injection / association errors
Helm v3.8+ for OCI charts helm version OCI oci:// pull unsupported
kubelet identity / RBAC az rights to assign roles az role assignment create succeeds Controller cannot be granted Config Manager

The controller authenticates as a user-assigned managed identity federated to its service account. Create the identity, grant it the purpose-built role on the node resource group (where the controller manages the AGC), and Network Contributor on the subnet’s resource group so it can join the delegated subnet:

IDENTITY=alb-controller-identity
az identity create -g "$RG" -n "$IDENTITY" -l "$LOCATION"

PRINCIPAL_ID=$(az identity show -g "$RG" -n "$IDENTITY" --query principalId -o tsv)
CLIENT_ID=$(az identity show -g "$RG" -n "$IDENTITY" --query clientId -o tsv)

MC_RG=$(az aks show -g "$RG" -n "$AKS" --query nodeResourceGroup -o tsv)
MC_RG_ID=$(az group show -n "$MC_RG" --query id -o tsv)

# The controller manages AGC inside the node resource group
az role assignment create \
  --assignee-object-id "$PRINCIPAL_ID" \
  --assignee-principal-type ServicePrincipal \
  --scope "$MC_RG_ID" \
  --role "AppGw for Containers Configuration Manager"

# Reader/Network Contributor on the subnet's resource group so it can join the delegated subnet
az role assignment create \
  --assignee-object-id "$PRINCIPAL_ID" \
  --assignee-principal-type ServicePrincipal \
  --scope "$(az group show -n "$RG" --query id -o tsv)" \
  --role "Network Contributor"

The AppGw for Containers Configuration Manager role is purpose-built for AGC. Do not substitute Contributor — least privilege here is auditable, and Microsoft scopes the built-in role exactly to the trafficControllers and association operations the controller needs. The exact roles, their scope, and why each is required:

Role Scope Why the controller needs it Substitute?
AppGw for Containers Configuration Manager Node resource group (or AGC scope in BYO) Create/update AGC, frontends, associations, routing config No — purpose-built, least privilege
Network Contributor Subnet’s resource group Join/associate the delegated subnet Narrow to the subnet if your policy demands
Reader (implicit via above) Same Read VNet/subnet to validate delegation Covered by Network Contributor

Federate the identity to the controller’s service account (namespace azure-alb-system, service account alb-controller-sa). The subject string must match exactly — a typo here is the single most common “controller starts but gets 401” cause:

az identity federated-credential create \
  --name alb-controller-fedcred \
  --identity-name "$IDENTITY" \
  -g "$RG" \
  --issuer "$OIDC_ISSUER" \
  --subject "system:serviceaccount:azure-alb-system:alb-controller-sa" \
  --audience api://AzureADTokenExchange

The federated-credential fields and the exact value each must take:

Field Required value Consequence if wrong
--issuer The cluster’s OIDC issuer URL Token issuer not trusted → 401
--subject system:serviceaccount:azure-alb-system:alb-controller-sa Subject mismatch → 401 (most common)
--audience api://AzureADTokenExchange Audience rejected → token exchange fails
--identity-name The UAMI you created Credential federated to wrong identity

Install the controller via Helm, passing the identity client ID. Pin the chart version explicitly so a helm upgrade is deliberate, not whatever floats at the tag:

az aks get-credentials -g "$RG" -n "$AKS" --overwrite-existing

helm upgrade --install alb-controller \
  oci://mcr.microsoft.com/application-lb/charts/alb-controller \
  --version 1.7.9 \
  --namespace azure-alb-system --create-namespace \
  --set albController.namespace=azure-alb-system \
  --set albController.podIdentity.clientID="$CLIENT_ID"

Confirm both the controller and its webhook are healthy, and that the GatewayClass is accepted, before going further:

kubectl get pods -n azure-alb-system
kubectl get gatewayclass azure-alb-external -o yaml | grep -A5 status:

azure-alb-external is the GatewayClass the chart registers; ACCEPTED=True on it is your green light. The Helm values you actually touch, and what each controls:

Helm value Purpose Default / typical When to change
albController.podIdentity.clientID The UAMI client ID for workload identity (required) Always set
albController.namespace Namespace the controller runs in azure-alb-system Rarely; keep the default
--version Chart/controller version pin explicitly Deliberate upgrades only
albController.replicaCount Controller replicas (HA) 2 Raise for resilience, not throughput
albController.logLevel Controller verbosity info debug while diagnosing

Provision the AGC (managed mode)

In managed mode you declare the AGC and its association as CRDs and let the controller build them. Create an infra namespace and the ApplicationLoadBalancer object, pointing at the delegated subnet:

SUBNET_ID=$(az network vnet subnet show \
  -g "$RG" --vnet-name vnet-agc --name subnet-alb \
  --query id -o tsv)

kubectl create namespace alb-infra
# alb.yaml
apiVersion: alb.networking.azure.io/v1
kind: ApplicationLoadBalancer
metadata:
  name: alb-prod
  namespace: alb-infra
spec:
  associations:
    - /subscriptions/<SUB_ID>/resourceGroups/rg-agc-prod/providers/Microsoft.Network/virtualNetworks/vnet-agc/subnets/subnet-alb
kubectl apply -f alb.yaml
# Watch provisioning; Deployment.Succeeded means the managed AGC + association exist
kubectl get applicationloadbalancer alb-prod -n alb-infra -o yaml | grep -A10 conditions

This step creates the actual Microsoft.ServiceNetworking/trafficControllers resource and a frontend in the node resource group. Provisioning takes a few minutes the first time. The ApplicationLoadBalancer spec fields and what each does:

Field Meaning Required Notes
spec.associations[] Full resource ID of the delegated subnet Yes The subnet must be /24 + delegated
metadata.namespace Infra namespace holding the CRD Yes Convention: alb-infra
metadata.name Logical AGC name referenced by Gateway annotations Yes Used in alb-name annotation

The provisioning status.conditions you read to know where you are — this is the managed-mode equivalent of watching an ARM deployment:

Condition type status you want Meaning If not
Deployment Succeeded / True AGC + association created Check subnet delegation + Config Manager role
Available True Data plane reachable Wait; then check controller logs
(any) Reason *Succeeded No error reason attached A Reason like SubnetDelegationMissing names the fix

Expose a Gateway and the first HTTPRoute

The Gateway references the GatewayClass and ties to your ApplicationLoadBalancer via annotation. Here is an HTTPS listener terminating a cert from a Kubernetes Secret:

# gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: gw-prod
  namespace: app
  annotations:
    alb.networking.azure.io/alb-namespace: alb-infra
    alb.networking.azure.io/alb-name: alb-prod
spec:
  gatewayClassName: azure-alb-external
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      hostname: "app.kloudvin.com"
      tls:
        mode: Terminate
        certificateRefs:
          - kind: Secret
            name: app-tls
      allowedRoutes:
        namespaces:
          from: Same
kubectl apply -f gateway.yaml
# AGC publishes a generated FQDN; read it back from the Gateway address
kubectl get gateway gw-prod -n app \
  -o jsonpath='{.status.addresses[0].value}{"\n"}'

Point your DNS CNAME for app.kloudvin.com at that generated FQDN (the *.fzXX.alb.azure.com name). The Gateway listener fields you set, and the choices behind each:

Listener field Values Default / typical When to change Gotcha
protocol HTTP, HTTPS HTTPS in prod HTTP only for redirect listeners HTTP listener serves cleartext
port any TCP port 443 (HTTPS), 80 (HTTP) Match your edge Must align with DNS/clients
hostname FQDN or wildcard the app host SNI-based routing Empty = match all (loosens routing)
tls.mode Terminate, Passthrough Terminate Passthrough for end-to-end at pod Passthrough skips L7 routing
tls.certificateRefs Secret ref(s) app-tls Secret Rotate by replacing the Secret Secret must be in the listener ns
allowedRoutes.namespaces.from Same, All, Selector Same Multi-team gateways All widens attach surface

Now bind a route. The minimal HTTPRoute sends all traffic for the host to one Service:

# route-basic.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: rt-app
  namespace: app
spec:
  parentRefs:
    - name: gw-prod
  hostnames:
    - "app.kloudvin.com"
  rules:
    - backendRefs:
        - name: app-svc
          port: 80

kubectl apply -f route-basic.yaml, and once status.parents[].conditions shows Accepted=True and ResolvedRefs=True, traffic flows. The reconcile is seconds, not the multi-minute ARM churn AGIC inflicted. The status conditions across the three objects — this is your single most-used diagnostic table:

Object Condition True means Common False cause
GatewayClass Accepted Controller bound + authorised Controller down / RBAC missing
Gateway Accepted Spec valid, class matched Bad gatewayClassName
Gateway Programmed Data plane configured; has an address Cert Secret missing; AGC not ready
HTTPRoute Accepted Route is valid + attached to parent parentRefs wrong; not allowed by Gateway
HTTPRoute ResolvedRefs All backendRefs/Secret refs resolve Service/Secret missing or cross-ns w/o ReferenceGrant

The HTTPRoute backendRefs fields you’ll set on every route:

backendRef field Meaning Required Notes
name Target Service name Yes Must exist in the route’s namespace (or grant cross-ns)
port Service port Yes The Service’s exposed port, not the container’s
weight Relative split share No (default 1) 0 drains; relative, not percent
kind Service (default) No AGC routes to Services
namespace Cross-namespace target No Requires a ReferenceGrant in the target ns

Weighted traffic splitting and canary

This is where Gateway API earns its keep. Splitting is native: multiple backendRefs under one rule, each with a weight. AGC distributes requests proportionally. A 90/10 canary:

# canary.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: rt-canary
  namespace: app
spec:
  parentRefs:
    - name: gw-prod
  hostnames:
    - "app.kloudvin.com"
  rules:
    - backendRefs:
        - name: app-svc-stable
          port: 80
          weight: 90
        - name: app-svc-canary
          port: 80
          weight: 10

Weights are relative, not percentages90/10 and 9/1 behave identically. Progress a release by editing weights and re-applying; because propagation is near-real-time, a canary ramp is just a sequence of applies. Set a weight to 0 to drain a backend without deleting the ref, which keeps the rollback path one edit away. A typical ramp, what each step means, and the rollback at every stage:

Stage stable / canary weights Effective canary share Watch before advancing Rollback
Baseline 100 / 0 0% Canary deployed, healthy, drained (already safe)
Smoke 95 / 5 ~5% Error rate, p95 latency on canary Set canary → 0
Early 80 / 20 ~20% Business metrics, saturation Set canary → 0
Half 50 / 50 ~50% Full load parity Set canary → 0
Cutover 0 / 100 100% Soak, then retire stable Swap weights back
Drain 100 / 0 0% Decommission canary deployment n/a

Because each step is one kubectl apply with sub-10s propagation, a controller (Argo Rollouts or Flagger) using the Gateway API provider can drive the whole ramp from metric analysis. The integration shape:

Tool How it drives AGC Weight mechanism Promotion trigger
Argo Rollouts Gateway API plugin edits the HTTPRoute backendRefs weights AnalysisTemplate metric checks pass
Flagger Gateway API provider patches the route backendRefs weights Prometheus metric thresholds
Manual / GitOps kubectl apply of weight edits backendRefs weights Human / pipeline gate

For the broader pattern and the metric-analysis side, see Progressive Delivery with Argo Rollouts: Canary Metrics. The split-specific failure modes you’ll actually hit:

Symptom Likely cause Confirm Fix
Canary takes ~50% not 10% A backendRef missing its weight (defaults to 1) kubectl get httproute -o yaml shows no weight Set explicit integer weights on every ref
Split ignored entirely Two rules instead of one (more-specific match wins) Inspect rules[] — weights must share a rule Put weighted refs under one rule
Drain not draining weight: 0 not applied / typo ResolvedRefs + the live YAML Re-apply; confirm propagation
Sticky to one backend Client/session affinity upstream Sample many requests, not one Don’t conclude from a single curl

Backend TLS, mTLS, and health probes

By default AGC speaks HTTP to your pods. For end-to-end encryption — re-encrypt to the backend — and for mTLS where AGC presents a client cert, you use BackendTLSPolicy targeting the Service. First, server-side re-encryption with hostname validation against a CA you trust:

# backend-tls.yaml
apiVersion: alb.networking.azure.io/v1
kind: BackendTLSPolicy
metadata:
  name: btls-app
  namespace: app
spec:
  targetRef:
    group: ""
    kind: Service
    name: app-svc
  default:
    sni: backend.app.svc.cluster.local
    ports:
      - port: 443
    clientCertificateRef:
      name: alb-client-cert        # omit for one-way TLS; include for mTLS
    verify:
      caCertificateRef:
        name: backend-ca
      subjectAltName: backend.app.svc.cluster.local

The clientCertificateRef is what turns this into mutual TLS: AGC presents that certificate to the backend, and a backend (an Istio sidecar, an NGINX terminating mTLS, etc.) validates it. Drop that field and you get standard one-way re-encryption. The verify block makes AGC validate the backend’s certificate against backend-ca and pin the SAN — skip it only in non-production. Every BackendTLSPolicy field, what it does, and the trade-off:

Field What it does Required Omit when Gotcha if wrong
targetRef (Service) Attaches the policy to a Service Yes Wrong kind/name → policy never applies
sni SNI sent to the backend Yes for TLS Mismatch with cert → handshake fail
ports[].port Backend TLS port Yes Wrong port → connection refused
verify.caCertificateRef CA that signs the backend cert Prod: yes non-prod only Missing CA → cannot validate → 502
verify.subjectAltName SAN to pin on the backend cert Prod: yes non-prod only SAN mismatch → re-encrypt 502
clientCertificateRef Client cert AGC presents (mTLS) Only for mTLS one-way TLS Rotated out → backend rejects

The three backend-encryption postures, side by side, so you pick deliberately:

Posture verify clientCertificateRef Use when Security
Cleartext (default) n/a n/a Internal, low-trust, non-regulated None on the backend leg
One-way re-encrypt present absent Most production; encrypt to pod Server authenticated, encrypted
Mutual TLS (mTLS) present present PCI/Zero-Trust; backend verifies AGC Both ends authenticated

Health probes default to GET / on the backend port. Override per-Service with a HealthCheckPolicy:

# health.yaml
apiVersion: alb.networking.azure.io/v1
kind: HealthCheckPolicy
metadata:
  name: hc-app
  namespace: app
spec:
  targetRef:
    group: ""
    kind: Service
    name: app-svc
  default:
    interval: 5s
    timeout: 3s
    healthyThreshold: 1
    unhealthyThreshold: 3
    http:
      host: app.kloudvin.com
      path: /healthz
      match:
        statusCodes:
          - start: 200
            end: 299

Both policies attach by targetRef to the Service, so they travel with the workload, not the gateway — exactly the separation of concerns you want when app and platform teams own different manifests. The HealthCheckPolicy knobs and how to reason about each:

Field What it does Default Typical When to change
interval Probe frequency (managed) 5s Faster detect vs more probe load
timeout Per-probe timeout (managed) 3s Slow backends need headroom
healthyThreshold Successes to mark healthy 1 1 Raise to debounce flapping
unhealthyThreshold Failures to mark unhealthy 3 3 Lower for fast eviction
http.path Probe path / /healthz Always a shallow readiness path
http.host Host header on the probe (none) the app host Backends that route by host
http.match.statusCodes Codes that count as healthy 200–399 200–299 Tighten to real success codes

The re-encryption/mTLS failure modes — this is where 502s hide:

Symptom Root cause Confirm Fix
502 on the backend leg only SAN/host mismatch vs backend cert Controller events; BackendTLSPolicy verify Pin subjectAltName to the cert SAN
502 “untrusted” caCertificateRef wrong/missing The referenced CA Secret content Upload the correct backend root CA
Backend rejects AGC clientCertificateRef rotated/invalid Backend (sidecar) TLS logs Re-issue the client cert Secret
Policy never applies targetRef wrong kind/name kubectl describe backendtlspolicy Match kind: Service + exact name
All backends unhealthy Probe path 5xx / wrong codes HealthCheckPolicy + app /healthz Shallow path; correct statusCodes

Header, path, and query routing

Gateway API matches give you composable L7 rules. Match types combine with AND semantics within a single match block; AGC evaluates more specific matches first, so ordering behaves intuitively. A few production patterns in one route:

# routing-advanced.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: rt-advanced
  namespace: app
spec:
  parentRefs:
    - name: gw-prod
  hostnames:
    - "app.kloudvin.com"
  rules:
    # Beta cohort: header-based dark launch
    - matches:
        - headers:
            - name: x-cohort
              value: beta
              type: Exact
      backendRefs:
        - name: app-svc-beta
          port: 80

    # API v2 by path prefix, with the prefix rewritten off
    - matches:
        - path:
            type: PathPrefix
            value: /api/v2
      filters:
        - type: URLRewrite
          urlRewrite:
            path:
              type: ReplacePrefixMatch
              replacePrefixMatch: /
      backendRefs:
        - name: api-v2-svc
          port: 8080

    # Query-param routing for a debug build
    - matches:
        - queryParams:
            - name: debug
              value: "true"
              type: Exact
      backendRefs:
        - name: app-svc-debug
          port: 80

    # Default
    - backendRefs:
        - name: app-svc-stable
          port: 80

The header and query rules win over the catch-all because they are more specific. The URLRewrite filter strips /api/v2 before the request reaches api-v2-svc. You can also inject or strip headers with RequestHeaderModifier/ResponseHeaderModifier filters — the clean way to add X-Forwarded-* or correlation headers without touching app code. The match types and their type values:

Match type type values Matches on Notes
path PathPrefix, Exact, RegularExpression URL path PathPrefix is the common case
headers Exact, RegularExpression Request header value Multiple headers AND together
queryParams Exact, RegularExpression Query string param Useful for debug/feature toggles
method GET/POST/… HTTP method Split read vs write paths

The filters you compose with matches, and what each does:

Filter Purpose Key fields Typical use
URLRewrite Rewrite path/host upstream path.ReplacePrefixMatch, hostname Strip an API prefix
RequestHeaderModifier Add/set/remove request headers add, set, remove Inject correlation/forwarded headers
ResponseHeaderModifier Add/set/remove response headers add, set, remove Security headers (HSTS)
RequestRedirect Issue an HTTP redirect scheme, statusCode, hostname HTTP→HTTPS, host moves
RequestMirror Mirror traffic to a second backend backendRef Shadow a new build (no client impact)

Match precedence, made explicit so route ordering never surprises you:

Rule shape Wins over Why
Exact path PathPrefix path Exact is more specific
Longer PathPrefix Shorter PathPrefix Longest prefix wins
Match with header + path Match with path only More match criteria = more specific
Any explicit match Catch-all (no matches) Catch-all is least specific

Bring-your-own AGC and Private Link

When a central network team owns the AGC, you provision it with IaC and the cluster only references it. Create the AGC, a frontend, and a subnet association with Bicep:

resource agc 'Microsoft.ServiceNetworking/trafficControllers@2023-11-01' = {
  name: 'agc-shared'
  location: location
}

resource frontend 'Microsoft.ServiceNetworking/trafficControllers/frontends@2023-11-01' = {
  parent: agc
  name: 'fe-prod'
  location: location
}

resource assoc 'Microsoft.ServiceNetworking/trafficControllers/associations@2023-11-01' = {
  parent: agc
  name: 'assoc-prod'
  location: location
  properties: {
    associationType: 'subnets'
    subnet: { id: subnetId }
  }
}

In BYO mode you skip the ApplicationLoadBalancer CRD and instead annotate the Gateway with the existing AGC’s frontend resource ID, granting the controller identity the Configuration Manager role on that AGC’s scope. For private exposure, the AGC frontend is reachable over Azure Private Link: create a private endpoint against the frontend and resolve its FQDN through a Private DNS zone, so the generated *.fzXX.alb.azure.com name resolves to a private IP inside the spoke. That keeps north-south traffic off the public internet while preserving the same Gateway API manifests. Managed vs BYO across the dimensions that decide it:

Dimension Managed mode BYO mode
AGC created by ALB Controller (CRD) Your IaC (ARM/Bicep/Terraform)
Lifecycle in Kubernetes manifests Network team’s pipeline
Gateway references it via alb-namespace + alb-name annotations The AGC frontend resource ID annotation
Config Manager role scope Node resource group The AGC’s resource scope
Subnet ownership Convenient, cluster-adjacent Central, governed
Private Link Possible First-class (team-owned)
Best for Greenfield, GitOps Regulated, segregated duties

The Azure resources behind an AGC, regardless of mode, so you can read them in the portal/ARM:

Resource type Role Created in
Microsoft.ServiceNetworking/trafficControllers The AGC itself Node RG (managed) / chosen RG (BYO)
.../trafficControllers/frontends Listener entry point (the FQDN) Same
.../trafficControllers/associations Binds the delegated subnet Same
Delegated subnet /24 for data-plane injection Your VNet
Private endpoint + Private DNS zone Private exposure of the frontend Spoke VNet (optional)

Architecture at a glance

Read the diagram left to right as a request actually travels, with the control plane feeding in from the side. A client resolves app.kloudvin.com to the AGC-generated FQDN (a *.fzXX.alb.azure.com CNAME) and opens HTTPS on 443 to the AGC data plane — the managed, regional proxy fleet. The fleet’s frontend listener terminates TLS using the cert from the app-tls Secret, then a routing rule evaluates the HTTPRoute matches (path, header, query — more-specific wins) and lands the request on the weighted split: backendRefs to the stable Service (weight 90) and the canary Service (weight 10), where weight: 0 is the drain lever. From the split, traffic is re-encrypted on a fresh TLS leg to the backend pods on 443, with the BackendTLSPolicy pinning the SAN and, when clientCertificateRef is set, presenting a client cert for mTLS that the pod’s sidecar verifies. Off to the side, the control plane — the ALB Controller in azure-alb-system, authorised by a workload identity (a federated service account holding the Configuration Manager role) — watches the Gateway API CRDs and programs the data plane in seconds.

Notice how every numbered failure point maps to one CRD or identity object, which is the whole operational story: if badge 1 (the GatewayClass not Accepted) is red, nothing downstream is programmed; badge 2 is the listener/cert leg (Programmed=False, no address); badge 3 is a skewed split (a missing weight); badge 4 is the re-encrypt 502 (SAN/CA mismatch, or a cross-namespace Service without a ReferenceGrant); and badge 5 is mTLS/identity drift (a wrong federated subject, or a rotated client cert). The diagnostic method is to walk the path left to right, find the first object whose status.conditions isn’t True, and read the legend for that badge’s confirm-and-fix.

Application Gateway for Containers on AKS: a client resolves a CNAME to the AGC FQDN and opens HTTPS 443 to the managed AGC data plane, whose frontend listener terminates TLS from a Kubernetes Secret and whose routing rule evaluates HTTPRoute path/header/query matches before landing on a weighted split between a stable Service (weight 90) and a canary Service (weight 10, weight 0 drains); traffic is then re-encrypted on a fresh TLS leg to backend pods on 443 with a BackendTLSPolicy pinning the SAN and an optional client certificate for mTLS that a sidecar verifies, while off to the side the in-cluster ALB Controller in azure-alb-system — authorised by a federated workload identity holding the AppGw for Containers Configuration Manager role — programs the data plane from Gateway API CRDs; five numbered badges mark the GatewayClass-not-Accepted, listener-cert, skewed-split, re-encrypt-502, and mTLS-identity-drift failure points

Real-world scenario

A fintech platform team I worked with ran AGIC fronting roughly 40 namespaces on one shared AKS cluster. Their pain was concrete and recurring: every Ingress change anywhere triggered a full Application Gateway config push, and a single team’s frequent deploys produced 4–7 minute propagation windows during which unrelated services saw stale routing. During those windows, a customer-facing payments microservice would intermittently route to a just-decommissioned pod set because the gateway hadn’t caught up — a Sev-2 they could not reliably reproduce. Worse, their PCI scope required re-encryption to the payment pods, but AGIC’s backend-mTLS story was awkward enough that they had quietly settled for TLS terminating at the edge and cleartext to the pod — an audit finding waiting to happen, and one the next QSA assessment would certainly flag.

They migrated to AGC in BYO mode so the network team kept ownership of the AGC, its delegated /24 subnet, and a Private Link frontend, all in Terraform. Each app namespace got its own Gateway bound to the shared AGC, which decoupled the reconcile blast radius — a deploy in one namespace no longer touched another’s routing, because the controller programs per-Gateway, not per gateway resource. The PCI gap closed with a BackendTLSPolicy carrying a clientCertificateRef, giving genuine mTLS from AGC to the payment service:

apiVersion: alb.networking.azure.io/v1
kind: BackendTLSPolicy
metadata:
  name: btls-payments
  namespace: payments
spec:
  targetRef:
    group: ""
    kind: Service
    name: payments-svc
  default:
    sni: payments.internal.kloudvin.com
    clientCertificateRef:
      name: agc-payments-client
    verify:
      caCertificateRef:
        name: payments-ca
      subjectAltName: payments.internal.kloudvin.com

The migration ran with both ingress paths live — AGC on a parallel hostname — and DNS weight-shifted over a week, so there was no cutover big bang. One real snag surfaced on day two: a shared app-tls Secret lived in a platform namespace while several Gateway listeners lived in team namespaces, so those listeners came up Programmed=False until they added a ReferenceGrant permitting the cross-namespace Secret reference. They also briefly chased a 502 on the payments leg that turned out to be a SAN mismatch — the cert’s SAN was payments.internal.kloudvin.com but an early BackendTLSPolicy pinned payments-svc.payments.svc.cluster.local; aligning subjectAltName to the cert fixed it in one apply.

The measurable outcome: routing propagation dropped from minutes to single-digit seconds, the noisy-neighbour reconcile storms disappeared, the intermittent payments mis-route Sev-2 stopped recurring, and the next QSA assessment recorded encryption all the way to the cardholder-data workload. Canary releases that used to be a deploy-time ritual became a weight edit driven by Argo Rollouts. The lesson on the wall: “AGC moves the gateway out of ARM and into Kubernetes objects — so every failure is now a status.conditions you can read, not an ARM deployment you wait on.”

Advantages and disadvantages

AGC’s managed-data-plane-plus-in-cluster-controller model is a clear win for ingress at scale, but it is not free of trade-offs — most notably the missing WAF. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
Near-real-time programming (seconds), no ARM throttling on every change No built-in WAF — you must add Front Door / AppGW v2 upstream for L7 protection
Per-Gateway blast radius — one namespace can’t stall another’s routing More moving parts (controller, federated identity, delegated subnet) to stand up correctly
Native weighted splitting — canary is a one-line weight edit Gateway API + AGC CRDs are a learning curve vs familiar Ingress
First-class backend re-encryption and mTLS via BackendTLSPolicy Cross-namespace refs need ReferenceGrant — an easy first-day trip-up
Secretless auth via workload identity (no SP passwords to rotate) Federated-credential subject typos fail as opaque 401s
Policies (BackendTLSPolicy, HealthCheckPolicy) travel with the Service Newer product — smaller community corpus than NGINX/AGIC
Portable, typed Gateway API manifests (multi-implementation) Region availability and feature parity still maturing in places

The model is right for any AKS estate doing ingress for more than a few namespaces, anyone needing progressive delivery wired to a real splitting data plane, and regulated workloads that must prove encryption to the pod. It is less compelling if you need a WAF at the same hop with zero extra components (then a classic Application Gateway v2 + WAF_v2, possibly via AGIC, is simpler), or if your cluster runs a single app and the AGIC/Ingress familiarity outweighs AGC’s gains. The disadvantages are all manageable — but only if you stand the prerequisites up precisely, which is the point of the install section.

Hands-on lab

Stand up AGC end to end on an existing AKS cluster, ship a 90/10 split, verify it lands, then add header routing and tear down. Run in Cloud Shell (Bash) with kubectl pointed at a test cluster you can administer. This uses a small managed AGC and a couple of single-replica deployments — minutes of runtime, deleted at the end.

Step 1 — Variables and cluster features.

RG=rg-agc-lab
AKS=aks-agc-lab
LOC=eastus2
az aks update -g "$RG" -n "$AKS" --enable-oidc-issuer --enable-workload-identity -o table
OIDC=$(az aks show -g "$RG" -n "$AKS" --query oidcIssuerProfile.issuerUrl -o tsv)
az aks get-credentials -g "$RG" -n "$AKS" --overwrite-existing

Expected: the cluster shows oidcIssuerProfile.enabled = true and a securityProfile.workloadIdentity block.

Step 2 — Identity, role, federation, and the Helm install.

ID=alb-lab-id
az identity create -g "$RG" -n "$ID" -l "$LOC" -o table
PID=$(az identity show -g "$RG" -n "$ID" --query principalId -o tsv)
CID=$(az identity show -g "$RG" -n "$ID" --query clientId -o tsv)
MCID=$(az group show -n "$(az aks show -g "$RG" -n "$AKS" --query nodeResourceGroup -o tsv)" --query id -o tsv)
az role assignment create --assignee-object-id "$PID" --assignee-principal-type ServicePrincipal \
  --scope "$MCID" --role "AppGw for Containers Configuration Manager"
az identity federated-credential create --name alb-lab-fc --identity-name "$ID" -g "$RG" \
  --issuer "$OIDC" --subject "system:serviceaccount:azure-alb-system:alb-controller-sa" \
  --audience api://AzureADTokenExchange
helm upgrade --install alb-controller \
  oci://mcr.microsoft.com/application-lb/charts/alb-controller --version 1.7.9 \
  --namespace azure-alb-system --create-namespace \
  --set albController.podIdentity.clientID="$CID"

Step 3 — Confirm the controller and GatewayClass are healthy.

kubectl get pods -n azure-alb-system
kubectl get gatewayclass azure-alb-external \
  -o jsonpath='{.status.conditions[?(@.type=="Accepted")].status}{"\n"}'

Expected: controller pods Running, and Accepted prints True.

Step 4 — Provision a managed AGC against the delegated subnet.

kubectl create namespace alb-infra
SUBNET=$(az network vnet subnet show -g "$RG" --vnet-name vnet-agc-lab --name subnet-alb --query id -o tsv)
cat <<EOF | kubectl apply -f -
apiVersion: alb.networking.azure.io/v1
kind: ApplicationLoadBalancer
metadata: { name: alb-lab, namespace: alb-infra }
spec: { associations: [ "$SUBNET" ] }
EOF
kubectl get applicationloadbalancer alb-lab -n alb-infra -o jsonpath='{.status.conditions[*].type}{"\n"}'

Expected (after a few minutes): a Deployment condition reaching Succeeded.

Step 5 — Deploy two versioned backends and a Gateway, then split 90/10.

kubectl create namespace app
# stable + canary echo deployments that return their version on /version
kubectl create deployment app-stable -n app --image=mcr.microsoft.com/azuredocs/aks-helloworld:v1
kubectl create deployment app-canary -n app --image=mcr.microsoft.com/azuredocs/aks-helloworld:v2
kubectl expose deployment app-stable -n app --name=app-svc-stable --port=80 --target-port=80
kubectl expose deployment app-canary -n app --name=app-svc-canary --port=80 --target-port=80
cat <<'EOF' | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: gw-lab
  namespace: app
  annotations:
    alb.networking.azure.io/alb-namespace: alb-infra
    alb.networking.azure.io/alb-name: alb-lab
spec:
  gatewayClassName: azure-alb-external
  listeners:
    - { name: http, protocol: HTTP, port: 80, allowedRoutes: { namespaces: { from: Same } } }
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata: { name: rt-lab, namespace: app }
spec:
  parentRefs: [ { name: gw-lab } ]
  rules:
    - backendRefs:
        - { name: app-svc-stable, port: 80, weight: 90 }
        - { name: app-svc-canary, port: 80, weight: 10 }
EOF

Step 6 — Read the FQDN, confirm Programmed, and sample the split.

kubectl get gateway gw-lab -n app \
  -o jsonpath='{.status.addresses[0].value} {.status.conditions[?(@.type=="Programmed")].status}{"\n"}'
FQDN=$(kubectl get gateway gw-lab -n app -o jsonpath='{.status.addresses[0].value}')
for i in $(seq 1 50); do curl -s "http://${FQDN}/"; echo; done | sort | uniq -c

Expected: an FQDN plus True, and the 50-sample count lands roughly 45 stable / 5 canary (relative weights, so exact counts vary).

Step 7 — Add header routing for a beta cohort and confirm it.

kubectl patch httproute rt-lab -n app --type=json -p='[
  {"op":"add","path":"/spec/rules/0","value":{
    "matches":[{"headers":[{"name":"x-cohort","value":"beta","type":"Exact"}]}],
    "backendRefs":[{"name":"app-svc-canary","port":80}]}}]'
curl -s "http://${FQDN}/" -H "x-cohort: beta"   # should always hit canary (v2)

Expected: with the x-cohort: beta header you consistently reach the canary (v2) backend; without it you get the 90/10 split.

Step 8 — Teardown.

kubectl delete namespace app
kubectl delete applicationloadbalancer alb-lab -n alb-infra
kubectl delete namespace alb-infra
helm uninstall alb-controller -n azure-alb-system
az identity delete -g "$RG" -n "$ID"

The lab steps mapped to what each proves:

Step What you did What it proves
2 Federate identity + Helm install Secretless control-plane auth
3 GatewayClass Accepted=True The controller is live and authorised
4 ApplicationLoadBalancer CRD Managed-mode AGC provisioning
5–6 Gateway + split, sample FQDN Native weighted traffic split works
7 Header match patch Composable L7 routing, seconds to apply
8 Delete CRDs + identity Clean lifecycle in Kubernetes

Cost note. A managed AGC plus two single-replica pods for an hour is well under ₹100; deleting the namespaces, the ApplicationLoadBalancer, and the identity stops all AGC charges. The AKS cluster itself is the larger cost — reuse an existing test cluster rather than creating one for the lab.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table for mid-incident, then the entries that bite hardest expanded with the full reasoning.

# Symptom Root cause Confirm (exact cmd) Fix
1 GatewayClass azure-alb-external never Accepted Controller down, or identity lacks Config Manager role kubectl get pods -n azure-alb-system; kubectl describe gatewayclass azure-alb-external Fix Helm install; grant Config Manager on the AGC scope
2 Controller pod runs but logs 401 / token errors Federated-credential subject wrong, or workload identity off kubectl logs deploy/alb-controller -n azure-alb-system; az identity federated-credential show Re-create fedcred with exact system:serviceaccount:azure-alb-system:alb-controller-sa
3 ApplicationLoadBalancer stuck, no AGC created Subnet not /24 or not delegated; missing role kubectl get applicationloadbalancer -o yaml (Reason); az network vnet subnet show --query delegations Delegate subnet to trafficControllers; size /24; grant role
4 Gateway has no address, Programmed=False TLS Secret missing, or AGC not ready kubectl describe gateway gw-prod -n app Create the app-tls Secret in the listener ns; wait for AGC
5 HTTPRoute Accepted=False parentRefs wrong, or Gateway allowedRoutes disallows it kubectl get httproute -o yaml (parents conditions) Fix parentRefs; widen allowedRoutes.namespaces
6 HTTPRoute ResolvedRefs=False Service/Secret missing, or cross-namespace without grant kubectl describe httproute rt-app -n app Create the Service; add a ReferenceGrant in the target ns
7 Split lands ~50/50 not 90/10 A backendRef missing its weight (defaults 1) kubectl get httproute -o yaml Set explicit integer weights on every ref
8 502 only on the re-encrypt leg BackendTLSPolicy SAN/CA mismatch controller events; the cert’s actual SAN Pin subjectAltName to the cert SAN; correct caCertificateRef
9 Backend rejects AGC (mTLS) clientCertificateRef rotated/invalid backend (sidecar) TLS logs Re-issue the client cert Secret
10 All backends marked unhealthy → 503 Probe path 5xx or wrong statusCodes HealthCheckPolicy; curl the app /healthz Shallow path; correct match.statusCodes
11 Header/query route never matches More-specific catch-all, or wrong type kubectl get httproute -o yaml rules order Use Exact/RegularExpression; rely on specificity
12 Routing changes don’t propagate Watching the wrong object; controller wedged kubectl logs deploy/alb-controller -n azure-alb-system Restart controller; check it reconciles the object
13 Private FQDN resolves publicly Private DNS zone not linked, no private endpoint nslookup the AGC FQDN from the spoke Create PE on the frontend; link Private DNS zone
14 Cert rotation didn’t take effect Replaced cert content but not the referenced Secret kubectl get secret app-tls -o yaml; Gateway events Update the exact Secret the listener references

The expanded form, for the entries that cost the most time:

1. GatewayClass azure-alb-external never reaches Accepted. Root cause: the ALB Controller isn’t running, or its identity lacks the AppGw for Containers Configuration Manager role, so it can’t bind the class. Confirm: kubectl get pods -n azure-alb-system (are they Running?), then kubectl describe gatewayclass azure-alb-external for the condition reason. Fix: re-check the Helm install (--set albController.podIdentity.clientID correct) and that the role assignment landed on the right scope (node RG for managed, AGC scope for BYO).

2. The controller pod runs but its logs show 401 / token-exchange errors. Root cause: the federated-credential subject is wrong (the single most common cause), or workload identity isn’t actually enabled so the SA token isn’t projected. Confirm: kubectl logs deploy/alb-controller -n azure-alb-system shows AAD token failures; az identity federated-credential show and compare the subject to system:serviceaccount:azure-alb-system:alb-controller-sa exactly. Fix: re-create the federated credential with the exact subject, issuer, and api://AzureADTokenExchange audience; confirm --enable-workload-identity on the cluster.

3. The ApplicationLoadBalancer CRD applies but no AGC is created. Root cause: the subnet is smaller than /24 or not delegated to Microsoft.ServiceNetworking/trafficControllers, or the controller lacks rights on the subnet’s RG. Confirm: kubectl get applicationloadbalancer alb-prod -n alb-infra -o yaml and read the condition Reason (e.g. SubnetDelegationMissing); az network vnet subnet show --query "{prefix:addressPrefix, deleg:delegations}". Fix: delegate the subnet, size it /24+, and ensure Network Contributor on the subnet’s RG.

6. HTTPRoute shows ResolvedRefs=False. Root cause: a referenced Service or TLS Secret doesn’t exist, or it lives in another namespace without a ReferenceGrant permitting the reference. Confirm: kubectl describe httproute rt-app -n app names the unresolved ref; check the Service/Secret exists in the expected namespace. Fix: create the missing object, or add a ReferenceGrant in the target namespace allowing the route/listener’s namespace to reference it:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-app-to-platform-tls
  namespace: platform           # the namespace that OWNS the Secret/Service
spec:
  from:
    - group: gateway.networking.k8s.io
      kind: Gateway
      namespace: app            # the namespace that REFERENCES it
  to:
    - group: ""
      kind: Secret
      name: app-tls

7. The split lands ~50/50 when you configured 90/10. Root cause: one backendRef is missing its weight and defaults to 1, so a 90 against a defaulted-1 is not what you think, or — more often — the two backends ended up in separate rules (most-specific-match wins, no split). Confirm: kubectl get httproute rt-canary -n app -o yaml and verify both refs share one rules[] entry and both carry explicit integer weights. Fix: put both weighted refs under a single rule with explicit weights; remember weights are relative.

8. A 502 appears only on the re-encrypt leg, not before the gateway. Root cause: BackendTLSPolicy verify doesn’t match the backend cert — wrong subjectAltName, or a caCertificateRef that doesn’t sign the backend’s cert. Confirm: controller events on the BackendTLSPolicy; inspect the backend cert’s real SAN (e.g. openssl s_client against the pod) and compare. Fix: set subjectAltName to the cert’s actual SAN and caCertificateRef to the CA that signed it.

12. Routing changes stop propagating. Root cause: the controller is wedged (a bad object earlier in the watch, or it lost its lease), or you’re editing an object the route doesn’t actually parent to. Confirm: kubectl logs deploy/alb-controller -n azure-alb-system for reconcile errors; confirm the HTTPRoute parentRefs points at the live Gateway. Fix: fix or remove the offending object; if needed restart the controller deployment to force a clean reconcile.

Best practices

The defaults to override on every new AGC, and what each prevents:

Default Override to Prevents
HTTP to backend (cleartext) BackendTLSPolicy re-encrypt + verify Plaintext-to-pod audit finding
GET / health probe HealthCheckPolicy /healthz Healthy backends marked unhealthy
Defaulted backendRef weight Explicit integer weights Skewed canary splits
allowedRoutes: All (if set) Same / a Selector Unintended route attachment
Floating chart tag Pinned --version Surprise controller upgrades
No ReferenceGrant Pre-created grants Programmed=False on shared Secrets

Security notes

The security controls that also harden routing, mapped to what each defends and prevents:

Control Mechanism Secures against Also prevents
Workload identity Federated SA → UAMI Stored SP secrets Credential-rotation breakage
Scoped Config Manager role Built-in role on AGC scope Over-privileged controller Accidental cross-resource changes
BackendTLSPolicy + verify Re-encrypt + CA/SAN pin Cleartext-to-pod, MITM Backend cert drift going unnoticed
Client cert (mTLS) clientCertificateRef Unauthenticated upstream to backend Spoofed gateway traffic
Private Link frontend PE + Private DNS Public exposure DNS leakage of internal services
allowedRoutes scoping Same / Selector Foreign route attachment Route hijack across teams
Upstream WAF Front Door / AppGW v2 OWASP-class L7 attacks (AGC has no WAF of its own)

Cost & sizing

The cost model is fundamentally different from AGIC’s per-gateway-hour-plus-capacity-units billing, and far simpler to reason about once you separate the data plane from what surrounds it.

The cost drivers and what each one buys you:

Cost driver What you pay for Rough INR / month What it buys Watch-out
AGC data plane Managed proxy (hourly + usage) ~₹3,000–8,000 (traffic-dependent) The whole L7 ingress fleet Usage scales with traffic
ALB Controller 2 small pods on AKS negligible The control plane Counts against node capacity
Front Door Premium (WAF) Edge + managed WAF ~₹25,000+ OWASP protection AGC lacks Often the biggest added cost
AppGW v2 + WAF_v2 (alt) Gateway-hour + capacity units ~₹15,000–30,000 WAF at a thin edge hop Reintroduces an ARM gateway
Private Link PE hourly + per-GB ~₹1,500–3,000 Private-only exposure Per-endpoint, per-spoke
Data processing / egress Per-GB through the proxy traffic-dependent (the traffic itself) Spikes during incidents/sales

Sizing rule of thumb: one AGC per cluster (or per environment) serves many teams via per-namespace Gateways — you almost never need multiple AGCs for capacity, only for hard isolation or BYO governance. Consolidating off sharded AGICs onto a single AGC was, for the fintech team above, a net cost reduction even after adding a Front Door WAF hop, because they collapsed dozens of gateway resources into one managed data plane. For broader AKS cost levers, see Kubernetes Cost Allocation & Rightsizing with Kubecost.

Interview & exam questions

1. How does Application Gateway for Containers differ architecturally from AGIC? AGIC ran a single in-cluster pod that mutated a Standard_v2 Application Gateway ARM resource on every Ingress change, producing multi-minute propagation. AGC has a managed, regional proxy data plane and an in-cluster ALB Controller that programs it via a config plane in seconds, speaks the Gateway API instead of Ingress, scopes blast radius per-Gateway, and has no built-in WAF.

2. Why does AGC use the Gateway API rather than the Ingress API? Because all of AGC’s routing capability — weighted traffic splitting via backendRefs weights, header/path/query matches, BackendTLSPolicy re-encryption and mTLS, HealthCheckPolicy — maps onto typed Gateway API objects, avoiding the annotation sprawl Ingress required. Gateway API is also portable across implementations and is where Microsoft is investing.

3. How does the ALB Controller authenticate to Azure? Via workload identity: a user-assigned managed identity is federated to the controller’s Kubernetes service account (azure-alb-system:alb-controller-sa), and the controller exchanges the projected SA token for an Azure token — no service-principal secret. The identity holds the AppGw for Containers Configuration Manager role on the AGC scope.

4. What are managed mode and BYO mode, and when do you pick each? In managed mode the controller creates the AGC and its subnet association from an ApplicationLoadBalancer CRD — best for greenfield, GitOps-driven estates. In BYO mode a central team provisions the AGC via IaC and the cluster only references it — best when a platform-networking team must own the AGC, subnet delegation, and Private Link independently of any cluster.

5. How do you ship a 90/10 canary on AGC, and what does weight: 0 do? Put two backendRefs (stable and canary) under one HTTPRoute rule with weight: 90 and weight: 10; AGC distributes proportionally. Weights are relative, not percentages. Setting a backend’s weight to 0 drains it to zero traffic without deleting the ref, keeping rollback one edit away.

6. A route shows ResolvedRefs=False. What are the two most likely causes? Either a referenced Service or TLS Secret doesn’t exist in the route’s namespace, or it lives in another namespace without a ReferenceGrant permitting the cross-namespace reference. Confirm with kubectl describe httproute; fix by creating the object or adding a ReferenceGrant in the target namespace.

7. How do you enforce mTLS from AGC to a backend pod? Apply a BackendTLSPolicy targeting the Service with a clientCertificateRef (the cert AGC presents) plus a verify block (caCertificateRef + subjectAltName) so AGC also validates the backend. Drop clientCertificateRef for one-way re-encryption; keep verify on in production either way.

8. The GatewayClass azure-alb-external never reaches Accepted. What do you check? Whether the ALB Controller pods are Running in azure-alb-system, and whether its identity holds AppGw for Containers Configuration Manager on the AGC scope. A controller that’s down or unauthorised can’t bind the class. Then check kubectl describe gatewayclass for the condition reason.

9. Does AGC include a WAF? If not, how do you protect L7? No — AGC has no built-in WAF. You place protection upstream: Front Door Premium (managed WAF) or a classic Application Gateway v2 + WAF_v2 hop in front of the AGC FQDN, letting AGC own routing and backend TLS while the edge does OWASP-class inspection.

10. A split you set to 90/10 is landing ~50/50. What’s wrong? Most likely one backendRef is missing its weight (defaulting to 1) or the two backends ended up in separate rules[] (so more-specific-match wins and there’s no split at all). Put both weighted refs under a single rule with explicit integer weights.

11. How does AGC achieve near-real-time routing changes when AGIC took minutes? AGC’s controller writes desired state to a managed config plane and the regional proxy fleet converges in seconds, rather than re-deploying an ARM Application Gateway resource (which paid ARM control-plane latency and throttling on every change as AGIC did).

12. What subnet requirements does AGC impose? A dedicated subnet of at least /24, delegated to Microsoft.ServiceNetworking/trafficControllers, into which AGC injects its data plane. A smaller or undelegated subnet fails provisioning (often surfaced as an ApplicationLoadBalancer condition reason).

These map primarily to AZ-700 (Designing and Implementing Azure Networking) — load balancing and application delivery — and the CKA/CKAD Gateway API and services/networking domains, with the workload-identity mechanics touching AZ-500. A compact cert-mapping for revision:

Question theme Primary cert Objective area
AGC vs AGIC, AGC architecture AZ-700 Design & implement application delivery
Gateway API objects, routing, splits CKA / CKAD Services & networking; Gateway API
Workload identity, federation, roles AZ-500 / AZ-700 Secure identity; secretless access
BackendTLSPolicy, mTLS, re-encrypt AZ-700 / AZ-500 Secure connectivity; encryption in transit
BYO mode, Private Link, subnet delegation AZ-700 Hybrid/private connectivity

Quick check

  1. AGC has no built-in WAF. Name two ways to add L7 protection in front of it.
  2. You set a 90/10 split but traffic lands ~50/50. What is the single most likely misconfiguration?
  3. A Gateway listener’s TLS Secret lives in a different namespace and the listener is Programmed=False. What object fixes it?
  4. How does the ALB Controller authenticate to Azure, and what is the one string that most often breaks it?
  5. What does setting a backend’s weight to 0 accomplish, and why is it useful during a rollout?

Answers

  1. Front AGC with Front Door Premium (managed WAF) or a thin Application Gateway v2 + WAF_v2 hop pointed at the AGC FQDN. AGC owns routing/backend TLS; the upstream hop does OWASP-class inspection.
  2. A backendRef is missing its explicit weight (defaulting to 1), or the two backends are in separate rules[] so there is no split at all. Put both weighted refs under one rule with explicit integer weights.
  3. A ReferenceGrant in the target namespace (the one that owns the Secret), permitting the listener’s namespace and Gateway kind to reference that Secret. Without it, cross-namespace refs are denied and the listener stays Programmed=False.
  4. Via workload identity — a user-assigned managed identity federated to the azure-alb-system:alb-controller-sa service account. The string that most often breaks it is the federated-credential subject, which must be exactly system:serviceaccount:azure-alb-system:alb-controller-sa.
  5. It drains the backend to zero traffic without deleting the backendRef, so the route and the backend stay in the manifest and rollback is a one-line weight edit rather than a redeploy.

Glossary

Next steps

You can now stand up AGC on AKS, drive ingress through Gateway API, and ship splits, mTLS and header routing in production. Build outward:

AzureAKSApplication Gateway for ContainersGateway APIIngress
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments