Containerization Networking

Istio Ambient Mesh in Practice: Zero-Trust mTLS, Traffic Management & L7 Authorization

Ambient mode splits the Istio data plane in two: a per-node L4 proxy (ztunnel) that does mTLS and identity for every enrolled pod for free, and an optional per-service L7 proxy (waypoint) you pay for only where you need HTTP routing or rich authorization. No sidecars, no pod restarts to join the mesh, and a CPU/memory bill that scales with traffic instead of pod count. This guide takes a workload from unmeshed to zero-trust, then routes, secures, and debugs it the way you would on call.

Everything here targets ambient as it shipped GA in Istio 1.24 and the APIs stable since. Commands are real and current; where a feature has a sharp edge I call it out rather than paper over it.

1. Sidecar vs ambient: what you are actually deploying

In the sidecar model every pod carries its own Envoy. That Envoy does L4 and L7, costs ~50-100Mi of memory per pod whether or not the pod needs HTTP features, and requires a pod restart to inject or upgrade. Ambient unbundles that:

Concern Sidecar Ambient
mTLS + L4 identity Per-pod Envoy ztunnel, one DaemonSet per node
HTTP routing / retries / L7 authz Same per-pod Envoy waypoint, opt-in per namespace or service
Join the mesh Inject sidecar, restart pod Label namespace, no restart
Cost model Scales with pod count L4 scales with nodes; L7 scales with traffic
Upgrade blast radius Restart every pod Roll the DaemonSet / waypoint Deployment

The trade-off is honest, not free. ztunnel tunnels traffic over HBONE (HTTP/2 CONNECT on port 15008, mTLS-wrapped), which adds a hop and a small latency tax versus a direct sidecar-to-sidecar path. The big win is that pods needing only encryption-in-transit and L4 policy never pay for an L7 proxy at all. You add a waypoint only when a service genuinely needs L7 — and a waypoint is a normal Deployment you can scale and schedule independently of the apps behind it.

Mental model: ztunnel is the zero-trust floor every workload gets. The waypoint is an L7 upgrade you bolt on per service. Traffic only traverses a waypoint when the destination it was originally addressed to has one configured.

2. Install ambient mode

Install with the ambient profile. This lays down istiod, the Istio CNI node agent (which sets up traffic redirection without NET_ADMIN in your app pods), and the ztunnel DaemonSet.

istioctl install --set profile=ambient --skip-confirmation

# Confirm the control plane and data plane are up
kubectl get pods -n istio-system
kubectl get daemonset ztunnel -n istio-system

You should see istiod, istio-cni-node (one per node), and ztunnel (one per node). If you run a managed cluster, check the CNI chaining mode for your platform — on some distributions the Istio CNI must be ordered after the primary CNI. Install the Kubernetes Gateway API CRDs too; waypoints are Gateway API resources:

kubectl get crd gateways.gateway.networking.k8s.io >/dev/null 2>&1 || \
  kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml

3. Enroll namespaces incrementally

This is ambient’s headline feature: enrollment is a label, and existing pods join with zero restarts.

kubectl label namespace shop istio.io/dataplane-mode=ambient

# Verify workloads are now seen by ztunnel as HBONE-capable
istioctl ztunnel-config workloads --workload-namespace shop

Every pod in shop is now inside the L4 mesh. ztunnel-config workloads (aliased istioctl zc workloads) lists each workload’s address, the protocol (HBONE once enrolled), and any assigned waypoint. Roll out namespace by namespace — because there is no sidecar to inject, you can mesh a busy namespace during business hours without a rolling restart, which is the single biggest operational difference from the sidecar model.

To exclude a specific pod (say a job that opens raw TCP that does not tolerate redirection), label the pod:

kubectl label pod <pod> istio.io/dataplane-mode=none --overwrite

4. Enforce strict mTLS and a default-deny posture

Enrollment already encrypts pod-to-pod traffic with mTLS via HBONE, but it does not yet forbid plaintext. Lock that down with a mesh-wide PeerAuthentication in STRICT mode, then layer a default-deny AuthorizationPolicy.

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system   # mesh-wide root config namespace
spec:
  mtls:
    mode: STRICT

Now the deny-all baseline. An AuthorizationPolicy with an empty spec (no rules) and the default ALLOW action denies everything, because “allow nothing” is the result of an allow-policy with zero matching rules.

apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: shop
spec:
  {}   # ALLOW action, zero rules => deny everything in this namespace

From here you punch holes with explicit allows. This L4 policy lets only the web service account reach orders, matched by SPIFFE identity rather than IP. It is enforced by ztunnel because it uses a selector and references only L4 attributes (principals, ports) — no HTTP:

apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: orders-allow-web
  namespace: shop
spec:
  selector:
    matchLabels:
      app: orders
  action: ALLOW
  rules:
  - from:
    - source:
        principals:
        - cluster.local/ns/shop/sa/web

Key distinction: a policy using selector and only L4 attributes is enforced at ztunnel. The moment you need HTTP methods, paths, or headers you must target a waypoint with targetRefs (Section 7). ztunnel is L4-only and physically cannot read HTTP.

5. Add a waypoint for L7

L7 routing and L7 authorization require a waypoint. Deploy one for the namespace:

istioctl waypoint apply -n shop --enroll-namespace
kubectl get gateways.gateway.networking.k8s.io -n shop

Under the hood this creates a Gateway API Gateway with gatewayClassName: istio-waypoint. If you manage manifests in Git, generate the YAML instead of applying imperatively:

istioctl waypoint generate --for service -n shop > waypoint.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: waypoint
  namespace: shop
  labels:
    istio.io/waypoint-for: service   # service | workload | all | none
spec:
  gatewayClassName: istio-waypoint
  listeners:
  - name: mesh
    port: 15008
    protocol: HBONE

Wait for the Gateway to report PROGRAMMED=True, then point services at it. --enroll-namespace adds istio.io/use-waypoint: waypoint to the namespace so all services route through it; you can scope it to a single service instead:

kubectl label service orders -n shop istio.io/use-waypoint=waypoint

The --for service choice matters: a service-scoped waypoint intercepts traffic addressed to a Service VIP (the common case for routing and splits). A workload waypoint intercepts pod-IP traffic. Pick service unless you specifically need to govern direct pod-to-pod calls.

6. L7 routing: weight and header-based splits

With a waypoint in place, classic VirtualService and DestinationRule work as they always have. Define subsets, then split. Here is a canary: 90/10 by weight, with an escape hatch that sends anyone carrying x-canary: always straight to v2.

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: orders
  namespace: shop
spec:
  host: orders.shop.svc.cluster.local
  subsets:
  - name: v1
    labels: { version: v1 }
  - name: v2
    labels: { version: v2 }
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: orders
  namespace: shop
spec:
  hosts:
  - orders.shop.svc.cluster.local
  http:
  - match:
    - headers:
        x-canary:
          exact: always
    route:
    - destination: { host: orders.shop.svc.cluster.local, subset: v2 }
  - route:
    - destination: { host: orders.shop.svc.cluster.local, subset: v1 }
      weight: 90
    - destination: { host: orders.shop.svc.cluster.local, subset: v2 }
      weight: 10

You can drive routing with Gateway API HTTPRoute instead of VirtualService. Pick one per service — mixing VirtualService and Gateway API route objects on the same host is not supported and produces undefined precedence. I use VirtualService when I need fault injection or circuit breaking (next section), because the Gateway API does not yet cover the full Istio resilience surface.

7. L7 authorization at the waypoint

Now the policy that needs HTTP semantics. This allows the web identity to GET and POST only under /api/, and is attached to the waypoint via targetRefs (note targetRefs, not selector). The to.operation block with methods and paths is exactly what forces waypoint enforcement.

apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: orders-l7
  namespace: shop
spec:
  targetRefs:
  - kind: Service
    group: ""
    name: orders
  action: ALLOW
  rules:
  - from:
    - source:
        principals:
        - cluster.local/ns/shop/sa/web
    to:
    - operation:
        methods: ["GET", "POST"]
        paths: ["/api/*"]

For end-user identity, validate JWTs at the waypoint with RequestAuthentication, then require a valid token in an AuthorizationPolicy. RequestAuthentication only defines how to validate a token — it does not reject requests on its own. The companion policy below denies any request lacking an authenticated principal (requestPrincipals of ["*"] means “any valid issuer/subject”).

apiVersion: security.istio.io/v1
kind: RequestAuthentication
metadata:
  name: orders-jwt
  namespace: shop
spec:
  targetRefs:
  - kind: Service
    group: ""
    name: orders
  jwtRules:
  - issuer: "https://accounts.example.com"
    jwksUri: "https://accounts.example.com/.well-known/jwks.json"
---
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: orders-require-jwt
  namespace: shop
spec:
  targetRefs:
  - kind: Service
    group: ""
    name: orders
  action: ALLOW
  rules:
  - from:
    - source:
        requestPrincipals: ["*"]

Order of operations at the waypoint: JWT is validated, then CUSTOM (ext-authz) policies, then DENY, then ALLOW. Deny always wins over allow, so a deny rule cannot be overridden by a broader allow.

8. Resilience: timeouts, retries, circuit breaking, fault injection

These ride on the same waypoint. Timeouts and retries live in VirtualService; connection-pool limits and outlier detection (circuit breaking) live in DestinationRule.

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: orders-resilience
  namespace: shop
spec:
  hosts: ["orders.shop.svc.cluster.local"]
  http:
  - timeout: 2s
    retries:
      attempts: 3
      perTryTimeout: 500ms
      retryOn: 5xx,reset,connect-failure
    route:
    - destination: { host: orders.shop.svc.cluster.local }
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: orders-cb
  namespace: shop
spec:
  host: orders.shop.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp: { maxConnections: 100 }
      http: { http2MaxRequests: 200, maxRequestsPerConnection: 10 }
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Fault injection is a separate VirtualService — inject a delay or an abort to test that your retries and timeouts actually behave:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: orders-fault
  namespace: shop
spec:
  hosts: ["orders.shop.svc.cluster.local"]
  http:
  - fault:
      delay:
        percentage: { value: 25 }
        fixedDelay: 3s
      abort:
        percentage: { value: 10 }
        httpStatus: 503
    route:
    - destination: { host: orders.shop.svc.cluster.local }

Sharp edge: Istio will not let you combine fault with retries/timeout on the same VirtualService. Keep fault injection in its own object, or you will get a config rejection. Remove it before promoting to prod.

9. Observability without sidecars

The waypoint exports the full standard Istio telemetry set, so a waypointed service has the same metrics, access logs, and tracing surface a sidecar would — just emitted by a shared proxy. Services that are L4-only (ztunnel alone) get connection-level metrics from ztunnel.

# Standard Istio request metrics, served by the waypoint
kubectl exec -n shop deploy/waypoint -c istio-proxy -- \
  pilot-agent request GET stats/prometheus | grep istio_requests_total

# ztunnel exposes its own metrics endpoint on 15020
kubectl exec -n istio-system ds/ztunnel -- curl -s localhost:15020/metrics | grep istio_tcp_connections

Turn on mesh-wide access logging and tracing with the Telemetry API rather than per-proxy flags:

apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system
spec:
  accessLogging:
  - providers:
    - name: envoy
  tracing:
  - randomSamplingPercentage: 5.0

Tail a waypoint’s access logs directly when you are debugging a specific call path:

kubectl logs -n shop deploy/waypoint -c istio-proxy --tail=50

Verify

Prove each layer independently.

# 1. mTLS / L4: ztunnel sees the workload over HBONE
istioctl ztunnel-config workloads --workload-namespace shop

# 2. The waypoint is healthy and programmed
kubectl get gateways.gateway.networking.k8s.io waypoint -n shop \
  -o jsonpath='{.status.conditions[?(@.type=="Programmed")].status}{"\n"}'

# 3. Authz works: allowed identity gets 200, others get 403
kubectl exec -n shop deploy/web -- curl -s -o /dev/null -w '%{http_code}\n' http://orders/api/health
kubectl run probe --rm -it --image=curlimages/curl -n shop --restart=Never -- \
  curl -s -o /dev/null -w '%{http_code}\n' http://orders.shop/api/health   # expect 403

# 4. Routing split is live (waypoint metrics by subset)
kubectl exec -n shop deploy/waypoint -c istio-proxy -- \
  pilot-agent request GET stats/prometheus | grep 'destination_version="v2"'

# 5. Static config sanity check
istioctl analyze -n shop

A clean run: workloads show HBONE, the Gateway is True, the allowed call returns 200 and the unauthorized one 403, v2 shows ~10% of requests, and analyze reports no errors.

10. Debugging the data plane

Two proxies means two places to look. Triage L4 at ztunnel, L7 at the waypoint.

# Is the pod actually enrolled? Look for the redirection annotation
kubectl get pod <pod> -n shop -o yaml | grep ambient.istio.io/redirection

# ztunnel's view of services and which waypoint each maps to
istioctl ztunnel-config services -n shop

# ztunnel's view of workloads (address, protocol, waypoint)
istioctl ztunnel-config workloads --workload-namespace shop

# Crank ztunnel logging for one node's pod (scoped, revert after)
istioctl ztunnel-config log <ztunnel-pod> --level=info,access=debug
kubectl logs -n istio-system <ztunnel-pod> --tail=100

# Inspect the waypoint's Envoy like any proxy: listeners, routes, clusters
istioctl proxy-config routes deploy/waypoint -n shop
istioctl proxy-config clusters deploy/waypoint -n shop

The usual culprits, in the order I check them:

Enterprise scenario

A payments platform team migrated ~400 namespaces off sidecars to bank the memory savings. Enrollment went clean until the fraud-scoring service started returning intermittent 502s under load — but only for calls that crossed availability zones. The constraint: that service ran a StatefulSet with mutual TLS terminated inside the app (legacy, pre-mesh) and clients addressed pods directly by their stable pod DNS, not the Service VIP. They had applied a service-scoped waypoint namespace-wide via --enroll-namespace, so pod-IP traffic never traversed the waypoint, yet the double mTLS (HBONE plus app-level) was fighting ztunnel’s redirection on the stateful pods.

The fix was twofold. First, they excluded the StatefulSet pods from the L7 path and let ztunnel handle L4 only, then scoped the waypoint to the Services that actually needed routing rather than the whole namespace:

# Stop blanket namespace enrollment; opt in per Service instead
kubectl label namespace payments istio.io/use-waypoint-

# Pods addressed directly need a workload waypoint, not a service one
istioctl waypoint apply -n payments --name pods-wp --for workload
kubectl label statefulset fraud-score -n payments istio.io/use-waypoint=pods-wp

Second, they kept the app-level TLS but stopped HBONE from re-wrapping by confirming the redirection annotation and sizing the waypoint against measured cross-AZ RPS, not pod count. The lesson the team wrote into their runbook: --for service and --for workload are not interchangeable, and any workload that bypasses the Service VIP needs a workload waypoint or no waypoint at all. The 502s disappeared once pod-IP traffic stopped hitting an L7 proxy that was never on its path.

Checklist

Pitfalls and next steps

The mistakes that bite in production: treating “enrolled” as “secured” (enrollment encrypts but does not deny — you still need STRICT plus default-deny); putting an L7 policy on ztunnel and wondering why HTTP rules vanish; and forgetting the waypoint is on the request path, so an undersized one becomes a latency and availability bottleneck for every service behind it. EnvoyFilter is not supported against waypoints — if you relied on it with sidecars, plan that migration deliberately.

From here: add a waypoint only to services that earn it, keep the long tail L4-only to bank the savings, wire Telemetry output into your existing Prometheus/Grafana and tracing backend, and benchmark the HBONE hop under realistic load before committing it to a latency-sensitive path.

IstioService-MeshAmbientmTLSZero-Trust

Comments

Keep Reading