Containerization Lesson 120 of 137

Azure Arc-Enabled Kubernetes: GitOps, Policy, and Fleet Governance for Hybrid Clusters

A platform team running EKS in one account, GKE in another, and three on-prem clusters in a colo does not have a Kubernetes problem. It has a governance problem: there is no single place to assert “every cluster runs this GitOps config, denies privileged pods, ships logs to one workspace, and is reachable for debugging without poking inbound holes in five firewalls.” Every cluster is a snowflake with its own RBAC, its own admission controller (or none), its own log destination, and its own bastion. Azure Arc-enabled Kubernetes projects any conformant cluster into Azure Resource Manager as a Microsoft.Kubernetes/connectedClusters resource, so the same management-group hierarchy, Azure Policy assignments, and RBAC you already use for native Azure resources now reach the cluster — wherever it physically runs.

This walkthrough onboards a non-Azure cluster, then layers the four controls that actually matter at fleet scale: Flux v2 GitOps for desired-state config, Azure Policy (Gatekeeper) for admission guardrails, cluster connect for kubectl without inbound firewall changes, and Container Insights plus workload identity for observability and secretless Key Vault access. Throughout I assume you have cluster-admin on the target cluster and Owner (or sufficient RBAC) on the Azure side. The goal is not a demo of one cluster — it is the machinery that turns forty snowflakes into “one policy, one Git repo, one identity boundary.”

Arc projects the cluster; it does not run it. The control plane, scheduler, and your nodes stay exactly where they are. Arc adds a set of agents that maintain an outbound connection to Azure and reconcile ARM intent into the cluster. If Azure is unreachable, the cluster keeps serving traffic — only the management plane pauses.

What problem this solves

The pain is operational drift across a heterogeneous fleet. Without a projection layer, every governance question becomes N separate answers. “Are privileged pods blocked everywhere?” means SSHing into N clusters or trusting N different OPA setups. “Who can debug the loyalty cluster at 02:00?” means N bastions, N VPNs, and N firewall change tickets. “Where are the logs?” means N workspaces and no fleet-wide query. When an auditor asks “prove no cluster runs hostPath mounts,” you have no single control plane to answer from.

What breaks without it: configuration entropy (each cluster diverges from the golden baseline because changes are applied by hand), inconsistent security posture (one cluster forgot the admission webhook and now runs root containers), blind operations (an outage on an edge cluster is invisible until a human notices), and access sprawl (every team cuts inbound firewall holes for kubectl, each one a new attack surface). Who hits this: platform/SRE teams running multi-cloud or hybrid Kubernetes, regulated shops that must prove uniform controls, and edge fleets (retail, manufacturing, telco) where clusters sit behind carrier-grade NAT with no public ingress.

Pain without Arc What it costs you How Arc fixes it
Config drift across N clusters Snowflakes; “works on cluster A, broken on B” Flux reconciles one Git repo to every cluster, prune=true
No uniform admission policy One cluster runs root pods, fails audit Azure Policy → Gatekeeper assigned at management-group scope
Inbound firewall holes for kubectl N attack surfaces, N change tickets Cluster connect — outbound-only, no inbound port
Logs scattered in N places No fleet-wide incident view Container Insights → one Log Analytics workspace
Static secrets in manifests Credential sprawl, no per-app audit Workload identity + Key Vault CSI, secretless
Onboarding a cluster is manual Days per cluster, human error MG inheritance — new cluster self-bootstraps baseline

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with core Kubernetes (Deployments, namespaces, RBAC, admission webhooks), kubectl and kubeconfig contexts, and Helm at a basic level. On the Azure side you need an understanding of Azure Resource Manager, management groups, Azure RBAC role assignments, and Azure Policy assignments. Familiarity with GitOps as a concept (desired state in Git, a controller reconciles) makes section 3 land faster — if you want a refresher, the Flux CD GitOps: Monorepo, Kustomize, and Multi-Tenancy and Argo CD App-of-Apps Multi-Cluster GitOps deep-dives cover the upstream engines Arc wraps.

Where this fits in the bigger picture: Arc-enabled Kubernetes is the hybrid arm of a wider Azure governance story. Management groups and Policy initiatives are the same primitives you would use in an Azure landing zone management group and Azure Policy at scale. Arc for servers is the sibling for VMs — see Azure Arc-Enabled Servers: Machine Configuration & Extended Security Updates. If your target is actually a managed Azure cluster, much of this carries over to AKS day-two operations covered in AKS Day-Two: Upgrades & Fleet Operations.

You should already know… Why it matters here If shaky, read
Kubernetes RBAC + admission webhooks Policy = Gatekeeper webhook; access = impersonation (K8s docs)
kubeconfig contexts connect uses the current context to deploy agents (kubectl basics)
Azure management groups Policy + RBAC inherit down the MG tree azure-landing-zone-management
Azure Policy assignments Initiatives become in-cluster constraints azure-policy-governance-scale
GitOps reconcile model Flux is the desired-state engine flux-cd-gitops-monorepo-kustomize-multi-tenancy
Managed identity + federation Workload identity = secretless Key Vault entra-managed-identities-deep-dive-user-assigned-fic-rbac

Core concepts

Arc-enabled Kubernetes is a thin projection: a Helm release of agents inside the cluster, a resource in ARM, and a set of cluster extensions that deliver capabilities (Flux, Policy, Monitor, Key Vault). Internalize this vocabulary before the deep sections — every later table assumes it.

Concept One-line definition Where it lives Why it matters
Connected cluster The ARM resource projecting your cluster Microsoft.Kubernetes/connectedClusters The handle Policy/RBAC/extensions attach to
Arc agents Helm release in azure-arc namespace In-cluster Maintain the outbound channel + reconcile intent
Cluster extension A managed add-on lifecycled by Arc Microsoft.KubernetesConfiguration/extensions How Flux/Policy/Monitor/KV get installed + upgraded
microsoft.flux The Flux v2 GitOps extension Cluster extension Delivers source/kustomize/helm controllers
fluxConfigurations ARM resource describing a Git source + Kustomizations ARM + in-cluster Desired-state intent, applied by config-agent
Azure Policy add-on Gatekeeper v3 (OPA) admission webhook Microsoft.PolicyInsights extension Turns ARM initiatives into in-cluster Constraints
Cluster connect Outbound channel for kubectl from anywhere clusterconnect-agent kubectl with no inbound port / VPN
kube-aad-proxy Entra authN + user impersonation shim In-cluster Maps an Azure token to a K8s identity
Container Insights Logs/metrics/inventory extension Microsoft.AzureMonitor.Containers Fleet telemetry into one workspace
Workload identity Federated UAMI → K8s service account Entra + cluster Secretless Key Vault / Azure API access
Key Vault CSI Secrets Store CSI driver + Azure provider Microsoft.AzureKeyVaultSecretsProvider Mounts vault secrets on tmpfs, no creds in-cluster
Management group A scope above subscriptions ARM hierarchy Policy + RBAC inheritance to all child clusters

The two control planes

Arc gives you two distinct planes, and confusing them is the root of most early mistakes. The management plane is ARM: management groups, Policy assignments, role assignments, extension lifecycle. It is eventually consistent — Policy syncs roughly every 15 minutes, Flux on its own interval. The data plane is your cluster’s kube-apiserver, untouched and authoritative for what actually runs. Arc never inserts itself in the request path of your workloads; it only reconciles intent and brokers kubectl.

Plane Owns Latency Authoritative for If Azure is down
Management (ARM) Policy, RBAC, extensions, GitOps intent Eventual (~15 min Policy) Desired state Reconcile pauses
Data (apiserver) Pods, services, actual admission Real-time Actual state Cluster keeps serving

1. Agent architecture, connectivity, and outbound requirements

az connectedk8s connect installs a Helm release into the azure-arc namespace. The agents are all-outbound by design — there is no inbound listener Azure dials into. Each agent has a single, separable job; knowing which one owns what turns a vague “Arc is broken” into a targeted fix.

Agent Role Owns this failure when it breaks
clusterconnect-agent Reverse proxy brokering the cluster-connect channel kubectl-over-Arc hangs / times out
kube-aad-proxy Entra authN on incoming connect requests, then impersonates the user kubectl returns forbidden / authN errors
config-agent Watches ARM for fluxConfigurations and applies them Flux config never reconciles
extension-manager Installs and lifecycles cluster extensions Extension stuck Creating/Failed
clusteridentityoperator Maintains the cluster’s MSI certificate used to auth to Azure Cluster goes Disconnected, cert renewal fails
resource-sync-agent Syncs cluster inventory back to the ARM resource connectivityStatus/inventory stale
cluster-metadata-operator Publishes cluster metadata (version, distribution) to ARM Resource Graph shows blank distribution/version
flux controllers (with extension) source-controller, kustomize-controller, helm-controller Source pull / apply failures

Every agent talks outbound over https://:443 and websockets. The non-obvious requirement is *.servicebus.windows.net with websockets enabled on your proxy/firewall — cluster connect rides Azure Relay over that endpoint, and a Layer-7 proxy that blocks websocket upgrades will let onboarding succeed but break kubectl-over-Arc later. This single trap accounts for the majority of “onboarded fine but proxy hangs” tickets.

Required outbound endpoints

FQDN Port Purpose Breaks if blocked
management.azure.com 443 ARM API (resource, extensions) Onboarding, all management
login.microsoftonline.com 443 Entra ID token issuance All auth
mcr.microsoft.com 443 Agent + extension container images Agents can’t pull
*.data.mcr.microsoft.com 443 MCR image data edges Image pull (CDN)
*.dp.kubernetesconfiguration.azure.com 443 Flux/config data plane GitOps + extensions
guestnotificationservice.azure.com 443 Notifications + the allowlist API Connect signalling
*.servicebus.windows.net 443 Azure Relay for cluster connect (websockets) kubectl-over-Arc
*.his.arc.azure.com 443 Hybrid identity service (MSI cert) Identity/cert renewal
gbl.his.arc.azure.com 443 Global hybrid identity endpoint First MSI provisioning
*.obo.arc.azure.com 443 On-behalf-of token exchange Cluster connect authZ
*.oms.opinsights.azure.com 443 Container Insights ingestion Log shipping
*.monitoring.azure.com 443 Metrics ingestion Prometheus/metrics
*.vault.azure.net 443 Key Vault data plane (CSI) Secret retrieval

The wildcard Service Bus endpoints resolve per-region; never hard-block them on a deny-by-default proxy without first expanding them for your regions. Expand with:

# Region-specific allowlist to replace the *.servicebus.windows.net wildcard
curl -s "https://guestnotificationservice.azure.com/urls/allowlist?api-version=2020-01-01&location=eastus"

There is no “Azure-initiated inbound” connectivity mode for Arc Kubernetes — it is outbound-only, which is precisely why it fits locked-down on-prem and multi-cloud egress postures. Choose your egress posture deliberately:

Egress posture What you configure Pros Cons
Direct outbound :443 Nothing extra Simplest; least to break Requires open egress to listed FQDNs
Explicit proxy --proxy-http/https/skip-range Centralized inspection/logging Proxy must allow websockets to Relay
Proxy + custom root CA add --proxy-cert TLS-inspecting proxies work Cert rotation must be maintained
Private endpoint (Arc PL) Private endpoints for Arc data plane Traffic stays on backbone More setup; per-region endpoints

Connectivity status meanings

connectivityStatus Meaning Likely cause Confirm Fix
Connected Agents heartbeating normally az connectedk8s show ... -o tsv (healthy)
Offline No heartbeat for >15 min Egress blocked / agents down kubectl get pods -n azure-arc Restore egress; restart agents
Connecting Onboarding/handshake in progress Just connected; provisioning Wait; check agent logs Usually transient
Expired MSI certificate expired clusteridentityoperator stuck / egress to *.his.arc.azure.com blocked Check that agent’s logs Allow HIS endpoints; restart agent

2. Onboard an on-prem or EKS/GKE cluster

Point your kubeconfig at the target cluster (kubectl config use-context my-eks), then prep the Azure side. Register the resource providers once per subscription — registration is asynchronous and can take ~10 minutes, so gate on it.

az extension add --name connectedk8s

az provider register --namespace Microsoft.Kubernetes
az provider register --namespace Microsoft.KubernetesConfiguration
az provider register --namespace Microsoft.ExtendedLocation

# Registration can take ~10 min; gate on it before connecting
az provider show -n Microsoft.Kubernetes --query registrationState -o tsv   # -> Registered
Resource provider Why you register it Needed for
Microsoft.Kubernetes Creates the connected-cluster resource Onboarding (always)
Microsoft.KubernetesConfiguration Flux configs + cluster extensions GitOps, all extensions
Microsoft.ExtendedLocation Custom locations on the cluster Arc-enabled services (App Svc, data)
Microsoft.PolicyInsights Azure Policy for Kubernetes Gatekeeper guardrails
Microsoft.OperationalInsights Log Analytics workspaces Container Insights destination

Create a resource group to hold the connected-cluster resources, then connect. connect uses the current kubeconfig context to deploy the Arc agents:

export RESOURCE_GROUP=rg-arc-fleet
export LOCATION=eastus
export CLUSTER_NAME=eks-prod-use1

az group create --name $RESOURCE_GROUP --location $LOCATION -o table

# Uses the CURRENT kubeconfig context to deploy the Arc agents
az connectedk8s connect \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION

connect installs its own Helm v3 binary under ~/.azure (it never touches a Helm you already have) and deploys the agents. The flags you will reach for most:

Flag What it does When to use Gotcha
--name Connected-cluster resource name Always Must be unique in the RG
--resource-group Target RG Always RG location ≠ cluster location is fine
--location ARM region for the resource Always Pick a region near you for control latency
--proxy-https HTTPS proxy for in-cluster agents Behind a proxy Agents inherit it, not just your shell
--proxy-http HTTP proxy Behind a proxy Pair with --proxy-https
--proxy-skip-range CIDRs/suffixes to bypass the proxy Behind a proxy Must include service CIDR + .svc
--proxy-cert Trusted root the proxy presents TLS-inspecting proxy Only for injecting a CA, not to “use a proxy”
--distribution Override detected distro Detection wrong Improves support/telemetry accuracy
--kube-config / --kube-context Target a specific kubeconfig/context Multiple clusters in one config Avoids onboarding the wrong cluster
--disable-auto-upgrade Pin agent version Change-controlled fleets You own upgrades thereafter
--container-log-path Custom container log path Non-standard distros For Insights log discovery

If the cluster egresses through a proxy, do not rely on HTTP_PROXY alone — pass it so the in-cluster agents inherit it. Always include the cluster’s service CIDR in --proxy-skip-range, or in-cluster service-to-service calls will be wrongly routed at the proxy:

az connectedk8s connect \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --proxy-https https://proxy.corp.local:8080 \
  --proxy-http  http://proxy.corp.local:8080 \
  --proxy-skip-range 10.0.0.0/16,kubernetes.default.svc,.svc.cluster.local,.svc \
  --proxy-cert /etc/ssl/certs/corp-root.crt

--proxy-cert is only for injecting a trusted root the proxy presents; it is not required just to use a proxy. The three flags most environments actually need are --proxy-http, --proxy-https, and --proxy-skip-range.

Distribution support and what changes

Arc onboards any CNCF-conformant cluster. The distribution mostly affects telemetry and which extensions are validated, not whether onboarding works.

Distribution Onboards Notes
AWS EKS Yes Common multi-cloud target; works as connectedClusters
Google GKE Yes Detected as gke; full extension support
k3s / k0s Yes Edge favourite; ensure adequate node resources
RKE / RKE2 Yes Rancher-managed; conformant
OpenShift (OKD/OCP) Yes SCCs may interact with policy; validate
kind / minikube Yes (dev) Fine for labs; not for production fleets
AKS (managed) Use managedClusters Already in Azure — Arc K8s is for non-AKS
AKS on Azure Stack HCI / Edge Essentials Provisioned-cluster path Slightly different onboarding

Onboarding errors you will actually hit

Symptom / error Likely cause Confirm Fix
MSI certificate is not ready Egress to *.his.arc.azure.com blocked clusteridentityoperator logs Allow HIS FQDNs; retry
Agents stuck Pending / ImagePullBackOff mcr.microsoft.com blocked kubectl describe pod -n azure-arc Allow MCR + data edges
connectivityStatus = Connecting forever Websocket/egress partial Agent logs; firewall logs Open *.servicebus, retry
Helm release failed on connect Stale prior install in azure-arc helm list -n azure-arc az connectedk8s delete then re-connect
Insufficient permissions Caller lacks RBAC on RG/sub az role assignment list Grant Contributor + K8s onboarding role
Provider not registered RP registration incomplete az provider show Re-run register; wait for Registered
In-cluster calls fail post-connect Service CIDR not in skip-range DNS/connectivity tests Add CIDR + .svc to --proxy-skip-range
Onboard OK, proxy hangs L7 proxy strips websockets az connectedk8s proxy -d (debug) Allow Relay FQDNs with websockets

3. Configure Flux v2 GitOps via the Arc extension

Arc’s GitOps is Flux v2 delivered as the microsoft.flux cluster extension (it installs fluxconfig-agent and fluxconfig-controller alongside the upstream source/kustomize/helm controllers). You rarely install the extension by hand — creating your first fluxConfigurations pulls it in automatically. Register the configuration with az k8s-configuration flux create, scoped at the cluster level, with one or more Kustomizations:

# Needs the k8s-configuration CLI extension
az extension add --name k8s-configuration

az k8s-configuration flux create \
  --name fleet-baseline \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --cluster-type connectedClusters \
  --namespace cluster-config \
  --scope cluster \
  --url https://github.com/acme-platform/fleet-gitops \
  --branch main \
  --kustomization name=infra path=./infrastructure prune=true \
  --kustomization name=apps  path=./apps/prod prune=true dependsOn=["infra"]

flux create options that change behaviour

Option Values Default When to change Trade-off / gotcha
--scope cluster | namespace cluster Tenant-confined config namespace can’t create CRDs/ClusterRoles
--namespace any (required) Where Flux objects live Created if absent
--kind git | bucket | azblob git Non-Git sources Auth differs per kind
--url repo URL (required) https:// or ssh://
--branch / --tag / --semver / --commit a ref branch=main-ish Pin to a release Tag/commit = immutable rollout
--interval duration 10m Faster/slower polls Lower = more API + Git load
--kustomization prune= true | false false Always true for real GitOps Without it, Git ≠ truth
--kustomization dependsOn= list none Order infra before apps Cycles = stuck reconcile
--kustomization sync_interval= duration 10m Per-Kustomization cadence Independent of source interval
--kustomization retry_interval= duration source interval Faster retry on failure Lower = more churn on broken state
--kustomization timeout= duration 10m Long applies (CRDs, big charts) Too low = false failures
--kustomization force= true | false false Recreate immutable fields Can cause disruptive replace
--https-user / --https-key string none Private HTTPS repo (PAT) Stored as a secret
--ssh-private-key / --ssh-private-key-file key none Private SSH repo Add known-hosts too
--known-hosts / --known-hosts-file string none SSH host verification Omit → host-key errors
--local-auth-ref secret name none Reference a pre-made secret Bring-your-own auth
--suspend flag off Freeze reconcile Drift not corrected while set

The mechanics worth internalising:

Source kinds and how each authenticates

--kind Source Auth options Use when
git GitHub/GitLab/Azure Repos/Bitbucket public, PAT (--https-*), SSH (--ssh-*) The default — Git is source of truth
bucket S3-compatible object store access key/secret Manifests in an S3/MinIO bucket
azblob Azure Blob Storage account key, SAS, managed identity Azure-native artifact store + WI

For a connected (non-AKS) cluster you do not need a managed identity to read a public Git repo — the source controller pulls directly. For private repos, pass --https-user/--https-key (PAT) or SSH key material; for Azure Blob sources with workload identity, the azblob kind federates to a UAMI (see section 7).

Flux config status and reconciliation states

complianceState / condition Meaning Likely cause Confirm Fix
Compliant Source + all Kustomizations applied az k8s-configuration flux show (healthy)
Non-Compliant Apply failed / drift uncorrected Manifest error, RBAC, --scope too narrow kubectl -n flux-system logs deploy/kustomize-controller Fix manifest/scope; re-reconcile
Pending First reconcile in progress Just created Watch source-controller logs Usually transient
Source not ready Can’t pull the repo Bad URL/branch, auth, host key source-controller events Fix URL/auth/known-hosts
Kustomization dependency not ready Waiting on dependsOn Upstream Kustomization not Ready flux show per Kustomization Fix the dependency first
health check failed Applied but objects unhealthy App crashing / not Ready kubectl get the objects Fix the workload

Force a reconcile without waiting for the interval by annotating the source/Kustomization (flux reconcile ... if the Flux CLI is installed), or simply bump a commit. On Arc, the config-agent will also re-pull on the next ARM sync.

4. Apply Azure Policy (Gatekeeper) at fleet scope

Azure Policy for Kubernetes extends Gatekeeper v3 (the OPA admission webhook) so you can author guardrails once in ARM and enforce them as in-cluster admission decisions across the fleet. Install the extension per cluster, then assign initiatives at a scope that covers many clusters. Register the provider and install the extension (Microsoft.PolicyInsights):

az provider register --namespace Microsoft.PolicyInsights

az k8s-extension create \
  --cluster-type connectedClusters \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --extension-type Microsoft.PolicyInsights \
  --name azurepolicy

Now assign a built-in initiative. The Pod Security baseline standards for Linux workloads initiative (a8640138-9b0a-4a28-b8cb-1666c838647d) bundles the deny rules most teams want — no privileged containers, no host namespaces, no hostPath, drop dangerous capabilities. Assign it at a management group so it lands on every connected cluster underneath, and exclude the system namespaces (otherwise you will block Arc’s own agents):

az policy assignment create \
  --name "psp-baseline-fleet" \
  --display-name "Pod Security baseline - Arc fleet" \
  --policy-set-definition "a8640138-9b0a-4a28-b8cb-1666c838647d" \
  --scope "/providers/Microsoft.Management/managementGroups/mg-arc-prod" \
  --params '{
    "effect": { "value": "deny" },
    "excludedNamespaces": { "value": ["kube-system","gatekeeper-system","azure-arc"] }
  }'

Policy effects and what each does in-cluster

Effect In-cluster behaviour When to use Risk
audit Logs non-compliant; admits the object Brownfield rollout, discovery None to workloads; just visibility
deny Gatekeeper rejects the admission Steady-state enforcement Blocks bad deploys (and false positives)
disabled Policy inert Temporarily pause a rule Drift uncorrected
audit (mutation n/a here)

The Kubernetes add-on supports audit, deny, and disabled effects. There is no deployIfNotExists inside the cluster — remediation of K8s objects is via GitOps, not Policy mutation.

Built-in initiatives worth knowing

Initiative Definition ID (set) What it enforces
Pod Security Baseline (Linux) a8640138-9b0a-4a28-b8cb-1666c838647d No privileged, no host ns, no hostPath, drop caps
Pod Security Restricted (Linux) (restricted set) Baseline + runAsNonRoot, seccomp, no privilege-escalation
Deployment safeguards (general) (built-in set) Resource limits, no :latest, approved registries

Common single-rule built-ins (assemble custom initiatives)

Rule (policy definition) Effect surface Catches
No privileged containers deny/audit securityContext.privileged: true
No host network/PID/IPC deny/audit hostNetwork/hostPID/hostIPC
No hostPath volumes deny/audit Node filesystem mounts
Allowed capabilities / drop NET_RAW deny/audit Dangerous Linux caps
runAsNonRoot required deny/audit Root containers
CPU/memory limits required deny/audit Unbounded pods
Allowed container registries deny/audit Pulls from untrusted registries
No :latest image tag deny/audit Unpinned images
Allowed external IPs / no NodePort deny/audit Unexpected exposure
Read-only root filesystem deny/audit Writable container roots

Two operational realities to respect:

  1. Roll out in audit before deny. Set effect to audit, watch the compliance results in Azure Policy for a week, fix the violators, then flip to deny. Flipping straight to deny on a brownfield cluster will reject existing Deployments on their next rollout and page you at 02:00.
  2. Constraints are pulled, not instant. The add-on syncs assignments roughly every 15 minutes and writes Gatekeeper Constraint objects whose names start with azurepolicy-. Inspect them in-cluster with kubectl get constrainttemplates and kubectl get constraints.

Policy troubleshooting playbook

# Symptom Root cause Confirm (exact cmd / path) Fix
1 Deploys suddenly rejected Initiative flipped to deny, real violation kubectl get events; Policy compliance blade Remediate manifest; or revert to audit
2 Arc/system pods blocked System namespaces not excluded kubectl get constraints -o yaml (excludedNamespaces) Add kube-system,gatekeeper-system,azure-arc
3 No constraints in cluster Assignment not synced yet kubectl get constraints (empty) Wait ~15 min; check add-on provisioningState
4 Compliance shows “no data” Add-on not installed / unhealthy az k8s-extension show --name azurepolicy (Re)install; check gatekeeper-system pods
5 Legit pod flagged non-compliant Rule stricter than intended Compliance reason on the resource Tune params / switch initiative tier
6 deny blocks a needed exception No per-namespace carve-out Identify the namespace Exclude namespace or scope assignment narrower
7 Custom rule never fires ConstraintTemplate/Rego error kubectl describe constrainttemplate ... Fix Rego; re-publish definition
8 Webhook latency/timeouts Gatekeeper under-resourced gatekeeper-system pod CPU/mem Raise limits; reduce constraint count
9 Negative test still admits Constraints not synced / wrong scope kubectl run pwn --privileged admits Verify MG scope; wait for sync
10 Compliance lags reality 15-min add-on + 24h full scan cadence Compare event time vs compliance time Allow for eventual consistency

For org-specific rules beyond the built-ins (e.g. “all images must come from acme.azurecr.io”), author a custom constraint template + Rego and ship it as a custom policy definition — same assignment model, same fleet scope. If you treat policy definitions as source-controlled artifacts, the Azure Policy as Code pipeline pattern applies unchanged here.

5. Cluster connect: kubectl without inbound firewall changes

This is the feature that wins over on-prem teams. The clusterconnect-agent holds an outbound channel open; az connectedk8s proxy uses your Azure token to open a local proxy and writes a kubeconfig that targets it. No inbound port, no VPN, no bastion. First grant access. With Azure RBAC, assign the user/group a built-in role at the cluster scope — no kubectl ClusterRoleBinding required:

ARM_ID=$(az connectedk8s show -n $CLUSTER_NAME -g $RESOURCE_GROUP --query id -o tsv)
AAD_ID=$(az ad signed-in-user show --query id -o tsv)

# "Cluster User Role" grants the cluster-connect channel; "Viewer/Writer" grants in-cluster RBAC
az role assignment create --role "Azure Arc Enabled Kubernetes Cluster User Role" --assignee $AAD_ID --scope $ARM_ID
az role assignment create --role "Azure Arc Kubernetes Viewer" --assignee $AAD_ID --scope $ARM_ID

Arc Kubernetes built-in roles

Role Grants Use for
Azure Arc Enabled Kubernetes Cluster User Role The cluster-connect channel (ability to open proxy) Anyone who needs kubectl access at all
Azure Arc Kubernetes Viewer Read-only in-cluster RBAC (no Secrets) Read access across the fleet
Azure Arc Kubernetes Writer Read/write most namespaced objects Operators deploying via kubectl
Azure Arc Kubernetes Admin Admin within namespaces (not cluster-scoped escalation) Namespace owners
Azure Arc Kubernetes Cluster Admin Full cluster-admin equivalent Break-glass / platform owners

The Cluster User Role only opens the channel; it grants no in-cluster permissions. You must also assign a Viewer/Writer/Admin role for the request to do anything once impersonated. Granting one without the other is the classic “I can connect but everything is forbidden” mistake.

Then open the proxy (it blocks the shell) and run kubectl from a second shell:

# Shell 1 - opens the proxy, blocks
az connectedk8s proxy -n $CLUSTER_NAME -g $RESOURCE_GROUP

# Shell 2 - normal kubectl, routed over the Arc channel
kubectl get pods -A

If you prefer native Kubernetes RBAC over Azure RBAC, bind a service account token instead and pass --token $TOKEN to the proxy command. Either way, the request path is: your token → Azure Relay → clusterconnect-agentkube-aad-proxy (Entra auth + user impersonation) → kube-apiserver. The impersonation step is why a fleet-wide Azure Arc Kubernetes Viewer role gives read-only kubectl on every cluster at once.

Azure RBAC vs native Kubernetes RBAC for connect

Aspect Azure RBAC Native K8s RBAC
Where you grant ARM role assignment (cluster/MG scope) RoleBinding/ClusterRoleBinding in-cluster
Fleet-wide grant One assignment at MG scope covers all Per-cluster bindings
Identity Entra users/groups/SPs Service account token
Audit Entra sign-in + Activity log apiserver audit log
Proxy flag (default) --token $TOKEN
Best for Centralized human access at scale App/CI tokens, fine-grained in-cluster

Cluster connect failure modes

Symptom Root cause Confirm Fix
proxy hangs / never binds L7 proxy strips websockets to Relay az connectedk8s proxy -d Allow regional *.servicebus with websockets
Connect OK, all forbidden Only Cluster User Role assigned az role assignment list --scope $ARM_ID Add Viewer/Writer/Admin role
Long running operation failed clusterconnect-agent down kubectl get pods -n azure-arc Restart agent; check egress
Token/auth error Stale Azure CLI login az account show az login again
Works for you, not teammates Their identity unassigned Check their role assignments Assign at group/MG scope
Intermittent drops Relay/egress flapping Firewall + agent logs Stabilize egress; check proxy timeouts

6. Enable Azure Monitor Container Insights

Ship stdout/stderr logs, inventory, and container metrics from every Arc cluster into one Log Analytics workspace via the Microsoft.AzureMonitor.Containers extension. Use managed identity auth (amalogs.useAADAuth=true) so there is no workspace key sitting in the cluster:

WORKSPACE_ID="/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.OperationalInsights/workspaces/law-fleet"

az k8s-extension create \
  --name azuremonitor-containers \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --cluster-type connectedClusters \
  --extension-type Microsoft.AzureMonitor.Containers \
  --configuration-settings \
      logAnalyticsWorkspaceResourceID=$WORKSPACE_ID \
      amalogs.useAADAuth=true

The extension deploys the ama-logs DaemonSet (every node) and ama-logs-rs ReplicaSet (cluster-level) into kube-system. To control ingestion cost on chatty clusters, scope collection to specific namespaces with dataCollectionSettings at install time:

az k8s-extension create \
  --name azuremonitor-containers \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --cluster-type connectedClusters \
  --extension-type Microsoft.AzureMonitor.Containers \
  --configuration-settings amalogs.useAADAuth=true \
      dataCollectionSettings='{"interval":"1m","namespaceFilteringMode":"Include","namespaces":["prod","ingress"],"enableContainerLogV2":true}'

Container Insights configuration settings

Setting Values Default Effect Cost lever
amalogs.useAADAuth true/false false Managed-identity auth (no workspace key) — (security)
logAnalyticsWorkspaceResourceID ARM ID (auto) Destination workspace Consolidate to one
dataCollectionSettings.interval 1m30m 1m Metric scrape cadence Higher = cheaper
namespaceFilteringMode Include/Exclude/Off Off Which namespaces collect logs Big lever
namespaces list Namespace allow/deny list Trim noisy ns
enableContainerLogV2 true/false varies Richer schema, multi-line Slightly more data
streams list all Which tables to ingest Drop unused streams

Key Container Insights tables (KQL)

Table Holds Typical query use
ContainerLogV2 stdout/stderr lines Error mining across fleet
KubePodInventory pod state, restarts Crash/restart hunting
KubeNodeInventory node status, conditions NotReady nodes
KubeEvents cluster events OOMKilled, FailedScheduling
InsightsMetrics container/node metrics CPU/mem saturation
ContainerInventory image, repo, ports Image/registry audit

Once data lands, query the whole fleet from one workspace. Container logs carry the cluster identity, so a single KQL query slices across every onboarded cluster:

ContainerLogV2
| where TimeGenerated > ago(1h)
| where LogLevel in ("error","critical")
| summarize Errors = count() by Computer, ContainerName, _ResourceId
| sort by Errors desc

Note the migration: the legacy Helm-chart onboarding for the Container Insights agent is retired. On Arc, install via the Microsoft.AzureMonitor.Containers extension — that is the supported path and the one that participates in extension lifecycle/upgrades.

If you want metrics in Prometheus/Grafana rather than (or alongside) Log Analytics, the managed Prometheus/Grafana pattern in Azure Monitor: Managed Prometheus & Managed Grafana for AKS applies to Arc clusters via the metrics extension. For shaping ingestion with data collection rules, see Azure Monitor: Data Collection Rules, Workbooks & Alerting.

7. Workload identity and Key Vault secret access

Static secrets in manifests are the failure mode Arc lets you finally kill. The Azure Key Vault Secrets Provider extension (Microsoft.AzureKeyVaultSecretsProvider) installs the Secrets Store CSI Driver plus the Azure provider, so pods mount Key Vault secrets as files on tmpfs with no credential in the cluster:

az k8s-extension create \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --cluster-type connectedClusters \
  --extension-type Microsoft.AzureKeyVaultSecretsProvider \
  --name akvsecretsprovider \
  --configuration-settings \
      secrets-store-csi-driver.enableSecretRotation=true \
      secrets-store-csi-driver.rotationPollInterval=2m \
      secrets-store-csi-driver.syncSecret.enabled=true

Key Vault CSI extension settings

Setting Values Default When to change Trade-off
enableSecretRotation true/false false You rotate secrets Polls vault; small overhead
rotationPollInterval duration (e.g. 2m) 2m Faster/slower rotation pickup Lower = more vault calls
syncSecret.enabled true/false false Need a native K8s Secret for env vars Env vars still need pod restart

For the auth itself, federate a user-assigned managed identity to a Kubernetes service account (workload identity) so the CSI provider exchanges the pod’s projected token for an Entra token — no client secret anywhere. A SecretProviderClass ties the service account to the vault:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: app-kv
  namespace: prod
spec:
  provider: azure
  parameters:
    clientID: "<USER_ASSIGNED_CLIENT_ID>"   # the federated UAMI
    keyvaultName: "kv-acme-prod"
    tenantId: "<TENANT_ID>"
    objects: |
      array:
        - |
          objectName: db-connection-string
          objectType: secret
  # Optional: project mounted secrets into a native K8s Secret for env vars
  secretObjects:
    - secretName: app-db
      type: Opaque
      data:
        - objectName: db-connection-string
          key: DB_CONN

Grant the UAMI Key Vault Secrets User on the vault via Azure RBAC, federate it to the service account’s OIDC subject, then any pod using that service account and mounting this SecretProviderClass reads the secret. Because the access is scoped per-service-account, you get least privilege and clean per-app audit instead of one node-wide credential.

Secret access patterns compared

Pattern Credential in cluster? Rotation Audit granularity Verdict
Hard-coded secret in manifest Yes (in Git!) Manual None Never
K8s Secret (base64) Yes (etcd) Manual Per-Secret Weak
Sealed/SOPS-encrypted in Git Encrypted at rest Re-encrypt Per-Secret OK for some
Workspace-key CSI Workspace key in-cluster Vault-side Per-app (if scoped) Avoid the key
Workload-identity CSI No Vault + poll Per-service-account Best

Workload identity federation mapping

Element Value source Notes
UAMI clientID The federated user-assigned identity Goes in SecretProviderClass
OIDC issuer Cluster’s projected-token issuer URL Must be reachable by Entra
Subject system:serviceaccount:<ns>:<sa> The federated credential subject
Vault RBAC Key Vault Secrets User on the vault Least-privilege data-plane role
Audience api://AzureADTokenExchange Standard WI audience

Rotation caveat: enableSecretRotation=true refreshes the mounted file on the poll interval. Apps that read the file each request pick up new values automatically; apps that load secrets once at boot, or consume the synced Secret as env vars, still need a restart to see a rotated value. Env vars are snapshotted at pod start — the kernel cannot rewrite a running process’s environment.

The same federation model underpins Azure Key Vault Workload Identity for Secrets and the AKS-flavoured Secrets Store CSI with Key Vault sync & rotation; the deep mechanics of federated credentials are in Entra Managed Identities: User-Assigned, FIC & RBAC. For rotation strategy across the vault itself, see Azure Key Vault Secret Rotation with Managed Identity.

8. Scale governance across many clusters

Onboarding one cluster is a demo. Governing forty is the job. Three primitives make Arc fleet-ready.

Management groups carry policy and RBAC. Place subscriptions (and therefore their connected clusters) under a management-group hierarchy and assign Policy initiatives + Arc Kubernetes roles at the MG level. A new cluster onboarded into any child subscription inherits the baseline the moment it appears — you do not touch it cluster-by-cluster.

Fleet primitive What it gives you Mechanism
Management-group inheritance Policy + RBAC apply to all child clusters Assign at MG, not per-cluster
Tags Targeting, chargeback, inventory slicing --tags on connect; ARG queries
GitOps-as-intent New clusters self-bootstrap baseline Bicep fluxConfigurations
Extension defaults Consistent add-on versions Pin via IaC; --auto-upgrade policy
Azure Resource Graph Single-pane fleet inventory resources queries

Tags drive targeting and chargeback. Tag connected clusters with environment, owner, and data-classification, then write policy assignments that key off tags or build Azure Resource Graph queries for fleet inventory:

// Every Arc cluster, its agent version, and connectivity health
resources
| where type == "microsoft.kubernetes/connectedclusters"
| project name, location,
          distribution = properties.distribution,
          k8sVersion   = properties.kubernetesVersion,
          connectivity = properties.connectivityStatus,
          agentVersion = properties.agentVersion,
          env = tags.environment
| order by connectivity asc

Fleet inventory queries worth saving

Question ARG where / project focus
Which clusters are Offline? connectivityStatus == "Offline"
Agent version spread summarize count() by agentVersion
Distribution mix summarize count() by distribution
Untagged clusters isnull(tags.owner)
Stale Kubernetes versions project kubernetesVersion then sort
Clusters per management group join to subscription/MG

GitOps is the fleet rollout mechanism. Because the same az k8s-configuration flux create works across every connected cluster, codify it. The Bicep below registers the Flux config as ARM intent, so onboarding a cluster and deploying a Policy assignment that requires this config means new clusters self-bootstrap their baseline:

resource fluxBaseline 'Microsoft.KubernetesConfiguration/fluxConfigurations@2023-05-01' = {
  name: 'fleet-baseline'
  scope: connectedCluster      // the Microsoft.Kubernetes/connectedClusters resource
  properties: {
    scope: 'cluster'
    namespace: 'cluster-config'
    sourceKind: 'GitRepository'
    gitRepository: {
      url: 'https://github.com/acme-platform/fleet-gitops'
      repositoryRef: { branch: 'main' }
    }
    kustomizations: {
      infra: { path: './infrastructure', prune: true }
      apps:  { path: './apps/prod', prune: true, dependsOn: ['infra'] }
    }
  }
}

The end state: a cluster joins the fleet, ARM applies the inherited Policy initiative (admission guardrails), the Flux config (desired state), the Monitor extension (telemetry), and the role assignments (kubectl access) — all without a human SSHing into the cluster.

Architecture at a glance

Read the diagram left to right as the path that intent travels and telemetry returns. On the far left, the platform SRE and the Git repository are the sources of truth — humans issue az commands and assign Policy, while desired configuration lives as YAML on branch: main. That intent lands in the Azure control plane zone: a management group that carries Policy and RBAC down to every child cluster, the Azure Policy engine that compiles initiatives into Gatekeeper constraints, and the Log Analytics workspace that all clusters report into. Critically, nothing in this zone reaches into your network — it publishes intent to ARM and waits.

The Arc agents zone is the bridge, living in the azure-arc namespace inside your cluster and dialling outbound only. The clusterconnect agent (badge 1) holds the Azure Relay channel open over *.servicebus.windows.net:443 so kubectl works with no inbound port; the config + extension manager (badge 3) pulls Flux/Policy/Monitor intent and reconciles it; and kube-aad-proxy (badge 4) authenticates each kubectl caller with Entra and impersonates them against the apiserver. Finally, the hybrid cluster zone — EKS, GKE, or k3s — keeps its kube-apiserver exactly where it was, runs your workloads with prune=true GitOps, and mounts Key Vault secrets via the CSI driver (badge 5). The two return flows (badge 2 marks the Policy admission decision; the amber arrow carries inventory and logs back) close the loop: intent flows right, evidence flows left, and not one inbound firewall rule was opened.

Azure Arc-enabled Kubernetes fleet governance architecture: operator and Git repo on the left feed ARM intent (management group, Azure Policy/Gatekeeper, Log Analytics) into outbound-only Arc agents (clusterconnect over Azure Relay on servicebus:443, config/extension manager, kube-aad-proxy with Entra impersonation) running in the azure-arc namespace of a hybrid EKS/GKE/k3s cluster, where the unchanged kube-apiserver, prune=true workloads, and Key Vault CSI mounts reconcile desired state while inventory and ContainerLogV2 logs flow back; five numbered badges mark cluster-connect websocket failures, Policy deny rollout, Flux reconcile, Entra authZ, and stale secret rotation.

Real-world scenario

A retail platform team ran 28 store-edge clusters (k3s on ruggedised hardware, one per regional distribution center) plus a GKE cluster for their loyalty service. Security mandated two things the existing setup could not deliver: a centrally enforced ban on privileged containers, and break-glass kubectl access for the on-call SRE without opening inbound ports on store networks — the stores sat behind carrier-grade NAT with no public ingress and a websocket-stripping Layer-7 proxy.

The constraint that bit them first was the proxy. Onboarding succeeded, Flux reconciled, Policy enforced — but az connectedk8s proxy hung, because cluster connect rides Azure Relay over *.servicebus.windows.net and the proxy silently dropped the websocket upgrade. The fix was an allow-rule for the resolved, regional Service Bus endpoints with websockets explicitly permitted, expanded from the wildcard via the guest-notification allowlist API:

# Run per store region; feed results into the proxy allowlist with websockets enabled
for region in eastus westus2 centralus; do
  curl -s "https://guestnotificationservice.azure.com/urls/allowlist?api-version=2020-01-01&location=$region"
done

With egress fixed, they assigned the Pod Security baseline initiative at the mg-retail-edge management group — in audit first. The audit results surfaced exactly the violators they expected: a legacy label-printer DaemonSet that ran privileged to access /dev. They refactored it to a specific device plugin, then flipped the initiative to deny. New store clusters now onboard via a pipeline that runs az connectedk8s connect, and inherit the deny policy and the Flux baseline automatically from the management group — zero per-store configuration.

Decision What they chose Why
Onboarding Pipeline-driven connect 28 stores, no manual touch
Policy rollout audit → fix → deny Avoid breaking brownfield workloads
Access Arc Cluster User Role at MG scope Any store, no inbound port
Egress fix Regional Relay FQDNs + websockets Cluster connect over CGNAT
Secrets Workload identity + KV CSI No keys on store hardware
Telemetry Container Insights → one workspace Fleet-wide error queries

On-call SREs hold Azure Arc Enabled Kubernetes Cluster User Role at the MG scope, giving them az connectedk8s proxy into any store on earth without a single inbound firewall rule. The whole 28-cluster fleet went from “28 snowflakes” to “one policy, one Git repo, one identity boundary” in under a sprint. The lasting win was not any single control — it was that adding store #29 became a pipeline run, not a project.

Advantages and disadvantages

Advantages Disadvantages
One control plane for hybrid/multi-cloud K8s Management plane is eventually consistent (~15 min Policy)
Outbound-only — no inbound firewall holes Hard dependency on egress to Azure FQDNs
Policy + RBAC inherit via management groups Mis-scoped assignment can hit many clusters at once
GitOps identical across Arc and AKS Flux/Gatekeeper add their own in-cluster footprint
Secretless Key Vault via workload identity Federation setup is fiddly the first time
Fleet telemetry in one Log Analytics workspace Ingestion cost grows with cluster/namespace count
New clusters self-bootstrap from MG inheritance If Azure is unreachable, management pauses (data plane keeps running)
Works behind CGNAT / locked-down on-prem Websocket-stripping proxies break cluster connect

When each matters: the outbound-only model is decisive for edge and regulated on-prem where inbound is simply not allowed. Management-group inheritance is the multiplier once you pass ~5 clusters — below that, the per-cluster effort is small and Arc’s value is mostly uniformity, not labour saved. The eventual-consistency caveat matters most for security expectations: do not assume a freshly assigned deny is enforced the instant you click save; budget ~15 minutes and verify with a negative test. The egress dependency is the thing that bites in practice — almost every painful Arc incident traces back to a firewall or proxy, not to Arc itself.

Hands-on lab

This lab onboards a local kind cluster (free, no cloud cost beyond minimal ARM/Log Analytics) and layers Policy + cluster connect. You need Azure CLI, kubectl, Docker, and an Azure subscription.

# 0) Prereqs
az login
az extension add --name connectedk8s
az extension add --name k8s-configuration
az extension add --name k8s-extension

# 1) A throwaway local cluster
kind create cluster --name arc-lab
kubectl config use-context kind-arc-lab

# 2) Register providers (idempotent; wait for Registered)
for ns in Microsoft.Kubernetes Microsoft.KubernetesConfiguration Microsoft.ExtendedLocation Microsoft.PolicyInsights; do
  az provider register --namespace $ns
done
az provider show -n Microsoft.Kubernetes --query registrationState -o tsv   # -> Registered

# 3) Onboard
export RESOURCE_GROUP=rg-arc-lab LOCATION=eastus CLUSTER_NAME=kind-arc-lab
az group create -n $RESOURCE_GROUP -l $LOCATION -o table
az connectedk8s connect -n $CLUSTER_NAME -g $RESOURCE_GROUP -l $LOCATION

# Expected: connectivityStatus -> Connected; azure-arc pods Running
az connectedk8s show -n $CLUSTER_NAME -g $RESOURCE_GROUP --query connectivityStatus -o tsv
kubectl get pods -n azure-arc

# 4) GitOps against a public repo
az k8s-configuration flux create \
  --name lab-baseline -g $RESOURCE_GROUP \
  --cluster-name $CLUSTER_NAME --cluster-type connectedClusters \
  --namespace cluster-config --scope cluster \
  --url https://github.com/Azure/gitops-flux2-kustomize-helm-mt \
  --branch main \
  --kustomization name=infra path=./infrastructure prune=true

# 5) Policy add-on + a deny baseline at the SUBSCRIPTION scope for the lab
az k8s-extension create --cluster-type connectedClusters \
  --cluster-name $CLUSTER_NAME -g $RESOURCE_GROUP \
  --extension-type Microsoft.PolicyInsights --name azurepolicy

SUB=$(az account show --query id -o tsv)
az policy assignment create \
  --name psp-baseline-lab \
  --policy-set-definition a8640138-9b0a-4a28-b8cb-1666c838647d \
  --scope "/subscriptions/$SUB" \
  --params '{"effect":{"value":"audit"},"excludedNamespaces":{"value":["kube-system","gatekeeper-system","azure-arc"]}}'

# 6) Cluster connect — grant yourself, then proxy
ARM_ID=$(az connectedk8s show -n $CLUSTER_NAME -g $RESOURCE_GROUP --query id -o tsv)
ME=$(az ad signed-in-user show --query id -o tsv)
az role assignment create --role "Azure Arc Enabled Kubernetes Cluster User Role" --assignee $ME --scope $ARM_ID
az role assignment create --role "Azure Arc Kubernetes Cluster Admin" --assignee $ME --scope $ARM_ID
# Shell 1: az connectedk8s proxy -n $CLUSTER_NAME -g $RESOURCE_GROUP
# Shell 2: kubectl get nodes      # routed over the Arc channel

# 7) TEARDOWN (avoid lingering cost)
az policy assignment delete --name psp-baseline-lab --scope "/subscriptions/$SUB"
az connectedk8s delete -n $CLUSTER_NAME -g $RESOURCE_GROUP --yes
az group delete -n $RESOURCE_GROUP --yes --no-wait
kind delete cluster --name arc-lab
Step You should see If you don’t
3 onboard Connected; azure-arc pods Running Check egress to MCR + ARM
4 GitOps complianceState: Compliant after a minute flux show; check the repo URL
5 Policy azurepolicy-* constraints after ~15 min kubectl get constraints empty → wait
6 connect kubectl get nodes via proxy Proxy hung → websocket egress
7 teardown Resources gone Re-run delete; --no-wait is async

Common mistakes & troubleshooting

These are the failure modes that actually generate tickets, in rough order of frequency. The first is responsible for more lost hours than the rest combined.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 Onboards fine, az connectedk8s proxy hangs L7 proxy strips websocket upgrade to *.servicebus.windows.net az connectedk8s proxy -d (debug); firewall logs Allow resolved regional Service Bus FQDNs with websockets enabled
2 New deny policy pages you at 02:00 Flipped straight to deny on brownfield Policy compliance → violating resources Always audit → fix → deny
3 Arc agents themselves blocked by policy System namespaces not excluded kubectl get constraints -o yaml Exclude kube-system,gatekeeper-system,azure-arc
4 “I can connect but everything is forbidden” Only Cluster User Role assigned az role assignment list --scope $ARM_ID Add Viewer/Writer/Admin role too
5 Flux never reconciles Private repo, missing/invalid auth az k8s-configuration flux show; source-controller logs Pass PAT/SSH; add --known-hosts
6 prune deletes more than expected Wrong path/--scope, shared namespace Inspect the Kustomization path Narrow path; separate namespaces
7 Cluster shows Offline Egress lost / MSI cert expired az connectedk8s show; clusteridentityoperator logs Restore egress to HIS endpoints; restart agent
8 Extension stuck Creating/Failed extension-manager can’t pull / RBAC az k8s-extension show ... provisioningState; pod logs Fix egress/RBAC; delete + recreate
9 In-cluster service calls fail after connect Service CIDR not in --proxy-skip-range DNS/connectivity test in a pod Re-connect with CIDR + .svc in skip-range
10 Secret rotated but app still old value Env-var snapshot at pod start Compare mounted file vs env var Read file per request, or restart pods
11 Log Analytics bill spikes Collecting all namespaces, V2 on everything Usage by _ResourceId Scope dataCollectionSettings namespaces
12 Negative policy test still admits privileged pod Constraints not synced / wrong scope kubectl run pwn --image=nginx --privileged=true -n prod admits Verify MG scope; wait ~15 min
13 connect fails: Helm release exists Stale prior onboarding helm list -n azure-arc az connectedk8s delete then re-connect
14 Resource Graph shows blank distribution/version cluster-metadata-operator unhealthy That agent’s logs Restart agent; check egress

A fast negative test for Policy: kubectl run pwn --image=nginx --privileged=true -n prod should be denied by the Gatekeeper webhook once the baseline initiative is in deny mode. If it succeeds, your assignment scope or namespace exclusions are wrong, or the constraints have not synced yet.

# Verify each layer landed before you call a cluster "governed"
az connectedk8s show -n $CLUSTER_NAME -g $RESOURCE_GROUP --query connectivityStatus -o tsv   # -> Connected
kubectl get pods -n azure-arc          # all Running
az k8s-configuration flux show --name fleet-baseline -g $RESOURCE_GROUP \
  --cluster-name $CLUSTER_NAME --cluster-type connectedClusters \
  --query "statuses[].complianceState" -o tsv
kubectl get constraints                 # azurepolicy-* present
kubectl get ds ama-logs -n kube-system  # Monitor agent shipping

Best practices

Security notes

Arc’s security model is “outbound-only management plane + least-privilege identity,” and you should keep it that way deliberately.

Concern Default / mechanism Hardening action
Inbound exposure None — agents dial out only Do not add inbound rules “to make it work”; fix egress instead
Cluster→Azure identity MSI cert (clusteridentityoperator) Allow only HIS FQDNs; monitor cert renewal
Human kubectl access Azure RBAC via Entra Least-privilege roles; PIM for admin; group, not user, assignments
Admission guardrails Gatekeeper via Policy Enforce baseline/restricted at MG; deny privileged/hostPath
Secrets Workload identity + KV CSI No static secrets; per-service-account scope; rotation on
Workspace key in cluster Avoided with useAADAuth=true Never store the Log Analytics key in-cluster
Private Git creds PAT/SSH stored as secret Prefer SSH deploy keys or WI (azblob); rotate PATs
Egress trust TLS to Azure TLS-inspecting proxy → supply --proxy-cert; pin allowlist
Network segmentation Per-region private endpoints (optional) Use Arc Private Link to keep data-plane off public internet
Audit Entra sign-in + Activity log + apiserver audit Centralize all three; alert on role-assignment changes

The identity boundary is the crown jewel: because human access is Entra-mediated and impersonated per request, you get one place (Entra + Activity log) to answer “who touched which cluster, when.” Protect the Arc Cluster Admin role with PIM and just-in-time elevation; a standing Cluster Admin at MG scope is a standing cluster-admin on every cluster in the fleet.

Cost & sizing

Arc-enabled Kubernetes has no per-cluster Arc fee for the core control plane (onboarding, cluster connect, GitOps, Policy). What you pay for is the value-added services that ride on top — chiefly log/metric ingestion and any Arc-enabled data/app services. Sizing is therefore mostly an observability-cost exercise plus a small in-cluster resource footprint for the agents and add-ons.

Cost driver What it bills on Rough figure How to control
Arc K8s control plane Onboarding/connect/GitOps/Policy No core charge
Container Insights ingestion GB ingested to Log Analytics ~₹230–290 / USD 2.76 per GB (pay-as-you-go) Scope namespaces; interval; commitment tiers
Log Analytics retention GB-month beyond free 31 days Per-GB-month Shorten retention; archive tier
Managed Prometheus metrics Metric samples ingested Per-sample pricing Scrape interval; drop unused series
Arc-enabled SQL/data services vCore/usage of the data service Service-specific Right-size the data workload
In-cluster agent footprint Node CPU/mem for agents + add-ons ~0.5–1 vCPU + ~1–2 GB cluster-wide Don’t run add-ons you don’t use
Egress/proxy infra Your firewall/proxy capacity Your existing infra Allowlist precisely; no new ingress

Right-sizing the in-cluster footprint

Component Approx. resource ask Notes
Arc agents (azure-arc) Modest; a handful of small pods Always present
Flux controllers CPU on reconcile spikes Scales with repo size + interval
Gatekeeper Scales with constraint count Trim constraints; set limits
ama-logs DaemonSet Per-node; scales with log volume Biggest variable; scope namespaces

Practical guidance: on a 28-cluster edge fleet, the dominant line item is almost always Container Insights ingestion, not anything Arc-specific. Turn on namespaceFilteringMode: Include for just prod/ingress, raise the metric interval to 5m where 1-minute resolution is not needed, and move long-tail logs to a cheaper retention/archive tier. Free-tier-wise, Log Analytics gives a small daily ingestion allowance and 31 days retention at no charge — enough to validate the pipeline in the lab above without a meaningful bill. (INR figures approximate at ~₹84/USD and vary by region and commitment tier; treat them as order-of-magnitude.)

Interview & exam questions

1. What does Arc-enabled Kubernetes actually add to a cluster, and what does it not touch? It installs a Helm release of outbound-only agents in the azure-arc namespace and creates a connectedClusters ARM resource; it adds Policy/GitOps/Monitor/Key Vault via cluster extensions. It does not change your control plane, scheduler, nodes, or the data path of your workloads — Arc reconciles intent and brokers kubectl, nothing more.

2. Why is *.servicebus.windows.net special in the egress allowlist? Cluster connect rides Azure Relay over that endpoint using websockets. A Layer-7 proxy that blocks websocket upgrades will let onboarding succeed but break kubectl-over-Arc, which is a notoriously confusing failure. You must allow the resolved regional FQDNs with websockets enabled.

3. Why roll Azure Policy out in audit before deny? deny causes Gatekeeper to reject non-compliant admissions, so flipping straight to deny on a brownfield cluster rejects existing Deployments on their next rollout. audit surfaces violators without blocking, letting you remediate first, then promote to deny safely.

4. Which namespaces must you exclude from policy assignments, and why? kube-system, gatekeeper-system, and azure-arc. They run system and Arc agent workloads that may legitimately need elevated settings; failing to exclude them can block Arc’s own agents and brick management.

5. Explain the cluster-connect request path. Your Azure token → Azure Relay → clusterconnect-agentkube-aad-proxy (Entra authN + user impersonation) → kube-apiserver. Impersonation is why a fleet-wide Viewer role grants read-only kubectl on every cluster at once.

6. Cluster User Role is assigned but everything returns forbidden. Why? The Cluster User Role only opens the connect channel; it grants no in-cluster permissions. You must also assign an in-cluster role (Viewer/Writer/Admin) for the impersonated request to do anything.

7. What does prune=true change, and why is it non-negotiable? With prune=true, deleting a manifest from Git causes Flux to garbage-collect the corresponding object from the cluster. Without it, deletions never propagate, so Git stops being the authoritative source of truth.

8. How do you give every new cluster the baseline automatically? Assign Policy initiatives and Arc Kubernetes roles at a management group, and register the Flux config as Bicep fluxConfigurations. A cluster onboarded into any child subscription inherits the policy, GitOps config, and access without per-cluster work.

9. How do you read a Key Vault secret with no credential in the cluster? Install the Key Vault Secrets Provider extension, federate a user-assigned managed identity to a Kubernetes service account (workload identity), and bind a SecretProviderClass. The CSI provider exchanges the pod’s projected token for an Entra token — no client secret anywhere.

10. A secret was rotated but the app still uses the old value. Why, and what fixes it? Rotation refreshes the mounted file on tmpfs, but environment variables are snapshotted at pod start and the synced Secret-as-env path is static. Apps that read the file per request pick up changes; apps that load at boot or use env vars need a pod restart.

11. How does Arc Kubernetes differ from AKS for these controls? The controls are nearly identical — same az k8s-configuration flux, same Policy initiatives, same extensions — but Arc uses --cluster-type connectedClusters and runs on non-Azure clusters with an outbound-only agent set, while AKS uses managedClusters and is already in Azure. The symmetry is intentional.

12. Which certs map to which exam? This material maps to AZ-305 (designing governance/hybrid) and AZ-104 (Arc, Policy, RBAC), with Kubernetes depth overlapping CKA/CKS for the in-cluster admission and RBAC mechanics.

Quick check

  1. What single egress FQDN, if proxied without websockets, lets onboarding succeed but breaks kubectl-over-Arc?
  2. You assigned the Arc Cluster User Role but every kubectl command is forbidden. What did you forget?
  3. Name the three namespaces you must exclude from a fleet policy assignment.
  4. What does prune=true do when you delete a manifest from Git?
  5. Why does a freshly assigned deny policy sometimes still admit a privileged pod for a few minutes?

Answers

  1. *.servicebus.windows.net — cluster connect rides Azure Relay over websockets there. Allow the resolved regional FQDNs with websockets enabled.
  2. An in-cluster role. The Cluster User Role only opens the connect channel; you must also assign Viewer/Writer/Admin for the impersonated request to have permissions.
  3. kube-system, gatekeeper-system, and azure-arc — excluding them keeps the policy from blocking system and Arc agent workloads.
  4. Flux garbage-collects the corresponding object from the cluster, keeping Git as the source of truth.
  5. The Policy add-on syncs assignments roughly every 15 minutes and writes the azurepolicy-* Gatekeeper constraints on that cadence; until they land (or if the scope is wrong), admission is not yet enforced.

Glossary

Term Definition
Connected cluster The Microsoft.Kubernetes/connectedClusters ARM resource projecting a non-Azure cluster into Azure.
Arc agents The outbound-only Helm release in the azure-arc namespace that maintains the channel and reconciles intent.
Cluster extension A managed add-on (Flux, Policy, Monitor, Key Vault) installed and lifecycled via Microsoft.KubernetesConfiguration.
microsoft.flux The Flux v2 GitOps cluster extension delivering source/kustomize/helm controllers.
fluxConfigurations The ARM resource describing a Git source + Kustomizations that config-agent applies.
Kustomization A Flux unit that applies a path from a source, with prune, dependsOn, and intervals.
Gatekeeper The OPA admission webhook (v3) that Azure Policy uses to enforce in-cluster constraints.
Constraint The in-cluster object (azurepolicy-*) Gatekeeper enforces, generated from a Policy assignment.
Cluster connect The outbound channel that lets az connectedk8s proxy provide kubectl with no inbound port.
kube-aad-proxy The in-cluster shim that performs Entra authN and impersonates the user against the apiserver.
Container Insights The Microsoft.AzureMonitor.Containers extension shipping logs/metrics/inventory to Log Analytics.
Workload identity A federated user-assigned managed identity bound to a Kubernetes service account for secretless Azure access.
SecretProviderClass The CSI object tying a service account + vault + secret list together for tmpfs mounting.
Management group An ARM scope above subscriptions through which Policy and RBAC inherit to child clusters.
Azure Relay The Azure service (over *.servicebus.windows.net) that brokers the cluster-connect websocket channel.

Next steps

azure-arckubernetesgitopshybridgovernance
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments