A container registry is the single most concentrated point of supply-chain risk in a platform. Every node in every cluster pulls from it, the images it serves run with whatever privileges the workload grants, and a compromised or stale image propagates silently across the fleet. Yet most ACR deployments I inherit are a Standard SKU with the admin user enabled, a long-lived password in a pipeline variable, public network access wide open, and no idea whether the latest tag is the thing that was scanned three months ago. This article builds the opposite: a Premium registry locked behind private endpoints, with repository-scoped tokens instead of admin creds, automated multi-step builds, Notation signatures gated by quarantine-on-push, geo-replicated zone-redundant distribution, Defender scanning, and keyless OIDC CI/CD.
Everything here requires the Premium tier. Private Link, tokens and scope maps, geo-replication, customer-managed keys, and content-trust workflows are all Premium-only. If you are on Basic or Standard, the first move is az acr update --sku Premium — the rest does not apply until you do.
RG=rg-platform-acr
ACR=kvacrprod # globally unique, alphanumeric, 5-50 chars
LOC=australiaeast
az group create -n $RG -l $LOC
az acr create -n $ACR -g $RG --sku Premium \
--admin-enabled false
1. Premium architecture: private endpoints, firewall, and trusted services
The data plane of ACR has two endpoint classes: the registry endpoint (<name>.azurecr.io, used for the Docker v2 API and auth) and the data endpoints that serve the actual layer blobs. With geo-replication, each region gets its own data endpoint (<name>.<region>.data.azurecr.io). When you lock down networking, you must account for both, or pulls succeed on auth and then hang on layer download.
Start by disabling public access and attaching a private endpoint. The private endpoint projects the registry into your VNet with a private IP, and Private Link automatically wires up the per-region data endpoints behind it.
# Disable public network access entirely
az acr update -n $ACR --public-network-enabled false
PE_SUBNET=/subscriptions/<sub>/resourceGroups/rg-net/providers/Microsoft.Network/virtualNetworks/vnet-hub/subnets/snet-pe
ACR_ID=$(az acr show -n $ACR -g $RG --query id -o tsv)
az network private-endpoint create \
-g $RG -n pe-$ACR \
--subnet $PE_SUBNET \
--private-connection-resource-id $ACR_ID \
--group-id registry \
--connection-name pe-$ACR-conn
The registry group ID covers both the control endpoint and all data endpoints — you do not create a separate private endpoint per region. Now wire the private DNS zone so <name>.azurecr.io and <name>.<region>.data.azurecr.io resolve to private IPs inside the VNet:
az network private-dns zone create -g rg-net -n privatelink.azurecr.io
az network private-dns link vnet create \
-g rg-net -n link-acr \
-z privatelink.azurecr.io \
-v vnet-hub --registration-enabled false
az network private-endpoint dns-zone-group create \
-g $RG --endpoint-name pe-$ACR -n acr-zone-group \
--private-dns-zone privatelink.azurecr.io --zone-name registry
The DNS zone group auto-populates A records for the registry and every replica data endpoint, so when you add a geo-replica later the record appears without manual intervention. Verify with az network private-dns record-set a list -g rg-net -z privatelink.azurecr.io -o table — you should see one entry per region.
Trusted services bypass
With public access disabled, platform services that legitimately need to reach the registry — Defender for Cloud scanning, ACR Tasks, Container Apps, the AKS image-cleaner — cannot traverse your private endpoint. ACR exposes a trusted services bypass for exactly this. It is not a blanket “allow Microsoft”; the trusted service must authenticate with its own managed identity that holds an AcrPull (or finer) role.
az acr update -n $ACR --allow-trusted-services true
A subtle failure mode:
az acr buildandaz acr taskrun on ACR’s own compute, which is a trusted service, so they bypass the firewall. Butaz acr importfrom a network-restricted source, or adocker pushfrom a self-hosted agent, is not trusted — that agent must sit inside the VNet or reach a private endpoint. Most “my firewall blocks ACR Tasks” tickets are actually about the source registry on an import, not the task itself.
2. Token and scope-map repository-scoped access without the admin user
The admin user is a single shared credential with full read/write to the entire registry. Disable it (we did, at creation) and use tokens scoped by scope maps instead. A scope map is an IAM policy for the registry: it grants a named set of actions on specific repositories. A token binds credentials to a scope map.
The valid actions are content/read, content/write, content/delete, metadata/read, and metadata/write. A pull-only CI consumer needs content/read plus metadata/read; a build agent that pushes needs content/write added.
# A pull-only scope map for the payments team's repos
az acr scope-map create -r $ACR -n payments-pull \
--repository payments/api content/read metadata/read \
--repository payments/worker content/read metadata/read \
--description "Pull-only access to payments images"
# Token bound to that scope map
az acr token create -r $ACR -n k8s-payments-puller \
--scope-map payments-pull
Wildcards make this scale. samples/* matches every repository under that prefix, and wildcard grants are additive with exact-match grants, so a CD service account can be given broad pull and narrow push in one map:
az acr scope-map create -r $ACR -n cd-pipeline \
--repository 'apps/*' content/read metadata/read \
--repository apps/checkout content/read content/write metadata/read metadata/write
Tokens carry passwords (two for rotation), but the strong pattern is to skip token passwords entirely and let Entra-ID identities pull via AcrPull role assignments with managed identity — covered in section 8. Use scope-map tokens where you genuinely cannot use Entra ID (a third party, an appliance), and rotate them:
az acr token credential generate -r $ACR -n k8s-payments-puller \
--password1 --expiration-in-days 90 -o json
3. ACR Tasks: multi-step builds, base-image triggers, and cache
ACR Tasks run builds on ACR-managed compute, so source never touches a developer laptop and the resulting image is born inside the security boundary. A multi-step task is defined in acr-task.yaml with three step types — build, cmd, and push — and a when property to express dependencies. Critically, unlike az acr build, a multi-step build step does not auto-push; you only push after validation passes. That gives you a build-test-push gate in a single task.
# acr-task.yaml
version: v1.1.0
steps:
- id: build
build: -t $Registry/payments/api:$ID -f Dockerfile .
# Run the freshly built image through tests before it is pushed
- id: unit-tests
cmd: $Registry/payments/api:$ID pytest -q
when: ["build"]
# Only push if tests succeeded
- id: push
push:
- $Registry/payments/api:$ID
- $Registry/payments/api:latest
when: ["unit-tests"]
$Registry expands at runtime to the executing registry’s login server, and $ID is the unique run ID — using it as the immutable tag means every build is independently addressable. Create the task with a Git trigger so a commit to main builds automatically:
az acr task create -r $ACR -n payments-api-ci \
--file acr-task.yaml \
--context https://github.com/org/payments.git#main \
--git-access-token $GH_PAT \
--commit-trigger-enabled true \
--base-image-trigger-enabled true \
--base-image-trigger-type Runtime
The base-image trigger is the feature that earns ACR Tasks its keep. When the base image your FROM line references is updated — whether that is an upstream mcr.microsoft.com/dotnet/aspnet digest or a hardened internal base you maintain — the task re-runs and rebuilds your application image with the patched layers. This is how you keep thousands of derived images current against CVEs without anyone manually rebuilding. The trigger requires your Dockerfile to pin a specific base tag (not nothing, and ideally not latest); ACR tracks the digest behind that tag and fires when it moves.
For an internal base-image chain, point a task at the base repo and let the derived task’s Runtime trigger cascade:
# Base image task — its push moves the digest behind myorg/base:1.0
az acr task create -r $ACR -n base-image \
--image myorg/base:1.0 \
--context https://github.com/org/base.git#main \
--git-access-token $GH_PAT \
--commit-trigger-enabled true
ACR Tasks caches layers between runs automatically, and BuildKit can be enabled by setting DOCKER_BUILDKIT=1 in the task env for better cache behavior and secret mounts.
4. Image signing with Notation and quarantine-on-push gating
Two independent controls combine here. Notation attaches a cryptographic signature to an image so consumers can prove provenance and integrity. Quarantine holds every pushed image invisible until a process explicitly marks it good — turning “push” into “push to staging” and forcing a gate before anything is pullable.
Quarantine on push
Quarantine is configured through the management policy API. Once enabled, a freshly pushed image is visible only to identities with quarantine-reader permission; normal pulls fail until the image is marked passed. Your scanner subscribes to the quarantine webhook, scans, and promotes.
ID=$(az acr show -n $ACR --query id -o tsv)
az resource update --ids $ID \
--set properties.policies.quarantinePolicy.status=enabled
Enabling quarantine is a breaking change to existing workflows: any image not explicitly marked good is blocked for pull. Roll it out per registry with the consuming teams aware, and make sure your promotion automation is live before you flip it, or every deployment stalls.
Signing with Notation and Azure Key Vault
Notation signs with a certificate stored in Key Vault via the azure-kv plugin. Install the CLI and plugin (pin versions — these are the current releases):
curl -Lo notation.tar.gz \
https://github.com/notaryproject/notation/releases/download/v1.3.2/notation_1.3.2_linux_amd64.tar.gz
tar xzf notation.tar.gz && cp ./notation /usr/local/bin
notation plugin install --url \
https://github.com/Azure/notation-azure-kv/releases/download/v1.2.1/notation-azure-kv_1.2.1_linux_amd64.tar.gz \
--sha256sum 67c5ccaaf28dd44d2b6572684d84e344a02c2258af1d65ead3910b3156d3eaf5
The signing identity needs Key Vault Certificates Officer and Key Vault Crypto User on the vault (RBAC mode), plus pull/push on the registry. Always sign by digest, never by tag — tags are mutable, and a signature must bind to immutable content:
KEY_ID=$(az keyvault certificate show -n signing-cert \
--vault-name kv-signing --query 'kid' -o tsv)
DIGEST=$(az acr build -r $ACR -t $ACR.azurecr.io/payments/api:v1 \
https://github.com/org/payments.git#main \
--no-logs --query "outputImages[0].digest" -o tsv)
IMAGE=$ACR.azurecr.io/payments/api@$DIGEST
notation sign --signature-format cose \
--id $KEY_ID --plugin azure-kv \
--plugin-config self_signed=true \
$IMAGE
Verification is policy-driven. Add the certificate to a named trust store, then import a trust policy that scopes which signers are trusted for which repositories:
az keyvault certificate download -n signing-cert --vault-name kv-signing -f cert.pem
notation cert add --type ca --store payments-ca cert.pem
{
"version": "1.0",
"trustPolicies": [
{
"name": "payments-images",
"registryScopes": [ "kvacrprod.azurecr.io/payments/api" ],
"signatureVerification": { "level": "strict" },
"trustStores": [ "ca:payments-ca" ],
"trustedIdentities": [
"x509.subject: CN=payments.org,O=Platform,L=Sydney,ST=NSW,C=AU"
]
}
]
}
notation policy import ./trustpolicy.json
notation verify $IMAGE
At the cluster, enforcement is done by Ratify plus an Azure Policy / Gatekeeper constraint that admits only images whose Notation signature validates against this trust policy. That closes the loop: ACR signs, AKS refuses anything unsigned or signed by the wrong identity. (Note Notation v1.2+ also supports RFC 3161 timestamping so signatures stay verifiable after the signing cert expires — essential with short-lived certs.)
5. Geo-replication, zone redundancy, and regional failover
Geo-replication makes the registry a single logical resource with image storage in multiple regions, served through one login server (<name>.azurecr.io). Pulls from a region are served by the nearest replica’s data endpoint, which cuts egress cost and latency for multi-region clusters, and survives a regional outage because the global endpoint routes around an unhealthy replica.
az acr replication create -r $ACR -l southeastasia
az acr replication create -r $ACR -l westus2
az acr replication list -r $ACR -o table
Zone redundancy is now on by default for every replica (and for the home region in AZ-supporting regions) at no extra cost — ACR spreads each replica’s storage across availability zones automatically. The --zone-redundancy flag still exists for backward compatibility but you no longer need to set it. The practical upshot: a single replica already survives a zone failure; geo-replication is what you add for region failure and pull locality.
Failover is platform-managed and health-aware. ACR continuously checks each replica and reroutes the global endpoint away from a replica that cannot serve reliably. There is no customer-invocable failover button and no DNS change on your side — pushes, pulls, and deletes continue through the surviving replicas. Your job is capacity planning (enough replicas that losing one does not overload the rest) and ensuring each consuming region actually has a nearby replica.
| Concern | Mechanism | Who triggers it |
|---|---|---|
| Zone outage | Zone-redundant replica storage (default) | Platform, automatic |
| Region outage | Geo-replication, health-aware routing | Platform, automatic |
| Pull latency / egress | Regional data endpoint nearest the client | Routing, automatic |
| Disaster recovery copy | Replica acts as a live, writable copy | You, by adding the replica |
6. Vulnerability scanning with Defender for Containers
Microsoft Defender for Containers scans images in ACR on push, on pull, and continuously (re-scanning already-pushed images as new CVE definitions land, for images pulled in the last 30 days). Enable the plan at the subscription level:
az security pricing create -n Containers --tier Standard
Because we disabled public access, Defender’s scanner reaches the registry through the trusted-services bypass — which is precisely why --allow-trusted-services true from section 1 is not optional once you turn on scanning. Findings surface in Defender for Cloud and can be queried in Azure Resource Graph to drive a fail-the-build or block-the-pull gate:
securityresources
| where type == "microsoft.security/assessments/subassessments"
| where id contains "containerRegistryVulnerability"
| extend sev = properties.status.severity,
cve = properties.id,
repo = properties.additionalData.repositoryName,
digest = properties.additionalData.imageDigest
| where sev in ("High", "Critical")
| project repo, digest, cve, sev, description = properties.description
| order by sev desc
Wire that query into a scheduled check or an Azure Monitor alert so a Critical finding on an in-use image pages the owning team, rather than sitting in a portal blade nobody opens.
7. Purge tasks, retention policies, and untagged manifest cleanup
A busy CI pipeline tagging every build by run ID will accumulate thousands of manifests and bloat storage and scan scope. Two complementary tools clean up: a retention policy for untagged manifests, and a purge task for tags.
The retention policy auto-deletes untagged manifests after N days. Untagged manifests are typically the orphans left when a tag is overwritten:
az acr config retention update -r $ACR \
--status enabled --days 14 --type UntaggedManifests
For tag-level cleanup on a schedule, ACR ships a containerized acr purge command you run as a scheduled task. This deletes tags older than a duration matching a filter, and --untagged then removes the now-unreferenced manifests:
PURGE_CMD="acr purge \
--filter 'payments/api:.*' \
--filter 'payments/worker:.*' \
--ago 30d --untagged"
az acr task create -r $ACR -n nightly-purge \
--cmd "$PURGE_CMD" \
--schedule "0 2 * * *" \
--context /dev/null
Two sharp edges. First,
acr purge --untaggedcan delete manifests that belong to multi-arch images or signatures if you are not careful with filters — anything referenced only by digest (signatures, SBOMs, multi-arch child manifests) looks “untagged.” Test filters with a--dry-run(supported by the purge command) before scheduling. Second, deleted image data is unrecoverable unless soft delete is enabled, which keeps deleted artifacts recoverable for a retention window — turn it on first if you want a safety net.
az acr config soft-delete update -r $ACR --status enabled --days 7
8. CI/CD wiring with managed identity and OIDC keyless push
The final piece removes the last long-lived secret. Instead of a token password or service-principal secret in the pipeline, use OIDC federated credentials: GitHub Actions (or Azure DevOps) presents a short-lived OIDC token, Entra ID validates it against a federated credential on a user-assigned managed identity, and the pipeline gets a transient access token. Nothing persistent is stored.
# User-assigned identity the pipeline will assume
az identity create -g $RG -n id-payments-cicd
APP_ID=$(az identity show -g $RG -n id-payments-cicd --query clientId -o tsv)
OID=$(az identity show -g $RG -n id-payments-cicd --query principalId -o tsv)
# Push rights to the registry (use a custom role / scope-map for least privilege)
az role assignment create --assignee $OID --role AcrPush --scope $ACR_ID
# Federate to a specific repo + branch — subject must match exactly
az identity federated-credential create \
-g $RG --identity-name id-payments-cicd \
-n gh-payments-main \
--issuer https://token.actions.githubusercontent.com \
--subject repo:org/payments:ref:refs/heads/main \
--audiences api://AzureADTokenExchange
The workflow requests id-token: write, logs in with no secret, and pushes:
permissions:
id-token: write # required to fetch the OIDC token
contents: read
jobs:
build-push:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ vars.AZURE_CLIENT_ID }}
tenant-id: ${{ vars.AZURE_TENANT_ID }}
subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
- name: Build and push via ACR Tasks
run: |
az acr login --name kvacrprod
az acr build -r kvacrprod -t kvacrprod.azurecr.io/payments/api:${{ github.sha }} .
Granting id-token: write only allows the job to request an OIDC token; it confers no resource access by itself. All authorization flows from the federated-credential subject match and the role assignment, so scope both tightly — federate per repo and branch, and assign push on the narrowest scope (a custom role limited to specific repositories beats AcrPush across the registry).
Verify
Confirm each control is actually in force, not just configured:
# Public access is off and trusted services are on
az acr show -n $ACR --query \
"{public:publicNetworkAccess, trusted:networkRuleBypassOptions}" -o table
# Admin user is disabled
az acr show -n $ACR --query adminUserEnabled -o tsv # expect: false
# Private endpoint resolves to a private IP from inside the VNet
nslookup $ACR.azurecr.io # expect a 10.x / private address, not public
# Replicas exist and are Ready
az acr replication list -r $ACR -o table
# Policies are enabled
az acr show -n $ACR --query \
"{quarantine:policies.quarantinePolicy.status, retention:policies.retentionPolicy.status}" -o table
# A signed image verifies; tampering or wrong signer fails
notation verify $ACR.azurecr.io/payments/api@$DIGEST
# A pull with a pull-only token cannot push
echo $TOKEN_PW | docker login $ACR.azurecr.io -u k8s-payments-puller --password-stdin
docker push $ACR.azurecr.io/payments/api:test # expect: denied
Enterprise scenario
A fintech platform team ran a single Standard ACR in Australia East feeding AKS clusters in Australia East and Southeast Asia. Two problems surfaced in the same quarter. First, a security review flagged that the registry’s admin user was enabled and its password lived in a Kubernetes imagePullSecret that had not rotated in 14 months — and the same secret was pasted into three pipelines. Second, the Southeast Asia clusters were pulling every layer cross-region on cold starts, adding seconds to pod startup during scale-out and racking up inter-region egress on every deployment.
They upgraded to Premium and made two coordinated changes. For credentials, they killed the admin user, moved AKS to managed-identity pulls by attaching the registry to each cluster (az aks update --attach-acr), which assigns AcrPull to the kubelet identity — no secret in the cluster at all — and moved pipelines to OIDC federated credentials. For locality, they added a geo-replica in Southeast Asia. Because the global login server is unchanged, no manifests, Helm charts, or pipelines needed editing; the Southeast Asia kubelets simply began resolving to the local data endpoint and pulling within region.
# Replica colocated with the SEA clusters — single command, zero manifest changes
az acr replication create -r kvacrprod -l southeastasia
# Each AKS cluster pulls with its kubelet managed identity, no imagePullSecret
az aks update -g rg-aks-sea -n aks-sea --attach-acr kvacrprod
The measurable outcomes: cold-start pull time in Southeast Asia dropped because layers no longer crossed the region boundary, inter-region egress on deploys went to near zero, and the credential audit finding closed because there were no static registry secrets left to rotate. The replica also gave them an unplanned benefit during a later Australia East zone disruption — the SEA replica kept serving pulls while the home region recovered, with no failover action on their part. The lesson the team took away: geo-replication is sold as DR, but the day-to-day wins are pull locality and the fact that one login server means you can change the topology underneath without touching a single workload manifest.