A single private endpoint is a five-minute job. A fleet of them — hundreds of endpoints across dozens of spokes, each needing an A record that resolves consistently from VNets and on-prem — is a governance problem dressed up as a networking one. This article builds the centralized Private DNS zone topology and then makes it self-enforcing with Azure Policy, so a spoke team can never ship an endpoint that silently resolves to a public IP.
Why private endpoints silently fail
A private endpoint projects a NIC into your VNet and assigns the target PaaS resource a private IP. What it does not do is change the name your application uses. Your SDK, connection string, and the resource’s TLS certificate all reference the public FQDN — stappdata.blob.core.windows.net, kv-app.vault.azure.net. The failure is always the same: the name resolves to the public IP, traffic takes the internet path, and if the PaaS firewall is locked down the connection simply hangs.
The fix is a DNS chain Azure half-builds for you. Microsoft’s public resolvers already return a CNAME from the public name to a privatelink.* alias. Your responsibility is to host that privatelink zone with an A record:
stappdata.blob.core.windows.net
└─ CNAME stappdata.privatelink.blob.core.windows.net (returned by Azure public DNS)
└─ A 10.42.3.7 (only resolves if YOU host privatelink.blob.core.windows.net)
If the querying client can see a Private DNS zone named privatelink.blob.core.windows.net containing that A record, it follows the CNAME and gets the private IP. If it cannot, the CNAME chain falls through to the public A record. Every troubleshooting scenario later in this article is a variation of one root cause: the client could not see the right zone, or the zone did not contain the right record.
Azure’s wire server, 168.63.129.16, is what makes this transparent for VNet clients. Any VNet using Azure-provided DNS automatically consults every Private DNS zone linked to that VNet. You do not need custom DNS servers on the spokes for internal resolution — only for hybrid, covered later.
Centralized vs decentralized: the blast-radius decision
You have two topologies. Understand the trade before you pour concrete, because migrating later means re-pointing hundreds of zone groups.
| Decentralized (zone per spoke) | Centralized (zone in hub) | |
|---|---|---|
| Zone copies | One privatelink.blob... per spoke |
Exactly one, in the connectivity subscription |
| Record drift | High — N places to diverge | Single source of truth |
| Cross-spoke resolution | Awkward; needs peering + links anyway | Native via VNet links |
| RBAC blast radius | Each team self-serves their zone | Platform team owns zones; spokes own endpoints |
| Failure if zone misconfigured | One spoke | Potentially the fleet |
Centralized wins for any estate past a handful of spokes. The blast-radius concern is real — a fat-fingered record in the hub zone affects everyone — but you mitigate that with policy-as-code and zone groups (so humans never write records by hand), not by accepting forty divergent copies. The model that follows is: the connectivity/hub subscription owns the canonical zones; each spoke owns its endpoints; Azure Policy stitches them together.
Step 1 — Canonical zones and VNet links in the hub
Create one copy of each privatelink zone in a dedicated DNS resource group in the connectivity subscription, then link every spoke VNet to each zone for resolution.
HUB_RG="rg-connectivity-dns"
LOCATION="eastus2"
# One canonical zone per PaaS suffix you use
for zone in \
privatelink.blob.core.windows.net \
privatelink.file.core.windows.net \
privatelink.vaultcore.azure.net \
privatelink.database.windows.net \
privatelink.azurewebsites.net ; do
az network private-dns zone create \
--resource-group "$HUB_RG" \
--name "$zone"
done
Now link the spokes. A VNet link with registration-enabled false is a resolution-only link — exactly what you want for a shared zone. Auto-registration is for VM hostnames in one VNet and has no business in a privatelink zone.
az network private-dns link vnet create \
--resource-group "$HUB_RG" \
--zone-name privatelink.blob.core.windows.net \
--name link-spoke-payments \
--virtual-network "$SPOKE_PAYMENTS_VNET_ID" \
--registration-enabled false
At fleet scale this is a Terraform for_each over the cross-product of zones and spokes. One block creates the zones, one creates every link:
locals {
privatelink_zones = [
"privatelink.blob.core.windows.net",
"privatelink.file.core.windows.net",
"privatelink.vaultcore.azure.net",
"privatelink.database.windows.net",
"privatelink.azurewebsites.net",
]
}
resource "azurerm_private_dns_zone" "this" {
for_each = toset(local.privatelink_zones)
name = each.value
resource_group_name = azurerm_resource_group.dns.name
}
# Cartesian product: every canonical zone linked to every spoke VNet
resource "azurerm_private_dns_zone_virtual_network_link" "this" {
for_each = {
for pair in setproduct(local.privatelink_zones, keys(var.spoke_vnets)) :
"${pair[0]}::${pair[1]}" => { zone = pair[0], vnet = pair[1] }
}
name = "link-${each.value.vnet}"
resource_group_name = azurerm_resource_group.dns.name
private_dns_zone_name = azurerm_private_dns_zone.this[each.value.zone].name
virtual_network_id = var.spoke_vnets[each.value.vnet]
registration_enabled = false
}
Adding a new spoke is now one entry in
var.spoke_vnetsand aterraform apply. Every zone links to it automatically. This is the operational payoff of centralization.
Step 2 — Automating A-record registration with the zone group
Never write the A record yourself. A Private DNS zone group binds an endpoint to one or more zones and hands Azure the record lifecycle: it writes the A record on creation, rewrites it if the private IP changes, and deletes it when the endpoint is deleted. Hand-authored records rot on the first redeploy.
# 1. Create the endpoint in the spoke (storage blob example)
az network private-endpoint create \
--name pe-stappdata-blob \
--resource-group rg-app-payments \
--vnet-name vnet-spoke-payments \
--subnet snet-privateendpoints \
--private-connection-resource-id "$STORAGE_ID" \
--group-id blob \
--connection-name conn-stappdata-blob
# 2. Bind it to the CENTRAL zone (note: --private-dns-zone is a full resource ID)
az network private-endpoint dns-zone-group create \
--resource-group rg-app-payments \
--endpoint-name pe-stappdata-blob \
--name default \
--private-dns-zone "$HUB_ZONE_ID_BLOB" \
--zone-name privatelink-blob
The critical detail: --private-dns-zone takes a full resource ID, and that ID can point at a zone in a different subscription. That single fact is the mechanism of centralization — the spoke subscription owns the endpoint, the connectivity subscription owns the zone, and the zone group bridges them. The --zone-name here is just a local label for the config entry, not the DNS zone’s FQDN.
The --group-id (the sub-resource) is per service and per record type. A storage account serving blob and file traffic needs two sub-resources, two endpoints (or one endpoint listing both group IDs), and the records land in two different zones — see the “tricky services” section.
Step 3 — Enforcing the pattern fleet-wide with Azure Policy
Manual zone groups do not survive contact with a dozen application teams. The first endpoint someone creates without a zone group is a silent public-resolution bug that surfaces weeks later as an intermittent connectivity ticket. Close the gap with Azure Policy using the deployIfNotExists (DINE) effect: the policy watches for new private endpoints and creates the zone group pointing at your canonical zone — no human in the loop.
Microsoft ships built-in DINE definitions; search the catalog for “Configure private endpoints … to use private DNS zones”. There is a service-specific definition (Blob, File, Key Vault, SQL, and so on), and the standard practice is to bundle them into an initiative assigned at the landing-zone management group, each parameterized with the relevant canonical zone’s resource ID.
The rule shape, so you understand what the engine evaluates and deploys:
{
"if": {
"allOf": [
{ "field": "type", "equals": "Microsoft.Network/privateEndpoints" },
{
"count": {
"field": "Microsoft.Network/privateEndpoints/privateLinkServiceConnections[*].groupIds[*]",
"where": {
"field": "Microsoft.Network/privateEndpoints/privateLinkServiceConnections[*].groupIds[*]",
"equals": "blob"
}
},
"greaterOrEquals": 1
}
]
},
"then": {
"effect": "deployIfNotExists",
"details": {
"type": "Microsoft.Network/privateEndpoints/privateDnsZoneGroups",
"roleDefinitionIds": [
"/providers/Microsoft.Authorization/roleDefinitions/4d97b98b-1d4f-4787-a291-c67834d212e7"
],
"deployment": {
"properties": {
"mode": "incremental",
"parameters": {
"privateDnsZoneId": { "value": "[parameters('privateDnsZoneId')]" },
"privateEndpointName": { "value": "[field('name')]" },
"location": { "value": "[field('location')]" }
},
"template": { "...": "ARM template that creates the privateDnsZoneGroups child resource" }
}
}
}
}
}
Two things make or break this in production:
The managed identity needs the right roles. The roleDefinitionIds above is Network Contributor (4d97b98b-1d4f-4787-a291-c67834d212e7), which the policy’s system-assigned identity needs to write the zone group on the endpoint. Because the canonical zone lives in another subscription, that identity also needs Private DNS Zone Contributor scoped to the hub DNS resource group. If you assign the policy at a management group, grant the identity rights on the hub scope explicitly — it is not implied.
DINE only fires on new and updated resources. Endpoints that predate the assignment are flagged non-compliant but not fixed until you run a remediation task:
# Find the assignment, then remediate existing non-compliant endpoints
ASSIGNMENT_ID=$(az policy assignment show \
--name "deploy-pe-privatedns" \
--scope "/providers/Microsoft.Management/managementGroups/mg-landingzones" \
--query id -o tsv)
az policy remediation create \
--name "remediate-pe-privatedns-blob" \
--policy-assignment "$ASSIGNMENT_ID" \
--definition-reference-id "configurePrivateEndpointBlob" \
--resource-discovery-mode ReEvaluateCompliance
Once assigned and remediated, the outcome is structural: every blob endpoint created anywhere under that management group, by any team, lands its A record in the one canonical zone. Governance, not goodwill.
Handling the tricky services
The single most common rollout bug is the wrong zone name. Several services use a counterintuitive suffix, a regional zone, or multiple zones per resource. Get it wrong and the zone group cheerfully writes the record into a zone nobody queries.
| Service | Sub-resource (group-id) |
Private DNS zone name |
|---|---|---|
| Blob | blob |
privatelink.blob.core.windows.net |
| File | file |
privatelink.file.core.windows.net |
| Table | table |
privatelink.table.core.windows.net |
| Queue | queue |
privatelink.queue.core.windows.net |
| Data Lake Gen2 | dfs |
privatelink.dfs.core.windows.net |
| Key Vault | vault |
privatelink.vaultcore.azure.net |
| Azure SQL DB | sqlServer |
privatelink.database.windows.net |
| Cosmos DB (Core/SQL) | Sql |
privatelink.documents.azure.com |
| App Service / Functions | sites |
privatelink.azurewebsites.net |
Storage sub-resources are separate zones. A storage account is not one private endpoint. Blob, file, table, queue, and dfs are independent sub-resources, each with its own FQDN and its own privatelink zone. If your app uses blob and file, you need both endpoints and both zones linked. A frequent miss: enabling the static-website or data-lake feature introduces the web or dfs endpoint and a zone you never provisioned.
Cosmos DB is partitioned by API and by account. The zone depends on the API — privatelink.documents.azure.com for Core (SQL), with different suffixes for MongoDB, Cassandra, Gremlin, and Table APIs. Critically, Cosmos creates an A record per physical partition region, so a single account in three regions produces multiple records in the zone. The zone group manages all of them; do not try to reason about individual records.
Two services that trip up even experienced teams: Key Vault’s zone suffix is
vaultcore.azure.net, notvault.azure.net(the public name). And Azure Monitor Private Link Scope needs a set of zones together —monitor,oms,ods,agentsvc, plus ablobzone — not a single zone. AKS private clusters embed the region in the zone name (privatelink.<region>.azmk8s.io). When in doubt, Microsoft’s “Azure Private Endpoint DNS configuration” reference is the source of truth — do not guess a suffix.
Sovereign and Government clouds use entirely different suffixes (*.core.usgovcloudapi.net and so on). Derive those from that cloud’s documentation rather than assuming the commercial names.
Cross-tenant and on-prem resolution via conditional forwarding
VNet clients resolve transparently through 168.63.129.16. On-prem and cross-tenant clients do not — that wire-server address is non-routable outside its own VNet. Those clients need an in-Azure resolver to forward to, and the clean answer is the Azure DNS Private Resolver: a managed service with no DNS VMs to patch. Deploy it in the hub with an inbound endpoint (the IP on-prem forwards to) and, if Azure needs to resolve on-prem names, an outbound endpoint. Both require dedicated subnets delegated to Microsoft.Network/dnsResolvers, minimum /28 — reserve that IP space in the hub up front.
The direction teams forget is on-prem into Azure. On your on-prem DNS, create a conditional forwarder for the public PaaS suffix — blob.core.windows.net, not privatelink.blob... — pointing at the resolver’s inbound endpoint IP:
$inbound = "10.10.0.4" # Private Resolver inbound endpoint IP in the hub
"blob.core.windows.net",
"file.core.windows.net",
"vaultcore.azure.net",
"database.windows.net" | ForEach-Object {
Add-DnsServerConditionalForwarderZone `
-Name $_ `
-MasterServers $inbound `
-ReplicationScope "Forest"
}
The mechanism: on-prem asks for stappdata.blob.core.windows.net and forwards it to the inbound endpoint. Inside Azure, 168.63.129.16 returns the public CNAME, follows it into your linked privatelink zone, and returns the private IP. On-prem never references the privatelink name at all — it only ever forwards the public suffix. Traffic to the inbound endpoint rides your existing ExpressRoute private peering or VPN, since the resolver IP is a normal hub private address.
The same pattern solves cross-tenant resolution. If a peered VNet lives in another tenant and you cannot link it to your zones via RBAC across the tenant boundary, point that tenant’s resolver (or its custom DNS) at your inbound endpoint as a conditional forwarder for the PaaS suffixes. The resolution still happens in the zone-owning tenant.
Verify
Confirm the chain end to end before declaring victory. From a VM in a spoke VNet:
# Should return the privatelink CNAME, then a PRIVATE (10.x / RFC1918) A record
nslookup stappdata.blob.core.windows.net
# Confirm the canonical zone actually holds the record (run against the hub sub)
az network private-dns record-set a list \
--resource-group rg-connectivity-dns \
--zone-name privatelink.blob.core.windows.net \
--query "[].{name:name, ip:aRecords[0].ipv4Address}" -o table
# Confirm the endpoint's zone group is bound to the central zone
az network private-endpoint dns-zone-group show \
--resource-group rg-app-payments \
--endpoint-name pe-stappdata-blob \
--name default \
--query "privateDnsZoneConfigs[].privateDnsZoneId" -o tsv
Then check policy posture across the fleet:
# Any non-compliant private endpoints under the management group?
az policy state summarize \
--management-group mg-landingzones \
--query "policyAssignments[?contains(policyAssignmentId,'deploy-pe-privatedns')]"
A correct result: nslookup returns a private IP, the record exists in the hub zone, the zone group references the hub zone’s resource ID, and the policy summary shows zero non-compliant endpoints.
Checklist
Enterprise scenario
A retail platform team ran the centralized model cleanly for a year, then a payments spoke’s blob endpoints started intermittently resolving to public IPs — but only for some clients in the same VNet. The records were correct in the hub zone, the zone group was bound, and nslookup from a fresh VM returned the private IP. The tell was “some clients”: the affected workloads were AKS pods, not VMs.
The cause was the AKS cluster running CoreDNS with a custom forward block pointing at an on-prem DNS appliance for a legacy internal domain. That appliance had a stale conditional forwarder for blob.core.windows.net sending queries straight to public Azure DNS — so pods bypassed 168.63.129.16 entirely and never touched the linked private zone. Node-level VMs used Azure-provided DNS and resolved fine, which is why it looked random.
The fix was to stop forwarding PaaS suffixes off-cluster. We scoped the CoreDNS forward to only the legacy domain and let everything else hit the default upstream (the VNet’s Azure DNS):
# coredns-custom ConfigMap — forward ONLY the legacy zone, not PaaS suffixes
legacy.server: |
corp.internal:53 {
forward . 10.50.0.10 # on-prem appliance, legacy domain only
}
Lesson for the runbook: centralized zones only work if the client’s resolver path actually reaches 168.63.129.16. Any layer that overrides DNS — CoreDNS, a custom VNet DNS server, a containerized resolver — must forward PaaS suffixes back to Azure DNS, never to an upstream that answers publicly. Audit override points, not just the zones.
Pitfalls and next steps
The failures that recur most often, in order:
- Wrong zone name. The zone group writes a record into a zone nobody queries and resolution silently falls through to public. Always verify the suffix —
vaultcore, notvault. - Missing the cross-subscription RBAC grant. The DINE identity has Network Contributor on the endpoint but no rights on the hub zone, so the deployment fails quietly and the endpoint shows non-compliant with no record. Grant Private DNS Zone Contributor on the hub scope explicitly.
- Forgetting remediation. DINE only fires going forward. Endpoints created before the assignment stay broken until you remediate.
- One direction of hybrid forwarding. Azure-to-on-prem works but on-prem-to-Azure was never configured, so datacenter clients still hit public IPs. Both directions are separate configuration.
- DNS caching during cutover. When you migrate a record or first link a zone, clients and on-prem forwarders cache the old public answer for the TTL. Plan for the TTL window; do not assume an instant flip.
Next, fold the zone-and-link Terraform and the policy initiative into your platform pipeline so a new spoke is one PR, and wire a daily compliance scan that alerts on any private endpoint without a zone group. At that point the architecture maintains itself, and “the app can’t reach storage” stops being a recurring ticket.