DNS is the quiet load-bearing wall of every cloud network. It rarely shows up in an architecture diagram, yet a single misconfigured forwarding zone or an overlapping split-horizon record can take down an entire estate in ways that look like a routing problem, a firewall problem, or “the cluster is down” — anything but DNS. Cloud DNS gives you five primitives — public zones, private zones, forwarding zones, peering zones, and response policies — and the engineering is in composing them into one coherent resolution path across GCP, on-prem, and managed services. This walkthrough builds that path end to end, then shows how to diagnose it when resolution order betrays you.
Step 1: Understand the four zone types (plus response policies)
Cloud DNS zones are not interchangeable. Each visibility/type combination answers a different question, and picking the wrong one is the root cause of most “why won’t this resolve” tickets.
| Zone type | What it does | Authoritative? | Typical use |
|---|---|---|---|
| Public | Serves records to the internet | Yes | External domains, DNSSEC-signed apex |
| Private | Serves records to bound VPCs only | Yes | Internal domains (*.corp.internal) |
| Forwarding | Forwards queries to specific name servers | No | On-prem resolution, conditional forwarding |
| Peering | Delegates resolution to another VPC’s DNS config | No | Centralizing resolution in a hub VPC |
The mental model: public and private zones hold records; forwarding and peering zones redirect the query elsewhere. A response policy sits in front of all of them and can override or block answers before the authoritative lookup ever happens. That ordering — response policy, then alternate name server (if set), then zones in most-specific-suffix order, then the internet default — is the single most important thing to internalize, and we will return to it when diagnosing.
Set defaults once so every command in this guide is shorter:
gcloud config set project HUB_PROJECT_ID
gcloud config set compute/region us-central1
Step 2: Create a private zone and bind it to multiple VPCs
A private zone is authoritative for an internal domain and is only visible to the VPC networks you bind to it. Crucially, one zone can be bound to many networks, which is how you make a single source of truth resolvable across a Shared VPC and any peered or standalone VPCs that need it.
gcloud dns managed-zones create corp-internal \
--description="Authoritative internal records" \
--dns-name="corp.example.internal." \
--visibility=private \
--networks="projects/HUB_PROJECT_ID/global/networks/hub-vpc"
To extend visibility to additional VPCs, update the binding with the full set of networks (the flag replaces, it does not append):
gcloud dns managed-zones update corp-internal \
--networks="projects/HUB_PROJECT_ID/global/networks/hub-vpc,projects/HUB_PROJECT_ID/global/networks/data-vpc"
Add records the way you would in any authoritative zone, using a transaction so the change is atomic:
gcloud dns record-sets transaction start --zone=corp-internal
gcloud dns record-sets transaction add 10.10.0.42 \
--name="api.corp.example.internal." --ttl=300 --type=A --zone=corp-internal
gcloud dns record-sets transaction execute --zone=corp-internal
A private zone is only resolvable from VMs (and serverless egress) inside a bound network. If a workload in
data-vpccannot resolveapi.corp.example.internal, the first thing to check is whetherdata-vpcis actually in the zone’s network list — not the record.
The HCL equivalent keeps the binding reviewable and avoids the append/replace footgun entirely:
resource "google_dns_managed_zone" "corp_internal" {
name = "corp-internal"
dns_name = "corp.example.internal."
description = "Authoritative internal records"
visibility = "private"
private_visibility_config {
networks {
network_url = google_compute_network.hub.id
}
networks {
network_url = google_compute_network.data.id
}
}
}
Step 3: Centralize resolution with a peering zone
Multi-network bindings work when you control every VPC. They do not scale when resolution config lives in a hub you do not want every spoke to re-implement — forwarding targets, response policies, and dozens of private zones. DNS peering solves this: a spoke VPC creates a peering zone that says “for this DNS suffix, use the consumer/producer VPC’s entire DNS configuration.” The query leaves the spoke, lands in the hub, and is resolved there using the hub’s private zones, forwarding zones, and response policies.
This is a one-way delegation and is distinct from VPC Network Peering — DNS peering does not require the networks to be VPC-peered at the IP layer. It rides on Google’s internal DNS plane.
# Created in the SPOKE project; --target-network is the HUB VPC
gcloud dns managed-zones create peer-to-hub \
--description="Send all internal resolution to the hub" \
--dns-name="corp.example.internal." \
--visibility=private \
--networks="projects/SPOKE_PROJECT_ID/global/networks/spoke-vpc" \
--target-project=HUB_PROJECT_ID \
--target-network=hub-vpc
The identity that creates the peering zone needs roles/dns.peer on the target (hub) project. A common pattern is a single peering zone for the root internal suffix so the hub becomes the resolution authority for everything internal, while spokes keep zero private-zone bookkeeping.
Peering is transitive for the query but not for zone authority: the spoke uses the hub’s config, but if the hub itself peers onward to a third VPC, that second hop is not followed. Keep your hub the terminal authority for internal names.
Step 4: Outbound forwarding to on-prem resolvers
For names owned by your data center (Active Directory, legacy DNS), you forward queries out of GCP to on-prem name servers. A forwarding zone is authoritative for a suffix only in the sense that it claims the suffix and ships the query to the listed targets.
gcloud dns managed-zones create onprem-forward \
--description="Forward AD domain to on-prem DNS" \
--dns-name="ad.example.com." \
--visibility=private \
--networks="projects/HUB_PROJECT_ID/global/networks/hub-vpc" \
--forwarding-targets="10.100.0.10,10.100.0.11"
Forwarding-target reachability is where this breaks. Cloud DNS has two forwarding modes:
- Standard (default for private RFC 1918 targets): the query is sourced so on-prem sees it coming from Google and the return path uses your VPC’s hybrid connectivity (VPN/Interconnect). Use this for private IPs reachable over your tunnels.
- Private: forces the egress through the VPC. You select it with
--private-forwarding-targets, which guarantees the lookup traverses your private connectivity rather than the public internet.
# Force forwarding through private connectivity (VPN/Interconnect)
gcloud dns managed-zones create onprem-forward \
--dns-name="ad.example.com." \
--visibility=private \
--networks="projects/HUB_PROJECT_ID/global/networks/hub-vpc" \
--private-forwarding-targets="10.100.0.10,10.100.0.11"
On-prem firewalls must allow UDP/TCP 53 from the Google DNS forwarding source range 35.199.192.0/19, and your Cloud Router must advertise that range to on-prem so the replies route back. Forgetting the route advertisement is the single most common reason outbound forwarding “works” intermittently — queries leave, replies have nowhere to go.
Step 5: Inbound forwarding so on-prem can resolve GCP names
The reverse direction needs an inbound server policy on the VPC. This allocates one internal forwarding IP per region (from your subnet ranges) that on-prem resolvers can target as a conditional forwarder for your GCP-managed suffixes.
gcloud dns policies create hub-inbound \
--description="Inbound DNS forwarding entrypoint" \
--networks="projects/HUB_PROJECT_ID/global/networks/hub-vpc" \
--enable-inbound-forwarding
Discover the allocated entrypoint IPs — these are the addresses you hand to the on-prem DNS team:
gcloud compute addresses list \
--filter="purpose=DNS_RESOLVER" \
--format="table(address, region, subnetwork)"
Point on-prem conditional forwarders for corp.example.internal at those IPs, open 53 inbound across the tunnel, and on-prem now resolves your private zones. Combined with Step 4, you have bidirectional hybrid resolution: GCP resolves AD, on-prem resolves cloud-internal.
You can also override the VPC’s default resolver behavior with an --alternative-name-servers policy, which replaces the internal 169.254.169.254 resolution path entirely for that VPC. Reach for it rarely — it bypasses Cloud DNS private zones unless you also set the zones up correctly, and it is a frequent split-horizon culprit (see Step 8).
Step 6: Response policies for overrides and sinkholing
A response policy is a per-VPC firewall for DNS answers, evaluated before zones. It is the right tool for three jobs: overriding a record without owning its zone, sinkholing malicious or unwanted domains, and bypassing a record locally during an incident. Each policy attaches to one or more networks and contains rules keyed by DNS name.
# 1) Create the policy and bind it to the VPC
gcloud dns response-policies create org-rpz \
--description="Org DNS overrides and sinkhole" \
--networks="projects/HUB_PROJECT_ID/global/networks/hub-vpc"
Override a name to point somewhere you control (for example, pin a SaaS hostname to a PSC endpoint IP) by supplying inline local-data:
gcloud dns response-policies rules create pin-vendor \
--response-policy=org-rpz \
--dns-name="files.vendor.example.com." \
--local-data=name="files.vendor.example.com.",type="A",ttl=60,rrdatas="10.10.5.20"
Sinkhole a domain by returning an explicit answer (here, a safe RFC 5737 documentation address) so clients fail fast instead of reaching it:
gcloud dns response-policies rules create block-malware \
--response-policy=org-rpz \
--dns-name="known-bad.example.net." \
--local-data=name="known-bad.example.net.",type="A",ttl=60,rrdatas="192.0.2.1"
The escape hatch matters in incidents: a behavior=bypassResponsePolicy rule for a specific name lets that one query skip the policy and fall through to normal resolution, without deleting the whole policy.
gcloud dns response-policies rules create allow-exception \
--response-policy=org-rpz \
--dns-name="known-bad.example.net." \
--behavior=bypassResponsePolicy
Response policy rules match on the query name, support a
*wildcard prefix for an entire subtree, and a more specific rule wins over a less specific one. Because they evaluate before any zone, an accidental wildcard here can blackhole a whole suffix across every bound VPC — treat them as production-changing config and review them like firewall rules.
Step 7: DNSSEC for public zones — signing, rotation, and the DS handoff
DNSSEC applies to public zones, not private ones (there is no untrusted resolver path to spoof inside your VPC). Cloud DNS manages signing for you, but the chain of trust is only complete once the parent zone holds your DS record. That handoff is the step teams forget, leaving a “signed but unvalidated” zone.
# Enable DNSSEC on an existing public zone
gcloud dns managed-zones update example-com-public --dnssec-state=on
Cloud DNS uses a two-key model: a Key Signing Key (KSK) signs the DNSKEY set, a Zone Signing Key (ZSK) signs the records. To complete the chain, fetch the KSK DS record and submit it to your registrar / parent zone:
gcloud dns dns-keys list --zone=example-com-public \
--filter="type=keySigning" \
--format="value(ds_record(keyTag))"
Key rotation is largely automated — Cloud DNS pre-publishes successor keys so rollovers do not break validation — but KSK rotation requires you to update the DS record at the registrar within the rollover window, because only the parent can vouch for a new KSK. ZSK rollovers are fully transparent. Set the algorithm and key specs explicitly at creation when compliance requires it:
gcloud dns managed-zones create example-com-public \
--dns-name="example.com." \
--description="Public apex, DNSSEC" \
--dnssec-state=on \
--ksk-algorithm=rsasha256 --ksk-key-length=2048 \
--zsk-algorithm=rsasha256 --zsk-key-length=1024
The most common DNSSEC outage is a transfer or registrar change that drops the DS record while signing stays on. Validating resolvers then return SERVFAIL for the entire domain. Before any registrar migration, either turn DNSSEC off, migrate, and re-sign, or pre-stage the matching DS at the new registrar.
Step 8: Private Google Access and PSC DNS with custom zones
Workloads without external IPs reach Google APIs via Private Google Access, but only if DNS sends *.googleapis.com to a private VIP. The clean way is a private zone for googleapis.com with a wildcard CNAME to the access endpoint, plus an A record for the endpoint itself.
gcloud dns managed-zones create googleapis-private \
--description="Route Google APIs to private VIP" \
--dns-name="googleapis.com." \
--visibility=private \
--networks="projects/HUB_PROJECT_ID/global/networks/hub-vpc"
gcloud dns record-sets transaction start --zone=googleapis-private
# private.googleapis.com VIP range is 199.36.153.8/30
gcloud dns record-sets transaction add 199.36.153.8 199.36.153.9 199.36.153.10 199.36.153.11 \
--name="private.googleapis.com." --ttl=300 --type=A --zone=googleapis-private
gcloud dns record-sets transaction add "private.googleapis.com." \
--name="*.googleapis.com." --ttl=300 --type=CNAME --zone=googleapis-private
gcloud dns record-sets transaction execute --zone=googleapis-private
Use restricted.googleapis.com (199.36.153.4/30) instead when you enforce VPC Service Controls — it only resolves APIs that support the perimeter. For Private Service Connect endpoints to published services, Cloud DNS auto-creates a private zone for the service’s DNS name when you create the endpoint with a DNS name configured; you can also manage that zone manually if you need custom records. Verify which mechanism is in play before adding overlapping records, or you get two authoritative sources for the same name.
Verify
Resolution behaves differently from inside a VPC than from your laptop, so test from a VM in a bound network.
# From a VM in hub-vpc: private zone resolves to the internal record
dig +short api.corp.example.internal
# Outbound forwarding: on-prem name resolves via the forwarder
dig +short host01.ad.example.com
# Private Google Access: APIs resolve to the private VIP, not a public IP
dig +short storage.googleapis.com # expect 199.36.153.x
# Response policy override is taking effect
dig +short files.vendor.example.com # expect the pinned 10.10.5.20
Inspect the control plane to confirm intent matches reality:
# Every zone, its type, and visibility in one view
gcloud dns managed-zones list \
--format="table(name, dnsName, visibility, peeringConfig.targetNetwork.networkUrl)"
# DNSSEC chain: state on, and a DS record exists to hand to the registrar
gcloud dns managed-zones describe example-com-public \
--format="value(dnssecConfig.state)"
# Inbound forwarding entrypoint IPs handed to on-prem
gcloud compute addresses list --filter="purpose=DNS_RESOLVER"
For on-prem-to-cloud, run nslookup api.corp.example.internal <inbound-entrypoint-ip> from a data-center host to prove the inbound path before flipping conditional forwarders for real users.
Enterprise scenario
A retail platform team ran a hub-and-spoke topology: one hub VPC with the authoritative corp.example.internal private zone and an ad.example.com forwarding zone to on-prem domain controllers, with every spoke using a DNS peering zone back to the hub. It worked for months. Then a new GKE-heavy spoke started reporting that pods could resolve internal services fine but intermittently failed to resolve their own on-prem AD-joined dependencies with SERVFAIL, roughly one query in five.
The constraint: the on-prem team had given Cloud DNS two domain-controller IPs as standard forwarding targets, and those DCs were reachable over an HA VPN whose Cloud Router advertised the VPC subnets — but not the Google DNS forwarding source range 35.199.192.0/19. One of the two on-prem DCs sat behind an asymmetric path where replies to that source range were silently dropped; the other DC returned answers. Cloud DNS round-robined between the targets, so failures tracked the DC selection, not the workload — which is exactly why it looked random and dodged every “it’s the cluster” hypothesis.
The fix had two parts. First, advertise the forwarding source range from the Cloud Router so replies route deterministically:
gcloud compute routers update-bgp-peer hub-router \
--peer-name=onprem-peer \
--region=us-central1 \
--advertisement-mode=custom \
--set-advertisement-ranges=10.10.0.0/16,35.199.192.0/19
Second, they switched the forwarding zone to private forwarding so the lookup was pinned to the VPC’s private connectivity instead of risking a public egress attempt, and opened UDP/TCP 53 from 35.199.192.0/19 on the on-prem firewall fronting both DCs. SERVFAILs went to zero. The lesson the team wrote into their runbook: outbound DNS forwarding is only as reliable as the return path for 35.199.192.0/19, and “intermittent” almost always means “one of N targets has a broken route,” not a flaky resolver.
Checklist
Pitfalls and next steps
The failure modes cluster around resolution order and return paths. Internalize the order — response policy, then alternate name servers, then private/forwarding/peering zones by most-specific suffix, then the internet default — because nearly every “split-horizon” mystery is a more-specific zone or a response policy quietly shadowing the answer you expected. An --alternative-name-servers policy that bypasses Cloud DNS is the classic trap: it overrides private zones for the whole VPC, so workloads stop resolving internal names that “obviously” exist.
From here, turn on DNS query logging via a server policy to make resolution observable, codify all zones and policies in Terraform so the network/return-path coupling is visible in review, and put guardrails on response policies so a stray wildcard cannot blackhole a suffix across every bound VPC. Get those three in place and Cloud DNS stops being the thing you blame last and becomes the thing you can actually reason about.