AWS Networking

Route 53 Resolver at Scale: Inbound/Outbound Endpoints, Rules, and DNS Firewall

DNS is the control plane nobody budgets for until it breaks at 2 a.m. In a single VPC the Route 53 Resolver “just works” and you never think about it. Across forty accounts with on-premises forwarding, split-horizon zones, and a security mandate to block DNS exfiltration, that same resolver becomes the single most load-bearing — and most misunderstood — piece of your network. This guide builds centralized hybrid DNS the way a platform team actually has to: outbound endpoints forwarding to on-prem over Direct Connect, inbound endpoints letting on-prem resolve your private zones, resolver rules shared to spoke VPCs through AWS RAM, and DNS Firewall enforcing an egress domain policy that fails the way you decided it should — not the way you discovered it does.

The mental model first, because almost every Resolver bug traces back to getting this wrong.

1. The +2 resolver, and what endpoints actually do

Every VPC has a built-in DNS resolver — the Amazon-provided DNS — reachable at the VPC CIDR base address plus two (10.0.0.0/16 -> 10.0.0.2), and also at the link-local 169.254.169.253. This resolver answers queries from instances inside the VPC. It resolves public names, private hosted zones associated with that VPC, and VPC-internal records. It is not addressable from outside the VPC, and it cannot, on its own, forward queries anywhere you tell it to.

Route 53 Resolver endpoints exist to break those two limitations:

Each endpoint is a set of elastic network interfaces, one per subnet/AZ you specify, each with an IP. You need at least two IPs across two AZs per endpoint — this is an availability requirement, not optional. Each ENI handles a bounded query rate (budget roughly 10,000 queries per second per IP as the hard ceiling), so the IP count is also your throughput knob. The interfaces consume IPs from your subnets and show up as Route 53 Resolver-owned ENIs in EC2.

# Outbound endpoint: 2 ENIs across 2 AZs in the networking account
aws route53resolver create-resolver-endpoint \
  --name "egress-fwd-outbound" \
  --direction OUTBOUND \
  --security-group-ids sg-0outboundresolver \
  --ip-addresses \
    SubnetId=subnet-az1,Ip=10.100.0.10 \
    SubnetId=subnet-az2,Ip=10.100.1.10 \
  --tags Key=team,Value=platform-network
# Inbound endpoint: lets on-prem query our private zones
aws route53resolver create-resolver-endpoint \
  --name "ingress-inbound" \
  --direction INBOUND \
  --security-group-ids sg-0inboundresolver \
  --ip-addresses \
    SubnetId=subnet-az1,Ip=10.100.0.20 \
    SubnetId=subnet-az2,Ip=10.100.1.20

Security groups on Resolver endpoints govern DNS traffic on port 53, both TCP and UDP. Inbound endpoints need ingress 53 from your on-prem resolver CIDRs; outbound endpoints need egress 53 to the on-prem resolver IPs. Forgetting TCP/53 is the classic failure — it works until a response exceeds 512 bytes (or EDNS isn’t honored) and the client retries over TCP, which your rule silently drops.

2. Centralize resolution in a networking account

Do not scatter endpoints across application accounts. Resolver endpoints are expensive per-ENI-hour and operationally noisy; you want exactly one well-run pair of inbound/outbound endpoints in a central networking account, attached to a hub/inspection VPC that already has connectivity to on-prem (Transit Gateway + Direct Connect, or VPN). Every spoke reaches on-prem DNS through this account, and on-prem reaches every private zone through this account.

The forwarding logic lives in Resolver rules. A FORWARD rule says “for this domain, send queries to these target IPs” — the target IPs being your on-prem DNS servers. A SYSTEM rule explicitly says “for this domain, do NOT forward; resolve it normally” — used to carve exceptions out of a broader forward rule.

# Forward corp.example.com to on-prem DNS via the outbound endpoint
aws route53resolver create-resolver-rule \
  --name "fwd-corp-to-onprem" \
  --rule-type FORWARD \
  --domain-name "corp.example.com" \
  --resolver-endpoint-id rslvr-out-0abc123 \
  --target-ips Ip=10.10.0.53,Port=53 Ip=10.10.1.53,Port=53 \
  --tags Key=scope,Value=hybrid
# Carve out an exception: resolve internal.corp.example.com inside AWS,
# even though corp.example.com forwards to on-prem.
aws route53resolver create-resolver-rule \
  --name "system-internal-corp" \
  --rule-type SYSTEM \
  --domain-name "internal.corp.example.com"

Resolution follows most-specific-match wins: a query for db.internal.corp.example.com matches the internal.corp.example.com SYSTEM rule over the broader corp.example.com FORWARD rule, so it is resolved by AWS (a private hosted zone, presumably) and never leaves. There is also a reserved . rule meaning “everything” — you can forward all queries to on-prem if you want on-prem to be authoritative for the internet, but that is rarely what you want and it puts your entire DNS dependency on the Direct Connect link.

3. Share rules across accounts with AWS RAM

A Resolver rule created in the networking account does nothing for spoke VPCs in other accounts until you share it via AWS Resource Access Manager (RAM) and the consuming account associates it with its VPCs. This is the multi-account mechanism — the spoke account never owns an endpoint; it borrows the forwarding behavior.

# In the networking account: create a RAM share with the rule(s)
aws ram create-resource-share \
  --name "resolver-rules-hybrid" \
  --resource-arns \
    arn:aws:route53resolver:eu-west-1:111111111111:resolver-rule/rslvr-rr-0fwdcorp \
  --principals 222222222222 \
  --tags key=team,value=platform-network

If your org has RAM sharing enabled with AWS Organizations (aws ram enable-sharing-with-aws-organization), you can set the --principals to an OU or org ARN and skip per-account invitations entirely — new accounts in that OU pick up the share automatically. With Organizations sharing on, the association does not require an accept step.

In the spoke account, associate the shared rule with each VPC that should honor it:

# In account 222222222222: bind the shared rule to a spoke VPC
aws route53resolver associate-resolver-rule \
  --resolver-rule-id rslvr-rr-0fwdcorp \
  --vpc-id vpc-0spokeapp01 \
  --name "corp-fwd-on-spoke-app01"

From that moment, an instance in vpc-0spokeapp01 querying corp.example.com has its query handled by the networking account’s outbound endpoint, forwarded to on-prem, and answered — even though that VPC has no endpoint of its own. Doing this at scale by hand is a mistake; drive it from Terraform so association is a property of the VPC, not a manual step.

resource "aws_route53_resolver_rule_association" "corp_fwd" {
  for_each         = toset(var.spoke_vpc_ids)
  resolver_rule_id = var.shared_corp_rule_id
  vpc_id           = each.value
}

A given VPC can have at most one rule association per rule, and rules are evaluated by specificity across all associated rules. Keep the rule set small and intentional; an explosion of overlapping forward rules is how you end up debugging which one won.

4. On-prem forwarding and split-horizon

The hard cases in hybrid DNS are the ones where the same name must resolve differently depending on who is asking — split-horizon. Two directions, two mechanisms:

AWS resolves a corporate name (outbound). Covered above: a FORWARD rule on corp.example.com points at on-prem resolver IPs reachable over Direct Connect/VPN. The outbound endpoint ENIs sit in the hub VPC, so the path to 10.10.0.53 rides your existing TGW + DX attachment. No special routing beyond “the hub VPC can reach the on-prem DNS subnet.”

On-prem resolves an AWS private zone (inbound). Configure the on-premises DNS servers with a conditional forwarder: for aws.example.com (the domain of your Route 53 private hosted zone), forward to the inbound endpoint IPs (10.100.0.20, 10.100.1.20). The private hosted zone must be associated with the VPC that hosts the inbound endpoint, or the resolver won’t have the records.

On Windows Server DNS, the conditional forwarder is one command:

# On-prem AD DNS: forward the AWS private zone to the inbound endpoint IPs
Add-DnsServerConditionalForwarderZone `
  -Name "aws.example.com" `
  -MasterServers 10.100.0.20, 10.100.1.20 `
  -ReplicationScope "Forest"

Split-horizon trap: suppose app.example.com exists both on-prem (pointing at a legacy server) and in a Route 53 private hosted zone (pointing at an ALB). An instance in AWS will, by default, resolve it via the private hosted zone — unless a more specific forward rule sends example.com or app.example.com to on-prem, in which case AWS gets the on-prem answer. You resolve this deliberately with rule specificity: forward the parent zone to on-prem, then add a SYSTEM rule for the specific subdomain you want answered inside AWS (as in Step 2). Decide per-name which horizon wins and encode it; do not leave it to default precedence and hope.

5. DNS Firewall: domain policy on the egress path

Route 53 Resolver DNS Firewall inspects outbound DNS queries from a VPC and acts on the queried domain name before resolution completes. It is your DNS-layer egress control: block known-bad domains, restrict resolution to an allow-list, and detect the long, high-entropy subdomains that signal DNS tunneling/exfiltration.

You build rule groups containing rules; each rule references a domain list and an action (ALLOW, BLOCK, or ALERT). Rules within a group are evaluated by priority (lowest number first), and the first match wins — so ordering is policy. A typical group: high-priority ALLOW for sanctioned domains, then a low-priority catch-all BLOCK.

# A managed AWS domain list (threat intel) - reference, don't author these
aws route53resolver list-firewall-domain-lists

# Custom block list for known-bad / disallowed domains
aws route53resolver create-firewall-domain-list --name "corp-blocklist"
aws route53resolver import-firewall-domains \
  --firewall-domain-list-id rslvr-fdl-0blocklist \
  --operation REPLACE \
  --domain-file-url s3://net-dns-policy/blocklist.txt

AWS publishes managed domain listsAWSManagedDomainsMalwareDomainList, AWSManagedDomainsBotnetCommandandControl, AWSManagedDomainsAggregateThreatList, and an AWSManagedDomainsAmazonGuardDutyThreatList fed by GuardDuty — that you reference (you cannot edit them). Wire them into a rule group at the top of the order, then your custom policy below:

aws route53resolver create-firewall-rule-group --name "egress-dns-policy"

# Priority 10: block AWS-managed malware domains outright
aws route53resolver create-firewall-rule \
  --firewall-rule-group-id rslvr-frg-0egress \
  --firewall-domain-list-id rslvr-fdl-AWSMalware \
  --priority 10 --action BLOCK \
  --block-response-dns-type-of-response NODATA \
  --name "block-managed-malware"

# Priority 20: block our internal blocklist
aws route53resolver create-firewall-rule \
  --firewall-rule-group-id rslvr-frg-0egress \
  --firewall-domain-list-id rslvr-fdl-0blocklist \
  --priority 20 --action BLOCK \
  --block-response BLOCK_RESPONSE_OVERRIDE \
  --block-override-domain "blocked.corp.example.com" \
  --block-override-dns-type CNAME \
  --block-override-ttl 60 \
  --name "block-corp-list"

BLOCK supports three responses: NODATA (empty answer), NXDOMAIN (name does not exist), or OVERRIDE (return a CNAME to a sinkhole you control — useful to redirect users to a “this was blocked” page and to see who’s hitting it). For exfiltration defense, BLOCK domains are not enough because attackers encode data into subdomains of a domain they own. Use a rule whose domain specification matches wildcard patterns (e.g. a list entry like *.suspicious-tunnel.example) and pair it with query logging + anomaly detection on query length and rate, covered in Step 7. DNS Firewall also supports confidence-thresholded detection of DGA-style domains via the managed lists; lean on those rather than hand-rolling entropy regex.

You associate a rule group with VPCs (this carries its own priority between groups), and the association — like everything else here — is RAM-shareable so the security team owns the policy centrally while application accounts merely inherit it.

6. Fail-open vs. fail-closed, and group ordering

Two ordering decisions decide whether DNS Firewall protects you or pages you.

Fail mode. The rule-group association has a FirewallFailOpen setting (ENABLED or DISABLED, exposed via the VPC’s resolver config). If DNS Firewall cannot evaluate a query — an internal failure, capacity exhaustion — fail-open (ENABLED) lets the query resolve normally; fail-closed (the default behavior, fail-open DISABLED) blocks it. For most enterprises, fail-open is the right default: a DNS Firewall hiccup taking down all name resolution in a production VPC is a worse outage than a brief window where filtering is bypassed. Choose fail-closed only for genuinely high-side workloads where leaking a query is worse than the application failing.

# Set fail-open on the VPC's resolver config (recommended default)
aws route53resolver update-resolver-config \
  --resource-id vpc-0spokeapp01 \
  --autodefined-reverse-flag DISABLE   # separate flag; shown for awareness

# Fail-open is configured on the firewall rule-group association:
aws route53resolver associate-firewall-rule-group \
  --firewall-rule-group-id rslvr-frg-0egress \
  --vpc-id vpc-0spokeapp01 \
  --priority 101 \
  --mutation-protection ENABLED \
  --name "egress-policy-on-app01" \
  --firewall-fail-open ENABLED

Group priority. When multiple rule groups are associated with one VPC, they evaluate in ascending association priority. Reserve low numbers for the security-team baseline (managed threat lists, org-wide blocks) and higher numbers for app-specific allow/exception groups, so a workload can add exceptions but never reorder itself ahead of the mandatory controls. Set --mutation-protection ENABLED so an application account can’t detach the org policy from its own VPC.

7. Query logging and DNS-based threat detection

You cannot detect exfiltration you can’t see. Resolver query logging captures every DNS query from a VPC — the query name, type, response code, who asked — to CloudWatch Logs, S3, or Kinesis Data Firehose. S3 for cheap long-term retention and Athena; CloudWatch for live alarming; Firehose to fan out to a SIEM.

aws route53resolver create-resolver-query-log-config \
  --name "vpc-dns-logs" \
  --destination-arn arn:aws:logs:eu-west-1:111111111111:log-group:/dns/resolver-queries

aws route53resolver associate-resolver-query-log-config \
  --resolver-query-log-config-id rqlc-0dnslogs \
  --resource-id vpc-0spokeapp01

The log-config association is, again, RAM-shareable — centralize logging policy so every VPC in the org logs to the security account’s destination without per-team effort. The classic exfiltration signal is a burst of unique, long, high-entropy subdomains under one parent. With logs in CloudWatch you can alarm on it directly with Logs Insights / Contributor Insights; in this KQL-style query language:

fields @timestamp, query_name, srcaddr
| parse query_name /(?<label>[^.]+)\.(?<parent>.+)/
| filter strlen(label) > 40
| stats count(*) as longLabels, count_distinct(label) as uniqueLabels by parent, srcaddr, bin(5m)
| sort longLabels desc

A source generating hundreds of distinct 40+ character labels under a single parent domain in five minutes is tunneling, not browsing. Feed that into a CloudWatch alarm, or land the logs in S3 and run it from Athena across the whole fleet. GuardDuty also consumes Resolver query logs natively for its DNS-based findings (Backdoor:EC2/DNSDataExfiltration, C2 callouts) — enabling GuardDuty is the lowest-effort version of this and should be on regardless.

8. Scaling, limits, and cost

Enterprise scenario

A payments platform ran a centralized DNS Firewall rule group across ~60 spoke VPCs, associated by the security team via RAM with mutation protection on. The group’s catch-all was a low-priority BLOCK NXDOMAIN over an allow-list — anything not explicitly sanctioned returned NXDOMAIN. It worked perfectly for months. Then a routine allow-list update job (a Lambda doing import-firewall-domains with --operation REPLACE) was handed a truncated S3 file after an upstream export failed: the allow-list dropped from ~1,400 domains to 12. Within seconds the catch-all started returning NXDOMAIN for package mirrors, container registries, and three internal SaaS dependencies. CI went red fleet-wide; a customer-facing service couldn’t reach its license server.

Two things bounded the blast radius. First, fail-open was ENABLED on every association — so when query volume spiked and one rule-group association briefly failed evaluation, those queries resolved instead of compounding the outage. Second, the team had a CloudWatch alarm on BLOCK-action query count from the Firewall metrics; the alarm fired in under two minutes, long before the support queue did.

The fix was procedural, not architectural. They moved the allow-list update from REPLACE to a guarded apply that refuses to shrink the list by more than a threshold, and made the importer validate row count against the previous version before committing:

PREV=$(aws route53resolver list-firewall-domains \
  --firewall-domain-list-id "$LIST_ID" --query 'Domains' --output text | wc -w)
NEW=$(wc -l < ./allowlist.txt)

# Refuse a >20% shrink - almost always a bad/truncated source file
if [ "$NEW" -lt $(( PREV * 80 / 100 )) ]; then
  echo "REFUSING: allow-list would shrink $PREV -> $NEW domains" >&2
  exit 1
fi

aws s3 cp ./allowlist.txt "s3://net-dns-policy/allowlist.txt"
aws route53resolver import-firewall-domains \
  --firewall-domain-list-id "$LIST_ID" \
  --operation REPLACE --domain-file-url "s3://net-dns-policy/allowlist.txt"

The lesson generalizes: an allow-list-first DNS Firewall is a fleet-wide kill switch wearing a security badge. Treat every change to it as a production deploy — validate the input, alarm on the BLOCK rate, and keep fail-open on so a control-plane stumble degrades gracefully instead of taking name resolution with it.

Verify

Confirm the whole path end to end before you call it done.

# 1. Outbound forwarding works from a spoke instance (SSM into it):
dig +short app.corp.example.com
# Expect the on-prem answer; confirm it left via the rule, not a PHZ.

# 2. The spoke VPC actually has the shared rule associated:
aws route53resolver list-resolver-rule-associations \
  --filters Name=VPCId,Values=vpc-0spokeapp01

# 3. Inbound works: from an on-prem host, query a private zone record:
#    dig @10.100.0.20 host.aws.example.com  -> expect the PHZ answer

# 4. DNS Firewall blocks a known-bad name (from a spoke instance):
dig +short test-domain-on-blocklist.example
# Expect NODATA/NXDOMAIN/override per your rule action.

# 5. Firewall associations and fail mode are what you intended:
aws route53resolver list-firewall-rule-group-associations \
  --vpc-id vpc-0spokeapp01

# 6. Query logs are flowing:
aws logs tail /dns/resolver-queries --since 5m

If dig over TCP hangs but UDP works, your endpoint security group is missing TCP/53 — fix it before anything else. If a forwarded query returns a PHZ answer instead of the on-prem one, a more-specific SYSTEM rule or PHZ association is winning; check rule specificity.

Checklist

awsroute53dnsvpcdns-firewallhybrid-networking

Comments

Keep Reading