A global front door has two jobs: stay up when an origin or a Region goes dark, and absorb hostile traffic before it ever reaches compute you pay for. CloudFront, Route 53, and AWS WAF do both, but only if you wire them together deliberately. The common failure mode is treating CloudFront as a dumb cache in front of one origin, pointing a CNAME at it, and bolting on the AWS-managed WAF rules with a click. That gives you a CDN, not an edge architecture. This walks through the layers that actually deliver resilience — Route 53 health-checked failover, CloudFront origin groups, Origin Shield, origin lock-down with OAC and signed headers, WAF with rate limiting and bot control, and the observability to prove any of it works.
A note on where each control lives, because the layering matters. Route 53 decides which hostname resolves to what — it is DNS, it operates before a connection is even opened, and its failover is health-check driven. CloudFront decides which origin a request is served from once the client has already connected to the edge — its failover is per-request and error-driven. They are complementary, not redundant: Route 53 moves you between front doors (or between CloudFront and a backup), CloudFront moves you between origins behind a single front door.
1. Route 53 routing policies and health checks
Route 53 routing policy is chosen per record set, and the policy decides which answer a resolver gets. Four matter for an edge design:
| Policy | Decides by | Typical use |
|---|---|---|
| Latency | Lowest network latency from resolver to Region | Multi-Region active-active where each Region has its own stack |
| Weighted | Operator-assigned weights | Canary / blue-green at DNS, A/B Region splits |
| Geolocation | Resolver’s continent/country | Data residency, localized content, sanctions blocking |
| Failover | Health-check state of the primary | Active-passive DR with a hot or warm standby |
The mistake people make is conflating latency routing with failover. Latency records will route away from a Region only when AWS’s latency data changes, not when your application is broken — a Region with a healthy network path but a 503-ing app still wins the latency race. If you want traffic to leave on application failure, you need health checks, and you attach them to the records.
Create a calculated structure: a health check that probes a real, cheap endpoint that exercises the dependency chain (not just GET / returning static HTML).
# Health check that probes a deep health endpoint over HTTPS with SNI
aws route53 create-health-check \
--caller-reference "primary-app-$(date +%s)" \
--health-check-config '{
"Type": "HTTPS",
"FullyQualifiedDomainName": "origin-primary.us-east-1.internal.example.com",
"Port": 443,
"ResourcePath": "/healthz/deep",
"RequestInterval": 30,
"FailureThreshold": 3,
"MeasureLatency": true,
"EnableSNI": true
}'
FailureThreshold: 3 with a 30-second interval means a hard origin failure takes up to ~90 seconds of probe time to flip the record, plus the record’s TTL on the resolver side. Keep failover-record TTLs low (60 seconds is the conventional floor for these) so resolvers re-query promptly. Set RequestInterval: 10 for faster detection if you accept the higher per-check cost.
For active-passive, define a primary and secondary record in the same name with Failover set, both referencing the health check on the primary:
aws route53 change-resource-record-sets \
--hosted-zone-id Z123EXAMPLE \
--change-batch '{
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "primary",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z2FDTNDATAQYW2",
"DNSName": "d111111abcdef8.cloudfront.net",
"EvaluateTargetHealth": false
},
"HealthCheckId": "abcd1234-primary-hc"
}
},
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z2FDTNDATAQYW2",
"DNSName": "d222222ghijkl9.cloudfront.net",
"EvaluateTargetHealth": false
}
}
}
]
}'
Z2FDTNDATAQYW2is the fixed hosted-zone ID for all CloudFront alias targets — it is the same in every account and never changes. Do not invent one. For ALB/NLB aliases the zone ID is Region-specific; look it up rather than hardcoding.
A subtle and important point: EvaluateTargetHealth is set to false for CloudFront alias targets because CloudFront is a global, always-resolvable service — Route 53 cannot meaningfully health-check it, so you drive failover from your own health check on the origin instead.
2. CloudFront distributions: behaviors, cache and origin request policies
A distribution is a set of behaviors, each a path pattern mapped to an origin plus a cache policy and an origin request policy. Get the split between those two policies right, because it is the single biggest lever on cache hit ratio.
- Cache policy controls the cache key and TTLs — which headers, cookies, and query strings make two requests “the same” object. Every field you add here fragments the cache.
- Origin request policy controls what gets forwarded to the origin without becoming part of the cache key. This is where you put things the origin needs to see but that must not split the cache (e.g.
User-Agentfor logging).
Use the managed policies where they fit; they are maintained by AWS and cover the common cases.
# Reference managed policies by their well-known IDs
# CachingOptimized: 658327ea-f89d-4fab-a63d-7e88639e58f6
# CachingDisabled: 4135ea2d-6df8-44a3-9df3-4b5a84be39ad
# AllViewerExceptHostHeader (ORP):b689b0a8-53d0-40ab-baf2-68738e2966ac
aws cloudfront create-cache-policy \
--cache-policy-config '{
"Name": "api-cache-key",
"DefaultTTL": 0,
"MaxTTL": 31536000,
"MinTTL": 0,
"ParametersInCacheKeyAndForwardedToOrigin": {
"EnableAcceptEncodingGzip": true,
"EnableAcceptEncodingBrotli": true,
"HeadersConfig": { "HeaderBehavior": "whitelist", "Headers": { "Quantity": 1, "Items": ["Authorization"] } },
"CookiesConfig": { "CookieBehavior": "none" },
"QueryStringsConfig": { "QueryStringBehavior": "whitelist", "QueryStrings": { "Quantity": 2, "Items": ["page", "limit"] } }
}
}'
The defaults to internalize: a path serving immutable static assets wants CachingOptimized and a long MaxTTL. An authenticated API wants Authorization in the key (so user A’s response is never served to user B) and a short or zero default TTL. Never forward all headers or all cookies — that is a near-100% cache miss configuration that turns CloudFront into an expensive reverse proxy.
3. Origin groups and error-based failover
Route 53 fails you over between front doors; an origin group fails you over between origins behind one distribution, per request, based on HTTP status. This is the layer that survives a single-Region origin outage without any DNS propagation delay.
You define two origins, then an origin group that lists primary and secondary plus the status codes that trigger failover.
aws cloudfront create-distribution --distribution-config '{
"CallerReference": "edge-2026-06",
"Comment": "Global front door with origin failover",
"Enabled": true,
"Origins": {
"Quantity": 2,
"Items": [
{ "Id": "origin-primary", "DomainName": "alb-primary.us-east-1.elb.amazonaws.com",
"CustomOriginConfig": { "HTTPPort": 80, "HTTPSPort": 443, "OriginProtocolPolicy": "https-only",
"OriginSslProtocols": { "Quantity": 1, "Items": ["TLSv1.2"] } } },
{ "Id": "origin-secondary","DomainName": "alb-secondary.us-west-2.elb.amazonaws.com",
"CustomOriginConfig": { "HTTPPort": 80, "HTTPSPort": 443, "OriginProtocolPolicy": "https-only",
"OriginSslProtocols": { "Quantity": 1, "Items": ["TLSv1.2"] } } }
]
},
"OriginGroups": {
"Quantity": 1,
"Items": [{
"Id": "og-app",
"FailoverCriteria": { "StatusCodes": { "Quantity": 4, "Items": [500, 502, 503, 504] } },
"Members": { "Quantity": 2, "Items": [
{ "OriginId": "origin-primary" }, { "OriginId": "origin-secondary" } ] }
}]
},
"DefaultCacheBehavior": {
"TargetOriginId": "og-app",
"ViewerProtocolPolicy": "redirect-to-https",
"CachePolicyId": "658327ea-f89d-4fab-a63d-7e88639e58f6",
"Compress": true
},
"DefaultRootObject": "index.html"
}'
Two constraints that trip people up:
- The
DefaultCacheBehavior(and any behavior) targets the origin group ID, not an origin ID. If you target an origin directly, failover never happens. - Origin-group failover triggers only on the listed status codes or on a connection-level error (timeout, can’t connect). It does not trigger on
4xx— a 403 from the primary is treated as a legitimate answer and is returned to the client, not retried against the secondary. OnlyGET,HEAD, andOPTIONSrequests fail over; a failedPOSTis not silently replayed against the secondary, which is the correct behavior for non-idempotent writes.
4. Origin Shield and cache hit-ratio optimization
CloudFront has two cache layers: the ~600+ edge locations and a smaller set of regional edge caches. A miss at the edge normally goes to a regional cache, and a miss there goes to the origin. Origin Shield adds a third, designated regional layer that all edge locations route through for a given origin, so the many regional caches collapse into one shield. The effect on a global workload is fewer distinct cache nodes hitting your origin, which means higher offload and lower origin load — especially valuable when traffic is spread thin across many Regions and each regional cache would otherwise miss independently.
# Origin Shield is set per-origin; pick the Region closest to the origin
aws cloudfront update-distribution --id E1EXAMPLE --if-match ETAG --distribution-config '{
"...": "full config required on update",
"Origins": { "Quantity": 1, "Items": [{
"Id": "origin-primary",
"DomainName": "alb-primary.us-east-1.elb.amazonaws.com",
"OriginShield": { "Enabled": true, "OriginShieldRegion": "us-east-1" },
"CustomOriginConfig": { "HTTPPort": 80, "HTTPSPort": 443, "OriginProtocolPolicy": "https-only",
"OriginSslProtocols": { "Quantity": 1, "Items": ["TLSv1.2"] } }
}]}
}'
Set OriginShieldRegion to the Region hosting (or nearest to) that origin — shield traffic should not take a transcontinental hop to reach the origin. Origin Shield is most worth it for low-to-moderate cache-hit content with global viewers, or origins that are expensive to hit (databases, dynamic renders). For a single-Region origin serving already-high-hit static content, the incremental offload may not justify the per-request shield cost; measure before assuming.
5. Securing origins: OAC, custom headers, edge functions
An origin that anyone can reach directly defeats every edge control above — attackers simply bypass CloudFront and WAF and hit the ALB or bucket. Two patterns lock this down.
For S3 origins, use Origin Access Control (OAC). OAC is the SigV4-signing successor to the legacy Origin Access Identity; it supports SSE-KMS and all Regions, and OAI should not be used for new builds.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "AllowCloudFrontServicePrincipalReadOnly",
"Effect": "Allow",
"Principal": { "Service": "cloudfront.amazonaws.com" },
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-edge-bucket/*",
"Condition": {
"StringEquals": { "AWS:SourceArn": "arn:aws:cloudfront::111122223333:distribution/E1EXAMPLE" }
}
}]
}
The AWS:SourceArn condition is what scopes the grant to your distribution — without it, any CloudFront distribution in any account could read the bucket. Pair this with Block Public Access on, so the bucket is reachable only through the signed CloudFront path.
For custom origins (ALB/EC2), inject a shared secret header at CloudFront and require it at the origin. CloudFront adds a custom header to every origin request; the ALB listener rule (or a WAF rule on the ALB) rejects requests lacking it.
# Add a secret header on the origin; store the value in Secrets Manager and reference it
aws cloudfront create-distribution --distribution-config '{
"...": "...",
"Origins": { "Quantity": 1, "Items": [{
"Id": "origin-primary",
"DomainName": "alb-primary.us-east-1.elb.amazonaws.com",
"CustomHeaders": { "Quantity": 1, "Items": [
{ "HeaderName": "X-Origin-Verify", "HeaderValue": "REPLACE_WITH_SECRET" } ] },
"CustomOriginConfig": { "HTTPPort": 80, "HTTPSPort": 443, "OriginProtocolPolicy": "https-only",
"OriginSslProtocols": { "Quantity": 1, "Items": ["TLSv1.2"] } }
}]}
}'
Rotate the header value on a schedule and have the ALB accept both old and new during the overlap window.
CloudFront Functions vs Lambda@Edge — pick by the job:
| CloudFront Functions | Lambda@Edge | |
|---|---|---|
| Runtime | Lightweight JS, sub-millisecond | Node/Python, up to seconds |
| Triggers | Viewer request/response only | All four (viewer + origin, request + response) |
| Use for | Header manipulation, URL rewrites, redirects, simple auth/token checks | Heavy logic, network/SDK calls, body manipulation, A/B at origin |
A canonical CloudFront Function — normalize a host header and enforce a security header on the way in:
function handler(event) {
var request = event.request;
var headers = request.headers;
// Force HSTS expectations downstream and strip a header clients should not set
if (headers['x-origin-verify']) {
delete headers['x-origin-verify']; // clients must never spoof the origin secret
}
return request;
}
6. AWS WAF at the edge: managed rules, rate limiting, bot control
WAF attaches to a CloudFront distribution as a web ACL with scope CLOUDFRONT, which means the Web ACL must be created in us-east-1 regardless of where your origins live. Build the ACL from AWS managed rule groups plus your own rate-based and custom rules, ordered by priority (lower number evaluates first).
aws wafv2 create-web-acl \
--name edge-frontdoor-acl \
--scope CLOUDFRONT \
--region us-east-1 \
--default-action '{"Allow":{}}' \
--visibility-config '{"SampledRequestsEnabled":true,"CloudWatchMetricsEnabled":true,"MetricName":"edgeAcl"}' \
--rules '[
{
"Name": "AWSCommonRules",
"Priority": 1,
"OverrideAction": { "None": {} },
"Statement": { "ManagedRuleGroupStatement": {
"VendorName": "AWS", "Name": "AWSManagedRulesCommonRuleSet" } },
"VisibilityConfig": { "SampledRequestsEnabled": true, "CloudWatchMetricsEnabled": true, "MetricName": "commonRules" }
},
{
"Name": "KnownBadInputs",
"Priority": 2,
"OverrideAction": { "None": {} },
"Statement": { "ManagedRuleGroupStatement": {
"VendorName": "AWS", "Name": "AWSManagedRulesKnownBadInputsRuleSet" } },
"VisibilityConfig": { "SampledRequestsEnabled": true, "CloudWatchMetricsEnabled": true, "MetricName": "badInputs" }
},
{
"Name": "RateLimitPerIP",
"Priority": 10,
"Action": { "Block": {} },
"Statement": { "RateBasedStatement": {
"Limit": 2000, "AggregateKeyType": "IP" } },
"VisibilityConfig": { "SampledRequestsEnabled": true, "CloudWatchMetricsEnabled": true, "MetricName": "rateLimit" }
}
]'
Three things to get right:
- Managed groups use
OverrideAction, notAction. A managed rule group has its own internal actions; you can override the whole group toCount(observe without blocking) during rollout. Rate-based and custom statements useAction. - Rate-based limits are evaluated over a rolling window.
Limit: 2000withAggregateKeyType: IPblocks an IP exceeding ~2000 requests in the evaluation window. UseFORWARDED_IPif you must rate-limit on a header instead of the connection IP, but only when you trust that header’s provenance. - Always roll out new managed groups in Count mode first. The Common Rule Set is broad and will false-positive on legitimate traffic (file uploads, rich JSON bodies, certain query patterns). Watch the sampled requests and metrics for a few days, exclude the specific rules that misfire, then flip to block.
For Bot Control, add AWSManagedRulesBotControlRuleSet — it labels and can block automated traffic, with a Targeted inspection level that defends against more sophisticated bots. It carries additional cost and inspects more of each request, so scope it to the paths that need it (login, checkout, scraping-sensitive endpoints) rather than the whole site, and run it in Count mode first to size the impact.
Finally, associate the ACL — for CloudFront you set the Web ACL ARN on the distribution config (WebACLId), not via associate-web-acl (that call is for regional resources like ALBs).
7. TLS, ACM certificates, and SNI
Three rules cover almost every CloudFront TLS question:
- The viewer-facing certificate must be in
us-east-1. CloudFront is global and pulls its ACM cert from N. Virginia exclusively. Request it there even if everything else lives ineu-west-1. (Origin-facing certs on the ALB live in the origin’s Region — different cert, different Region.) - Use SNI, not a dedicated IP.
SSLSupportMethod: sni-onlyis free and correct for all modern clients. Dedicated-IP SSL exists only for ancient non-SNI clients, bills a significant monthly fee per distribution, and you almost certainly do not need it. - Set a modern security policy so the negotiated minimum TLS version and cipher suite are current.
aws cloudfront update-distribution --id E1EXAMPLE --if-match ETAG --distribution-config '{
"...": "...",
"Aliases": { "Quantity": 1, "Items": ["app.example.com"] },
"ViewerCertificate": {
"ACMCertificateArn": "arn:aws:acm:us-east-1:111122223333:certificate/abcd-1234",
"SSLSupportMethod": "sni-only",
"MinimumProtocolVersion": "TLSv1.2_2021"
}
}'
ACM certificates that CloudFront uses must be validated and renewable; DNS validation in the same Route 53 zone lets ACM auto-renew indefinitely without you touching it again.
Verify
Prove each layer independently — a green dashboard is not verification.
# 1. TLS + edge: confirm CloudFront serves over HTTPS and reports an edge hit/miss
curl -sSI https://app.example.com/static/app.js | grep -i -E 'x-cache|via|x-amz-cf-pop'
# Expect: X-Cache: Hit from cloudfront (after a warm-up request)
# 2. Origin lock-down: hit the origin directly and confirm it refuses
curl -sSI https://alb-primary.us-east-1.elb.amazonaws.com/ | head -1
# Expect: 403 (missing X-Origin-Verify) — NOT 200
# 3. WAF rate limit: fire past the threshold and confirm 403s appear
for i in $(seq 1 50); do curl -s -o /dev/null -w "%{http_code}\n" https://app.example.com/; done | sort | uniq -c
# 4. Route 53 failover: check which record is currently answering and its health
aws route53 get-health-check-status --health-check-id abcd1234-primary-hc \
--query 'HealthCheckObservations[].StatusReport.Status'
# 5. Origin-group failover: with the primary returning 503, confirm the secondary still serves 200
curl -sS -o /dev/null -w "%{http_code}\n" https://app.example.com/
For a real failover drill, inject failure rather than reasoning about it: make the primary origin’s /healthz/deep return 503 (a feature flag, or block the health-check IPs at a security group), then watch Route 53 flip the record and CloudFront fall through the origin group. Do this in a game day, not in your head.
Observability: real-time logs, metrics, synthetic monitoring
- CloudFront standard logs land in S3 (or via CloudWatch) and are your forensic record — status, edge location, cache result, time-taken, per request. Real-time logs stream a configurable sample to Kinesis Data Streams within seconds, for live dashboards and anomaly detection. Use standard for completeness, real-time for latency-sensitive alerting.
- Cache hit ratio is the metric that proves your cache-key design. Watch
CacheHitRateper distribution in CloudWatch; a falling ratio after a deploy almost always means someone added a header, cookie, or query string to the cache key and fragmented the cache. - WAF metrics (
BlockedRequests,AllowedRequests,CountedRequests) per rule tell you whether a managed rule is doing its job or quietly false-positiving. Enable sampled requests so you can inspect what got blocked. - CloudWatch Synthetics canaries give you outside-in truth. A canary running from multiple Regions, hitting
https://app.example.comon a schedule, catches DNS, TLS-expiry, and edge problems that origin-side health checks never see — including a Route 53 misconfiguration that no internal probe would notice.
# Example CloudWatch alarm: alert when edge cache hit ratio drops below 80%
AlarmName: cloudfront-low-cache-hit
Namespace: AWS/CloudFront
MetricName: CacheHitRate
Dimensions:
- Name: DistributionId
Value: E1EXAMPLE
- Name: Region
Value: Global
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 80
ComparisonOperator: LessThanThreshold
CloudFront metrics publish to the
AWS/CloudFrontnamespace with theRegiondimension set toGlobal, and you read them fromus-east-1. Building the alarm in any other Region’s console and finding “no data” is a common waste of an afternoon.
Enterprise scenario
A media company ran an active-passive setup: primary origin in us-east-1, warm standby in eu-west-1, fronted by a single CloudFront distribution with an origin group. They had done the homework — health checks, failover criteria on 500/502/503/504, low TTLs. During a partial us-east-1 ALB degradation, the primary started returning a mix of 200s and 429 Too Many Requests under load. Their dashboards showed the origin group not failing over, and customers in Europe — who should have been served by the nearby standby anyway — were seeing errors.
Two root causes. First, 429 was not in their FailoverCriteria, and origin groups only fail over on the configured 5xx codes (or connection errors) — a 429 is, by spec, a valid response that gets returned to the client, never retried against the secondary. Second, every viewer worldwide was being routed to the single distribution’s origin group, whose primary was the struggling us-east-1 ALB; CloudFront origin failover is per-request and reactive, so European users still hit the failing primary first and only fell through if the response was a configured 5xx.
The fix layered the two failover mechanisms correctly instead of relying on origin groups alone. They added 500, 502, 503, 504 coverage they already had, added health checks tuned to the real failure signal, and — critically — moved Region selection up to Route 53 with latency records plus health checks, so resolvers in Europe were steered to a distribution whose primary origin was eu-west-1. The origin group remained as the last line of defense within each Region.
# Origin group now also treats overload as failover-worthy where idempotent,
# and Route 53 latency records carry the health check so a sick Region sheds traffic at DNS.
# (429 is intentionally NOT added to FailoverCriteria for write paths;
# it is handled by a separate, GET-only behavior pointing at a read-replica origin group.)
aws route53 change-resource-record-sets --hosted-zone-id Z123EXAMPLE --change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "app.example.com", "Type": "A",
"SetIdentifier": "eu-west-1", "Region": "eu-west-1",
"AliasTarget": { "HostedZoneId": "Z2FDTNDATAQYW2",
"DNSName": "d222222ghijkl9.cloudfront.net", "EvaluateTargetHealth": false },
"HealthCheckId": "eu-primary-hc"
}
}]
}'
The lesson the team took away: origin groups and Route 53 failover solve different outage shapes. Origin groups handle “this origin returned a 5xx for this request.” Route 53 handles “this whole Region is sick, steer everyone away from it.” Relying on either alone leaves a gap — and 429/4xx overload is a gap origin groups will never close for you.