Most teams reach for “multi-region” the moment a board asks “what happens if a region goes down?” — and then discover that the honest answer depends entirely on how you went multi-region. A warm standby that nobody has failed over to in 18 months is not resilience; it is a liability with a runbook. This article is about the harder, more useful pattern: active-active, where both regions serve live production traffic all the time, so a regional loss is a capacity event rather than a heroic, untested failover.
We will build this on AWS with the four services that make active-active actually feasible without a self-managed replication layer: Amazon Route 53 for global traffic steering and health-based failover, Amazon CloudFront for edge delivery and origin failover, Amazon DynamoDB Global Tables for multi-active key-value/document data, and Amazon Aurora Global Database for relational data that needs a single writer but cross-region read scale and fast promotion. The interesting engineering is not in turning these on — it is in the boundaries between them: where you accept eventual consistency, where you refuse to, and how you keep two live regions from corrupting each other.
The business scenario
The pattern fits a surprisingly wide band of organisations, because the driver is rarely raw scale — it is revenue-per-minute, regulatory geography, or contractual availability.
- A mid-market SaaS company (say, 200–2,000 paying tenants) signs an enterprise customer whose procurement team demands 99.99% availability and “no single-region dependency” in the security questionnaire. A single-region deployment with a documented RTO of 4 hours will lose the deal.
- A fintech / payments platform must keep authorising transactions during a regional impairment because every minute of downtime is directly quantifiable: a $40M/year merchant base losing 30 minutes at peak is roughly $2,300 of authorised volume per minute they cannot process, plus reputational damage and SLA penalties.
- A global consumer app (gaming backend, ride-hailing, social) needs low write latency on two continents at once. A user in Frankfurt and a user in Virginia both need sub-100ms writes; a single writer 6,000 km away cannot deliver that.
- A large enterprise running a customer-facing portal under a regulator that expects demonstrable, regularly-exercised resilience (DORA in the EU, FFIEC/OCC guidance in US banking) — where “we have backups” is no longer an acceptable answer and auditors ask for evidence of actual failover tests.
What unites them is the same uncomfortable realisation: the failure modes that hurt are correlated and regional. A bad deployment, a control-plane event, an AZ-spanning power or networking incident, or a throttling storm tends to take out a region’s worth of a service at once, not one server. Multi-AZ (which you should already have) protects against a data-centre; it does nothing for a regional control-plane degradation or a fat-fingered region-wide config change.
The problem this architecture solves: serve every user from a healthy region with low latency, survive the loss of an entire region with an RTO measured in minutes and an RPO measured in seconds, and do it as a steady-state property of the system rather than a fire-drill. The cost is real — roughly 1.7–2.1x the single-region infrastructure bill plus meaningful inter-region data-transfer charges — so the rest of this article is also about deciding which parts of your system genuinely need it.
Architecture overview
Picture two AWS Regions running the same stack — call them us-east-1 (Virginia) as the primary and eu-west-1 (Ireland) as the secondary, though “primary/secondary” only matters for the relational tier; everything else is genuinely symmetric.
The request path, edge inward:
- A user resolves
app.example.com. Route 53 answers with a latency-based (or geoproximity) routing policy, returning the entry point for the AWS Region closest to them. Each record is tied to a Route 53 health check so an unhealthy Region is withdrawn from DNS automatically. - The resolved endpoint is a CloudFront distribution. Static assets and cacheable API responses are served from the edge. CloudFront is configured with an origin group: a primary origin (the nearest Region’s regional entry point) and a secondary origin, so that even before DNS TTLs expire, CloudFront can fail a single request over to the other Region on a 5xx/connection error.
- Dynamic requests hit the Region’s Application Load Balancer, fronted by AWS WAF, terminating TLS via ACM. The ALB distributes to the compute tier — ECS Fargate or EKS tasks (or Lambda behind API Gateway for the event-driven slices) — running identical container images in both Regions, deployed from one pipeline.
- The application reads and writes data through two distinct data planes, and this split is the heart of the design:
- DynamoDB Global Tables for data that is naturally key-addressable and tolerant of last-writer-wins semantics — sessions, user profiles, carts, feature flags, event ledgers, idempotency keys. Each Region writes to its local replica; DynamoDB asynchronously replicates both directions, typically within a second or two. Both Regions are writers. There is no failover for this tier — it is already multi-active.
- Aurora Global Database (PostgreSQL- or MySQL-compatible) for relational data needing strong consistency, foreign keys, and transactions — orders, invoices, ledgers-of-record, anything where “last writer wins” would be a financial bug. Here there is exactly one writer Region at a time. The secondary Region holds a read-only replica kept current via Aurora’s storage-layer replication (typically under one second of lag, often ~tens of milliseconds). Local reads are fast; writes from the secondary Region are routed back to the primary writer (cross-region) or, for write-heavy local needs, handled via write forwarding.
- CloudFront, DynamoDB Global Tables, and Route 53 are all global services — they do not themselves “fail over.” Only the relational writer and the regional compute fleets have a concept of primary/secondary.
The data path under steady state: a write in Ireland to a DynamoDB-backed feature flag is acknowledged locally in single-digit milliseconds and shows up in Virginia ~1–2s later. A write in Ireland to an Aurora-backed order is either forwarded to the Virginia writer (adding one cross-Atlantic round trip, ~70–90ms) or, if Ireland is the writer, committed locally and streamed to Virginia.
The failure path: if Virginia degrades, Route 53 health checks flip DNS so new users resolve to Ireland; CloudFront origin failover catches in-flight requests immediately; the DynamoDB tier needs no action because Ireland was always a writer; and the operator (or an automated runbook) promotes the Aurora secondary in Ireland to writer — a managed planned failover completes in about a minute with near-zero data loss, while an unplanned failover (region truly gone) is a few minutes with an RPO usually under a second.
The single most important design decision visible in this overview: DynamoDB removes failover from the equation for everything you can model as key-value; Aurora concentrates the only real failover risk into one well-understood, well-tooled operation. Minimise what lives in the Aurora “single-writer” world and active-active gets dramatically simpler.
Component breakdown
| Component | Role in this architecture | Key configuration choices |
|---|---|---|
| Route 53 | Global DNS steering + health-based regional withdrawal | Latency or geoproximity routing; health checks on a deep /health endpoint (not just TCP); low TTL (30–60s) on the apex; Evaluate Target Health on alias records; optionally Route 53 ARC for orchestrated failover |
| CloudFront | Edge caching, TLS, DDoS surface reduction, sub-DNS origin failover | Origin group (primary + failover origin) with failover on 500/502/503/504 and connection errors; cache policies per path; AWS WAF + Shield attached; OAC to lock origins to CloudFront |
| AWS WAF + Shield | L7 filtering, rate-limiting, managed rule sets | Rate-based rules per IP; AWS Managed Rules (core, SQLi, known-bad-inputs); Shield Advanced if availability SLAs/penalties justify it |
| Application Load Balancer | Regional ingress to compute | One per Region; cross-zone balancing on; deregistration delay tuned; HTTP/2; access logs to S3 |
| ECS Fargate / EKS | Stateless compute, identical in both Regions | Same image digest per release; min capacity sized to absorb the other Region’s traffic (see N+1 sizing below); per-Region auto scaling |
| DynamoDB Global Tables | Multi-active NoSQL data plane | Global Tables v2 (2019.11.21); on-demand or autoscaled capacity; PITR + DynamoDB Streams; design for last-writer-wins; conflict-aware item modelling |
| Aurora Global Database | Relational data with single writer + cross-region read & fast promotion | One global cluster; primary + ≥1 secondary Region; storage-level replication; managed planned failover for drills; write forwarding if secondary needs local writes; RPO target ~1s |
| S3 + Cross-Region Replication | Object storage (uploads, exports, static origin) | Per-Region buckets with bidirectional or hub-and-spoke CRR; replication metrics; versioning on |
| AWS Global Accelerator (optional) | Anycast IPs + faster failover than DNS for non-cacheable, latency-sensitive APIs | Two static anycast IPs; endpoint groups per Region; traffic dials; health checks at the network layer |
| Secrets Manager / KMS | Secrets + encryption, replicated per Region | Multi-Region KMS keys; Secrets Manager cross-Region replica secrets; no plaintext secrets in images |
A few component choices deserve their why, not just their what:
Why DynamoDB Global Tables and not “Aurora for everything”? Because Aurora has exactly one writer. If you put high-velocity, geographically-distributed writes (sessions, telemetry, idempotency keys) into Aurora, half your users pay a transatlantic write penalty all the time, and a writer-region loss blocks all writes until promotion. DynamoDB lets both Regions write locally and reconciles asynchronously. The price is last-writer-wins semantics with no built-in row-level merge — so you must model data to make concurrent writes either rare or commutative (see the conflict note below).
Why keep Aurora at all, then? Because last-writer-wins is wrong for money and relationships. An order total, a ledger entry, a uniqueness constraint, a multi-row transaction — these need a single source of truth and serialisable behaviour. Aurora Global Database gives you that plus a secondary Region that is already warm, replicated at the storage layer with typically sub-second lag, and promotable in about a minute. You get strong consistency where it matters and you concede that a small slice of your system has a real (but fast and well-tooled) failover.
The conflict reality for DynamoDB Global Tables: replication is last-writer-wins based on the most recent write timestamp. If the same item is updated in two Regions within the replication window, one update silently wins. Mitigations that actually work in practice:
- Partition writes by geography where possible so a given item is “owned” by one Region (e.g. shard a user to their home Region).
- Make updates idempotent and commutative — use conditional writes, append-only event items instead of in-place mutation, or per-attribute updates that don’t collide.
- For counters, prefer DynamoDB atomic counters per Region summed at read time, or move the counter to a model that tolerates merge.
- Never assume cross-Region read-after-write; treat the remote replica as eventually consistent.
Implementation guidance
Infrastructure as Code. Treat the two Regions as one logical system expressed in Terraform, not two copies maintained by hand. The clean structure:
- A
globalstack for the genuinely global resources: Route 53 hosted zone + records + health checks, CloudFront distribution and origin group, AWS WAF web ACL (CloudFront scope isus-east-1/ global), the multi-Region KMS primary key, and the ACM cert (CloudFront certs must live inus-east-1). - A reusable
regionmodule instantiated twice with differentprovidersaliases (aws.use1,aws.euw1), producing the VPC, subnets, ALB, ECS/EKS, Aurora regional cluster member, and the DynamoDB table (DynamoDB Global Tables v2 is modelled asaws_dynamodb_tablewithreplica { region_name = ... }blocks — you declare the replicas on the table, not as separate resources). - A thin
datastack wiring the Aurora Global Database (aws_rds_global_cluster+ a primaryaws_rds_clusterin us-east-1 and a secondaryaws_rds_clusterin eu-west-1 referencing the global cluster) and the S3 CRR rules.
Terraform shape for the data tier (illustrative):
resource "aws_rds_global_cluster" "this" {
global_cluster_identifier = "ex-global"
engine = "aurora-postgresql"
engine_version = "16.4"
database_name = "appdb"
storage_encrypted = true
}
# Primary writer — us-east-1
resource "aws_rds_cluster" "primary" {
provider = aws.use1
cluster_identifier = "ex-use1"
engine = aws_rds_global_cluster.this.engine
engine_version = aws_rds_global_cluster.this.engine_version
global_cluster_identifier = aws_rds_global_cluster.this.id
master_username = var.db_user
manage_master_user_password = true # password lives in Secrets Manager, not state
kms_key_id = aws_kms_key.use1.arn
db_subnet_group_name = module.region_use1.db_subnet_group
}
# Secondary reader (promotable) — eu-west-1
resource "aws_rds_cluster" "secondary" {
provider = aws.euw1
cluster_identifier = "ex-euw1"
engine = aws_rds_global_cluster.this.engine
engine_version = aws_rds_global_cluster.this.engine_version
global_cluster_identifier = aws_rds_global_cluster.this.id
kms_key_id = aws_kms_key.euw1.arn
db_subnet_group_name = module.region_euw1.db_subnet_group
depends_on = [aws_rds_cluster.primary]
}
And the DynamoDB global table as a single resource with replicas:
resource "aws_dynamodb_table" "sessions" {
name = "sessions"
billing_mode = "PAY_PER_REQUEST"
hash_key = "pk"
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
attribute { name = "pk" type = "S" }
point_in_time_recovery { enabled = true }
replica { region_name = "us-east-1" }
replica { region_name = "eu-west-1" }
}
Networking. Each Region gets its own VPC; do not rely on a cross-Region VPC peering hot path for the request flow — keep request handling entirely in-Region and let only data replication cross Regions (over the AWS backbone, which Aurora and DynamoDB use natively). If services in one Region must reach the other (e.g. Aurora write-forwarding, or an in-Region app reaching the remote writer during a forwarding window), use Transit Gateway with inter-region peering or PrivateLink, and budget for the inter-Region data-transfer cost. Put VPC endpoints (Gateway endpoint for DynamoDB and S3; Interface endpoints for everything else) in each Region so data-plane traffic to DynamoDB/S3 stays off NAT and off the internet.
Identity and access. One AWS Organization, with the workload spread across accounts by environment (and optionally by Region for blast-radius isolation). Use IAM Identity Center for human SSO. For the application: per-Region IAM roles assumed by the Fargate/EKS tasks (IRSA on EKS), scoped to that Region’s table/cluster ARNs. KMS keys are multi-Region so an encrypted DynamoDB/S3 item replicated to the other Region decrypts under the local key replica without a cross-Region KMS call. Secrets Manager uses replica secrets so each Region reads its DB credentials locally — never a cross-Region Secrets Manager dependency on the hot path.
Deployment. One CI/CD pipeline (CodePipeline or GitHub Actions) builds one image, pushes to ECR with cross-Region replication enabled, and deploys the same digest to both Regions — ideally one Region at a time (canary in Region A, bake, then Region B) so a bad release cannot brick both Regions simultaneously. This staggering is itself a resilience control: your two Regions are also two blast-radius boundaries for deploys.
Schema migrations are the sharp edge of single-writer relational data: run them against the Aurora primary writer only, design them backward-compatible (expand/contract), and never ship an app version that requires a schema the other Region’s in-flight traffic hasn’t seen — because after a failover the other Region becomes the writer.
Enterprise considerations
Security and Zero Trust. The perimeter is CloudFront + WAF + Shield, but trust is enforced per request, per Region. Lock origins with Origin Access Control so the ALB/S3 only accept traffic from your CloudFront distribution. Terminate TLS everywhere (ACM), encrypt at rest with multi-Region KMS keys, and keep all secrets in Secrets Manager with per-Region replicas. Apply least-privilege IAM scoped to in-Region resource ARNs — a compromised task in Ireland should have no standing path to Virginia’s data plane beyond what replication already provides. Enable GuardDuty, Security Hub, and CloudTrail (organization trail, multi-Region) so detection is symmetric; a threat actor will not politely confine themselves to your primary Region.
Cost optimization. Active-active is not free; the discipline is spending it where it buys availability and trimming where it does not.
- Compute: you do not need 2x full capacity. Size each Region to N+1 of steady-state peak / 2 plus enough headroom to absorb the other Region’s share during failover. A common target is each Region running at ~50–60% utilisation so that one Region can take 100% with autoscaling catching up. Use Compute Savings Plans on the always-on baseline and Spot for fault-tolerant async workers.
- Data transfer is the silent line item. DynamoDB Global Tables charges replicated write capacity (rWCUs) for cross-Region replication, and Aurora Global Database charges for replicated data transfer between Regions. At scale this can rival the compute bill. Reduce it by not globally replicating data that doesn’t need it — keep purely regional data (e.g. region-local analytics staging) in single-Region tables.
- Aurora: the secondary cluster is real spend even while “just reading.” Right-size it; you can run the secondary smaller than primary and scale it up as part of the promotion runbook if your RTO budget allows the extra minute.
- CloudFront offsets cost by serving cache hits at edge pricing instead of origin compute + cross-Region transfer — push cacheability as high as correctness allows.
Scalability. DynamoDB on-demand scales horizontally per Region with no capacity planning; Aurora scales reads via in-Region replicas and the cross-Region secondary, but writes remain bound to one node in one Region — that is the architecture’s scaling ceiling for relational data. If relational write throughput becomes the limit, the answer is not a second Aurora writer (multi-writer Aurora is a niche, contention-prone mode) but moving more of the write path onto DynamoDB or sharding the relational domain by tenant/Region.
Reliability and DR (RTO/RPO). This is the headline:
| Tier | RPO (data loss) | RTO (time to recover) | Mechanism |
|---|---|---|---|
| DynamoDB Global Tables | ~0 for the writing Region; in-flight cross-Region writes within the ~1–2s replication window may be reordered/LWW | ~0 (both Regions already active) | Native multi-active replication; no failover needed |
| Aurora — planned failover (drills, maintenance) | ~0 (replication caught up before switch) | ~1 minute | Managed planned failover of the Global Database |
| Aurora — unplanned failover (region lost) | typically < 1 second of lag at the moment of loss | a few minutes to promote secondary + repoint app | Promote secondary to standalone writer; app uses Region-local writer endpoint |
| Edge / routing | n/a | DNS: 30–60s for new resolvers (TTL); CloudFront origin failover and Global Accelerator: seconds for in-flight | Route 53 health checks; CloudFront origin group; optional Global Accelerator |
The crucial practice: exercise the Aurora planned failover on a schedule (monthly/quarterly) using GameDays. A failover path you have never run is an RTO you cannot honestly claim. Route 53 Application Recovery Controller (ARC) is worth adopting for the orchestration — its routing controls let you flip Region traffic deterministically (and its readiness checks continuously verify the standby is actually able to take load), avoiding the trap of DNS health checks misbehaving during a partial brownout.
Observability. Metrics, logs, and traces are emitted per Region (CloudWatch, X-Ray / OpenTelemetry) and aggregated into a single pane — either CloudWatch cross-account/cross-Region dashboards or a third-party (Datadog/Grafana). The few signals you must watch specifically for this pattern: DynamoDB ReplicationLatency per replica, Aurora AuroraGlobalDBReplicationLag / AuroraGlobalDBRPOLag, Route 53 health-check status, and CloudFront origin failover counts. A rising Aurora replication lag is a leading indicator that your RPO promise is degrading before any outage happens.
Governance. Enforce with Service Control Policies (e.g. deny creation of resources outside your two sanctioned Regions to prevent shadow expansion), AWS Config conformance packs checking that tables are Global Tables and clusters belong to the global cluster, and tagging that distinguishes data-residency classes so a future engineer cannot accidentally replicate EU-resident personal data into a non-permitted Region — a real compliance trap in any cross-Region design.
Reference enterprise example
Helios Pay, a fictional B2B payments platform, processes virtual-card authorisations for ~600 business customers across the EU and US. Their single-region (us-east-1) stack had a 99.9% SLA; a new anchor customer — a logistics firm doing ~$90M/year through Helios — made 99.99% and “no single-region dependency” contractual, with a 0.1% monthly fee credit per 0.1% under SLA. Downtime was now line-itemed.
What they decided:
- Regions: us-east-1 (existing) as Aurora primary, eu-west-1 as the active secondary — also a latency win, since 40% of volume is European.
- Data split (the key architectural call):
- Aurora Global Database (PostgreSQL): the ledger of record — authorisations, settlements, double-entry accounting, the card-balance source of truth. Non-negotiably strongly consistent; LWW would be a financial defect.
- DynamoDB Global Tables: idempotency keys (every authorisation request carries one; dedupe must work even mid-failover), tokenised card metadata lookups, rate-limit counters per merchant (atomic, per-Region, summed), and session/auth context for the merchant dashboard.
- Edge: CloudFront for the dashboard SPA and cacheable reference data; Global Accelerator in front of the authorisation API because that path is latency-critical and non-cacheable, and they wanted sub-DNS, sub-10s failover rather than waiting on resolver TTLs.
- Compute: ECS Fargate, each Region sized to run at ~55% of combined peak so either Region can absorb 100% while autoscaling ramps.
Numbers:
- Steady-state: ~1,100 authorisations/sec at peak, split roughly 60/40 US/EU.
- DynamoDB replication latency observed: p50 ~0.7s, p99 ~1.9s.
- Aurora
AuroraGlobalDBRPOLag: typically 50–400 ms. - Monthly cost moved from ~$58k (single Region) to ~$112k active-active — a 1.93x multiplier. Breakdown of the increase: secondary Aurora cluster + replicated transfer (~$21k), DynamoDB replicated write capacity (~$9k), second Fargate fleet (~$16k), Global Accelerator + cross-Region transfer + extra observability (~$8k).
- They justified it bluntly: one 30-minute total outage at peak = ~$2k of stuck authorisation volume per minute plus SLA credits across the whole book — a single bad incident would erase a quarter of the added annual spend, before counting the lost anchor customer.
The test that mattered: in their first quarterly GameDay they ran a managed planned Aurora failover from us-east-1 to eu-west-1 during a low-traffic window. New writes were serving from Ireland in ~70 seconds; DynamoDB needed no action; Global Accelerator shifted the auth API in ~8 seconds; the dashboard followed via Route 53 within a minute. Idempotency keys in DynamoDB meant a handful of authorisation requests retried across the switch did not double-charge — the single most important correctness outcome of the whole exercise.
One scar they earned: their first design put per-merchant rate-limit counters as in-place UPDATEs on a single DynamoDB item, and under concurrent US+EU traffic the LWW replication undercounted — a merchant briefly exceeded their cap. The fix was a per-Region counter item (pk = merchant#REGION) summed at read time. It is the canonical Global Tables lesson: model for last-writer-wins, or it will model you.
When to use it
Use active-active multi-region when:
- Availability is contractual or regulatory (99.99%+, DORA, banking guidance) and must be demonstrably exercised, not just architected on paper.
- A regional outage has directly quantifiable revenue/penalty cost that exceeds the ~2x infrastructure premium.
- You genuinely have users on multiple continents needing low write latency, which a single writer cannot serve.
- A meaningful share of your data is key-addressable and LWW-tolerant, so DynamoDB Global Tables can carry it and shrink the relational failover surface.
Think hard (and probably choose simpler) when:
- Your real availability need is 99.9–99.95%. Multi-AZ in one Region already delivers that, at a fraction of the cost and complexity. Multi-region here is over-engineering.
- Your data is overwhelmingly relational and write-heavy with no LWW-tolerant slice. You will end up with one Aurora writer anyway, paying transatlantic write penalties and a cross-Region failover — i.e. warm standby, not true active-active. Be honest about which one you’re building.
- You cannot commit to running failover GameDays. Untested active-active is more dangerous than honest single-region, because the false confidence is worse than the known limitation.
Anti-patterns to avoid:
- Treating Aurora like it’s multi-active. It has one writer. Routing all writes everywhere without write-forwarding or a clear writer Region causes errors or cross-Region latency you didn’t plan for.
- Ignoring DynamoDB conflict semantics. In-place mutation of the same item from two Regions silently loses data under LWW. Model commutatively or partition ownership.
- Cross-Region dependencies on the hot path — a synchronous call from Ireland to a Virginia service, a single shared Secrets Manager secret, a single-Region KMS key. Each one quietly converts your “independent” Regions into a single failure domain.
- DNS-only failover for latency-critical, non-cacheable APIs. Resolver TTL caching makes DNS failover minutes-slow in practice; use CloudFront origin failover or Global Accelerator for those paths.
- One pipeline deploying both Regions simultaneously. A bad release then takes out both Regions at once — the exact catastrophe multi-region was meant to prevent. Stagger deploys.
Alternatives worth weighing:
- Warm standby / pilot light (active-passive): far cheaper, simpler, acceptable when RTO of minutes-to-an-hour and a small RPO are fine and you don’t need multi-continent write latency. Most enterprises genuinely need this, not active-active.
- Single-Region multi-AZ with rigorous backups + PITR: the right default for the majority of workloads; reach for multi-region only when a concrete driver above forces it.
- Multi-Region only for the data tier (Global Tables + Aurora Global DB) with compute that fails over rather than runs hot: a pragmatic middle ground that gets you fast data recovery without doubling the compute fleet.
The honest framing for any architecture review: active-active multi-region is the right answer to a specific and expensive problem, and an expensive mistake when adopted for prestige. Decide which one you have before you double your bill.