A single db.r6g.2xlarge with Multi-AZ is not a resilient database; it is a single point of failure with a warm standby and a reconnect storm waiting to happen. Aurora changes the physics of high availability by decoupling compute from a distributed, self-healing storage layer. The architect’s job is to design the cluster topology, connection path, and runbooks so the failures Aurora handles for you stay invisible to your application. This guide walks the decisions I make on every production Aurora cluster.
What Aurora’s storage architecture changes about HA
In standard RDS, the instance owns its storage. A Multi-AZ standby is a second full copy kept current by physical replication; failover means promoting that standby and repointing DNS. In Aurora, all instances in a cluster — the writer and every reader — attach to the same shared storage volume, replicated six ways across three Availability Zones. The instances are stateless compute over that volume.
This has three consequences that drive every design choice below:
| Property | Standard RDS Multi-AZ | Aurora |
|---|---|---|
| Storage copies | 2 (primary + standby) | 6, across 3 AZs |
| Replica lag | Async, seconds to minutes | Typically <100 ms (no data copy, just redo) |
| Readers serve traffic? | No (standby is passive) | Yes, all replicas are queryable |
| Failover target | The one standby | Any replica, chosen by tier |
| Storage durability quorum | N/A | 4-of-6 writes, 3-of-6 reads |
Because readers share storage with the writer, failover does not require copying or catching up data — Aurora just promotes an existing replica. That is why Aurora failover is measured in seconds rather than the minute-plus of promoting an RDS standby. Your reconnection logic, not the database, becomes the long pole.
Step 1 — Cluster topology and connection management
An Aurora cluster exposes managed endpoints. You almost never connect to an instance endpoint directly in application code:
- Cluster (writer) endpoint — always points at the current writer. Survives failover; the CNAME is repointed for you.
- Reader endpoint — DNS round-robins across available replicas for read-only traffic.
- Custom endpoints — a named subset of instances (e.g. route reporting queries to two large analytics replicas).
- Instance endpoints — one per instance; use only for diagnostics.
Provision the cluster and a couple of replicas with Terraform. The replicas live in different AZs so a zone failure cannot take out every reader:
resource "aws_rds_cluster" "main" {
cluster_identifier = "prod-app"
engine = "aurora-postgresql"
engine_version = "16.4"
database_name = "app"
master_username = "app_admin"
manage_master_user_password = true # store the secret in Secrets Manager
db_subnet_group_name = aws_db_subnet_group.aurora.name
vpc_security_group_ids = [aws_security_group.aurora.id]
storage_encrypted = true
kms_key_id = aws_kms_key.aurora.arn
backup_retention_period = 14
preferred_backup_window = "03:00-04:00"
deletion_protection = true
enabled_cloudwatch_logs_exports = ["postgresql"]
}
resource "aws_rds_cluster_instance" "writer" {
identifier = "prod-app-0"
cluster_identifier = aws_rds_cluster.main.id
instance_class = "db.r6g.xlarge"
engine = aws_rds_cluster.main.engine
promotion_tier = 0
}
resource "aws_rds_cluster_instance" "reader" {
count = 2
identifier = "prod-app-${count.index + 1}"
cluster_identifier = aws_rds_cluster.main.id
instance_class = "db.r6g.xlarge"
engine = aws_rds_cluster.main.engine
promotion_tier = 1
}
Use
manage_master_user_passwordso the credential is generated and rotated in AWS Secrets Manager rather than living in Terraform state. Never put a real password inmaster_password.
Put RDS Proxy in front of the writer
Serverless and high-concurrency workloads churn connections aggressively. Every Postgres backend is a forked process with real memory cost, and a connection storm during failover can knock the new writer over before it stabilizes. RDS Proxy maintains a warm pool, multiplexes client connections onto fewer database connections, and — critically — holds client connections open and routes them to the new writer during failover, cutting failover time as the application sees it.
resource "aws_db_proxy" "main" {
name = "prod-app-proxy"
engine_family = "POSTGRESQL"
role_arn = aws_iam_role.proxy.arn
vpc_subnet_ids = aws_db_subnet_group.aurora.subnet_ids
vpc_security_group_ids = [aws_security_group.proxy.id]
require_tls = true
auth {
auth_scheme = "SECRETS"
iam_auth = "REQUIRED"
secret_arn = aws_rds_cluster.main.master_user_secret[0].secret_arn
}
}
resource "aws_db_proxy_default_target_group" "main" {
db_proxy_name = aws_db_proxy.main.name
connection_pool_config {
max_connections_percent = 90
max_idle_connections_percent = 50
}
}
resource "aws_db_proxy_target" "main" {
db_proxy_name = aws_db_proxy.main.name
target_group_name = aws_db_proxy_default_target_group.main.name
db_cluster_identifier = aws_rds_cluster.main.id
}
Point your application’s write traffic at the proxy’s writer endpoint and its read traffic at the proxy’s read-only endpoint (RDS Proxy exposes both for Aurora clusters). With iam_auth = REQUIRED the app fetches a short-lived token instead of a static password:
TOKEN=$(aws rds generate-db-auth-token \
--hostname prod-app-proxy.proxy-xxxx.us-east-1.rds.amazonaws.com \
--port 5432 --username app_admin --region us-east-1)
Step 2 — Scaling reads: auto scaling replicas vs. Serverless v2
You have two ways to add read capacity, and they are not mutually exclusive.
Provisioned replicas with Auto Scaling keep a fixed floor of instances and add more when a target metric (CPU or connections) is breached. Use this when load is steady or predictably diurnal and you want a known cost. Define a scaling target against the cluster’s reader role:
resource "aws_appautoscaling_target" "replicas" {
service_namespace = "rds"
resource_id = "cluster:${aws_rds_cluster.main.cluster_identifier}"
scalable_dimension = "rds:cluster:ReadReplicaCount"
min_capacity = 2
max_capacity = 8
}
resource "aws_appautoscaling_policy" "replicas_cpu" {
name = "aurora-reader-cpu"
service_namespace = aws_appautoscaling_target.replicas.service_namespace
resource_id = aws_appautoscaling_target.replicas.resource_id
scalable_dimension = aws_appautoscaling_target.replicas.scalable_dimension
policy_type = "TargetTrackingScaling"
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "RDSReaderAverageCPUUtilization"
}
target_value = 60
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
Aurora Serverless v2 scales an instance’s capacity vertically in fine-grained Aurora Capacity Units (ACUs, each ~2 GiB of memory) without disconnecting clients. Set a serverlessv2_scaling_configuration on the cluster and create instances with class db.serverless. It shines for spiky or unpredictable load and for non-prod environments that should scale toward zero overnight:
resource "aws_rds_cluster" "main" {
# ...as above...
serverlessv2_scaling_configuration {
min_capacity = 0.5
max_capacity = 16
seconds_until_auto_pause = 3600 # v2 can pause near-idle clusters
}
}
resource "aws_rds_cluster_instance" "serverless_reader" {
identifier = "prod-app-sv2-1"
cluster_identifier = aws_rds_cluster.main.id
instance_class = "db.serverless"
engine = aws_rds_cluster.main.engine
promotion_tier = 1
}
A common, pragmatic pattern: a provisioned writer for predictable baseline write throughput, plus Serverless v2 readers that absorb read spikes. You can mix db.serverless and provisioned instances in the same cluster.
Step 3 — Failover behavior and tuning reconnection
When the writer fails (or you reboot it with failover), Aurora promotes a replica and repoints the cluster endpoint CNAME. Promotion tier (promotion_tier, 0–15) decides who wins: Aurora prefers the lowest-numbered tier, breaking ties by the replica largest in size so the new writer can handle the load. Pin your beefy replicas to tier 0 or 1 and keep tiny analytics nodes at tier 15 so they are never promoted into the writer role.
The database side of failover is fast. The slow part is almost always the client:
- DNS TTL. The cluster endpoint CNAME has a short TTL (around 5 seconds). JVMs that cache DNS forever are the classic offender — set
networkaddress.cache.ttlto a low value or you will keep hammering the old IP. - Connection pool validation. Configure your pool (HikariCP, pgbouncer, etc.) to test connections on borrow and evict dead ones quickly, instead of handing the app a half-open socket to the demoted instance.
- Use RDS Proxy. It absorbs the reconnect storm and pins clients to the new writer, which is the single biggest lever for shrinking application-observed failover time.
Trigger a controlled failover to a specific target:
aws rds failover-db-cluster \
--db-cluster-identifier prod-app \
--target-db-instance-identifier prod-app-1
Step 4 — Cross-region DR with Aurora Global Database
Multi-AZ protects you from instance and zone failure. It does nothing for a regional outage or a region-wide control-plane event. Aurora Global Database replicates from a primary region to up to five secondary regions using the storage layer’s dedicated replication infrastructure, with typical cross-region lag around one second and negligible impact on primary write performance.
resource "aws_rds_global_cluster" "global" {
global_cluster_identifier = "prod-app-global"
engine = "aurora-postgresql"
engine_version = "16.4"
}
# Primary regional cluster joins the global cluster
resource "aws_rds_cluster" "primary" {
provider = aws.us_east_1
cluster_identifier = "prod-app-use1"
global_cluster_identifier = aws_rds_global_cluster.global.id
engine = aws_rds_global_cluster.global.engine
engine_version = aws_rds_global_cluster.global.engine_version
# ...storage, subnets, security groups...
}
# Secondary read-only cluster in another region
resource "aws_rds_cluster" "secondary" {
provider = aws.eu_west_1
cluster_identifier = "prod-app-euw1"
global_cluster_identifier = aws_rds_global_cluster.global.id
engine = aws_rds_global_cluster.global.engine
engine_version = aws_rds_global_cluster.global.engine_version
source_region = "us-east-1"
# ...storage, subnets, security groups...
}
The secondary region serves low-latency reads to local users. For DR you have two recovery modes:
- Managed planned failover — for a healthy primary (e.g. a region evacuation drill). Aurora coordinates so no data is lost: RPO is effectively zero, and it demotes the old primary to a secondary so the global topology is preserved.
- Unplanned (“detach and promote”) — when the primary region is gone. You detach the secondary from the global cluster and promote it to a standalone writable cluster. RPO equals whatever replication lag existed at the moment of failure (typically ~1 s), and RTO is dominated by how fast you can repoint application traffic and (re)build a new global cluster afterward.
# Planned, zero-RPO switchover to the secondary region
aws rds failover-global-cluster \
--global-cluster-identifier prod-app-global \
--target-db-cluster-identifier arn:aws:rds:eu-west-1:111122223333:cluster:prod-app-euw1
Set explicit targets and write them in the runbook. A realistic posture for Global Database: RPO ~1 second, RTO of a few minutes for unplanned promotion, gated mostly by DNS/Route 53 repointing and application config, not the database promotion itself.
Step 5 — Zero-downtime schema and engine changes with blue/green
In-place major version upgrades and risky schema migrations are where teams take outages. RDS Blue/Green Deployments create a full, synchronized copy of the cluster (the green environment) replicating from production (blue). You apply your engine upgrade or schema change to green, validate it against real replicated data, and then switch over — Aurora redirects the endpoints to green, typically within a minute, with built-in guardrails that abort if replication is unhealthy or lag is too high.
aws rds create-blue-green-deployment \
--blue-green-deployment-name prod-app-pg17-upgrade \
--source arn:aws:rds:us-east-1:111122223333:cluster:prod-app \
--target-engine-version 17.2 \
--target-db-cluster-parameter-group-name prod-app-pg17
Workflow:
- Create the deployment; green spins up and begins replicating from blue.
- Apply schema DDL to green if needed. Keep changes backward compatible (additive columns, new tables) so blue and the application keep working during the window — replication from blue to green stops working if you make changes on green that conflict with incoming changes.
- Run your test suite and compare query plans against green’s endpoints.
- Switch over. Endpoints repoint to green; the old blue cluster is kept (renamed) so you can investigate or roll back by redeploying.
aws rds switchover-blue-green-deployment \
--blue-green-deployment-identifier bgd-xxxxxxxxxxxx \
--switchover-timeout 300
Blue/green is the right tool for engine upgrades and infra-level parameter changes. For purely additive application schema changes, the expand/contract pattern (deploy schema, deploy code that tolerates both shapes, backfill, then remove the old shape) is still your friend and needs no green environment.
Step 6 — Backups, point-in-time recovery, and cloning
Aurora continuously backs up to S3 with no performance penalty. With backup_retention_period set, you can restore the cluster to any second within the window. Point-in-time recovery always creates a new cluster — it never overwrites the running one — which is exactly what you want when recovering from a bad migration or an errant DELETE:
aws rds restore-db-cluster-to-point-in-time \
--db-cluster-identifier prod-app-recovered \
--source-db-cluster-identifier prod-app \
--restore-to-time 2026-04-22T09:15:00Z
For safe testing against production-sized data, use database cloning. Aurora clones use copy-on-write at the storage layer: the clone is near-instant and initially consumes almost no extra storage, diverging only as pages are written. Spin one up to test a migration or load test against real data, then throw it away:
aws rds restore-db-cluster-to-point-in-time \
--db-cluster-identifier prod-app-clone-test \
--source-db-cluster-identifier prod-app \
--restore-type copy-on-write \
--use-latest-restorable-time
Verify
Confirm the topology, roles, and DR posture before declaring the cluster production-ready.
# Writer vs. reader roles and per-instance status
aws rds describe-db-clusters --db-cluster-identifier prod-app \
--query 'DBClusters[0].DBClusterMembers[].{id:DBInstanceIdentifier,writer:IsClusterWriter,status:DBClusterParameterGroupStatus}' \
--output table
# Promotion tiers — make sure tiny readers are NOT tier 0/1
aws rds describe-db-instances \
--query 'DBInstances[?DBClusterIdentifier==`prod-app`].{id:DBInstanceIdentifier,tier:PromotionTier,class:DBInstanceClass}' \
--output table
# Global cluster members and which region is primary
aws rds describe-global-clusters --global-cluster-identifier prod-app-global \
--query 'GlobalClusters[0].GlobalClusterMembers[].{arn:DBClusterArn,writer:IsWriter}' \
--output table
# Replica lag in milliseconds (CloudWatch)
aws cloudwatch get-metric-statistics --namespace AWS/RDS \
--metric-name AuroraReplicaLag --statistics Maximum --period 60 \
--start-time 2026-04-22T09:00:00Z --end-time 2026-04-22T09:30:00Z \
--dimensions Name=DBClusterIdentifier,Value=prod-app
A real verification step is a game day: in a staging cluster, run failover-db-cluster, watch AuroraReplicaLag and DatabaseConnections in CloudWatch, and time how long application requests fail. If that number is more than a few seconds, the problem is on the client (DNS caching or pool config), not Aurora.
Production readiness checklist
Performance Insights and parameter tuning
Enable Performance Insights on every instance (performance_insights_enabled = true, with performance_insights_kms_key_id) — it gives you the Average Active Sessions view broken down by wait event and top SQL, which is how you find the query melting a reader long before it pages you. Tune at the cluster parameter group for storage/cluster-wide settings (e.g. rds.logical_replication) and the DB parameter group for instance-level settings. Change parameters in a blue/green green environment or a clone first; never test a pending-reboot static parameter change directly on the writer.
Failover game-day runbook
Keep a short, rehearsed runbook so the on-call engineer is not improvising at 3 a.m.:
- Detect — alarm on
AuroraReplicaLag, writerCPUUtilization, andDatabaseConnections. Confirm scope (instance, AZ, or region). - Single instance/AZ — Aurora auto-fails-over to the highest-priority healthy replica. Verify the new writer with
describe-db-clustersand confirm app reconnection in metrics. - Region down — execute the Global Database failover/promote runbook, repoint Route 53 to the secondary, and announce the new primary region.
- After stabilization — rebuild redundancy: add replicas back, re-establish the global cluster, and capture the actual RTO/RPO observed for the postmortem.
Enterprise scenario
A fintech payments platform ran a provisioned Aurora PostgreSQL cluster behind RDS Proxy with two readers and a Global Database secondary in eu-west-1. Failover game days passed in under 8 seconds, so the team signed off on the HA posture. Then a routine PostgreSQL 15-to-16 blue/green upgrade switched over cleanly — and an hour later the writer’s storage replication started lagging into the secondary region, eventually breaching the 4 GB lag budget and detaching it.
Root cause: the green environment had been created from a cluster parameter group where rds.logical_replication = 1 was left on from an old Debezium CDC experiment. Logical replication slots on the new writer pinned WAL, the restart_lsn stopped advancing, and physical storage-level replication to the global secondary fell behind because WAL could not be recycled. The blue/green health checks never caught it — they validate replication into green, not downstream global lag.
The fix was to drop the orphaned slots and rebuild the global secondary, then add a guardrail so this never ships again:
-- Find slots pinning WAL on the writer
SELECT slot_name, active, restart_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained
FROM pg_replication_slots WHERE slot_type = 'logical';
SELECT pg_drop_replication_slot('debezium_orders');
They also wired a CloudWatch alarm on OldestReplicationSlotLag and added a CI check that diffs the green parameter group against an approved baseline before any switchover-blue-green-deployment. Lesson: a blue/green that passes its own guardrails can still poison a downstream global cluster — the parameter group is part of your change surface, not background config.
Pitfalls
- Connecting to instance endpoints in app config. After a failover the old writer is a reader; your writes start failing. Always use the cluster/proxy endpoint.
- DNS caching defeating fast failover. A JVM caching DNS for the process lifetime turns a 10-second failover into a multi-minute outage. Set a low TTL.
- Tiny readers at tier 0. Aurora may promote a
db.r6g.largeanalytics node to writer and it falls over under production write load. Set tiers deliberately. - Assuming Global Database is zero-RPO for unplanned failures. It is zero-RPO only for managed planned failover. An unplanned region loss costs you the in-flight replication lag.
- Treating blue/green green as throwaway-safe for any change. Non-backward-compatible DDL on green breaks replication from blue. Keep changes additive until switchover.
- Skipping the game day. A DR plan you have never executed is a hypothesis, not a plan.
Aurora hands you a distributed, self-healing storage layer and fast failover almost for free. The engineering that remains is the connection path, the promotion tiers, the cross-region story, and the runbooks — get those right and the failures your users would have noticed become a line in a CloudWatch dashboard.