AWS Databases

Aurora for Production: Multi-AZ Failover, Global Database, and Zero-Downtime Operations

A single db.r6g.2xlarge with Multi-AZ is not a resilient database; it is a single point of failure with a warm standby and a reconnect storm waiting to happen. Aurora changes the physics of high availability by decoupling compute from a distributed, self-healing storage layer. The architect’s job is to design the cluster topology, connection path, and runbooks so the failures Aurora handles for you stay invisible to your application. This guide walks the decisions I make on every production Aurora cluster.

What Aurora’s storage architecture changes about HA

In standard RDS, the instance owns its storage. A Multi-AZ standby is a second full copy kept current by physical replication; failover means promoting that standby and repointing DNS. In Aurora, all instances in a cluster — the writer and every reader — attach to the same shared storage volume, replicated six ways across three Availability Zones. The instances are stateless compute over that volume.

This has three consequences that drive every design choice below:

Property Standard RDS Multi-AZ Aurora
Storage copies 2 (primary + standby) 6, across 3 AZs
Replica lag Async, seconds to minutes Typically <100 ms (no data copy, just redo)
Readers serve traffic? No (standby is passive) Yes, all replicas are queryable
Failover target The one standby Any replica, chosen by tier
Storage durability quorum N/A 4-of-6 writes, 3-of-6 reads

Because readers share storage with the writer, failover does not require copying or catching up data — Aurora just promotes an existing replica. That is why Aurora failover is measured in seconds rather than the minute-plus of promoting an RDS standby. Your reconnection logic, not the database, becomes the long pole.

Step 1 — Cluster topology and connection management

An Aurora cluster exposes managed endpoints. You almost never connect to an instance endpoint directly in application code:

Provision the cluster and a couple of replicas with Terraform. The replicas live in different AZs so a zone failure cannot take out every reader:

resource "aws_rds_cluster" "main" {
  cluster_identifier      = "prod-app"
  engine                  = "aurora-postgresql"
  engine_version          = "16.4"
  database_name           = "app"
  master_username         = "app_admin"
  manage_master_user_password = true # store the secret in Secrets Manager
  db_subnet_group_name    = aws_db_subnet_group.aurora.name
  vpc_security_group_ids  = [aws_security_group.aurora.id]
  storage_encrypted       = true
  kms_key_id              = aws_kms_key.aurora.arn
  backup_retention_period = 14
  preferred_backup_window = "03:00-04:00"
  deletion_protection     = true
  enabled_cloudwatch_logs_exports = ["postgresql"]
}

resource "aws_rds_cluster_instance" "writer" {
  identifier         = "prod-app-0"
  cluster_identifier = aws_rds_cluster.main.id
  instance_class     = "db.r6g.xlarge"
  engine             = aws_rds_cluster.main.engine
  promotion_tier     = 0
}

resource "aws_rds_cluster_instance" "reader" {
  count              = 2
  identifier         = "prod-app-${count.index + 1}"
  cluster_identifier = aws_rds_cluster.main.id
  instance_class     = "db.r6g.xlarge"
  engine             = aws_rds_cluster.main.engine
  promotion_tier     = 1
}

Use manage_master_user_password so the credential is generated and rotated in AWS Secrets Manager rather than living in Terraform state. Never put a real password in master_password.

Put RDS Proxy in front of the writer

Serverless and high-concurrency workloads churn connections aggressively. Every Postgres backend is a forked process with real memory cost, and a connection storm during failover can knock the new writer over before it stabilizes. RDS Proxy maintains a warm pool, multiplexes client connections onto fewer database connections, and — critically — holds client connections open and routes them to the new writer during failover, cutting failover time as the application sees it.

resource "aws_db_proxy" "main" {
  name                   = "prod-app-proxy"
  engine_family          = "POSTGRESQL"
  role_arn               = aws_iam_role.proxy.arn
  vpc_subnet_ids         = aws_db_subnet_group.aurora.subnet_ids
  vpc_security_group_ids = [aws_security_group.proxy.id]
  require_tls            = true

  auth {
    auth_scheme = "SECRETS"
    iam_auth    = "REQUIRED"
    secret_arn  = aws_rds_cluster.main.master_user_secret[0].secret_arn
  }
}

resource "aws_db_proxy_default_target_group" "main" {
  db_proxy_name = aws_db_proxy.main.name
  connection_pool_config {
    max_connections_percent      = 90
    max_idle_connections_percent = 50
  }
}

resource "aws_db_proxy_target" "main" {
  db_proxy_name         = aws_db_proxy.main.name
  target_group_name     = aws_db_proxy_default_target_group.main.name
  db_cluster_identifier = aws_rds_cluster.main.id
}

Point your application’s write traffic at the proxy’s writer endpoint and its read traffic at the proxy’s read-only endpoint (RDS Proxy exposes both for Aurora clusters). With iam_auth = REQUIRED the app fetches a short-lived token instead of a static password:

TOKEN=$(aws rds generate-db-auth-token \
  --hostname prod-app-proxy.proxy-xxxx.us-east-1.rds.amazonaws.com \
  --port 5432 --username app_admin --region us-east-1)

Step 2 — Scaling reads: auto scaling replicas vs. Serverless v2

You have two ways to add read capacity, and they are not mutually exclusive.

Provisioned replicas with Auto Scaling keep a fixed floor of instances and add more when a target metric (CPU or connections) is breached. Use this when load is steady or predictably diurnal and you want a known cost. Define a scaling target against the cluster’s reader role:

resource "aws_appautoscaling_target" "replicas" {
  service_namespace  = "rds"
  resource_id        = "cluster:${aws_rds_cluster.main.cluster_identifier}"
  scalable_dimension = "rds:cluster:ReadReplicaCount"
  min_capacity       = 2
  max_capacity       = 8
}

resource "aws_appautoscaling_policy" "replicas_cpu" {
  name               = "aurora-reader-cpu"
  service_namespace  = aws_appautoscaling_target.replicas.service_namespace
  resource_id        = aws_appautoscaling_target.replicas.resource_id
  scalable_dimension = aws_appautoscaling_target.replicas.scalable_dimension
  policy_type        = "TargetTrackingScaling"

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "RDSReaderAverageCPUUtilization"
    }
    target_value       = 60
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

Aurora Serverless v2 scales an instance’s capacity vertically in fine-grained Aurora Capacity Units (ACUs, each ~2 GiB of memory) without disconnecting clients. Set a serverlessv2_scaling_configuration on the cluster and create instances with class db.serverless. It shines for spiky or unpredictable load and for non-prod environments that should scale toward zero overnight:

resource "aws_rds_cluster" "main" {
  # ...as above...
  serverlessv2_scaling_configuration {
    min_capacity             = 0.5
    max_capacity             = 16
    seconds_until_auto_pause = 3600 # v2 can pause near-idle clusters
  }
}

resource "aws_rds_cluster_instance" "serverless_reader" {
  identifier         = "prod-app-sv2-1"
  cluster_identifier = aws_rds_cluster.main.id
  instance_class     = "db.serverless"
  engine             = aws_rds_cluster.main.engine
  promotion_tier     = 1
}

A common, pragmatic pattern: a provisioned writer for predictable baseline write throughput, plus Serverless v2 readers that absorb read spikes. You can mix db.serverless and provisioned instances in the same cluster.

Step 3 — Failover behavior and tuning reconnection

When the writer fails (or you reboot it with failover), Aurora promotes a replica and repoints the cluster endpoint CNAME. Promotion tier (promotion_tier, 0–15) decides who wins: Aurora prefers the lowest-numbered tier, breaking ties by the replica largest in size so the new writer can handle the load. Pin your beefy replicas to tier 0 or 1 and keep tiny analytics nodes at tier 15 so they are never promoted into the writer role.

The database side of failover is fast. The slow part is almost always the client:

  1. DNS TTL. The cluster endpoint CNAME has a short TTL (around 5 seconds). JVMs that cache DNS forever are the classic offender — set networkaddress.cache.ttl to a low value or you will keep hammering the old IP.
  2. Connection pool validation. Configure your pool (HikariCP, pgbouncer, etc.) to test connections on borrow and evict dead ones quickly, instead of handing the app a half-open socket to the demoted instance.
  3. Use RDS Proxy. It absorbs the reconnect storm and pins clients to the new writer, which is the single biggest lever for shrinking application-observed failover time.

Trigger a controlled failover to a specific target:

aws rds failover-db-cluster \
  --db-cluster-identifier prod-app \
  --target-db-instance-identifier prod-app-1

Step 4 — Cross-region DR with Aurora Global Database

Multi-AZ protects you from instance and zone failure. It does nothing for a regional outage or a region-wide control-plane event. Aurora Global Database replicates from a primary region to up to five secondary regions using the storage layer’s dedicated replication infrastructure, with typical cross-region lag around one second and negligible impact on primary write performance.

resource "aws_rds_global_cluster" "global" {
  global_cluster_identifier = "prod-app-global"
  engine                    = "aurora-postgresql"
  engine_version            = "16.4"
}

# Primary regional cluster joins the global cluster
resource "aws_rds_cluster" "primary" {
  provider                  = aws.us_east_1
  cluster_identifier        = "prod-app-use1"
  global_cluster_identifier = aws_rds_global_cluster.global.id
  engine                    = aws_rds_global_cluster.global.engine
  engine_version            = aws_rds_global_cluster.global.engine_version
  # ...storage, subnets, security groups...
}

# Secondary read-only cluster in another region
resource "aws_rds_cluster" "secondary" {
  provider                  = aws.eu_west_1
  cluster_identifier        = "prod-app-euw1"
  global_cluster_identifier = aws_rds_global_cluster.global.id
  engine                    = aws_rds_global_cluster.global.engine
  engine_version            = aws_rds_global_cluster.global.engine_version
  source_region             = "us-east-1"
  # ...storage, subnets, security groups...
}

The secondary region serves low-latency reads to local users. For DR you have two recovery modes:

# Planned, zero-RPO switchover to the secondary region
aws rds failover-global-cluster \
  --global-cluster-identifier prod-app-global \
  --target-db-cluster-identifier arn:aws:rds:eu-west-1:111122223333:cluster:prod-app-euw1

Set explicit targets and write them in the runbook. A realistic posture for Global Database: RPO ~1 second, RTO of a few minutes for unplanned promotion, gated mostly by DNS/Route 53 repointing and application config, not the database promotion itself.

Step 5 — Zero-downtime schema and engine changes with blue/green

In-place major version upgrades and risky schema migrations are where teams take outages. RDS Blue/Green Deployments create a full, synchronized copy of the cluster (the green environment) replicating from production (blue). You apply your engine upgrade or schema change to green, validate it against real replicated data, and then switch over — Aurora redirects the endpoints to green, typically within a minute, with built-in guardrails that abort if replication is unhealthy or lag is too high.

aws rds create-blue-green-deployment \
  --blue-green-deployment-name prod-app-pg17-upgrade \
  --source arn:aws:rds:us-east-1:111122223333:cluster:prod-app \
  --target-engine-version 17.2 \
  --target-db-cluster-parameter-group-name prod-app-pg17

Workflow:

  1. Create the deployment; green spins up and begins replicating from blue.
  2. Apply schema DDL to green if needed. Keep changes backward compatible (additive columns, new tables) so blue and the application keep working during the window — replication from blue to green stops working if you make changes on green that conflict with incoming changes.
  3. Run your test suite and compare query plans against green’s endpoints.
  4. Switch over. Endpoints repoint to green; the old blue cluster is kept (renamed) so you can investigate or roll back by redeploying.
aws rds switchover-blue-green-deployment \
  --blue-green-deployment-identifier bgd-xxxxxxxxxxxx \
  --switchover-timeout 300

Blue/green is the right tool for engine upgrades and infra-level parameter changes. For purely additive application schema changes, the expand/contract pattern (deploy schema, deploy code that tolerates both shapes, backfill, then remove the old shape) is still your friend and needs no green environment.

Step 6 — Backups, point-in-time recovery, and cloning

Aurora continuously backs up to S3 with no performance penalty. With backup_retention_period set, you can restore the cluster to any second within the window. Point-in-time recovery always creates a new cluster — it never overwrites the running one — which is exactly what you want when recovering from a bad migration or an errant DELETE:

aws rds restore-db-cluster-to-point-in-time \
  --db-cluster-identifier prod-app-recovered \
  --source-db-cluster-identifier prod-app \
  --restore-to-time 2026-04-22T09:15:00Z

For safe testing against production-sized data, use database cloning. Aurora clones use copy-on-write at the storage layer: the clone is near-instant and initially consumes almost no extra storage, diverging only as pages are written. Spin one up to test a migration or load test against real data, then throw it away:

aws rds restore-db-cluster-to-point-in-time \
  --db-cluster-identifier prod-app-clone-test \
  --source-db-cluster-identifier prod-app \
  --restore-type copy-on-write \
  --use-latest-restorable-time

Verify

Confirm the topology, roles, and DR posture before declaring the cluster production-ready.

# Writer vs. reader roles and per-instance status
aws rds describe-db-clusters --db-cluster-identifier prod-app \
  --query 'DBClusters[0].DBClusterMembers[].{id:DBInstanceIdentifier,writer:IsClusterWriter,status:DBClusterParameterGroupStatus}' \
  --output table

# Promotion tiers — make sure tiny readers are NOT tier 0/1
aws rds describe-db-instances \
  --query 'DBInstances[?DBClusterIdentifier==`prod-app`].{id:DBInstanceIdentifier,tier:PromotionTier,class:DBInstanceClass}' \
  --output table

# Global cluster members and which region is primary
aws rds describe-global-clusters --global-cluster-identifier prod-app-global \
  --query 'GlobalClusters[0].GlobalClusterMembers[].{arn:DBClusterArn,writer:IsWriter}' \
  --output table

# Replica lag in milliseconds (CloudWatch)
aws cloudwatch get-metric-statistics --namespace AWS/RDS \
  --metric-name AuroraReplicaLag --statistics Maximum --period 60 \
  --start-time 2026-04-22T09:00:00Z --end-time 2026-04-22T09:30:00Z \
  --dimensions Name=DBClusterIdentifier,Value=prod-app

A real verification step is a game day: in a staging cluster, run failover-db-cluster, watch AuroraReplicaLag and DatabaseConnections in CloudWatch, and time how long application requests fail. If that number is more than a few seconds, the problem is on the client (DNS caching or pool config), not Aurora.

Production readiness checklist

Performance Insights and parameter tuning

Enable Performance Insights on every instance (performance_insights_enabled = true, with performance_insights_kms_key_id) — it gives you the Average Active Sessions view broken down by wait event and top SQL, which is how you find the query melting a reader long before it pages you. Tune at the cluster parameter group for storage/cluster-wide settings (e.g. rds.logical_replication) and the DB parameter group for instance-level settings. Change parameters in a blue/green green environment or a clone first; never test a pending-reboot static parameter change directly on the writer.

Failover game-day runbook

Keep a short, rehearsed runbook so the on-call engineer is not improvising at 3 a.m.:

  1. Detect — alarm on AuroraReplicaLag, writer CPUUtilization, and DatabaseConnections. Confirm scope (instance, AZ, or region).
  2. Single instance/AZ — Aurora auto-fails-over to the highest-priority healthy replica. Verify the new writer with describe-db-clusters and confirm app reconnection in metrics.
  3. Region down — execute the Global Database failover/promote runbook, repoint Route 53 to the secondary, and announce the new primary region.
  4. After stabilization — rebuild redundancy: add replicas back, re-establish the global cluster, and capture the actual RTO/RPO observed for the postmortem.

Enterprise scenario

A fintech payments platform ran a provisioned Aurora PostgreSQL cluster behind RDS Proxy with two readers and a Global Database secondary in eu-west-1. Failover game days passed in under 8 seconds, so the team signed off on the HA posture. Then a routine PostgreSQL 15-to-16 blue/green upgrade switched over cleanly — and an hour later the writer’s storage replication started lagging into the secondary region, eventually breaching the 4 GB lag budget and detaching it.

Root cause: the green environment had been created from a cluster parameter group where rds.logical_replication = 1 was left on from an old Debezium CDC experiment. Logical replication slots on the new writer pinned WAL, the restart_lsn stopped advancing, and physical storage-level replication to the global secondary fell behind because WAL could not be recycled. The blue/green health checks never caught it — they validate replication into green, not downstream global lag.

The fix was to drop the orphaned slots and rebuild the global secondary, then add a guardrail so this never ships again:

-- Find slots pinning WAL on the writer
SELECT slot_name, active, restart_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained
FROM pg_replication_slots WHERE slot_type = 'logical';

SELECT pg_drop_replication_slot('debezium_orders');

They also wired a CloudWatch alarm on OldestReplicationSlotLag and added a CI check that diffs the green parameter group against an approved baseline before any switchover-blue-green-deployment. Lesson: a blue/green that passes its own guardrails can still poison a downstream global cluster — the parameter group is part of your change surface, not background config.

Pitfalls

Aurora hands you a distributed, self-healing storage layer and fast failover almost for free. The engineering that remains is the connection path, the promotion tiers, the cross-region story, and the runbooks — get those right and the failures your users would have noticed become a line in a CloudWatch dashboard.

AWSAuroraRDSHigh AvailabilityDisaster RecoveryBlue-Green

Comments

Keep Reading