AWS Databases

Zero-Downtime RDS and Aurora Upgrades with Blue/Green Deployments

The riskiest maintenance window most platform teams run is a major engine upgrade. An in-place ModifyDBInstance --engine-version on a production Postgres or MySQL database takes the writer offline for minutes-to-tens-of-minutes, is effectively irreversible once it starts, and gives you no way to test the new engine against real traffic before you commit. RDS Blue/Green Deployments turn that one-way door into a rehearsed cutover: AWS stands up a staging copy of your database (the green environment), keeps it in sync with production (blue) via logical replication, lets you upgrade and validate green at your leisure, and then performs a switchover that typically blocks writes for well under a minute. This guide walks the full lifecycle the way I run it on production fleets, including the guardrails that decide whether a switchover is allowed to proceed and the cleanup that catches teams off guard afterward.

How Blue/Green Deployments actually work

A Blue/Green Deployment creates a managed pair:

Green is not a read replica in the classic sense. It is a separate instance or cluster with its own DNS endpoints, on which you can make changes that would be impossible on a physical replica: a higher engine version, a different DB parameter group, a larger instance class, or schema DDL. Logical replication carries row-level changes from blue, so green can have a different binary format and still stay in sync.

The replication direction matters enormously. Replication is one-way, blue to green. Writes you issue directly on green are not sent back to blue, and they can collide with replicated changes and break replication. Treat green as read-only for application traffic; the only writes you make there are deliberate schema/upgrade operations.

Property Blue (production) Green (staging)
Serves app traffic Yes, read + write No (until switchover)
Engine version Current Can be upgraded
Parameter group Current Can be changed
Instance class / storage Current Can be resized
Replication role Source Target (logical)
Endpoints Production endpoints Separate, temporary endpoints

At switchover, RDS does the endpoint swap for you: green is renamed to take over blue’s endpoint names, blue is renamed with an -old1 suffix and kept around (no longer receiving traffic). Applications that connect by the cluster/instance endpoint name keep working without a connection-string change.

RDS Blue/Green supports RDS for MySQL, RDS for MariaDB, RDS for PostgreSQL, Aurora MySQL-Compatible, and Aurora PostgreSQL. The underlying sync uses binlog-based replication for MySQL/MariaDB engines and PostgreSQL logical replication for Postgres engines, which is why the relevant parameters (binlog_format = ROW, or rds.logical_replication = 1) must be enabled on blue before the deployment can be created.

Step 1 — Use cases worth a Blue/Green for

Blue/Green earns its operational overhead for changes that are slow, risky, or irreversible in place:

If your change is a trivial dynamic parameter tweak or a minor-version patch within the same major line, Blue/Green is overkill — apply_immediately or a normal maintenance window is fine. Reach for Blue/Green when the cost of a long outage or an un-rehearsed cutover is the thing you are trying to eliminate.

Step 2 — Prerequisites on the blue database

Logical replication must be enabled before you create the deployment, and enabling it is usually a static-parameter change requiring a reboot. So this is a pre-work step, not part of the cutover.

For RDS for PostgreSQL, set rds.logical_replication = 1 in a custom DB parameter group and reboot:

resource "aws_db_parameter_group" "pg_blue" {
  name   = "prod-pg16-blue"
  family = "postgres15"

  parameter {
    name         = "rds.logical_replication"
    value        = "1"
    apply_method = "pending-reboot" # static parameter
  }
}

For RDS for MySQL / MariaDB, automated backups must be enabled and binary logging must be in row format:

# MySQL/MariaDB: backups on (binlogs require a non-zero retention),
# and ROW binlog format on the cluster/instance parameter group.
aws rds modify-db-parameter-group \
  --db-parameter-group-name prod-mysql80-blue \
  --parameters "ParameterName=binlog_format,ParameterValue=ROW,ApplyMethod=pending-reboot"

Aurora MySQL requires binlog replication to be enabled at the cluster level (binlog_format = ROW on the cluster parameter group) so the green cluster can be fed. Aurora PostgreSQL uses the same rds.logical_replication flag. Confirm the reboot has happened and the parameter is in-sync before proceeding — a deployment created against a database that has not actually picked up the parameter will fail to start replication.

Step 3 — Create the deployment

The defining choice at creation time is what is different about green. You specify the target engine version, parameter groups, and (for Aurora) cluster parameter group up front; RDS provisions green with those settings already applied.

A major Postgres upgrade with a new parameter group, via CLI:

aws rds create-blue-green-deployment \
  --blue-green-deployment-name prod-pg-16-upgrade \
  --source arn:aws:rds:ap-south-1:111122223333:db:prod-app \
  --target-engine-version 16.4 \
  --target-db-parameter-group-name prod-pg16-green \
  --tags Key=change,Value=CHG-4821 Key=team,Value=platform

For an Aurora cluster, point --source at the cluster ARN and supply cluster-level targets:

aws rds create-blue-green-deployment \
  --blue-green-deployment-name aurora-mysql-3-upgrade \
  --source arn:aws:rds:ap-south-1:111122223333:cluster:prod-aurora \
  --target-engine-version 8.0.mysql_aurora.3.07.1 \
  --target-db-cluster-parameter-group-name prod-aurora-mysql3-green \
  --target-db-instance-class db.r6g.2xlarge

In Terraform, the resource is aws_rds_global_cluster-independent and dedicated:

resource "aws_rds_blue_green_deployment" "pg_upgrade" {
  # provider: AWS provider >= 5.x exposes this as a managed resource
  name                        = "prod-pg-16-upgrade"
  source                      = aws_db_instance.prod_app.arn
  engine_version              = "16.4"
  parameter_group_name        = aws_db_parameter_group.pg16_green.name

  lifecycle {
    # green endpoints change on switchover; ignore drift you do not own
    ignore_changes = [target]
  }
}

Creation takes a while — RDS clones the volume, provisions green, and establishes replication. Watch the status move through PROVISIONING to AVAILABLE:

aws rds describe-blue-green-deployments \
  --blue-green-deployment-identifier bgd-abc123 \
  --query 'BlueGreenDeployments[0].{Status:Status,Tasks:Tasks}'

Until Status is AVAILABLE, switchover is not permitted.

Step 4 — Validate green and watch replication lag

Once green is AVAILABLE, it has its own endpoints (RDS appends a generated suffix to the green identifiers). Connect to green directly and validate everything that matters before you even think about switching:

Replication lag is the single most important pre-switchover signal. For Postgres engines, lag shows up as replication slot activity on blue and apply lag on green. The cleanest cross-engine view is CloudWatch:

# Aurora/RDS expose replica lag for the green target during a BG deployment.
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name ReplicaLag \
  --dimensions Name=DBInstanceIdentifier,Value=prod-app-green-xyz \
  --start-time "$(date -u -d '15 minutes ago' +%FT%TZ)" \
  --end-time "$(date -u +%FT%TZ)" \
  --period 60 --statistics Maximum

On Postgres, also confirm the slot is active and not retaining unbounded WAL on blue:

-- run on BLUE
SELECT slot_name, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS retained
FROM pg_replication_slots;

If retained is growing without bound, green is not keeping up (or is stuck), and you must fix that before switchover — a switchover with high lag will either be refused by the guardrails or will extend the write-blocking window while it drains.

Step 5 — Apply schema changes and upgrades on green

This is the part that makes Blue/Green more than just an upgrade tool. Because green accepts DDL while blue serves production, you can stage schema migrations that would be painful online:

-- run directly on GREEN (it is the upgrade/staging target)
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders (customer_id);
ALTER TABLE invoices ADD COLUMN settled_at timestamptz; -- cheap on PG 11+

Two hard rules govern green-side changes:

  1. Never write application data on green. DDL is fine; INSERT/UPDATE/DELETE of business rows is not. Such writes are not replicated back to blue and create divergence that surfaces as data loss the moment you switch over.
  2. Keep the schema replication-compatible. Logical replication maps changes by table and primary key. Dropping a column that blue still writes to, or removing a table’s primary key, breaks the replication stream. Make additive, backward-compatible changes on green; do destructive changes only after switchover.

This is the expand/contract (a.k.a. parallel-change) pattern, just spread across the blue/green boundary: expand the schema on green in a way both the old and new application versions tolerate, switch over, then contract once the old version is fully gone.

Step 6 — Pre-switchover guardrails and health checks

Before issuing the switchover, run a gate. RDS itself enforces several conditions and will refuse a switchover that is unsafe, but I add explicit checks on top because a refused switchover at 2am is a worse outcome than a checklist that caught the problem at 2pm.

RDS-enforced guardrails (it will block switchover if violated):

My added health checks, scripted:

#!/usr/bin/env bash
set -euo pipefail
MAX_LAG_SEC=5

LAG=$(aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS --metric-name ReplicaLag \
  --dimensions Name=DBInstanceIdentifier,Value="$GREEN_ID" \
  --start-time "$(date -u -d '5 minutes ago' +%FT%TZ)" \
  --end-time "$(date -u +%FT%TZ)" \
  --period 60 --statistics Maximum \
  --query 'Datapoints | sort_by(@,&Timestamp)[-1].Maximum' --output text)

echo "current green replica lag: ${LAG}s (threshold ${MAX_LAG_SEC}s)"
awk "BEGIN{exit !(${LAG:-9999} <= ${MAX_LAG_SEC})}" \
  || { echo "ABORT: lag too high, not switching over"; exit 1; }

I also confirm: green’s connection count is near zero (no app accidentally pointed at it), monitoring and alerting are armed for the new endpoints, and the application team has acknowledged the switchover window. An acceptable lag threshold for me is typically a few seconds; anything in the tens of seconds means I wait or investigate rather than switch.

Step 7 — Execute the switchover

Switchover is a single API call with a timeout. The timeout is the maximum duration RDS will allow the switchover (including write blocking) to take; if it cannot complete cleanly within that bound, it aborts and rolls back, leaving blue serving traffic untouched.

aws rds switchover-blue-green-deployment \
  --blue-green-deployment-identifier bgd-abc123 \
  --switchover-timeout 300

What happens during the switchover window, in order:

  1. Write blocking — RDS stops accepting new writes on blue so no transactions are lost. In-flight transactions are allowed to drain.
  2. Drain replication — it waits for green to apply the last replicated changes so green is byte-for-byte caught up with blue.
  3. Endpoint swap / rename — green’s resources are renamed to take over blue’s endpoint identifiers; blue’s resources are renamed with an -old1 suffix. DNS for the production endpoints now resolves to green.
  4. Resume — green (now production) begins accepting writes under the original endpoint names.

The total write-blocked window for a healthy deployment is typically well under a minute. Applications experience it as a brief period where writes error or block, then succeed again against the new engine. Connection pools should be configured to retry transient errors so this is mostly invisible; long-lived prepared statements and cached connections will need to reconnect.

Because the endpoint names are preserved, no connection-string change is required. But DNS TTL and pooled connections mean some clients hold the old IP briefly. Validate that your driver re-resolves DNS on reconnect — a stale pool pointing at the renamed -old1 blue is the most common “we switched over but errors continued” complaint, and it is a client-side caching issue, not a switchover failure.

Step 8 — Rollback, cleanup, and external consumers

Rollback before switchover is trivial: blue never stopped serving traffic, so you simply delete the deployment and keep running on blue. The green resources are torn down.

Rollback after switchover is the part teams underestimate. Once you switch over, blue is renamed to -old1 and replication stops — the old blue does not continue receiving the writes that now land on green. If you discover a problem post-switchover, you cannot simply “switch back” and have a current database, because old-blue is frozen at the switchover moment. Your realistic options are:

So the rollback plan must be decided before switchover, and your confidence has to come from validating green thoroughly, not from assuming you can reverse the cutover.

Cleanup deletes the deployment object and, optionally, the now-renamed old-blue resources:

# delete the BG deployment; keep old-blue around as a safety net for a while
aws rds delete-blue-green-deployment \
  --blue-green-deployment-identifier bgd-abc123

# later, once you trust the new production, delete the renamed old-blue
aws rds delete-db-instance \
  --db-instance-identifier prod-app-old1 \
  --skip-final-snapshot # or take one; your call

I keep -old1 for a defined cooling-off period (often 24-72 hours) before deleting, so a “restore the pre-upgrade state” request has a fast answer.

External replicas and CDC consumers are the sharpest edge. Anything attached to blue’s replication stream does not automatically follow to green:

Limitations and gotchas

Enterprise scenario

A fintech platform team I worked with ran a 4 TB RDS for PostgreSQL 13 writer behind RDS Proxy, feeding a Debezium CDC pipeline that drove their ledger-reconciliation service. They needed Postgres 13 -> 15 (for partitioning and planner improvements) but their compliance posture allowed a write outage of at most 60 seconds, and the reconciliation pipeline could not lose or double-process a single ledger event across the upgrade.

An in-place upgrade was a non-starter: the outage alone exceeded the budget, and there was no way to validate the new planner against production query shapes first. They used Blue/Green. They enabled rds.logical_replication, created the deployment targeting 15.x with a new parameter group, and ran their full reconciliation test suite against the green endpoint over two days. They caught one real issue — a query that regressed under the PG 15 planner — and fixed it with an index added CONCURRENTLY on green before switchover.

The CDC pipeline was the hard part. Their runbook drained and paused the Debezium connector immediately before switchover, recorded the exact LSN it had consumed from blue, ran the switchover (write-blocked window measured at 19 seconds), then re-created the connector against the new green/production with a snapshot mode of never and resumed from the recorded position so it picked up exactly where it left off — no gap, no replay.

# the cutover itself: tight timeout, lag pre-checked to < 3s
aws rds switchover-blue-green-deployment \
  --blue-green-deployment-identifier "$BGD_ID" \
  --switchover-timeout 60

They kept prod-app-old1 for 72 hours as a rollback anchor, then deleted it. Total customer-visible write interruption: under 20 seconds, against a budget of 60, with the new planner already validated. The lesson the team internalized: Blue/Green made the database cutover easy; the engineering effort was almost entirely in re-anchoring the CDC consumer cleanly, which the runbook had to own explicitly because RDS does nothing for downstream consumers automatically.

Verify

After switchover, confirm you are actually running on the new green and that nothing is silently still pointed at old-blue:

# 1. Engine version on the production endpoint is the upgraded one
psql "host=prod-app.cluster-xxxx.ap-south-1.rds.amazonaws.com dbname=app" \
  -c "SELECT version();"

# 2. The renamed old-blue exists and is the previous version (rollback anchor)
aws rds describe-db-instances \
  --db-instance-identifier prod-app-old1 \
  --query 'DBInstances[0].{Engine:EngineVersion,Status:DBInstanceStatus}'

# 3. The Blue/Green deployment shows SWITCHOVER_COMPLETED
aws rds describe-blue-green-deployments \
  --blue-green-deployment-identifier "$BGD_ID" \
  --query 'BlueGreenDeployments[0].Status'

# 4. Application error rate and DB connection count are nominal on the new endpoint
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=prod-app \
  --start-time "$(date -u -d '10 minutes ago' +%FT%TZ)" \
  --end-time "$(date -u +%FT%TZ)" --period 60 --statistics Average

Then re-point and resume any external replicas and CDC consumers, watch their lag drain to zero, and only after that delete old-blue.

Pre-flight checklist

awsrdsaurorablue-greenupgradeszero-downtime

Comments

Keep Reading