There is one operation in modern infrastructure that engineers are most afraid of, and with good reason: changing a database. Compute is replaceable, storage can be re-provisioned, code can be redeployed, but data is the irreplaceable asset, and any operation that touches it carries an asymmetric risk — the upside is “the migration succeeds” and the downside is “the company loses several days of revenue and several years of customer trust.”
This lesson is about doing the scary thing safely. We will cover three classes of operation:
- Online schema changes on a single live database (add columns, change types, add indexes, partition tables) without locking out applications
- Blue-green database migrations — provisioning a parallel database, replicating into it, validating it, then cutting over with seconds of downtime instead of hours
- Zero-downtime database engine migrations — moving from MySQL 5.7 → 8.0, PostgreSQL 11 → 16, Oracle → PostgreSQL, on-prem → cloud, or instance class upgrades that require a restart
Across all three, the unifying discipline is expand-contract: you never make a destructive schema change directly. You expand the schema to support both the old and the new shape simultaneously, migrate readers and writers across, then contract by removing the old shape once nothing references it. This pattern, more than any tool, is what enables zero-downtime database evolution.
The other unifying discipline is that every step is an Ansible-controlled idempotent operation with explicit checkpoints. Database migrations performed by humans on a CLI are the highest-risk operation in IT. Encoding them as playbooks with mandatory verification gates eliminates the most common failure modes (forgot a step, ran out of order, no rollback plan, no evidence trail).
This lesson assumes familiarity with the previous lessons in the Tier 5 wave. The CHG-gate pattern from D7 is mandatory for every database migration. The evidence-bundle pattern from D1 produces the audit artefact. The DR pattern from D2 is the fallback when a migration goes wrong.
1. The expand-contract pattern: the most important idea in this lesson
If you remember nothing else from this lesson, remember this:
Never make a destructive schema change in a single deployment. Always expand the schema, migrate, then contract.
Concretely, suppose you want to rename a column users.fullname → users.full_name. The naive approach is ALTER TABLE users RENAME COLUMN fullname TO full_name. This requires:
- Application code that reads
fullnameto be deployed simultaneously with code that readsfull_name - Zero requests in-flight using the old column at the moment of cutover
- A perfect rollback plan if the application breaks
In practice, this is impossible in any production system bigger than a single replica. The expand-contract version:
| Phase | Schema | Application | Goal |
|---|---|---|---|
| 1. Expand | Add full_name, copy data, install trigger that keeps both columns in sync |
App still reads/writes fullname only |
Both columns exist; old code unaffected |
| 2. Migrate writes | (no schema change) | Deploy app that writes to full_name (trigger keeps fullname in sync) |
New writes go to full_name; legacy reads still work |
| 3. Migrate reads | (no schema change) | Deploy app that reads from full_name |
All app code uses full_name |
| 4. Contract | Drop trigger, drop fullname |
(no app change) | Old column gone; schema clean |
Each phase is independently deployable, independently rollbackable, and never requires the application and database to be deployed simultaneously. Production databases that follow this discipline rigorously can sustain hundreds of schema changes per quarter with zero downtime.
The Ansible role we will build executes phases 1 and 4 (the schema-touching phases). Phases 2 and 3 are application deploys that happen between Ansible runs.
# roles/db_migration_expand/tasks/main.yml
---
- name: Phase 1 — add new column nullable
community.postgresql.postgresql_query:
db: "{{ db_name }}"
query: |
ALTER TABLE {{ table }}
ADD COLUMN IF NOT EXISTS {{ new_column }} {{ new_type }};
no_log: false
- name: Phase 1 — backfill data in batches
community.postgresql.postgresql_query:
db: "{{ db_name }}"
query: |
UPDATE {{ table }}
SET {{ new_column }} = {{ backfill_expr }}
WHERE {{ new_column }} IS NULL
AND id BETWEEN {{ batch_start }} AND {{ batch_end }};
loop: "{{ range(0, max_id, batch_size) | list }}"
loop_control:
loop_var: batch_start
extended: true
vars:
batch_end: "{{ batch_start + batch_size - 1 }}"
- name: Phase 1 — install dual-write trigger
community.postgresql.postgresql_query:
db: "{{ db_name }}"
query: |
CREATE OR REPLACE FUNCTION {{ table }}_dual_write_{{ new_column }}()
RETURNS TRIGGER AS $$
BEGIN
IF NEW.{{ old_column }} IS NOT NULL AND NEW.{{ new_column }} IS NULL THEN
NEW.{{ new_column }} := NEW.{{ old_column }};
ELSIF NEW.{{ new_column }} IS NOT NULL AND NEW.{{ old_column }} IS NULL THEN
NEW.{{ old_column }} := NEW.{{ new_column }};
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS dual_write_{{ new_column }} ON {{ table }};
CREATE TRIGGER dual_write_{{ new_column }}
BEFORE INSERT OR UPDATE ON {{ table }}
FOR EACH ROW EXECUTE FUNCTION {{ table }}_dual_write_{{ new_column }}();
The contract role drops the trigger and the old column, but only after a verification step:
- name: Phase 4 — verify no recent writes to old column
community.postgresql.postgresql_query:
db: "{{ db_name }}"
query: |
SELECT count(*) AS cnt
FROM {{ table }}
WHERE {{ old_column }} != {{ new_column }};
register: divergence_check
- name: Fail if old and new column diverge
ansible.builtin.fail:
msg: |
Refusing to contract: {{ divergence_check.query_result[0].cnt }} rows have
{{ old_column }} != {{ new_column }}. Either trigger is broken or backfill incomplete.
when: divergence_check.query_result[0].cnt > 0
- name: Phase 4 — drop trigger and old column
community.postgresql.postgresql_query:
db: "{{ db_name }}"
query: |
DROP TRIGGER IF EXISTS dual_write_{{ new_column }} ON {{ table }};
DROP FUNCTION IF EXISTS {{ table }}_dual_write_{{ new_column }}();
ALTER TABLE {{ table }} DROP COLUMN {{ old_column }};
The hard rule: never run the contract phase until application telemetry confirms zero reads/writes against the old column for at least 7 days. Operations that violate this rule are how production outages happen.
2. Online DDL for the unavoidable schema changes
Some operations cannot be done with pure expand-contract because they’re inherently destructive (changing a column type, adding a NOT NULL constraint, partitioning a table). For these, we use online DDL tools that perform the operation as a series of small, non-blocking steps.
2.1 MySQL/MariaDB: gh-ost and pt-online-schema-change
gh-ost (GitHub’s online schema migrator) is the modern choice. Unlike pt-online-schema-change, it doesn’t use triggers (which add load), and it can pause/resume cleanly:
- name: Install gh-ost
ansible.builtin.package:
name: gh-ost
state: present
- name: Render gh-ost command
ansible.builtin.set_fact:
ghost_cmd: >-
gh-ost
--user={{ db_admin_user }}
--password={{ vault_db_admin_password }}
--host={{ db_host }}
--database={{ db_name }}
--table={{ table }}
--alter="{{ alter_statement }}"
--max-load=Threads_running=25
--critical-load=Threads_running=1000
--chunk-size=1000
--throttle-control-replicas={{ replica_hosts | join(',') }}
--max-lag-millis=1500
--switch-to-rbr
--allow-on-master
--cut-over=default
--hooks-path=/etc/gh-ost/hooks
--execute
no_log: true
- name: Run gh-ost migration
ansible.builtin.command: "{{ ghost_cmd }}"
register: ghost_result
changed_when: ghost_result.rc == 0
failed_when: ghost_result.rc != 0
no_log: true
The flags that matter:
--max-load=Threads_running=25: pauses migration if active query count exceeds 25 (i.e. the app is busy)--throttle-control-replicas: pauses migration if any replica falls more than 1500ms behind--chunk-size=1000: copy 1000 rows at a time, yielding to other queries between chunks--cut-over=default: do the final atomic rename when all changes are caught up; lock is held for milliseconds--hooks-path: shell scripts that run at lifecycle events (onStartup,onValidated,onCutOverComplete)
The hooks are where you wire ServiceNow updates and Slack notifications:
#!/bin/bash
# /etc/gh-ost/hooks/gh-ost-on-startup
curl -X POST -H "Content-Type: application/json" \
-d "{\"text\":\"gh-ost started: ${GH_OST_DATABASE_NAME}.${GH_OST_TABLE_NAME}\"}" \
"${SLACK_WEBHOOK_URL}"
For PostgreSQL, the equivalent of gh-ost is pg_repack (for table reorgs and index rebuilds without locks) and pg_squeeze. Both are wrapped trivially in Ansible:
- name: Run pg_repack on bloated table
ansible.builtin.command: >-
pg_repack
--no-superuser-check
--jobs=4
--table {{ table }}
--dbname {{ db_name }}
--host {{ db_host }}
--username {{ db_admin_user }}
environment:
PGPASSWORD: "{{ vault_db_admin_password }}"
no_log: true
2.2 The “lock-free” mental model isn’t quite right
A subtle but important point: “online” DDL tools don’t eliminate locks. They reduce them to milliseconds at the cutover moment, instead of holding them for hours during the data copy. A migration that runs gh-ost on a table with 500M rows might take 6 hours of background copy and then a 50ms lock at the end. That 50ms is not zero — it can still cause a brief connection pool spike — but it’s three orders of magnitude better than a 6-hour blocking ALTER.
The implication for production runbooks: schedule cutover moments during traffic troughs (typically 03:00-04:00 local time), and have application connection pools sized to absorb a brief spike. A pool with 50 connections that hits 80% utilisation during a cutover lock is fine. A pool at 95% utilisation will fail spectacularly.
3. Blue-green database cutovers
For changes that are too large or too risky for online DDL — major engine version upgrades, instance class changes, encryption-at-rest enablement, AZ migrations, on-prem → cloud — the blue-green pattern is the answer.
The shape:
BLUE (current production) GREEN (parallel)
│ │
┌────┴────┐ ┌────┴────┐
│ App │ │ │
│ Pool │ │ │
└────┬────┘ └────┬────┘
│ │
┌────▼────┐ logical replication ┌────▼────┐
│ Primary ├──────────────────────►│ Primary │
│ DB │ │ DB │
└─────────┘ └─────────┘
The five-phase cutover:
- Provision green — identical-shape database with the new version/config
- Replicate — logical replication from blue → green; let it catch up
- Validate — checksum tables, run smoke tests, verify replication lag = 0
- Cut over — pause writes on blue (5-30s), promote green to primary, redirect application
- Decommission — keep blue as read-only for ~24h as escape hatch, then destroy
The Ansible workflow that orchestrates this:
---
- name: Blue-green database migration workflow
hosts: localhost
gather_facts: true
vars:
cutover_window_start: "2026-06-23T03:00:00Z"
cutover_window_end: "2026-06-23T05:00:00Z"
tasks:
- name: Phase 1 — provision green DB cluster
ansible.builtin.include_role:
name: kv.db_provision
vars:
cluster_name: "{{ db_cluster_name }}-green"
engine_version: "{{ target_engine_version }}"
instance_class: "{{ target_instance_class }}"
parameter_group: "{{ target_parameter_group }}"
- name: Phase 2a — configure logical replication on blue
community.postgresql.postgresql_query:
login_host: "{{ blue_db_host }}"
login_user: "{{ db_admin_user }}"
login_password: "{{ vault_db_admin_password }}"
db: "{{ db_name }}"
query: |
ALTER SYSTEM SET wal_level = 'logical';
SELECT pg_reload_conf();
CREATE PUBLICATION migration_pub FOR ALL TABLES;
no_log: true
- name: Phase 2b — pre-seed green from latest backup
ansible.builtin.include_role:
name: kv.db_restore_from_snapshot
vars:
target_cluster: "{{ db_cluster_name }}-green"
snapshot_id: "{{ latest_blue_snapshot_id }}"
- name: Phase 2c — start logical replication on green
community.postgresql.postgresql_query:
login_host: "{{ green_db_host }}"
login_user: "{{ db_admin_user }}"
login_password: "{{ vault_db_admin_password }}"
db: "{{ db_name }}"
query: |
CREATE SUBSCRIPTION migration_sub
CONNECTION 'host={{ blue_db_host }} dbname={{ db_name }}
user=replicator password={{ vault_replicator_password }}'
PUBLICATION migration_pub
WITH (copy_data = false);
no_log: true
- name: Phase 3a — wait for replication lag to reach zero
community.postgresql.postgresql_query:
login_host: "{{ green_db_host }}"
login_user: "{{ db_admin_user }}"
login_password: "{{ vault_db_admin_password }}"
db: "{{ db_name }}"
query: |
SELECT
EXTRACT(EPOCH FROM (now() - last_msg_receipt_time)) AS lag_seconds
FROM pg_stat_subscription;
register: lag
until: lag.query_result[0].lag_seconds | float < 1.0
retries: 240
delay: 30
no_log: true
- name: Phase 3b — checksum critical tables
ansible.builtin.include_role:
name: kv.db_table_checksum_compare
vars:
blue_host: "{{ blue_db_host }}"
green_host: "{{ green_db_host }}"
critical_tables: "{{ critical_business_tables }}"
- name: Phase 3c — run application smoke tests against green (read-only)
ansible.builtin.include_role:
name: kv.app_smoke_tests
vars:
target_db_host: "{{ green_db_host }}"
read_only: true
- name: Phase 4a — open CHG implementation window
servicenow.itsm.change_request:
number: "{{ change_request_number }}"
state: implement
work_notes: "Cutover beginning at {{ ansible_date_time.iso8601 }}"
- name: Phase 4b — pause application writes (set DB to read-only)
community.postgresql.postgresql_query:
login_host: "{{ blue_db_host }}"
login_user: "{{ db_admin_user }}"
login_password: "{{ vault_db_admin_password }}"
query: "ALTER SYSTEM SET default_transaction_read_only = on; SELECT pg_reload_conf();"
- name: Phase 4c — wait for in-flight transactions to drain
ansible.builtin.pause:
seconds: 10
- name: Phase 4d — confirm zero replication lag
community.postgresql.postgresql_query:
login_host: "{{ green_db_host }}"
login_user: "{{ db_admin_user }}"
login_password: "{{ vault_db_admin_password }}"
db: "{{ db_name }}"
query: |
SELECT
CASE WHEN replay_lsn = sent_lsn THEN 'caught_up' ELSE 'behind' END AS status
FROM pg_stat_subscription;
register: final_lag
failed_when: final_lag.query_result[0].status != 'caught_up'
- name: Phase 4e — promote green to primary
community.postgresql.postgresql_query:
login_host: "{{ green_db_host }}"
login_user: "{{ db_admin_user }}"
login_password: "{{ vault_db_admin_password }}"
db: "{{ db_name }}"
query: "DROP SUBSCRIPTION migration_sub;"
- name: Phase 4f — flip application DB endpoint (DNS or service mesh)
ansible.builtin.include_role:
name: kv.app_db_endpoint_flip
vars:
new_db_host: "{{ green_db_host }}"
- name: Phase 4g — verify writes succeeding on green
ansible.builtin.include_role:
name: kv.app_smoke_tests
vars:
target_db_host: "{{ green_db_host }}"
read_only: false
- name: Phase 5 — keep blue as read-only escape hatch (24h)
community.postgresql.postgresql_query:
login_host: "{{ blue_db_host }}"
login_user: "{{ db_admin_user }}"
login_password: "{{ vault_db_admin_password }}"
query: |
REVOKE INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public FROM application_role;
The cutover itself (phases 4a-g) takes 30-90 seconds in practice. From the application’s perspective, this is a brief connection pool blip — within the tolerance of any retrying client. From the database’s perspective, it’s a clean atomic switch.
3.1 The escape-hatch discipline
Notice that we don’t drop blue. We keep it for 24 hours, in read-only mode. If something goes catastrophically wrong with green in the first hour after cutover (a query plan regression that wasn’t caught in smoke tests, an unexpected lock contention pattern, anything), we can:
- Flip the application back to blue (point-in-time-restore-aware: any green-writes that happened in the last hour are lost; this is the painful but bounded fallback)
- Investigate green offline
- Try the cutover again with the issue fixed
If the first hour is clean, you typically won’t need this. But the 5 minutes spent leaving blue running cost almost nothing and provide a meaningful insurance policy. After 24 hours of clean green operation, decommission blue.
A more sophisticated variant — forward-flow protection — sets up green-to-blue replication for that 24h window so that the escape hatch doesn’t lose data. This is much more complex (you must handle write conflicts if both sides take writes briefly), and is overkill for most cases. Reserve it for migrations where any data loss is unacceptable (financial transactions, healthcare records).
4. Cross-engine migrations: Oracle → PostgreSQL, MSSQL → PostgreSQL
These are the migrations that cause executives to lose sleep. Schema differences, type mismatches, stored procedure rewrites, application-level ORM changes — every layer has compatibility issues.
The pragmatic toolchain:
| Source → Target | Tool | Ansible wrapper |
|---|---|---|
| Oracle → PostgreSQL | ora2pg | ansible.builtin.command + community.postgresql.* |
| MSSQL → PostgreSQL | pg_chameleon, AWS DMS | community.aws.dms_* collections |
| MySQL → PostgreSQL | pgloader | ansible.builtin.command |
| Anything → Anything | Debezium + Kafka Connect | community.kubernetes.k8s (deploy operators) |
| RDBMS → RDBMS (cloud-native) | AWS DMS, Azure Database Migration Service, GCP Database Migration Service | Cloud-provider collections |
The hard part of these migrations is rarely the data move itself — it’s everything else:
- Stored procedure rewrites: Oracle PL/SQL → PostgreSQL PL/pgSQL is mostly mechanical, but every business-logic procedure must be re-tested. AWS Babelfish or EnterpriseDB’s Oracle compatibility help, but don’t eliminate the work.
- Type system differences: Oracle’s
NUMBERvs PostgreSQL’snumericdiffer in precision behaviour. Oracle’sDATEincludes time; PostgreSQL’sdatedoesn’t. Application code that depends on these subtleties will fail. - Sequence semantics: Oracle’s
SEQUENCE.NEXTVALand PostgreSQL’snextval()behave differently in transaction-rollback scenarios. - Lock and isolation differences: Oracle’s read-consistent-by-default vs PostgreSQL’s MVCC have different deadlock characteristics. Stress test, do not assume.
A sensible pattern for these migrations:
- Phase 0: Schema convertor + stored proc audit. Produce a manual remediation list. This is human work; no automation shortcut.
- Phase 1: Initial bulk load via DMS or equivalent. Ansible orchestrates the DMS task creation and monitoring.
- Phase 2: CDC-based ongoing replication. DMS or Debezium streams changes from source to target.
- Phase 3: Full-stack parallel test — application points at target, traffic is shadowed (read-only) for weeks, mismatches are logged.
- Phase 4: Cutover, exactly as in the blue-green pattern above.
- Phase 5: Source-side decommission, weeks later.
A skeleton AWS DMS playbook:
- name: Create DMS replication instance
community.aws.dms_replication_instance:
replication_instance_identifier: "ora-to-pg-{{ env }}"
replication_instance_class: dms.r5.4xlarge
allocated_storage: 500
vpc_security_group_ids: "{{ dms_sg_ids }}"
publicly_accessible: false
multi_az: true
state: present
- name: Create source endpoint (Oracle)
community.aws.dms_endpoint:
endpoint_identifier: "src-oracle-{{ env }}"
endpoint_type: source
engine_name: oracle
server_name: "{{ oracle_host }}"
port: 1521
database_name: "{{ oracle_service_name }}"
username: "{{ vault_oracle_dms_user }}"
password: "{{ vault_oracle_dms_password }}"
extra_connection_attributes: "useLogminerReader=N;useBfile=Y;asm_user={{ asm_user }};asm_password={{ vault_asm_password }}"
state: present
no_log: true
- name: Create target endpoint (PostgreSQL on RDS)
community.aws.dms_endpoint:
endpoint_identifier: "tgt-pg-{{ env }}"
endpoint_type: target
engine_name: postgres
server_name: "{{ rds_endpoint }}"
port: 5432
database_name: "{{ db_name }}"
username: "{{ vault_pg_dms_user }}"
password: "{{ vault_pg_dms_password }}"
state: present
no_log: true
- name: Create migration task with full-load + CDC
community.aws.dms_replication_task:
replication_task_identifier: "ora-to-pg-task-{{ env }}"
source_endpoint_identifier: "src-oracle-{{ env }}"
target_endpoint_identifier: "tgt-pg-{{ env }}"
replication_instance_identifier: "ora-to-pg-{{ env }}"
migration_type: full-load-and-cdc
table_mappings: "{{ lookup('file', 'table-mappings.json') }}"
replication_task_settings: "{{ lookup('file', 'task-settings.json') }}"
state: present
- name: Start the task
community.aws.dms_replication_task:
replication_task_identifier: "ora-to-pg-task-{{ env }}"
state: started
The table-mappings.json and task-settings.json are where most of the real configuration lives. They should be in version control alongside the playbook. Common useful patterns:
- Map all tables but exclude audit/log tables (huge, low-value)
- Apply LOB (large object) settings appropriate for your data
- Configure parallel-load for the largest tables
- Set CDC starting point appropriately (
now, or a specific SCN/LSN)
DMS is operationally easy to use and operationally easy to misuse. The most common failure mode is “it appeared to work but silently dropped some rows.” Mitigations:
- Always run a row-count comparison after full-load, before enabling CDC, and again before cutover
- Always run column-by-column checksum on critical tables (PK ranges, sample windows)
- Always test the rollback path (DMS task can be reversed in some cases, but not always — know your escape hatch)
5. Engine version upgrades
The “blue-green for major version upgrade” pattern is its own special case worth covering separately. PostgreSQL 11 → 16, MySQL 5.7 → 8.0, MariaDB 10.6 → 11.4. These upgrades require:
- Catalog format changes (cannot just stop and restart on new binaries)
- Possible parameter changes (some params renamed or removed)
- Possible application-level breakage (deprecated SQL features)
- Possible plan regressions (planner changes between versions)
For PostgreSQL specifically, pg_upgrade is the in-place tool, but it requires downtime equal to the time the catalog conversion takes (minutes for small clusters, longer for big). For zero-downtime, you blue-green via logical replication.
The Ansible role for a PG major-version blue-green:
# roles/pg_major_upgrade/tasks/main.yml
---
- name: Pre-flight — check for incompatible features
community.postgresql.postgresql_query:
db: "{{ db_name }}"
query: |
SELECT
'extensions' AS check_type,
string_agg(extname, ', ') AS items
FROM pg_extension
WHERE extname NOT IN ({{ supported_extensions_in_target | join(',') }})
UNION ALL
SELECT
'data_types',
string_agg(DISTINCT typname, ', ')
FROM pg_type
WHERE typname IN ({{ removed_types_in_target | join(',') }});
register: incompat_check
- name: Fail if incompatible features found
ansible.builtin.fail:
msg: "Cannot upgrade: {{ incompat_check.query_result }}"
when: incompat_check.query_result | selectattr('items', 'truthy') | list | length > 0
- name: Provision green cluster on target version
ansible.builtin.include_role:
name: kv.db_provision_pg
vars:
cluster_name: "{{ db_cluster_name }}-green"
engine_version: "{{ target_pg_version }}"
parameter_group: "{{ target_parameter_group }}"
- name: Initial schema export from blue
community.postgresql.postgresql_db:
name: "{{ db_name }}"
state: dump
target: "/tmp/{{ db_name }}-schema.sql"
target_opts: "--schema-only --no-owner --no-acl"
login_host: "{{ blue_db_host }}"
login_user: "{{ db_admin_user }}"
login_password: "{{ vault_db_admin_password }}"
no_log: true
- name: Import schema into green
community.postgresql.postgresql_db:
name: "{{ db_name }}"
state: restore
target: "/tmp/{{ db_name }}-schema.sql"
login_host: "{{ green_db_host }}"
login_user: "{{ db_admin_user }}"
login_password: "{{ vault_db_admin_password }}"
no_log: true
# (then logical replication setup as in section 3, then cutover, then decommission)
For MySQL, the corresponding flow uses native MySQL replication (CHANGE MASTER TO from a backup-derived position) with gtid_mode = ON for clean failover. Tools like ProxySQL or HAProxy provide the connection-flip layer.
6. Managed-database equivalents (RDS, Cloud SQL, Azure Database)
The same patterns apply with cloud-managed databases, with two critical differences:
- You don’t run pg_upgrade or mysql_upgrade yourself — the cloud provider does. AWS RDS Blue/Green Deployments, Azure Database for PostgreSQL Flexible Server “Restore + Replicate”, Google Cloud SQL “Migrate” — each cloud has its own native blue-green primitive.
- The bypass is more obvious — you cannot get to the underlying OS, so any manual recovery requires a support ticket. Plan for this.
The AWS RDS Blue/Green pattern is the cleanest:
- name: Create blue/green deployment
community.aws.rds_cluster:
db_cluster_identifier: "{{ blue_cluster_name }}"
blue_green_deployment:
target_engine_version: "{{ target_pg_version }}"
target_db_parameter_group_name: "{{ target_parameter_group }}"
state: create_blue_green
register: bg_deployment
- name: Wait for green to catch up (replication lag <= 1s)
community.aws.rds_cluster_info:
db_cluster_identifier: "{{ bg_deployment.green_cluster_identifier }}"
register: green_status
until: green_status.cluster.replication_lag_seconds <= 1
retries: 240
delay: 60
- name: Run validation on green
ansible.builtin.include_role:
name: kv.app_smoke_tests
vars:
target_db_host: "{{ bg_deployment.green_endpoint }}"
read_only: true
- name: Switch over (atomic flip)
community.aws.rds_cluster:
db_cluster_identifier: "{{ blue_cluster_name }}"
blue_green_deployment_identifier: "{{ bg_deployment.id }}"
state: switchover
switchover_timeout: 300
The actual switchover takes ~30 seconds in well-behaved cases. AWS handles all the connection-routing magic — both endpoints (writer and reader) are atomically remapped. Application code using the cluster endpoint sees a brief connection drop and reconnects to green.
The Ansible value here isn’t reinventing what RDS already does — it’s wrapping the RDS calls inside the standard CHG-gated, evidence-bundled, ServiceNow-tracked workflow that every other production change goes through. The migration becomes “another playbook” rather than “a special manual ritual,” and the audit trail is identical to every other change.
7. Stateful application-level migrations
A subtype worth mentioning: data backfills that aren’t schema changes but are still risky operations. “Re-encrypt every PII column with a new key.” “Recalculate every order’s tax amount with new logic.” “Update every customer record with a new normalised address.”
These have the same risk profile as schema changes (they touch every row) but no DDL. The discipline is identical:
- Idempotent (re-running the operation has no effect on already-processed rows)
- Batched (1000-10000 rows per transaction; never one giant UPDATE)
- Throttled (yield to user-facing query load; pause if replication lag rises)
- Resumable (checkpoint progress; pick up where it stopped)
- Verifiable (a separate query that confirms every row meets the new condition)
A pattern:
- name: Backfill loop with throttling and checkpointing
block:
- name: Read checkpoint
ansible.builtin.slurp:
src: "/var/lib/backfills/{{ backfill_id }}.checkpoint"
register: checkpoint_raw
failed_when: false
- name: Set starting position
ansible.builtin.set_fact:
start_id: "{{ (checkpoint_raw.content | b64decode | trim) | default('0', true) | int }}"
- name: Process batches
community.postgresql.postgresql_query:
db: "{{ db_name }}"
query: |
UPDATE {{ table }}
SET {{ update_expr }}
WHERE id BETWEEN {{ batch_start }} AND {{ batch_end }}
AND {{ where_condition }};
loop: "{{ range(start_id | int, max_id, batch_size) | list }}"
loop_control:
loop_var: batch_start
vars:
batch_end: "{{ batch_start + batch_size - 1 }}"
- name: Pause if replication lag too high
community.postgresql.postgresql_query:
db: "{{ db_name }}"
query: |
SELECT max(EXTRACT(EPOCH FROM (now() - last_msg_receipt_time))) AS lag
FROM pg_stat_replication;
register: rep_lag
until: rep_lag.query_result[0].lag | float < 5
retries: 20
delay: 30
- name: Update checkpoint
ansible.builtin.copy:
content: "{{ batch_end }}"
dest: "/var/lib/backfills/{{ backfill_id }}.checkpoint"
8. The cutover runbook is sacred
Every database migration has a cutover moment. That moment is the single highest-risk window in the entire migration. The runbook for that moment must be:
- Written before the migration starts: rehearsed in pre-prod, reviewed by the on-call team
- Time-boxed: every step has a maximum duration; if exceeded, abort
- Fail-forward or rollback explicit: every step has a documented rollback path
- Single-pilot: one named human runs the cutover, with a co-pilot watching dashboards
- No improvisation: if something unexpected happens, abort and reschedule, do not improvise
The Ansible workflow renders the runbook as code, but the human runbook — the document the on-call team reads before kickoff — is its own deliverable. It should explicitly cover:
- Pre-flight checklist (review 24h before, review 1h before)
- Go/no-go criteria
- Communication plan (channels, escalation)
- Step-by-step with expected output and elapsed time
- Failure decision tree (if step N fails, do X)
- Post-cutover validation
- Decommission timeline
- Lessons-learned template
Migrations performed without this runbook discipline are how the dramatic post-mortem stories get written. Migrations performed with it are tedious and uneventful, which is exactly the goal.
9. Common failure modes
| Failure mode | Symptom | Mitigation |
|---|---|---|
| Replication lag never reaches zero | Cutover never happens | Check for long-running transactions on source; check network bandwidth; check target’s write capacity |
| Logical replication stalls on large transactions | A 1B-row transaction gets stuck | Break the source transaction into smaller chunks before migration |
| Sequence/identity gaps after cutover | App generates duplicate IDs | Bump sequences on green to a value safely above blue’s max before cutover |
| Connection pool exhaustion at cutover | App errors briefly | Pre-warm connections to green; size pool for spike |
| Plan regression on new version | Some queries 100x slower on green | Run pg_stat_statements diff in pre-prod; capture problem queries; tune target before cutover |
| Foreign key cascade timing | Replicated child rows arrive before parents | Use foreign_keys=DEFERRED or disable FK during initial load |
| Trigger-induced infinite loop | Dual-write trigger triggers itself via replication | Mark replicated rows with pg_trigger_depth() = 0 check |
| Encoding mismatch | Special characters become garbage on target | Always specify target encoding explicitly; never rely on defaults |
| Time zone drift | TIMESTAMP WITHOUT TIME ZONE columns shift | Always use TIMESTAMP WITH TIME ZONE; document the timezone of the source |
| Cutover window misalignment | App restart misses the DB cutover | Use service-mesh or Consul DNS so app sees endpoint change without restart |
| Lost-updates window | App writes during the read-only pause are dropped | Set application to retry-with-backoff on read-only errors; window must be < retry timeout |
The non-obvious lesson from running many of these: every migration teaches you something about your database that you did not know before. There will be a stored procedure no one remembered. A foreign key with unusual cascade behaviour. A custom collation. A trigger written by someone who left the company three years ago. The migration is also a discovery exercise. Build that into your timeline.
10. Where this fits in the broader course
The Tier 5 wave so far:
- D1 (compliance) — what auditors expect
- D2 (DR) — surviving total failure
- D3 (OS migration) — fleet upgrades
- D4 (air-gap) — isolated environments
- D5 (SAP) — complex stack
- D6 (edge/IoT) — fleet primitives
- D7 (ITSM/ChatOps) — governance fabric
- D8 (backup) — last line of defence
- D9 (this lesson) — moving the data without breaking it
The capstone (D10) is observability — the system that tells you whether all of this automation is actually healthy. It pulls Prometheus, Grafana, Loki, OpenTelemetry, AAP’s own metrics, and ServiceNow’s incident stream into a single coherent operational view. After D10, you will have a complete blueprint for a regulated-enterprise automation platform: governance, compliance, security, scale, recovery, migration, and visibility.
What you should walk away from this lesson with: the conviction that database migrations are engineering exercises, not acts of bravery. With expand-contract, blue-green, and the cutover discipline encoded as Ansible workflows with strict gates, the most feared operation in IT becomes a routine, repeatable, evidence-producing change. Teams that internalise this can deliver schema changes weekly without drama. Teams that don’t continue treating each migration as a unique heroic event, and continue paying the cost in postponed work and weekend pages.