Ansible × Database Migrations & Zero-Downtime Schema Changes, In Depth: Online DDL, Blue-Green Cutovers, Logical Replication & Expand-Contract

There is one operation in modern infrastructure that engineers are most afraid of, and with good reason: changing a database. Compute is replaceable, storage can be re-provisioned, code can be redeployed, but data is the irreplaceable asset, and any operation that touches it carries an asymmetric risk — the upside is “the migration succeeds” and the downside is “the company loses several days of revenue and several years of customer trust.”

This lesson is about doing the scary thing safely. We will cover three classes of operation:

Online schema changes on a single live database (add columns, change types, add indexes, partition tables) without locking out applications
Blue-green database migrations — provisioning a parallel database, replicating into it, validating it, then cutting over with seconds of downtime instead of hours
Zero-downtime database engine migrations — moving from MySQL 5.7 → 8.0, PostgreSQL 11 → 16, Oracle → PostgreSQL, on-prem → cloud, or instance class upgrades that require a restart

Across all three, the unifying discipline is expand-contract: you never make a destructive schema change directly. You expand the schema to support both the old and the new shape simultaneously, migrate readers and writers across, then contract by removing the old shape once nothing references it. This pattern, more than any tool, is what enables zero-downtime database evolution.

The other unifying discipline is that every step is an Ansible-controlled idempotent operation with explicit checkpoints. Database migrations performed by humans on a CLI are the highest-risk operation in IT. Encoding them as playbooks with mandatory verification gates eliminates the most common failure modes (forgot a step, ran out of order, no rollback plan, no evidence trail).

This lesson assumes familiarity with the previous lessons in the Tier 5 wave. The CHG-gate pattern from D7 is mandatory for every database migration. The evidence-bundle pattern from D1 produces the audit artefact. The DR pattern from D2 is the fallback when a migration goes wrong.

1. The expand-contract pattern: the most important idea in this lesson

If you remember nothing else from this lesson, remember this:

Never make a destructive schema change in a single deployment. Always expand the schema, migrate, then contract.

Concretely, suppose you want to rename a column users.fullname → users.full_name. The naive approach is ALTER TABLE users RENAME COLUMN fullname TO full_name. This requires:

Application code that reads fullname to be deployed simultaneously with code that reads full_name
Zero requests in-flight using the old column at the moment of cutover
A perfect rollback plan if the application breaks

In practice, this is impossible in any production system bigger than a single replica. The expand-contract version:

Phase	Schema	Application	Goal
1. Expand	Add `full_name`, copy data, install trigger that keeps both columns in sync	App still reads/writes `fullname` only	Both columns exist; old code unaffected
2. Migrate writes	(no schema change)	Deploy app that writes to `full_name` (trigger keeps `fullname` in sync)	New writes go to `full_name`; legacy reads still work
3. Migrate reads	(no schema change)	Deploy app that reads from `full_name`	All app code uses `full_name`
4. Contract	Drop trigger, drop `fullname`	(no app change)	Old column gone; schema clean

Each phase is independently deployable, independently rollbackable, and never requires the application and database to be deployed simultaneously. Production databases that follow this discipline rigorously can sustain hundreds of schema changes per quarter with zero downtime.

The Ansible role we will build executes phases 1 and 4 (the schema-touching phases). Phases 2 and 3 are application deploys that happen between Ansible runs.

# roles/db_migration_expand/tasks/main.yml
---
- name: Phase 1 — add new column nullable
  community.postgresql.postgresql_query:
    db: "{{ db_name }}"
    query: |
      ALTER TABLE {{ table }}
      ADD COLUMN IF NOT EXISTS {{ new_column }} {{ new_type }};
  no_log: false

- name: Phase 1 — backfill data in batches
  community.postgresql.postgresql_query:
    db: "{{ db_name }}"
    query: |
      UPDATE {{ table }}
      SET {{ new_column }} = {{ backfill_expr }}
      WHERE {{ new_column }} IS NULL
        AND id BETWEEN {{ batch_start }} AND {{ batch_end }};
  loop: "{{ range(0, max_id, batch_size) | list }}"
  loop_control:
    loop_var: batch_start
    extended: true
  vars:
    batch_end: "{{ batch_start + batch_size - 1 }}"

- name: Phase 1 — install dual-write trigger
  community.postgresql.postgresql_query:
    db: "{{ db_name }}"
    query: |
      CREATE OR REPLACE FUNCTION {{ table }}_dual_write_{{ new_column }}()
      RETURNS TRIGGER AS $$
      BEGIN
        IF NEW.{{ old_column }} IS NOT NULL AND NEW.{{ new_column }} IS NULL THEN
          NEW.{{ new_column }} := NEW.{{ old_column }};
        ELSIF NEW.{{ new_column }} IS NOT NULL AND NEW.{{ old_column }} IS NULL THEN
          NEW.{{ old_column }} := NEW.{{ new_column }};
        END IF;
        RETURN NEW;
      END;
      $$ LANGUAGE plpgsql;

      DROP TRIGGER IF EXISTS dual_write_{{ new_column }} ON {{ table }};
      CREATE TRIGGER dual_write_{{ new_column }}
      BEFORE INSERT OR UPDATE ON {{ table }}
      FOR EACH ROW EXECUTE FUNCTION {{ table }}_dual_write_{{ new_column }}();

The contract role drops the trigger and the old column, but only after a verification step:

- name: Phase 4 — verify no recent writes to old column
  community.postgresql.postgresql_query:
    db: "{{ db_name }}"
    query: |
      SELECT count(*) AS cnt
      FROM {{ table }}
      WHERE {{ old_column }} != {{ new_column }};
  register: divergence_check

- name: Fail if old and new column diverge
  ansible.builtin.fail:
    msg: |
      Refusing to contract: {{ divergence_check.query_result[0].cnt }} rows have
      {{ old_column }} != {{ new_column }}. Either trigger is broken or backfill incomplete.
  when: divergence_check.query_result[0].cnt > 0

- name: Phase 4 — drop trigger and old column
  community.postgresql.postgresql_query:
    db: "{{ db_name }}"
    query: |
      DROP TRIGGER IF EXISTS dual_write_{{ new_column }} ON {{ table }};
      DROP FUNCTION IF EXISTS {{ table }}_dual_write_{{ new_column }}();
      ALTER TABLE {{ table }} DROP COLUMN {{ old_column }};

The hard rule: never run the contract phase until application telemetry confirms zero reads/writes against the old column for at least 7 days. Operations that violate this rule are how production outages happen.

2. Online DDL for the unavoidable schema changes

Some operations cannot be done with pure expand-contract because they’re inherently destructive (changing a column type, adding a NOT NULL constraint, partitioning a table). For these, we use online DDL tools that perform the operation as a series of small, non-blocking steps.

2.1 MySQL/MariaDB: gh-ost and pt-online-schema-change

gh-ost (GitHub’s online schema migrator) is the modern choice. Unlike pt-online-schema-change, it doesn’t use triggers (which add load), and it can pause/resume cleanly:

- name: Install gh-ost
  ansible.builtin.package:
    name: gh-ost
    state: present

- name: Render gh-ost command
  ansible.builtin.set_fact:
    ghost_cmd: >-
      gh-ost
      --user={{ db_admin_user }}
      --password={{ vault_db_admin_password }}
      --host={{ db_host }}
      --database={{ db_name }}
      --table={{ table }}
      --alter="{{ alter_statement }}"
      --max-load=Threads_running=25
      --critical-load=Threads_running=1000
      --chunk-size=1000
      --throttle-control-replicas={{ replica_hosts | join(',') }}
      --max-lag-millis=1500
      --switch-to-rbr
      --allow-on-master
      --cut-over=default
      --hooks-path=/etc/gh-ost/hooks
      --execute
  no_log: true

- name: Run gh-ost migration
  ansible.builtin.command: "{{ ghost_cmd }}"
  register: ghost_result
  changed_when: ghost_result.rc == 0
  failed_when: ghost_result.rc != 0
  no_log: true

The flags that matter:

--max-load=Threads_running=25: pauses migration if active query count exceeds 25 (i.e. the app is busy)
--throttle-control-replicas: pauses migration if any replica falls more than 1500ms behind
--chunk-size=1000: copy 1000 rows at a time, yielding to other queries between chunks
--cut-over=default: do the final atomic rename when all changes are caught up; lock is held for milliseconds
--hooks-path: shell scripts that run at lifecycle events (onStartup, onValidated, onCutOverComplete)

The hooks are where you wire ServiceNow updates and Slack notifications:

#!/bin/bash
# /etc/gh-ost/hooks/gh-ost-on-startup
curl -X POST -H "Content-Type: application/json" \
  -d "{\"text\":\"gh-ost started: ${GH_OST_DATABASE_NAME}.${GH_OST_TABLE_NAME}\"}" \
  "${SLACK_WEBHOOK_URL}"

For PostgreSQL, the equivalent of gh-ost is pg_repack (for table reorgs and index rebuilds without locks) and pg_squeeze. Both are wrapped trivially in Ansible:

- name: Run pg_repack on bloated table
  ansible.builtin.command: >-
    pg_repack
    --no-superuser-check
    --jobs=4
    --table {{ table }}
    --dbname {{ db_name }}
    --host {{ db_host }}
    --username {{ db_admin_user }}
  environment:
    PGPASSWORD: "{{ vault_db_admin_password }}"
  no_log: true

2.2 The “lock-free” mental model isn’t quite right

A subtle but important point: “online” DDL tools don’t eliminate locks. They reduce them to milliseconds at the cutover moment, instead of holding them for hours during the data copy. A migration that runs gh-ost on a table with 500M rows might take 6 hours of background copy and then a 50ms lock at the end. That 50ms is not zero — it can still cause a brief connection pool spike — but it’s three orders of magnitude better than a 6-hour blocking ALTER.

The implication for production runbooks: schedule cutover moments during traffic troughs (typically 03:00-04:00 local time), and have application connection pools sized to absorb a brief spike. A pool with 50 connections that hits 80% utilisation during a cutover lock is fine. A pool at 95% utilisation will fail spectacularly.

3. Blue-green database cutovers

For changes that are too large or too risky for online DDL — major engine version upgrades, instance class changes, encryption-at-rest enablement, AZ migrations, on-prem → cloud — the blue-green pattern is the answer.

The shape:

BLUE (current production)        GREEN (parallel)
        │                                │
   ┌────┴────┐                      ┌────┴────┐
   │ App     │                      │         │
   │ Pool    │                      │         │
   └────┬────┘                      └────┬────┘
        │                                │
   ┌────▼────┐  logical replication ┌────▼────┐
   │ Primary ├──────────────────────►│ Primary │
   │ DB      │                      │ DB      │
   └─────────┘                      └─────────┘

The five-phase cutover:

Provision green — identical-shape database with the new version/config
Replicate — logical replication from blue → green; let it catch up
Validate — checksum tables, run smoke tests, verify replication lag = 0
Cut over — pause writes on blue (5-30s), promote green to primary, redirect application
Decommission — keep blue as read-only for ~24h as escape hatch, then destroy

The Ansible workflow that orchestrates this:

---
- name: Blue-green database migration workflow
  hosts: localhost
  gather_facts: true
  vars:
    cutover_window_start: "2026-06-23T03:00:00Z"
    cutover_window_end: "2026-06-23T05:00:00Z"
  tasks:

    - name: Phase 1 — provision green DB cluster
      ansible.builtin.include_role:
        name: kv.db_provision
      vars:
        cluster_name: "{{ db_cluster_name }}-green"
        engine_version: "{{ target_engine_version }}"
        instance_class: "{{ target_instance_class }}"
        parameter_group: "{{ target_parameter_group }}"

    - name: Phase 2a — configure logical replication on blue
      community.postgresql.postgresql_query:
        login_host: "{{ blue_db_host }}"
        login_user: "{{ db_admin_user }}"
        login_password: "{{ vault_db_admin_password }}"
        db: "{{ db_name }}"
        query: |
          ALTER SYSTEM SET wal_level = 'logical';
          SELECT pg_reload_conf();
          CREATE PUBLICATION migration_pub FOR ALL TABLES;
      no_log: true

    - name: Phase 2b — pre-seed green from latest backup
      ansible.builtin.include_role:
        name: kv.db_restore_from_snapshot
      vars:
        target_cluster: "{{ db_cluster_name }}-green"
        snapshot_id: "{{ latest_blue_snapshot_id }}"

    - name: Phase 2c — start logical replication on green
      community.postgresql.postgresql_query:
        login_host: "{{ green_db_host }}"
        login_user: "{{ db_admin_user }}"
        login_password: "{{ vault_db_admin_password }}"
        db: "{{ db_name }}"
        query: |
          CREATE SUBSCRIPTION migration_sub
            CONNECTION 'host={{ blue_db_host }} dbname={{ db_name }}
                        user=replicator password={{ vault_replicator_password }}'
            PUBLICATION migration_pub
            WITH (copy_data = false);
      no_log: true

    - name: Phase 3a — wait for replication lag to reach zero
      community.postgresql.postgresql_query:
        login_host: "{{ green_db_host }}"
        login_user: "{{ db_admin_user }}"
        login_password: "{{ vault_db_admin_password }}"
        db: "{{ db_name }}"
        query: |
          SELECT
            EXTRACT(EPOCH FROM (now() - last_msg_receipt_time)) AS lag_seconds
          FROM pg_stat_subscription;
      register: lag
      until: lag.query_result[0].lag_seconds | float < 1.0
      retries: 240
      delay: 30
      no_log: true

    - name: Phase 3b — checksum critical tables
      ansible.builtin.include_role:
        name: kv.db_table_checksum_compare
      vars:
        blue_host: "{{ blue_db_host }}"
        green_host: "{{ green_db_host }}"
        critical_tables: "{{ critical_business_tables }}"

    - name: Phase 3c — run application smoke tests against green (read-only)
      ansible.builtin.include_role:
        name: kv.app_smoke_tests
      vars:
        target_db_host: "{{ green_db_host }}"
        read_only: true

    - name: Phase 4a — open CHG implementation window
      servicenow.itsm.change_request:
        number: "{{ change_request_number }}"
        state: implement
        work_notes: "Cutover beginning at {{ ansible_date_time.iso8601 }}"

    - name: Phase 4b — pause application writes (set DB to read-only)
      community.postgresql.postgresql_query:
        login_host: "{{ blue_db_host }}"
        login_user: "{{ db_admin_user }}"
        login_password: "{{ vault_db_admin_password }}"
        query: "ALTER SYSTEM SET default_transaction_read_only = on; SELECT pg_reload_conf();"

    - name: Phase 4c — wait for in-flight transactions to drain
      ansible.builtin.pause:
        seconds: 10

    - name: Phase 4d — confirm zero replication lag
      community.postgresql.postgresql_query:
        login_host: "{{ green_db_host }}"
        login_user: "{{ db_admin_user }}"
        login_password: "{{ vault_db_admin_password }}"
        db: "{{ db_name }}"
        query: |
          SELECT
            CASE WHEN replay_lsn = sent_lsn THEN 'caught_up' ELSE 'behind' END AS status
          FROM pg_stat_subscription;
      register: final_lag
      failed_when: final_lag.query_result[0].status != 'caught_up'

    - name: Phase 4e — promote green to primary
      community.postgresql.postgresql_query:
        login_host: "{{ green_db_host }}"
        login_user: "{{ db_admin_user }}"
        login_password: "{{ vault_db_admin_password }}"
        db: "{{ db_name }}"
        query: "DROP SUBSCRIPTION migration_sub;"

    - name: Phase 4f — flip application DB endpoint (DNS or service mesh)
      ansible.builtin.include_role:
        name: kv.app_db_endpoint_flip
      vars:
        new_db_host: "{{ green_db_host }}"

    - name: Phase 4g — verify writes succeeding on green
      ansible.builtin.include_role:
        name: kv.app_smoke_tests
      vars:
        target_db_host: "{{ green_db_host }}"
        read_only: false

    - name: Phase 5 — keep blue as read-only escape hatch (24h)
      community.postgresql.postgresql_query:
        login_host: "{{ blue_db_host }}"
        login_user: "{{ db_admin_user }}"
        login_password: "{{ vault_db_admin_password }}"
        query: |
          REVOKE INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public FROM application_role;

The cutover itself (phases 4a-g) takes 30-90 seconds in practice. From the application’s perspective, this is a brief connection pool blip — within the tolerance of any retrying client. From the database’s perspective, it’s a clean atomic switch.

3.1 The escape-hatch discipline

Notice that we don’t drop blue. We keep it for 24 hours, in read-only mode. If something goes catastrophically wrong with green in the first hour after cutover (a query plan regression that wasn’t caught in smoke tests, an unexpected lock contention pattern, anything), we can:

Flip the application back to blue (point-in-time-restore-aware: any green-writes that happened in the last hour are lost; this is the painful but bounded fallback)
Investigate green offline
Try the cutover again with the issue fixed

If the first hour is clean, you typically won’t need this. But the 5 minutes spent leaving blue running cost almost nothing and provide a meaningful insurance policy. After 24 hours of clean green operation, decommission blue.

A more sophisticated variant — forward-flow protection — sets up green-to-blue replication for that 24h window so that the escape hatch doesn’t lose data. This is much more complex (you must handle write conflicts if both sides take writes briefly), and is overkill for most cases. Reserve it for migrations where any data loss is unacceptable (financial transactions, healthcare records).

4. Cross-engine migrations: Oracle → PostgreSQL, MSSQL → PostgreSQL

These are the migrations that cause executives to lose sleep. Schema differences, type mismatches, stored procedure rewrites, application-level ORM changes — every layer has compatibility issues.

The pragmatic toolchain:

Source → Target	Tool	Ansible wrapper
Oracle → PostgreSQL	ora2pg	`ansible.builtin.command` + `community.postgresql.*`
MSSQL → PostgreSQL	pg_chameleon, AWS DMS	`community.aws.dms_*` collections
MySQL → PostgreSQL	pgloader	`ansible.builtin.command`
Anything → Anything	Debezium + Kafka Connect	`community.kubernetes.k8s` (deploy operators)
RDBMS → RDBMS (cloud-native)	AWS DMS, Azure Database Migration Service, GCP Database Migration Service	Cloud-provider collections

The hard part of these migrations is rarely the data move itself — it’s everything else:

Stored procedure rewrites: Oracle PL/SQL → PostgreSQL PL/pgSQL is mostly mechanical, but every business-logic procedure must be re-tested. AWS Babelfish or EnterpriseDB’s Oracle compatibility help, but don’t eliminate the work.
Type system differences: Oracle’s NUMBER vs PostgreSQL’s numeric differ in precision behaviour. Oracle’s DATE includes time; PostgreSQL’s date doesn’t. Application code that depends on these subtleties will fail.
Sequence semantics: Oracle’s SEQUENCE.NEXTVAL and PostgreSQL’s nextval() behave differently in transaction-rollback scenarios.
Lock and isolation differences: Oracle’s read-consistent-by-default vs PostgreSQL’s MVCC have different deadlock characteristics. Stress test, do not assume.

A sensible pattern for these migrations:

Phase 0: Schema convertor + stored proc audit. Produce a manual remediation list. This is human work; no automation shortcut.
Phase 1: Initial bulk load via DMS or equivalent. Ansible orchestrates the DMS task creation and monitoring.
Phase 2: CDC-based ongoing replication. DMS or Debezium streams changes from source to target.
Phase 3: Full-stack parallel test — application points at target, traffic is shadowed (read-only) for weeks, mismatches are logged.
Phase 4: Cutover, exactly as in the blue-green pattern above.
Phase 5: Source-side decommission, weeks later.

A skeleton AWS DMS playbook:

- name: Create DMS replication instance
  community.aws.dms_replication_instance:
    replication_instance_identifier: "ora-to-pg-{{ env }}"
    replication_instance_class: dms.r5.4xlarge
    allocated_storage: 500
    vpc_security_group_ids: "{{ dms_sg_ids }}"
    publicly_accessible: false
    multi_az: true
    state: present

- name: Create source endpoint (Oracle)
  community.aws.dms_endpoint:
    endpoint_identifier: "src-oracle-{{ env }}"
    endpoint_type: source
    engine_name: oracle
    server_name: "{{ oracle_host }}"
    port: 1521
    database_name: "{{ oracle_service_name }}"
    username: "{{ vault_oracle_dms_user }}"
    password: "{{ vault_oracle_dms_password }}"
    extra_connection_attributes: "useLogminerReader=N;useBfile=Y;asm_user={{ asm_user }};asm_password={{ vault_asm_password }}"
    state: present
  no_log: true

- name: Create target endpoint (PostgreSQL on RDS)
  community.aws.dms_endpoint:
    endpoint_identifier: "tgt-pg-{{ env }}"
    endpoint_type: target
    engine_name: postgres
    server_name: "{{ rds_endpoint }}"
    port: 5432
    database_name: "{{ db_name }}"
    username: "{{ vault_pg_dms_user }}"
    password: "{{ vault_pg_dms_password }}"
    state: present
  no_log: true

- name: Create migration task with full-load + CDC
  community.aws.dms_replication_task:
    replication_task_identifier: "ora-to-pg-task-{{ env }}"
    source_endpoint_identifier: "src-oracle-{{ env }}"
    target_endpoint_identifier: "tgt-pg-{{ env }}"
    replication_instance_identifier: "ora-to-pg-{{ env }}"
    migration_type: full-load-and-cdc
    table_mappings: "{{ lookup('file', 'table-mappings.json') }}"
    replication_task_settings: "{{ lookup('file', 'task-settings.json') }}"
    state: present

- name: Start the task
  community.aws.dms_replication_task:
    replication_task_identifier: "ora-to-pg-task-{{ env }}"
    state: started

The table-mappings.json and task-settings.json are where most of the real configuration lives. They should be in version control alongside the playbook. Common useful patterns:

Map all tables but exclude audit/log tables (huge, low-value)
Apply LOB (large object) settings appropriate for your data
Configure parallel-load for the largest tables
Set CDC starting point appropriately (now, or a specific SCN/LSN)

DMS is operationally easy to use and operationally easy to misuse. The most common failure mode is “it appeared to work but silently dropped some rows.” Mitigations:

Always run a row-count comparison after full-load, before enabling CDC, and again before cutover
Always run column-by-column checksum on critical tables (PK ranges, sample windows)
Always test the rollback path (DMS task can be reversed in some cases, but not always — know your escape hatch)

5. Engine version upgrades

The “blue-green for major version upgrade” pattern is its own special case worth covering separately. PostgreSQL 11 → 16, MySQL 5.7 → 8.0, MariaDB 10.6 → 11.4. These upgrades require:

Catalog format changes (cannot just stop and restart on new binaries)
Possible parameter changes (some params renamed or removed)
Possible application-level breakage (deprecated SQL features)
Possible plan regressions (planner changes between versions)

For PostgreSQL specifically, pg_upgrade is the in-place tool, but it requires downtime equal to the time the catalog conversion takes (minutes for small clusters, longer for big). For zero-downtime, you blue-green via logical replication.

The Ansible role for a PG major-version blue-green:

# roles/pg_major_upgrade/tasks/main.yml
---
- name: Pre-flight — check for incompatible features
  community.postgresql.postgresql_query:
    db: "{{ db_name }}"
    query: |
      SELECT
        'extensions' AS check_type,
        string_agg(extname, ', ') AS items
      FROM pg_extension
      WHERE extname NOT IN ({{ supported_extensions_in_target | join(',') }})
      UNION ALL
      SELECT
        'data_types',
        string_agg(DISTINCT typname, ', ')
      FROM pg_type
      WHERE typname IN ({{ removed_types_in_target | join(',') }});
  register: incompat_check

- name: Fail if incompatible features found
  ansible.builtin.fail:
    msg: "Cannot upgrade: {{ incompat_check.query_result }}"
  when: incompat_check.query_result | selectattr('items', 'truthy') | list | length > 0

- name: Provision green cluster on target version
  ansible.builtin.include_role:
    name: kv.db_provision_pg
  vars:
    cluster_name: "{{ db_cluster_name }}-green"
    engine_version: "{{ target_pg_version }}"
    parameter_group: "{{ target_parameter_group }}"

- name: Initial schema export from blue
  community.postgresql.postgresql_db:
    name: "{{ db_name }}"
    state: dump
    target: "/tmp/{{ db_name }}-schema.sql"
    target_opts: "--schema-only --no-owner --no-acl"
    login_host: "{{ blue_db_host }}"
    login_user: "{{ db_admin_user }}"
    login_password: "{{ vault_db_admin_password }}"
  no_log: true

- name: Import schema into green
  community.postgresql.postgresql_db:
    name: "{{ db_name }}"
    state: restore
    target: "/tmp/{{ db_name }}-schema.sql"
    login_host: "{{ green_db_host }}"
    login_user: "{{ db_admin_user }}"
    login_password: "{{ vault_db_admin_password }}"
  no_log: true

# (then logical replication setup as in section 3, then cutover, then decommission)

For MySQL, the corresponding flow uses native MySQL replication (CHANGE MASTER TO from a backup-derived position) with gtid_mode = ON for clean failover. Tools like ProxySQL or HAProxy provide the connection-flip layer.

6. Managed-database equivalents (RDS, Cloud SQL, Azure Database)

The same patterns apply with cloud-managed databases, with two critical differences:

You don’t run pg_upgrade or mysql_upgrade yourself — the cloud provider does. AWS RDS Blue/Green Deployments, Azure Database for PostgreSQL Flexible Server “Restore + Replicate”, Google Cloud SQL “Migrate” — each cloud has its own native blue-green primitive.
The bypass is more obvious — you cannot get to the underlying OS, so any manual recovery requires a support ticket. Plan for this.

The AWS RDS Blue/Green pattern is the cleanest:

- name: Create blue/green deployment
  community.aws.rds_cluster:
    db_cluster_identifier: "{{ blue_cluster_name }}"
    blue_green_deployment:
      target_engine_version: "{{ target_pg_version }}"
      target_db_parameter_group_name: "{{ target_parameter_group }}"
    state: create_blue_green
  register: bg_deployment

- name: Wait for green to catch up (replication lag <= 1s)
  community.aws.rds_cluster_info:
    db_cluster_identifier: "{{ bg_deployment.green_cluster_identifier }}"
  register: green_status
  until: green_status.cluster.replication_lag_seconds <= 1
  retries: 240
  delay: 60

- name: Run validation on green
  ansible.builtin.include_role:
    name: kv.app_smoke_tests
  vars:
    target_db_host: "{{ bg_deployment.green_endpoint }}"
    read_only: true

- name: Switch over (atomic flip)
  community.aws.rds_cluster:
    db_cluster_identifier: "{{ blue_cluster_name }}"
    blue_green_deployment_identifier: "{{ bg_deployment.id }}"
    state: switchover
    switchover_timeout: 300

The actual switchover takes ~30 seconds in well-behaved cases. AWS handles all the connection-routing magic — both endpoints (writer and reader) are atomically remapped. Application code using the cluster endpoint sees a brief connection drop and reconnects to green.

The Ansible value here isn’t reinventing what RDS already does — it’s wrapping the RDS calls inside the standard CHG-gated, evidence-bundled, ServiceNow-tracked workflow that every other production change goes through. The migration becomes “another playbook” rather than “a special manual ritual,” and the audit trail is identical to every other change.

7. Stateful application-level migrations

A subtype worth mentioning: data backfills that aren’t schema changes but are still risky operations. “Re-encrypt every PII column with a new key.” “Recalculate every order’s tax amount with new logic.” “Update every customer record with a new normalised address.”

These have the same risk profile as schema changes (they touch every row) but no DDL. The discipline is identical:

Idempotent (re-running the operation has no effect on already-processed rows)
Batched (1000-10000 rows per transaction; never one giant UPDATE)
Throttled (yield to user-facing query load; pause if replication lag rises)
Resumable (checkpoint progress; pick up where it stopped)
Verifiable (a separate query that confirms every row meets the new condition)

A pattern:

- name: Backfill loop with throttling and checkpointing
  block:
    - name: Read checkpoint
      ansible.builtin.slurp:
        src: "/var/lib/backfills/{{ backfill_id }}.checkpoint"
      register: checkpoint_raw
      failed_when: false

    - name: Set starting position
      ansible.builtin.set_fact:
        start_id: "{{ (checkpoint_raw.content | b64decode | trim) | default('0', true) | int }}"

    - name: Process batches
      community.postgresql.postgresql_query:
        db: "{{ db_name }}"
        query: |
          UPDATE {{ table }}
          SET {{ update_expr }}
          WHERE id BETWEEN {{ batch_start }} AND {{ batch_end }}
            AND {{ where_condition }};
      loop: "{{ range(start_id | int, max_id, batch_size) | list }}"
      loop_control:
        loop_var: batch_start
      vars:
        batch_end: "{{ batch_start + batch_size - 1 }}"

    - name: Pause if replication lag too high
      community.postgresql.postgresql_query:
        db: "{{ db_name }}"
        query: |
          SELECT max(EXTRACT(EPOCH FROM (now() - last_msg_receipt_time))) AS lag
          FROM pg_stat_replication;
      register: rep_lag
      until: rep_lag.query_result[0].lag | float < 5
      retries: 20
      delay: 30

    - name: Update checkpoint
      ansible.builtin.copy:
        content: "{{ batch_end }}"
        dest: "/var/lib/backfills/{{ backfill_id }}.checkpoint"

8. The cutover runbook is sacred

Every database migration has a cutover moment. That moment is the single highest-risk window in the entire migration. The runbook for that moment must be:

Written before the migration starts: rehearsed in pre-prod, reviewed by the on-call team
Time-boxed: every step has a maximum duration; if exceeded, abort
Fail-forward or rollback explicit: every step has a documented rollback path
Single-pilot: one named human runs the cutover, with a co-pilot watching dashboards
No improvisation: if something unexpected happens, abort and reschedule, do not improvise

The Ansible workflow renders the runbook as code, but the human runbook — the document the on-call team reads before kickoff — is its own deliverable. It should explicitly cover:

Pre-flight checklist (review 24h before, review 1h before)
Go/no-go criteria
Communication plan (channels, escalation)
Step-by-step with expected output and elapsed time
Failure decision tree (if step N fails, do X)
Post-cutover validation
Decommission timeline
Lessons-learned template

Migrations performed without this runbook discipline are how the dramatic post-mortem stories get written. Migrations performed with it are tedious and uneventful, which is exactly the goal.

9. Common failure modes

Failure mode	Symptom	Mitigation
Replication lag never reaches zero	Cutover never happens	Check for long-running transactions on source; check network bandwidth; check target’s write capacity
Logical replication stalls on large transactions	A 1B-row transaction gets stuck	Break the source transaction into smaller chunks before migration
Sequence/identity gaps after cutover	App generates duplicate IDs	Bump sequences on green to a value safely above blue’s max before cutover
Connection pool exhaustion at cutover	App errors briefly	Pre-warm connections to green; size pool for spike
Plan regression on new version	Some queries 100x slower on green	Run pg_stat_statements diff in pre-prod; capture problem queries; tune target before cutover
Foreign key cascade timing	Replicated child rows arrive before parents	Use foreign_keys=DEFERRED or disable FK during initial load
Trigger-induced infinite loop	Dual-write trigger triggers itself via replication	Mark replicated rows with `pg_trigger_depth() = 0` check
Encoding mismatch	Special characters become garbage on target	Always specify target encoding explicitly; never rely on defaults
Time zone drift	TIMESTAMP WITHOUT TIME ZONE columns shift	Always use TIMESTAMP WITH TIME ZONE; document the timezone of the source
Cutover window misalignment	App restart misses the DB cutover	Use service-mesh or Consul DNS so app sees endpoint change without restart
Lost-updates window	App writes during the read-only pause are dropped	Set application to retry-with-backoff on read-only errors; window must be < retry timeout

The non-obvious lesson from running many of these: every migration teaches you something about your database that you did not know before. There will be a stored procedure no one remembered. A foreign key with unusual cascade behaviour. A custom collation. A trigger written by someone who left the company three years ago. The migration is also a discovery exercise. Build that into your timeline.

10. Where this fits in the broader course

The Tier 5 wave so far:

D1 (compliance) — what auditors expect
D2 (DR) — surviving total failure
D3 (OS migration) — fleet upgrades
D4 (air-gap) — isolated environments
D5 (SAP) — complex stack
D6 (edge/IoT) — fleet primitives
D7 (ITSM/ChatOps) — governance fabric
D8 (backup) — last line of defence
D9 (this lesson) — moving the data without breaking it

The capstone (D10) is observability — the system that tells you whether all of this automation is actually healthy. It pulls Prometheus, Grafana, Loki, OpenTelemetry, AAP’s own metrics, and ServiceNow’s incident stream into a single coherent operational view. After D10, you will have a complete blueprint for a regulated-enterprise automation platform: governance, compliance, security, scale, recovery, migration, and visibility.

What you should walk away from this lesson with: the conviction that database migrations are engineering exercises, not acts of bravery. With expand-contract, blue-green, and the cutover discipline encoded as Ansible workflows with strict gates, the most feared operation in IT becomes a routine, repeatable, evidence-producing change. Teams that internalise this can deliver schema changes weekly without drama. Teams that don’t continue treating each migration as a unique heroic event, and continue paying the cost in postponed work and weekend pages.