Ansible Lesson 34 of 42

Ansible for Disaster Recovery, In Depth: RPO/RTO Engineering, Site Failover & Cross-Region Runbooks

Ansible for Disaster Recovery, In Depth — RPO/RTO Engineering, Site Failover and Cross-Region Runbooks

Disaster recovery is the operations discipline most likely to be rehearsed on paper and broken in production. Every regulated company has a DR plan. Almost no regulated company can execute it inside the recovery window the plan claims, because the plan is a Word document and the failover is a 600-step manual procedure that depends on the one engineer who happened to write it.

Ansible fixes that, but only if you treat DR the way you treat any other production system: with versioned code, signed artefacts, automated validation, and frequent rehearsal. This lesson is the specialist-tier guide to engineering DR with Ansible — translating recovery objectives into runbooks, validating replication, orchestrating cross-region cutovers from the Ansible Automation Platform (AAP), and proving recovery on a cadence that satisfies auditors and operators alike.

We will not cover the marketing distinction between “DR” and “BC/BCP”. We will cover the engineering: how to make a failover happen, how to know it worked, how to fail back without losing data, and how to keep the runbook honest as the underlying systems change.

Position in the curriculum. This lesson assumes Tier 1–4 fluency: roles, dynamic inventory, AAP workflows, vault, no_log discipline, at least one cloud collection, and the Tier 5 compliance lesson (because DR runbooks are themselves compliance evidence).


What “DR” really means in the Ansible context

DR is the discipline of continuing to operate a service after a fault that exceeds the local fault domain. “Local fault domain” here is whatever your steady-state HA was designed for — a node, a rack, a switch, a rack row, a power feed. Anything that takes out more than that is a DR event: a regional outage, a fibre cut, a ransomware encryption, a cloud-provider-wide control plane failure, a fire, a flood, a blast radius from a misapplied automation script.

DR has two non-negotiable numbers, set by the business:

There is a third number that often gets ignored: MTPD — Maximum Tolerable Period of Disruption, the point at which the business itself fails. RTO must always be less than MTPD. RPO and RTO together define the DR architecture pattern you must build:

Pattern Typical RPO Typical RTO Cost When to use
Backup & restore Hours–days Hours–days Low Non-critical apps, archive, dev/test
Pilot light Minutes 30–60min Medium Core data replicated, app stack scaled-out on demand
Warm standby Minutes 5–15min High Smaller capacity running hot, scaled up on failover
Active-passive Seconds 1–5min High Full capacity in DR site, idle until failover
Active-active 0 (sync) 0 (transparent) Very high Mission-critical financial/medical, cross-region traffic

Ansible’s job in DR is orchestration, not storage. The data replication itself is done by storage arrays (SnapMirror, Storage Replica, EBS replication), database engines (Postgres streaming, MySQL group replication, MongoDB replica sets), or block-level tools (Veeam, Zerto, AWS DRS). Ansible’s job is to:

  1. Validate that replication is healthy and within RPO every minute of every day.
  2. Promote the DR copy when a disaster is declared.
  3. Reconfigure networking, DNS, certificates and identity to make the DR copy reachable.
  4. Restart application stacks in the DR site in the correct order.
  5. Verify that the failed-over service actually works, end-to-end.
  6. Fail back cleanly when the primary site recovers, without data loss.
  7. Test the whole flow on a non-disruptive cadence — quarterly or better.

If Ansible owns those seven jobs, your DR plan is code. If not, your DR plan is a wiki page that no one has read in 18 months.


The DR architecture this lesson assumes

For concreteness, the rest of this lesson assumes a representative dual-region hybrid pattern that maps onto most enterprise environments:

This is a pilot-light/warm-standby hybrid: the database is hot, application servers are scaled to zero in DR until failover. RPO target is 5 minutes; RTO target is 30 minutes. Those numbers are what we will engineer to.


The DR repository layout

Treat the DR runbook as a top-level Ansible repository, not a side project of the main one. The blast radius of a DR run is the entire estate; it deserves its own roles, its own review process, and its own test matrix.

dr-orchestration/
├── ansible.cfg
├── collections/requirements.yml      # pinned: amazon.aws, community.aws,#         community.postgresql, ansible.posix
├── inventory/
│   ├── primary/                       # DC1 (vSphere)
│   │   ├── hosts.yml
│   │   └── group_vars/
│   └── dr/                            # AWS eu-west-1
│       ├── aws_ec2.yml                # dynamic inventory plugin
│       └── group_vars/
├── group_vars/
│   └── all/
│       ├── rpo_rto.yml                # the SLOs (single source of truth)
│       └── vault.yml                  # encrypted secrets
├── playbooks/
│   ├── 00-prereq-validate.yml
│   ├── 10-replication-health.yml      # runs every minute via AAP schedule
│   ├── 20-declare-disaster.yml        # human-gated, requires approval
│   ├── 30-failover.yml                # the actual cutover
│   ├── 40-validate-failover.yml       # synthetic transactions
│   ├── 50-cutover-dns.yml             # last step — irreversible-feeling
│   ├── 60-failback-prep.yml
│   ├── 70-failback.yml
│   └── 99-game-day.yml                # non-destructive test
└── roles/
    ├── replication_check_netapp/
    ├── replication_check_postgres/
    ├── replication_check_s3/
    ├── promote_rds/
    ├── scale_asg/
    ├── reconfigure_dns/
    ├── reconfigure_ad_trust/
    ├── reconfigure_certs/
    ├── app_smoke_test/
    └── dr_evidence/

The split between playbooks/10-replication-health.yml (continuous) and playbooks/30-failover.yml (event) is fundamental. You cannot fail over to a replica that is broken, and you do not want to discover the replica is broken during a real disaster. The continuous playbook runs every minute and screams when replication lag exceeds RPO.


RPO engineering: continuous replication validation

The first runbook to write is not the failover — it is the replication health check. Without this, every other runbook is a guess.

# playbooks/10-replication-health.yml
---
- name: Replication health (must stay green)
  hosts: localhost
  gather_facts: false
  vars:
    rpo_db_seconds: 300       # 5 minutes
    rpo_storage_seconds: 600  # 10 minutes
    rpo_object_seconds: 900   # 15 minutes
  tasks:

    - name: Postgres replication lag (logical replication slot)
      ansible.builtin.shell: |
        psql -h primary-db.kv.local -U dr_observer -d app -At -c "
          SELECT EXTRACT(EPOCH FROM
            (now() - pg_last_xact_replay_timestamp()))::int
          FROM pg_stat_replication
          WHERE application_name = 'dr_replica'
        "
      register: pg_lag
      changed_when: false
      no_log: true

    - name: Fail loudly if Postgres RPO breached
      ansible.builtin.fail:
        msg: "Postgres replication lag {{ pg_lag.stdout }}s > RPO {{ rpo_db_seconds }}s"
      when: pg_lag.stdout | int > rpo_db_seconds

    - name: NetApp SnapMirror lag
      netapp.ontap.na_ontap_rest_info:
        hostname: "{{ netapp_host }}"
        username: "{{ netapp_user }}"
        password: "{{ netapp_pass }}"
        gather_subset: snapmirror_info
      register: sm
      no_log: true

    - name: Fail loudly if storage RPO breached
      ansible.builtin.fail:
        msg: "SnapMirror lag {{ item.lag_time }} on {{ item.destination_path }}"
      loop: "{{ sm.ontap_info.snapmirror_info.records }}"
      when: (item.lag_time | community.general.iso8601_to_seconds) > rpo_storage_seconds

    - name: S3 cross-region replication metric
      amazon.aws.cloudwatch_metric_statistics:
        namespace: AWS/S3
        metric_name: ReplicationLatency
        dimensions:
          SourceBucket: "{{ src_bucket }}"
          DestinationBucket: "{{ dst_bucket }}"
        period: 60
        statistics: [Maximum]
        start_time: "{{ '%Y-%m-%dT%H:%M:%SZ' | strftime((ansible_date_time.epoch | int) - 300) }}"
        end_time:   "{{ ansible_date_time.iso8601 }}"
      register: s3_lag

    - name: Push DR health to Prometheus pushgateway
      ansible.builtin.uri:
        url: "http://pushgw.kv.local:9091/metrics/job/dr_health"
        method: POST
        body: |
          dr_pg_lag_seconds {{ pg_lag.stdout }}
          dr_sm_lag_seconds {{ sm_lag_max }}
          dr_s3_lag_seconds {{ s3_lag_max }}
          dr_health_check_unixtime {{ ansible_date_time.epoch }}
        status_code: 200

Run this from AAP every minute on a schedule. Wire the Prometheus metrics into Alertmanager: a single missed minute is ignored; three consecutive breaches paste a P1 to PagerDuty. The breach itself is your early warning that DR cannot meet RPO right now — long before any actual disaster.


RTO engineering: the failover runbook

The failover playbook is the most important Ansible file in your estate. Treat it accordingly: small, idempotent, modular, and gated by approval.

# playbooks/30-failover.yml
---
- name: DR failover  primary -> DR
  hosts: localhost
  gather_facts: false
  vars_prompt:
    - name: confirm
      prompt: "Type FAILOVER to proceed. This will promote DR and stop primary writes."
      private: false
  pre_tasks:
    - name: Abort unless explicitly confirmed
      ansible.builtin.assert:
        that: confirm == "FAILOVER"
        fail_msg: "DR failover not confirmed; aborting."

  tasks:

    - name: 1. Stop writes to primary (fence)
      ansible.builtin.import_role:
        name: fence_primary
      tags: [fence]

    - name: 2. Final replication catch-up window
      ansible.builtin.import_role:
        name: replication_check_postgres
      vars:
        rpo_db_seconds: 60
      tags: [validate]

    - name: 3. Promote RDS read-replica to primary
      ansible.builtin.import_role:
        name: promote_rds
      tags: [promote]

    - name: 4. Scale up DR application tier
      ansible.builtin.import_role:
        name: scale_asg
      vars:
        target_capacity: "{{ primary_capacity }}"
      tags: [compute]

    - name: 5. Reconfigure AD trust + DNS (Route 53)
      ansible.builtin.import_role:
        name: reconfigure_dns
      tags: [dns]

    - name: 6. Re-issue certificates pointing at DR endpoints
      ansible.builtin.import_role:
        name: reconfigure_certs
      tags: [tls]

    - name: 7. Smoke test
      ansible.builtin.import_role:
        name: app_smoke_test
      tags: [verify]

    - name: 8. Record evidence
      ansible.builtin.import_role:
        name: dr_evidence
      vars:
        event: failover
      tags: [evidence]

Note the structure: each step is a role, each step is tagged, each step is idempotent, and each step is named with the order it must execute. You can re-run this playbook safely — that is the point of idempotency in DR. If step 4 (scale ASG) fails because of an EC2 capacity issue and the operator fixes it manually, re-running the playbook will see the ASG already scaled and skip to step 5. This is the difference between a runbook that survives a real disaster and a runbook that breaks the moment reality intrudes.

The vars_prompt is deliberately a typed phrase, not a y/n. Tired engineers at 3am hit y by reflex.


The fence: stop primary writes before promoting

This is the step that operators most often skip in tabletop exercises and most often regret in real failovers. You must stop the primary from accepting writes before promoting the replica, or you will end up with a split brain and no clean way to merge the two timelines back.

# roles/fence_primary/tasks/main.yml
---
- name: Disable primary load balancer (HAProxy/F5)
  community.general.haproxy:
    socket: /var/lib/haproxy/stats
    state: disabled
    backend: "{{ item.backend }}"
    host: "{{ item.host }}"
  loop: "{{ primary_app_hosts }}"
  delegate_to: lb-primary.kv.local

- name: Reject writes at primary Postgres (read-only)
  community.postgresql.postgresql_set:
    name: default_transaction_read_only
    value: "on"
    login_host: "{{ primary_pg_host }}"
    login_user: postgres
  no_log: true

- name: Tell ALL primary app hosts to stop ingesting
  ansible.builtin.systemd:
    name: kv-app
    state: stopped
  delegate_to: "{{ item }}"
  loop: "{{ groups['primary_app'] }}"
  ignore_errors: true   # primary may already be unreachable

- name: Flag the storage volume as fenced (NetApp)
  netapp.ontap.na_ontap_volume:
    state: present
    name: "{{ item }}"
    is_online: false
    hostname: "{{ netapp_host }}"
    username: "{{ netapp_user }}"
    password: "{{ netapp_pass }}"
  loop: "{{ primary_volumes }}"
  no_log: true

ignore_errors: true on the systemd stop is intentional: in a real disaster the primary may be unreachable. The fence still works because the Postgres default_transaction_read_only=on rejects writes from any client that does manage to connect, and the LB drains traffic at the network edge.


Promotion, scale-up, and DNS cutover

These are the three steps that take the most clock time. Engineer each one to stream progress to the operator rather than sit silently for 10 minutes.

# roles/promote_rds/tasks/main.yml
---
- name: Promote RDS read replica
  amazon.aws.rds_instance:
    id: "{{ dr_db_identifier }}"
    state: present
    promote: true
    region: "{{ dr_region }}"
    wait: true
    wait_timeout: 600
  register: rds_promo

- name: Verify Postgres accepts writes
  community.postgresql.postgresql_query:
    db: app
    login_host: "{{ rds_promo.endpoint.address }}"
    login_user: "{{ pg_user }}"
    login_password: "{{ pg_pass }}"
    query: "CREATE TABLE IF NOT EXISTS dr_canary (ts timestamptz); INSERT INTO dr_canary VALUES (now());"
  no_log: true

- name: Update Postgres connection string in AWS Secrets Manager
  amazon.aws.secretsmanager_secret:
    name: app/db/url
    secret_type: string
    secret: "postgresql://{{ pg_user }}:{{ pg_pass }}@{{ rds_promo.endpoint.address }}:5432/app"
    region: "{{ dr_region }}"
    state: present
  no_log: true

The “canary” insert is your proof that promotion worked. If the role completes without that insert succeeding, you do not have a working DR — you have a half-promoted replica, and step 4 (scale ASG) will start launching application servers that immediately fail to connect.

# roles/scale_asg/tasks/main.yml
---
- name: Scale ASG to primary capacity
  amazon.aws.autoscaling_group:
    name: "{{ dr_asg_name }}"
    region: "{{ dr_region }}"
    desired_capacity: "{{ target_capacity }}"
    min_size: "{{ target_capacity }}"
    max_size: "{{ (target_capacity | int * 1.5) | round | int }}"
    wait_for_instances: true
    wait_timeout: 900
  register: scale_result

- name: Wait until /healthz returns 200 on every new instance
  ansible.builtin.uri:
    url: "http://{{ item.private_ip_address }}:8080/healthz"
    status_code: 200
    timeout: 5
  retries: 30
  delay: 10
  delegate_to: localhost
  loop: "{{ scale_result.instances }}"
  when: scale_result.instances is defined

The DNS cutover happens last because it is the visible-to-customers step. Once Route 53 points at the DR ALB, traffic shifts; that is the moment customers see a recovered service.

# roles/reconfigure_dns/tasks/main.yml
---
- name: Move app.example.com to DR ALB
  amazon.aws.route53:
    state: present
    zone: example.com
    record: app.example.com
    type: A
    alias: true
    alias_hosted_zone_id: "{{ dr_alb_hosted_zone_id }}"
    value: "{{ dr_alb_dns_name }}"
    ttl: 60
    overwrite: true
  no_log: true

- name: Bump TTL back up after stabilisation
  amazon.aws.route53:
    state: present
    zone: example.com
    record: app.example.com
    type: A
    alias: true
    alias_hosted_zone_id: "{{ dr_alb_hosted_zone_id }}"
    value: "{{ dr_alb_dns_name }}"
    ttl: 300
    overwrite: true
  when: stable | default(false)

Notice the TTL=60 during cutover. You want clients to re-resolve quickly when DR comes online, but you do not want a permanently low TTL that hammers your DNS bill. The stable flag is set by a follow-up task once the cutover has held for an hour.


End-to-end smoke test

The last technical step before declaring DR successful is to prove the application works, not just that the infrastructure is up. Health-check endpoints lie. Write a smoke test that exercises the actual user path.

# roles/app_smoke_test/tasks/main.yml
---
- name: Synthetic transaction  sign up a test user
  ansible.builtin.uri:
    url: "https://app.example.com/signup"
    method: POST
    body_format: json
    body:
      email: "dr-canary+{{ ansible_date_time.epoch }}@kv.local"
      password: "{{ canary_pass }}"
    status_code: 201
    return_content: true
  register: signup
  no_log: true

- name: Synthetic transaction  log in
  ansible.builtin.uri:
    url: "https://app.example.com/login"
    method: POST
    body_format: json
    body:
      email: "dr-canary+{{ ansible_date_time.epoch }}@kv.local"
      password: "{{ canary_pass }}"
    status_code: 200
    return_content: true
  register: login
  no_log: true

- name: Synthetic transaction  read a row
  ansible.builtin.uri:
    url: "https://app.example.com/api/v1/profile"
    method: GET
    headers:
      Authorization: "Bearer {{ login.json.token }}"
    status_code: 200
  no_log: true

- name: Fail the whole DR if smoke test fails
  ansible.builtin.fail:
    msg: "Smoke test failed; DR is NOT yet customer-ready"
  when: signup.status != 201 or login.status != 200

This is the runbook step that catches misconfigurations DNS health checks can never see: an expired certificate, a missing IAM role, a region-mismatched secret, a forgotten Postgres extension. If your smoke test passes, you can confidently tell the business “we are in DR, customers can use the service.”


Failback: the part everyone forgets

Failover is rehearsed; failback is improvised. That is why most real DR events end with the company running on the DR site for months — the team is too scared to fail back, because they know the failback was never tested.

Failback is its own playbook with its own gating. The steps are roughly the inverse of failover:

  1. Verify primary is healthy (storage replicated, network up, certs valid).
  2. Reverse replication direction: DR (now primary) replicates back to the original primary (now standby).
  3. Wait until reverse replication is within RPO.
  4. Schedule a maintenance window.
  5. Fence DR writes.
  6. Promote primary.
  7. Scale up primary application tier.
  8. Cut DNS back to primary.
  9. Smoke test.
  10. Re-establish forward replication (primary → DR).
  11. Scale DR back down to pilot-light.
  12. Record evidence.

The dangerous step is #2 — reversing replication. Postgres does not let you simply “reverse” a logical replication slot; you need to either rebuild the original primary from the DR site (pg_basebackup --pgdata=/var/lib/postgresql/data --host=dr-primary --wal-method=stream) or use a tool like AWS DMS that can do bidirectional sync. Whichever you choose, bake it into a role and rehearse it, because failback is the failure mode that costs you months of running on a more expensive site.


Cross-cloud and hybrid DR considerations

The dual-region pattern above lives entirely in AWS for DR. Many enterprises have hybrid DR (primary on-prem, DR in cloud) or cross-cloud DR (primary AWS, DR Azure). Ansible handles both, but the validation steps grow.

Hybrid (vSphere → AWS): the storage replication is usually NetApp SnapMirror to Cloud Volumes ONTAP, or Veeam Cloud Connect, or AWS Storage Gateway. Ansible orchestrates the same way; you just import a different replication-check role. The networking is harder because Direct Connect / VPN tunnels must come up, and routes must be advertised. Pre-stage the BGP peerings and keep them administratively shut, then use community.aws.directconnect_virtual_interface to enable them during failover.

Cross-cloud (AWS → Azure): you cannot use cloud-native replication; you must use application-level (Postgres logical replication over a VPN), storage-level (Veeam), or stream-based (Kafka MirrorMaker) replication. Identity federation gets harder — you need both AD/IAM Identity Center and Azure AD Connect to be replicated. This is the case where AAP automation mesh shines: control plane in one cloud, execution nodes in both, no inbound firewall holes anywhere.

Multi-region active-active (rare, but covered for completeness): there is no failover playbook because every site is always serving traffic. Ansible’s role here is traffic shaping — using Route 53 latency-based routing or Azure Traffic Manager to drain traffic away from a degraded region while the underlying replication keeps converging. The “DR” event is a configuration change, not a promotion.


DR testing: game days, table-tops and chaos

A DR runbook that has not been tested in 12 months is broken. The exam question is not “is it broken?” but “how broken?”. Test on a cadence:

The playbooks/99-game-day.yml is your scaffolding for the partial DR test. It runs the failover playbook with --check --diff against a _test_ inventory, then runs the smoke test against a clone of the DR environment. Schedule it monthly in AAP. The test that runs every month and emails a green “DR rehearsal: PASS” message to risk management is worth more than 100 pages of plan documentation.


DR evidence: what auditors actually want

Every DR run, real or rehearsed, must produce evidence. The minimum bundle:

# roles/dr_evidence/tasks/main.yml
---
- name: Evidence directory for this run
  ansible.builtin.set_fact:
    dr_evidence_dir: "/var/lib/dr/evidence/{{ ansible_date_time.epoch }}-{{ event }}"

- name: Capture replication lag at start of run
  ansible.builtin.copy:
    dest: "{{ dr_evidence_dir }}/01-replication-lag.json"
    content: "{{ replication_state | to_nice_json }}"

- name: Capture failover duration
  ansible.builtin.copy:
    dest: "{{ dr_evidence_dir }}/02-timing.json"
    content: |
      { "start": "{{ run_start }}",
        "end":   "{{ ansible_date_time.iso8601 }}",
        "rto_seconds": {{ (ansible_date_time.epoch | int) - run_start_epoch }},
        "rto_target_seconds": {{ rto_target }},
        "met_rto": {{ ((ansible_date_time.epoch | int) - run_start_epoch) <= rto_target }}
      }

- name: Capture smoke-test results
  ansible.builtin.copy:
    dest: "{{ dr_evidence_dir }}/03-smoke-test.json"
    content: "{{ smoke_results | to_nice_json }}"

- name: Sign the evidence with the runbook commit hash
  ansible.builtin.shell:
    cmd: "sha256sum {{ dr_evidence_dir }}/*.json > {{ dr_evidence_dir }}/MANIFEST.sha256"

- name: Push to evidence bucket (object lock + immutable)
  amazon.aws.s3_object:
    bucket: "{{ evidence_bucket }}"
    object: "{{ dr_evidence_dir | basename }}/MANIFEST.sha256"
    src: "{{ dr_evidence_dir }}/MANIFEST.sha256"
    mode: put

The S3 bucket must have Object Lock enabled in compliance mode with a retention period that exceeds your audit cycle. Auditors do not accept “we have logs” — they accept “we have logs you cannot tamper with.” Every quarterly review then becomes a query: “show me all DR runs in the last 90 days, their RTO numbers, and whether we met target.” That is a Glue/Athena query against the evidence bucket. It is also the kind of report risk management loves.


Workflow: stitching it all together in AAP

The full DR workflow in Automation Controller is a directed graph:

[Replication health (scheduled, every 1m)]
       |
       | (failure path: PagerDuty incident)
       |
[Declare disaster (manual approval)]
       |
       v
[30-failover.yml]
       |
       +-- on success --> [Notify ops + risk]
       |
       +-- on failure --> [Rollback (40-rollback.yml)]
                          |
                          v
                   [Page incident commander]

Two AAP-specific notes:


Anti-patterns that destroy DR

A non-exhaustive list, all collected from real audits:


Frequently asked questions

1. Why orchestrate DR with Ansible instead of a vendor tool like Zerto, AWS DRS or Azure Site Recovery? Vendor tools handle one slice — typically VM-level block replication and orchestrated boot. They do not handle DNS cutover, certificate re-issuance, AD trust reconfiguration, application smoke tests, or evidence generation. Use a vendor tool as one role inside the Ansible runbook. The runbook is still the source of truth for “what does failover mean for this business service.”

2. What’s the smallest sensible RPO target? Anything below 30 seconds requires synchronous replication and a third site to break ties. For most enterprises, the practical floor is 5 minutes with asynchronous replication and a well-tested fence. Below that, costs and complexity explode.

3. How do I keep DR roles in sync with prod? Use the same roles. The application playbook that deploys to prod must also deploy to DR — no DR-specific forks. The differences live in group_vars/dr/ (region, sizing, replication endpoints) only. If you find yourself maintaining two copies of a role, you have an architectural drift bug.

4. What about ransomware? Ransomware is the DR scenario most enterprises fail. The replica is encrypted within minutes of the primary. Defend with immutable, air-gapped backups (S3 Object Lock in compliance mode, NetApp SnapVault with locked snapshots, or tape) plus the DR replica. Failover may not be from the replica; it may be from a 4-hour-old immutable backup, accepting the higher RPO as a deliberate trade-off. Have a separate ransomware runbook (playbooks/30-ransomware-failover.yml) that promotes the immutable backup, not the live replica.

5. How do I federate AD between primary and DR without making AD itself a SPOF? Two domain controllers in DR (different AZs), site-aware DNS, and replication topology configured so that DR is a separate site. AD Connect to Azure AD must use a service account replicated to both sites. Test the DR-only logon path quarterly: shut off the primary DCs at the network layer and ensure a user can still authenticate.

6. What’s the right way to handle stateful Kubernetes workloads in DR? For PVCs, use storage-class replication (Portworx, Longhorn, NetApp Trident with SnapMirror). For control plane, run separate clusters per region — never replicate the etcd quorum across regions; the latency will destroy you. Use Velero or Trident to back up cluster state hourly, and fail over by restoring into the DR cluster. The Kubernetes lesson covers the operator pattern; in DR, treat the operator manifest as the authoritative state.

7. How do I test DR without scaring the business? Invest in a “DR pre-prod” environment: a separate VPC/account that mirrors prod’s topology at 1/4 size with synthetic data. Run the full failover/failback there monthly. Run partial real-prod tests (steps 1–4) quarterly. Run full real-prod tests once a year, scheduled, with stakeholder buy-in.

8. Should the failover playbook use serial and strategy: free? For application-tier scaling, yes — serial: 25% gives you a rolling start that lets healthchecks catch failed AMIs early. For database promotion, no — it must be sequential. Use one playbook per phase, each with the right strategy.

9. How do I handle ChatOps for DR declarations? A Slack bot calls the AAP API to launch the 20-declare-disaster.yml job template, which in turn requires a survey approval on the actual failover. The Slack interaction is a convenience; the gating still happens inside AAP, with the audit trail. Never expose the failover endpoint directly to a chat command.

10. What’s the single most underrated DR practice? The cleanup playbook. After a successful failover or rehearsal, run a playbook that records the commit hash of the runbook, archives the evidence, regenerates RPO/RTO Grafana dashboards with the actual numbers, and files a Jira ticket for any role that hit ignore_errors. The teams that do this never have a DR plan that quietly rots; the teams that don’t have a “DR reawakening project” once every two years.


Hands-on lab — your first automated failover

The following lab fails over a tiny Postgres + Flask app from a “primary” Docker Compose stack to a “DR” Docker Compose stack, all on one machine. It teaches the runbook structure without needing two cloud accounts.

Prerequisites: docker, docker compose, ansible-core ≥ 2.16, community.postgresql.

mkdir -p dr-lab/{primary,dr,playbooks,roles}
cd dr-lab
# primary/docker-compose.yml
services:
  pg-primary:
    image: postgres:16
    environment:
      POSTGRES_DB: app
      POSTGRES_USER: app
      POSTGRES_PASSWORD: appsecret
      POSTGRES_INITDB_ARGS: "-c wal_level=logical"
    ports: ["5432:5432"]
    volumes:
      - ./primary/init.sql:/docker-entrypoint-initdb.d/init.sql:ro

  app-primary:
    image: python:3.12-slim
    command: >
      bash -c "pip install flask psycopg2-binary &&
               python -c \"
               from flask import Flask; import psycopg2, os;
               a = Flask(__name__);
               @a.route('/')
               def i(): c = psycopg2.connect(os.environ['DB']); cur=c.cursor(); cur.execute('SELECT now()'); return str(cur.fetchone());
               a.run(host='0.0.0.0', port=8080)\""
    environment:
      DB: postgresql://app:appsecret@pg-primary/app
    ports: ["8080:8080"]
    depends_on: [pg-primary]
# dr/docker-compose.yml — same shape but ports 5433/8081
services:
  pg-dr:
    image: postgres:16
    environment:
      POSTGRES_DB: app
      POSTGRES_USER: app
      POSTGRES_PASSWORD: appsecret
    ports: ["5433:5432"]

  app-dr:
    image: python:3.12-slim
    command: ... # same as primary
    environment:
      DB: postgresql://app:appsecret@pg-dr/app
    ports: ["8081:8080"]
    depends_on: [pg-dr]
# playbooks/30-failover.yml
- hosts: localhost
  gather_facts: false
  vars_prompt:
    - name: confirm
      prompt: "Type FAILOVER to proceed"
      private: false
  tasks:
    - assert: { that: confirm == "FAILOVER" }

    - name: Stop primary writes (read-only)
      community.postgresql.postgresql_set:
        name: default_transaction_read_only
        value: "on"
        login_host: 127.0.0.1
        login_port: 5432
        login_user: postgres
        login_password: appsecret
      no_log: true

    - name: Take final snapshot
      ansible.builtin.shell: |
        docker exec primary-pg-primary-1 pg_dumpall -U app > /tmp/dr-final.sql

    - name: Restore into DR
      ansible.builtin.shell: |
        cat /tmp/dr-final.sql | docker exec -i dr-pg-dr-1 psql -U app

    - name: Smoke test DR
      ansible.builtin.uri:
        url: "http://127.0.0.1:8081/"
        status_code: 200
      retries: 10
      delay: 2

Run:

docker compose -f primary/docker-compose.yml up -d
docker compose -f dr/docker-compose.yml up -d
ansible-playbook playbooks/30-failover.yml -e confirm=FAILOVER
curl http://localhost:8081/   # works; 8080 is now read-only

Now extend the lab: add a 40-validate-failover.yml that times the run, a 50-cutover-dns.yml that switches a /etc/hosts entry, a 60-failback-prep.yml that reverses the dump direction, and a 99-game-day.yml that runs the whole flow with --check. You will have a complete DR mental model in under an hour.


Glossary


Certification mapping

This lesson maps directly onto:


Next steps

You now have an opinionated, code-first DR foundation. The next two specialist lessons build on it:

If you only take one habit from this lesson: run your DR playbook on a schedule, not on a calendar reminder. The DR plan that runs every month against pre-prod is the DR plan that works the day reality demands it.

ansibledisaster-recoverydrrportofailoverbusiness-continuityaap
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments