Ansible × Backup & Storage Automation, In Depth: Veeam, Rubrik, Commvault, NetApp ONTAP, Pure & the 3-2-1-1-0 Rule as Code

There is a special kind of weariness that comes over an engineer when they are asked, mid-incident, “do we have a backup?” The answer is almost never a clean yes or no. It’s “yes, but it’s two weeks old, and we haven’t tested a restore in three months, and the storage admin just left the company.” That fragility is what this lesson exists to fix.

In modern infrastructure, backups are simultaneously:

The single most important last-line-of-defence against ransomware
The control that auditors examine most closely (because they’re easy to test on paper but hard to test for real)
The only line of defence that cannot be skipped or postponed when prod is down
The system that operates entirely outside the normal change-management flow (it must keep running even when prod is broken)
The most boring thing in the data centre, which means it’s the easiest to neglect

The thesis of this lesson is that backups become trustworthy only when they are treated as automated, tested, and policy-driven — not when they are configured by hand once and reviewed annually. We’ll cover four pillars:

Backup-policy-as-code with the major vendor collections (Veeam, Rubrik, Cohesity, Commvault)
Storage-array automation with NetApp ONTAP, Pure, and Dell PowerStore for the snapshot tier of the 3-2-1-1-0 rule
Immutability and ransomware resilience — object-lock repositories, tape-equivalents, and the air-gapped 4th copy
Automated restore drills — the discipline that separates real DR readiness from theatrical compliance checkbox

This lesson assumes you’ve internalised the DR lesson (D2) — backups are the content of DR; this lesson is the plumbing that creates and maintains that content correctly.

1. The 3-2-1-1-0 rule as the unifying policy

The classic backup mantra was 3-2-1: 3 copies, 2 different media, 1 off-site. Modern ransomware threats and compliance frameworks (FFIEC, NIST 800-209, HKMA, MAS TRM) extended this to 3-2-1-1-0:

3 copies of data (the original + 2 backups)
2 different media or storage types
1 copy off-site
1 copy immutable or air-gapped (cannot be modified or deleted by anyone, including admins)
0 errors on the latest restore test

The “1 immutable” and “0 errors” additions are what most organisations get wrong. The first because immutability is operationally inconvenient (you cannot delete bad backups quickly even when you want to). The second because automated restore testing is hard to set up and easy to defer.

The whole policy can — and should — be encoded as a single Ansible group_vars file applied to every protected workload:

# group_vars/all/backup_policy.yml
---
backup_policy:
  rpo_hours: 4              # max acceptable data loss
  rto_hours: 8              # max acceptable downtime
  retention:
    daily: 14               # 14 daily backups
    weekly: 6               # 6 weekly backups
    monthly: 12             # 12 monthly backups
    yearly: 7               # 7 yearly backups (regulatory)
  copies:
    primary:
      type: snapshot        # array-based snapshot
      target: ontap-cluster-01
      retention_days: 7
    secondary:
      type: backup
      target: veeam-repo-primary
      retention_days: 30
    offsite:
      type: backup_copy
      target: veeam-repo-aws-s3-frankfurt
      retention_days: 365
    immutable:
      type: backup_copy
      target: veeam-repo-aws-s3-objectlock
      object_lock_days: 90
      retention_days: 2555  # 7 years
  restore_test:
    frequency_days: 30
    sample_size_pct: 5
    success_threshold_pct: 100

Every backup automation playbook in your repo reads from this single source of truth. When the policy changes (auditor says “we need 14 monthlies, not 12”), you change one line and the next sync run reconciles every Veeam job, Rubrik SLA, snapshot schedule, and restore-test frequency. This is the entire game. Everything else in this lesson is implementation detail in service of this principle.

2. Veeam Backup & Replication automation

Veeam is the most common backup product in enterprise Windows/VMware environments and increasingly in physical Linux estates. The official collection is veeam.backup (Veeam Backup Enterprise Manager API wrapper).

2.1 Connection and authentication

# vars
veeam_em_host: "veeam-em.corp.example.com"
veeam_em_port: 9398
veeam_em_user: "ANSIBLE\\svc-veeam"
veeam_em_password: "{{ vault_veeam_em_password }}"

- name: Connect to Veeam EM
  veeam.backup.session:
    host: "{{ veeam_em_host }}"
    port: "{{ veeam_em_port }}"
    username: "{{ veeam_em_user }}"
    password: "{{ veeam_em_password }}"
    validate_certs: true
  register: veeam_session
  no_log: true

The collection caches the session token for the duration of the play, so you authenticate once.

2.2 Job-as-code

A Veeam backup job in code looks like this:

- name: Reconcile backup job for payments_api
  veeam.backup.backup_job:
    session_token: "{{ veeam_session.session_token }}"
    name: "BJ-payments_api-PROD"
    description: "Managed by Ansible — do not edit manually"
    job_type: vmware
    objects:
      - vm_name: "{{ item }}"
        host: "vcenter-dc1.corp.example.com"
      loop: "{{ groups['app_payments_api'] }}"
    schedule:
      type: daily
      time: "22:00"
      timezone: "Europe/Berlin"
    repository: "veeam-repo-primary"
    retention:
      restore_points: "{{ backup_policy.retention.daily }}"
      gfs:
        weekly:
          enabled: true
          count: "{{ backup_policy.retention.weekly }}"
        monthly:
          enabled: true
          count: "{{ backup_policy.retention.monthly }}"
        yearly:
          enabled: true
          count: "{{ backup_policy.retention.yearly }}"
    options:
      backup_mode: incremental
      synthetic_full_enabled: true
      synthetic_full_days: ["saturday"]
      compression_level: 5
      storage_block_size: 1024
      encryption:
        enabled: true
        password_id: "{{ veeam_encryption_key_id }}"
    notifications:
      enabled: true
      email_to: "platform-backup-alerts@example.com"
      on_success: false
      on_warning: true
      on_failure: true
    state: present
  register: veeam_job
  no_log: false

Two key disciplines visible here:

Tag the job description “Managed by Ansible — do not edit manually”. When a Veeam admin opens the GUI and sees this, they know not to make manual changes that will be overwritten on the next sync.
Encryption is non-optional for production. Use a Veeam-managed encryption password stored once in the password manager; reference by ID, never embed the password.

2.3 Backup copy job (the off-site & immutable copies)

The “1 off-site, 1 immutable” rules are implemented as Veeam Backup Copy Jobs (BCJ) that pull from the primary repo to a secondary and tertiary target:

- name: Backup copy job — primary → AWS S3 (off-site copy)
  veeam.backup.backup_copy_job:
    session_token: "{{ veeam_session.session_token }}"
    name: "BCJ-payments_api-OFFSITE"
    source_job: "BJ-payments_api-PROD"
    target_repository: "veeam-repo-aws-s3-frankfurt"
    schedule:
      type: continuous
    retention:
      restore_points: "{{ backup_policy.copies.offsite.retention_days }}"
    state: present

- name: Backup copy job — primary → S3 Object Lock (immutable copy)
  veeam.backup.backup_copy_job:
    session_token: "{{ veeam_session.session_token }}"
    name: "BCJ-payments_api-IMMUTABLE"
    source_job: "BJ-payments_api-PROD"
    target_repository: "veeam-repo-aws-s3-objectlock"
    schedule:
      type: daily
      time: "04:00"
    retention:
      restore_points: "{{ backup_policy.copies.immutable.retention_days }}"
      immutability_days: "{{ backup_policy.copies.immutable.object_lock_days }}"
    state: present

The object-lock repository must be configured with Compliance Mode (not Governance Mode) on the underlying S3 bucket. Compliance Mode is irrevocable: even an AWS root account holder cannot delete objects within their lock period. This is what makes the backup ransomware-resistant — even if an attacker compromises your AWS root credentials, they cannot delete the immutable backups.

2.4 Drift detection

Veeam admins, even well-meaning ones, will edit jobs in the GUI when they’re under pressure. The reconciliation playbook should run daily and detect drift:

- name: Check for unmanaged Veeam jobs (drift detection)
  veeam.backup.backup_job_info:
    session_token: "{{ veeam_session.session_token }}"
  register: all_jobs

- name: Find drifted or unmanaged jobs
  ansible.builtin.set_fact:
    drifted_jobs: >-
      {{ all_jobs.jobs
         | rejectattr('description', 'search', 'Managed by Ansible')
         | list }}

- name: Open INC if drifted jobs exist
  servicenow.itsm.incident:
    short_description: "Unmanaged Veeam backup jobs detected"
    description: |
      The following Veeam jobs are not managed by Ansible:
      {% for j in drifted_jobs %}
      - {{ j.name }} (last modified by {{ j.last_modified_by }})
      {% endfor %}
    impact: 2
    urgency: 2
  when: drifted_jobs | length > 0

Now any out-of-band Veeam change becomes a ServiceNow incident automatically. The first time you turn this on, you will discover dozens of orphan jobs, custom retention policies, and “temporary” exceptions that were never cleaned up. Cleaning these up is its own multi-week project, but you will be in a measurably better state at the end.

3. Rubrik automation

Rubrik takes a different philosophical approach: rather than per-VM job configuration, it uses SLA Domains — declarative policies attached to objects (VMs, databases, filesets) that the Rubrik cluster autonomously satisfies. This maps cleanly to Ansible’s declarative style.

The collection is rubrikinc.cdm.

3.1 SLA-as-code

- name: Define SLA domain for production tier-1
  rubrikinc.cdm.rubrik_sla_domain:
    name: "Tier-1-PROD"
    archive_settings:
      archival_threshold_unit: hours
      archival_threshold: 4
      archival_location_id: "{{ rubrik_archive_aws_id }}"
    replication_settings:
      target_cluster_id: "{{ rubrik_dr_cluster_id }}"
      retention_days: 30
    frequencies:
      hourly:
        frequency: 4
        retention: 72
      daily:
        frequency: 1
        retention: 14
      weekly:
        frequency: 1
        day_of_week: SATURDAY
        retention: 6
      monthly:
        frequency: 1
        day_of_month: LAST_DAY
        retention: 12
      yearly:
        frequency: 1
        day_of_year: LAST_DAY
        retention: 7
    state: present
  register: sla

3.2 Object → SLA assignment

- name: Assign SLA to all production VMs
  rubrikinc.cdm.rubrik_assign_sla:
    object_name: "{{ item.name }}"
    object_type: vmware
    sla_name: "Tier-1-PROD"
  loop: "{{ vmware_vm_inventory }}"
  when: item.tags | intersect(['env:prod', 'tier:1']) | length == 2

The pattern is read-the-source-of-truth (CMDB or vCenter tags) → derive SLA → apply. When a VM gets retagged in vCenter, the next reconciliation run reassigns it to the correct SLA. There is no “VM with no SLA” state possible by construction (you can have a default “uncategorised” SLA that catches anything missed and alerts loudly).

3.3 Rubrik recovery automation

The killer feature of Rubrik is its API-driven instant recovery — mounting a backup as a live VMDK in seconds. This is what makes automated restore drills (section 7) actually feasible:

- name: Live-mount latest snapshot for restore test
  rubrikinc.cdm.rubrik_vsphere_live_mount:
    vm_name: "{{ source_vm }}"
    mounted_vm_name: "{{ source_vm }}-restore-test-{{ ansible_date_time.epoch }}"
    snapshot_id: latest
    host: "{{ restore_test_esxi_host }}"
    datastore: "{{ restore_test_datastore }}"
    power_on: true
  register: live_mount

- name: Wait for VM to boot
  ansible.builtin.wait_for:
    host: "{{ live_mount.ip_address }}"
    port: 22
    timeout: 600
    delay: 30

A live-mount completes in ~30 seconds and consumes near-zero storage (it’s a CoW overlay on the backup). Restore tests that would have taken 2-4 hours of disk copy now take 5 minutes total. This unlocks frequent, automated restore validation.

4. Cohesity and Commvault — the “everything else” tier

If you are running Cohesity, the collection is cohesity.dataprotect. Concepts are similar to Rubrik (Protection Policies are the SLA equivalent):

- name: Define Cohesity protection policy
  cohesity.dataprotect.cohesity_protection_policy:
    name: "Tier-1-PROD"
    backup_schedule:
      unit: hours
      frequency: 4
    retention:
      unit: days
      duration: 14
    extended_retention:
      - schedule:
          unit: weeks
          frequency: 1
        retention:
          unit: weeks
          duration: 6
    archival_targets:
      - target_id: "{{ cohesity_aws_archive_id }}"
        unit: days
        duration: 30
    state: present

Commvault has the commvault.cv collection. The semantics are different — Commvault uses Storage Policies and Subclients — but the pattern is the same: define policies declaratively, attach to clients, reconcile daily.

The general lesson: regardless of vendor, everything maps to the same five concepts — schedule, retention, copies, encryption, target. The Ansible policy file looks identical; only the collection’s module names change. Teams who run multi-vendor backup estates can build a thin Ansible role that takes the unified backup_policy dictionary and dispatches to the right vendor module.

5. Storage-array snapshot automation (NetApp, Pure, Dell)

The snapshot tier of the 3-2-1-1-0 rule is satisfied by storage-array native snapshots, not by backup software. These are the fastest recovery option (seconds to mount, vs minutes for backup-software restore) and the cheapest (no separate compute or media). Every modern enterprise array has Ansible support.

5.1 NetApp ONTAP

netapp.ontap is one of the most mature collections. A typical pattern:

- name: Configure SnapMirror policy (replication to DR cluster)
  netapp.ontap.na_ontap_snapmirror_policy:
    state: present
    name: "MirrorAndVault-Daily"
    cluster: "ontap-cluster-prod"
    policy_type: mirror_vault
    snapmirror_label: ["daily", "weekly", "monthly"]
    keep: [14, 6, 12]
    schedule: ["daily", "weekly", "monthly"]
    hostname: "{{ ontap_mgmt_ip }}"
    username: "{{ ontap_user }}"
    password: "{{ vault_ontap_password }}"
    https: true
    validate_certs: true
  no_log: true

- name: Configure snapshot policy on volume
  netapp.ontap.na_ontap_volume:
    state: present
    name: "vol_payments_data"
    vserver: "svm-payments"
    snapshot_policy: "MirrorAndVault-Daily"
    snapshot_auto_delete:
      state: "on"
      delete_order: "oldest_first"
      defer_delete: "user_created"
      trigger: "volume"
      target_free_space: 20
    # SnapLock for immutable snapshots (compliance mode)
    snaplock:
      type: compliance
      autocommit_period: 4hours
      retention:
        default: 7years
        minimum: 30days
        maximum: 10years
    hostname: "{{ ontap_mgmt_ip }}"
    username: "{{ ontap_user }}"
    password: "{{ vault_ontap_password }}"
    https: true
    validate_certs: true
  no_log: true

The key feature here is SnapLock Compliance — once a snapshot is committed to SnapLock, it cannot be deleted by anyone, including ONTAP admins. This is the array-level equivalent of S3 Object Lock and serves the same role: ransomware can encrypt your live volume, but it cannot touch your locked snapshots.

5.2 Pure Storage

purestorage.flasharray for FlashArray and purestorage.flashblade for FlashBlade. The patterns are similar:

- name: Set protection group for volume
  purestorage.flasharray.purefa_pg:
    name: "pg_payments_prod"
    volume:
      - "vol_payments_data"
      - "vol_payments_logs"
    target:
      - "{{ purefa_dr_array_name }}"
    state: present
    fa_url: "{{ purefa_mgmt_ip }}"
    api_token: "{{ vault_purefa_api_token }}"
  no_log: true

- name: Set snapshot schedule
  purestorage.flasharray.purefa_pgsnap:
    name: "pg_payments_prod"
    schedule:
      enabled: true
      snap_at: "22:00"
      snap_frequency: 14400  # 4 hours in seconds
      snap_per_day: 6
      replicate_at: "23:00"
      replicate_frequency: 14400
      replicate_per_day: 6
      days: 14
      weeks: 6
      months: 12
    fa_url: "{{ purefa_mgmt_ip }}"
    api_token: "{{ vault_purefa_api_token }}"
  no_log: true

- name: Enable SafeMode (ransomware protection)
  purestorage.flasharray.purefa_pg:
    name: "pg_payments_prod"
    safe_mode: true
    fa_url: "{{ purefa_mgmt_ip }}"
    api_token: "{{ vault_purefa_api_token }}"
  no_log: true

Pure’s SafeMode is similar in spirit to ONTAP SnapLock: snapshots and protection groups become undeletable except via a Pure-side multi-party approval workflow. It is the mechanism by which Pure customers survived several high-profile 2023-2024 ransomware attacks where every other storage tier was encrypted.

5.3 Dell PowerStore / IBM Storage / others

dellemc.powerstore and ibm.storage_virtualize collections follow the same shape. The pattern is identical: define protection policies, attach to volumes, enable immutability, replicate to DR.

The discipline that matters across all of these: never allow a production volume to exist without a protection policy. A daily compliance check should query each array’s API for unprotected volumes and open an incident for each one. This is the fastest way to surface “someone provisioned a database last quarter and forgot to set up backups” — exactly the kind of gap that becomes catastrophic during ransomware.

6. Immutability strategies and the “1” in 3-2-1-1-0

There are four mechanisms for true immutability, in increasing order of strength:

Mechanism	Strength	Cost	Use case
Object-lock S3 (Compliance Mode)	High	Low (cloud storage cost)	Off-site copy, easy automation
Storage array SnapLock / SafeMode	High	Medium (array licensing)	On-prem snapshot tier
Hardened Linux repository (Veeam)	Medium	Low	If you don’t have S3
Tape (LTO with WORM cartridges)	Highest	High (operational complexity)	Highly regulated industries (banks, defence, nuclear)

The recommendation for most organisations is two of the above: S3 Object Lock for the cloud copy, and either SnapLock or SafeMode for the on-prem snapshot copy. This gives you two independent immutability mechanisms with different threat models, dramatically reducing the chance that a single attacker can defeat both.

The classic mistake is “one immutable copy” — say, only S3 Object Lock. If an attacker gets your S3 access keys and learns to wait out the lock period, they can still cause damage. Two independent immutable copies, with different access mechanisms (one needs AWS IAM, the other needs ONTAP admin credentials), are exponentially harder to defeat.

A note on tape: tape is having a renaissance specifically because of ransomware. An LTO tape that has been ejected from the library and stored in a fireproof safe is, by definition, air-gapped — no network attacker can touch it. For compliance-driven industries the cost is justified. The Ansible automation here is minimal (you can’t reach an offline tape) but the workflow automation is rich: ServiceNow CHGs for tape-vaulting events, signed manifests of what’s on each tape, and quarterly retrieval drills that prove the chain of custody works.

7. The discipline that actually matters: automated restore drills

Backups that have never been restored are not backups. They are claims about backups. The 0-errors part of 3-2-1-1-0 is about converting claims into evidence.

The minimum viable restore-drill loop:

---
- name: Monthly automated restore drill
  hosts: localhost
  gather_facts: true
  vars:
    drill_id: "drill-{{ ansible_date_time.iso8601_basic_short }}"
    target_workloads: "{{ groups['env_prod'] | random(seed=ansible_date_time.epoch | int, count=(groups['env_prod'] | length * 0.05) | int) }}"
  tasks:

    - name: Open CHG for restore drill
      servicenow.itsm.change_request:
        type: standard
        template: "Standard - DR Restore Drill"
        short_description: "{{ drill_id }} restore drill ({{ target_workloads | length }} workloads)"
        state: implement
        start_date: "{{ ansible_date_time.iso8601 }}"
      register: drill_chg

    - name: Live-mount each target VM
      rubrikinc.cdm.rubrik_vsphere_live_mount:
        vm_name: "{{ item }}"
        mounted_vm_name: "{{ item }}-{{ drill_id }}"
        snapshot_id: latest
        host: "{{ restore_test_esxi_host }}"
        datastore: "{{ restore_test_datastore }}"
        power_on: true
      loop: "{{ target_workloads }}"
      register: mounted_vms

    - name: Wait for each VM to boot and respond to SSH/WinRM
      ansible.builtin.wait_for:
        host: "{{ item.ip_address }}"
        port: 22
        timeout: 900
      loop: "{{ mounted_vms.results }}"
      loop_control:
        label: "{{ item.item }}"

    - name: Run application-level smoke tests
      ansible.builtin.include_role:
        name: kv.app_smoke_tests
      vars:
        target_host: "{{ item.ip_address }}"
        original_vm_name: "{{ item.item }}"
      loop: "{{ mounted_vms.results }}"
      register: smoke_test_results

    - name: Run filesystem integrity checks
      ansible.builtin.shell: |
        find / -type f \( -name "*.db" -o -name "*.dat" -o -name "*.idx" \) \
          -exec md5sum {} \; > /tmp/restore-checksums.txt
      delegate_to: "{{ item.ip_address }}"
      loop: "{{ mounted_vms.results }}"
      changed_when: false

    - name: Compare checksums against pre-backup baseline
      ansible.builtin.include_role:
        name: kv.checksum_compare
      vars:
        baseline_path: "/var/lib/backup-baselines/{{ item.item }}.md5"
        actual_path: "/tmp/restore-checksums.txt"
      loop: "{{ mounted_vms.results }}"
      register: checksum_results

    - name: Generate drill report
      ansible.builtin.template:
        src: drill_report.html.j2
        dest: "/var/lib/backup-drills/{{ drill_id }}/report.html"
      vars:
        smoke_results: "{{ smoke_test_results.results }}"
        checksum_results: "{{ checksum_results.results }}"
        chg_number: "{{ drill_chg.record.number }}"

    - name: Sign and archive drill evidence
      ansible.builtin.shell: |
        tar -czf {{ drill_id }}.tar.gz {{ drill_id }}/
        gpg --batch --yes --detach-sign --armor --local-user backup-drill-signer \
          {{ drill_id }}.tar.gz
        aws s3 cp {{ drill_id }}.tar.gz s3://kv-evidence-immutable/restore-drills/
        aws s3 cp {{ drill_id }}.tar.gz.asc s3://kv-evidence-immutable/restore-drills/
      args:
        chdir: /var/lib/backup-drills/

    - name: Tear down restore-test VMs
      rubrikinc.cdm.rubrik_vsphere_live_unmount:
        mounted_vm_name: "{{ item.item }}-{{ drill_id }}"
      loop: "{{ mounted_vms.results }}"

    - name: Close CHG with results
      servicenow.itsm.change_request:
        number: "{{ drill_chg.record.number }}"
        state: closed
        close_code: "{{ 'successful' if all_passed else 'unsuccessful' }}"
        close_notes: |
          Restore drill {{ drill_id }} complete.
          Workloads tested: {{ target_workloads | length }}
          Passed: {{ smoke_test_results.results | selectattr('failed', 'equalto', false) | list | length }}
          Failed: {{ smoke_test_results.results | selectattr('failed', 'equalto', true) | list | length }}
          Evidence: s3://kv-evidence-immutable/restore-drills/{{ drill_id }}.tar.gz
      vars:
        all_passed: "{{ smoke_test_results.results | rejectattr('failed', 'equalto', false) | list | length == 0 }}"

The pattern is:

Random sample: every month, pick 5% of the production fleet at random (seeded by date for reproducibility)
Live-mount, don’t full-restore: leverage Rubrik/Veeam Instant Recovery to make this fast enough to actually do
Test at three layers: VM boots, app smoke tests pass, filesystem checksums match baseline
Sign and archive evidence: the drill report itself is signed and goes to immutable storage
CHG-tracked: the entire drill is a Change ticket in ServiceNow, with success/failure recorded

The bar for success is 100% of the sampled workloads must pass. Anything less, and you have a real backup integrity problem to investigate.

7.1 The metrics that matter

Three numbers that should be on a dashboard reviewed in every operations standup:

Restore drill pass rate (last 12 months): target 100%, alert below 99%
Time-to-first-byte on restore (P95): target < 10 minutes, alert above 30 minutes
Backup coverage of production CIs: target 100%, alert below 99% (anything missing is a P1 gap)

The first metric is the only one that proves backups work. The second proves they’re operationally useful (a backup that takes 8 hours to mount is useless for most outage scenarios). The third proves they’re complete.

8. Backup chain integrity and synthetic full validation

A specific failure mode worth calling out: silent corruption of the backup chain. Backup software performs incrementals against a previous full or synthetic full. If any link in the chain is corrupt, every subsequent restore from that chain is corrupt. You will not notice until you try to restore.

The mitigations:

Synthetic full weekly: every Saturday, the backup software constructs a new “full” from the prior chain. If the prior chain has issues, this fails — and you find out within a week, not a year.
Health checks (Veeam terminology — equivalents in other tools): periodic re-read and verification of all blocks. CPU-intensive but worth it for tier-1 workloads.
CRC validation on every read: most modern backup repositories do this by default; verify it’s actually enabled.
Periodic forensic scrubs: a quarterly job that mounts a sampling of older snapshots (90, 180, 365 days old) and verifies they can still be read. Backups that worked yesterday tell you nothing about backups from a year ago.

Encoded as Ansible:

- name: Trigger health check on backup repository
  veeam.backup.repository_health_check:
    session_token: "{{ veeam_session.session_token }}"
    repository: "veeam-repo-primary"
    full_scan: false
  register: health_check

- name: Wait for health check to complete
  veeam.backup.repository_health_check_info:
    session_token: "{{ veeam_session.session_token }}"
    job_id: "{{ health_check.job_id }}"
  register: health_status
  until: health_status.state in ['Stopped', 'Failed']
  retries: 360
  delay: 60

- name: Open INC if health check found issues
  servicenow.itsm.incident:
    short_description: "Veeam repository health check found issues"
    description: |
      Repository: veeam-repo-primary
      Issues: {{ health_status.issues_found }}
      Details: {{ health_status.details }}
    impact: 1
    urgency: 1
  when: health_status.issues_found > 0

9. The economic dimension — and why it matters for adoption

A common political problem with backup automation: backup admins feel threatened by it, because it makes their job legible to engineering leadership. “We’ve always been a 4-FTE team because backups are complicated” is harder to defend when an Ansible role can spin up correct backup configuration for a new application in 30 seconds.

The way to land this politically is to be very clear that automation makes backup admins more valuable, not less. The work shifts from:

“Click through 47 screens to configure a new VM’s backup” (replaceable, low-leverage)

to:

“Design backup policies that satisfy regulators and survive ransomware” (irreplaceable, high-leverage)
“Investigate the 2 restore-drill failures last quarter” (irreplaceable)
“Run quarterly DR exercises with the application teams” (irreplaceable)

In every successful rollout I’ve seen, the backup admin team becomes the trusted advisor function — they review proposed policy changes, approve PRs to the policy repo, and own the relationship with vendors. The grunt work of clicking through GUIs disappears, but their organisational standing increases. Frame it that way from day one and adoption is much easier.

10. Common failure modes (and how to spot them)

Failure mode	What it looks like	How to detect
Backup ran but contains no data	Job succeeds, restored VM is empty	Restore drill checksums mismatch
Encryption key lost	Backup exists but cannot be decrypted	Quarterly key-recovery drill
Object-lock retention misconfigured	Auditor finds 30-day retention where 90 was required	Daily policy reconciliation; alert on drift
Replication broken silently	Off-site copy is N days behind	`last_successful_run` query on every BCJ; alert if > 24h
Test snapshots fill the array	Live-mount drills accumulate, eat capacity	Always tear down mounted VMs in a `block:`/`always:` pattern
Backup admin GUI changes	Policies drift from declared state	Daily reconciliation play with diff alerts
Tape unreadable	Drive degraded or tape damaged	Quarterly tape retrieval drill
Ransomware deletes online backups, immutables save you	But you didn’t test recovery from immutables	Quarterly drill specifically restores from immutable copy
New application onboarded without backup	“We didn’t know we needed it”	Daily compliance scan: any production CI without protection_policy → P1 incident
Service account credentials rotated	All backup jobs fail at 03:00	Backup-system credential rotation must be a coordinated change with explicit verification

The single most valuable habit is the quarterly chaos drill — pick one production application at random, simulate “your backup is the only thing left, recover the application from scratch in a clean environment, time it.” That exercise will surface every weakness in your backup automation faster than any audit.

11. Where this fits in the broader course

The compliance lesson (D1) gave you the paper-trail discipline. The DR lesson (D2) gave you the worst-case-day response. The migrations lesson (D3) gave you the bulk fleet change skills. The air-gap lesson (D4) gave you the isolated environment patterns. The SAP lesson (D5) gave you the complex-stack integration. The edge lesson (D6) gave you the device fleet primitives. The ITSM lesson (D7) gave you the governance fabric.

This lesson gives you the last-line-of-defence — the system that has to keep working when everything else has failed. It is also the system most often neglected, most often used, and most often inadequate when actually called upon.

The remaining lessons cover the database-migration patterns (D9 — online migrations, blue-green database swaps, zero-downtime cutovers) and the observability capstone (D10) that ties metrics, logs, traces, and AAP events into a coherent operational dashboard. Those final two lessons close the loop on the entire Tier 5 curriculum: from what to automate to how to be sure it’s running well.

What you should walk away with is the conviction that backups are not an IT operations side-quest. They are the foundational control that everything else depends on, and they earn that role only when they are tested, declarative, immutable, and reconciled. Anything less is a backup that exists in the same way that a fire extinguisher with an empty CO₂ canister exists — present, mounted on the wall, completely useless when needed.