Ansible Lesson 35 of 42

Ansible for OS Migrations, In Depth: P2V, V2V, RHEL Major-Version Upgrades & Windows Server Upgrades

Ansible for OS Migrations, In Depth — P2V, V2V, RHEL Major-Version Upgrades and Windows Server Upgrades

Operating system migrations are the projects that consume two engineers for nine months and deliver a 30-page wiki page that nobody reads. They do not have to. The work splits into a handful of repeatable patterns, each of which Ansible automates well: lift-and-shift conversion (P2V/V2V), major-version in-place upgrade (RHEL leapp, Windows setup.exe), and migrate-and-modernise (rebuild on a new image and replay configuration with Ansible). This lesson teaches all three, with the runbook patterns that keep migrations safe at fleet scale.

We will be opinionated. We will not cover every vendor migration tool that ever shipped — we will cover the open-source toolchain (virt-v2v, leapp, dnf, ansible-core, the community.windows collection) plus the small number of commercial pieces (vSphere, Veeam, AWS MGN) that actually integrate with Ansible runbooks. We will also be honest about the mistakes — the migrations that look clean on paper and break in production because someone forgot to reconcile UID space or DNS suffix.

Position in the curriculum. This lesson assumes Tier 1–4 fluency plus the Tier 5 compliance and DR lessons. Migrations interact with both: a migration window is a controlled mini-disaster, and the DR runbook is your safety net.


What “OS migration” means in the Ansible context

Three migration shapes account for almost all real work:

  1. P2V — Physical to Virtual. A bare-metal server (often a 10-year-old HP ProLiant) is captured and replayed as a VM on vSphere/KVM/AWS. Used to retire dying hardware without re-architecting the application.
  2. V2V — Virtual to Virtual. A VM moves between hypervisors or between vSphere clusters or out to cloud (vSphere → AWS via AWS MGN, vSphere → KVM via virt-v2v, Hyper-V → vSphere). Used to consolidate, exit a vendor, or migrate to cloud.
  3. In-place major-version upgrade. RHEL 7 → 8 → 9 with leapp; Windows Server 2016 → 2019 → 2022 with setup.exe /Auto:Upgrade; Ubuntu 20.04 → 22.04 → 24.04 with do-release-upgrade. The OS stays on the same VM; only the userland and kernel change.

There is a fourth pattern, rebuild-and-reconfigure, which is technically not a migration: it is a clean install of the new OS plus an Ansible playbook replay against fresh hosts, then a data cut-over. It is the safest of the four when feasible, because the new host is built from the current source of truth instead of carrying years of manual drift. We cover it last because it is the easiest to do right.

Ansible’s role in all four:

The discipline is the same as DR: small, idempotent roles; each step gated; the runbook is the source of truth.


The migration repository layout

Treat the migration as its own short-lived repository (or a sub-tree of a long-lived one). Once the migration is complete, the runbook code is preserved as evidence; the inventory is archived.

os-migration-rhel7-to-rhel9/
├── ansible.cfg
├── collections/requirements.yml      # community.general, ansible.posix,# community.crypto, redhat.rhel_system_roles,# community.windows (for mixed estates)
├── inventory/
│   ├── source/                        # what we are leaving (RHEL 7)
│   └── target/                        # what we are arriving at (RHEL 9)
├── group_vars/
│   └── all/
│       ├── migration_window.yml
│       ├── leapp.yml
│       └── vault.yml
├── playbooks/
│   ├── 00-discover-source.yml
│   ├── 10-pre-flight-checks.yml
│   ├── 20-snapshot-and-backup.yml
│   ├── 30-leapp-preupgrade.yml
│   ├── 40-leapp-upgrade.yml
│   ├── 50-post-upgrade-reconcile.yml
│   ├── 60-validate.yml
│   ├── 70-rollback.yml                # snapshot revert
│   └── 99-decommission-source.yml
└── roles/
    ├── discover_packages/
    ├── discover_services/
    ├── discover_disk_layout/
    ├── leapp_prepare/
    ├── leapp_run/
    ├── reconcile_network/
    ├── reconcile_repos/
    ├── reconcile_selinux/
    ├── reconcile_certs/
    ├── reconcile_third_party/
    ├── post_upgrade_validate/
    └── migration_evidence/

The naming pattern (00-, 10-, …) makes the migration order obvious to a tired operator at 2am, and the role split keeps the failure radius small.


Inventory and discovery: know what you are migrating

Most migration disasters are inventory disasters. The 18-year-old Linux box had an unscheduled cron tail that nobody documented; it fired once a quarter; you discover this six weeks after migration when finance asks where their report went. Spend disproportionate effort on discovery; it is the cheapest insurance you will ever buy.

# playbooks/00-discover-source.yml
---
- name: Discover source estate
  hosts: source_estate
  gather_facts: true
  tasks:

    - name: Capture all installed packages
      ansible.builtin.package_facts:

    - name: Capture all running services
      ansible.builtin.service_facts:

    - name: Capture cron jobs (system + user)
      ansible.builtin.shell: |
        cat /etc/crontab /etc/cron.d/* 2>/dev/null
        for u in $(cut -d: -f1 /etc/passwd); do
          crontab -u $u -l 2>/dev/null && echo "##USER:$u"
        done
      register: cron_dump
      changed_when: false

    - name: Capture systemd timers
      ansible.builtin.shell: systemctl list-timers --all --no-pager
      register: timers
      changed_when: false

    - name: Capture network interfaces and routes
      ansible.builtin.shell: |
        ip -j addr; echo '---'; ip -j route
      register: net
      changed_when: false

    - name: Capture mount table
      ansible.builtin.shell: findmnt --json
      register: mounts
      changed_when: false

    - name: Capture third-party agents (everything not in package manager DB)
      ansible.builtin.find:
        paths: [/opt, /usr/local]
        file_type: directory
        recurse: false
      register: third_party

    - name: Capture listening ports
      ansible.builtin.shell: ss -tulpenH
      register: ports
      changed_when: false

    - name: Persist inventory artefact
      ansible.builtin.copy:
        dest: "/var/lib/migration/discovery/{{ inventory_hostname }}.json"
        content: |
          {
            "host": "{{ inventory_hostname }}",
            "os": "{{ ansible_distribution }} {{ ansible_distribution_version }}",
            "kernel": "{{ ansible_kernel }}",
            "packages": {{ ansible_facts.packages | to_json }},
            "services": {{ ansible_facts.services | to_json }},
            "cron": {{ cron_dump.stdout_lines | to_json }},
            "timers": {{ timers.stdout_lines | to_json }},
            "net": {{ net.stdout | from_json }},
            "mounts": {{ mounts.stdout | from_json }},
            "third_party_dirs": {{ third_party.files | map(attribute='path') | list | to_json }},
            "ports": {{ ports.stdout_lines | to_json }}
          }
      delegate_to: localhost

Run this against every host you intend to migrate. Build a small Python script (or a Jupyter notebook) over the resulting JSON to surface:

Ship the analysis to the application owner before scheduling the window. The number of “wait, you can’t migrate that — it’s owned by Risk” conversations you avoid is the ROI of the discovery playbook.


P2V and V2V with virt-v2v

virt-v2v is the open-source converter that turns a physical machine, a VMware VM, or a Hyper-V VM into a KVM/qemu image (or directly into RHV/oVirt). It is what AWS MGN’s underlying replication engine resembles, what most cloud importers do under the hood, and the only converter you should bet on for Linux V2V.

The pattern with Ansible is to:

  1. Cleanly stop the source workload (or take a quiesced snapshot if downtime is unacceptable).
  2. Run virt-v2v against the source, with credentials stored in Vault.
  3. Boot the target on the new platform.
  4. Apply post-conversion reconciliation (network, repos, certificates).
  5. Validate; if green, decommission source after the bake-in window.
# roles/v2v_convert/tasks/main.yml
---
- name: Quiesce source VM (vSphere)
  community.vmware.vmware_guest_tools_upgrade:
    hostname: "{{ vcenter_host }}"
    username: "{{ vcenter_user }}"
    password: "{{ vcenter_pass }}"
    name: "{{ source_vm }}"
    validate_certs: false
  when: source_platform == "vsphere"
  no_log: true

- name: Snapshot source for rollback
  community.vmware.vmware_guest_snapshot:
    hostname: "{{ vcenter_host }}"
    username: "{{ vcenter_user }}"
    password: "{{ vcenter_pass }}"
    name: "{{ source_vm }}"
    state: present
    snapshot_name: "pre-v2v-{{ ansible_date_time.epoch }}"
    quiesce: true
    memory_dump: false
  no_log: true

- name: Run virt-v2v from converter host
  ansible.builtin.command:
    cmd: >
      virt-v2v
      -ic vpx://{{ vcenter_user | urlencode }}@{{ vcenter_host }}/Datacenter/Cluster/host?no_verify=1
      --password-file /etc/v2v.pwd
      -o rhv -os {{ rhv_storage_domain }}
      -of raw
      -on {{ source_vm }}-converted
      {{ source_vm }}
  delegate_to: "{{ converter_host }}"
  register: v2v
  no_log: true

- name: Capture v2v log for evidence
  ansible.builtin.fetch:
    src: /var/log/virt-v2v.log
    dest: "/var/lib/migration/evidence/{{ source_vm }}/v2v.log"
    flat: true
  delegate_to: "{{ converter_host }}"

- name: Boot converted VM (RHV)
  ovirt.ovirt.ovirt_vm:
    auth: "{{ ovirt_auth }}"
    name: "{{ source_vm }}-converted"
    state: running
    wait: true
    cluster: "{{ rhv_cluster }}"
  no_log: true

The two non-obvious points:

After conversion, the new VM almost always needs network reconciliation (different MAC, different vNIC name) and often needs DNS/AD re-join. Wrap those in roles/reconcile_network and roles/reconcile_third_party.


RHEL major-version in-place upgrades with leapp

leapp is Red Hat’s officially supported in-place upgrade engine. It runs in two phases: leapp preupgrade (analyses the system and emits a report) and leapp upgrade (the actual upgrade, requires a reboot). Crucially, leapp is opinionated: it will refuse to upgrade a system with unsupported configurations, and you must address every blocker before running the upgrade.

The standard fleet pattern with Ansible is:

  1. Run leapp preupgrade against every host.
  2. Collect the JSON reports centrally.
  3. Cluster the reports by blocker type.
  4. Fix each blocker class with a targeted role (or an exception list).
  5. Re-run leapp preupgrade until clean.
  6. Schedule the upgrade window.
  7. Run leapp upgrade and reboot in waves.
  8. Validate; record evidence.
# playbooks/30-leapp-preupgrade.yml
---
- name: Leapp preupgrade
  hosts: leapp_targets
  become: true
  tasks:

    - name: Subscribe and enable leapp repos
      ansible.builtin.dnf:
        name:
          - leapp-upgrade
          - leapp
        state: present

    - name: Stage answerfile for known questions
      ansible.builtin.copy:
        dest: /var/log/leapp/answerfile
        content: |
          [remove_pam_pkcs11_module_check]
          confirm = True
          [authselect_check]
          confirm = True

    - name: Run leapp preupgrade
      ansible.builtin.command:
        cmd: leapp preupgrade --report-schema=1.2.0
      register: preup
      failed_when: preup.rc not in [0, 1]   # 1 = report has inhibitors, expected
      changed_when: true

    - name: Fetch the report JSON
      ansible.builtin.fetch:
        src: /var/log/leapp/leapp-report.json
        dest: "/var/lib/migration/leapp/{{ inventory_hostname }}.json"
        flat: true

    - name: Fetch the inhibitor list (human readable)
      ansible.builtin.fetch:
        src: /var/log/leapp/leapp-report.txt
        dest: "/var/lib/migration/leapp/{{ inventory_hostname }}.txt"
        flat: true

The interesting work happens between preupgrade and upgrade: parsing the reports and writing remediations. Common inhibitors and their fixes:

Each remediation becomes a role:

# roles/leapp_remediate_pam/tasks/main.yml
---
- name: Switch authselect to sssd
  ansible.builtin.command:
    cmd: authselect select sssd with-smartcard --force
  changed_when: true

- name: Remove pam_pkcs11
  ansible.builtin.dnf:
    name: pam_pkcs11
    state: absent

After all remediations, re-run 30-leapp-preupgrade.yml and assert that no host has inhibitors:

- name: Assert leapp clean
  ansible.builtin.assert:
    that: leapp_inhibitor_count == 0
    fail_msg: "{{ inventory_hostname }} still has {{ leapp_inhibitor_count }} inhibitors"
  vars:
    leapp_inhibitor_count: "{{ leapp_report.entries | selectattr('flags','contains','inhibitor') | list | length }}"

The actual leapp upgrade run

# playbooks/40-leapp-upgrade.yml
---
- name: Leapp upgrade  wave 1
  hosts: leapp_targets
  serial: "20%"
  become: true
  tasks:

    - name: Snapshot LVM root for rollback (best effort)
      ansible.builtin.shell: |
        lvcreate -s -L 10G -n root_pre_leapp /dev/{{ ansible_lvm.vgs[0].vg_name }}/root
      args:
        creates: "/dev/{{ ansible_lvm.vgs[0].vg_name }}/root_pre_leapp"
      ignore_errors: true   # not all hosts use LVM

    - name: Run leapp upgrade (download + reboot)
      ansible.builtin.command:
        cmd: leapp upgrade --report-schema=1.2.0
      register: upgrade
      async: 7200
      poll: 0   # fire-and-forget, host will reboot

    - name: Wait for reboot into upgrade environment
      ansible.builtin.wait_for_connection:
        delay: 60
        timeout: 1800

    - name: Confirm upgraded OS major version
      ansible.builtin.setup:
        gather_subset: distribution

    - name: Assert we are now on RHEL 9
      ansible.builtin.assert:
        that:
          - ansible_distribution == "RedHat"
          - ansible_distribution_major_version == "9"

    - name: Capture post-upgrade leapp log
      ansible.builtin.fetch:
        src: /var/log/leapp/leapp-upgrade.log
        dest: "/var/lib/migration/leapp/{{ inventory_hostname }}.upgrade.log"
        flat: true

Why serial: "20%"? A wave size of 20% lets you observe failures on a slice of the fleet without committing the entire estate. The first wave should be 1–2 hosts; the second 5–10%; the third 20%; thereafter you can go faster. AAP supports this pattern via job templates with explicit limits (--limit wave1, --limit wave2).

Why LVM snapshots? They are not a substitute for backups, but they let you rollback an upgrade in 30 seconds (lvconvert --merge) instead of restoring from backup. The 10GB carve-out is sized for the upgrade log churn; tune to your workload.

Why fire-and-forget on the upgrade command? Because the host reboots into a temporary upgrade initramfs; the SSH session dies. async: 7200 with poll: 0 queues the work; wait_for_connection resumes once SSH is back.


Post-upgrade reconciliation (the part everyone underestimates)

A successful leapp upgrade does not mean a working host. Things that often break and must be reconciled:

Each of these gets its own role; each role is idempotent so it can be re-run. A representative reconciliation playbook:

# playbooks/50-post-upgrade-reconcile.yml
---
- hosts: leapp_targets
  become: true
  tasks:
    - import_role: { name: reconcile_repos }
    - import_role: { name: reconcile_selinux }
    - import_role: { name: reconcile_network }
    - import_role: { name: reconcile_certs }
    - import_role: { name: reconcile_third_party }
    - import_role: { name: reconcile_python }
    - import_role: { name: reconcile_app }   # last; depends on the app stack

The reconcile_app role is application-specific. If you are migrating an app that runs Postgres on the host, you re-run the regular Ansible deploy playbook — the same one used in steady state. If you are migrating a custom Java app, you have a play that ensures java-17-openjdk is installed (RHEL 9 default is 17 not 8) and reconfigures JAVA_HOME accordingly. The point is that the migration playbook delegates application reconciliation to the standard application deploy playbook, ensuring the new host is built from the current source of truth — exactly like a rebuild.


Windows Server in-place upgrades (2016 → 2019 → 2022)

Windows in-place upgrades are less reliable than RHEL leapp, but with care they work. The pattern with the community.windows collection:

# roles/win_inplace_upgrade/tasks/main.yml
---
- name: Stage Windows Server 2022 ISO
  ansible.windows.win_copy:
    src: /shared/iso/SERVER_2022.iso
    dest: C:\Setup\SERVER_2022.iso

- name: Mount the ISO
  ansible.windows.win_powershell:
    script: |
      $img = Mount-DiskImage -ImagePath C:\Setup\SERVER_2022.iso -PassThru
      ($img | Get-Volume).DriveLetter
  register: mount

- name: Pre-upgrade compatibility check
  ansible.windows.win_command:
    cmd: "{{ mount.output[0] }}:\\setup.exe /Auto:Upgrade /Compat:ScanOnly /Quiet /NoReboot /CopyLogs:C:\\Setup\\compat-logs"
  register: compat
  failed_when: compat.rc not in [0, 0xC1900210, 0xC1900208]
  # 0xC1900210 = compatible; no compatibility issues.
  # 0xC1900208 = compatibility issues found.
  # See https://learn.microsoft.com/windows/deployment/upgrade/upgrade-error-codes

- name: Fail if compatibility issues
  ansible.builtin.fail:
    msg: "Windows compatibility scan failed; check C:\\Setup\\compat-logs"
  when: compat.rc == 0xC1900208

- name: Run the actual upgrade (will reboot)
  ansible.windows.win_command:
    cmd: "{{ mount.output[0] }}:\\setup.exe /Auto:Upgrade /Quiet /CopyLogs:C:\\Setup\\upgrade-logs"
  async: 7200
  poll: 0

- name: Wait for upgrade reboot cycle (Windows reboots multiple times)
  ansible.windows.win_wait_for_pending_reboots:
    timeout: 5400
  retries: 3

- name: Verify upgrade succeeded
  ansible.windows.win_powershell:
    script: |
      $os = Get-CimInstance Win32_OperatingSystem
      [pscustomobject]@{
        Caption = $os.Caption
        Version = $os.Version
        BuildNumber = $os.BuildNumber
      }
  register: post_os

- name: Assert Windows Server 2022
  ansible.builtin.assert:
    that:
      - "'2022' in post_os.output[0].Caption"

The Windows-specific traps:

The post-upgrade Windows reconciliation is similar to RHEL: re-attach the host to AD if it dropped, re-install monitoring agents, re-import scheduled tasks (which sometimes get reset), and verify the application service starts.


Migrate-and-modernise: rebuild and reconfigure

When the migration path is “RHEL 7 → RHEL 9 with completely new hardware” or “Windows 2016 → AWS-hosted Windows 2022”, the safest approach is rebuild: provision a fresh target host, run your steady-state Ansible playbooks against it, cut the data over, then decommission the source.

This pattern has three big advantages over in-place:

  1. The new host is built from current source of truth, with no inherited drift.
  2. The cut-over is a single network/DNS change; the rollback is a single network/DNS change.
  3. You rehearse the rebuild on pre-prod weeks before production, with high confidence that it will work in prod because the same playbook drives both.

The rebuild migration runbook looks like:

# playbooks/rebuild-migration.yml
---
- name: 1. Provision target host
  hosts: localhost
  tasks:
    - import_role: { name: provision_target }    # vSphere, AWS, Azure, etc.

- name: 2. Bootstrap target with steady-state config
  hosts: "{{ target_host }}"
  tasks:
    - import_playbook: ../../app-platform/site.yml   # the regular play

- name: 3. Replicate data (initial sync)
  hosts: "{{ source_host }}"
  tasks:
    - import_role: { name: data_sync_initial }

- name: 4. Cut-over window: stop source, sync delta, swing DNS
  hosts: "{{ source_host }}"
  tasks:
    - import_role: { name: stop_source }
    - import_role: { name: data_sync_delta }
    - import_role: { name: swing_dns }
      delegate_to: localhost

- name: 5. Smoke test target
  hosts: "{{ target_host }}"
  tasks:
    - import_role: { name: app_smoke_test }

- name: 6. Bake-in window  keep source on standby
  hosts: localhost
  tasks:
    - debug:
        msg: "Source kept for {{ bake_in_days }} days. Rollback by reversing DNS swing."

- name: 7. Decommission source
  hosts: "{{ source_host }}"
  tasks:
    - import_role: { name: decommission_source }

Steps 4 and 7 are gated by manual approvals in AAP (separate job templates). The bake-in window in step 6 is non-negotiable for production; one week is the minimum, two weeks is sane, four weeks is conservative.


Ubuntu and Debian major-version upgrades

do-release-upgrade is the analogue of leapp on Ubuntu. The pattern is identical to RHEL but the tooling differs:

- name: Configure unattended upgrade prompts
  ansible.builtin.copy:
    dest: /etc/apt/apt.conf.d/50unattended-upgrades-noninteractive
    content: |
      Dpkg::Options { "--force-confdef"; "--force-confold"; }

- name: Run release upgrade
  ansible.builtin.command:
    cmd: do-release-upgrade -f DistUpgradeViewNonInteractive
  async: 7200
  poll: 0

- name: Wait for reboot
  ansible.builtin.wait_for_connection:
    delay: 60
    timeout: 1800

- name: Confirm Ubuntu version
  ansible.builtin.setup:
    gather_subset: distribution

- name: Assert Ubuntu 24.04
  ansible.builtin.assert:
    that:
      - ansible_distribution == "Ubuntu"
      - ansible_distribution_version == "24.04"

Debian uses apt full-upgrade after editing /etc/apt/sources.list; the rest is the same. Both Debian and Ubuntu have weaker guarantees than RHEL leapp — the upgrade tooling is less opinionated, so your pre-flight discovery has to do more work to surface conflicts.


Validation: prove the migration succeeded

Validation is the difference between “it booted” and “the application works.” Always run end-to-end synthetic transactions, not just service checks.

# playbooks/60-validate.yml
---
- name: End-to-end validation
  hosts: target_estate
  tasks:

    - name: Service health check (basic)
      ansible.builtin.systemd:
        name: kv-app
        state: started
      check_mode: true

    - name: Application smoke test
      ansible.builtin.uri:
        url: "https://{{ inventory_hostname }}:8443/healthz"
        status_code: 200
        validate_certs: true
      register: health
      retries: 10
      delay: 10

    - name: Synthetic transaction (full path)
      ansible.builtin.uri:
        url: "https://{{ inventory_hostname }}:8443/api/v1/canary"
        method: POST
        body_format: json
        body: { canary: "{{ ansible_date_time.epoch }}" }
        status_code: 201
      no_log: true

    - name: Persist validation result
      ansible.builtin.copy:
        dest: "/var/lib/migration/validation/{{ inventory_hostname }}.json"
        content: |
          { "host": "{{ inventory_hostname }}",
            "ts": "{{ ansible_date_time.iso8601 }}",
            "health": {{ health.status }},
            "passed": true
          }
      delegate_to: localhost

If any host fails validation, that host’s job goes to playbooks/70-rollback.yml instead of 99-decommission-source.yml. The rollback playbook is platform-specific: revert vSphere snapshot, revert LVM snapshot via lvconvert --merge, restore from Veeam backup, etc.


Migration evidence and reporting

Auditors and project sponsors both want per-host migration evidence: when did it migrate, what changed, did it pass validation, who approved the wave. Capture this once, store it forever:

- name: Persist evidence bundle
  ansible.builtin.copy:
    dest: "/var/lib/migration/evidence/{{ inventory_hostname }}/{{ ansible_date_time.epoch }}.json"
    content: |
      {
        "host": "{{ inventory_hostname }}",
        "wave": "{{ wave_id }}",
        "from": "{{ source_os }}",
        "to":   "{{ ansible_distribution }} {{ ansible_distribution_version }}",
        "started": "{{ run_start }}",
        "ended":   "{{ ansible_date_time.iso8601 }}",
        "duration_seconds": {{ (ansible_date_time.epoch | int) - run_start_epoch }},
        "leapp_report_sha256": "{{ leapp_report_sha }}",
        "validation_passed": {{ validation_passed }},
        "approved_by": "{{ approver }}",
        "ticket": "{{ change_ticket }}"
      }

Land this in an immutable bucket (S3 Object Lock compliance mode, GCS retention policy, Azure Blob immutability). At the end of the migration project, you can produce a single report: “We migrated 3,412 hosts across 27 waves, with 97.4% first-pass success and a mean RTO per host of 47 minutes.” That number is the kind of thing executive sponsors quote in their next budget request.


Anti-patterns that kill migrations


Frequently asked questions

1. Should I prefer in-place upgrade or rebuild for RHEL 7 → RHEL 9? Rebuild, when feasible. RHEL 7 → 9 requires two leapp passes (7→8 then 8→9), and the cumulative risk of inhibitors and reconciliations is high. If your application can be redeployed cleanly from Ansible to a new host, rebuild and cut over. If it has years of manual drift that nobody can replicate, in-place leapp is your friend.

2. How long should bake-in be before decommissioning the source? At minimum, a full business cycle: month-end close, weekly batch, quarterly report. For a typical web app, 2 weeks. For a financial system, 6 weeks. The fact that the new host worked for 3 days proves nothing about the quarterly batch.

3. Can I run leapp on a host running a third-party agent like Splunk or CrowdStrike? Yes, but you must have the RHEL 9-compatible agent ready to install before the upgrade, or remove the agent before upgrade and re-install after. Some agents block the upgrade engine; check vendor documentation. Always test on a non-prod host with the same agent stack first.

4. What’s the right wave size for a fleet upgrade? Wave 1: 1–3 hosts (the canary). Wave 2: 5% of fleet. Wave 3: 20% of fleet. Wave 4 onwards: 25–50%. Tighten the gap to 1–2 days between waves; fix everything found in wave N before starting N+1.

5. How do I migrate a host with a third-party kernel module (NIC driver, GPFS, etc.)? Identify it during discovery. Get the RHEL 9-compatible version from the vendor. Stage it in your local repo. Add a remediation role that swaps the module post-upgrade. If no RHEL 9-compatible version exists, defer migration of that host and engage the vendor.

6. Do I need to re-take SAN snapshots after the upgrade? Yes — your backup baselines are now invalid because the OS, package versions, and possibly file paths have changed. Take fresh full backups within 24 hours of upgrade and re-establish your incremental chain.

7. How do I handle hosts that fail validation post-upgrade? Roll back via snapshot. Investigate the failure on a clone. Add a reconciliation role to fix the failure mode. Re-run the upgrade with the new role included. Do not “fix in place” on a failed upgrade — your source-of-truth runbook will not match what’s running.

8. What’s the right way to upgrade a database host? Almost never in-place. Build a new host with the new OS and the new database version, replicate the data using the database’s native replication (Postgres logical, MySQL replicas, MongoDB replica sets), and cut over with a DNS swing. The database migration becomes its own runbook (covered in the Tier 5 DB migrations lesson).

9. How long does a typical 1000-host RHEL 7 → RHEL 9 migration take, end-to-end? With proper Ansible discovery, remediation, and waving: 8–12 weeks elapsed. Without: 9–18 months. The ratio is the cost of skipping discovery.

10. What’s the single most underrated migration practice? The rebuild-vs-upgrade decision per workload. Rebuild is safer, rehearsable, and produces a cleaner long-term state. Many teams default to in-place because “we can’t redeploy this app from scratch.” If you cannot redeploy, that is the bug to fix first; once you can, every future migration becomes a non-event.


Hands-on lab — your first leapp upgrade

The following lab takes a freshly-spun RHEL 8 container (we use a privileged container as a stand-in for a VM) through the leapp preupgrade analysis, surfacing inhibitors so you can experience the workflow without committing a real VM.

Prerequisites: Podman or Docker, a Red Hat developer subscription (free tier is fine), ansible-core ≥ 2.16.

mkdir -p leapp-lab/{playbooks,roles,inventory}
cd leapp-lab
# inventory/hosts.yml
all:
  hosts:
    rhel8-canary:
      ansible_connection: podman
      ansible_python_interpreter: /usr/bin/python3
podman run -d --name rhel8-canary --privileged registry.access.redhat.com/ubi8/ubi-init sleep infinity
podman exec rhel8-canary subscription-manager register --username "$RHN_USER" --password "$RHN_PASS" --auto-attach
# playbooks/leapp-preupgrade.yml
- hosts: rhel8-canary
  become: true
  tasks:
    - dnf:
        name: leapp-upgrade
        state: present

    - command: leapp preupgrade --target 9.4
      register: leapp
      failed_when: false
      changed_when: leapp.rc == 0

    - fetch:
        src: /var/log/leapp/leapp-report.txt
        dest: ./leapp-report.txt
        flat: true

    - debug:
        msg: "{{ lookup('file','./leapp-report.txt').split('\n')[:50] }}"
ansible-playbook -i inventory/hosts.yml playbooks/leapp-preupgrade.yml
cat leapp-report.txt | head -100

The output is the leapp report — read every inhibitor. Now the exercise: write a roles/remediate_<inhibitor>/tasks/main.yml for each inhibitor you see, re-run preupgrade, and watch the inhibitor list shrink. By the time the list is empty, you have built the muscle memory for fleet leapp.


Glossary


Certification mapping


Next steps

You now have the mental model and the runbook patterns for the three migration shapes. The next specialist lesson covers air-gapped automation — running Ansible against environments that cannot reach the internet, and how to keep collections, EEs, leapp content, and OS repositories synchronised inside a sealed network. Air-gapped migrations are even less forgiving than connected ones, and they are where the discipline taught here pays the highest dividend.

If you only take one habit from this lesson: discover before you migrate, and rehearse before you decommission. A migration with three weeks of discovery and a week of rehearsal beats the heroic weekend migration every time, even though it looks slower on the project plan.

ansiblemigrationp2vv2vleapprhel-upgradewindows-upgradevirt-v2v
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments