Deploy Proxmox VE Cluster with Ceph Hyperconverged Storage and HA Migration

A regional university IT team is told to retire a pair of aging VMware ESXi hosts whose vSphere renewal quote just doubled, and the brief is uncompromising: keep the campus Moodle learning platform, the student records app, and a dozen departmental VMs running through the migration, survive a single host failure without a 2 a.m. callout, and do it on hardware they already own. The answer here is a Proxmox VE cluster with Ceph providing hyperconverged storage on the same three nodes — no separate SAN to buy, VM disks replicated across hosts, and a VM that can live-migrate or be auto-restarted by the HA stack when a node dies. This guide builds that cluster end to end: three nodes, a Ceph RBD pool, live migration, and HA fencing that a fleet of departmental services can actually sit on.

The economics are the whole point. A traditional design splits compute and storage — hypervisor hosts plus a dual-controller SAN — and you pay for the array, its support contract, and the SAN fabric. Hyperconvergence collapses that: each node contributes CPU, RAM, and disks, Ceph stitches the disks into one replicated pool every node can read and write, and you scale by adding nodes. Three nodes is the floor for Ceph because its default replication needs three copies on three failure domains to keep a quorum when one is lost.

Prerequisites

Three physical servers, identical where possible: 8+ cores, 64 GB+ RAM each. Each node needs at least one OS disk (SSD, mirrored ideally) and two or more dedicated, unpartitioned disks for Ceph OSDs (NVMe/SSD strongly preferred — Ceph is latency-sensitive).
Two separate networks: a management/VM network (10.20.0.0/24) and a dedicated Ceph/cluster network of at least 10 GbE (10.20.10.0/24). Do not run Ceph replication over the same NIC as VM traffic.
Proxmox VE 8.x ISO flashed to USB; nodes named pve1, pve2, pve3 with static IPs and matching /etc/hosts entries.
A workstation with terraform (>= 1.6), ansible (>= 2.16), and SSH key access to all three nodes.
NTP reachable from every node — Ceph monitors fall out of quorum on clock skew, and Proxmox corosync is unforgiving about latency.

Target topology

Deploy Proxmox VE Cluster with Ceph Hyperconverged Storage and HA Migration — topology

Three Proxmox nodes form a single corosync cluster. Each runs a Ceph monitor (MON), a manager (MGR), and OSD daemons — one per data disk. Ceph presents an RBD pool as shared storage that all three nodes mount, so any VM’s disk lives on every node and a VM can run on, or move to, any host. Corosync provides cluster membership and quorum; the Proxmox HA manager watches resources and fences a dead node before restarting its VMs elsewhere. Around the cluster sits the operating model: Microsoft Entra ID (brokering Okta for campus staff) as the SSO identity provider for the Proxmox web UI via OpenID Connect; HashiCorp Vault issuing the Ceph and API credentials that Terraform consumes so no secret is committed; Terraform and Ansible standing the cluster up declaratively; GitHub Actions and Argo CD driving the VM-definition pipeline; Akamai fronting the public Moodle endpoint with TLS and WAF; CrowdStrike Falcon and Wiz for endpoint and posture security; and Dynatrace (with Datadog as an alternative) plus ServiceNow for observability and change/incident workflow.

1. Install Proxmox VE and configure networking

Boot each node from the Proxmox 8.x ISO, install to the OS disk, and set the static management IP. After first boot, switch each node off the enterprise (subscription) repo to the no-subscription repo so apt works without a licence:

# On each node, disable the enterprise repo and enable no-subscription
sed -i 's/^deb/#deb/' /etc/apt/sources.list.d/pve-enterprise.list
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" \
  > /etc/apt/sources.list.d/pve-no-subscription.list
apt update && apt -y dist-upgrade

Define both networks. Edit /etc/network/interfaces on pve1 (repeat with the right IPs on pve2/pve3). Here vmbr0 is the management/VM bridge and ens2 carries Ceph:

auto vmbr0
iface vmbr0 inet static
    address 10.20.0.11/24
    gateway 10.20.0.1
    bridge-ports ens1
    bridge-stp off
    bridge-fd 0

auto ens2
iface ens2 inet static
    address 10.20.10.11/24   # Ceph cluster + public network
    mtu 9000                 # jumbo frames on the storage net

Make hostname resolution deterministic on every node — corosync and Ceph both bind by name:

cat >> /etc/hosts <<'EOF'
10.20.0.11 pve1.lab.kloudvin.local pve1
10.20.0.12 pve2.lab.kloudvin.local pve2
10.20.0.13 pve3.lab.kloudvin.local pve3
EOF

2. Form the corosync cluster

Create the cluster on the first node, binding corosync’s ring to the management network. From pve1:

pvecm create campus-cluster --link0 address=10.20.0.11

Join the other two nodes. Run on pve2, then pve3 (it will SSH to pve1 and ask to accept the host key):

# On pve2
pvecm add 10.20.0.11 --link0 address=10.20.0.12
# On pve3
pvecm add 10.20.0.11 --link0 address=10.20.0.13

Confirm all three are present and quorate:

pvecm status
# Expect: "Quorate: Yes", "Total votes: 3", three nodes Online
pvecm nodes

3. Install and bootstrap Ceph

Use the Proxmox-integrated Ceph installer on each node, then create the cluster. Run the install on all three; create the config once on pve1. We pin the Reef release and tell Ceph that the storage network is the 10 GbE segment:

# On all three nodes
pveceph install --repository no-subscription --version reef

# On pve1 only: initialise the Ceph cluster config
pveceph init --network 10.20.10.0/24 --cluster-network 10.20.10.0/24

Create one monitor and one manager per node so the control plane survives a host loss:

# Run the matching command on each node
pveceph mon create   # creates a MON on the local node
pveceph mgr create   # creates a MGR on the local node

Check the cluster is forming. It will report HEALTH_WARN until OSDs exist — that is expected:

ceph -s
# services: mon: 3 daemons, mgr: pve1(active), pve2,pve3 (standbys)

4. Create OSDs from the data disks

List the empty disks Ceph can claim — only use disks with no partitions and no filesystem:

# Identify candidate disks on each node
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT
ceph-volume inventory   # shows which devices are "available"

Create an OSD per data disk on every node. With NVMe/SSD, putting the DB on the same fast device is fine; on hybrid setups you would point --db-device at an SSD:

# Example: pve1 has two NVMe data disks
pveceph osd create /dev/nvme1n1
pveceph osd create /dev/nvme2n1
# Repeat on pve2 and pve3 for their disks

Watch them come up and in:

ceph osd tree
# Each host should list its OSDs as "up" and weighted; status -> HEALTH_OK
ceph osd df

5. Create the RBD pool and a CephFS for ISOs

Create a replicated RBD pool for VM disks. With three nodes, size=3/min_size=2 keeps you writable when one node is down:

pveceph pool create vm-rbd \
  --application rbd \
  --size 3 --min_size 2 \
  --pg_autoscale_mode on \
  --add_storages 1     # auto-registers it as Proxmox storage on every node

Add a small CephFS so all nodes share ISO images and container templates from the same replicated store:

pveceph mds create        # run on each node for an HA metadata server
pveceph fs create --name cephfs --pg_num 32 --add-storage 1

Verify Proxmox now sees shared storage cluster-wide:

pvesm status
# vm-rbd  rbd    active ...
# cephfs  cephfs active ...

6. Stand the cluster definition up with Terraform and Ansible

Treat the cluster as code. Terraform (with the bpg/proxmox provider) declares VMs and storage; Ansible handles in-node configuration that Terraform should not — Ceph tunables, package state, and CrowdStrike/agent rollout. Pull the API token from HashiCorp Vault rather than hardcoding it, so the secret never lands in state or git:

# providers.tf
terraform {
  required_providers {
    proxmox = { source = "bpg/proxmox", version = "~> 0.66" }
    vault   = { source = "hashicorp/vault", version = "~> 4.4" }
  }
}

data "vault_kv_secret_v2" "pve" {
  mount = "secret"
  name  = "proxmox/campus-cluster"
}

provider "proxmox" {
  endpoint  = "https://pve1.lab.kloudvin.local:8006/"
  api_token = data.vault_kv_secret_v2.pve.data["api_token"]
  insecure  = false
}

# moodle-vm.tf — a HA-managed VM on the Ceph pool
resource "proxmox_virtual_environment_vm" "moodle" {
  name      = "moodle-app-01"
  node_name = "pve1"
  cpu  { cores = 4 }
  memory { dedicated = 8192 }
  disk {
    datastore_id = "vm-rbd"   # lands on the replicated Ceph pool
    interface    = "scsi0"
    size         = 80
  }
  network_device { bridge = "vmbr0" }
}

A minimal Ansible play to enforce Ceph network tunables and register nodes with the agents:

# site.yml
- hosts: pve_nodes
  become: true
  tasks:
    - name: Ensure jumbo frames on storage NIC
      ansible.builtin.command: ip link set ens2 mtu 9000
    - name: Install CrowdStrike Falcon sensor (EDR on the hypervisors)
      ansible.builtin.apt:
        deb: /opt/pkgs/falcon-sensor.deb

Run them in order:

terraform init && terraform apply -auto-approve
ansible-playbook -i inventory/hosts.ini site.yml

This pipeline runs from GitHub Actions on merge to main; for the application VMs that host Moodle, Argo CD syncs the desired manifests so VM definitions and app config stay declarative and auditable. Every change ticket is opened and closed in ServiceNow, which is the system of record for the maintenance window.

7. Enable HA and configure fencing

Tell Proxmox to manage the Moodle VM as a highly available resource. The HA manager will restart it on a surviving node if its host dies:

# Add the VM (e.g. VMID 100) to HA management
ha-manager add vm:100 --state started --max_restart 3 --max_relocate 3

# Group VMs to prefer nodes but allow failover anywhere
ha-manager groupadd campus-ha --nodes "pve1,pve2,pve3" --nofailback 0
ha-manager set vm:100 --group campus-ha

Fencing in Proxmox is self-fencing via the hardware watchdog: a node that loses quorum resets itself within ~60 s, guaranteeing it has released its disks before the HA manager restarts the VM elsewhere (this is what prevents two copies writing the same RBD image). Enable the watchdog explicitly — use the IPMI/ipmi_watchdog device on real hardware rather than the softdog:

# Prefer the hardware watchdog on server-class gear
echo "options ipmi_watchdog action=reset panic_wdt_timeout=10" \
  > /etc/modprobe.d/ipmi-watchdog.conf
sed -i 's/#WATCHDOG_MODULE=.*/WATCHDOG_MODULE=ipmi_watchdog/' /etc/default/pve-ha-manager
systemctl restart watchdog-mux

8. Gate the web UI with Entra ID SSO

Replace local Proxmox logins with enterprise identity. Register an app in Microsoft Entra ID (which brokers Okta for staff who live in the campus Okta tenant), then add it as an OpenID Connect realm so admins authenticate with corporate MFA and group claims, not shared root passwords:

pveum realm add entra-oidc --type openid \
  --issuer-url https://login.microsoftonline.com/<TENANT_ID>/v2.0 \
  --client-id <APP_CLIENT_ID> \
  --client-key "$(vault kv get -field=oidc_secret secret/proxmox/oidc)" \
  --username-claim email --autocreate 1

# Map the Entra "ProxmoxAdmins" group to the Administrator role
pveum acl modify / --roles Administrator --groups ProxmoxAdmins-entra-oidc

Validation

Run these checks before declaring the cluster production-ready:

# 1. Cluster + Ceph health
pvecm status | grep Quorate          # Quorate: Yes
ceph -s | grep HEALTH                 # HEALTH_OK
ceph osd pool ls detail               # vm-rbd: replicated size 3 min_size 2

# 2. Live migration works with zero downtime (shared storage = no disk copy)
qm migrate 100 pve2 --online
# UI shows the VM running on pve2; ping the Moodle endpoint throughout — no drops

Now prove HA failover. Hard-power-off the node currently running VMID 100 (pull power or echo c > /proc/sysrq-trigger) and watch:

ha-manager status
# The lost node goes "fence", then "unknown"; after the watchdog reset the VM
# transitions to "started" on a surviving node within ~1-2 minutes.
ceph -s
# Degraded PGs while one node is down, but I/O continues (min_size=2 satisfied).

Confirm the recovered node rejoins cleanly and Ceph re-balances:

ceph osd tree           # returned OSDs back "up/in"
ceph -s                 # HEALTH_OK once backfill completes

Rollback / teardown

If a node is being decommissioned or the build is being torn down, drain it first so Ceph re-replicates before the disks leave:

# Gracefully remove one node's OSDs (let data drain first)
ceph osd out osd.<id>
# wait for HEALTH_OK / no misplaced objects, then:
pveceph osd destroy <id> --cleanup 1

Remove a node from the cluster (run from a surviving node, node powered off):

ha-manager remove vm:100          # release HA management first
pvecm delnode pve3
pvecm expected 2                  # adjust expected votes for the smaller cluster

Full teardown — destroy the Terraform-managed resources, then the storage, then the cluster:

terraform destroy -auto-approve
pveceph pool destroy vm-rbd --remove-storages 1
pveceph fs destroy cephfs --remove-storages 1
# Finally rebuild each node from ISO if repurposing the hardware

Common pitfalls

Two-node Ceph. With only two nodes you cannot hold a MON quorum or three replicas; the pool blocks on a single failure. Three is the genuine minimum — a 2-node “cluster” plus a QDevice gives corosync quorum but does not fix Ceph replication.
Storage and VM traffic on one NIC. Ceph recovery saturates the link, VM I/O stalls, and corosync starts dropping — the cluster flaps. Keep the 10 GbE storage net physically separate.
min_size = 1. Tempting during an outage, but it lets a single OSD accept writes and invites split-brain data loss. Leave it at 2.
Clock skew. A node drifting a few hundred ms drops its MON out of quorum. Verify chrony/NTP everywhere before you trust the cluster.
Softdog in production. The software watchdog can fail to fire under a hung kernel; use the hardware/IPMI watchdog so fencing is real.

Security notes

Lock the cluster down beyond SSO. Pull every credential Terraform and the nodes need — the Proxmox API token, Ceph keyrings, the OIDC secret — from HashiCorp Vault with short TTLs, so nothing sensitive is in git or on disk. Run CrowdStrike Falcon sensors on the hypervisor hosts for EDR on the most privileged layer in the building, and scan the Terraform and pipeline definitions with Wiz Code (with Wiz assessing the running posture) so a misconfigured firewall rule or an over-permissive token is caught in the pull request, not after an incident. The public Moodle endpoint sits behind Akamai for TLS termination, WAF, and bot mitigation, so the cluster’s web tier is never directly exposed. Keep the Proxmox management network on its own VLAN, reachable only via the admin jump host.

Cost notes

The hyperconverged win is capital: no SAN, no SAN fabric, no per-socket hypervisor licence. Three mid-range nodes with NVMe replace a SAN-plus-hosts design at a fraction of the spend, and you grow by adding a fourth node rather than forklifting an array. Dynatrace (or Datadog) watches OSD latency, pool capacity, and per-VM resource use so you provision the next node from data instead of guesswork — Ceph wants you to expand before the pool passes ~75% full, where rebalancing gets expensive. Drive capacity, change, and incident workflow through ServiceNow so the maintenance windows and the audit trail line up with the rest of the estate.