VMware to Azure VMware Solution Migration and Hybrid Operations

A regional health-insurance carrier gets a hard date from its CFO and its head of facilities at the same meeting: the lease on the primary colocation datacenter — 1,900 VMware virtual machines across claims adjudication, member portals, a Moodle-based agent training platform, and a wall of regulated actuarial workloads — expires in fourteen months and will not be renewed. The hardware refresh quote alone is a seven-figure capital request the board has already declined twice. The mandate is blunt: be out of the building before the lease ends, do not refactor 1,900 applications to get there, and do not have a claims-processing outage that lands the company in front of a state insurance regulator. The infrastructure team has spent a decade building operational muscle around vSphere, vCenter, and NSX — and they are being told to keep all of it while the floor disappears underneath them. This article is the reference architecture for that exit: a lift-and-shift to Azure VMware Solution (AVS) that preserves the VMware operating model, moves live workloads with HCX, and lands on a hybrid estate the existing tools and runbooks still understand.

The pressures here are different from a greenfield cloud build, and naming them sets the whole design. Time is fixed and external — a lease clock, not an engineering preference. Risk is regulatory: a claims platform under HIPAA and state DOI oversight cannot have a multi-day cutover window. Skills are the hidden constraint — retraining a vSphere team into native Azure IaaS in fourteen months while also doing the migration is how projects miss the date. And cost is a moving target: AVS is not cheap, so the plan has to include getting smaller over time, not just relocating the whole estate at full size. AVS satisfies all four because it runs the actual VMware stack — ESXi, vCenter, vSAN, NSX-T — on dedicated bare-metal Azure hosts, so the VMs, the runbooks, and the muscle memory all transfer unchanged.

Why not the obvious alternatives

Each alternative fails against this specific clock, and someone on the steering committee will champion all three.

Re-platform every VM to native Azure IaaS (Azure VMs, managed disks, VMSS) is the “do it right” answer and the one that misses the lease date. 1,900 workloads means 1,900 OS-level migrations, driver and agent changes, network re-IPs, and regression tests against regulated systems — twelve to twenty-four months of effort the team does not have, on a corpus where many apps have no owner left to test them. Refactor to containers/PaaS is even further out: rewriting actuarial batch jobs and a claims engine as cloud-native services is a multi-year program, not a datacenter exit. A second colocation lease just resets the same problem in three years and still needs a hardware refresh the board already rejected.

AVS threads the clock. Because it is genuine VMware on Azure-hosted bare metal, a VM migrates without changing its guest OS, its IP address, its vNICs, or its backup agent — HCX can even keep the same MAC and Layer-2 segment so the application has no idea it moved. The team keeps vCenter and NSX-T Manager as their day-2 consoles. Migration becomes a relocation, measured in vMotion windows rather than application rewrites, and the harder modernization — re-platforming the workloads that deserve it — happens after the building is vacated, on a clock the company controls.

Architecture overview

VMware to Azure VMware Solution Migration and Hybrid Operations — architecture

The architecture has two phases that share one topology: a migration phase where on-prem vSphere and an AVS private cloud run as a single stretched estate, and a steady-state hybrid phase where AVS is the primary compute and the colocation is gone. The whole design hinges on building the network and identity plane first, so that on the day a workload moves it lands somewhere already wired, secured, and observable.

The defining property is continuity: AVS exposes the same vCenter and NSX-T control plane the team already operates, so the migration changes where VMs run without changing how they are run. Everything below exists to make that continuity real across the move.

Foundation, built before a single VM moves:

ExpressRoute is the spine. The AVS private cloud is provisioned with a managed ExpressRoute circuit (the “ER Direct”/AVS-managed circuit) into an Azure ExpressRoute Gateway in a hub VNet, and a second ExpressRoute circuit connects the colocation datacenter to the same hub. Crucially, the two circuits are stitched with ExpressRoute Global Reach, giving on-prem ESXi and AVS ESXi a private, low-latency, high-throughput path to each other — the data highway HCX needs to bulk-copy terabytes and run live vMotion. No migration traffic touches the public internet, which is the first thing the HIPAA assessor asks.
Identity stays put and federates. The carrier’s workforce already authenticates through Okta as the primary IdP. Okta is federated to Microsoft Entra ID over OIDC so AVS management, the Azure portal, and Azure RBAC all see first-class Entra tokens, while administrators keep their Okta login and conditional-access policies. vCenter and NSX-T Manager bind to Entra (via the AVS identity integration / LDAPS to a domain-controller VM) so VMware roles map to the same human identities — no separate local admin sprawl appearing the moment the estate doubles.
Akamai stays at the edge for the member portal and the public Moodle agent-training site. During migration its origin is failed over from the colocation public IPs to AVS-fronted load balancers, so members and agents see no change while the backend relocates — TLS termination, WAF, and bot mitigation continue uninterrupted at the perimeter.

Migration data flow, per workload:

HCX is deployed as a pair: an HCX Connector on-prem and an HCX Cloud Manager in AVS, linked over Global Reach. HCX stands up its Interconnect, Network Extension, and WAN Optimization appliances — these are the virtual appliances that do the actual work: the Network Extension appliance stretches an on-prem VLAN into an AVS NSX-T segment as Layer 2, so a VM keeps its IP after it moves and its peers on-prem still reach it.
A workload is migrated by its method-to-risk fit: HCX vMotion for zero-downtime live moves of latency-sensitive or always-on VMs (the claims API tier), HCX Bulk Migration for scheduled, replication-based moves of large batches during a maintenance window (the bulk of file and app servers), and HCX Replication Assisted vMotion (RAV) for moving many VMs in parallel with near-zero cutover. Migration waves are grouped by application affinity so an app’s tiers move together and never get split across a high-latency link mid-cutover.
As VMs land in AVS, NSX-T enforces micro-segmentation: a distributed firewall applies allow-list rules between application tiers, scoped by NSX security groups populated from VM tags, so claims, member-portal, and actuarial workloads cannot reach each other except on the exact ports their contracts require. The same policy that took a change-control ticket per rule in the old physical firewall is now declarative and travels with the VM.
Storage is vSAN on the AVS hosts for the hot estate, with Azure NetApp Files mounted over NFS as an external datastore for capacity-heavy, low-IOPS workloads — which decouples the storage bill from the host count and is the single biggest AVS cost lever.

Hybrid steady state, after the colocation is gone: AVS is the primary; native Azure services it now connects to over the hub — Azure SQL, Blob, Key Vault, and Azure Monitor — sit behind Private Endpoints, so an AVS VM that gets re-platformed later can call a managed service privately without leaving the VNet. The on-prem footprint shrinks to a small disaster-recovery toehold or disappears entirely, and the Global Reach circuit to the colo is decommissioned on the last day of the lease.

Component breakdown

Component	Service / tool	Role in the migration	Key configuration choices
Connectivity spine	ExpressRoute + Global Reach	Private, high-throughput path between on-prem and AVS; AVS-to-Azure	AVS-managed circuit to ER Gateway; Global Reach to colo circuit; ER FastPath
Migration engine	VMware HCX	Live + bulk VM migration, L2 network extension	HCX Enterprise; Interconnect/NE/WAN-Opt appliances; RAV for parallel waves
VMware control plane	AVS vCenter + NSX-T Manager	Day-2 operations unchanged; same consoles as on-prem	Entra/LDAPS identity bind; CloudAdmin role scoping
Micro-segmentation	NSX-T Distributed Firewall	Tier-to-tier allow-listing inside the private cloud	Tag-driven security groups; default-deny; per-app sections
Edge	Akamai	TLS, WAF, bot mitigation for portal + Moodle	Origin failover colo → AVS LB; no member-facing change
Identity / SSO	Okta + Microsoft Entra ID	Admin SSO (Okta) federated to Entra for Azure RBAC and vCenter	OIDC federation; conditional access; VMware role mapping
Secrets	HashiCorp Vault	Migration service-account creds, vCenter API tokens, app secrets	Entra auth method; dynamic DB creds for re-platformed apps
Capacity storage	Azure NetApp Files	External NFS datastore to scale storage without buying hosts	NFSv3/v4.1; capacity pools sized off cold tier
Cloud posture	Wiz + Wiz Code	Posture, exposure, and attack-path scanning across AVS + Azure	Agentless scan of VNet/AVS; Wiz Code gates Terraform PRs
Runtime security	CrowdStrike Falcon	EDR on the guest VMs, before and after the move	Sensor in the golden image; detections to the SOC; survives vMotion
Observability	Dynatrace / Datadog	App + infra telemetry continuous across the cutover	OneAgent/agent in guest; AVS host metrics via vCenter API
ITSM / change	ServiceNow	Wave approvals, CMDB sync, cutover change records	Change gate per wave; CMDB updated from vCenter on landing
IaC / automation	Terraform + Ansible	AVS + network provisioning; in-guest config	Terraform for AVS/ExpressRoute/NSX; Ansible for OS hardening
CI / pipelines	Jenkins / GitHub Actions / Argo CD	Migration tooling builds; GitOps for re-platformed workloads	OIDC to Azure; Argo CD for the post-migration AKS landing zone

A few choices deserve the why, because they are where AVS migrations go wrong.

Why ExpressRoute Global Reach, not a VPN. HCX bulk migration moves terabytes and live vMotion is exquisitely sensitive to latency and jitter — a VPN over the public internet introduces both, stretching a migration wave from a weekend into weeks and risking vMotion failures mid-flight. Global Reach links the two ExpressRoute circuits at the Microsoft backbone, so on-prem and AVS ESXi hosts exchange traffic at circuit speed with predictable latency. It is the difference between a migration that finishes on the lease clock and one that does not.

Why Layer-2 extension, not re-IP. The tempting “clean” approach is to give every VM a new AVS subnet and re-IP it. On 1,900 mostly-unowned workloads with hard-coded IPs in config files and firewall rules nobody documented, re-IP is a landmine field. HCX Network Extension stretches the on-prem VLAN into AVS at Layer 2, so a VM keeps its address, its default gateway stays on-prem until you choose to migrate it, and the app never notices. You migrate the gateway last, per segment, on your schedule — converting a risky big-bang re-IP into a controlled, reversible sequence.

Why micro-segmentation belongs in NSX, not the perimeter. Once the estate is a flat private cloud, an old-world perimeter firewall sees only north-south traffic and is blind to the east-west movement an attacker uses to pivot from a compromised member-portal VM into the claims database. NSX-T’s distributed firewall enforces allow-list policy at every VM’s vNIC, driven by tags, so segmentation is a property of the workload that travels with it during vMotion — not a static rule on a box the VM just moved away from.

Implementation guidance

Provision the landing zone with Terraform, network first. The order matters because HCX cannot do anything until the path exists.

The AVS private cloud (host SKU and an initial 3-node cluster minimum; AV36P/AV48 class hosts), placed in a region paired with the colo for latency.
The ExpressRoute Gateway in the hub VNet, the AVS-managed circuit connected to it, and Global Reach stitched to the colocation’s existing ExpressRoute circuit. Validate route propagation both ways before going further — a missing route advertisement is the most common silent failure here.
NSX-T segments, tier-0/tier-1 gateways, and the distributed-firewall sections (default-deny, then per-app allow rules keyed to tags).
Azure NetApp Files capacity pools and the external NFS datastore attachment, sized from the cold-data inventory.
Private DNS and Private Endpoints for the Azure PaaS the re-platformed workloads will eventually call, plus the identity plumbing (Entra federation, domain-controller VMs, LDAPS to vCenter/NSX).

A minimal Terraform shape for the private cloud and Global Reach communicates the intent:

resource "azurerm_vmware_private_cloud" "avs" {
  name                = "avs-claims-prod-eus2"
  resource_group_name = azurerm_resource_group.avs.name
  location            = "eastus2"
  sku_name            = "AV36P"

  management_cluster {
    size = 3                       # minimum; scale hosts as waves land
  }
  network_subnet_cidr = "10.20.0.0/22"   # AVS management /22, non-overlapping
  internet_connection_enabled = false    # all ingress/egress via ExpressRoute
}

# Stitch the AVS circuit to the colo circuit at the Microsoft backbone
resource "azurerm_express_route_circuit_connection" "global_reach" {
  name                            = "gr-avs-to-colo"
  express_route_circuit_peering_id =
    azurerm_vmware_private_cloud.avs.circuit[0].express_route_private_peering_id
  peer_express_route_circuit_peering_id =
    data.azurerm_express_route_circuit.colo.peerings[0].id
  address_prefix                  = "172.16.0.0/29"   # /29 transit for Global Reach
}

The pipeline that applies this runs in GitHub Actions (with Jenkins still driving some legacy build jobs the team has not retired), authenticating to Azure via OIDC federation so no service-principal secret is stored — a discipline the security team holds to firmly. Wiz Code runs as a required check on every Terraform pull request, failing the build if a change would open a public IP on AVS, weaken an NSX rule to any-any, or create a datastore without encryption — posture enforced before merge, not discovered after deploy.

In-guest config is Ansible’s job. Terraform builds the AVS and network plane; Ansible handles what lives inside the VMs — baking the CrowdStrike Falcon sensor and the Dynatrace/Datadog agent into the golden image, applying CIS hardening, and rotating the local credentials that HashiCorp Vault now issues dynamically. Because the EDR and observability agents live in the guest, they survive the vMotion unchanged: a VM arrives in AVS already reporting to the SOC and the APM backend, with zero re-instrumentation.

Plan the waves in ServiceNow, drive the CMDB from vCenter. Group the 1,900 VMs into waves by application affinity and risk, lowest-risk first to build confidence (the Moodle training platform and internal file servers before the claims engine). Each wave is a ServiceNow change request with an approval gate, a tested rollback, and a defined window; as VMs land, the CMDB is updated from the AVS vCenter inventory so the source of truth never drifts during the most volatile months the estate will ever see.

Enterprise considerations

Security & Zero Trust. The posture is Zero Trust by construction and, importantly, continuous across the move. (a) NSX-T micro-segmentation default-denies east-west traffic so a foothold in one tier cannot pivot to another; (b) Okta → Entra federation means every admin action is tied to a real identity with conditional access, and vCenter/NSX roles map to those same identities rather than shared local accounts; © HashiCorp Vault issues short-lived credentials for migration service accounts and dynamic database creds for re-platformed apps, so nothing long-lived sits in a config file — directly addressing the carrier’s standing rule that database passwords are never static or committed; (d) CrowdStrike Falcon runtime EDR rides in the guest image and keeps reporting through the cutover; (e) Wiz runs continuous CSPM and attack-path analysis across both the AVS private cloud and the surrounding Azure estate, alerting the instant a host drifts to public exposure or an NSX rule widens, and Wiz Code stops those drifts at the pull request. All AVS management and migration traffic rides ExpressRoute — no public data-plane surface for the HIPAA assessor to flag.

Cost optimization. AVS bills per host-hour, so the estate’s size is the bill, and “relocate then shrink” is the financial heart of the plan.

Lever	Mechanism	Typical effect
Reserved Instances	1- or 3-year AVS RI on the steady-state host count	Up to ~50% vs pay-as-you-go on committed hosts
External datastore	Offload cold/low-IOPS data to Azure NetApp Files	Storage scales without buying hosts for capacity
Right-size on landing	vSAN dedupe/compression + drop zombie VMs before/after move	Fewer hosts needed than a 1:1 colo copy
Post-migration modernization	Re-platform stateless tiers to AKS/PaaS, retire their VMs	Shrinks the AVS host floor over 12–24 months
Host SKU fit	Match AV36P vs AV48 to the CPU:RAM:storage profile	Avoids over-provisioning the cluster

The trap is treating AVS as the destination and paying to host 1,900 VMs forever. The win is treating it as the bridge: get out of the building at full size on the lease clock, then use the Argo CD / AKS landing zone to re-platform the workloads that justify it and retire AVS hosts as the estate shrinks — a glin path native-IaaS-first migrations never offered.

Scalability. AVS scales by adding hosts to a cluster (and the cluster autoscale policy can add hosts when CPU/RAM/storage cross a threshold), with storage scaling independently via Azure NetApp Files so you are not buying compute to get capacity. HCX migration throughput scales by adding Network Extension and Interconnect appliance pairs and by running RAV to move many VMs in parallel — the appliance count, not a single tunnel, is the migration’s real bandwidth ceiling. ExpressRoute is sized (and FastPath-enabled) so the circuit never becomes the bottleneck during the heaviest bulk-migration weekends.

Failure modes, and what each one looks like. Name them before the cutover weekend.

A missing or unpropagated ExpressRoute route — HCX appliances deploy clean but cannot pair, and migrations stall with no obvious error. Mitigation: assert Global Reach route propagation in Terraform and a pre-wave smoke test both directions.
MTU mismatch on the extended segment — large frames fragment or drop across the L2 stretch, and migrations crawl or fail intermittently. Mitigation: validate jumbo-frame end-to-end MTU before the first wave; HCX WAN-Opt tuned to the path.
A split application across the link — half an app’s tiers move, the rest stay on-prem, and every transaction now pays a round-trip across ExpressRoute. Mitigation: wave grouping by application affinity so tiers move together.
AVS capacity exhaustion mid-wave — a cluster fills and the next wave has nowhere to land. Mitigation: stage host additions ahead of the wave schedule; alert on vSAN/host headroom.
vMotion failure on a fragile VM — an old guest with paravirtual driver issues fails live migration. Mitigation: fall back to HCX Bulk (replication-based) for known-fragile VMs; test a representative sample per wave first.

Reliability & DR (RTO/RPO). Decide the numbers per tier. During migration, the on-prem copy is the rollback — HCX keeps the source VM until you commit the cutover, so a failed wave reverts in minutes. In steady state, AVS DR uses VMware SRM with vSphere Replication (or JetStream/Zerto) replicating to a second AVS private cloud in a paired region, or back to a small on-prem toehold if one is retained. A pragmatic target for the claims platform: RTO 4 hours, RPO 15 minutes, with member-facing portal tiers tighter once they are re-platformed onto multi-region PaaS. Akamai health checks drive edge failover for the public properties.

Observability. Keep telemetry continuous across the move, because a migration is exactly when you most need to see regressions. Dynatrace (or Datadog, the carrier runs both during a tooling consolidation) keeps its agent in the guest image so application metrics, traces, and logs flow unbroken before, during, and after vMotion — a latency spike from a mid-wave split app surfaces immediately. AVS host and cluster health (CPU ready time, vSAN latency, host headroom) is pulled from the vCenter and NSX APIs into the same backend, so the platform team watches infrastructure and applications on one pane. Emit the metrics the migration program actually cares about — VMs migrated vs planned per wave, cutover window adherence, AVS host utilization, and east-west denied-flow counts from the NSX firewall — and pipe wave completion into the ServiceNow change record automatically.

Governance. Pin the AVS host SKU and cluster sizing in Terraform so capacity changes are reviewed, not ad hoc. Keep NSX firewall policy as code so every segmentation rule is version-controlled, reviewable, and revertable. Drive every wave through a ServiceNow change gate with a documented rollback, and sync the CMDB from vCenter so the configuration record matches reality throughout. Apply Azure Policy to deny public IP exposure on the AVS network and require diagnostic settings, with Wiz as the independent verifier that the controls are actually holding.

Explicit tradeoffs

Accept these or do not build it. AVS keeps your VMware skills and runbooks intact, but you are still paying for dedicated bare-metal hosts — it is more expensive per workload than well-tuned native Azure IaaS, and the only thing that makes the economics work is the commitment to shrink the estate afterward. You inherit the VMware operating model in full: vCenter, NSX-T, vSAN, and HCX are powerful but they are more to operate than a managed PaaS, and your team still patches ESXi (within the AVS shared-responsibility line) and tunes NSX. The Layer-2 extension that makes migration safe also means you carry on-prem network constructs into Azure until you deliberately retire them — convenient during cutover, technical debt if you never finish. And HCX itself is a set of appliances to size, monitor, and occasionally fight: MTU, bandwidth, and appliance scaling are real operational work, not a button.

The alternatives, and when they win. If you have time and application owners, native re-platforming to Azure IaaS/PaaS is cheaper and more cloud-native at the end — choose it when the clock and the staffing allow, and it composes with this design as the post-migration target. If a workload is already stateless and containerizable, re-architecting onto AKS (via the Argo CD GitOps landing zone this design provisions) beats hosting it on AVS forever — do that after the exit, per workload, as the economics justify. And if you are not actually leaving a datacenter — just extending capacity or bursting — a smaller hybrid AVS footprint alongside a retained on-prem estate is the right scope, without the full exit machinery here.

The shape of the win

For the carrier, the payoff is not “we moved to the cloud.” It is that the colocation floor goes dark before the lease ends, the claims engine processed every member’s claim through the entire migration without a regulator-visible outage, and the vSphere team that had spent a decade mastering VMware is still operating VMware the Monday after the last wave — now on Azure hosts, with the same vCenter, the same NSX policy, the same Okta login, the same Falcon sensors, and the same Dynatrace dashboards. That continuity is what let the project hit a fixed external date that a 1,900-application rewrite never could have. And because AVS is the bridge rather than the destination, the harder, more valuable work — re-platforming the workloads that deserve it and retiring AVS hosts as the estate shrinks — now runs on a clock the company controls, not a landlord’s. The architecture here is how a regulated enterprise gets out of the building on time without betting the business on a big-bang rewrite.

VMware to Azure VMware Solution Migration and Hybrid Operations

Why not the obvious alternatives

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Explicit tradeoffs

The shape of the win

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)