A national health-insurance payer is six weeks from an external audit when its cloud security lead gets the finding that derails the quarter: the company’s Azure estate has grown to forty-three spoke VNets across two regions, and every one of them has its own egress path to the internet — some through an Azure NAT Gateway, some through a default route nobody can explain, two through a developer’s leftover public IP on a jump box. The auditor’s question is simple and unanswerable in the current shape: “Show me every flow that left your network last Tuesday, what application it was, and the policy that allowed it.” There is no single place that sees east-west traffic between the claims-processing spoke and the analytics spoke, no application-aware control on egress, and no consistent policy between the two regions. For a payer handling PHI under HIPAA and processing card data under PCI-DSS, “we have NSGs on each subnet” is not an answer — NSGs match ports and CIDRs, not the actual application riding the port, and they are scattered across forty-three blast radii with no central audit. This article is the reference architecture for fixing that properly: a centralized inspection hub on Azure built on a Palo Alto VM-Series HA pair, with all spoke egress and inter-spoke traffic forced through it, App-ID and Threat Prevention doing the actual inspection, and Panorama holding one policy across both regions.
The pressures here are the ones that drive every regulated network: auditability — every flow needs a named application, a verdict, and a log a regulator can read; segmentation — the claims spoke must not be able to reach the analytics spoke except on explicitly allowed flows, and a compromise in one must not pivot freely into the others; consistency — a rule written for the East US region must apply identically in the West US region, with no drift; and threat prevention — known exploits, C2 callbacks, and malware downloads must be blocked inline, not detected after the fact in a log review. A pile of per-subnet NSGs satisfies none of these. A next-generation firewall (NGFW) inspecting a chokepoint satisfies all four — if, and only if, the network topology actually forces traffic through it.
Why not the obvious alternatives
Three cheaper options will be proposed in the first design meeting, and naming why each falls short matters because each is genuinely tempting.
Network Security Groups everywhere. NSGs are free, native, and fine for coarse subnet isolation — but they are stateful packet filters keyed on 5-tuple. They cannot tell HTTPS-to-a-CDN from HTTPS-tunnelled-exfiltration, they do no threat signature matching, and forty-three independent NSG sets are forty-three independent audit surfaces with no unified flow record. They are a complement to the hub, not a substitute.
Azure Firewall (the native NGFW-lite). Azure Firewall Premium adds IDPS, TLS inspection, and FQDN filtering, and for many shops it is the right call — it is managed, it autoscales, and it needs no VMs to patch. But this payer has an existing Palo Alto estate on-premises, a security team fluent in PAN-OS policy and App-ID, a SOC tuned to Palo Alto threat logs, and a compliance posture built around Palo Alto’s reporting. Running the same NGFW in the cloud means one policy model, one log schema, one skill set spanning on-prem and Azure — which is worth more than the operational savings of a managed box when an auditor wants consistency across the whole enterprise. App-ID’s application identification is also more mature than Azure Firewall’s application rules for the deep east-west visibility the finding demands.
Per-spoke firewall appliances. Putting a small NGFW in each spoke gives local inspection but multiplies cost by forty-three, fragments policy, and recreates the exact “no central view” problem the audit flagged. Centralization is the entire point.
The hub pattern threads the needle: a single, scaled, HA inspection point that every spoke routes through, one policy surface managed centrally, and one consolidated flow-and-threat log the SOC and the auditor both read.
Architecture overview
The topology is a classic hub-and-spoke with the firewall as the hub’s beating heart, and the defining property — the one that makes the whole thing real rather than aspirational — is that traffic has no path to the internet or to a sibling spoke except through the firewall data plane. If a packet can find any other way out, the architecture has failed silently, so the routing design below is not decoration; it is the load-bearing wall.
The hub VNet contains the inspection layer. The VM-Series firewalls run as an HA pair — two instances in an Azure Availability Set (or across Availability Zones) — but Azure does not natively support the gratuitous-ARP IP-takeover that a traditional Palo Alto active/passive HA pair uses on-premises. So the cloud-correct pattern is an active/active-style sandwich with Azure load balancers steering traffic to whichever firewalls are healthy, fronted by health probes:
- An internal Standard Load Balancer (ILB) sits on the trust side. Spoke route tables point their default route (
0.0.0.0/0) at the ILB’s frontend IP as the next hop. The ILB load-balances across the firewalls’ trust-side NICs using an HA Ports rule (all ports, all protocols) so every flow — not just TCP/80 and 443 — is steered to a firewall. - A public/external Load Balancer sits on the untrust side for inbound flows and as the SNAT path for inspected egress.
- Health probes on both load balancers hit a firewall management/dataplane port so that if one firewall fails, the ILB simply stops sending it flows and the survivor carries the load — no ARP, no floating IP, just probe-driven steering.
Each firewall is multi-NIC: a management interface (to Panorama and for admin), an untrust/public interface (internet-facing, behind the public LB), and a trust/private interface (behind the ILB, facing the spokes). PAN-OS zones map to these: an untrust zone, a trust zone, and typically a dedicated zone per security tier so policy can speak in zone terms.
Control flow, following a packet’s life:
- Spoke egress (forced tunnel). A claims-processing VM in Spoke A wants to reach an external payments API. Its subnet’s User-Defined Route (UDR) sends
0.0.0.0/0to the trust ILB. This is the forced tunnel — the spoke has no NAT Gateway, no public IP, no default Azure internet route; the only way out is the firewall. The ILB hands the flow to a healthy firewall. - Inspection. The firewall applies its security policy: App-ID identifies the actual application (not just “tcp/443” but, say, the specific SaaS API), the rule checks source zone/address, destination, application, and user; Threat Prevention runs the flow against IPS signatures, antivirus, and anti-spyware (C2/DNS-tunnelling) profiles; URL Filtering and WildFire (sandboxing of unknown files) apply where configured. If the policy permits, the firewall SNATs the flow out its untrust interface via the public LB. If it denies, the flow is dropped and logged.
- East-west (inter-spoke). A flow from the claims spoke to the analytics spoke also routes via the trust ILB (each spoke’s UDR sends the other spoke’s CIDR, or simply
0.0.0.0/0, to the firewall). The firewall sees it astrust → trust(or tier-zone to tier-zone) and applies an explicit allow/deny — so the two spokes can only talk on flows security has named, and a compromise in claims cannot freely pivot into analytics. This is the micro-segmentation choke point the audit demanded. - Inbound. External traffic to a published service hits the public LB, is DNAT’d by the firewall to the backend in the relevant spoke, inspected on the way in, and returned symmetrically (Azure LB and the firewall’s session table keep the path symmetric, which matters — asymmetric routing breaks a stateful firewall).
- On-prem. An ExpressRoute or VPN gateway in the hub connects the corporate network; on-prem-to-Azure and Azure-to-on-prem flows are routed through the firewall too, giving one consistent inspection and policy boundary spanning the whole estate.
Management plane, the part that makes it scale: Panorama — Palo Alto’s central management server, itself deployed as a redundant pair in the hub (or a dedicated management VNet) — pushes one device-group policy hierarchy and template stack to every firewall in both regions. Security rules, threat profiles, and App-ID updates are authored once in Panorama and pushed everywhere, which is precisely how you kill the “East US drifted from West US” problem. Panorama is also the log collector: every firewall streams traffic and threat logs to Panorama, which aggregates them and forwards to the SOC’s SIEM — one consolidated flow-and-threat record for the auditor.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Inspection hub | Palo Alto VM-Series (HA pair) | Stateful NGFW: App-ID, Threat Prevention, URL filtering, NAT | Multi-NIC; active/active via LB; zone-based policy |
| Trust-side steering | Azure Internal Load Balancer | Steers spoke egress + east-west to healthy firewalls | Standard SKU; HA Ports rule; dataplane health probe |
| Untrust-side | Azure Public Load Balancer | Inbound DNAT path + egress SNAT exit | Standard SKU; floating IP for symmetric return |
| Forced tunnel | User-Defined Routes (UDRs) | Force 0.0.0.0/0 and inter-spoke CIDRs to the firewall |
Default route → ILB next hop; no NAT GW in spokes |
| Central policy | Panorama (HA pair) | One policy + threat config + App-ID across both regions | Device groups; template stacks; log collection |
| Hybrid link | ExpressRoute / VPN Gateway | On-prem reach, inspected through the hub | In hub; routes via firewall trust side |
| Identity / admin SSO | Okta + Microsoft Entra ID | Admin SSO to Panorama and Azure portal; RBAC | Okta SAML to Panorama; Entra RBAC + PIM on the subscription |
| Bootstrap secrets | HashiCorp Vault | Holds bootstrap auth keys, API keys, licence tokens | Dynamic Azure creds; bootstrap key issued at deploy |
| Runtime / endpoint security | CrowdStrike Falcon | Protects management jump hosts and bootstrap automation runners | Sensor on admin/runner VMs; detections to SOC |
| CSPM / drift | Wiz | Detects routing drift, public-IP creation, NSG widening | Agentless scan; alert on any spoke gaining a direct egress path |
| Observability | Dynatrace | Throughput, session, health-probe, latency telemetry | OneAgent on runners; metrics from firewall + LB; Davis anomaly |
| ITSM / change | ServiceNow | Firewall rule-change approvals, incident tickets on threats | Change gate before commit; auto-ticket on critical threat log |
| CI / IaC | Terraform + GitHub Actions | Deploy VNets, LBs, UDRs, firewalls; bootstrap config | OIDC to Azure; plan/apply with policy gate |
A few choices carry the why, because they are where cloud NGFW deployments most often go wrong.
Why HA Ports on the internal load balancer, not a per-port rule. A normal Azure LB rule covers one protocol and port range. A firewall must inspect every protocol — TCP, UDP, ICMP, ESP, whatever the spoke emits — so the trust ILB uses an HA Ports rule, a single rule matching all ports and all protocols, sending the entire flow set to the backend firewall pool. Without it, anything that isn’t your listed ports bypasses inspection or black-holes.
Why active/active steering instead of classic Palo Alto floating-IP HA. On-premises, a Palo Alto active/passive pair fails over by having the passive unit gratuitously-ARP and claim the active’s data-plane IPs. Azure’s software-defined network does not honor gratuitous ARP, so that mechanism does not work. The cloud-native pattern is to put both firewalls behind load balancers with health probes and let the probes route around a failed unit. The cost is that you must engineer symmetric return traffic (via LB floating-IP/Direct Server Return and consistent SNAT) so a stateful session always returns to the same firewall that owns its session table — get this wrong and you see mysterious mid-session drops.
Why Panorama is non-negotiable at this scale. Two firewalls per region across two regions is four data-plane devices, and the audit finding was inconsistency. Managing four firewalls by logging into each is exactly how East and West drift apart. Panorama makes the policy a single source of truth pushed everywhere, gives you one place to author a rule and one place to read every log, and lets you stage and validate a change before it hits production firewalls.
Implementation guidance
Provision with Terraform, and treat routing as the acceptance test. The deployment order matters because a half-wired forced tunnel leaks traffic silently — packets find the default Azure internet route and leave uninspected, and nothing errors.
- The hub VNet with subnets for management, untrust, trust, the gateway, and a Panorama subnet; the spoke VNets peered to the hub (hub-spoke peering, with
allowForwardedTrafficso the firewall can forward). - The load balancers — internal (trust) with the HA Ports rule and dataplane health probe, public (untrust) with the inbound rules.
- The VM-Series firewalls, multi-NIC, in an Availability Set or across zones, IP-forwarding enabled on the trust/untrust NICs, bootstrapped from an Azure Storage bootstrap package (so they come up already licensed and connected to Panorama).
- The UDRs: every spoke subnet gets a route table whose
0.0.0.0/0points to the trust ILB frontend IP asVirtualAppliancenext hop, and BGP/Azure default-route propagation is disabled on those subnets so nothing competes with the forced-tunnel route. Inter-spoke routes (or a blanket0.0.0.0/0) likewise point at the firewall. - Panorama deployed and the firewalls registered into a device group and template stack.
The forced-tunnel route is the single most important object in the build. A minimal Terraform shape communicates the intent — the spoke’s default route is the firewall, full stop:
resource "azurerm_route_table" "spoke_claims" {
name = "rt-spoke-claims-eus"
location = "eastus"
resource_group_name = azurerm_resource_group.net.name
bgp_route_propagation_enabled = false # nothing competes with the forced tunnel
}
resource "azurerm_route" "default_to_firewall" {
name = "default-via-fw"
route_table_name = azurerm_route_table.spoke_claims.name
resource_group_name = azurerm_resource_group.net.name
address_prefix = "0.0.0.0/0"
next_hop_type = "VirtualAppliance"
next_hop_in_ip_address = azurerm_lb.trust_ilb.frontend_ip_configuration[0].private_ip_address
}
Apply via GitHub Actions authenticating to Azure with OIDC federation, so there is no stored service-principal secret to leak. A required pipeline step runs a connectivity assertion: from a probe VM in each spoke, confirm that egress is SNAT’d to the firewall’s public IP and that a direct route does not exist — turning “is the forced tunnel actually forcing?” into an automated gate rather than a hope.
Bootstrap secrets and licensing via Vault. The VM-Series bootstrap needs an auth key (to register with Panorama) and licence/API tokens. Do not bake these into the image or a plaintext init script. Store them in HashiCorp Vault, have the deployment runner pull a short-lived bootstrap auth key at apply time, and write it into the bootstrap package; Vault also issues the dynamic Azure credentials the runner uses, so no long-lived cloud secret sits in the pipeline. The bootstrap key is consumed once and expires.
Admin identity: federate the humans, gate the privilege. Panorama administrators authenticate via Okta SAML, so firewall admin uses the same workforce identity, MFA, and conditional-access posture as everything else — no local Panorama accounts to manage or offboard. Azure-side, subscription access uses Microsoft Entra ID with PIM (Privileged Identity Management) so changing the hub network is a time-boxed, approved, just-in-time elevation, not standing access. Role separation is real: network-platform engineers can change routing and infrastructure; only the security team’s Panorama role can author or commit firewall security rules.
Policy authoring. Define the App-ID-based rules in Panorama device groups, with shared rules for enterprise-wide baselines (block known-bad URL categories, deny risky apps, force decryption where policy allows) and device-group-specific rules per region or environment where they legitimately differ. A representative trust-to-untrust egress rule reads in plain terms — claims servers may reach the named payments application over the approved port, inspected for threats, and nothing else:
rule "claims-egress-payments" {
from zone trust; source addr claims-app-servers;
to zone untrust; destination fqdn payments-api.partner.example;
application [ ssl payments-partner-api ]; # App-ID, not just port
service application-default;
profile-setting group "strict-threat"; # IPS + AV + anti-spyware + URL
action allow; log-end yes;
}
Everything not explicitly allowed falls to a default deny rule that logs — so the audit answer to “what left the network and under what policy” is a single Panorama log query.
Enterprise considerations
Security & Zero Trust. The hub is the Zero Trust enforcement point for the network: nothing routes east-west or egresses without an identity-, application-, and threat-aware verdict, and the default is deny. Layer on top: (a) micro-segmentation by giving each security tier its own zone so the firewall policy speaks in tier terms and a compromise is contained to its segment; (b) TLS decryption where compliance and privacy policy permit, so App-ID and Threat Prevention can see inside encrypted flows rather than waving them through — decryption is the difference between real inspection and theatre, and where it is disallowed (e.g., certain PHI flows) you accept reduced visibility deliberately and document it; © Wiz running continuous CSPM that specifically watches for the failure mode that started this project — a spoke gaining a public IP, a NAT Gateway, or a UDR change that creates an uninspected egress path — and alerts the moment routing drifts away from the forced tunnel; (d) CrowdStrike Falcon on the management jump hosts and the bootstrap automation runners, since those are the high-value targets that can rewrite firewall policy; (e) a critical threat log (an IPS block on an active exploit, a C2 callback) auto-raises a ServiceNow incident so the SOC has a ticket, not just a log entry, and every firewall rule change goes through a ServiceNow change gate before it is committed in Panorama — giving the auditor a documented approval per rule.
Cost optimization. The VM-Series licence and the firewall VM compute are the dominant cost, and they scale with throughput, so size deliberately.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Right-size VM-Series SKU | Match the VM size/licence tier to measured throughput, not peak fear | Avoids over-buying dataplane capacity |
| Flexible/credit licensing | Use Palo Alto’s flexible vCPU licensing pooled across firewalls | Spend follows actual deployed capacity |
| Single hub per region | One inspection hub serves all spokes vs per-spoke appliances | Collapses 43 potential firewalls into 2/region |
| Reserved/savings-plan VMs | Commit the steady-state firewall VMs | Cuts compute ~30–60% vs on-demand |
| Scale-out only on demand | Add firewalls to the LB pool when throughput SLOs slip, not pre-emptively | Capacity tracks load |
Centralization is itself the biggest cost lever — two firewalls per region inspecting everything is dramatically cheaper than scattering appliances or, worse, the unmeasured cost of an audit failure.
Scalability. The data plane scales horizontally: add more VM-Series instances to the load-balancer backend pools and the ILB/public LB spread sessions across them, so throughput grows with instance count rather than being capped by one box. Each firewall scales vertically within its SKU’s session and throughput limits, so the levers are (1) a larger VM size/licence tier for more per-instance capacity and (2) more instances for aggregate capacity and resilience. Panorama scales by adding dedicated Log Collectors when log volume from many firewalls outgrows the management pair. The practical ceiling is per-instance session-table size and Azure LB flow limits, which is why a large estate plans the SKU and instance count against measured connections-per-second, not just bandwidth.
Failure modes, and what each one looks like. Name them before they page the on-call.
- A leaky forced tunnel — a spoke subnet missing its UDR, or with Azure default-route propagation left on, sends traffic straight to the internet uninspected. It works perfectly and silently bypasses every control. Mitigation: assert UDRs in Terraform, disable route propagation, and run the per-spoke egress-path assertion in CI; let Wiz catch any post-deploy drift.
- Asymmetric routing — return traffic lands on a different firewall than the one holding the session, and the stateful firewall drops it as out-of-state. You see mid-session resets that defy single-packet debugging. Mitigation: LB floating-IP/DSR for symmetric return and consistent SNAT so a session always returns to its owner.
- Health-probe misconfiguration — the LB probes a port the firewall isn’t actually serving on, marks healthy units down, and the pool empties; or it probes too loosely and keeps a dead firewall in rotation, black-holing a share of flows. Mitigation: probe a true dataplane/management liveness signal and test failover before go-live.
- Firewall becomes the bottleneck — sessions or throughput exceed the SKU and latency climbs or new connections are refused, which looks like a flaky application, not a firewall. Mitigation: Dynatrace alarms on session-table utilization and throughput well before saturation; scale out the pool.
- Panorama push drift/outage — a failed or partial commit-push leaves firewalls out of sync, recreating the inconsistency you set out to eliminate. Mitigation: validate commits, push to both regions together, and alert on out-of-sync device state.
Reliability & DR (RTO/RPO). Within a region, the HA pair behind health-probed load balancers gives sub-minute failover for a single firewall loss; deploying the pair across Availability Zones survives a zone failure. For regional DR, the second region already runs its own hub and firewall pair receiving the same Panorama policy, so a regional outage is a routing/DNS failover to the standby region’s hub rather than a rebuild — the policy is already there. Panorama runs as an HA pair and its configuration is the source of truth, backed up and version-controlled, so the entire policy estate is recoverable. A pragmatic target for this platform: RTO under 5 minutes for in-region firewall failure (probe-driven), RTO 30–60 minutes for a full regional cutover (routing/DNS plus standby validation), and effectively zero RPO for policy since it is replicated by Panorama and held in source control. ExpressRoute should be provisioned with redundant circuits so the hybrid path is not a single point of failure.
Observability. Stream firewall traffic and threat logs to Panorama and on to the SOC’s SIEM as the security record, and separately pull operational telemetry into Dynatrace — firewall throughput and session counts, load-balancer health-probe status and backend availability, SNAT port utilization (a sneaky exhaustion failure under high egress fan-out), and end-to-end latency through the hub — with Davis anomaly detection so a throughput or session regression surfaces on its own rather than in a post-incident review. Emit the metrics the business and audit actually care about: percentage of egress that is inspected (should be 100% — anything less is a leaky tunnel), threats blocked, top applications by volume (App-ID’s gift — you finally know what is actually traversing the network), firewall capacity headroom, and policy-sync status across regions. The combination answers the auditor’s original question — every flow, its application, its verdict, its policy — from one place.
Governance. Pin the PAN-OS version explicitly and promote upgrades through a staged Panorama push (validate on one device first) so behavior does not drift; keep firewall policy as a reviewable, revertable change set with a ServiceNow change record per modification. Apply Azure Policy to deny the creation of public IPs and NAT Gateways in spoke subscriptions and to require the forced-tunnel route table on workload subnets, with Wiz as the independent verifier that the guardrail is actually holding. Enforce role separation between who can change Azure routing and who can author firewall rules, and log every administrative action in Panorama for audit.
Explicit tradeoffs
Accept these or do not build it. The hub is a single chokepoint — powerful for control and audit, but it means the firewall pair’s capacity, health, and latency now sit on the critical path of essentially all traffic, so it must be HA, scaled with headroom, and monitored like the production-critical dependency it is. The active/active load-balancer sandwich is more complex than on-prem floating-IP HA and demands careful symmetric-routing and SNAT design; getting it subtly wrong yields intermittent drops that are miserable to debug. You are now operating firewall VMs — patching PAN-OS, sizing SKUs, owning the licence — which is real toil that a fully-managed native firewall would absorb. TLS decryption is what makes inspection meaningful, but it carries privacy, performance, and certificate-management weight, and some regulated flows you will deliberately leave undecrypted, accepting the visibility gap. And the forced tunnel that delivers the audit win also means there is exactly one way out: when the hub has a bad day, every spoke feels it, which is the price of having one place that sees and controls everything.
The alternatives, and when they win. If your team has no Palo Alto investment and wants the lightest operational load, Azure Firewall Premium gives you IDPS, TLS inspection, and FQDN filtering as a managed, autoscaling service — choose it when consistency with an existing Palo Alto estate is not a requirement. If your needs are coarse subnet isolation without application awareness or threat prevention, NSGs plus Azure NAT Gateway are far simpler and cheaper — choose them when “what application and is it a threat” is genuinely not a question you must answer. If you want centralized inspection without hand-building the routing sandwich, Azure Virtual WAN with a secured hub (Routing Intent) automates much of the forced-tunnel and next-hop wiring and can host the firewall — choose it to reduce the routing toil at some loss of fine-grained control. And if you only need to inspect east-west between a handful of workloads rather than the whole estate, a service-mesh or host-based segmentation approach may be lighter than routing everything through a network hub. This VM-Series hub is the right destination specifically when you have an existing Palo Alto practice, a regulated estate that must prove consistent application-aware control and one consolidated flow record across regions, and the scale to justify a dedicated inspection layer.
The shape of the win
For the payer’s security team, the payoff is not “we bought a firewall.” It is that when the auditor asks “show me every flow that left your network last Tuesday, its application, and the policy that allowed it,” the answer is a single Panorama query returning App-ID-classified flows, threat verdicts, and the named rule for each — across both regions, identically governed — and the answer to “can the claims spoke reach the analytics spoke” is “only on these three explicitly allowed flows, here is the deny log for everything else.” That is the sentence that passes the audit. Everything upstream — the forced-tunnel UDRs, the HA Ports load balancer, the active/active firewall pair, the Panorama policy fabric, the Vault-held bootstrap keys, the Wiz drift detection, the Dynatrace capacity telemetry — exists so that a compliance officer, a CISO, and an auditor can each see one consistent, application-aware, fully-logged boundary around the whole Azure estate. Start with one region and a few spokes if you must, but a regulated payer’s network has to land here.