A mid-sized automotive parts manufacturer — call it the kind of Tier-1 supplier that ships brake assemblies to three carmakers on just-in-time contracts — has a connectivity problem that has quietly become a board-level risk. Their MES (manufacturing execution system) and a fleet of PLCs on four plant floors still run on-prem, because a brake line cannot pause for a cloud round-trip and because the OT network is air-gapped by policy. But the demand-forecasting and supply-chain analytics that decide how many of each part to build now live in Azure, the connected-vehicle telematics ingestion their largest customer mandated lives in AWS, and the SAP estate is mid-migration with one foot in each. Today all of this is stitched together with a pair of site-to-site IPsec VPNs over the public internet, and last quarter a fibre cut at the ISP plus a BGP flap took the forecasting feed offline for six hours during a model-year ramp. The plant kept building to a stale forecast; the over-build cost more than a year of private connectivity would. The CIO’s directive is blunt: private, redundant reach from the data centres into both clouds, no single point of failure, and one place to provision and reason about it all. This article is the reference architecture for that — hybrid connectivity to Azure and AWS through a Megaport fabric, built so a network architect, a security lead, and a CFO each sign off.
The pressures here are the classic ones, just wearing factory clothes. Resilience is the headline: a JIT manufacturer that loses its demand signal builds the wrong parts, and a six-hour outage is a real line-down event with contractual penalties. Performance and determinism matter because analytics that drive same-day build decisions cannot tolerate the jitter and unpredictable latency of best-effort internet. Security and compliance matter because automotive customers now impose TISAX-style assessments and because OT/IT segmentation is non-negotiable. And cost matters because dedicated circuits and port-hours are not free, and the CFO will rightly ask why this beats two more VPNs. Private, fabric-provisioned interconnect satisfies all four at once: dedicated bandwidth, predictable latency, traffic that never touches the public internet, and — critically — virtual cross-connects you can stand up, resize, and tear down in minutes instead of the months a carrier cross-connect used to take.
Why not the obvious shortcuts
Each cheaper option fails in a way someone on the project will have to be talked out of.
More site-to-site VPNs over the internet are quick and cheap, but they inherit every pathology that caused the outage: shared public paths, ISP-dependent routing, throughput capped by IPsec and CPU, and latency that swings with internet weather. They are a fine backup, and we keep them as exactly that — not as the primary for a line-down-sensitive feed.
A single carrier ExpressRoute or a single Direct Connect, ordered the traditional way, takes weeks to months to provision, locks you to one carrier’s footprint, and — if you buy just one — reintroduces the single point of failure you are trying to kill. Buying two carrier circuits to two clouds means four separate orders, four contracts, four lead times, and four mental models.
A cloud-vendor-native transit (Azure Virtual WAN alone, or AWS Cloud WAN alone) solves one cloud beautifully and leaves the other stranded, which is the opposite of what a genuinely multi-cloud manufacturer needs. You would still need a private on-ramp into each, and you would be operating two unrelated connectivity planes.
A Network-as-a-Service fabric threads the needle. You provision a small number of physical ports into the fabric once, then spin up Virtual Cross Connects (VXCs) from those ports to an ExpressRoute circuit and a Direct Connect connection as software constructs — same pair of ports, multiple private on-ramps, minutes not months, and one portal and one API (Terraform-driven) to reason about the whole topology. The fabric becomes the single layer-2/3 meeting point between your data centres and both clouds.
Architecture overview
The design is a dual-region, fully diverse hub-and-spoke that meets both clouds at a Network-as-a-Service fabric. Hold two ideas separately and the rest follows: there is a physical layer (ports, cross-connects, fibre paths that must be diverse) and a logical layer (BGP sessions, route advertisements, and the policies that decide which path carries which traffic). Most outages on hybrid links are failures of the logical layer hiding behind a healthy physical one.
The defining property of the whole topology is the one the CIO demanded: no single point of failure on any layer. That means two physical ports into the fabric from two diverse points of presence (PoPs) reached over two carriers; ExpressRoute’s mandatory redundant pair of connections; Direct Connect taken as two connections in two AWS Direct Connect locations; and BGP tuned so that the loss of any one port, PoP, circuit, or cloud-edge router shifts traffic to a surviving path within seconds, never silently blackholes it.
Data-plane path, from a plant analytics query to Azure, following the flow:
- A demand-forecasting job on a plant-floor server needs data in Azure. It resolves a private IP (RFC 1918) for the Azure workload — no public DNS, no public endpoint — and the packet enters the on-prem network.
- The packet hits the on-prem edge routers (a redundant pair). These hold BGP sessions out to the fabric and, by route preference, send cloud-bound traffic onto the primary physical port into the Megaport fabric, with the second port as the diverse standby. Internet-bound traffic does not take this path at all.
- Inside the fabric, a Virtual Cross Connect (VXC) carries the traffic to the ExpressRoute circuit’s edge. ExpressRoute presents its mandatory primary and secondary connections; BGP sessions run over both, advertising the manufacturer’s on-prem prefixes to Azure and learning Azure’s prefixes in return.
- Traffic lands in the Azure hub VNet — a Virtual WAN hub (or a classic hub VNet with an ExpressRoute Gateway). From the hub it routes via VNet peering into the spoke VNets that hold the forecasting and SAP-on-Azure workloads. The hub is the single Azure ingress for on-prem traffic and the place where Azure Firewall inspects east-west and north-south flows.
- The response retraces the path. Because the fabric, ExpressRoute, and the on-prem edge all carry symmetric BGP-learned routes, the return packet takes a private path home — no asymmetric drop, no internet leak.
Data-plane path to AWS is the mirror image and shares the same two physical ports: a second pair of VXCs from the fabric reaches two Direct Connect connections in two DX locations. Each Direct Connect carries a private Virtual Interface (VIF) to a Direct Connect Gateway, which associates to a Transit Gateway; the Transit Gateway is the AWS hub, fanning out via attachments to the VPCs that hold telematics ingestion and the AWS side of SAP. BGP over each VIF advertises on-prem prefixes into AWS and learns VPC prefixes back.
The fabric in the middle is the unifying move. Two physical ports, provisioned once, terminate as many private on-ramps as you need — here an ExpressRoute pair and a Direct Connect pair — and the entire mesh is described in Terraform through the fabric provider and the two cloud providers, so the topology is code, reviewable in a pull request, and reproducible in a DR region.
Component breakdown
| Layer | Component / service | Role in the architecture | Key configuration choices |
|---|---|---|---|
| On-prem edge | Redundant edge router pair | Originate on-prem prefixes; prefer private path; keep internet separate | eBGP to fabric; BFD; route-maps to control advertisements |
| Fabric ports | Two physical ports, diverse PoPs | Single physical meeting point with both clouds | Two carriers, two PoPs; LAG optional; diverse fibre entries |
| Fabric VXCs | Virtual Cross Connects | Software circuits to ExpressRoute + Direct Connect | Per-cloud VXC pairs; rate-limited; provisioned via Terraform/API |
| Azure circuit | ExpressRoute (+ ER Gateway) | Private L3 reach into Azure | Redundant primary/secondary; Private peering; ER Gateway in hub |
| Azure hub | Virtual WAN hub / hub VNet | Single Azure ingress; spoke fan-out; inspection point | Hub routing; Azure Firewall; VNet peering to spokes |
| AWS circuit | Direct Connect (+ DX Gateway) | Private L3 reach into AWS | Two connections, two DX locations; private VIF; DX Gateway |
| AWS hub | Transit Gateway | Single AWS ingress; VPC fan-out | DXGW association; route tables per attachment; appliance-mode for inspection |
| Routing control | BGP everywhere | Decide active path, fail over, prevent leaks | AS-path prepend, MED, local-pref; BFD; max-prefix limits |
| Identity / SSO | Microsoft Entra ID + Okta | Admin access to portals, clouds, network tooling | Okta workforce SSO federated to Entra; conditional access; MFA on all consoles |
| Secrets | HashiCorp Vault | Fabric/cloud API tokens, BGP MD5 keys, device creds | Dynamic cloud creds; short leases; no static keys in pipelines |
| CSPM / posture | Wiz + Wiz Code | Detect misconfig, public exposure, IaC drift before merge | Agentless scan of both clouds; Wiz Code scans Terraform in PR |
| Runtime security | CrowdStrike Falcon | Runtime protection on hub appliances and on-prem servers | Sensors on inspection VMs and edge hosts; detections to SOC |
| Observability | Dynatrace / Datadog | Path health, BGP state, latency/loss per link, flow telemetry | SNMP/flow ingest; synthetic probes per path; alert on path-down |
| ITSM / change | ServiceNow | Change approval for any routing or circuit change | Change gate on Terraform apply; auto-incident on path failover |
| CI / IaC | GitHub Actions / Jenkins + Terraform / Ansible | Provision fabric+cloud as code; configure edge devices | OIDC to clouds; Ansible pushes edge router config; eval/plan gate |
| Edge protection | Akamai | Protect the few public-facing apps the plants still expose | WAF/DDoS at the edge, kept entirely off the private hybrid path |
A few of these choices earn the why, because they are where hybrid links actually break.
Why the fabric instead of direct carrier circuits. The traditional path — order a carrier ExpressRoute, separately order a Direct Connect — gives you long lead times, carrier lock-in, and four uncoordinated contracts. Provisioning VXCs over a shared pair of fabric ports collapses that to one port order and software-defined circuits you can create, resize from 1 Gbps to 10 Gbps, or delete through Terraform in minutes. When the telematics contract doubled ingestion volume, the AWS VXCs were resized in an afternoon without touching the Azure side or ordering anything physical.
Why ExpressRoute is a redundant pair by design, and Direct Connect must be made one. ExpressRoute always provisions two connections (primary and secondary) — Microsoft builds the redundancy in, and you simply must run BGP over both and not treat one as ornamental. AWS Direct Connect does not give you that for free: a single Direct Connect connection is a single point of failure, so resilience is your responsibility — take two connections in two separate Direct Connect locations, each as a private VIF to the DX Gateway, so a location or device loss costs you one path, not all reach to AWS.
Why hub-and-spoke on both sides. A hub concentrates the cloud-edge gateway, firewall inspection, and shared services in one VNet/VPC, and spokes peer or attach to it — so on-prem traffic has exactly one ingress per cloud to secure and observe, and adding a workload is adding a spoke, not re-plumbing the edge. On Azure this is the Virtual WAN hub (or a hub VNet with an ExpressRoute Gateway); on AWS it is the Transit Gateway, which is the native any-to-any router and the only sane way to fan a single Direct Connect out to many VPCs.
BGP and route control — where hybrid actually lives
The physical diversity above buys you nothing if BGP sends all traffic down one path and silently blackholes on failure. Route control is the real engineering, and it is worth being explicit.
Symmetric, deterministic path selection. You decide which physical port is primary and make BGP agree in both directions, because asymmetric routing across a stateful firewall drops connections. The standard levers:
# On-prem → cloud (egress): prefer the primary port
route-map TO_FABRIC_PRIMARY permit 10
set local-preference 200 # higher local-pref wins for egress
# Cloud → on-prem (ingress): make the secondary path less attractive
# - AS-path prepend our own ASN on the secondary advertisement
# - or raise MED on the secondary so the cloud prefers the primary
route-map ADVERTISE_SECONDARY permit 10
set as-path prepend 65010 65010 # longer path = less preferred inbound
local-preference controls your egress; AS-path prepend and MED influence which path the cloud prefers for the return traffic — and you need both halves to keep the flow symmetric. ExpressRoute and Direct Connect each honour these standard BGP attributes.
Fast failure detection. Plain BGP can take tens of seconds (or its hold timer, default 180s) to notice a dead peer — far too slow for a line-down feed. Enable BFD (Bidirectional Forwarding Detection) on every session so a path loss is detected in well under a second and BGP reconverges onto the surviving path before the MES job times out.
Don’t let a leak become an outage. Set a maximum-prefix limit on each session so a misconfiguration on one side cannot flood your edge routers’ tables, and use route-maps / prefix-lists to advertise only the prefixes each cloud should learn — the OT network ranges are never advertised to either cloud, enforcing the air-gap in routing as well as in policy.
| Failure | What BGP/BFD does | Result |
|---|---|---|
| Primary fabric port dies | BFD drops the session; secondary’s routes become best | Traffic shifts to standby port in < 1s |
| One ExpressRoute connection fails | Surviving connection’s BGP session carries all Azure traffic | Azure reach preserved, reduced bandwidth |
| One Direct Connect location fails | Second DX location’s VIF carries all AWS traffic | AWS reach preserved, reduced bandwidth |
| Entire fabric / both ports down | On-prem edge withdraws private routes; IPsec VPN backup takes over | Degraded but connected (lower throughput) |
| BGP flap / route leak | max-prefix limit + prefix-lists contain it; alert fires | No table flood, no blackhole |
Implementation guidance
Provision the whole mesh as code, network first. Order matters: the physical ports and circuits must exist before the logical sessions, and DNS/routing must be correct before any workload depends on the path.
- Terraform the two fabric ports (diverse PoPs/carriers), then the VXCs to ExpressRoute and to the two Direct Connect connections.
- Create the ExpressRoute circuit (with its redundant pair) and the Direct Connect connections + private VIFs; on the cloud side, stand up the Virtual WAN hub / ER Gateway and the Transit Gateway + DX Gateway.
- Configure BGP on every leg with BFD, prefix-lists, and the local-pref/MED/prepend policy above; Ansible pushes the matching config to the on-prem edge pair so device state is also code, not console clicks.
- Wire hub-and-spoke — VNet peering on Azure, TGW attachments on AWS — and place the firewall/inspection appliances in each hub.
- Keep the IPsec VPN backup configured and BGP-attached as a lower-preference path, so loss of the entire fabric degrades rather than disconnects.
A minimal Terraform shape for one cloud-bound VXC communicates the intent — a software circuit from a fabric port straight to the ExpressRoute service key:
resource "megaport_vxc" "to_azure_primary" {
product_name = "vxc-onprem-to-azure-er-primary"
rate_limit = 1000 # Mbps; resizable later
port_uid = megaport_port.primary.product_uid
csp_settings {
attached_to = "AZURE_EXPRESS_ROUTE"
service_key = var.expressroute_service_key # from the ER circuit
peering { type = "private" peer_asn = 12076 } # Azure's ER ASN
}
}
# A mirror resource on megaport_port.secondary gives the diverse second path.
The pipeline that applies this runs in GitHub Actions (or Jenkins where the manufacturer already standardised on it), authenticating to Azure and AWS via OIDC federation so there is no stored cloud secret to leak. Wiz Code scans the Terraform in the pull request and fails the build on a misconfiguration — a route table that would leak the OT range, a VIF without redundancy — before anything reaches the fabric. A terraform plan that changes routing is gated behind a ServiceNow change approval, because a one-line BGP edit can take a plant offline.
Identity: federate the humans, kill the static keys. Every console that can change this network — the fabric portal, the Azure and AWS consoles, the device manager — sits behind SSO. The manufacturer’s workforce signs in through Okta, federated to Microsoft Entra ID so Azure RBAC sees a first-class token, with conditional access and MFA mandatory on all three planes. The fabric and cloud API tokens, BGP MD5 authentication keys, and edge-device credentials that the pipeline and Ansible need are never embedded — they live in HashiCorp Vault, leased dynamically with short TTLs, so a leaked pipeline log exposes nothing durable. (The team has a standing rule, learned the hard way, that no credential is ever committed.)
Enterprise considerations
Security & segmentation. Private interconnect is not the same as a secure one — a private link with a flat route table is just a faster way to spread a compromise. Layer accordingly: (a) advertise only sanctioned prefixes per cloud and never the OT ranges, so the air-gap holds in routing; (b) terminate on-prem traffic at a hub firewall (Azure Firewall in the vWAN hub; an inspection VPC with Transit Gateway appliance-mode on AWS) and inspect east-west and north-south; © optionally enable MACsec on the fabric ports for layer-2 encryption of the cross-connects, since ExpressRoute/Direct Connect private peering is private but not inherently encrypted; (d) run Wiz continuous CSPM across both clouds so any drift to public exposure, an over-broad route, or a misconfigured peering is flagged, with Wiz Code catching the same classes in IaC before merge; (e) put CrowdStrike Falcon sensors on the hub inspection appliances and the on-prem servers that ride the link, feeding the manufacturer’s SOC; (f) any path failover or routing change auto-raises a ServiceNow incident so the network team has a ticket, not just a graph. The handful of genuinely public plant apps stay behind Akamai for WAF and DDoS at the internet edge — deliberately off the private hybrid path entirely, so the two planes never blur.
Cost optimisation. Private connectivity is a real bill, and the CFO will compare it to “two more VPNs,” so engineer the spend.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Right-size VXC rate-limits | Provision Mbps to measured demand; resize in software | Pay for actual throughput, not peak guesses |
| Shared ports, many VXCs | Two physical ports backing all on-ramps to both clouds | One port cost amortised across every circuit |
| ExpressRoute SKU choice | Metered vs unlimited by egress profile; Local SKU for in-region | Match billing model to traffic shape |
| Watch cloud egress | Private interconnect lowers egress rates but does not zero them | Model data-transfer-out per cloud, both directions |
| Backup, not duplicate | Keep VPN as cheap standby, not a second always-on private link | Resilience without doubling circuit spend |
Pipe per-link bytes and circuit utilisation into Datadog/Dynatrace so utilisation drives the resize decisions and the CFO sees a chargeback view per business line.
Scalability. Each layer scales independently. VXCs resize in software (1→10 Gbps) without re-cabling, so the telematics ingestion growth was absorbed without a physical change. Azure scales by adding spoke VNets peered to the hub (the Virtual WAN hub raises the per-hub connection ceiling); AWS scales by adding VPC attachments to the Transit Gateway, the native fan-out point. New plants or regions are new ports into the fabric and new VXCs from them, described in the same Terraform — the topology grows by composition, not redesign. The natural ceilings to plan for are ExpressRoute’s prefix-advertisement limits and the Transit Gateway’s per-attachment bandwidth, both of which you design around early, not discover under load.
Reliability & DR (RTO/RPO). This architecture is the resilience story, but name the numbers. With diverse ports, the ExpressRoute pair, dual Direct Connect locations, and BFD-driven BGP, the loss of any single port, PoP, circuit, or cloud-edge router fails over sub-second to a few seconds — a degraded-bandwidth event, not an outage. Loss of the entire fabric drops to the IPsec VPN backup automatically (lower throughput, still connected). For a regional cloud-edge disaster, the same Terraform stands the meeting point up in a second fabric metro with its own ExpressRoute/Direct Connect, failed over at the hub. A pragmatic target for this manufacturer: RTO seconds for single-path loss, minutes for a full-fabric failover to VPN, and a documented hours-scale rebuild of a region from code if a whole metro is lost. The connectivity layer carries no state, so RPO is a property of the workloads on top, not of the network.
Observability. Instrument every path independently in Dynatrace / Datadog — BGP session state, BFD up/down, per-link latency, packet loss, and bytes in each direction — plus synthetic probes that traverse each physical path end to end so a partially failed circuit (up at layer 1, broken at BGP) surfaces as its own alert rather than as mysterious application latency. The metrics the business actually feels: per-path latency and loss, failover events and time-to-reconverge, circuit utilisation vs rate-limit, and forecasting-feed availability — the SLO that maps straight to the line-down risk that started this. A path-down or a flap pages the network team and opens a ServiceNow record automatically.
Governance. The topology is Terraform in version control, so every circuit, VXC, peering, and route policy is reviewable in a pull request and revertable in one. Wiz Code is the gate that a change does not leak the OT range or drop a redundancy before it merges; ServiceNow is the human change-approval gate before a routing-affecting apply; and Ansible keeps the on-prem edge config as code rather than as tribal knowledge in someone’s terminal history. BGP authentication keys and API tokens are pinned to Vault with short leases, never floating in a pipeline variable.
Explicit tradeoffs
Accept these or do not build it. Private, redundant, multi-cloud interconnect is genuinely more to own than two VPNs: ports and circuits cost real money every month whether or not they are saturated, BGP route control is a skill the team must actually have (a wrong local-pref or a forgotten prepend creates asymmetric routing that drops sessions across the firewall), and the diversity that makes it resilient — two carriers, two PoPs, two DX locations, the ExpressRoute pair — is more moving parts to provision and monitor. The fabric adds a vendor and a layer between you and the clouds; you trade a little direct control for software-defined circuits and one API. And the security posture that keeps OT air-gapped in routing means disciplined prefix-lists you must maintain, not a flat any-to-any you set once.
The alternatives, and when they win. If you are single-cloud and small, a single carrier ExpressRoute or a single Direct Connect, taken with that vendor’s built-in redundancy, is simpler and may be all you need. If your latency tolerance is genuinely loose and the data is not line-down-critical, resilient site-to-site VPNs over two ISPs are dramatically cheaper and a legitimate choice — we kept them, just as the backup. If you live overwhelmingly in one cloud, that cloud’s native transit (Virtual WAN or Cloud WAN) end to end is the path of least resistance. The fabric-based dual-cloud design here earns its complexity precisely when you are genuinely multi-cloud, latency- and resilience-sensitive, and want one provisioning plane over both on-ramps — which is exactly the manufacturer’s situation, and increasingly the default for anyone running production workloads in more than one cloud.
The shape of the win
For the manufacturer, the payoff is not “faster links.” It is that the next fibre cut at the ISP is a non-event — BFD drops the session, BGP reconverges onto the diverse path in under a second, the demand-forecasting feed never blinks, and the plant keeps building to a current signal instead of a stale one. The over-build that cost more than a year of connectivity does not happen again, the telematics ingestion their largest customer mandated runs on a private path that passed the TISAX assessment, and the SAP migration straddling both clouds rides one coherent, code-described network instead of a tangle of VPNs. Everything upstream — the diverse fabric ports, the ExpressRoute pair, the dual Direct Connect locations, the disciplined BGP, the Vault-held keys, the Wiz posture scanning, the Dynatrace per-path probes — exists so that a network architect, a security lead, and a CFO each say yes. Start single-cloud if you must; this is where a resilient, multi-cloud, line-down-sensitive hybrid network has to land.