Most cloud networks do not fail on day one. They fail in year two, when the third business unit shows up with its own VPCs, the security team mandates traffic inspection, an acquisition arrives with overlapping CIDRs, and finance discovers that forty-three VPCs each run their own NAT gateways. What started as a handful of peering connections has quietly become a full mesh that no one can reason about, audit, or price.
This article is a reference architecture for hybrid connectivity at scale on AWS — a network foundation that connects on-premises data centers, branch offices, and a growing fleet of VPCs through a single, inspectable, governable core. It is built around four load-bearing services: AWS Transit Gateway as the routing hub, AWS Direct Connect (with Site-to-Site VPN as the failover path) for the on-premises link, AWS Network Firewall for centralized east-west and egress inspection, and a centralized egress VPC so that internet-bound traffic leaves through one priced, logged, controlled door.
The business scenario
Picture a mid-sized financial-services or manufacturing enterprise — call it 600 to 8,000 employees — three to five years into its cloud journey. The first workloads went up as isolated VPCs, connected to the data center with one-off VPN tunnels and stitched together with VPC peering. It worked. Then it grew.
The recurring symptoms look the same across every organisation that hits this wall:
- The peering mesh is unmanageable. VPC peering is non-transitive: if VPC A peers with B, and B peers with C, A still cannot reach C. To connect n VPCs fully you need n(n-1)/2 peering connections — 45 for ten VPCs, 190 for twenty. Each one needs route-table edits on both sides, and the on-prem link must be re-attached to every new VPC by hand.
- There is no place to inspect traffic. Peering and plain VPN give you connectivity but no choke point. The security team cannot put an IDS/IPS or egress filter anywhere, because every VPC talks to every other VPC directly and reaches the internet through its own NAT gateway.
- Egress is sprawled and unpriced. Every VPC runs its own NAT gateways. NAT gateway hourly charges plus per-GB processing, multiplied across dozens of VPCs and Availability Zones, become a five- or six-figure annual line item that nobody can attribute or cap. There is no single egress IP allow-list to hand a SaaS vendor.
- The data-center link is fragile and slow. A single IPsec VPN tunnel over the public internet caps at roughly 1.25 Gbps per tunnel, with internet-variable latency and jitter — unacceptable for chatty database replication, file shares, or a phased data-center exit.
- Audit and acquisitions are painful. When a new business unit or an acquired company arrives — frequently with a
10.0.0.0/16that already collides with three existing VPCs — there is no clean way to onboard them or to prove to auditors how traffic flows.
The mandate that lands on the network team is therefore not “build a VPC.” It is: give us one connectivity backbone that any team can plug into in minutes, that inspects and logs traffic centrally, that has a resilient high-bandwidth path to the data center, that bills egress to one place, and that an auditor can understand in a single diagram. That backbone is what we build here. It scales down to a two-account startup and up to a multi-Region, several-hundred-VPC enterprise without changing shape.
Architecture overview
The architecture is a hub-and-spoke network with AWS Transit Gateway at the center, deployed inside a dedicated Network account in AWS Organizations. Spoke VPCs (owned by application teams in their own accounts) attach to the Transit Gateway; the Transit Gateway also terminates the hybrid links to on-premises and routes selected traffic through inspection and egress VPCs. Transit Gateway route tables — not a single flat routing domain — decide who can talk to whom, which gives you segmentation by design.
Follow four representative traffic flows end to end.
1. Application VPC to the data center (hybrid, the primary use case). A workload in the prod-payments VPC needs to reach an on-prem Oracle database. Its subnet route table sends the data-center CIDR to a Transit Gateway attachment. The TGW consults the route table associated with that attachment, sees the on-prem prefixes advertised over BGP, and forwards the packet out the Direct Connect path: across a Transit VIF, through a Direct Connect Gateway, to the customer router in a colocation facility, and into the data center. The return path is symmetric. If Direct Connect is down, the same prefixes are still reachable over the Site-to-Site VPN attachment, which BGP keeps as a less-preferred backup — failover is automatic and needs no human.
2. East-west between two application VPCs, inspected. prod-web needs to call an internal API in prod-payments. Instead of letting spokes talk directly, the spoke route tables point all inter-VPC traffic (the RFC 1918 supernet) at the TGW, and the TGW’s spoke route table sends that traffic to an inspection VPC running AWS Network Firewall across all AZs. The firewall applies stateful rules and Suricata-compatible IDS/IPS signatures, then hands the packet back to the TGW, which forwards it to the destination spoke. This is the classic “centralized inspection” pattern: every VPC-to-VPC flow is forced through one firewall using appliance mode on the inspection attachment so that request and response always traverse the same firewall endpoint (essential for stateful inspection).
3. Application VPC to the internet (centralized egress). A workload in any spoke needs to pull a patch from the internet. Its route table sends 0.0.0.0/0 to the TGW. The TGW’s spoke route table sends default-route traffic to a dedicated egress VPC, where the packet first hits Network Firewall (for domain/FQDN allow-listing and TLS-SNI filtering), then a NAT gateway, then the internet gateway. All spokes share this one egress path, so there is one set of egress IPs, one place to filter outbound destinations, and one NAT bill instead of forty-three.
4. Application VPC to AWS services, privately. When a workload calls Amazon S3, DynamoDB, or Secrets Manager, you do not want that traffic on the internet at all. Interface and Gateway VPC endpoints (AWS PrivateLink) — optionally centralized behind a shared-services VPC with Route 53 Resolver — keep AWS-API traffic on the AWS backbone, off the egress path entirely.
Three cross-cutting layers wrap all of this. AWS Resource Access Manager (RAM) shares the Transit Gateway from the Network account to every other account, so application teams create attachments without the network team handing out anything by hand. Route 53 Resolver endpoints and rules provide bidirectional DNS resolution between cloud and on-premises. And flow logs plus firewall logs stream to a central S3 bucket and CloudWatch for the security and audit story. The mental picture: a star, with the TGW as the hub, application VPCs and hybrid links as spokes, and the inspection and egress VPCs as special spokes that every flow is steered through by routing rather than by trust.
Component breakdown
| Component | Role in the architecture | Why it is here | Key configuration choices |
|---|---|---|---|
| Transit Gateway (TGW) | Central regional router connecting all VPCs and hybrid links | Replaces the O(n²) peering mesh with O(n) attachments and transitive routing | Multiple TGW route tables for segmentation; disable default route-table association/propagation; enable appliance mode on the inspection VPC attachment |
| TGW route tables | Per-attachment routing policy (who can reach whom) | Turns one TGW into many isolated routing domains — prod, non-prod, shared, inspection | Separate tables for spokes, on-prem, and inspection; spokes default-route to inspection/egress, never to each other directly |
| Direct Connect + DX Gateway | Private, high-bandwidth link to on-premises | Predictable latency/bandwidth (1–100 Gbps) vs. internet VPN; backbone for DC-exit and replication | Two DX connections in two locations for resilience; Transit VIF to a DX Gateway associated with the TGW; BGP with AS-path prepending for path preference |
| Site-to-Site VPN | Encrypted failover (or low-cost primary) path to on-prem | DX has no inherent encryption and takes weeks to provision; VPN is the day-one and the backup path | Attach to TGW; BGP dynamic routing; advertise same prefixes as DX but less-preferred; enable acceleration if branches are global |
| Network Firewall | Stateful L3–L7 inspection for east-west and egress | One auditable choke point for IDS/IPS, FQDN filtering, and outbound control | Firewall endpoint per AZ; Suricata-compatible rule groups; stateful default-deny on egress; managed threat-signature feeds |
| Inspection VPC | Hosts the firewall endpoints between TGW and traffic | Keeps inspection off the application teams’ plate and centrally owned | Dedicated /24 per AZ for firewall + TGW subnets; routes craft the TGW → firewall → TGW hairpin |
| Egress VPC | Single shared internet-egress point | One NAT bill, one egress IP set, one outbound filter for all VPCs | NAT gateway + IGW per AZ; firewall in front of NAT for FQDN allow-listing; static EIPs for vendor allow-lists |
| VPC endpoints (PrivateLink) | Private reach to AWS service APIs | Keeps S3/DDB/Secrets/etc. traffic off the internet and off the egress path | Gateway endpoints for S3/DynamoDB (free); centralized interface endpoints + Route 53 for the rest |
| Resource Access Manager (RAM) | Shares the TGW across accounts | Self-service attachments without manual hand-offs; enforces the Network-account ownership boundary | Share TGW to the Org or specific OUs; attachments auto-accept within the Org |
| Route 53 Resolver | Hybrid DNS resolution both directions | Cloud must resolve on-prem names and vice versa | Inbound + outbound resolver endpoints; forwarding rules for on-prem zones; shared via RAM |
A few choices deserve emphasis. Appliance mode on the inspection attachment is non-negotiable for stateful firewalls: without it, the TGW may hash the forward and return packets of one flow to firewall endpoints in different AZs, breaking connection state and silently dropping traffic. Disabling the TGW’s default association and propagation is what converts a flat router into a segmented one — every attachment is then explicitly placed in a route table, so “deny by default between business units” becomes the natural posture rather than an afterthought. And keeping the VPN prefixes identical but less-preferred versus Direct Connect (via AS-path prepending on the customer side and BGP local-preference) is what makes sub-minute failover happen with zero operator action.
Implementation guidance
Account and ownership model. Stand this up under AWS Organizations with a dedicated Network account inside an Infrastructure OU. The Network account owns the Transit Gateway, Direct Connect, the inspection and egress VPCs, and the central DNS resolver. Application teams own only their spoke VPCs in their own accounts. This is the AWS multi-account (“landing zone”) pattern, and it cleanly separates the blast radius and the bill of the network platform from the workloads riding on it. Enforce it with Service Control Policies that deny application accounts the ability to create internet gateways or their own NAT gateways — that is what forces traffic onto the central egress path rather than merely encouraging it.
Provisioning order (the dependency chain matters):
- Create the TGW in the Network account; set
default_route_table_association = disableanddefault_route_table_propagation = disable. - Build the inspection VPC and egress VPC; deploy Network Firewall with one endpoint per AZ in each.
- Create TGW route tables:
spokes,onprem,inspection,egress. Associate and propagate per the steering rules below. - Share the TGW via RAM to the Organization.
- Application teams create spoke VPCs and TGW attachments (auto-accepted within the Org); their attachments associate to the
spokesroute table. - Provision Direct Connect (lead time is weeks — start early), a DX Gateway, a Transit VIF, and a Site-to-Site VPN as the same-day bring-up and permanent backup.
- Wire Route 53 Resolver inbound/outbound endpoints and forwarding rules; share via RAM.
The routing logic, stated plainly. Spoke subnet route tables send 0.0.0.0/0 to the TGW (for egress) and the on-prem/inter-VPC supernets to the TGW as well. On the TGW side: the spokes route table default-routes (0.0.0.0/0) to the egress VPC attachment and routes the RFC 1918 supernet to the inspection VPC attachment; the inspection and egress route tables propagate the spoke and on-prem routes back so return traffic finds its way home. On-prem prefixes arrive by BGP propagation from the DX and VPN attachments into the onprem table, which is associated with those attachments and shared into the spoke/inspection tables as policy dictates.
Infrastructure as Code. Terraform is the natural fit here because the network spans many accounts and AWS publishes mature modules for exactly this shape:
- Use the community
terraform-aws-modules/transit-gatewaymodule for the TGW, its route tables, and VPC attachments; it exposes the association/propagation toggles directly. - Use
terraform-aws-modules/vpcfor spoke, inspection, and egress VPCs. - Manage Network Firewall with the
aws_networkfirewall_firewall,aws_networkfirewall_firewall_policy, andaws_networkfirewall_rule_groupresources; keep Suricata rules in version-controlled.rulesfiles so security changes go through pull requests. - Drive multi-account deployment with Terraform workspaces or stacks per account and assume-role providers, or wrap it in an AWS-native pipeline. For teams standing up the whole landing zone, AWS Control Tower plus Account Factory for Terraform (AFT) bootstraps the Organization, OUs, and guardrails, and you layer this network module on top.
- Share resources with
aws_ram_resource_shareandaws_ram_principal_associationtargeting the Organization ARN so attachments auto-accept.
Identity and access wiring. Application teams need only ec2:CreateTransitGatewayVpcAttachment against the shared TGW — they never touch the TGW itself, its route tables, the firewall, or egress. Centralize human access through IAM Identity Center with permission sets scoped per OU. For the data-center side, the BGP session and customer-router config live with the network team; document the on-prem ASN, the advertised prefixes, and the BCP 38 anti-spoofing expectations in the runbook so a failover is boring.
Enterprise considerations
Security and Zero Trust. The architecture operationalises Zero Trust at the network layer through segmentation by routing rather than trust by adjacency. Because the TGW’s default association/propagation is off, two spokes cannot reach each other unless a route table explicitly allows it — and even then the traffic is forced through Network Firewall in appliance mode, where stateful rules and IDS/IPS signatures inspect every east-west flow. Egress is default-deny: outbound traffic is dropped unless the destination FQDN is on the allow-list, which neutralises a large class of data-exfiltration and C2 paths. Pair this with security groups and NACLs inside each VPC (micro-segmentation), VPC endpoints to keep AWS-API calls off the internet entirely, and encryption in transit (TLS for apps, IPsec on the VPN, optional MACsec on Direct Connect). The result is defence in depth: identity, network segmentation, inspection, and egress control reinforcing one another.
Cost optimization. The headline saving is centralized egress — collapsing dozens of per-VPC NAT gateways into one shared egress VPC removes a stack of hourly NAT charges and consolidates per-GB processing, often the single biggest line item in a sprawling network. But be honest about the trade-off the TGW introduces: you now pay a per-attachment hourly charge and a per-GB TGW data-processing charge, and inspected/egress traffic can cross the TGW more than once (spoke → TGW → inspection → TGW → destination), so each GB is processed multiple times. The levers: keep latency-sensitive, high-volume same-Region VPC-to-VPC flows on direct VPC peering when they need no inspection (peering has no per-GB charge); use Gateway VPC endpoints for S3/DynamoDB (free, and they keep that traffic off both NAT and TGW); and reserve Direct Connect for the bandwidth that justifies it while right-sizing the port speed. Tag every attachment to attribute TGW and egress cost back to the owning team.
Scalability. A single TGW supports thousands of attachments and roughly 5,000 routes per route table, with up to ~100 Gbps of bandwidth per VPC attachment (using ECMP across multiple VPN tunnels or via DX scaling). For multi-Region, deploy a TGW per Region and connect them with inter-Region TGW peering, advertising a clean CIDR plan so routes summarise. The hub-and-spoke shape means onboarding the fiftieth VPC is the same one-attachment operation as the fifth — linear effort, not quadratic.
Reliability and DR (RTO/RPO). Every layer is multi-AZ: firewall endpoints, NAT gateways, and TGW attachment subnets exist in every AZ in the Region, and the TGW itself is a managed, AZ-redundant service. The hybrid path is the key resilience story: two Direct Connect connections in two separate DX locations survive a facility failure, and the Site-to-Site VPN is a permanently-attached, BGP-preferred-lower backup so a total DX loss fails over to the internet path in well under a minute with no operator action — an effective network RTO measured in seconds. For Regional DR, the same module deployed in a second Region with inter-Region TGW peering gives a warm network fabric; workload RPO is then governed by the data-replication strategy (database replication, S3 Cross-Region Replication) riding over this backbone, not by the network itself. Model the failure cases explicitly in a runbook: single AZ loss, single DX loss, full DX-location loss, and Region loss.
Observability. Turn on VPC Flow Logs on every spoke, inspection, and egress VPC; TGW Flow Logs for hub-level visibility; and Network Firewall alert/flow logs for the inspected-traffic record. Stream all of it to a central S3 bucket (for Athena queries and long retention) and CloudWatch (for live dashboards and alarms). Use Reachability Analyzer and Network Access Analyzer to prove — before an auditor asks — that a given path is open or closed, and CloudWatch Network Monitor or DX/VPN CloudWatch metrics to alarm on BGP session drops and link health.
Governance. Ownership is the policy: the Network account owns the fabric, SCPs prevent application accounts from creating their own IGWs/NAT (closing the bypass), and RAM enforces that attachments are the only self-service action. Codify CIDR allocation centrally (an IPAM such as AWS VPC IP Address Manager prevents the overlapping-10.0.0.0/16 problem before it happens), require firewall-rule changes to go through pull requests, and run AWS Config rules to detect drift such as a rogue NAT gateway or an un-inspected route.
Reference enterprise example
Meridian Components is a fictional but representative automotive-parts manufacturer: 4,200 employees, two data centers (one in Virginia for ERP/MES, one in Texas as DR), eleven plants, and a three-year-old AWS footprint that had grown to 38 VPCs across 14 accounts, connected by 61 peering connections and 9 ad-hoc VPN tunnels. The breaking point was an audit finding: the security team could not demonstrate that traffic between the plant-floor IoT ingestion VPCs and the ERP-integration VPCs was inspected — because it wasn’t — and finance flagged $214,000/year in NAT gateway charges spread across 31 VPCs that nobody could attribute.
Meridian rebuilt on this reference architecture over a quarter:
- Network account + TGW in
us-east-1, with route tablesspokes,onprem,inspection,egress; default association/propagation disabled. - 2 × 10 Gbps Direct Connect connections in two Ashburn DX locations, via a DX Gateway and Transit VIF, replacing the 9 VPN tunnels for the Virginia data center. The Texas DR data center kept an accelerated Site-to-Site VPN, and a VPN to Virginia remained as the DX backup.
- Network Firewall in an inspection VPC (appliance mode), enforcing IDS/IPS on all east-west traffic — including the plant-IoT-to-ERP path the auditors had flagged.
- One egress VPC with FQDN allow-listing; SCPs blocked all 31 application accounts from creating NAT or IGWs. The 31 NAT gateways collapsed to 3 (one per AZ).
- RAM shared the TGW org-wide; teams migrated their VPCs by swapping peering routes for a TGW attachment — about a day of work per VPC, mostly testing.
The numbers after one quarter: NAT/egress spend fell from $214K to roughly $61K/year (≈71% reduction), even after accounting for the new TGW attachment and data-processing charges. The 61 peering connections dropped to zero (everything routes through the hub). A simulated full-DX-location failure failed over to VPN in under 40 seconds with no packet loss visible to the ERP integration. And the audit finding closed: the security team now ships a single architecture diagram plus Reachability Analyzer evidence showing every inter-VPC and egress path traversing the firewall. Onboarding the next plant’s VPC went from “open a peering ticket and edit nine route tables” to a 20-line Terraform attachment merged the same morning.
When to use it
Use this architecture when you have — or can see coming within a year — more than a handful of VPCs, a real on-premises footprint that needs predictable bandwidth, a mandate to inspect and log traffic centrally, or multiple accounts/business units that must be segmented and governed. It is the right foundation for regulated industries (finance, healthcare, manufacturing with OT/IT convergence), for phased data-center exits, and for any enterprise where “who can reach whom” must be auditable.
Be aware of the trade-offs. The TGW adds a per-attachment and per-GB cost, and routing inspected traffic through the hub multiplies data-processing charges — for a tiny two-VPC environment with no inspection or hybrid needs, that overhead is not yet justified. Centralized inspection adds a hop of latency and makes the inspection VPC a critical path (mitigated by per-AZ firewall endpoints and appliance mode, but real). And the architecture demands organisational discipline: it only works if SCPs actually prevent egress bypass and if the Network account is treated as a shared platform with proper change control.
Anti-patterns to avoid. Do not skip appliance mode on the inspection attachment — asymmetric routing will silently break stateful inspection and you will chase phantom drops for days. Do not leave default TGW route-table association/propagation on; that quietly recreates a flat any-to-any network and erases your segmentation. Do not let application teams keep their own NAT gateways “just for now” — the central egress savings and the egress-control story both evaporate the moment one bypass exists. And do not run a single Direct Connect connection and call it redundant; one DX plus one VPN is the minimum, two DX in two locations plus VPN is the standard.
Alternatives and when they fit. For genuinely small or transient environments, VPC peering plus a single Site-to-Site VPN is simpler and cheaper — graduate to this architecture when the mesh or the egress bill starts to hurt. AWS Cloud WAN is the natural next step for large multi-Region estates: it adds a policy-driven global network with centralized segmentation across Regions on top of the same TGW primitives, and is worth adopting when you outgrow managing per-Region TGWs and inter-Region peering by hand. For the firewall layer specifically, AWS Network Firewall is the managed default, but a Gateway Load Balancer fronting third-party NGFW appliances (Palo Alto, Fortinet, Check Point) slots into the exact same inspection-VPC position when you need a specific vendor’s feature set or to match on-prem tooling. The hub-and-spoke skeleton stays the same; you are only swapping the inspection engine.