Resilient Hybrid Connectivity with HA VPN, Cloud Router, and BGP on GCP

Every serious hybrid estate eventually outgrows a single VPN tunnel. A flapping link or a maintenance window on one side should never partition your network, and you want both paths carrying traffic rather than one sitting idle. On GCP the answer is HA VPN plus Cloud Router running BGP: redundant tunnels across two interfaces, dynamic route exchange, and explicit control over which path wins. This walkthrough builds that seam end to end, then steers failover deterministically and stitches multiple sites together with Network Connectivity Center.

Step 1: Understand HA VPN topology and the 99.99% SLA

HA VPN is not “Classic VPN with a second tunnel.” It is a distinct gateway resource with two interfaces, interface 0 and interface 1, each given its own external IP from a separate Google edge domain. That separation is what backs Google’s 99.99% availability SLA. The SLA is conditional, and the conditions are where most designs go wrong:

You must configure two tunnels, one from each gateway interface, to a peer that is itself redundant.
The peer side must be either a second HA VPN gateway, two physical on-prem devices, or one device with two separate external IPs (two WAN uplinks).
Both tunnels must run BGP through Cloud Router. A single tunnel, or static routing, drops you to a 99.9% SLA at best.

# Create the HA VPN gateway (two interfaces are allocated automatically)
gcloud compute vpn-gateways create ha-vpn-gw-use4 \
  --network=prod-vpc \
  --region=us-east4

# Inspect the two interfaces and their auto-assigned public IPs
gcloud compute vpn-gateways describe ha-vpn-gw-use4 \
  --region=us-east4 \
  --format="table(vpnInterfaces[].id, vpnInterfaces[].ipAddress)"

Rule of thumb: the SLA is a property of the whole path, not the gateway. Two tunnels to a single non-redundant firewall is still a single point of failure, SLA or not.

For GCP-to-GCP, you create an HA VPN gateway in each region/VPC and a peer gateway reference. For GCP-to-on-prem, you model the far side with an external VPN gateway resource describing your physical device IPs.

# Model an on-prem peer with two WAN IPs as a 2-interface external gateway
gcloud compute external-vpn-gateways create onprem-dc1 \
  --interfaces=0=203.0.113.10,1=198.51.100.10

Step 2: Cloud Router fundamentals - BGP sessions, ASNs, dynamic routing mode

Cloud Router is the BGP speaker. It learns routes from your peer and advertises GCP subnets back. Two decisions matter before you create it.

ASN selection. Cloud Router needs a private ASN (64512-65534, or the 32-bit range). Your peer needs a different ASN; eBGP between distinct ASNs is the normal mode. Reusing the same ASN on both ends forces iBGP semantics you do not want over VPN.

Dynamic routing mode. This is set on the VPC, not the router, and it is the single most consequential networking flag in a hybrid design:

Mode	What Cloud Router advertises	What it learns	Use when
`regional` (default)	Only subnets in the router’s own region	Peer routes applied to the router’s region	Single-region footprint, or you want regional isolation
`global`	All subnets in the VPC, across every region	Peer routes propagated to all regions	Multi-region VPC reachable over one hybrid edge

# Set the VPC to global dynamic routing so on-prem reaches every region
gcloud compute networks update prod-vpc --bgp-routing-mode=global

# Create the Cloud Router with a private ASN
gcloud compute routers create cr-use4 \
  --network=prod-vpc \
  --region=us-east4 \
  --asn=65001 \
  --advertisement-mode=default

--advertisement-mode=default means the router auto-advertises connected subnets. We switch to custom in Step 4 once we need to control exactly what leaves.

Step 3: Active/active vs active/passive design and ECMP

With two tunnels up and both running BGP, Cloud Router programs ECMP (Equal-Cost Multi-Path) across them by default. Traffic is hashed across both tunnels on a 5-tuple basis. This is active/active, and it is what you want for throughput: two tunnels means roughly 2x aggregate bandwidth (each HA VPN tunnel caps near 3 Gbps, and per-flow throughput is bounded well below that, so multiple flows are what fill the pipe).

The decision tree:

Active/active - both tunnels advertise routes with equal BGP attributes. Cloud Router installs both as ECMP next-hops. Maximum throughput, instant failover, but return-path symmetry is not guaranteed (acceptable for most stateless routing; a problem for stateful middleboxes on-prem).
Active/passive - one tunnel is made less preferred via BGP attributes (lower advertised priority / longer AS-path). All traffic uses the primary; the standby carries traffic only when the primary withdraws its routes. Predictable, symmetric paths, at the cost of half your bandwidth idle.

First, create the tunnels. Each tunnel binds to a specific gateway interface and a specific peer interface:

# Tunnel 0: GCP interface 0 -> peer interface 0
gcloud compute vpn-tunnels create tun-use4-if0 \
  --peer-external-gateway=onprem-dc1 \
  --peer-external-gateway-interface=0 \
  --region=us-east4 \
  --ike-version=2 \
  --shared-secret="${PSK_0}" \
  --router=cr-use4 \
  --vpn-gateway=ha-vpn-gw-use4 \
  --interface=0

# Tunnel 1: GCP interface 1 -> peer interface 1
gcloud compute vpn-tunnels create tun-use4-if1 \
  --peer-external-gateway=onprem-dc1 \
  --peer-external-gateway-interface=1 \
  --region=us-east4 \
  --ike-version=2 \
  --shared-secret="${PSK_1}" \
  --router=cr-use4 \
  --vpn-gateway=ha-vpn-gw-use4 \
  --interface=1

Now attach a BGP interface and peer to each tunnel. BGP runs over a link-local /30 (169.254.0.0/16) per tunnel:

# BGP session on tunnel 0
gcloud compute routers add-interface cr-use4 \
  --interface-name=if-tun0 \
  --vpn-tunnel=tun-use4-if0 \
  --ip-address=169.254.0.1 \
  --mask-length=30 \
  --region=us-east4

gcloud compute routers add-bgp-peer cr-use4 \
  --peer-name=bgp-tun0 \
  --interface=if-tun0 \
  --peer-ip-address=169.254.0.2 \
  --peer-asn=64600 \
  --region=us-east4

# BGP session on tunnel 1 (second /30)
gcloud compute routers add-interface cr-use4 \
  --interface-name=if-tun1 \
  --vpn-tunnel=tun-use4-if1 \
  --ip-address=169.254.1.1 \
  --mask-length=30 \
  --region=us-east4

gcloud compute routers add-bgp-peer cr-use4 \
  --peer-name=bgp-tun1 \
  --interface=if-tun1 \
  --peer-ip-address=169.254.1.2 \
  --peer-asn=64600 \
  --region=us-east4

At this point, with default attributes on both sessions, you have active/active ECMP.

Step 4: Custom route advertisements - summarization, specific ranges, priorities

By default Cloud Router advertises every subnet, which leaks your full IP plan to the peer and produces a noisy table. Switch to custom advertisements to send exactly what you intend - ideally a summarized supernet rather than dozens of /24s.

You can set advertisements at the router level (applies to every BGP peer) or per-peer (overrides the router for that session). Per-peer is what enables asymmetric, active/passive steering.

# Router-level: advertise a single summary instead of all subnets
gcloud compute routers update cr-use4 \
  --region=us-east4 \
  --advertisement-mode=custom \
  --set-advertisement-groups=all_subnets \
  --set-advertisement-ranges=10.20.0.0/16=GCP-prod-supernet

all_subnets is a convenience group; combining it with explicit ranges lets you advertise connected subnets plus a static summary (for example, a route to a downstream NCC spoke or a peered VPC the peer should also reach). If you want a clean summary only, drop the group and list ranges explicitly.

Summarize aggressively. On-prem route tables are finite and often shared with the rest of the enterprise WAN; advertising a /16 instead of sixty /24s is a courtesy that prevents real incidents.

Step 5: Steering failover with MED, AS-path prepending, and base priorities

For deterministic active/passive (or to prefer one region over another), you bias BGP. GCP gives you three levers; understand which direction each one steers.

Advertised route priority (MED). When Cloud Router advertises a route, --advertised-route-priority becomes the MED the peer sees. Lower MED wins at the peer. So to make tunnel 0 primary for traffic coming from on-prem into GCP, advertise GCP routes with a lower priority on tunnel 0 and higher on tunnel 1:

# Make tunnel 0 the preferred ingress path (lower MED = preferred)
gcloud compute routers update-bgp-peer cr-use4 \
  --peer-name=bgp-tun0 \
  --region=us-east4 \
  --advertised-route-priority=100

gcloud compute routers update-bgp-peer cr-use4 \
  --peer-name=bgp-tun1 \
  --region=us-east4 \
  --advertised-route-priority=200

Base priority / learned-route preference. That only steers the on-prem-to-GCP direction. The GCP-to-on-prem direction is decided by how Cloud Router ranks routes it learns. Identical learned routes get ECMP; to prefer tunnel 0, the peer must influence it - either by advertising a lower MED toward GCP, or by AS-path prepending on the standby. Prepending lengthens the AS-path, and longer AS-path loses:

! On the peer (e.g. Cisco IOS-XE), prepend own ASN on the standby tunnel
route-map TO-GCP-STANDBY permit 10
  set as-path prepend 64600 64600
!
router bgp 64600
  neighbor 169.254.1.1 route-map TO-GCP-STANDBY out

The clean mental model:

Direction	Decided by	To prefer tunnel 0
On-prem -> GCP	MED that GCP advertises	Lower `--advertised-route-priority` on tunnel 0
GCP -> on-prem	What the peer advertises (MED or AS-path)	Peer sends lower MED, or prepends AS-path on tunnel 1

Set both sides consistently or you get asymmetric routing: egress on tunnel 0, return on tunnel 1. That breaks stateful firewalls on-prem and is the single most common HA VPN misconfiguration.

Step 6: Network Connectivity Center - VPN spokes and hub-and-spoke transit

A pair of HA VPN tunnels connects one VPC to one site. When you have many sites and many VPCs, full-mesh VPN does not scale and VPC peering is non-transitive. Network Connectivity Center (NCC) gives you a hub with spokes, where spokes can be HA VPN tunnels, Interconnect attachments, or router appliances, and the hub provides transitive any-to-any reachability between them.

# Create the NCC hub
gcloud network-connectivity hubs create global-hub \
  --description="Enterprise hybrid transit hub"

# Attach the HA VPN tunnels as a spoke (both tunnels = one redundant spoke)
gcloud network-connectivity spokes linked-vpn-tunnels create dc1-spoke \
  --hub=global-hub \
  --region=us-east4 \
  --vpn-tunnels=tun-use4-if0,tun-use4-if1 \
  --site-to-site-data-transfer

With --site-to-site-data-transfer, two VPN spokes on the same hub can route to each other through GCP’s backbone - a branch office in one region reaches another branch in a different region without a direct site-to-site tunnel. The Cloud Routers on each spoke automatically exchange the dynamic routes learned across the hub, so on-prem prefixes propagate site-to-site without static glue.

NCC is the right tool when you have N sites needing any-to-any reach. It replaces an N-squared mesh of tunnels with N spokes on one hub, and it keeps using the same HA VPN + BGP primitives you already built.

Step 7: MTU, IKEv2, and rekey for stable throughput

Throughput problems on VPN are almost always MTU and fragmentation, not bandwidth.

MTU. HA VPN supports a maximum payload MTU of 1460 bytes. The ESP/IPsec overhead is carried on top of that. Set the on-prem tunnel interface and, critically, ensure path MTU works or that TCP MSS is clamped. The most reliable fix for mysterious stalls on large transfers is MSS clamping on the peer (ip tcp adjust-mss 1360 on Cisco), which forces TCP to negotiate a segment size that survives encapsulation.
IKEv2. Always use --ike-version=2. IKEv2 is required for the cleanest HA VPN behavior, supports MOBIKE, and rekeys without tearing the SA. Classic IKEv1 is legacy.
Rekey / DPD. GCP manages IKE and IPsec SA lifetimes; you do not tune them on the GCP side. Make sure the peer’s rekey margins and Dead Peer Detection are sane so a rekey does not look like a link failure to BGP. Mismatched aggressive DPD timers are a classic cause of tunnels that drop every few hours.

A correct Terraform tunnel pins the version and PSK explicitly:

resource "google_compute_vpn_tunnel" "tun_if0" {
  name                            = "tun-use4-if0"
  region                          = "us-east4"
  vpn_gateway                     = google_compute_ha_vpn_gateway.gw.id
  vpn_gateway_interface           = 0
  peer_external_gateway           = google_compute_external_vpn_gateway.onprem.id
  peer_external_gateway_interface = 0
  shared_secret                   = var.psk_if0
  router                          = google_compute_router.cr.id
  ike_version                     = 2
}

Verify

Validate the data and control planes before declaring victory.

# 1. Both tunnels established at the IPsec layer
gcloud compute vpn-tunnels list \
  --filter="region:us-east4" \
  --format="table(name, status, detailedStatus)"

# 2. Both BGP sessions are 'Up' and learning/advertising routes
gcloud compute routers get-status cr-use4 \
  --region=us-east4 \
  --format="flattened(result.bgpPeerStatus[].name,
            result.bgpPeerStatus[].status,
            result.bgpPeerStatus[].state,
            result.bgpPeerStatus[].numLearnedRoutes,
            result.bgpPeerStatus[].advertisedRoutes[].destRange)"

# 3. Confirm the dynamic routes appear in the VPC route table
gcloud compute routes list \
  --filter="network:prod-vpc AND nextHopVpnTunnel:*" \
  --format="table(destRange, priority, nextHopVpnTunnel)"

For a real failover test, do not just ping. Administratively bring down the primary BGP session and confirm traffic survives:

# Disable the primary peer; learned routes withdraw, ECMP/standby takes over
gcloud compute routers update-bgp-peer cr-use4 \
  --peer-name=bgp-tun0 --region=us-east4 --disabled

# ... run a sustained transfer across the tunnel here, confirm no hard break ...

# Re-enable and confirm the route returns
gcloud compute routers update-bgp-peer cr-use4 \
  --peer-name=bgp-tun0 --region=us-east4 --no-disabled

Troubleshooting flapping BGP. If get-status shows a peer cycling between Connect/Established, work the layers in order: (1) confirm the tunnel itself is stable in Cloud Logging - a flapping IPsec SA from mismatched DPD looks like BGP flap; (2) verify the link-local /30 matches exactly on both ends (.1 on GCP, .2 on peer) and the peer ASN is correct; (3) check for an MTU black hole - BGP opens fine but large UPDATE packets are dropped, so the session resets the moment route count grows; (4) confirm the peer is not also advertising a default route that creates a routing loop back into the tunnel.

Enterprise scenario

A payments platform ran a Shared VPC in us-east4 with HA VPN back to a colo data center that fronted a pair of stateful Palo Alto firewalls. Both tunnels were up, BGP was healthy, and throughput was fine - until intermittently, a subset of TCP sessions to on-prem services would hang and reset. The firewall logs showed packets arriving with no matching session and being dropped.

The constraint: the firewalls are stateful and not clustered for asymmetric flows. The platform team had left both BGP sessions at default attributes, so Cloud Router was doing ECMP egress while the firewalls’ own routing sent return traffic over whichever tunnel they preferred. Result: a flow would leave GCP on tunnel 0, the firewall would reply via tunnel 1, hit the other firewall, find no session, and drop it.

They fixed it by forcing symmetric active/passive without giving up fast failover. On the GCP side they lowered the advertised MED on tunnel 0 so on-prem always ingressed via tunnel 0; on the firewall side they prepended the AS-path on tunnel 1 so GCP always egressed via tunnel 0. Tunnel 1 stayed hot as standby and took over automatically only when tunnel 0’s routes withdrew.

# GCP: tunnel 0 strongly preferred for on-prem -> GCP ingress
gcloud compute routers update-bgp-peer cr-use4 \
  --peer-name=bgp-tun0 --region=us-east4 --advertised-route-priority=100
gcloud compute routers update-bgp-peer cr-use4 \
  --peer-name=bgp-tun1 --region=us-east4 --advertised-route-priority=1000

The lesson: ECMP active/active is correct for stateless routing, but the moment a stateful middlebox sits in the path, you must enforce path symmetry through BGP attributes on both directions. The 99.99% SLA was never in question - the topology was sound; the route policy was the bug.

Resilient Hybrid Connectivity with HA VPN, Cloud Router, and BGP on GCP

Step 1: Understand HA VPN topology and the 99.99% SLA

Step 2: Cloud Router fundamentals - BGP sessions, ASNs, dynamic routing mode

Step 3: Active/active vs active/passive design and ECMP

Step 4: Custom route advertisements - summarization, specific ranges, priorities

Step 5: Steering failover with MED, AS-path prepending, and base priorities

Step 6: Network Connectivity Center - VPN spokes and hub-and-spoke transit

Step 7: MTU, IKEv2, and rekey for stable throughput

Verify

Enterprise scenario

Checklist

Written by Vinod

Comments

Keep Reading

BigQuery Fine-Grained Security: Column-Level, Row-Level, and Data Masking

Cloud DNS at Scale: Private Zones, Peering, Forwarding, and Response Policies

Event-Driven Architecture with Cloud Functions 2nd Gen and Eventarc