Default BGP path selection is not your friend in a hybrid estate. Plug in ExpressRoute, Direct Connect, and a VPN backup, advertise the same prefixes over all three, and the router picks a “best” path using rules that have nothing to do with your contract, your bandwidth, or your intent. When the primary drops, traffic does not always fail to the standby you provisioned — it fails to whatever the decision tree picks next, and sometimes that is a black hole because each side thinks the other path is live. This guide walks the best-path algorithm, then shows which knob to turn on which side so the path you pay for is the path you get, and failover is deterministic and measurable.
1. The best-path decision tree and which knobs each cloud lets you turn
Every BGP speaker runs the same ordered tie-break when it has multiple routes to the same prefix. Stop at the first step that produces a single winner:
| Order | Attribute | Direction it influences | Who sets it |
|---|---|---|---|
| 1 | Weight (Cisco-proprietary, local) | Outbound from this router | Your edge router only |
| 2 | Local-Preference | Outbound from your AS | Your edge, propagated via iBGP |
| 3 | Locally originated / aggregate | Outbound | Origination config |
| 4 | Shortest AS-path | Both (prepend on advertise) | Either side |
| 5 | Lowest origin type (IGP < EGP < Incomplete) | Both | Origination |
| 6 | Lowest MED | Inbound to your AS | Neighbor, between paths from same AS |
| 7 | eBGP over iBGP | Internal preference | Topology |
| 8 | Lowest IGP metric to next-hop | Internal | Your IGP |
| 9 | Oldest route / lowest router-id | Tie-break | Arbitrary |
Two facts drive everything that follows. Weight and Local-Preference decide where your traffic exits (outbound toward the cloud) and you control them on your edge. AS-path length and MED are how you ask the cloud to prefer one of its return paths (inbound), honored only within the cloud’s own policy. You cannot set Local-Pref inside Microsoft’s or AWS’s network; you influence their decision only with attributes they agree to read — which is what published BGP communities are for.
The cruel part: clouds short-circuit this tree with their own policy before AS-path matters.
- Azure ExpressRoute vs VPN: if the same prefix arrives over both ExpressRoute and a VPN gateway, Azure prefers ExpressRoute by default regardless of AS-path. That is documented behavior, not a tie-break you won. Prepending the VPN advertisement 50 times changes nothing — you need connection weight, or to not advertise the overlap.
- AWS Direct Connect: AWS evaluates longest-prefix-match first, then local-preference BGP communities you tag, and only then AS-path length. A more-specific route always wins over prepending, and a
7224:7100community beats a shorter AS-path.
Internalize this: AS-path prepending is a weak, last-resort lever that any more-specific route or any cloud policy override will defeat. Reach for communities and prefix scoping first; prepend only to break ties between otherwise-equal paths.
2. Primary-vs-backup design: transport preferred, VPN as standby
The target topology across both clouds is the same shape:
on-prem edge (ASN 65001)
|-- ExpressRoute / Direct Connect (PRIMARY, low latency, high bw)
|-- IPsec VPN over internet (BACKUP, standby only)
cloud VNet/VPC
Four things must be simultaneously true:
- Outbound (you to cloud) prefers the dedicated circuit. Set a higher Local-Preference on routes learned over ExpressRoute/Direct Connect than over the VPN.
- Inbound (cloud to you) also prefers the circuit. Make the cloud see the VPN path as worse — via AS-path prepend, a low-pref community, or (cleanest) by not advertising the prefix over the VPN until needed.
- The backup is actually viable. The VPN session must be up and advertising in steady state, so failover is a withdraw-and-reconverge, not a cold start. A standby that only comes up after the primary dies adds tunnel-negotiation time to the outage.
- No prefix is advertised in a way that creates a return-path mismatch — where black holes live (Section 7).
On your edge router, outbound preference is a Local-Pref policy keyed off the neighbor:
! Cisco IOS-XE: prefer the Direct Connect / ExpressRoute neighbor outbound
route-map FROM_PRIMARY permit 10
set local-preference 200
route-map FROM_BACKUP permit 10
set local-preference 100
!
router bgp 65001
address-family ipv4 unicast
neighbor 169.254.10.1 route-map FROM_PRIMARY in
neighbor 169.254.20.1 route-map FROM_BACKUP in
Higher Local-Pref wins (step 2), and because it propagates through your iBGP mesh, every internal router agrees to exit via the primary. That is the single most important outbound control, and it is entirely on your side.
3. AS-path prepending to deprioritize a path, and where it gets ignored
Prepending makes a path look longer by stuffing your own ASN into the AS-path, so a remote AS comparing AS-path length picks the other path. Apply it outbound, on the advertisement over the path you want demoted (the VPN).
! Cisco IOS-XE: prepend on the VPN advertisement so the cloud prefers the circuit
route-map TO_BACKUP_VPN permit 10
set as-path prepend 65001 65001 65001
!
router bgp 65001
address-family ipv4 unicast
neighbor 169.254.20.1 route-map TO_BACKUP_VPN out
On Azure VPN Gateway you do the symmetric thing without touching a router — the gateway demotes the connection via its routing weight. In Terraform:
resource "azurerm_virtual_network_gateway_connection" "vpn_backup" {
name = "vpn-backup"
resource_group_name = azurerm_resource_group.net.name
location = azurerm_resource_group.net.location
type = "IPsec"
virtual_network_gateway_id = azurerm_virtual_network_gateway.vpn.id
local_network_gateway_id = azurerm_local_network_gateway.onprem.id
shared_key = var.vpn_psk
enable_bgp = true
# Make Azure see on-prem as "farther" via the VPN; the gateway prepends outbound.
# Higher value = stronger demotion of this connection for Azure-egress traffic.
routing_weight = 0
}
Three hard limits you must respect:
- It is ignored before the AS-path step is reached. Direct Connect’s longest-prefix-match and local-pref communities, and Azure’s ExpressRoute-over-VPN preference, all sit above AS-path. Prepend is invisible to them.
- Per-AS, not global. Some upstreams cap or collapse prepends, and an AS that selects on Local-Pref or MED never looks at your padding. You cannot guarantee a distant AS honors it.
- More than 3-5 prepends buys nothing. Two or three extra hops already make your path strictly longer to a peer comparing paths. Padding to 20 is cargo-cult, and only widens the blast radius if a leak re-originates the path elsewhere.
Prepend nudges a neighbor that has two equal-length paths from you. It is not a failover mechanism. If correctness depends on a prepend being honored three ASes away, the design is wrong.
4. Local-preference and weight for inbound vs outbound steering
These two are frequently confused. Keep the scope straight:
| Lever | Scope | Honored by | Use for |
|---|---|---|---|
| Weight | Single router (Cisco-local, never advertised) | Only the router it is set on | Pinning one edge box’s outbound choice |
| Local-Preference | Entire AS (propagated in iBGP) | Every iBGP speaker in your AS | Your whole network’s outbound exit |
| AS-path prepend | Advertised to neighbors | Neighbors that compare AS-path | Asking the cloud to demote a return path |
| MED | Advertised to one neighbor AS | That neighbor, comparing its own paths | Steering inbound when you own both links to one AS |
Weight is the highest-priority, lowest-scope lever — a hard local override on a single edge router with both a Direct Connect VIF and a VPN tunnel, independent of iBGP propagation:
! Weight: highest wins, applies only to THIS router's outbound decision
router bgp 65001
address-family ipv4 unicast
neighbor 169.254.10.1 weight 200 ! Direct Connect VIF
neighbor 169.254.20.1 weight 100 ! VPN tunnel
Use Local-Preference when multiple edge routers must agree (the common enterprise case). Use weight only when a single box must override regardless of iBGP — for example a regional edge that should always prefer its local circuit even if a remote site advertises a better Local-Pref.
For inbound steering when both links land on the same neighbor AS (e.g. two Direct Connect VIFs to AWS in one Region), MED is correct: AWS compares MED between paths from your same ASN and prefers the lower. MED does not survive crossing into a different AS, so it only works link-to-link with one provider.
5. BGP communities: tagging routes and using cloud-published communities
A BGP community is a 32-bit tag (ASN:value) on a prefix. Two uses: your own internal signaling, and — far more powerful in cloud — cloud-published communities the provider’s policy reads to change its own routing or scope your advertisements.
AWS Direct Connect inbound local-preference communities. AWS sets its local-preference for traffic returning to you based on the community you tag on the advertised prefix:
| Community | AWS local-pref | Meaning |
|---|---|---|
7224:7100 |
Low | Least-preferred return path |
7224:7200 |
Medium | Default if untagged |
7224:7300 |
High | Most-preferred return path |
Tag your primary VIF advertisement 7224:7300 and your backup VIF 7224:7100, and AWS returns traffic over the primary deterministically — this beats AS-path length, so it is far more reliable than prepending:
! Tag the primary Direct Connect advertisement HIGH, backup LOW
ip community-list standard DX_HIGH permit 7224:7300
ip community-list standard DX_LOW permit 7224:7100
!
route-map TO_DX_PRIMARY permit 10
set community 7224:7300
route-map TO_DX_BACKUP permit 10
set community 7224:7100
!
router bgp 65001
address-family ipv4 unicast
neighbor 169.254.10.1 route-map TO_DX_PRIMARY out
neighbor 169.254.30.1 route-map TO_DX_BACKUP out
neighbor 169.254.10.1 send-community
neighbor 169.254.30.1 send-community
AWS Direct Connect scope communities (public VIFs). On public VIFs, AWS publishes scope tags that limit how far your prefix propagates within AWS’s network:
| Community | Scope |
|---|---|
7224:9100 |
Local AWS Region only |
7224:9200 |
Same continent |
7224:9300 |
Global (all public Regions) |
You also receive AWS’s own region-scope communities on routes it advertises to you (7224:8100 local region, 7224:8200 continent, 7224:8300 global on private/transit VIFs), which you match on to filter what you accept.
Azure ExpressRoute communities. ExpressRoute tags every advertised prefix with a per-region BGP community (a region-specific value like 12076:5xxx) plus a service community for Microsoft peering routes. You match on these to scope what you accept and re-advertise. To send a community Azure acts on — notably NO_EXPORT semantics to keep a prefix from leaking past Microsoft’s edge — set it on the advertisement; Azure honors standard well-known communities on private peering. Confirm current published values in the provider docs before hard-coding; region tags change as regions are added.
! Accept only the home-region ExpressRoute prefixes; drop everything else
ip community-list standard ER_HOME_REGION permit 12076:5010
route-map FROM_EXPRESSROUTE permit 10
match community ER_HOME_REGION
set local-preference 200
route-map FROM_EXPRESSROUTE deny 20
6. Prefix filtering, max-prefix limits, and summarization
Communities steer; filters protect. Three controls keep a hybrid table from leaking or exploding:
Prefix lists / route filters — accept only what you expect. Never accept a default route or an unplanned aggregate. Azure ExpressRoute Route Filters gate which Microsoft-peering BGP community prefixes you receive:
az network route-filter create \
--name rf-m365 --resource-group net-rg --location eastus2
az network route-filter rule create \
--resource-group net-rg --route-filter-name rf-m365 \
--name allow-exchange --access Allowed \
--communities 12076:5010 12076:5020
On your edge, an inbound prefix-list is non-negotiable:
ip prefix-list FROM_CLOUD seq 5 permit 10.50.0.0/16 le 24
ip prefix-list FROM_CLOUD seq 10 deny 0.0.0.0/0 le 32
route-map FROM_PRIMARY permit 10
match ip address prefix-list FROM_CLOUD
set local-preference 200
Max-prefix limits — bound the blast radius of a leak. A misconfigured neighbor that suddenly advertises the full table should tear down the session, not melt your control plane:
router bgp 65001
address-family ipv4 unicast
neighbor 169.254.10.1 maximum-prefix 100 80 restart 15
That caps the neighbor at 100 prefixes, warns at 80%, and auto-restarts the session after 15 minutes. ExpressRoute itself enforces a hard route limit (4000 prefixes on standard, 10000 with the premium add-on) — exceed it and the entire BGP session drops, taking the circuit with it. Summarize aggressively so you never approach that ceiling.
Summarization — advertise aggregates, not host routes. Send an estate as a handful of /16s, not hundreds of /24s. Fewer prefixes means faster reconvergence and headroom under the limit. Beware the longest-prefix-match interaction: if your primary advertises a /16 and a leak injects a /24 inside it over the backup, the /24 wins on AWS regardless of your communities (Section 1). Summarize symmetrically on every path, or not at all on one.
7. Asymmetric-routing and route-leak traps
The black holes are almost always the same two failure modes.
Same prefix on both paths with mismatched preference. If outbound prefers the circuit (Local-Pref) but inbound prefers the VPN (you forgot to demote the VPN advertisement, or a more-specific leaked), traffic leaves over ExpressRoute and returns over the VPN. Stateful firewalls and NAT devices that expect both directions drop the asymmetric flow — a “half-open” outage that pings fine but breaks TCP. Fix it by making inbound and outbound preference agree: demote the backup in both directions, or do not advertise the overlap on the backup until failover.
The leak that silently wins. Direct Connect’s longest-prefix-match means one stray more-specific — a /32 a teammate redistributed, an aggregate someone forgot to suppress — pulls traffic onto a path your communities and prepends never touch, because the cloud never reaches the AS-path step. Guard with strict outbound prefix-lists on every advertisement and aggregate-address ... summary-only so components cannot leak past the aggregate:
router bgp 65001
address-family ipv4 unicast
aggregate-address 10.50.0.0 16 summary-only
neighbor 169.254.20.1 prefix-list ONLY_AGGREGATE out
!
ip prefix-list ONLY_AGGREGATE seq 5 permit 10.50.0.0/16
The black-hole test that catches both: from a host behind the cloud, traceroute back to on-prem while the primary is up. If the return path does not traverse the circuit you intended, you have an asymmetry waiting to become an outage — fix it before you ever pull a cable.
Verify
Prove the steady state before trusting failover. Confirm the attributes, not just reachability — a path can be reachable and still be the wrong one.
! Best path and the attributes that chose it
show bgp ipv4 unicast 10.50.0.0/16
! Look for: ">" on the circuit path, Local-Pref 200, your communities, AS-path length
show ip bgp neighbors 169.254.10.1 advertised-routes ! what you SEND the cloud
show ip bgp neighbors 169.254.10.1 received-routes ! what the cloud SENDS you
show ip bgp 10.50.0.0/16 bestpath ! why this path won
On Azure, dump the effective routes the gateway actually programmed and the prefixes it learned over BGP:
# Routes the ExpressRoute/VPN gateway learned via BGP
az network vnet-gateway list-learned-routes \
--name er-gw --resource-group net-rg -o table
# Prefixes the gateway is advertising to on-prem
az network vnet-gateway list-advertised-routes \
--name er-gw --resource-group net-rg --peer 169.254.21.2 -o table
On AWS, confirm VIF state and the right communities (the Direct Connect console and aws directconnect describe-virtual-interfaces show BGP status; verify learned routes on the Transit Gateway / VGW route table). Match every received prefix against your expected list — anything extra is a leak in progress.
Enterprise scenario
A retail platform team ran ExpressRoute into Azure East US 2 as primary, a VPN gateway as backup, and a stateful Palo Alto pair on-prem inspecting all hybrid traffic. They had set a higher Local-Pref on ExpressRoute routes, so outbound correctly preferred the circuit, and failover testing “passed.” Then a routine VNet expansion added 10.60.0.0/16, advertised over both the ExpressRoute and VPN connections — with the VPN left at routing_weight = 0 and no prepend. The firewall policy had only ever been built around the original prefix range.
The result was textbook asymmetry on the new range. Azure-egress traffic for 10.60.0.0/16 returned over the VPN (Azure had no reason to prefer ExpressRoute for an equally advertised prefix), while on-prem sent to it over ExpressRoute because of the Local-Pref. The Palo Altos saw SYNs on the ExpressRoute interface and SYN-ACKs on the VPN interface, flagged the flow asymmetric, and silently dropped it. ICMP worked, monitoring stayed green, and one new subnet was unreachable over TCP for forty minutes during business hours.
The fix had two parts. First, stop advertising the overlap on the backup in steady state — the VPN advertises the estate aggregate only when ExpressRoute withdraws. Second, where an overlap was genuinely needed, demote the VPN path with prepend so Azure’s return path matched the firewall’s expectation. The corrected connection:
resource "azurerm_virtual_network_gateway_connection" "vpn_backup" {
name = "vpn-backup"
resource_group_name = azurerm_resource_group.net.name
location = azurerm_resource_group.net.location
type = "IPsec"
virtual_network_gateway_id = azurerm_virtual_network_gateway.vpn.id
local_network_gateway_id = azurerm_local_network_gateway.onprem.id
shared_key = var.vpn_psk
enable_bgp = true
routing_weight = 0 # never preferred for Azure egress while ER is up
}
On-prem, the VPN advertisement was prepended 65001 65001 65001 and gated to the aggregate only, so no more-specific could leak and win on longest-prefix. They added a synthetic monitor that runs a return-path traceroute per advertised prefix and alarms if it does not egress the expected circuit — turning the asymmetry class of bug into a dashboard, not an incident.