Diagnosing and Killing SNAT Port Exhaustion on Cloud NAT Gateways

The pager goes off for “intermittent connection timeouts to the payments provider.” Everything else works. The provider swears their endpoint is healthy, and from your laptop it is. You retry the failing request and it succeeds. This is the signature of SNAT port exhaustion, and it is one of the most misdiagnosed failures in cloud networking because every individual symptom points somewhere else — at the destination, at TLS, at a flaky network — and never at the thing actually starving: the finite pool of source ports your NAT gateway hands out per destination.

This article builds the mental model first (the 5-tuple math that caps concurrent flows), shows you how to measure exhaustion instead of guessing, and then walks three remediations in priority order across Azure NAT Gateway and AWS NAT Gateway. We finish by reproducing exhaustion under load so you can prove headroom after the fix rather than waiting for the next incident.

1. What a SNAT port actually is

Source Network Address Translation lets many private instances share a small number of public IPs for outbound traffic. When an instance on 10.0.1.5 opens a connection to a public endpoint, the NAT gateway rewrites the packet’s source from the private IP to a public IP and rewrites the source port to a value it owns. Return traffic is matched back to the originating flow using the full 5-tuple:

(protocol, source IP, source port, destination IP, destination port)

The gateway’s public IP is fixed and the destination IP and port are fixed (you are talking to api.provider.com:443). The protocol is fixed (TCP). That leaves exactly one field the gateway can vary to keep concurrent flows distinct: the source port. The ephemeral port range is roughly 1024-65535, which is where the magic numbers come from:

Azure NAT Gateway: each attached public IP provides 64,512 SNAT ports.
AWS NAT Gateway: supports up to 55,000 simultaneous connections to each unique destination, where “unique destination” is the combination of destination IP, destination port, and protocol.

The critical insight that everyone misses: the limit is per destination, not global. You can have a million idle ports and still exhaust the pool to one busy endpoint. If 64,512 (Azure) or 55,000 (AWS) flows to api.provider.com:443 are simultaneously occupied, the 64,513th connection to that destination has no free source port and fails — even though connections to any other endpoint succeed instantly. That is why “timeouts to one endpoint while everything else works” is the textbook symptom.

A “flow” here is not a request. With HTTP keep-alive, one TCP connection carries thousands of requests and holds one port the whole time. Without it — a new socket per request — you burn and release a port every call, and TCP’s TIME_WAIT keeps that port reserved for up to four minutes after the socket closes. The port math is dominated by connection churn and TIME_WAIT residency, not by request rate.

2. The symptom profile

Before you touch a metric, recognize the shape. SNAT exhaustion is almost never a hard outage; it is a probabilistic one that tracks load.

Symptom	Why it points at SNAT (not the destination)
Timeouts to one busy endpoint; the rest of the internet is fine	The pool is per-destination; only the hot 5-tuple is starved
Failures correlate with traffic volume — worst at peak, gone at 3am	Port occupancy scales with concurrent flows + churn
`connect()` hangs then times out rather than returning RST/refused	No source port to allocate, so the SYN is never sent on the wire
The destination’s own logs show no failed requests	The packets never left your NAT gateway
Retries succeed seconds later	Ports free up as other flows close past `TIME_WAIT`
New instances or a restart temporarily helps	Clears accumulated `TIME_WAIT`/idle connections, then it creeps back

If you see this profile, stop blaming the endpoint. The timeout is happening on your side of the NAT, before a single byte reaches the provider.

3. Measuring it instead of guessing

Diagnosis is a metrics exercise. Both clouds expose port-allocation counters; the trick is knowing which ones actually indicate exhaustion versus normal churn.

Azure: SNAT and port-allocation metrics

Azure NAT Gateway publishes metrics through Azure Monitor. The four that matter:

Metric	What it tells you
`SNATConnectionCount`	Active SNAT flows, split by `Attempted`/`Failed` via the Connection State dimension
`TotalConnectionCount`	Total active connections through the gateway
`PacketCount` / `PacketDropCount`	Dropped packets — a secondary exhaustion signal
`DatapathAvailability`	Health of the datapath itself (rule out a platform issue)

The smoking gun is failed connections. Query it directly. With the metric REST surface or the portal you filter SNATConnectionCount on the Connection State = Failed dimension; a non-zero, load-correlated failed count is exhaustion until proven otherwise.

# Pull SNATConnectionCount split by connection state for a NAT gateway
NATGW_ID=$(az network nat gateway show \
  -g rg-egress -n natgw-prod --query id -o tsv)

az monitor metrics list \
  --resource "$NATGW_ID" \
  --metric "SNATConnectionCount" \
  --dimension "ConnectionState" \
  --interval PT1M \
  --aggregation Total \
  --start-time 2026-06-08T12:00:00Z \
  --end-time   2026-06-08T13:00:00Z \
  -o table

If you ship NAT gateway metrics into a Log Analytics workspace, the same question in KQL — and this is the alert you actually want, because it fires on the failed dimension, not on raw volume:

AzureMetrics
| where ResourceId has "natGateways/natgw-prod"
| where MetricName == "SNATConnectionCount"
| extend state = tostring(parse_json(Dimensions).ConnectionState)
| where state == "Failed"
| summarize FailedConns = sum(Total) by bin(TimeGenerated, 1m)
| where FailedConns > 0
| order by TimeGenerated desc

AWS: the ErrorPortAllocation counter

AWS makes this refreshingly unambiguous. NAT Gateway emits a CloudWatch metric named ErrorPortAllocation: the number of times the gateway could not allocate a source port. Any sustained non-zero value is exhaustion — there is no benign reading. Pair it with PacketsDropCount.

# ErrorPortAllocation over the last hour, 1-minute resolution
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name ErrorPortAllocation \
  --dimensions Name=NatGatewayId,Value=nat-0abc123def456 \
  --start-time 2026-06-08T12:00:00Z \
  --end-time   2026-06-08T13:00:00Z \
  --period 60 \
  --statistics Sum \
  --query 'Datapoints[?Sum>`0`].[Timestamp,Sum]' \
  --output table

Wire it to an alarm so you never diagnose this from a pager guess again:

aws cloudwatch put-metric-alarm \
  --alarm-name natgw-prod-port-exhaustion \
  --namespace AWS/NATGateway \
  --metric-name ErrorPortAllocation \
  --dimensions Name=NatGatewayId,Value=nat-0abc123def456 \
  --statistic Sum --period 60 --evaluation-periods 1 \
  --threshold 0 --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:us-east-1:111122223333:netops-pager

Instance-side confirmation

Cloud metrics tell you the gateway is starved; the instance tells you why. Count sockets to the hot destination — a wall of TIME_WAIT is the confession:

# How many sockets to the destination, grouped by TCP state?
ss -tan dst 203.0.113.10 | awk 'NR>1 {print $1}' | sort | uniq -c | sort -rn

# Aggregate view; a large TIME_WAIT count = per-request socket churn
ss -tan state time-wait | wc -l

If TIME_WAIT to one IP runs into the tens of thousands, the application is opening a fresh connection per request and never reusing sockets. That is remediation #2 territory, and often the real root cause hiding behind the gateway metric.

4. Default port budgets and why outbound rules make it worse

Two design choices quietly shrink the pool below what people assume.

Azure default outbound access and Load Balancer SNAT are the trap. Before NAT Gateway, the common outbound paths were “default outbound access” (being retired, and never something to depend on) and Load Balancer outbound rules, which pre-allocate a fixed, small number of SNAT ports per instance. With a public Load Balancer, the default allocation is on the order of 1,024 ports per VM unless you configure an outbound rule to raise it. Pre-allocation is the worst model for bursty per-destination load: a VM that needs 4,000 concurrent flows to one endpoint hits a wall at 1,024 even though the IP has 64,512 ports sitting unused on other VMs.

NAT Gateway fixes this structurally. It allocates dynamically at the subnet level: every instance in the subnet draws from the shared pool of all attached public IPs on demand, so a single hot VM can consume far more than any static per-VM slice. This is why “attach a NAT Gateway and stop using Load Balancer for outbound” is itself remediation #1’s first move — you go from a rigid per-instance budget to one elastic pool. If both a NAT Gateway and Load Balancer outbound rules are configured on the same subnet, NAT Gateway takes precedence for outbound; do not leave conflicting outbound rules in place expecting them to add capacity.

AWS: the limit is per gateway, per destination, per AZ. A single NAT gateway gives 55,000 connections per unique destination. Two facts make this worse than it sounds: cross-AZ NAT traffic is billed and adds latency (so teams correctly deploy one NAT gateway per AZ — but that means each AZ has its own 55,000 ceiling, and a skewed client distribution can starve one AZ while others idle), and a chatty fleet all hammering the same SaaS IP shares that one ceiling.

5. Remediation 1 — multiply the port pool with more public IPs

The most direct fix: more public IPs means more 5-tuple space, because each IP is a distinct source address and therefore a fresh full range of source ports per destination.

Azure — attach multiple public IPs (or a prefix). A Standard NAT Gateway supports up to 16 public IPv4 addresses. Each adds 64,512 ports, so a fully populated gateway provides up to ~1,032,192 SNAT ports. A public IP prefix is the clean way to do this: one contiguous block, and the gateway uses every address in it.

# A /28 prefix = 16 contiguous public IPs, one resource to manage
az network public-ip prefix create \
  -g rg-egress -n pipfx-natgw \
  --length 28 --location eastus2 --sku Standard

az network nat gateway create \
  -g rg-egress -n natgw-prod \
  --public-ip-prefixes pipfx-natgw \
  --idle-timeout 4 --sku Standard

In Terraform, so the prefix size is a reviewable knob:

resource "azurerm_public_ip_prefix" "natgw" {
  name                = "pipfx-natgw"
  resource_group_name = azurerm_resource_group.egress.name
  location            = azurerm_resource_group.egress.location
  prefix_length       = 28          # 16 IPs -> ~1,032,192 SNAT ports
  sku                 = "Standard"
}

resource "azurerm_nat_gateway" "prod" {
  name                    = "natgw-prod"
  resource_group_name     = azurerm_resource_group.egress.name
  location                = azurerm_resource_group.egress.location
  sku_name                = "Standard"
  idle_timeout_in_minutes = 4        # lower idle timeout frees ports faster
}

resource "azurerm_nat_gateway_public_ip_prefix_association" "prod" {
  nat_gateway_id      = azurerm_nat_gateway.prod.id
  public_ip_prefix_id = azurerm_public_ip_prefix.natgw.id
}

resource "azurerm_subnet_nat_gateway_association" "app" {
  subnet_id      = azurerm_subnet.app.id
  nat_gateway_id = azurerm_nat_gateway.prod.id
}

Lowering idle_timeout_in_minutes (default 4, range 4-120) returns idle SNAT ports to the pool sooner. If your traffic is short bursts of many connections, the minimum of 4 minutes meaningfully increases effective port turnover. It will not save you from a true keep-alive bug, but it buys real headroom.

AWS — associate secondary IPs on the NAT gateway. AWS NAT Gateway supports up to 8 IPv4 addresses (1 primary + 7 secondary), and each additional IP raises the per-destination ceiling by another 55,000 — up to 440,000 simultaneous single-destination connections.

# Allocate secondary EIPs and attach them to an existing NAT gateway
EIP2=$(aws ec2 allocate-address --domain vpc --query AllocationId --output text)
EIP3=$(aws ec2 allocate-address --domain vpc --query AllocationId --output text)

aws ec2 associate-nat-gateway-address \
  --nat-gateway-id nat-0abc123def456 \
  --allocation-ids "$EIP2" "$EIP3"

Caveat that catches teams off guard: an allow-list. If the destination firewalls inbound by source IP (common with payment processors and B2B partners), every NAT IP and every address in an Azure prefix must be added to their allow-list, or you have just created a new class of intermittent failure where requests randomly egress from an un-allowed IP. Coordinate the IP list before you scale out.

More IPs is the fastest mitigation, but it treats the symptom. If the app churns sockets, you are buying a bigger bucket for a leak. Fix the leak next.

6. Remediation 2 — connection reuse, keep-alive, and killing per-request sockets

The highest-leverage fix is usually free: stop opening a new connection per request. One reused, keep-alive connection holds one port and serves thousands of requests; the per-request anti-pattern burns a port and parks it in TIME_WAIT for minutes after.

Use a pooled, long-lived client and never construct a new one per call. The canonical .NET bug is new HttpClient() per request — each instance opens fresh sockets and leaks ports. Use a single shared client (or IHttpClientFactory) and cap pooled-connection lifetime so DNS changes are still honored:

// One shared handler; bounded pool; reuse across the whole process.
var handler = new SocketsHttpHandler
{
    PooledConnectionLifetime    = TimeSpan.FromMinutes(5),  // refresh for DNS
    PooledConnectionIdleTimeout = TimeSpan.FromMinutes(2),
    MaxConnectionsPerServer     = 50,                        // bound the fan-out
};
// Register as a singleton — do NOT 'new' this per request.
services.AddSingleton(new HttpClient(handler));

The same discipline in any runtime:

Node.js: set keepAlive: true on the HTTP(S) Agent and bound maxSockets; the default global agent does not reuse aggressively enough for high throughput.
Python requests: use a module-level requests.Session() (pooled keep-alive) instead of bare requests.get(), which builds and tears down a connection every call.
Go: reuse one *http.Client and always drain and Close() the response body, or the connection is not returned to the pool.
Databases / Redis / any backend: size the connection pool deliberately. An over-large pool that opens hundreds of idle connections to one DB IP is itself a per-destination SNAT consumer.

Tune TIME_WAIT residency on the instances as a secondary lever. Shrinking how long a closed socket parks its port returns ports faster (Linux):

# Reuse TIME_WAIT sockets for new OUTBOUND connections (safe; client-side)
sysctl -w net.ipv4.tcp_tw_reuse=1
# Cap the live local ephemeral range count visible to the app
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

Do not reach for tcp_tw_recycle — it was removed from modern kernels and broke NAT’d clients badly. tcp_tw_reuse=1 is the correct, supported knob for outbound-heavy hosts. And remember: these tune the instance’s local ports. The NAT gateway’s per-destination pool is the real ceiling, but reducing churn upstream directly reduces pressure on it.

After this change, the instance-side ss count to the hot destination should collapse from tens of thousands of churning sockets to a small, stable set of reused ones. That collapse is the proof the fix landed.

7. Remediation 3 — remove hot destinations from SNAT entirely with Private Link

The cleanest fix is to stop translating at all. If the hot destination is a service that supports private connectivity, traffic that rides a private path is not SNAT’d through the NAT gateway — so it consumes zero ports from the per-destination pool, no matter how many flows you open.

Azure — Private Endpoint / Service Endpoint. A Private Endpoint projects a NIC for the PaaS resource (Storage, SQL, Key Vault, a Private Link Service) into your VNet with a private IP. Connections go over the Microsoft backbone via that private IP and never touch the NAT gateway, removing that destination from the SNAT equation completely.

resource "azurerm_private_endpoint" "sql" {
  name                = "pe-sql-payments"
  location            = azurerm_resource_group.app.location
  resource_group_name = azurerm_resource_group.app.name
  subnet_id           = azurerm_subnet.privateendpoints.id

  private_service_connection {
    name                           = "psc-sql-payments"
    private_connection_resource_id = azurerm_mssql_server.payments.id
    subresource_names              = ["sqlServer"]
    is_manual_connection           = false
  }
}

A lighter-weight option for first-party services is a Service Endpoint, which keeps traffic to that service on the Azure backbone and off the public path. Private Endpoint is preferred for new designs (it gives a private IP and works with on-prem), but a service endpoint on the subnet is a fast way to pull, say, all Storage traffic out of the NAT pool.

AWS — VPC Endpoints. A Gateway VPC Endpoint (S3, DynamoDB) routes that traffic through a route-table entry, completely bypassing the NAT gateway — and it is free, so there is no excuse to NAT S3 traffic. An Interface VPC Endpoint (PrivateLink, powered by an ENI in your subnet) does the same for most other AWS services and partner SaaS that publish a PrivateLink service.

# Gateway endpoint for S3 — removes all S3 egress from the NAT gateway, free
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0aaa111 \
  --vpc-endpoint-type Gateway \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-0priv1 rtb-0priv2

The architectural point: NAT gateways are for the open internet. Every destination you can reach privately — your own PaaS, AWS services, a SaaS partner on PrivateLink — should leave the SNAT pool, both to kill exhaustion and to drop NAT data-processing cost. Often a single S3 Gateway Endpoint erases the largest consumer of NAT bandwidth in the whole account.

8. Load test to reproduce exhaustion and confirm headroom

Never declare victory on a fix you have not stressed. Reproduce exhaustion deliberately, then re-run after each remediation and watch the failure metric flatline.

Open many concurrent, non-reused connections to a single destination from inside the subnet. A short script that deliberately does the wrong thing (one socket per request) drives ports up fast:

# Hammer one destination with fresh connections; watch ports climb.
# Run from a VM/instance whose egress is the NAT gateway under test.
DEST="https://api.provider.example/health"
seq 1 60000 | xargs -P 800 -I{} \
  curl -s -o /dev/null --connect-timeout 5 \
       -H "Connection: close" "$DEST" \
  || echo "connect failure at $(date +%T)"

While it runs, watch the authoritative counter in the other terminal:

# AWS: this should climb from 0 under load on an undersized gateway
watch -n 5 'aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway --metric-name ErrorPortAllocation \
  --dimensions Name=NatGatewayId,Value=nat-0abc123def456 \
  --start-time $(date -u -d "-5 min" +%FT%TZ) \
  --end-time   $(date -u +%FT%TZ) \
  --period 60 --statistics Sum --query "Datapoints[].Sum"'

On an undersized gateway, ErrorPortAllocation (AWS) or SNATConnectionCount Failed (Azure) climbs off zero and the curl loop starts printing connect failures. Apply a remediation — more IPs, Connection: keep-alive instead of close, or a private endpoint — re-run the identical test, and the counter should stay pinned at zero while curl errors disappear. That before/after on the same test is your headroom proof.

Drive the load from inside the subnet (an EC2 instance, an AKS pod, a VM whose default route is the NAT gateway), not from your laptop. Laptop traffic does not traverse the gateway and proves nothing about its pool.

Verify

Run these after remediation to confirm the fix end to end, not just that an alert went quiet.

# AWS — ErrorPortAllocation must be 0 across the test window
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway --metric-name ErrorPortAllocation \
  --dimensions Name=NatGatewayId,Value=nat-0abc123def456 \
  --start-time 2026-06-08T12:00:00Z --end-time 2026-06-08T13:00:00Z \
  --period 60 --statistics Sum \
  --query 'Datapoints[?Sum>`0`]'        # empty result == healthy

# AWS — confirm the extra IPs actually attached
aws ec2 describe-nat-gateways --nat-gateway-ids nat-0abc123def456 \
  --query 'NatGateways[0].NatGatewayAddresses[].PublicIp'

# Azure — failed SNAT connections must be 0
az monitor metrics list --resource "$NATGW_ID" \
  --metric "SNATConnectionCount" --dimension "ConnectionState" \
  --filter "ConnectionState eq 'Failed'" \
  --interval PT1M --aggregation Total -o table

# Azure — confirm the prefix/IP count behind the gateway
az network nat gateway show -g rg-egress -n natgw-prod \
  --query "{ips:publicIpAddresses, prefixes:publicIpPrefixes}" -o jsonc

# Instance — socket churn to the hot destination should be small & stable
ss -tan dst 203.0.113.10 | awk 'NR>1 {print $1}' | sort | uniq -c

A correct result: the failed/error counter holds at zero through the same load that previously broke it; the gateway shows the expected number of public IPs; and instance-side sockets to the hot destination are a small, reused set rather than tens of thousands in TIME_WAIT.

Checklist

Enterprise scenario

A fintech platform team ran a single AKS-backed payment service behind one Azure public Load Balancer for outbound. Reconciliation jobs at the top of every hour fired thousands of short-lived HTTPS calls to a single card-network endpoint, and a slice of them timed out — but only during the hourly burst, and only to that one host. The card network’s logs showed no inbound failures. Engineers spent two days suspecting TLS handshake limits and the partner’s rate limiting.

The constraint surfaced in the metrics: the public Load Balancer was the outbound path, and its default SNAT allocation was ~1,024 ports per node. The reconciliation pods on a single node needed several thousand concurrent flows to that one destination during the burst, so they hit the per-node, per-destination ceiling while 60,000-plus ports sat unused elsewhere. The pre-allocated, per-instance model was the entire problem — raw volume was never the issue, distribution was.

The fix was two moves. First, attach a NAT Gateway with a /28 public IP prefix to the AKS subnet so outbound used a dynamic, subnet-wide pool (~1,032,192 ports across 16 IPs) instead of a rigid 1,024-per-node slice. Second, fix the app to reuse a single pooled HttpClient instead of constructing one per call, which collapsed TIME_WAIT churn. The constraint that made move one non-trivial: the card network allow-listed inbound by source IP, so all 16 prefix addresses had to be registered with the partner before cutover or the bursts would have failed from un-allowed IPs.

# Subnet egress switched from Load Balancer SNAT to NAT Gateway with a 16-IP prefix
az network nat gateway create -g rg-egress -n natgw-aks \
  --public-ip-prefixes pipfx-aks-natgw --sku Standard --idle-timeout 4

az network vnet subnet update -g rg-aks \
  --vnet-name vnet-aks -n snet-pods \
  --nat-gateway natgw-aks

After cutover, SNATConnectionCount Failed held at zero through the hourly reconciliation spike, and the per-node port ceiling stopped being a ceiling because every node now drew from the shared pool. The runbook lesson: a public Load Balancer is an outbound anti-pattern for bursty, per-destination workloads — its static per-instance budget is exactly the wrong shape — and NAT Gateway’s dynamic pool plus connection reuse is the correct answer.

Pitfalls and next steps

The failures that recur most often, in order:

Blaming the destination. The packets never left your NAT gateway; the endpoint’s logs are clean because nothing reached it. Check your own port-allocation metric first.
Scaling IPs while the app still churns sockets. More IPs is a bigger bucket; a per-request-socket app is a hole in it. You will exhaust the larger pool later, under a slightly heavier burst. Fix connection reuse in tandem.
Forgetting the destination allow-list. You add IPs (or an Azure prefix) and now a fraction of requests egress from an un-allowed source IP, converting one intermittent failure into a different, harder-to-spot one. Register every IP first.
Leaving Load Balancer outbound rules alongside NAT Gateway. They do not stack. NAT Gateway takes precedence; conflicting outbound rules just confuse the next engineer reading the config.
Reaching for tcp_tw_recycle. Removed from modern kernels and historically catastrophic behind NAT. Use tcp_tw_reuse=1.
Cross-AZ skew on AWS. One NAT gateway per AZ is correct for cost and latency, but each AZ has its own 55,000-per-destination ceiling. A load balancer that piles clients into one AZ can exhaust that AZ while others idle.

Next, make exhaustion structurally impossible to miss: alert on ErrorPortAllocation > 0 and SNATConnectionCount Failed > 0 as page-worthy, ship a quarterly load test that drives the old failure and proves zero, and audit egress for any first-party or SaaS destination that could move to a private endpoint. The goal is that “intermittent timeouts to the provider” stops being a multi-day investigation and becomes a glance at one graph.