Quick take — A production azurerm ~> 4.0 module for Azure NAT Gateway: zonal deployment, multiple static public IPs and an IP prefix for SNAT port scale, idle-timeout tuning, and subnet association. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "azurerm" {
features {}
}
module "nat_gateway" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-nat-gateway?ref=v1.0.0"
name = "..." # NAT gateway name; also the prefix for its public IP and…
resource_group_name = "..." # Resource group for the NAT gateway and its public IP re…
location = "..." # Azure region; must match the associated subnets' VNet r…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Azure NAT Gateway is a fully managed, highly resilient network address translation service that gives resources in a private subnet deterministic, scalable outbound connectivity to the internet without exposing them with public IPs. When you attach a NAT Gateway to a subnet, all outbound (egress) flows from that subnet are source-NAT’d (SNAT) through the NAT Gateway’s static public IP addresses. Inbound-initiated connections are not allowed, so it is purely an egress primitive — the opposite of a Load Balancer’s inbound role.
The reason NAT Gateway matters in production is SNAT port exhaustion. Default outbound access and Load Balancer outbound rules statically pre-allocate a small, fixed number of SNAT ports per VM, which collapses under chatty microservices that open many concurrent connections to the same destination (a database, an API gateway, a package registry). NAT Gateway instead allocates SNAT ports on demand from a shared pool across the whole subnet, and each attached public IP contributes 64,512 SNAT ports. Attaching a /28 public IP prefix (16 addresses) yields over a million ports — which is why a NAT Gateway with an IP prefix is the standard fix for “intermittent outbound connection timeouts at scale.”
Wrapping this in a reusable Terraform module is worth it because a correct NAT Gateway is never a single resource. You need the gateway itself plus the public IP(s) and/or public IP prefix it draws from, the explicit azurerm_subnet_nat_gateway_association for every subnet that should route through it, a sensible idle_timeout_in_minutes, and zone pinning that is consistent with the public IP SKU. This module makes all of that var-driven and prevents the most common production mistakes: forgetting the subnet association (so nothing actually egresses through it), mixing a zonal public IP with a non-zonal gateway, or running out of SNAT ports because only one IP was attached.
When to use it
- A private subnet (AKS node pools, Container Apps, VMSS, Functions on a VNet) needs outbound internet access but the workloads must not have public IPs.
- You are hitting SNAT port exhaustion — intermittent
connection timed out/ETIMEDOUTto external endpoints under load — and need a large, on-demand SNAT port pool. - A downstream partner, SaaS provider, or on-prem firewall requires you to allow-list a small, stable set of egress public IPs (a NAT Gateway with a public IP prefix gives you a known CIDR).
- You want outbound that survives the deprecation of default outbound access (Azure is retiring implicit outbound for new VNets from 2025 onward), so every new subnet needs an explicit egress method.
- You want zone-resilient egress in a single Availability Zone for a zonal workload, or one NAT Gateway per zone for a zone-redundant design.
Do not use it for inbound exposure (use a Load Balancer / Application Gateway), for east-west or VNet-to-VNet traffic, or where all egress must be force-tunneled through a firewall via UDR 0.0.0.0/0 — though NAT Gateway and Azure Firewall can be combined, with the firewall sending allowed traffic out via the NAT Gateway.
Module structure
terraform-module-azure-nat-gateway/
├── versions.tf # provider + Terraform version pins
├── main.tf # NAT gateway, public IP(s), IP prefix, subnet associations
├── variables.tf # var-driven inputs with validation
└── outputs.tf # ids/names + the public IPs to allow-list
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 4.0"
}
}
}
main.tf
locals {
# Number of discrete (non-prefix) Standard public IPs to create and attach.
pip_names = [
for i in range(var.public_ip_count) :
format("%s-pip-%02d", var.name, i + 1)
]
# Zone list used for both the gateway and its public IPs so they stay aligned.
# An empty list ("regional"/no-zone) is valid and means non-zonal.
zones = var.availability_zone == null ? [] : [var.availability_zone]
}
# One or more Standard, static public IPs. Each contributes 64,512 SNAT ports.
resource "azurerm_public_ip" "this" {
for_each = toset(local.pip_names)
name = each.value
resource_group_name = var.resource_group_name
location = var.location
allocation_method = "Static"
sku = "Standard"
sku_tier = "Regional"
zones = local.zones
ddos_protection_mode = "VirtualNetworkInherited"
tags = var.tags
}
# Optional public IP prefix. A contiguous CIDR is the clean way to give partners
# a stable egress range and to scale SNAT ports far beyond individual IPs.
resource "azurerm_public_ip_prefix" "this" {
count = var.public_ip_prefix_length == null ? 0 : 1
name = "${var.name}-pipprefix"
resource_group_name = var.resource_group_name
location = var.location
prefix_length = var.public_ip_prefix_length
sku = "Standard"
ip_version = "IPv4"
zones = local.zones
tags = var.tags
}
resource "azurerm_nat_gateway" "this" {
name = var.name
resource_group_name = var.resource_group_name
location = var.location
sku_name = "Standard"
idle_timeout_in_minutes = var.idle_timeout_in_minutes
# A NAT gateway is zonal: either pinned to a single zone or non-zonal.
zones = local.zones
tags = var.tags
}
# Associate each discrete public IP with the NAT gateway.
resource "azurerm_nat_gateway_public_ip_association" "this" {
for_each = azurerm_public_ip.this
nat_gateway_id = azurerm_nat_gateway.this.id
public_ip_address_id = each.value.id
}
# Associate the optional public IP prefix with the NAT gateway.
resource "azurerm_nat_gateway_public_ip_prefix_association" "this" {
count = var.public_ip_prefix_length == null ? 0 : 1
nat_gateway_id = azurerm_nat_gateway.this.id
public_ip_prefix_id = azurerm_public_ip_prefix.this[0].id
}
# Bind the NAT gateway to every subnet that should egress through it.
# Without this association the gateway exists but routes no traffic.
resource "azurerm_subnet_nat_gateway_association" "this" {
for_each = toset(var.subnet_ids)
subnet_id = each.value
nat_gateway_id = azurerm_nat_gateway.this.id
}
variables.tf
variable "name" {
type = string
description = "Name of the NAT gateway. Also used as the prefix for its public IP and prefix resources."
validation {
condition = can(regex("^[a-zA-Z0-9][a-zA-Z0-9._-]{0,78}[a-zA-Z0-9_]$", var.name))
error_message = "name must be 2-80 chars, start alphanumeric, and contain only letters, numbers, '.', '_' or '-'."
}
}
variable "resource_group_name" {
type = string
description = "Resource group in which to create the NAT gateway and its public IP resources."
}
variable "location" {
type = string
description = "Azure region (e.g. centralindia, eastus). Must match the subnets' VNet region."
}
variable "subnet_ids" {
type = list(string)
description = "Subnet resource IDs to associate with this NAT gateway. A subnet can only be bound to one NAT gateway, and that NAT gateway must be in the same region and subscription as the VNet."
default = []
}
variable "public_ip_count" {
type = number
description = "Number of discrete Standard static public IPs to create and attach. Each adds 64,512 SNAT ports. Set to 0 if you only use a public IP prefix."
default = 1
validation {
condition = var.public_ip_count >= 0 && var.public_ip_count <= 16
error_message = "public_ip_count must be between 0 and 16."
}
}
variable "public_ip_prefix_length" {
type = number
description = "Prefix length for an optional public IP prefix (e.g. 28 for a /28 = 16 IPs). null disables the prefix. Azure allows /28 (16) down to /31 (2) for NAT gateway."
default = null
validation {
condition = (
var.public_ip_prefix_length == null ||
(var.public_ip_prefix_length >= 28 && var.public_ip_prefix_length <= 31)
)
error_message = "public_ip_prefix_length must be between 28 and 31 (a /28 to /31), or null."
}
}
variable "idle_timeout_in_minutes" {
type = number
description = "Idle timeout for outbound SNAT flows. Lower values release SNAT ports faster (helps short-lived, high-volume connections); higher values keep long-lived flows alive."
default = 4
validation {
condition = var.idle_timeout_in_minutes >= 4 && var.idle_timeout_in_minutes <= 120
error_message = "idle_timeout_in_minutes must be between 4 and 120."
}
}
variable "availability_zone" {
type = string
description = "Single Availability Zone to pin the NAT gateway and its public IPs to (e.g. \"1\"). null/omit for a non-zonal deployment. For zone-redundant egress, deploy one module instance per zone."
default = null
validation {
condition = var.availability_zone == null || contains(["1", "2", "3"], var.availability_zone)
error_message = "availability_zone must be one of \"1\", \"2\", \"3\", or null."
}
}
variable "tags" {
type = map(string)
description = "Tags applied to the NAT gateway, public IPs and IP prefix."
default = {}
}
outputs.tf
output "id" {
description = "Resource ID of the NAT gateway."
value = azurerm_nat_gateway.this.id
}
output "name" {
description = "Name of the NAT gateway."
value = azurerm_nat_gateway.this.name
}
output "resource_guid" {
description = "The resource GUID of the NAT gateway."
value = azurerm_nat_gateway.this.resource_guid
}
output "public_ip_ids" {
description = "Resource IDs of the discrete public IPs attached to the NAT gateway."
value = [for pip in azurerm_public_ip.this : pip.id]
}
output "public_ip_addresses" {
description = "The egress IP addresses of the discrete public IPs — the values to allow-list on partner firewalls."
value = [for pip in azurerm_public_ip.this : pip.ip_address]
}
output "public_ip_prefix_id" {
description = "Resource ID of the public IP prefix, or null if none was created."
value = try(azurerm_public_ip_prefix.this[0].id, null)
}
output "public_ip_prefix_cidr" {
description = "The CIDR of the public IP prefix to allow-list, or null if none was created."
value = try(azurerm_public_ip_prefix.this[0].ip_prefix, null)
}
output "associated_subnet_ids" {
description = "Subnet IDs associated with this NAT gateway for outbound SNAT."
value = [for assoc in azurerm_subnet_nat_gateway_association.this : assoc.subnet_id]
}
How to use it
This example gives an AKS node subnet deterministic egress: a single static public IP plus a /28 public IP prefix (a stable CIDR a partner can allow-list), pinned to zone 1, with a shorter idle timeout to recycle SNAT ports quickly under load.
module "nat_gateway" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-nat-gateway?ref=v1.0.0"
name = "ngw-prod-cin-aks"
resource_group_name = azurerm_resource_group.network.name
location = "centralindia"
subnet_ids = [
azurerm_subnet.aks_nodes.id,
azurerm_subnet.jobs.id,
]
public_ip_count = 1
public_ip_prefix_length = 28 # /28 = 16 IPs => ~1M SNAT ports + a stable egress CIDR
idle_timeout_in_minutes = 10
availability_zone = "1"
tags = {
environment = "prod"
owner = "platform-network"
cost_center = "cc-1042"
}
}
# Downstream: surface the egress CIDR so a partner allow-list / firewall rule
# is generated straight from the module output — no hand-copied IPs.
resource "azurerm_network_security_rule" "partner_api_egress" {
name = "allow-egress-partner-api"
resource_group_name = azurerm_resource_group.network.name
network_security_group_name = azurerm_network_security_group.aks_nodes.name
priority = 200
direction = "Outbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "443"
source_address_prefix = "*"
destination_address_prefix = "203.0.113.10/32"
}
output "egress_cidr_to_share" {
description = "Send this to the partner to allow-list our outbound traffic."
value = module.nat_gateway.public_ip_prefix_cidr
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "azurerm"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...azurerm state bucket/container + key per path...
}
}
2. Module config — live/prod/nat_gateway/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-nat-gateway?ref=v1.0.0"
}
inputs = {
name = "..."
resource_group_name = "..."
location = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/nat_gateway && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
name |
string |
— | Yes | NAT gateway name; also the prefix for its public IP and prefix resources. Validated 2-80 chars. |
resource_group_name |
string |
— | Yes | Resource group for the NAT gateway and its public IP resources. |
location |
string |
— | Yes | Azure region; must match the associated subnets’ VNet region. |
subnet_ids |
list(string) |
[] |
No | Subnet IDs to associate for outbound SNAT. Each subnet binds to only one NAT gateway. |
public_ip_count |
number |
1 |
No | Count of discrete Standard static public IPs to create/attach (0-16). Each adds 64,512 SNAT ports. |
public_ip_prefix_length |
number |
null |
No | Prefix length (28-31) for an optional public IP prefix, or null to disable. |
idle_timeout_in_minutes |
number |
4 |
No | Outbound SNAT idle timeout (4-120). Lower recycles ports faster under high connection churn. |
availability_zone |
string |
null |
No | Single zone (“1”/“2”/“3”) to pin the gateway and its IPs, or null for non-zonal. |
tags |
map(string) |
{} |
No | Tags applied to the NAT gateway, public IPs and IP prefix. |
Outputs
| Name | Description |
|---|---|
id |
Resource ID of the NAT gateway. |
name |
Name of the NAT gateway. |
resource_guid |
The resource GUID of the NAT gateway. |
public_ip_ids |
Resource IDs of the discrete public IPs attached to the NAT gateway. |
public_ip_addresses |
Egress IP addresses of the discrete public IPs — the values to allow-list on partner firewalls. |
public_ip_prefix_id |
Resource ID of the public IP prefix, or null if none was created. |
public_ip_prefix_cidr |
CIDR of the public IP prefix to allow-list, or null if none was created. |
associated_subnet_ids |
Subnet IDs associated with this NAT gateway for outbound SNAT. |
Enterprise scenario
A fintech platform team runs a multi-tenant AKS cluster in centralindia whose pods call a payments partner that only accepts traffic from a pre-registered IP range. Under peak settlement load the node pool was hitting SNAT port exhaustion against the partner’s single API endpoint, causing sporadic ETIMEDOUT failures and retries. The team deployed this module per Availability Zone with a /28 public IP prefix, giving each node subnet roughly a million on-demand SNAT ports and a stable 16-address egress CIDR that the partner allow-listed once. Egress timeouts disappeared, and because the CIDR is now a Terraform output, onboarding the next partner is a one-line firewall change rather than a chase for “what are our outbound IPs?”
Best practices
- Always attach more than one IP’s worth of ports at scale. A single public IP caps you at 64,512 SNAT ports; for chatty workloads attach a
/28public IP prefix or multiple public IPs so the on-demand pool does not exhaust against a small set of destinations. - Keep the gateway and public IPs zone-aligned. A NAT gateway is a zonal resource — pin the gateway and every attached Standard public IP to the same zone, and deploy one module instance per zone for zone-redundant egress instead of expecting one gateway to span zones.
- Tune
idle_timeout_in_minutesto the traffic shape. Short-lived, high-churn outbound calls reclaim ports faster with a lower timeout (the 4-minute default is fine); only raise it for genuinely long-lived flows, since a high timeout holds ports open and works against you under exhaustion. - Prefer a public IP prefix for partner allow-lists. A contiguous CIDR is stable and far easier for a partner firewall to accept than a growing list of individual addresses; expose it via the
public_ip_prefix_cidroutput and reference that downstream rather than hard-coding IPs. - Don’t forget the subnet association — and remember it’s exclusive. The gateway only does work once a subnet is associated, and a subnet can belong to just one NAT gateway; co-locating it with a basic Load Balancer outbound rule on the same subnet is unsupported, so make NAT Gateway the single egress method.
- Name and tag for cost attribution. NAT Gateway bills an hourly gateway charge plus per-GB data processed, so use a clear convention like
ngw-<env>-<region>-<workload>and consistentcost_center/ownertags to keep egress spend traceable per team.