You have an internal API that another team — or another company — needs to reach. The reflex is to peer the VPCs or hang both off a Transit Gateway. Both grant network-layer reachability, and both fall apart the moment the consumer’s 10.0.0.0/16 collides with yours, which across a large estate it always eventually does. PrivateLink solves a narrower problem and solves it cleanly: it exposes one service across a trust boundary as a single ENI in the consumer’s subnet. No routes, no transitive reachability, no CIDR negotiation. This is how to build the provider and consumer sides correctly, wire up private DNS, and keep it observable.
When an endpoint service is the right call
PrivateLink, peering, and Transit Gateway are not interchangeable. Pick by what you are actually sharing.
| Pattern | What it connects | CIDR overlap | Direction | Bills per GB |
|---|---|---|---|---|
| VPC peering | Two VPCs, full IP reachability | Must not overlap | Bidirectional | No (intra-region) |
| Transit Gateway | Many VPCs/accounts, policy-routed | Must not overlap | Bidirectional | Yes |
| PrivateLink | One service behind an NLB/GWLB | Irrelevant | Unidirectional (consumer to provider) | Yes |
The deciding question is reachability. Peering and TGW give the consumer a route to your network; a misconfigured security group or a curious operator can then reach anything routable. PrivateLink gives the consumer a route to one load balancer, full stop. Traffic is initiated only from consumer to provider, it never traverses the public internet, and because connectivity is an endpoint rather than a route, the two VPCs can use identical address space. That last property is why PrivateLink is the default for multi-tenant SaaS on AWS and for any “expose this API to 200 internal accounts” problem.
Mental model: peering and TGW are routing. PrivateLink is publishing. If you would put the thing behind a load balancer and a DNS name anyway, publish it.
The cost is that it is one-directional and one-service-per-endpoint, and it requires a Network Load Balancer (or Gateway Load Balancer) in front of the workload. If you need bidirectional, everything-talks-to-everything connectivity, this is the wrong tool — reach for TGW.
Step 1 — Front the service with an NLB (provider account)
An endpoint service sits on top of an NLB (or GWLB). We will use an NLB. It must be internal — PrivateLink targets the load balancer’s private addresses, not an internet-facing scheme. Register your service’s targets (instances, IPs, or an ALB as a target if you need L7 routing behind it) in a target group.
resource "aws_lb" "svc" {
name = "payments-svc-nlb"
internal = true
load_balancer_type = "network"
subnets = var.provider_subnet_ids # one per AZ you publish
enable_cross_zone_load_balancing = true
}
resource "aws_lb_target_group" "svc" {
name = "payments-svc-tg"
port = 8443
protocol = "TCP"
vpc_id = var.provider_vpc_id
target_type = "ip"
health_check {
protocol = "TCP"
port = "8443"
healthy_threshold = 3
unhealthy_threshold = 3
interval = 10
}
}
resource "aws_lb_listener" "svc" {
load_balancer_arn = aws_lb.svc.arn
port = 8443
protocol = "TCP"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.svc.arn
}
}
Two decisions here matter for everything downstream. First, the subnets you attach to the NLB define which AZs the service is available in. A consumer can only create an endpoint in an AZ where you have a presence. Publish in at least two, and prefer to publish in every AZ your largest consumers use. Second, turn on cross-zone load balancing. An NLB is zonal by default: an endpoint ENI in us-east-1a will only send to targets in us-east-1a unless cross-zone is enabled. With uneven target distribution that produces hot zones and surprising health-check behavior. (Note: cross-zone traffic on an NLB incurs inter-AZ data charges, which the load-balancer owner pays — budget for it.)
Step 2 — Create the VPC endpoint service
With the NLB live, create the endpoint service that points at it. The key knob is acceptance_required.
resource "aws_vpc_endpoint_service" "payments" {
acceptance_required = true
network_load_balancer_arns = [aws_lb.svc.arn]
tags = { Name = "payments-endpoint-service" }
}
output "service_name" {
# e.g. com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0
value = aws_vpc_endpoint_service.payments.service_name
}
AWS assigns a service name of the form com.amazonaws.vpce.<region>.vpce-svc-xxxxxxxxxxxxxxxxx. This string is what consumers use to find you; it is not secret, but it is the coordination key — hand it to consumers out of band.
acceptance_required = true means every new connection lands in pendingAcceptance and waits for you to approve it. For a controlled internal platform with a known consumer list, that manual gate is worth keeping. For self-service at scale, you flip it to false and rely on the allow-list (next step) instead. Do not run with false and an open allow-list unless you genuinely intend anyone in any account to connect.
Step 3 — Allowed principals, acceptance, and notifications
Two independent controls govern who reaches your service. They stack.
Allowed principals decide who is even permitted to create an endpoint to your service. Without an entry, the consumer cannot see or target the service at all. Scope these as tightly as the consumer’s identity allows — a specific role ARN is better than a whole account, which is better than *.
resource "aws_vpc_endpoint_service_allowed_principal" "consumer" {
vpc_endpoint_service_id = aws_vpc_endpoint_service.payments.id
principal_arn = "arn:aws:iam::222222222222:root" # tighten to a role ARN if you can
}
Acceptance is the per-connection gate that applies when acceptance_required = true. List pending connections and approve them:
aws ec2 describe-vpc-endpoint-connections \
--filters Name=service-id,Values=vpce-svc-0123456789abcdef0 \
--query 'VpcEndpointConnections[?VpcEndpointState==`pendingAcceptance`].[VpcEndpointId,VpcEndpointOwner]' \
--output table
aws ec2 accept-vpc-endpoint-connections \
--service-id vpce-svc-0123456789abcdef0 \
--vpc-endpoint-ids vpce-0a1b2c3d4e5f6a7b8
To avoid polling for pending connections, wire a connection notification to SNS. You get an event on Connect, Accept, Reject, and Delete, which you can route to a Lambda for auto-approval against an authoritative consumer registry, or just to a channel so a human acts within minutes instead of hours.
resource "aws_vpc_endpoint_connection_notification" "payments" {
vpc_endpoint_service_id = aws_vpc_endpoint_service.payments.id
connection_notification_arn = aws_sns_topic.privatelink_events.arn
connection_events = ["Connect", "Accept", "Reject", "Delete"]
}
Step 4 — Create the interface endpoint (consumer account)
Now switch to the consumer account. The consumer creates an interface endpoint of type Interface, naming the provider’s service. AWS provisions one ENI per subnet you specify, each with a private IP from that subnet’s range. Those ENIs are the only thing the consumer’s workloads ever talk to.
resource "aws_vpc_endpoint" "payments" {
vpc_id = var.consumer_vpc_id
service_name = "com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0"
vpc_endpoint_type = "Interface"
subnet_ids = var.consumer_subnet_ids # one per AZ — must overlap provider AZs
security_group_ids = [aws_security_group.endpoint.id]
private_dns_enabled = false # see Step 5 before enabling
}
Three things define correctness on this side:
- AZ alignment. Put the endpoint in the same AZs the provider published. An endpoint ENI in an AZ the provider does not serve is dead weight — there is no target on the other side. Use AZ IDs (
use1-az1), not names, when reasoning across accounts:us-east-1ain your account may be a different physical zone than in the provider’s. - The security group is on the endpoint ENI. This is the most common stumble. The SG attached to the endpoint controls traffic from consumer workloads into the ENI. It must allow inbound on the service port from the client CIDRs/SGs. The provider’s NLB does not have a security group, and consumer-side SGs on the clients still need egress to the endpoint.
resource "aws_security_group" "endpoint" {
name = "payments-endpoint-sg"
vpc_id = var.consumer_vpc_id
ingress {
description = "Clients to PrivateLink endpoint"
from_port = 8443
to_port = 8443
protocol = "tcp"
cidr_blocks = [var.app_subnet_cidr]
}
}
- One endpoint, many AZs, one connection. Each consumer VPC needs exactly one interface endpoint to the service; the ENIs across AZs share a single connection from the provider’s point of view.
Step 5 — Private DNS and domain ownership verification
By default the endpoint hands the consumer an ugly regional DNS name like vpce-0a1b2c3d4e5f6a7b8-abcd1234.vpce-svc-0123456789abcdef0.us-east-1.vpce.amazonaws.com, plus zonal variants. Functional, but nobody wants that baked into client config. Private DNS names let the provider associate a friendly name — say payments.internal.example.com — so consumers keep calling that hostname and resolution silently points at the endpoint.
The catch, and the reason this trips teams up, is that AWS makes the provider prove they own the domain before any consumer is allowed to enable private DNS. This prevents someone from publishing a service that hijacks *.example.com.
Enable a private DNS name on the endpoint service, then publish the TXT record AWS gives you:
resource "aws_vpc_endpoint_service" "payments" {
acceptance_required = true
network_load_balancer_arns = [aws_lb.svc.arn]
private_dns_name = "payments.internal.example.com"
}
AWS returns a verification token. Fetch it and create the TXT record in the public hosted zone for the domain (verification is done against public DNS, even though the service itself is private):
aws ec2 describe-vpc-endpoint-service-configurations \
--service-ids vpce-svc-0123456789abcdef0 \
--query 'ServiceConfigurations[0].PrivateDnsNameConfiguration'
# -> { "State": "pendingVerification", "Type": "TXT",
# "Value": "vpce:abc123...", "Name": "_a1b2c3d4" }
resource "aws_route53_record" "privatelink_verify" {
zone_id = var.public_zone_id
name = "_a1b2c3d4.payments.internal.example.com"
type = "TXT"
ttl = 1800
records = ["vpce:abc123..."]
}
Then trigger verification:
aws ec2 start-vpc-endpoint-service-private-dns-verification \
--service-id vpce-svc-0123456789abcdef0
Once the state flips to verified, consumers can set private_dns_enabled = true on their interface endpoints. Behind the scenes AWS creates a managed private hosted zone in the consumer VPC that resolves payments.internal.example.com to the endpoint ENIs — classic split-horizon: the public name resolves to nothing useful publicly, but inside the consumer VPC it points at the local ENIs. For private_dns_enabled to actually take effect, the consumer VPC must have enableDnsHostnames and enableDnsSupport both turned on, or resolution silently falls back to the long regional name.
Verify
Prove the path end to end before declaring victory.
# 1. Provider: service exists, NLB attached, DNS verified
aws ec2 describe-vpc-endpoint-service-configurations \
--service-ids vpce-svc-0123456789abcdef0 \
--query 'ServiceConfigurations[0].[ServiceState,PrivateDnsNameConfiguration.State,AvailabilityZones]'
# 2. Provider: the consumer's connection is accepted (not pendingAcceptance/rejected)
aws ec2 describe-vpc-endpoint-connections \
--filters Name=service-id,Values=vpce-svc-0123456789abcdef0 \
--query 'VpcEndpointConnections[].[VpcEndpointOwner,VpcEndpointState]' --output table
# 3. Consumer: endpoint is "available" with ENIs in each AZ
aws ec2 describe-vpc-endpoints \
--vpc-endpoint-ids vpce-0a1b2c3d4e5f6a7b8 \
--query 'VpcEndpoints[0].[State,NetworkInterfaceIds,DnsEntries[0].DnsName]'
# 4. From a consumer instance: DNS resolves to a private (ENI) address...
dig +short payments.internal.example.com
# 10.x.x.x (a consumer-subnet IP, not a public one)
# 5. ...and the service answers
curl -sS -o /dev/null -w '%{http_code}\n' https://payments.internal.example.com:8443/healthz
If dig returns the long vpce-...amazonaws.com name, private DNS is not effective — re-check private_dns_enabled, the DNS verification state, and the VPC’s DNS-hostnames flag.
Scaling, resiliency, and quotas
A few limits and behaviors decide whether this holds up under load.
- Cross-zone load balancing. Covered above, but worth repeating because it bites in production: without it, an endpoint ENI only reaches targets in its own AZ. Enable it on the NLB unless you have a deliberate zonal-isolation design and matching per-AZ target capacity.
- Endpoint connection capacity. A single interface endpoint scales horizontally across AZs; AWS gives roughly tens of thousands of concurrent connections per AZ per endpoint. For very high fan-in, the bottleneck is usually NLB target capacity and source-port exhaustion on long-lived connections, not the endpoint itself. Watch
ActiveFlowCountandNewFlowCounton the NLB. - Quotas. The defaults you will actually hit: interface endpoints per VPC, and allowed principals per endpoint service (default 50, raise via a quota request well before you onboard the 50th consumer account). Endpoint services per account is also capped. Track these in your platform’s quota dashboard.
- Idle timeout. NLB TCP flows have a 350-second idle timeout. Long-lived gRPC or database-style connections through PrivateLink need TCP keepalives below that, or you will see connections silently reset.
Observability and cost
You are billed on two axes for an interface endpoint: an hourly charge per endpoint per AZ, and a per-GB data-processing charge on traffic through it. The per-AZ hourly line is why you do not blindly enable every AZ — each ENI is its own meter. The data-processing charge is on top of any inter-AZ transfer the NLB incurs. At high fan-in across hundreds of consumer endpoints, the per-endpoint hourly cost dominates and is easy to overlook.
For traffic visibility, VPC Flow Logs on the endpoint ENIs show source IPs, ports, and accept/reject actions — invaluable when a consumer swears they cannot connect. The endpoint ENIs have stable interface IDs; filter on them.
// CloudWatch Logs Insights over VPC Flow Logs — rejected traffic to the endpoint ENI
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter interfaceId = "eni-0abc123def456789"
| filter action = "REJECT"
| stats count() as rejects by srcAddr, dstPort
| sort rejects desc
On the provider side, the meaningful signals are NLB CloudWatch metrics: HealthyHostCount/UnHealthyHostCount per target group, ActiveFlowCount, and TCP_Target_Reset_Count. A rise in target resets with healthy hosts usually points at idle-timeout or application-side connection churn rather than the network.
Troubleshooting runbook
When it does not work, walk these in order — most failures are one of the first three.
- Endpoint stuck in
pendingAcceptance. The provider has not accepted the connection (Step 3), or the consumer is not in the allowed-principals list. Check both. Withacceptance_required = false, an unlisted principal still cannot connect. availableendpoint but connections hang or refuse. Almost always the endpoint ENI’s security group does not allow inbound on the service port from the client. The NLB has no SG, so the endpoint SG is the only L4 filter on the path. Also confirm targets are healthy — a dead target group looks like a network problem from the client.- DNS returns the long
vpce-...amazonaws.comname instead of your friendly name. Private DNS is not effective: either the provider’s domain verification is notverified, the consumer setprivate_dns_enabled = false, or the consumer VPC lacksenableDnsHostnames/enableDnsSupport. - Works in one AZ, fails in another. AZ mismatch. The provider did not publish in that AZ, or cross-zone load balancing is off and that AZ has no healthy targets. Remember AZ names are not consistent across accounts — compare AZ IDs.
- Intermittent resets on long-lived connections. NLB 350-second idle timeout. Add TCP keepalives on the client below that threshold.
- “Asymmetric routing”-style weirdness. PrivateLink itself does not produce asymmetry — return traffic flows back through the same ENI. If you see it, the cause is almost always an ALB-as-target-behind-the-NLB setup with mismatched health checks or an extra hop you added, not PrivateLink. Simplify the target chain and re-test.
Enterprise scenario
A payments platform team ran a fraud-scoring API that ~120 internal accounts needed to call, plus three external partner accounts under contract. Their first instinct was a Transit Gateway attachment per consumer. It died on contact with reality: two recently-acquired business units had VPCs on 10.20.0.0/16 — the same block the fraud service’s VPC used — and renumbering a live, PCI-scoped production VPC was a non-starter. Worse, security flagged that a TGW attachment would grant those partner accounts a route into the platform network, far more reach than “call one API” warranted.
They rebuilt it as a PrivateLink endpoint service. The fraud API went behind an internal NLB across three AZs with cross-zone enabled; the endpoint service ran with acceptance_required = true and an allow-list keyed to specific consumer role ARNs, not account roots, so a partner could only connect from their designated workload role. A connection-notification SNS topic fed a small Lambda that auto-approved any principal present in the platform’s account-registry DynamoDB table and left everything else pending for a human — onboarding dropped from a ticket-and-a-meeting to minutes.
The overlapping 10.20.0.0/16 BUs connected with zero renumbering, because PrivateLink never exposes the provider’s address space. They published fraud.internal.payments.example.com as a verified private DNS name so every consumer used one stable hostname regardless of account.
# The control that made the partner case acceptable to security:
# allow only the partner's specific workload role, never the account root.
resource "aws_vpc_endpoint_service_allowed_principal" "partner_a" {
vpc_endpoint_service_id = aws_vpc_endpoint_service.fraud.id
principal_arn = "arn:aws:iam::333333333333:role/fraud-client-prod"
}
The one real cost they had to plan for was the per-endpoint, per-AZ hourly charge multiplied across 120+ consumers — a line item that genuinely showed up on the bill. They accepted it as the price of dropping TGW attachment management and CIDR coordination entirely. Net: the consumer count became a self-service onboarding problem instead of a networking project, which is exactly the trade PrivateLink is built to make.