AWS Networking

Building Cross-Account Services with AWS PrivateLink: Endpoint Services, NLBs, and DNS

You have an internal API that another team — or another company — needs to reach. The reflex is to peer the VPCs or hang both off a Transit Gateway. Both grant network-layer reachability, and both fall apart the moment the consumer’s 10.0.0.0/16 collides with yours, which across a large estate it always eventually does. PrivateLink solves a narrower problem and solves it cleanly: it exposes one service across a trust boundary as a single ENI in the consumer’s subnet. No routes, no transitive reachability, no CIDR negotiation. This is how to build the provider and consumer sides correctly, wire up private DNS, and keep it observable.

When an endpoint service is the right call

PrivateLink, peering, and Transit Gateway are not interchangeable. Pick by what you are actually sharing.

Pattern What it connects CIDR overlap Direction Bills per GB
VPC peering Two VPCs, full IP reachability Must not overlap Bidirectional No (intra-region)
Transit Gateway Many VPCs/accounts, policy-routed Must not overlap Bidirectional Yes
PrivateLink One service behind an NLB/GWLB Irrelevant Unidirectional (consumer to provider) Yes

The deciding question is reachability. Peering and TGW give the consumer a route to your network; a misconfigured security group or a curious operator can then reach anything routable. PrivateLink gives the consumer a route to one load balancer, full stop. Traffic is initiated only from consumer to provider, it never traverses the public internet, and because connectivity is an endpoint rather than a route, the two VPCs can use identical address space. That last property is why PrivateLink is the default for multi-tenant SaaS on AWS and for any “expose this API to 200 internal accounts” problem.

Mental model: peering and TGW are routing. PrivateLink is publishing. If you would put the thing behind a load balancer and a DNS name anyway, publish it.

The cost is that it is one-directional and one-service-per-endpoint, and it requires a Network Load Balancer (or Gateway Load Balancer) in front of the workload. If you need bidirectional, everything-talks-to-everything connectivity, this is the wrong tool — reach for TGW.

Step 1 — Front the service with an NLB (provider account)

An endpoint service sits on top of an NLB (or GWLB). We will use an NLB. It must be internal — PrivateLink targets the load balancer’s private addresses, not an internet-facing scheme. Register your service’s targets (instances, IPs, or an ALB as a target if you need L7 routing behind it) in a target group.

resource "aws_lb" "svc" {
  name                             = "payments-svc-nlb"
  internal                         = true
  load_balancer_type               = "network"
  subnets                          = var.provider_subnet_ids   # one per AZ you publish
  enable_cross_zone_load_balancing = true
}

resource "aws_lb_target_group" "svc" {
  name        = "payments-svc-tg"
  port        = 8443
  protocol    = "TCP"
  vpc_id      = var.provider_vpc_id
  target_type = "ip"

  health_check {
    protocol            = "TCP"
    port                = "8443"
    healthy_threshold   = 3
    unhealthy_threshold = 3
    interval            = 10
  }
}

resource "aws_lb_listener" "svc" {
  load_balancer_arn = aws_lb.svc.arn
  port              = 8443
  protocol          = "TCP"
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.svc.arn
  }
}

Two decisions here matter for everything downstream. First, the subnets you attach to the NLB define which AZs the service is available in. A consumer can only create an endpoint in an AZ where you have a presence. Publish in at least two, and prefer to publish in every AZ your largest consumers use. Second, turn on cross-zone load balancing. An NLB is zonal by default: an endpoint ENI in us-east-1a will only send to targets in us-east-1a unless cross-zone is enabled. With uneven target distribution that produces hot zones and surprising health-check behavior. (Note: cross-zone traffic on an NLB incurs inter-AZ data charges, which the load-balancer owner pays — budget for it.)

Step 2 — Create the VPC endpoint service

With the NLB live, create the endpoint service that points at it. The key knob is acceptance_required.

resource "aws_vpc_endpoint_service" "payments" {
  acceptance_required        = true
  network_load_balancer_arns = [aws_lb.svc.arn]

  tags = { Name = "payments-endpoint-service" }
}

output "service_name" {
  # e.g. com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0
  value = aws_vpc_endpoint_service.payments.service_name
}

AWS assigns a service name of the form com.amazonaws.vpce.<region>.vpce-svc-xxxxxxxxxxxxxxxxx. This string is what consumers use to find you; it is not secret, but it is the coordination key — hand it to consumers out of band.

acceptance_required = true means every new connection lands in pendingAcceptance and waits for you to approve it. For a controlled internal platform with a known consumer list, that manual gate is worth keeping. For self-service at scale, you flip it to false and rely on the allow-list (next step) instead. Do not run with false and an open allow-list unless you genuinely intend anyone in any account to connect.

Step 3 — Allowed principals, acceptance, and notifications

Two independent controls govern who reaches your service. They stack.

Allowed principals decide who is even permitted to create an endpoint to your service. Without an entry, the consumer cannot see or target the service at all. Scope these as tightly as the consumer’s identity allows — a specific role ARN is better than a whole account, which is better than *.

resource "aws_vpc_endpoint_service_allowed_principal" "consumer" {
  vpc_endpoint_service_id = aws_vpc_endpoint_service.payments.id
  principal_arn           = "arn:aws:iam::222222222222:root"   # tighten to a role ARN if you can
}

Acceptance is the per-connection gate that applies when acceptance_required = true. List pending connections and approve them:

aws ec2 describe-vpc-endpoint-connections \
  --filters Name=service-id,Values=vpce-svc-0123456789abcdef0 \
  --query 'VpcEndpointConnections[?VpcEndpointState==`pendingAcceptance`].[VpcEndpointId,VpcEndpointOwner]' \
  --output table

aws ec2 accept-vpc-endpoint-connections \
  --service-id vpce-svc-0123456789abcdef0 \
  --vpc-endpoint-ids vpce-0a1b2c3d4e5f6a7b8

To avoid polling for pending connections, wire a connection notification to SNS. You get an event on Connect, Accept, Reject, and Delete, which you can route to a Lambda for auto-approval against an authoritative consumer registry, or just to a channel so a human acts within minutes instead of hours.

resource "aws_vpc_endpoint_connection_notification" "payments" {
  vpc_endpoint_service_id     = aws_vpc_endpoint_service.payments.id
  connection_notification_arn = aws_sns_topic.privatelink_events.arn
  connection_events           = ["Connect", "Accept", "Reject", "Delete"]
}

Step 4 — Create the interface endpoint (consumer account)

Now switch to the consumer account. The consumer creates an interface endpoint of type Interface, naming the provider’s service. AWS provisions one ENI per subnet you specify, each with a private IP from that subnet’s range. Those ENIs are the only thing the consumer’s workloads ever talk to.

resource "aws_vpc_endpoint" "payments" {
  vpc_id              = var.consumer_vpc_id
  service_name        = "com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.consumer_subnet_ids       # one per AZ — must overlap provider AZs
  security_group_ids  = [aws_security_group.endpoint.id]
  private_dns_enabled = false                          # see Step 5 before enabling
}

Three things define correctness on this side:

resource "aws_security_group" "endpoint" {
  name   = "payments-endpoint-sg"
  vpc_id = var.consumer_vpc_id

  ingress {
    description = "Clients to PrivateLink endpoint"
    from_port   = 8443
    to_port     = 8443
    protocol    = "tcp"
    cidr_blocks = [var.app_subnet_cidr]
  }
}

Step 5 — Private DNS and domain ownership verification

By default the endpoint hands the consumer an ugly regional DNS name like vpce-0a1b2c3d4e5f6a7b8-abcd1234.vpce-svc-0123456789abcdef0.us-east-1.vpce.amazonaws.com, plus zonal variants. Functional, but nobody wants that baked into client config. Private DNS names let the provider associate a friendly name — say payments.internal.example.com — so consumers keep calling that hostname and resolution silently points at the endpoint.

The catch, and the reason this trips teams up, is that AWS makes the provider prove they own the domain before any consumer is allowed to enable private DNS. This prevents someone from publishing a service that hijacks *.example.com.

Enable a private DNS name on the endpoint service, then publish the TXT record AWS gives you:

resource "aws_vpc_endpoint_service" "payments" {
  acceptance_required        = true
  network_load_balancer_arns = [aws_lb.svc.arn]
  private_dns_name           = "payments.internal.example.com"
}

AWS returns a verification token. Fetch it and create the TXT record in the public hosted zone for the domain (verification is done against public DNS, even though the service itself is private):

aws ec2 describe-vpc-endpoint-service-configurations \
  --service-ids vpce-svc-0123456789abcdef0 \
  --query 'ServiceConfigurations[0].PrivateDnsNameConfiguration'
# -> { "State": "pendingVerification", "Type": "TXT",
#      "Value": "vpce:abc123...", "Name": "_a1b2c3d4" }
resource "aws_route53_record" "privatelink_verify" {
  zone_id = var.public_zone_id
  name    = "_a1b2c3d4.payments.internal.example.com"
  type    = "TXT"
  ttl     = 1800
  records = ["vpce:abc123..."]
}

Then trigger verification:

aws ec2 start-vpc-endpoint-service-private-dns-verification \
  --service-id vpce-svc-0123456789abcdef0

Once the state flips to verified, consumers can set private_dns_enabled = true on their interface endpoints. Behind the scenes AWS creates a managed private hosted zone in the consumer VPC that resolves payments.internal.example.com to the endpoint ENIs — classic split-horizon: the public name resolves to nothing useful publicly, but inside the consumer VPC it points at the local ENIs. For private_dns_enabled to actually take effect, the consumer VPC must have enableDnsHostnames and enableDnsSupport both turned on, or resolution silently falls back to the long regional name.

Verify

Prove the path end to end before declaring victory.

# 1. Provider: service exists, NLB attached, DNS verified
aws ec2 describe-vpc-endpoint-service-configurations \
  --service-ids vpce-svc-0123456789abcdef0 \
  --query 'ServiceConfigurations[0].[ServiceState,PrivateDnsNameConfiguration.State,AvailabilityZones]'

# 2. Provider: the consumer's connection is accepted (not pendingAcceptance/rejected)
aws ec2 describe-vpc-endpoint-connections \
  --filters Name=service-id,Values=vpce-svc-0123456789abcdef0 \
  --query 'VpcEndpointConnections[].[VpcEndpointOwner,VpcEndpointState]' --output table

# 3. Consumer: endpoint is "available" with ENIs in each AZ
aws ec2 describe-vpc-endpoints \
  --vpc-endpoint-ids vpce-0a1b2c3d4e5f6a7b8 \
  --query 'VpcEndpoints[0].[State,NetworkInterfaceIds,DnsEntries[0].DnsName]'
# 4. From a consumer instance: DNS resolves to a private (ENI) address...
dig +short payments.internal.example.com
# 10.x.x.x  (a consumer-subnet IP, not a public one)

# 5. ...and the service answers
curl -sS -o /dev/null -w '%{http_code}\n' https://payments.internal.example.com:8443/healthz

If dig returns the long vpce-...amazonaws.com name, private DNS is not effective — re-check private_dns_enabled, the DNS verification state, and the VPC’s DNS-hostnames flag.

Scaling, resiliency, and quotas

A few limits and behaviors decide whether this holds up under load.

Observability and cost

You are billed on two axes for an interface endpoint: an hourly charge per endpoint per AZ, and a per-GB data-processing charge on traffic through it. The per-AZ hourly line is why you do not blindly enable every AZ — each ENI is its own meter. The data-processing charge is on top of any inter-AZ transfer the NLB incurs. At high fan-in across hundreds of consumer endpoints, the per-endpoint hourly cost dominates and is easy to overlook.

For traffic visibility, VPC Flow Logs on the endpoint ENIs show source IPs, ports, and accept/reject actions — invaluable when a consumer swears they cannot connect. The endpoint ENIs have stable interface IDs; filter on them.

// CloudWatch Logs Insights over VPC Flow Logs — rejected traffic to the endpoint ENI
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter interfaceId = "eni-0abc123def456789"
| filter action = "REJECT"
| stats count() as rejects by srcAddr, dstPort
| sort rejects desc

On the provider side, the meaningful signals are NLB CloudWatch metrics: HealthyHostCount/UnHealthyHostCount per target group, ActiveFlowCount, and TCP_Target_Reset_Count. A rise in target resets with healthy hosts usually points at idle-timeout or application-side connection churn rather than the network.

Troubleshooting runbook

When it does not work, walk these in order — most failures are one of the first three.

  1. Endpoint stuck in pendingAcceptance. The provider has not accepted the connection (Step 3), or the consumer is not in the allowed-principals list. Check both. With acceptance_required = false, an unlisted principal still cannot connect.
  2. available endpoint but connections hang or refuse. Almost always the endpoint ENI’s security group does not allow inbound on the service port from the client. The NLB has no SG, so the endpoint SG is the only L4 filter on the path. Also confirm targets are healthy — a dead target group looks like a network problem from the client.
  3. DNS returns the long vpce-...amazonaws.com name instead of your friendly name. Private DNS is not effective: either the provider’s domain verification is not verified, the consumer set private_dns_enabled = false, or the consumer VPC lacks enableDnsHostnames/enableDnsSupport.
  4. Works in one AZ, fails in another. AZ mismatch. The provider did not publish in that AZ, or cross-zone load balancing is off and that AZ has no healthy targets. Remember AZ names are not consistent across accounts — compare AZ IDs.
  5. Intermittent resets on long-lived connections. NLB 350-second idle timeout. Add TCP keepalives on the client below that threshold.
  6. “Asymmetric routing”-style weirdness. PrivateLink itself does not produce asymmetry — return traffic flows back through the same ENI. If you see it, the cause is almost always an ALB-as-target-behind-the-NLB setup with mismatched health checks or an extra hop you added, not PrivateLink. Simplify the target chain and re-test.

Enterprise scenario

A payments platform team ran a fraud-scoring API that ~120 internal accounts needed to call, plus three external partner accounts under contract. Their first instinct was a Transit Gateway attachment per consumer. It died on contact with reality: two recently-acquired business units had VPCs on 10.20.0.0/16 — the same block the fraud service’s VPC used — and renumbering a live, PCI-scoped production VPC was a non-starter. Worse, security flagged that a TGW attachment would grant those partner accounts a route into the platform network, far more reach than “call one API” warranted.

They rebuilt it as a PrivateLink endpoint service. The fraud API went behind an internal NLB across three AZs with cross-zone enabled; the endpoint service ran with acceptance_required = true and an allow-list keyed to specific consumer role ARNs, not account roots, so a partner could only connect from their designated workload role. A connection-notification SNS topic fed a small Lambda that auto-approved any principal present in the platform’s account-registry DynamoDB table and left everything else pending for a human — onboarding dropped from a ticket-and-a-meeting to minutes.

The overlapping 10.20.0.0/16 BUs connected with zero renumbering, because PrivateLink never exposes the provider’s address space. They published fraud.internal.payments.example.com as a verified private DNS name so every consumer used one stable hostname regardless of account.

# The control that made the partner case acceptable to security:
# allow only the partner's specific workload role, never the account root.
resource "aws_vpc_endpoint_service_allowed_principal" "partner_a" {
  vpc_endpoint_service_id = aws_vpc_endpoint_service.fraud.id
  principal_arn           = "arn:aws:iam::333333333333:role/fraud-client-prod"
}

The one real cost they had to plan for was the per-endpoint, per-AZ hourly charge multiplied across 120+ consumers — a line item that genuinely showed up on the bill. They accepted it as the price of dropping TGW attachment management and CIDR coordination entirely. Net: the consumer count became a self-service onboarding problem instead of a networking project, which is exactly the trade PrivateLink is built to make.

Checklist

awsprivatelinknetworkingvpcnlb

Comments

Keep Reading