AWS Databases

RDS Proxy in Production: Connection Pooling, Failover Acceleration, and IAM Authentication

A relational database has a hard ceiling on concurrent connections, and that ceiling is far lower than the concurrency your serverless and container fleets can generate. A db.r6g.large PostgreSQL instance defaults to roughly 850 max_connections; a single Lambda burst can ask for thousands of fresh TCP sessions in seconds, each one a forked backend process with real memory cost. The database does not gracefully shed that load. It runs out of connection slots, new sessions get FATAL: remaining connection slots are reserved, and your healthy application starts throwing 500s because it cannot open a socket — not because the query was slow.

RDS Proxy sits between the application and the database, maintains a warm pool of database connections, and multiplexes a large number of short-lived client connections onto a small number of long-lived database connections. It also holds client sockets open during a failover and routes them to the new writer, which shrinks application-observed failover time dramatically. This article is the production build: provisioning, the connection pinning trap that silently destroys multiplexing, tuning the pool under load, IAM authentication, Lambda and VPC wiring, and the CloudWatch signals that tell you whether the proxy is actually helping.

1. The connection-exhaustion problem and what multiplexing buys you

The math is unforgiving. Postgres allocates a backend process per connection; even idle, each one consumes work_mem-adjacent memory and a proc slot. The relationship between application concurrency and database connections is the whole game:

The proxy does this through transaction-level multiplexing: a database connection is borrowed from the pool when a client begins a transaction and returned when the transaction commits or rolls back. Between transactions the client holds no backend. This is the same idea as PgBouncer in transaction mode, but managed, VPC-native, and integrated with Secrets Manager and IAM.

The caveat that defines everything downstream: multiplexing only works when the database connection is safe to hand to a different client after each transaction. Some session state makes that unsafe, and the proxy responds by pinning — dedicating a backend to one client for the life of its session. Pinning is multiplexing’s off switch. Section 3 is about not tripping it.

RDS Proxy supports MySQL and PostgreSQL on both Aurora and RDS. It is a managed, autoscaling fleet inside your VPC; you pay per vCPU-hour of the underlying database instance class it fronts, which is why the cost trade-off in section 8 is real and not an afterthought.

2. Provisioning the proxy: secrets, IAM, and target groups

The proxy needs three things wired correctly: a Secrets Manager secret holding the database credentials it uses to open backend connections, an IAM role granting it read access to that secret, and a target pointing at your cluster or instance.

First, the secret. RDS Proxy authenticates to the database with credentials from Secrets Manager — never inline. The secret JSON must use these keys:

{
  "username": "app_proxy_user",
  "password": "REDACTED",
  "host": "prod-aurora.cluster-abc123.us-east-1.rds.amazonaws.com",
  "port": 5432,
  "engine": "postgres",
  "dbname": "appdb"
}

Store it:

aws secretsmanager create-secret \
  --name prod/aurora/proxy-user \
  --secret-string file://proxy-secret.json

The proxy’s IAM role needs to read that secret and decrypt it with the KMS key:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadProxySecret",
      "Effect": "Allow",
      "Action": ["secretsmanager:GetSecretValue"],
      "Resource": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/aurora/proxy-user-*"
    },
    {
      "Sid": "DecryptSecret",
      "Effect": "Allow",
      "Action": ["kms:Decrypt"],
      "Resource": "arn:aws:kms:us-east-1:111122223333:key/KMS-KEY-ID",
      "Condition": {
        "StringEquals": { "kms:ViaService": "secretsmanager.us-east-1.amazonaws.com" }
      }
    }
  ]
}

The role’s trust policy must allow rds.amazonaws.com to assume it. Now create the proxy. Terraform is the durable way to express this:

resource "aws_db_proxy" "aurora" {
  name                   = "prod-aurora-proxy"
  engine_family          = "POSTGRESQL"
  role_arn               = aws_iam_role.proxy.arn
  vpc_subnet_ids         = var.private_subnet_ids
  vpc_security_group_ids = [aws_security_group.proxy.id]

  require_tls            = true
  idle_client_timeout    = 1800   # seconds; reap idle clients
  debug_logging          = false  # enable temporarily to capture SQL in pinning logs

  auth {
    auth_scheme = "SECRETS"
    secret_arn  = aws_secretsmanager_secret.proxy_user.arn
    iam_auth    = "REQUIRED"   # force IAM token auth from clients
    description = "app proxy user"
  }
}

resource "aws_db_proxy_default_target_group" "aurora" {
  db_proxy_name = aws_db_proxy.aurora.name

  connection_pool_config {
    max_connections_percent      = 75
    max_idle_connections_percent = 50
    connection_borrow_timeout    = 120
  }
}

resource "aws_db_proxy_target" "aurora" {
  db_proxy_name         = aws_db_proxy.aurora.name
  target_group_name     = aws_db_proxy_default_target_group.aurora.name
  db_cluster_identifier = aws_rds_cluster.aurora.id   # Aurora cluster (use db_instance_identifier for RDS)
}

Two distinctions that bite people:

3. Connection pinning: the silent killer of multiplexing

Pinning is the single most important RDS Proxy concept to internalize. When a client does something that makes a backend connection unsafe to share, the proxy stops multiplexing that connection and dedicates it to the client until the client disconnects. Enough pinning and your “pooled” proxy degrades into a 1:1 passthrough — you pay for the proxy and get none of the multiplexing.

For PostgreSQL, common pinning triggers include:

You detect pinning two ways. The DatabaseConnectionsCurrentlySessionPinned CloudWatch metric shows how many connections are pinned right now; if it tracks close to your client connection count, multiplexing is effectively off. For root cause, enable proxy logging and read the pinning reason in the logs:

fields @timestamp, @message
| filter @message like /pinned/
| sort @timestamp desc
| limit 50

A pinning log line names the cause, for example a session variable being set, so you can map it back to a query pattern.

Practical fixes:

Pinning is not a bug, it is correctness. The proxy refuses to leak one client’s session state into another’s. The job is to write query patterns that do not require session state to survive across transactions.

4. Tuning the pool: MaxConnectionsPercent, idle, and borrow timeout

Three knobs in connection_pool_config govern behavior under load.

max_connections_percent caps the proxy’s database connections as a percentage of the target’s max_connections. At 75% against an 850-slot instance, the proxy will open up to ~637 backends. Leave headroom: administrative tools, replication, and direct connections also consume slots. If you front one database with multiple proxies or also allow direct app connections, their percentages must sum to under 100 with margin.

max_idle_connections_percent sets how many backends the proxy keeps warm when demand drops. Higher means faster response to the next burst (no cold connect) at the cost of holding idle backends. For spiky serverless traffic, keeping this meaningfully above zero (e.g. 50%) avoids re-establishing connections on every burst; for steady traffic you can run it lower to free slots. It must be less than or equal to max_connections_percent.

connection_borrow_timeout is how long a client waits for a backend when the pool is saturated before the proxy returns an error. This is your backpressure valve. Under a connection storm with the pool maxed, clients queue here. A short timeout fails fast and sheds load (good for Lambda, where a hung invocation burns billed duration and concurrency); a longer timeout absorbs brief spikes without errors. Tune it against your client’s own timeout so the proxy errors before the client gives up, giving you a clean signal.

A starting point for a Lambda-heavy workload:

connection_pool_config {
  max_connections_percent      = 75
  max_idle_connections_percent = 50
  connection_borrow_timeout    = 30   # fail fast; let Lambda retry rather than hang
}

The signal to watch while tuning is DatabaseConnectionsBorrowLatency (section 8). If it climbs, clients are queuing for backends and you are either pinning too aggressively or max_connections_percent is too low for the offered load.

5. Failover acceleration and reader endpoints

The second reason to run RDS Proxy — independent of pooling — is failover behavior. During an Aurora or RDS Multi-AZ failover, the writer’s DNS flips to a new instance. Applications connecting directly must detect the broken connection, re-resolve DNS (subject to TTL and cached resolvers), and reconnect — and a fleet doing this simultaneously is a reconnect storm that can knock over the freshly promoted writer.

RDS Proxy changes the failure mode. It holds the client’s connection open, absorbs the reconnect against the database internally, and routes the held client connection to the new writer once promotion completes. The application sees a brief pause on in-flight transactions rather than a flood of connection errors. This is the single biggest lever for shrinking application-observed failover time, and it is why the related Aurora HA build puts the proxy in front of the writer as a baseline.

For Aurora clusters, create a read-only proxy endpoint so read traffic uses the proxy too and benefits from the same pooling and failover handling:

aws rds create-db-proxy-endpoint \
  --db-proxy-name prod-aurora-proxy \
  --db-proxy-endpoint-name prod-aurora-proxy-ro \
  --target-role READ_ONLY \
  --vpc-subnet-ids subnet-aaa subnet-bbb

Route writes at the default (read-write) endpoint and reads at the READ_ONLY endpoint. Two operational notes:

6. Enforcing TLS and IAM database authentication

Long-lived database passwords sitting in application config or environment variables are the credential you most want to eliminate. RDS Proxy lets you replace them with IAM authentication: the application calls AWS to mint a short-lived (15-minute) token and uses it as the database password. No static secret in the app; access is governed by IAM and fully logged.

With iam_auth = REQUIRED and require_tls = true on the proxy, the path is:

  1. The application’s IAM principal must hold rds-db:connect for the specific database user it logs in as:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "rds-db:connect",
    "Resource": "arn:aws:rds-db:us-east-1:111122223333:dbuser:prx-0abc123def456/app_user"
  }]
}

The resource ARN uses the proxy resource ID (prx-..., from describe-db-proxies), not the database instance ID, and the trailing segment is the database username. Scope it to exactly the user(s) the principal may assume.

  1. The application generates a token at connect time and uses it as the password:
export PGPASSWORD="$(aws rds generate-db-auth-token \
  --hostname prod-aurora-proxy.proxy-abc123.us-east-1.rds.amazonaws.com \
  --port 5432 \
  --username app_user \
  --region us-east-1)"

psql "host=prod-aurora-proxy.proxy-abc123.us-east-1.rds.amazonaws.com \
      port=5432 user=app_user dbname=appdb sslmode=verify-full \
      sslrootcert=/etc/ssl/certs/rds-combined-ca-bundle.pem"
  1. The database user must be allowed to authenticate via IAM. For PostgreSQL, grant the rds_iam role; once a role has rds_iam, it authenticates only via IAM tokens:
CREATE USER app_user;
GRANT rds_iam TO app_user;
GRANT CONNECT ON DATABASE appdb TO app_user;

Use sslmode=verify-full (not require) with the RDS CA bundle so the client verifies the proxy’s certificate and hostname, defeating man-in-the-middle. The token is generated with SigV4 — possession of it requires valid IAM credentials at that moment, and it expires in 15 minutes, so a leaked token is short-lived. The result: the app holds no database password, only an IAM identity.

7. Lambda integration, VPC networking, and cold-start storms

RDS Proxy and Lambda are made for each other, but the wiring has sharp edges.

VPC and security groups. The proxy lives in private subnets. The chain of security groups must allow: Lambda SG -> proxy SG on 5432, and proxy SG -> database SG on 5432. The proxy’s own SG is the trust boundary for the database; the database should accept connections from the proxy SG, not from the Lambda SG directly, so that the proxy is the only path in.

Get the connection lifecycle right. The classic Lambda mistake is opening a connection inside the handler and never reusing it. Open the connection (and fetch the IAM token) in the module/init scope so it is reused across warm invocations on the same execution environment:

import os, boto3, psycopg2

rds = boto3.client("rds")
HOST = os.environ["PROXY_HOST"]
USER = os.environ["DB_USER"]

def _connect():
    token = rds.generate_db_auth_token(
        DBHostname=HOST, Port=5432, DBUsername=USER, Region=os.environ["AWS_REGION"]
    )
    return psycopg2.connect(
        host=HOST, port=5432, user=USER, dbname=os.environ["DB_NAME"],
        password=token, sslmode="verify-full",
        sslrootcert="/var/task/rds-combined-ca-bundle.pem",
    )

conn = _connect()   # module scope: reused across warm invocations

def handler(event, context):
    global conn
    try:
        with conn.cursor() as cur:
            cur.execute("SELECT 1")
            return cur.fetchone()[0]
    except psycopg2.OperationalError:
        conn = _connect()   # reconnect on a dropped backend (e.g. after failover)
        with conn.cursor() as cur:
            cur.execute("SELECT 1")
            return cur.fetchone()[0]

Why the proxy specifically helps cold starts. When Lambda scales out hard, hundreds of new execution environments each open a connection in init. Without the proxy that is a direct connection storm against the database. The proxy absorbs it: client connections land on the proxy and borrow from the warm pool, so the database sees a bounded number of backends no matter how wide Lambda fans out. Keep max_idle_connections_percent high enough that the warm pool is ready for the next burst rather than cold-connecting under the spike.

Token, not password, in the environment. Note there is no PGPASSWORD env var holding a secret — only the IAM token minted per connection. The Lambda execution role carries the rds-db:connect permission from section 6.

Enterprise scenario

A retail platform team ran order-processing on Lambda against a provisioned Aurora PostgreSQL cluster (db.r6g.2xlarge, ~3400 max_connections). They had already put RDS Proxy in front of the writer and considered the connection problem solved. During a flash sale, throughput tripled, and they watched DatabaseConnectionsBorrowLatency spike into the seconds while the database’s own connection count sat far below capacity — the pool was saturated even though the database was not. Clients were timing out on connection_borrow_timeout and Lambda was retrying, amplifying the load.

The constraint: they could not simply raise max_connections_percent, because the backends were not idle and available — they were pinned. DatabaseConnectionsCurrentlySessionPinned was tracking nearly 1:1 with client connections, so the proxy had silently degraded to passthrough and every Lambda invocation was effectively holding a dedicated backend for its whole lifetime.

Enabling debug logging surfaced the cause in the pinning logs: the ORM issued SET search_path on every connection, and an audit path used a session-level advisory lock. Both pin. The fix was three moves, none of them “buy a bigger database”:

-- 1. Move search_path off the runtime SET and onto the role so the proxy never sees it
ALTER ROLE app_user SET search_path = orders, public;
# 2. Restore headroom now that connections multiplex again
connection_pool_config {
  max_connections_percent      = 80
  max_idle_connections_percent = 60
  connection_borrow_timeout    = 20   # fail fast, let Lambda retry, shed load cleanly
}

They also replaced the session advisory lock with a transaction-scoped pg_advisory_xact_lock, which releases at commit and does not pin. After the change, DatabaseConnectionsCurrentlySessionPinned dropped to near zero, borrow latency fell back into single-digit milliseconds, and the same cluster absorbed the next sale at double the concurrency without touching the instance class. The lesson the team took away: with RDS Proxy, the metric that predicts an outage is pinning, not CPU.

Verify

Confirm the proxy is healthy, multiplexing, and enforcing auth before you route production traffic.

Target health and endpoints:

aws rds describe-db-proxy-targets --db-proxy-name prod-aurora-proxy \
  --query "Targets[].{Target:RdsResourceId,State:TargetHealth.State,Reason:TargetHealth.Reason}"

aws rds describe-db-proxy-endpoints --db-proxy-name prod-aurora-proxy \
  --query "DBProxyEndpoints[].{Name:DBProxyEndpointName,Role:TargetRole,Status:Status}"

TargetHealth.State should be AVAILABLE. An UNAVAILABLE target with reason AUTH_FAILURE means the secret credentials are wrong or the role cannot read the secret.

IAM auth path works (and direct password auth is rejected):

# Should succeed with a freshly minted token over TLS
PGPASSWORD="$(aws rds generate-db-auth-token --hostname $PROXY_HOST --port 5432 \
  --username app_user --region us-east-1)" \
  psql "host=$PROXY_HOST user=app_user dbname=appdb sslmode=verify-full \
        sslrootcert=rds-combined-ca-bundle.pem" -c "select current_user;"

Multiplexing is actually happening. Generate load, then compare client connections to database connections — the ratio is the proof:

fields @timestamp, ClientConnections, DatabaseConnectionsCurrentlyBorrowed, DatabaseConnectionsCurrentlySessionPinned
| sort @timestamp desc
| limit 20

ClientConnections should be well above DatabaseConnectionsCurrentlyBorrowed, and DatabaseConnectionsCurrentlySessionPinned should be near zero. If pinned tracks client connections, you are not multiplexing — go back to section 3.

Failover behaves. Trigger a controlled failover and watch the application:

aws rds failover-db-cluster --db-cluster-identifier prod-aurora

Through the proxy you should see a brief stall on in-flight transactions and a fast recovery, not a flood of connection-refused errors across the fleet.

Production checklist

awsrdsaurorards-proxyconnection-poolinglambda

Comments

Keep Reading